CN109961107B

CN109961107B - Training method and device for target detection model, electronic equipment and storage medium

Info

Publication number: CN109961107B
Application number: CN201910315195.1A
Authority: CN
Inventors: 李永波; 李伯勋; 俞刚
Original assignee: Beijing Megvii Technology Co Ltd
Current assignee: Beijing Megvii Technology Co Ltd
Priority date: 2019-04-18
Filing date: 2019-04-18
Publication date: 2022-07-19
Anticipated expiration: 2039-04-18
Also published as: CN109961107A

Abstract

The embodiment of the application provides a training method and device for a target detection model, electronic equipment and a storage medium, wherein the target detection model comprises a first classification network, and the method comprises the following steps: setting at least one second classification network, wherein the input of the second classification network is the same as the input of the first classification network during training; and training the target detection model based on the total loss function until the total loss function is converged, wherein the total loss function comprises the loss function of the target detection model and the loss function of the second classification network. Compared with the existing training mode of the target detection model, the scheme of the embodiment of the application can effectively enhance the learning of the model on false detection and false alarm by adding the second classification network and training the target detection model based on the loss function of the second classification network and the loss function of the model when the target detection model is trained, thereby improving the detection precision of the target detection model.

Description

Training method and device of target detection model, electronic equipment and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method and an apparatus for training a target detection model, an electronic device, and a storage medium.

Background

The task of object detection is to find objects of interest in the image, for example, when the object is a face, face detection is intended to detect the face and its corresponding location in the scene. Target detection is one of the important problems in the field of computer vision, and has long-term research value and wide application requirements in the fields of security detection, human-computer interaction and the like.

In recent years, with the development of deep neural networks and hardware devices, the target detection technology has been developed rapidly, but in the practical application process, the target detection technology is often accompanied by a great amount of false alarms, that is, some non-target areas are identified as target areas, which seriously affects the popularization and use of the target detection technology. Therefore, how to suppress false alarms in the target detection network and improve the target detection accuracy is a very important problem in the field.

Disclosure of Invention

The present application aims to solve at least one of the above technical drawbacks, in particular, the technical drawback of high false alarm rate in the target detection process.

In a first aspect, an embodiment of the present application provides a method for training a target detection model, where the target detection model includes a first classification network, and the method includes:

setting at least one second classification network, wherein the input of the second classification network is the same as the input of the first classification network during training;

and training the target detection model based on the total loss function until the total loss function is converged, wherein the total loss function comprises the loss function of the target detection model and the loss function of the second classification network.

In an alternative embodiment of the present application, the target detection model comprises a single stage detection network architecture.

In an alternative embodiment of the present application, the single-stage detection network structure comprises a RetinaNet network structure.

In an alternative embodiment of the present application, the second classification network comprises cascaded convolutional layers and fully-connected layers, wherein an input of the convolutional layers is connected to an output of a Backbone network of the RetinaNet network structure.

In an alternative embodiment of the present application, the loss function of the second classification network comprises at least one of a first loss function determined based on an output of the second classification network and a second loss function determined based on an output of the first classification network and an output of the second classification network.

In the embodiment of the present application, the first loss function is:

L_C1＝-(1-α)*p₁ ^γlog(1-p₁)*(1-y)-α*(1-p₁)^γlog(p₁)*y

wherein L is_C1Representing a first loss function, alpha being a weighting factor, p₁And y represents a sample label, and gamma is an adjusting factor.

In an alternative embodiment of the present application, the second loss function is:

L_C2＝-(y*(1-M)+(1-y)*M)*

(α*p₁ ^γlog(p₁)*M-(1-α)*(1-p₁)^γlog(1-p₁)*(1-M))

wherein L is_c2Representing a second loss function, alpha being a weighting factor, p₁And (3) an output result of the second classification network, wherein y represents a sample label, gamma is an adjusting factor, M is a target area result label, and the value of M is determined in the following way:

where p represents the output determination of the first classification network and th represents a preset threshold.

In an alternative embodiment of the present application, the loss function of the target detection model includes a loss function of the first classification network and a loss function of the target box regression network.

In a second aspect, an embodiment of the present application further provides an image detection method, where the method includes:

acquiring an image to be detected;

detecting an image to be detected through a target detection model, wherein the target detection model is obtained through training by a training method of the target detection model in the first aspect of the embodiment of the application;

and obtaining a detection result of the image to be detected based on the output of the target detection model.

In a third aspect, an embodiment of the present application provides a training apparatus for a target detection model, where the target detection model includes a first classification network, and the apparatus includes:

the training supervision network setting module is used for setting at least one second classification network, wherein the input of the second classification network is the same as the input of the first classification network during training;

and the model training module is used for training the target detection model based on the total loss function until the total loss function is converged, wherein the total loss function comprises the loss function of the target detection model and the loss function of the second classification network.

In an alternative embodiment of the present application, the second classification network comprises a convolutional layer and a fully-connected layer in cascade, wherein an input of the convolutional layer is connected to an output of the backhaul network of the RetinaNet network structure.

In an alternative embodiment of the present application, the first loss function is:

L_C1＝-(1-α)*p₁ ^γlog(1-p₁)*(1-y)-α*(1-p₁)^γlog(p₁)*y

wherein L is_C1Representing a first loss function, α being a weighting factor, p₁And y represents a sample label, and gamma is an adjusting factor, wherein the output result of the second classification network is shown as y.

In the embodiment of the present application, the second loss function is:

L_C2＝-(y*(1-M)+(1-y)*M)*

(α*p₁ ^γlog(p₁)*M-(1-α)*(1-p₁)^γlog(1-p₁)*(1-M))

In a fourth aspect, an embodiment of the present application further provides an image detection apparatus, including:

the image acquisition module is used for acquiring an image to be detected;

the image detection module is configured to detect an image to be detected through a target detection model, and obtain a detection result of the image to be detected based on output of the target detection model, where the target detection model is obtained by training through a training method of the target detection model in the first aspect of the embodiment of the present application.

In a fifth aspect, the present application provides an electronic device, comprising: a processor and a memory;

a memory for storing operating instructions;

and the processor is used for executing the method shown in any one of the first aspect and the second aspect of the application by calling the operation instruction.

In a sixth aspect, the present application provides a computer readable storage medium storing at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by a processor to implement the method as set forth in any of the first or second aspects of the present application.

The beneficial effect that technical scheme that this application provided brought is:

compared with the existing training mode of the target detection model, the scheme of the embodiment of the application can effectively enhance the learning of the model on false detection and false alarm by adding the second classification network and training the target detection model based on the loss function of the second classification network and the loss function of the model when the target detection model is trained, thereby improving the detection precision of the target detection model.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic flowchart of a method for training a target detection model according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram of a second classification network according to an embodiment of the present application;

fig. 3 is a schematic diagram of another second classification network provided in an embodiment of the present application;

fig. 4 is a schematic diagram of another second classification network provided in an embodiment of the present application;

fig. 5 is a schematic diagram of another second classification network provided in an embodiment of the present application;

fig. 6 is a schematic flowchart of an image detection method according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a training apparatus for a target detection model according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of an image detection apparatus according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative and are only for the purpose of explaining the present application and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" include plural referents unless the context clearly dictates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

In order to make the objects, technical solutions and advantages of the present application clearer, in the embodiments of the present application, a scheme in the embodiments of the present application is described by taking an example of detecting a face region in an image by using an object detection model, and embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

An embodiment of the present application provides a training method for a target detection model, where the target detection model includes a first classification network, and as shown in fig. 1, the method may include:

step S110, at least one second classification network is set, wherein the input of the second classification network is the same as the input of the first classification network during training.

Wherein the target detection model is used for detecting a target region in an image, for example, the target detection model may be a face detection model for detecting a face region in the image, or a human body detection model or other object detection models for detecting a human body region in the image, and the output result of the classification network is a class for characterizing the target region, for example, for the face detection model, the output result of the classification network may be a probability that the detected target region in the input image is the face region.

And step S120, training the target detection model based on the total loss function until the total loss function is converged, wherein the total loss function comprises the loss function of the target detection model and the loss function of the second classification network.

The loss function is used for estimating the inconsistency degree of the prediction result and the real result of the model, is a non-negative real-value function, the smaller the loss function is, the better the robustness of the model is, and the loss function is a core part of the empirical risk function and is also an important component of the structural risk function. The loss function convergence is a limit concept, and generally, if the function value tends to a certain finite value as the variable, the loss function is converged.

It is understood that the loss function of the object detection model refers to the loss function portion of the model itself, the specific form of which is related to the structure of the object detection model. For example, if the target detection model includes a classification network, the loss function of the target detection model includes a loss function corresponding to the classification network, and if the target detection model further includes a regression network, the loss function of the target detection model includes a loss function corresponding to the classification network and a loss function corresponding to the regression network. The form of the loss function of each sub-network (e.g., classification network, regression network) of the model is also related to the structure of each sub-network.

In an embodiment of the present application, the loss function of the target detection model includes a loss function of the first classification network and a loss function of the target box regression network.

That is, in the embodiment of the present application, the target detection model may include a first classification network and a target box regression network, and the loss function of the target detection model includes a loss function of the first classification network and a loss function of the target box regression network. Accordingly, the total loss function may include a loss function of the first classification network, a loss function of the target box regression network, and a loss function of the second classification network.

In practical applications, the manner of determining whether the loss function (such as the total loss function) converges may be configured according to actual requirements. For example, during training, if the function value of the total loss function approaches a finite value, the function may be considered to converge. Generally, the smaller the total loss function is, the better the total loss function is, the value of the total loss function will be continuously reduced and tend to be stable with the increase of the training times, for example, the convergence condition may mean that the difference between the values of the total loss functions of two adjacent training times is smaller than a set threshold, when the training result meets the convergence condition, the total loss function may be considered to be converged, and of course, other convergence conditions may be configured according to actual needs or other manners of determining whether the function is converged may be adopted.

In the embodiment of the application, when the target detection model is trained, at least one second classification network is additionally arranged, and the target detection model is trained based on the loss function of the second classification network and the loss function of the target detection model. The total loss function comprises the loss function of the target detection model and the loss function of the second classification network, and the loss function of the second classification network can play a role in supervision when the target detection model is trained.

In addition, the second classification network is arranged outside the network structure of the target detection model, so that the network structure of the original target detection model is not influenced, and the detection speed is not influenced because the network structure of the target detection model is not changed when the target detection model is subsequently used for target area detection.

Of course, the object detection model may also include a multi-level detection network architecture. The Single-stage detection network structure, i.e., the one-stage detection network structure, may include, but is not limited to, a YOLO (just one pass) structure, an SSD (Single Shot multi box Detector) network structure, or a RetinaNet network structure, for example.

The RetinaNet network structure is a network used for detecting a target image, the RetinaNet network structure is a single network consisting of a Backbone network and two sub-networks with specific tasks, the Backbone network is responsible for calculating convolution characteristics on the whole image, the first sub-network performs an image classification task (namely, a first classification network) on the output of the Backbone network, and the second sub-network is responsible for convolution frame regression (namely, a regression network). The first classification network may include cascaded convolutional layers and fully-connected layers, and the RetinaNet network structure corresponds to a loss function L_CIncluding a loss function L of the first classification network_clsLoss function L corresponding to regression network_bbI.e. L_C＝L_bb+L_cls。

In an optional embodiment of the present application, the second classification network includes a convolutional layer and a fully connected layer in cascade, where an input of the convolutional layer is connected to an output of a backhaul network of a RetinaNet network structure, and the backhaul network functions to perform pre-training in the form of a classifier to extract features from an image.

In practical applications, the second classification network may be designed according to the first classification network, the structural form of the second classification network may be the same as or different from that of the first classification network, and when the structural form of the second classification network is the same as that of the first classification network, the network parameters in the second classification network and the first classification network may be different. In an example, if the single-stage detection network structure is a RetinaNet network structure, and the first classification network in the RetinaNet network structure includes cascaded convolutional layers and fully-connected layers, the second classification network may also include cascaded convolutional layers and fully-connected layers, and an input of a convolutional layer in the second classification network is connected to an output of a backhaul network of the RetinaNet network structure.

In an example, a target detection model is taken as an example to describe a RetinaNet network structure, and as shown in fig. 2, an embodiment of the present application provides a schematic diagram of a RetinaNet network structure and a second classification network. The network Branch corresponding to Branch-c1 in the figure represents a second classification network, the second classification network is composed of a convolutional layer Conv3 and a full connection layer FC3 which are cascaded, a part inside a dotted frame in the figure is a RetinaNet network structure, and the RetinaNet network structure comprises two network branches of Branch-c and Branch-b. The network Branch corresponding to Branch-c represents a first classification network, the first classification network is composed of the cascaded convolutional layer Conv1 and the fully-connected layer, and the network Branch corresponding to Branch-b is a regression network, and the regression network is composed of the cascaded convolutional layer Conv2 and the fully-connected layer FC2 in this example. In this example, the output of the network Branch corresponding to Branch-b may be the coordinates of the target area, that is, the output result is the coordinates of the target area, and the output of the network branches corresponding to Branch-c and Branch-c1 may be the probability that the detection area is the target area, that is, the output result is the probability.

In an optional embodiment of the present application, the loss function of the second classification network comprises at least one of a first loss function determined based on an output of the second classification network and a second loss function determined based on an output of the first classification network and an output of the second classification network.

That is, in practical applications, the loss function of the second classification network may have different combinations, that is, may include a first loss function determined based on the output of the second classification network, or may include a second loss function determined based on the output of the first classification network and the output of the second classification network, or may include a first loss function determined based on the output of the second classification network, and a second loss function determined based on the output of the first classification network and the output of the second classification network.

L_C1＝-(1-α)*p₁ ^γlog(1-p₁)*(1-y)-α*(1-p₁)^γlog(p₁)*y

wherein L is_C1Representing a first loss function, alpha being a weighting factor, p₁As a result of the output of the second classification network, y is the sample label and γ is the adjustment factor.

The sample label refers to a label added for whether a target area exists in the sample image, and if the sample image is used for detecting whether a face area exists in the image, at this time, the sample image includes a face, y may be set to 1, and if no face exists (that is, the sample image is a background image), y may be set to 0.

L_C2＝-(y*(1-M)+(1-y)*M)*

(α*p₁ ^γlog(p₁)*M-(1-α)*(1-p₁)^γlog(1-p₁)*(1-M))

wherein L is_C2Representing a second loss function, alpha being a weighting factor, p₁And the result output by the second classification network is y, a sample label is y, a regulating factor is gamma, M is a target area result label, and the value of M is determined by the following method:

That is, if p ≧ th and M ═ 1, the result is considered positive, i.e., the target region, and if p < th and M ═ 0, the result is considered negative, i.e., the target region is not identified.

In the following, for a specific alternative of the loss function of the second classification network, taking the target detection model as the RetinaNet network structure as an example, and a detailed description is given to the total loss function by combining a specific example.

1. The loss function of the second classification network comprises a first loss function L determined based on the output of the second classification network_C1At this time, the total loss function L includes a loss function L corresponding to the RetinaNet network structure_C(L_C＝L_bb+L_cls) And a first loss function L determined based on an output of the second classification network_C1E.g. L ═ L_C1+L_C。

As an example, as shown in fig. 3, when training a RetinaNet network structure, after a sample image is input to a backhaul network, an output result of a first classification network in the RetinaNet network structure is a probability p, an output result of a regression network is represented by a Box, and an output result of a second classification network is a probability p₁In this example, the total loss function can be expressed as:

L＝L_C+L_C1＝L_bb+L_cls+L_C1

wherein the content of the first and second substances,

L_cls＝-(1-α)*p^γlog(1-p)*(1-y)-α*(1-p)^γlog(p)*y

L_C1＝-(1-α)*p₁ ^γlog(1-p₁)*(1-y)-α*(1-p₁)^γlog(p₁)*y

2. the loss function of the second classification network comprises a second loss function L determined based on the output of the first classification network and the output of the second classification network_C2At this time, the total loss function L includes a loss function L corresponding to the RetinaNet network structure_C(L_C＝L_bb+L_cls) And output and the first classification network basedSecond loss function L determined by output of two-class network_C2E.g. L ═ L_C2+L_C。

As an example, as shown in fig. 4, when training a RetinaNet network structure, after a sample image is input into a backhaul network, an output result of a first classification network in the RetinaNet network structure is a probability p, an output result of a regression network is represented by Box, and an output result of a second classification network is the probability p₁In this example, the total loss function may be expressed as:

L＝L_C2+L_C＝L_bb+L_cls+L_C2

L_C2＝-(y*(1-M)+(1-y)*M)*

(α*p₁ ^γlog(p₁)*M-(1-α)*(1-p₁)^γlog(1-p₁)*(1-M))

wherein, the value of M can be determined by the following templates:

wherein L is_clsDetailed description of the invention and L in the above examples_clsThe same applies to the above embodiments, and the details are not repeated herein.

3. The loss function of the second classification network comprises a first loss function L determined based on the output of the second classification network_C1And a second loss function L determined based on an output of the first classification network and an output of the second classification network_C2At this time, the total loss function L includes a loss function L corresponding to the RetinaNet network structure_C(L_C＝L_bb+L_cls) A first loss function L determined based on the output of the second classification network_C1And a second loss function L determined based on the output of the first classification network and the output of the second classification network_C2E.g. L ═ L_C2+_LC+L_C1。

As an example, as shown in FIG. 5, when training a RetinaNet network structureAfter the sample image is input into the backsbone network, the output result of the first classification network in the RetinaNet network structure is probability p, the output result of the regression network is represented by Box, and the output result of the second classification network is probability p₁In this example, the total loss function can be expressed as:

L＝L_C2+L_C+L_C1＝L_bb+L_cls+L_C2+L_C1

wherein L is_cls、L_C2And L_C1Specific forms of (1) and L in the above examples_cls、L_C2And L_C1For details, reference may be made to the above embodiments, which are not described herein again.

Based on the target detection model provided by the embodiment of the present invention, as shown in fig. 6, the embodiment of the present application further provides an image detection method, including:

step S610, acquiring an image to be detected;

step S620, detecting the image to be detected through the target detection model;

the target detection model is the target detection model trained by the training method of the target detection model in the above embodiment, and the specific implementation manner of the training method may refer to the description of the training method of the target detection model in the above embodiment, which is not described herein again, for example, the target detection model may be a trained RetinaNet network structure.

And step S630, obtaining a detection result of the image to be detected based on the output of the target detection model.

The target area can be set by itself according to actual needs, and is not limited to the face area of a person, such as a human body area.

That is to say, when image detection is performed, an image to be detected can be input into the trained target detection model, the target detection model can output a result, and a detection result of the image to be detected can be obtained based on the output result of the target detection model.

Based on the same principle as the method shown in fig. 1, in an embodiment of the present application, there is also provided an object detection model training apparatus 70, where the object detection model training apparatus 70 includes a first classification network, and as shown in fig. 7, the object detection model training apparatus 70 may include: a training supervision network setup module 710 and a model training module 720, wherein:

a training supervision network setting module 710, configured to set at least one second classification network, where an input of the second classification network is the same as an input of the first classification network during training;

and a model training module 720, configured to train the target detection model based on the total loss function until the total loss function converges, where the total loss function includes a loss function of the target detection model and a loss function of the second classification network.

In an embodiment of the application, the target detection model includes a single-stage detection network structure.

In the embodiment of the application, the single-stage detection network structure comprises a RetinaNet network structure.

In the embodiment of the present application, the second classification network includes a convolutional layer and a fully connected layer, which are cascaded, where an input of the convolutional layer is connected to an output of the backhaul network of the RetinaNet network structure.

In an embodiment of the application, the loss function of the second classification network comprises at least one of a first loss function determined based on an output of the second classification network and a second loss function determined based on an output of the first classification network and an output of the second classification network.

In the embodiment of the present application, the first loss function is:

L_C1＝-(1-α)*p₁ ^γlog(1-p₁)*(1-y)-α*(1-p₁)^γlog(p₁)*y

In the embodiment of the present application, the second loss function is:

L_C2＝-(y*(1-M)+(1-y)*M)*

(α*p₂ ^γlog(p₂)*M-(1-α)*(1-p₂)^γlog(1-p₂)*(1-M))

In an embodiment of the present application, the loss function of the target detection model includes a loss function of the first classification network and a loss function of the target frame regression network.

The training device of the target detection model in the embodiments of the present application may execute the training method of the target detection model provided in the embodiments of the present application, and the implementation principle is similar, the actions performed by each module in the training device of the target detection model in each embodiment of the present application correspond to the steps in the training method of the target detection model in each embodiment of the present application, and the detailed functional description of each module of the training device of the target detection model may specifically refer to the description in the training method of the corresponding target detection model shown in the foregoing, and will not be described again here.

Based on the same principle as the method shown in fig. 6, an embodiment of the present application further provides an image detection apparatus 80, and as shown in fig. 8, the image detection apparatus 80 may include: an image acquisition module 810 and an image detection module 820, wherein:

an image obtaining module 810, configured to obtain an image to be detected;

the image detection module 820 is used for detecting the image to be detected through the target detection model and obtaining a detection result of the image to be detected based on the output of the target detection model; the target detection model is obtained by training through the training method of the target detection model in the embodiment.

The image detection apparatus of the embodiment of the present application can execute the image detection method provided by the embodiment of the present application, and the implementation principles thereof are similar, the actions executed by the modules in the image detection apparatus in the embodiments of the present application correspond to the steps in the image detection method in the embodiments of the present application, and the detailed functional description of the modules in the image detection apparatus may specifically refer to the description in the corresponding image detection method shown in the foregoing, and will not be described again here.

Embodiments of the present application also provide an electronic device, which may include but is not limited to: a processor and a memory; a memory for storing computer operating instructions; and the processor is used for executing the method shown in the embodiment by calling the computer operation instruction.

Yet another embodiment of the present application provides a computer-readable storage medium storing at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by a processor to implement the respective contents of the aforementioned method embodiments.

In an alternative embodiment, an electronic device is provided, as shown in fig. 9, the electronic device 4000 shown in fig. 9 comprising: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004. It should be noted that the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or execute the various illustrative logical blocks, modules, and circuits described in connection with the disclosure herein. The processor 4001 may also be a combination that performs a computing function, e.g., comprising one or more microprocessors, a combination of DSPs and microprocessors, etc.

Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 9, but this does not indicate only one bus or one type of bus.

The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

The memory 4003 is used for storing application codes for executing the scheme of the present application, and the execution is controlled by the processor 4001. Processor 4001 is configured to execute application code stored in memory 4003 to implement what is shown in any of the foregoing method embodiments.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for training an object detection model, wherein the object detection model is used for detecting an object region in an image, and the object detection model comprises a first classification network, and the method comprises:

training the target detection model based on a total loss function until the total loss function converges, wherein the total loss function comprises a loss function of the target detection model and a loss function of the second classification network, and the loss function of the second classification network comprises a second loss function determined based on an output of the first classification network and an output of the second classification network;

wherein the second loss function is:

L_C2＝-(y*(1-M)+(1-y)*M)*(α*p₁ ^γlog(p₁)*M-(1-α)*(1-p₁)^γlog(1-p₁)*(1-M))

wherein L is_c2Representing a second loss function, α being a weighting factor, p₁Is the output result of the second classification network, y represents a sample label, gamma is an adjusting factor, M is a target area result label, and the taking of MThe values are determined by:

2. The method of claim 1, wherein the object detection model comprises a single stage detection network architecture.

3. The method of claim 2, wherein the single stage detection network structure comprises a RetinaNet network structure.

4. The method of claim 3, wherein the second classification network comprises cascaded convolutional layers and fully-connected layers, wherein inputs of the convolutional layers are connected to outputs of a Backbone Backbone network of the RetinaNet network structure.

5. The method of any of claims 1 to 4, wherein the loss function of the second classification network further comprises a first loss function determined based on an output of the second classification network.

6. The method of claim 5, wherein the first loss function is:

L_C1＝-(1-α)*p₁ ^γlog(1-p1)*(1-y)-α*(1-p₁)^γlog(p₁)*y

7. The method of claim 1, wherein the loss function of the target detection model comprises a loss function of the first classification network and a loss function of a target box regression network.

8. An image detection method, characterized in that the method comprises:

acquiring an image to be detected;

detecting the image to be detected through the target detection model, wherein the target detection model is obtained by training through the method of any one of claims 1 to 7;

and obtaining a detection result in the image to be detected based on the output of the target detection model.

9. An apparatus for training an object detection model, the object detection model being used for detecting an object region in an image, the object detection model comprising a first classification network, the apparatus comprising:

a model training module, configured to train the target detection model based on a total loss function until the total loss function converges, where the total loss function includes a loss function of the target detection model and a loss function of the second classification network, and the loss function of the second classification network includes a second loss function determined based on an output of the first classification network and an output of the second classification network;

wherein the second loss function is:

wherein L is_c2Representing a second loss function, alpha being a weighting factor, p₁Is the output result of the second classification network, y represents a sample label, gamma is an adjustment factor, M is a target area result label, and the value of M is obtained by the following stepsThe following is determined:

10. An image detection apparatus, characterized in that the apparatus comprises:

the image acquisition module is used for acquiring an image to be detected;

an image detection module, configured to detect the image to be detected through a target detection model, and obtain a detection result of the image to be detected based on an output of the target detection model, where the target detection model is obtained by training according to the method of any one of claims 1 to 7.

11. An electronic device, characterized in that the electronic device comprises: a processor and a memory;

the memory is used for storing operation instructions;

the processor is used for executing the method of any one of claims 1 to 8 by calling the operation instruction.

12. A computer readable storage medium, characterized in that it stores at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by a processor to implement the method according to any one of claims 1 to 8.