CN115439699B

CN115439699B - Training method of target detection model, target detection method and related products

Info

Publication number: CN115439699B
Application number: CN202211310926.1A
Authority: CN
Inventors: 贺婉佶; 史晓宇
Original assignee: Beijing Airdoc Technology Co Ltd
Current assignee: Beijing Airdoc Technology Co Ltd
Priority date: 2022-10-25
Filing date: 2022-10-25
Publication date: 2023-06-30
Anticipated expiration: 2042-10-25
Also published as: CN115439699A

Abstract

The application discloses a training method of a target detection model, a target detection method and related products. The target detection model comprises a main network and a detection frame classification branch connected with the main network, wherein the detection frame classification branch comprises at least one detection category output; the training method comprises the following steps: acquiring a sample training set formed by sample images containing detection category targets and/or false positive category targets; two layers of labels are respectively assigned to each detection category target and each false positive category target, a first layer of labels in the two layers of labels are used for identifying the detection category of each target, and a second layer of labels in the two layers of labels are used for identifying the true category of each target; and training the target detection model by using a sample training set with the two-layer labels. According to the training method, the distinguishing and identifying ability of the target detection model to the pseudo-yang type targets is improved.

Description

Training method of target detection model, target detection method and related products

Technical Field

The present application relates generally to the field of image processing technology. More particularly, the present application relates to a training method of a target detection model, a target detection method, a device and a computer readable storage medium.

Background

With the continuous development of artificial intelligence technology, the application of the target detection model in image recognition has been very wide, such as the recognition and detection of a human body or a target object in a monitoring video in a monitoring scene, the recognition and detection of a five sense organ in a face image in a face recognition scene, the recognition and detection of a focus in a medical image, the recognition and detection of a cell or microorganism of interest in a microscopic image, and the like.

For the currently common object detection model, although different model detection frames may have different matching principles between the gold standard frame (or true frame) and the anchor (or detection frame), these matching principles generally have a common point that only anchors near the gold standard frame will be matched with the gold standard frame, and anchors that are not near any gold standard frame will be matched as background.

Because the targets corresponding to the anchors matched as the background cannot be input into the detection frame classification branches of the target detection model for learning, the detection frame classification branches of the target detection model can only learn the appearance of the detection frame of the category of interest, and cannot learn the appearance of the targets which are similar to the category of interest and belong to the background. From the viewpoint of classification learning, the detection frame classification branches only see what is a positive sample, and never see what is not a positive sample, which is unfavorable for model learning.

In view of this, there is a need to provide a training way that is more beneficial for model learning.

Disclosure of Invention

To address at least one or more of the technical problems mentioned above, the present application proposes, in various aspects, a training method of an object detection model, a method of object detection, an apparatus, and a computer-readable storage medium.

In a first aspect, the present application provides a training method of a target detection model, the target detection model comprising a backbone network and a detection frame classification branch connected to the backbone network, the detection frame classification branch comprising at least one detection class output; the training method comprises the following steps: acquiring a sample training set formed by sample images containing detection category targets and/or false positive category targets; two layers of labels are respectively assigned to each detection category target and each false positive category target, a first layer of labels in the two layers of labels are used for identifying the detection category of each target, and a second layer of labels in the two layers of labels are used for identifying the true category of each target; and training the target detection model by using a sample training set with the two-layer labels.

In some embodiments, prior to training the target detection model using the sample training set, the training method further comprises: obtaining a pre-training set formed by a sample image with a sample label, wherein the sample label is used for identifying a detection category target in the sample image; and pre-training the target detection model by using the pre-training set.

In other embodiments, the training method further comprises: based on the pre-training set, a sample training set comprising two layers of labels is generated.

In still other embodiments, generating the sample training set comprising two layers of labels comprises: performing target detection on the sample images in the pre-training set by using the pre-trained target detection model so as to obtain a plurality of detection results; comparing the detection results with the sample labels to determine false positive category targets corresponding to the background in the pre-training set and detection categories to which the false positive category targets belong from the detection results; and generating the sample training set containing two layers of labels according to the detection category and the real category of each detection category target and each false positive category target in the pre-training set.

In some embodiments, the training method further comprises: a loss function of the object detection model is calculated based on the two-layer tags, wherein the loss function includes a first loss function for a first layer of tags and a second loss function for a second layer of tags.

In other embodiments, the first loss function comprises a cross entropy loss function and the second loss function comprises a contrast learning loss function.

In still other embodiments, the first loss function is expressed as:

wherein->

For representing a first loss function, B for representing a batch size,/for the first loss function>

For representing the b-th sheet in a batchThe number of detection frames detected in the sample image and matched as foreground frames, N representing the number of detection categories, < >>

Tag value of nth detection class in first layer tag for representing mth detection frame in b-th sample image,/th detection frame>

Representing the output probability of the mth detection frame in the nth detection category in the b sample image.

In some embodiments, the second loss function is expressed as:

wherein

Wherein, the method comprises the steps of, wherein,

for representing a second loss function, Q for representing the number of detection frames detected in all sample images in the batch and matched as foreground frames, +.>

Contrast learning loss indicating the ith detection box in the lot,/->

Second layer label for indicating ith detection frame,/->

Second layer label for representing kth detection frame,/->

Second layer label for indicating j-th detection frame,>

first layer label for representing c-th detection frame,/->

First layer tag for indicating ith detection frame,/->

For indicating the temperature coefficient >

Penultimate layer feature vector for representing the ith detection box, < >>

Penultimate layer feature vector for representing the jth detection box, < >>

For representing the penultimate layer feature vector of the c-th detection box.

In other embodiments, the loss function is a weighted sum of the first loss function and the second loss function.

In still other embodiments, prior to training the target detection model using the sample training set, the training method further comprises: and freezing weight parameters in other network structures except the detection frame classification branches in the target detection model.

In some embodiments, the target detection model further comprises a foreground-background classification branch and a detection frame position regression branch respectively connected with the backbone network; the freezing weight parameters include: and freezing weight parameters in the backbone network, the foreground and background classification branches and the detection frame position regression branches.

In other embodiments, the sample image comprises a medical sample image.

In a second aspect, the present application provides a method of image-based object detection, comprising: inputting an image to be detected into a target detection model trained by the training method according to any one of the first aspect of the application; and performing target detection on the image to be detected by using the target detection model and outputting a detection result.

In a third aspect, there is provided an apparatus for target detection, comprising: a processor for executing program instructions; and a memory storing the program instructions that, when loaded and executed by the processor, cause the processor to perform the method of training the object detection model according to any of the first aspects of the present application or the method of image-based object detection according to any of the second aspects of the present application.

In a fourth aspect, there is provided a computer readable storage medium having stored thereon computer readable instructions which, when executed by one or more processors, implement a method of training an object detection model as described in any of the first aspects of the present application or a method of image-based object detection as described in any of the second aspects of the present application.

Through the above description of the technical solution and the embodiments of the present application, a person skilled in the art may understand that in the solution of the present application, by respectively assigning two layers of labels to each detection category target and each false positive category target, the classification branch of the detection frame of the target detection model not only can learn the characteristics of the detection category in the first layer of labels, but also can learn the difference between the detection category target and the false positive category target in the second layer of labels, so that the target detection model can better learn how to distinguish the detection category target and the false positive category target, which is beneficial to improving the resolving and identifying ability of the target detection model to the false positive category target and improving the detection accuracy to the detection category target.

Further, in some embodiments, by setting the loss function for two layers of labels based on the two layers of labels, a hierarchical supervision manner in the training process of the target detection model can be implemented, so as to supervise the learning effect of the target detection model on each layer of labels. Still further, in other embodiments, by using a contrast learning loss function, the contrast between the false positive class object (or false positive sample) of each class of interest and its corresponding detection class object (or true positive sample) of the detection frame classification branch is enhanced, and the feature distance between the false positive class object and the detection class object of each class of interest is increased, so that the recognition capability of the detection frame classification branch on the object is enhanced.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present application will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present application are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

FIG. 1 is a schematic block diagram illustrating an object detection model according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating a training method of an object detection model according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating a training method of an object detection model according to another embodiment of the present application;

FIG. 4 is a schematic diagram illustrating a sample image in a pre-training set according to an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating sample images in a sample training set according to an embodiment of the present application;

FIG. 6 is a flow chart illustrating a method of image-based object detection according to an embodiment of the present application;

FIG. 7 is a schematic block diagram illustrating a trained object detection model according to an embodiment of the present application; and

fig. 8 is a schematic block diagram illustrating a system for target detection according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

It should be understood that the terms "comprises" and "comprising," when used in this specification and in the claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the present application is for the purpose of describing particular embodiments only, and is not intended to be limiting of the application. As used in the specification and claims of this application, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the present specification and claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

Specific embodiments of the present application are described in detail below with reference to the accompanying drawings. In order to facilitate understanding of the technical solutions of the embodiments of the present application, an exemplary description of the object detection model related to the embodiments of the present application is first described below with reference to fig. 1.

Fig. 1 is a schematic block diagram illustrating an object detection model according to an embodiment of the present application. As shown in fig. 1, the object detection model 100 may include a backbone network 110 and a detection box classification branch 120 connected to the backbone network 110, the detection box classification branch 120 may include at least one detection class output 121. In some embodiments, the Backbone network 110 (or Backbone network) may be used to perform feature extraction on the input image, may output feature maps and/or feature vectors generated based on the input image, and the like. In other embodiments, the backbone network 110 may include a network structure such as a convolutional neural network that can be used to perform feature extraction on the input image. In still other embodiments, the object detection model 100 may be a currently commonly used detection model, for example, a network model structure such as YOLO series or RCNN series may be used, where the backbone network 110 may use a backbone network structure in a network model such as YOLO series or RCNN series.

The detection frame classification branch 120 may be used to classify the targets in the detection frame. In some embodiments, each detection class output 121 may be used to output a probability value for a detection class. The detection category may be a category of interest or a target category, for example, in some application scenarios, the detection task is to detect whether there are apples and pears in the input image, where the apples and pears belong to the detection category, while other category objects in the input image, such as bananas, pomegranates, etc., do not belong to the detection category, and are determined as background in the detection and are not identified by the detection frame. In other embodiments, each detection class output 121 may be implemented using a sigmoid normalization layer.

As further shown in fig. 1, the object detection model 100 may further include a foreground-background classification branch 130 and a detection frame position regression branch 140 connected to the backbone network 110, respectively, wherein the foreground-background classification branch 130 may be used to distinguish between foreground and background of the feature map output by the backbone network 110, and the output foreground object may be marked in the form of a detection frame; the detection frame position regression branch 140 may be used to output position information of the detection frame, including, for example, position coordinates, offsets, detection frame sizes, and the like. In some embodiments, the foreground-background classification branch 130 and the detection frame position regression branch 140 may employ respective branch network structures in a network model, such as the YOLO series or the RCNN series.

While an exemplary object detection model according to an embodiment of the present application has been described above in connection with fig. 1, it is to be understood that the above description is exemplary and not limiting, and that the number of detection class outputs 121 may be not limited to three in the illustration, but may be more or less as desired. Having described the object detection model of an embodiment of the present application, an exemplary description of a training method according to an embodiment of the present application will be described below with reference to fig. 2.

FIG. 2 is a flowchart illustrating a training method of an object detection model according to an embodiment of the present application. As shown in fig. 2, training method 200 may include: in step 210, a sample training set formed from sample images containing detection class targets and/or false positive class targets may be acquired.

The content of the sample image described above may be selected according to the scene of the desired application. For example, in some embodiments, the sample image may include a medical sample image for detection of a lesion. In other embodiments, the medical sample image may include one of medical images such as fundus images, brain images, lung images, and the like. In still other embodiments, the medical sample image may be acquired by a medical device such as a fundus camera, OCT (Optical coherence tomography) device, nuclear magnetic resonance device, electronic computed tomography CT device, or the like. For example, in other embodiments, the sample image may comprise, for example, a portrait sample image for detection in the field of surveillance or in the field of face recognition. In still other embodiments, the sample image may include a text sample image, such as for text detection in the field of text recognition or translation. For another example, in some embodiments, the sample image may include a passing vehicle sample image for detection and identification of vehicles in road monitoring, and the like.

The sample training set may include one or more sample images, wherein each sample image may include a detection class object and/or a false positive class object. In some embodiments, each sample image in the sample training set may contain a detection class object and a false positive class object. In other embodiments, a portion of the sample images in the sample training set contain only detection class objects or false positive class objects, and another portion of the sample images contain both detection class objects and false positive class objects. In still other embodiments, a portion of the sample images in the sample training set contain only detection class targets, while another portion of the sample images contain only false positive class targets.

In some embodiments, the detection class object may be an object in the sample image that belongs to a detection class, and the false positive class object may be an object in the sample image that belongs to a false positive class. In other embodiments, the false positive class may be similar to the detection class target, and may be easily misclassified as a class of non-detection targets in the detection class, i.e., false positives corresponding to the detection class. Still taking the above detection class including apples and pears as an example, the pomegranate in the input image may be a non-detection target with respect to apples, and since the pomegranate is similar to apples and is easily misclassified into apples by the target detection model, the pomegranate may be determined as a false positive class target corresponding to this detection class of apples. In some embodiments, detection class objects and/or false positive class objects in the sample image may be marked by a detection box (or bounding box) to facilitate identification.

Next, in step 220, two layers of labels may be respectively assigned to each detection category target and each pseudo-cation category target, a first layer of labels in the two layers of labels being used to identify the detection category of each target, and a second layer of labels in the two layers of labels being used to identify the true category of each target. Tags may be used to represent learning objectives of the object detection model. In some embodiments, the detection class of each target may be understood as the class of each target predicted by the target detection model. For the detection class object, the first layer of label is the detection class to which the first layer of label belongs, and the real class of the detection class object is still the detection class to which the second layer of label of the detection class object belongs, so that the second layer of label of the detection class object still represents the detection class of the second layer of label or can represent true positive. For the false positive class target, the first layer of label is a detection class which is easy to be misclassified by a model, and the true class of the false positive class target is not the misclassified detection class, so the second layer of label of the false positive class target is the true false positive class or can be expressed as false positive.

In other embodiments, both layers of labels may use one-hot labels that are single-hot coded (i.e., the data in the label includes only 0 and 1). For example, in one particular embodiment, for a three-classification task that includes three detection categories, for a detection category target belonging to a first detection category, its first layer label may be denoted as [1, 0], and its second layer label may be denoted as [1,0,0,0,0,0]; for a false positive class object that originally belongs to the first detection class, its first layer tag may be denoted as [1, 0], i.e. the same as the first layer tag of its corresponding true positive detection class object, its second layer tag may be denoted as [0,1,0,0,0,0]. In this embodiment, each detection class in the first layer of labels is detailed as true or false positive in the second layer of labels, and thus can be represented by a two-digit number. In still other embodiments, in the second layer of labels, label 1 may be classified as [ label 1 true positive, label 1 false positive ].

Flow may then proceed to step 230 where the target detection model may be trained using the sample training set with two layers of labels. The target detection model may include a backbone network and a detection frame classification branch connected to the backbone network, the detection frame classification branch may include at least one detection class output. The object detection model may employ, for example, the object detection model 100 shown in fig. 1, which will not be described in detail herein.

Further, in some embodiments, training method 200 may further comprise: based on the two-layer tags, a loss function of the object detection model may be calculated, wherein the loss function may include a first loss function for the first layer tag and a second loss function for the second layer tag. By setting the loss functions aiming at different layers of labels, the detection frame classification branch can be targeted for hierarchical training.

In still other embodiments, the first loss function may comprise a cross entropy loss function and the second loss function may comprise a contrast learning loss function. Since the contrast learning loss function is set for the model learning second layer label, the contrast learning loss function can realize the supervised contrast learning of the target detection model.

In some embodiments, the first loss function may be expressed in the form of a cross entropy loss function as shown in equation 1 below:

(equation 1)

Wherein, the liquid crystal display device comprises a liquid crystal display device,

For representing the number of detection frames detected in the b-th sample image in the batch and matched as foreground frames (i.e. the number of detection frames matched as foreground), N for representing the number of detection categories,/-for example>

For representing the output probability of the mth detection frame in the nth detection category in the b sample image. />

Representing +.A label value of the nth detection class in the first layer label of the mth detection frame in the b-th sample image is 1 +.>

=1, otherwise->

=0. In some embodiments, the detected detection frames may be matched using, for example, an intersection-to-union (IOU) matching rule or the like to determine the number of detection frames that are matched as foreground frames.

In other embodiments, the second loss function may be expressed in the form of a contrast learning loss function as shown in equations 2 and 3 below:

(equation 2)

(equation 3)

Contrast learning loss indicating the ith detection box in the lot,/->

Second layer label for indicating ith detection frame,/->

Second layer label for representing kth detection frame,/->

Second layer label for indicating j-th detection frame,>

first layer label for representing c-th detection frame,/->

First layer tag for indicating ith detection frame,/->

For indicating the temperature coefficient>

The next to last layer feature vector (i.e. the last feature layer of the feature layer operated by sigmoid) for representing the ith detection box +.>

Penultimate layer feature vector for representing the jth detection box, < >>

And the second last layer characteristic vector is used for representing the c-th detection frame. In some embodiments, the temperature coefficient +.>

May be greater than 1.

Further, in the above formula

For indicating that the value is 1 when the second layer label of the i-th detection frame is the same as the second layer label of the k-th detection frame; otherwise the value is 0./>

For indicating that when i=j, the value is 1; otherwise the value is 0./>

For indicating when the second layer label of the ith detection frame is the jth detection frame When the two layers of labels are the same, the value is 1; otherwise the value is 0./>

For the case where i is not the same as c, the value is 1; otherwise the value is 0./>

For indicating that the value is 1 when the first layer label of the i-th detection frame is the same as the first layer label of the c-th detection frame; otherwise the value is 0.

From the above formula 3

It can be seen that the contrast learning loss set according to the embodiment of the present application is only calculated in the second layer class (i.e., the true positive class or the false positive class corresponding to the second layer label) of every two first layer labels that are the same (i.e., the detection class of the model prediction is the same), and is not calculated in the second layer class corresponding to the different first layer labels. According to the arrangement, the object detection model can be ensured to learn the distinction between the false positive type and the corresponding detection type, but not the distinction between the false positive type and the other detection types, and the comparison between each false positive type object and the corresponding detection type object by the detection frame classification branch is facilitated to be enhanced, so that the recognition capability of the detection frame classification branch is facilitated to be enhanced.

In other embodiments, the loss function may be a weighted sum of the first loss function and the second loss function. According to such a setting, the ratio between the first loss function and the second loss function can be adjusted as needed. In still other embodiments, the loss function may be represented by the following equation 4:

(equation 4)

Wherein L represents a loss function, L ₁ Representing a first loss function, L ₂ A second loss function is indicated and is indicated,

and representing a super parameter for adjusting the ratio between the first loss function and the second loss function. In some embodiments, ->

The value of (2) may be a value between 0 and 1.

As described above with reference to fig. 2, it may be understood that, by training the target detection model using the sample training set with the two-layer label, two-layer hierarchical categories are established, that is, the first layer category is the detection category predicted by the model, and the second layer category is the true positive category and the false positive category of the detection category to which the first layer category belongs, so that the target detection model can learn the distinction between the true positive category and the false positive category in the learning process of the second layer category.

Further, in some embodiments, training of the target detection model may be further supervised by setting a loss function for two layers of labels, so that a feature distance between the pseudo-cation type target and the detection type target corresponding to the pseudo-cation type target is increased, and thus recognition capability of the trained target detection model is improved. It will also be appreciated that the above description is exemplary and not limiting, e.g., in some embodiments, pre-training may also be included prior to training the target detection model using the sample training set. As will be described in detail below in connection with fig. 3.

FIG. 3 is a flowchart illustrating a training method of an object detection model according to another embodiment of the present application. As will be appreciated from the following description, the training method 300 illustrated in FIG. 3 may be one embodied form of the training method 200 described above in connection with FIG. 2. Accordingly, the foregoing description of training method 200 may also be applied to the following description of training method 300.

As shown in fig. 3, training method 300 may include: in step 310, a pre-training set of sample image formations with sample annotations may be obtained, wherein the sample annotations may be used to identify detection class targets in the sample image. In some embodiments, the sample label may be shown in the form of a detection box (or bounding box). In other embodiments, the sample annotation may further comprise a category marking for detecting a category target in the sample image. In still other embodiments, the pre-training set may be selected from the data sets currently available for conventional training of the target detection model.

Next, in step 320, the target detection model may be pre-trained using the pre-training set. The pre-training set may include one or more sample images with sample annotations in some or all of the sample images. Because the pre-training set only comprises sample labels, other targets except the sample labels in the sample image are difficult to match as prospects and are sent into a detection frame classification branch for learning, so that a pre-trained target detection model is difficult to distinguish a real detection category target from a false positive category target, and the problem of false positive possibly exists.

Flow may then proceed to step 330 where a sample training set may be generated containing two layers of labels based on the pre-training set. In some embodiments, the detection category targets similar to the detection category targets can be found in the sample image by a machine according to the detection category targets marked by the samples in the pre-training set. In other embodiments, the detection type target similar to the detection type target may be searched in the sample image in the pre-training set and other sample images outside the pre-training set according to the detection type target marked by the sample in the pre-training set, and then the sample image in the pre-training set and the other sample images outside the pre-training set may be formed together into the sample training set.

In other embodiments, generating a sample training set comprising two layers of labels based on the pre-training set may comprise: and generating a sample training set containing two layers of labels according to the pre-trained target detection model and the pre-training set. As further shown in fig. 3, step 330 may include steps 331-333 in the illustration, wherein in step 331 (shown in dashed boxes), the pre-trained object detection model may be used to perform object detection on the sample images in the pre-training set to obtain a plurality of detection results. Sample images in the pre-training set may be input into the backbone network of the pre-trained target detection model to output detection results for each sample image in the branch structure of the target detection model. In some embodiments, the detection result may include a detection box and a detection category to which it belongs.

Next, in step 332 (shown in dashed box), the plurality of test results may be compared to the sample labels to determine among the plurality of test results the category of false positive targets and the test categories to which they pertain (i.e., the test categories to which they are predicted to belong). The detection results in each sample image may be compared to the sample labels in that sample image. In some embodiments, the sample label may be regarded as a gold standard box for detecting the class target, and comparing the plurality of detection results with the sample label may be understood as comparing a detection box result with a gold standard box, wherein the probability value of the detection box result is greater than a probability threshold value in the plurality of detection boxes predicted by the trained target detection model, so as to find a detection box result (i.e. a false positive detection box) with the cross-over ratio between the detection box result and the gold standard box being smaller than the cross-over ratio threshold value. The targets in the false positive detection box (i.e., false positive class targets) are predicted to belong to a certain detection class, but are not actually the detection class, nor do they belong to other detection classes. In a typical training method, since the target in the false positive detection frame is not in the vicinity of any gold standard frame, it cannot affect the learning of the classification branch of the detection frame. That is, the pseudo-cation class target is not labeled in the pre-training set corresponding to the background, and is used as the background in the pre-training process of the target detection model to learn without entering the detection frame classification branch.

Flow may then proceed to step 333 (shown in dashed box), where a sample training set may be generated that includes two layers of labels based on the detection category and the true category to which each detection category target and each false positive category target respectively belong in the pre-training set. Specifically, for each detection class target, a label value for indicating the detection class to which it belongs may be given to its first layer label, and a label value for indicating its true positive class may be given to its second layer label. For each determined false positive class target, a label value may be assigned to its first layer label for representing its belonging detection class and a label value may be assigned to its second layer label for representing its false positive class.

In some application scenarios, the target detection model may be pre-trained using all sample images in the pre-training set in step 320, and the target detection model after the pre-training may be used to target detect all sample images in the pre-training set in step 331, to obtain detection results of all sample images in the pre-training set, in which case the sample training set generated in step 333 may include all sample images in the pre-training set.

In other applications, a portion of the sample images in the pre-training set may be used to pre-train the target detection model in step 320, and another portion of the sample images in the pre-training set may be used to target detect in step 331, to obtain detection results of the other portion of the sample images in the pre-training set, in which case the sample training set generated in step 333 may include the other portion of the sample images in the pre-training set.

In yet other application scenarios, all of the sample images in the pre-training set may be used to pre-train the target detection model in step 320, while a portion of the sample images in the pre-training set may be used to target detect in step 331 to obtain detection results for a portion of the sample images in the pre-training set, in which case the sample training set generated in step 333 may include a portion of the sample images in the pre-training set. According to the arrangement, the training data can be pretrained by using the big data sample, and only a small amount of training data comprising two layers of labels is needed to be generated for fine tuning training of the model, so that the generation difficulty of the sample training set is reduced.

It will be appreciated that

steps

331 and 332 described above may be one of the embodiments of step 210 described above in connection with fig. 2, and step 333 may be one of the embodiments of step 220 described above in connection with fig. 2. As further shown in fig. 3, after the sample training set is generated, step 340 may continue to be performed where the target detection model may be trained using the sample training set with two layers of labels. Step 340 is described in detail above in connection with step 230 of fig. 2, and is not described here.

In other embodiments, training method 300 may further comprise, prior to performing step 340: the weight parameters in the network structure except the detection frame classification branch in the target detection model are frozen. The freezing the weight parameter may be fixing the weight parameter in other network structures to avoid the weight parameter of other network structures from being changed by subsequent operations. In some embodiments, other network structures may include a backbone network. In other embodiments, the target detection model may further include a foreground-background classification branch and a detection frame position regression branch respectively connected to the backbone network; freezing weight parameters in other network structures may include: the weight parameters in the backbone network, foreground-background classification branches and detection frame position regression branches are frozen.

According to the setting, when the sample training set is used for training the target detection model, weight parameters in other network structures except the detection frame classification branch, which are determined after pre-training, are not changed, and only the weight parameters in the detection frame classification branch are subjected to fine adjustment and updating, so that the speed and the efficiency of training the target detection model are improved.

A training method according to another embodiment of the present application is described in detail above with reference to fig. 3, and for easier understanding of a method for generating a sample training set based on a pre-training set, an exemplary description will be given below with reference to fig. 4 and 5.

Fig. 4 is a schematic diagram illustrating a sample image in a pre-training set according to an embodiment of the present application. Fig. 5 is a schematic diagram illustrating sample images in a sample training set according to an embodiment of the present application. As shown in fig. 4, taking a first sample image 400 in the pre-training set as an example, it is assumed that letters a, b, c are three different detection class objects of interest, and each detection class object in the first sample image 400 carries a sample label thereon, for example, a sample label 401 on the detection class object a, a sample label 402 on the detection class object b, and a sample label 403 on the detection class object c, which are shown in the figure. In some embodiments, sample labels 401, 402, and 403 may each be represented in a different color. Also present in the first sample image 400 are non-detection targets O404, g405, q406, and the like, which are not marked by any frame.

In performing object detection on the first sample image 400 using, for example, the object detection model described above in connection with fig. 1 or another commonly used detection network model, since the anchors surrounding these

non-detection objects

404, 405, and 406 are not near any gold standard boxes, these

non-detection objects

404, 405, and 406 are not matched to any gold standard boxes, and thus cannot affect the loss calculation of the detection box classification branches. However, from the aspect of the appearance, these non-detection targets O, g, q are relatively similar to the detection class targets a, b, c in the illustration (e.g., g is similar to a, O is similar to c, q is similar to b), and are easily detected by the target detection model as belonging to a certain detection class, so these non-detection targets that are easily misdetected belong to difficult false positive samples, which may be referred to herein as false positive class targets.

In other embodiments, a second sample image 500 as shown in FIG. 5 may be formed by labeling the false positive class targets O, g, q as shown in FIG. 4, with a detection box 501 (shown in dashed boxes) for labeling the false positive class target g, which is a false positive of the detection class target a; a detection box 502 (shown in dashed box) is used to label a false positive class target q, which is a false positive of the detection class target b; the detection box 503 (shown by a dashed box) is used to label the false positive class object O, which is a false positive of the detection class object c. In other embodiments, the

detection boxes

501, 502, and 503 may be represented in different colors, respectively, and may be shown in a form different from the sample label, for example, in the form of a dashed box in the drawings, in a form different from the solid box of the sample label.

In some embodiments, a first sample image 400 in the pre-training set, such as shown in fig. 4, may be detected using a pre-trained object detection model to mine out the pseudo-cation class objects O, g, q, such that a second sample image 500 may be generated based on the first sample image 400, and further two layers of labels may be respectively assigned to the detected class objects and the mined out pseudo-cation class objects in each of the generated second sample images 500 to form a sample training set comprising the two layers of labels.

The training method of the embodiments of the present application is described above with reference to the several drawings, and in another aspect, the present application provides a method for performing object detection based on images, that is, an inference method or a prediction method of an object detection model. This will be described below with reference to fig. 6.

Fig. 6 is a flowchart illustrating a method for image-based object detection according to an embodiment of the present application. As shown in fig. 6, method 600 may include: in step 610, the image to be detected may be input into a target detection model trained according to the training method described above in connection with any of the embodiments of fig. 2-5. Next, in step 620, the image to be detected may be subject to target detection using the target detection model and a detection result may be output.

In some embodiments, the image to be detected may include a medical image, the medical image is input into a target detection model trained according to the training method of the embodiment of the present application, a backbone network in the target detection model may extract a lesion feature in the medical image, a foreground-background classification branch in the target detection model may distinguish a feature map including the lesion feature from a foreground and a background, a detection frame classification branch may identify and classify a detection frame of the lesion feature in the foreground to determine whether the lesion feature in the detection frame belongs to a detection category corresponding to the detection category output, and a detection frame position regression branch may be used to output position information of each detection frame.

In other embodiments, the image to be detected may include a text image, so that whether the text image has the detection category text or the pseudo-sun category text of interest is detected by using the target detection model, and the detection process of the target detection model is similar to that of the detection medical image, which is not repeated herein. In still other embodiments, the image to be detected may also select one or more of a portrait image, a road monitoring image, a microscopic image, an item image, etc. as desired, for example, based on the content contained in the sample image in the sample training set used to train the target detection model.

Further, since the training method according to the embodiment of the application adopts two layers of labels to train the target detection model, the detection result of performing target detection output on the image to be detected by using the target detection model may include true positives or false positives of which targets in each detection frame belong to detection categories. For ease of understanding, an exemplary description is provided below in connection with FIG. 7.

FIG. 7 is a schematic block diagram illustrating a trained object detection model according to an embodiment of the present application. As shown in fig. 7, the object detection model 700 may include a backbone network 110 and a detection box classification branch 120, a foreground-background classification branch 130, and a detection box position regression branch 140 respectively connected to the backbone network 110, the detection box classification branch 120 may include at least one detection class output 121. For each detection class output 121, it may be used to output a true positive class 701 or a false positive class 702.

Taking the second sample image 500 shown in fig. 5 as an example, assuming that targets within each detection frame in the second sample image 500 have been given two-layer labels, and assuming that three detection class outputs included in the target detection model 700 in the drawing are used to predict three detection classes a, b, and c, respectively, for the trained target detection model 700, it may output, in the detection result of the detection frame classification branch 120, that targets in each detected detection frame in the image to be detected belong to any one of six second-layer classes [ a, non-a, b, non-b, c, non-c ].

The foregoing aspects of the embodiments of the present application may be implemented by means of program instructions. Thus, the present application also provides an apparatus for target detection, comprising: a processor for executing program instructions; and a memory storing program instructions that, when loaded and executed by the processor, cause the processor to perform the training method of the object detection model described in any of the embodiments above or to perform the image-based object detection method described above in connection with fig. 6.

Fig. 8 is a schematic block diagram illustrating a system for target detection according to an embodiment of the present application. The system 800 may include a device 801 according to an embodiment of the present application, and peripheral devices and external networks thereof, where the device 801 is used for training a target detection model or performing operations such as target detection on an image to be detected, so as to implement the technical solutions of the embodiments of the present application described in any of the foregoing in connection with fig. 1 to 7.

As shown in fig. 8, the device 801 may include a CPU 8011, which may be a general purpose CPU, a special purpose CPU, or other execution unit for information processing and program execution. Further, the device 801 may further include a mass memory 8012 and a read-only memory ROM 8013, wherein the mass memory 8012 may be configured to store various types of data including a pre-training set, a sample training set, weight parameters, detection results, and the like, and various programs required to run a neural network, and the ROM 8013 may be configured to store data required for power-on self-test of the device 801, initialization of various functional modules in the system, driving of basic input/output of the system, and booting of the operating system.

Further, device 801 also includes other hardware platforms or components, such as TPU 8014, GPU 8015, FPGA 8016, and MLU 8017 as shown. It will be appreciated that while various hardware platforms or components are shown in device 801, this is by way of example only and not limitation, and that one of ordinary skill in the art may add or remove corresponding hardware as desired. For example, device 801 may include only a CPU as a well-known hardware platform and another hardware platform as a test hardware platform of the present invention.

The device 801 of the present application further comprises a communication interface 8018 whereby it may be connected to a local area network/wireless local area network (LAN/WLAN) 805 via the communication interface 8018, and further to a local server 806 or to the Internet ("Internet") 807 via the LAN/WLAN. Alternatively or additionally, the device 801 of the present application may also be directly connected to the internet or cellular network via the communication interface 8018 based on wireless communication technology, such as third generation ("3G"), fourth generation ("4G"), or 5 th generation ("5G") wireless communication technology. In some application scenarios, the device 801 of the present application may also access a server 808 and possibly a database 809 of an external network as needed to obtain various known neural network models, data and modules, and may store the measured various data remotely.

Peripheral devices of the device 801 may include a display device 802, an input device 803, and a data transfer interface 804. In one embodiment, the display device 802 may include, for example, one or more speakers and/or one or more visual displays configured to provide voice prompts and/or visual display of the operational procedures or detection results of the present application. The input device 803 may include, for example, a keyboard, mouse, microphone, gesture-capturing camera, or other input buttons or controls configured to receive input of training data or user instructions. The data transfer interface 804 may include, for example, a serial interface, a parallel interface, or a universal serial bus interface ("USB"), a small computer system interface ("SCSI"), serial ATA, fireWire ("FireWire"), PCI Express, and high definition multimedia interface ("HDMI"), etc., configured for data transfer and interaction with other devices or systems. According to aspects of the subject application, the data transmission interface 804 may receive pre-training data for a pre-training set or training data for a training sample training set, and transmit various types of data and results to the device 801.

The above-described CPU 8011, mass memory 8012, read-only memory ROM 8013, TPU 8014, GPU 8015, FPGA 8016, MLU 8017, and communication interface 8018 of the device 801 of the present application may be connected to each other through a bus 8019, and data interaction with peripheral devices is achieved through the bus. In one embodiment, the CPU 8011 may control other hardware components in the device 801 and their peripherals via the bus 8019.

In operation, the processor CPU 8011 of the apparatus 801 of the present application may receive training data or an image to be detected through the input device 803 or the data transmission interface 804, and retrieve computer program instructions or code (e.g., code related to a neural network) stored in the memory 8012 to train the received training data or detect the image to be detected, so as to obtain weight parameters or detection results of the trained target detection model. After the CPU 8011 determines the detection result by executing the program instructions, the detection result may be displayed on the display device 802 or output by means of voice prompt. In addition, the device 801 may upload the detection results to a network, such as a remote database 809, via the communication interface 8018.

It should also be appreciated that any module, unit, component, server, computer, terminal, or device executing instructions illustrated herein may include or otherwise access a computer-readable medium, such as a storage medium, computer storage medium, or data storage device (removable) and/or non-removable) such as, for example, magnetic disks, optical disks, or tape. Computer storage media may include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

Based on the foregoing, the present application also provides a computer-readable storage medium having stored thereon computer-readable instructions that, when executed by one or more processors, implement the training method of the object detection model as described above in connection with any of the embodiments of fig. 2-5 or the image-based object detection method as described above in connection with fig. 6.

The computer readable storage medium may be any suitable magnetic or magneto-optical storage medium, such as, for example, resistive Random Access Memory RRAM (Resistive Random Access Memory), dynamic Random Access Memory DRAM (Dynamic Random Access Memory), static Random Access Memory SRAM (Static Random-Access Memory), enhanced dynamic Random Access Memory EDRAM (Enhanced Dynamic Random Access Memory), high-Bandwidth Memory HBM (High-Bandwidth Memory), hybrid storage cube HMC (Hybrid Memory Cube), etc., or any other medium that may be used to store the desired information and that may be accessed by an application, a module, or both. Any such computer storage media may be part of, or accessible by, or connectable to, the device. Any of the applications or modules described herein may be implemented using computer-readable/executable instructions that may be stored or otherwise maintained by such computer-readable media.

Through the above description of the training method and the embodiments of the target detection model of the present application, it can be understood by those skilled in the art that the training method of the present application trains the target detection model by using the sample training set with two layers of labels, so that the target detection model can learn the distinction between the detection type target and the corresponding pseudo-positive type target on the basis of not changing the structure of the target detection model, thereby being beneficial to improving the recognition capability of the target detection model on the pseudo-positive type target and improving the detection accuracy of the trained target detection model on the detection type target.

In some embodiments, training supervision is performed on the second-layer tag by using a contrast learning loss function, so that the feature distance between the false positive type target and the detection type target corresponding to the false positive type target is increased, and the recognition capability of the detection frame classification branch of the target detection model is higher.

Further, in other embodiments, the pre-training set with the sample label is detected by using the pre-trained target detection model, and the detection result is compared with the sample label, so that a difficult false positive sample which is difficult to identify for the target detection model can be mined, the training of the target detection model can be better guided, and the efficiency and the reliability of obtaining the sample training set can be improved. The method for mining the difficult false positive samples can skillfully and fully utilize the easily obtained pre-training set with the sample labels, reduce the difficulty of generating the sample training set, and also reduce the difficulty and pressure of foreground and background classification branches in foreground and background classification.

While various embodiments of the present application have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous modifications, changes, and substitutions will occur to those skilled in the art without departing from the spirit and scope of the present application. It should be understood that various alternatives to the embodiments of the present application described herein may be employed in practicing the application. The appended claims are intended to define the scope of the application and are therefore to cover all equivalents and alternatives falling within the scope of these claims.

Claims

1. The training method of the target detection model is characterized in that the target detection model comprises a main network, a detection frame classification branch, a foreground and background classification branch and a detection frame position regression branch, wherein the detection frame classification branch is connected with the main network, and comprises at least one detection category output; the training method comprises the following steps:

acquiring a sample training set formed by sample images containing detection category targets and/or false positive category targets;

two layers of labels are respectively assigned to each detection category target and each false positive category target, wherein a first layer of label in the two layers of labels is used for identifying the detection category of each target, and a second layer of label in the two layers of labels is used for identifying the real category of each target, wherein the detection category of each target is a category predicted by a target detection model, and the real category of each target is a true positive category or a false positive category of the detection category to which each target belongs;

Training the target detection model by using a sample training set with the two layers of labels; and

a loss function of the object detection model is calculated based on the two-layer tags, wherein the loss function includes a first loss function for a first layer of tags and a second loss function for a second layer of tags.

2. The training method of claim 1, wherein prior to training the target detection model using the sample training set, the training method further comprises:

obtaining a pre-training set formed by a sample image with a sample label, wherein the sample label is used for identifying a detection category target in the sample image;

and pre-training the target detection model by using the pre-training set.

3. The training method of claim 2, further comprising:

based on the pre-training set, a sample training set comprising two layers of labels is generated.

4. The training method of claim 3, wherein generating a sample training set comprising two layers of labels comprises:

performing target detection on the sample images in the pre-training set by using the pre-trained target detection model so as to obtain a plurality of detection results;

Comparing the detection results with the sample labels to determine false positive category targets corresponding to the background in the pre-training set and detection categories to which the false positive category targets belong from the detection results; and

and generating the sample training set containing two layers of labels according to the detection category and the real category of each detection category target and each false positive category target in the pre-training set.

5. The training method of any one of claims 1 to 4, wherein,

the first loss function comprises a cross entropy loss function and the second loss function comprises a contrast learning loss function.

6. The training method of claim 5, wherein the first loss function is expressed as:

The number of detection frames used to represent the detected and matched foreground frames in the b-th sample image in the batch, N representing the number of detection categories, +.>

7. The training method of claim 5, wherein the second loss function is expressed as:

wherein

Contrast learning loss for representing the ith detection box in the batch, < >>

Second layer label for indicating ith detection frame,/->

Second layer label for representing kth detection frame,/->

Second layer label for indicating j-th detection frame,>

first layer label for representing c-th detection frame,/->

First layer tag for indicating ith detection frame,/->

For indicating the temperature coefficient>

Penultimate layer feature vector for representing the ith detection box, < >>

Penultimate layer feature vector for representing the jth detection box, < >>

And the second last layer characteristic vector is used for representing the c-th detection frame.

8. The training method of claim 1, wherein,

the loss function is a weighted sum of the first loss function and the second loss function.

9. The training method of any of claims 1-4, wherein prior to training the target detection model using the sample training set, the training method further comprises:

And freezing weight parameters in other network structures except the detection frame classification branches in the target detection model.

10. The training method of claim 9, wherein freezing the weight parameters comprises:

and freezing weight parameters in the backbone network, the foreground and background classification branches and the detection frame position regression branches.

11. The training method of claim 1, wherein the sample image comprises a medical sample image.

12. A method for image-based object detection, comprising:

inputting an image to be detected into a target detection model trained by the training method according to any one of claims 1 to 11; and

and carrying out target detection on the image to be detected by using the target detection model and outputting a detection result.

13. An apparatus for target detection, comprising:

a processor for executing program instructions; and

a memory storing the program instructions that, when loaded and executed by the processor, cause the processor to perform the method of training the object detection model according to any one of claims 1-11 or the method of image-based object detection according to claim 12.

14. A computer-readable storage medium having stored thereon computer-readable instructions that, when executed by one or more processors, implement the method of training the object detection model of any one of claims 1-11 or the method of image-based object detection of claim 12.