CN115713111A - Method for training object detection model and object detection method - Google Patents

Method for training object detection model and object detection method Download PDF

Info

Publication number
CN115713111A
CN115713111A CN202110949753.7A CN202110949753A CN115713111A CN 115713111 A CN115713111 A CN 115713111A CN 202110949753 A CN202110949753 A CN 202110949753A CN 115713111 A CN115713111 A CN 115713111A
Authority
CN
China
Prior art keywords
instance
class
source domain
target domain
object detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110949753.7A
Other languages
Chinese (zh)
Inventor
钟朝亮
汪洁
冯成
张颖
孙俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN202110949753.7A priority Critical patent/CN115713111A/en
Priority to JP2022111473A priority patent/JP2023029236A/en
Publication of CN115713111A publication Critical patent/CN115713111A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The present disclosure relates to a method for training an object detection model and an object detection method. According to one embodiment of the present disclosure, a method for training a model includes training an object detection model in an iterative manner, wherein a current training iteration round includes the operations of: reading a source domain data subset and a target domain data subset; determining a detection loss for the source domain data subset and a source domain instance classification feature set; determining a classification characteristic set of a target domain instance; determining an instance-level alignment loss associated with the instance feature alignment based on the source domain instance classification feature set and the target domain instance classification feature set; and optimizing the object detection model by adjusting parameters of the object detection model based on the total loss associated with the detection loss and the instance alignment loss. The beneficial effects of the aspects of the present disclosure include at least one of the following: robust to tag noise, overcome class imbalance, improve instance level alignment, and improve detection accuracy.

Description

Method for training object detection model and object detection method
Technical Field
The present disclosure relates generally to image processing, and more particularly, to a method for training an object detection model and an object detection method.
Background
In recent years, with the development of neural network technology, an image processing model based on a neural network has been applied in various fields. For example, the fields of face recognition, object classification, object detection (object detection), automatic driving, behavior recognition, and the like.
Generally, an object detection model based on a neural network is trained by using a large number of labeled sample images before object detection is performed, so as to optimize the object detection model to ensure that the model has satisfactory detection performance. After the training is completed, the image to be detected may be input to the object detection model, and after various processing (e.g., feature extraction) of the image to be detected by the object detection model, the object detection model may output the position and type of each object instance included in the image to be detected.
Disclosure of Invention
A brief summary of the disclosure is provided below in order to provide a basic understanding of some aspects of the disclosure. It should be understood that this summary is not an exhaustive overview of the disclosure. It is not intended to identify key or critical elements of the disclosure or to delineate the scope of the disclosure. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later.
According to an aspect of the present disclosure, there is provided a computer-implemented method for training an object detection model, the method comprising training the object detection model in an iterative manner, and the object detection model being based on a neural network. During training, the current training iteration round includes the following operations: reading, from the source domain data set having the larger number of labels and the target domain data set having the smaller number of labels, respectively, a subset of source domain data having the larger number of labels corresponding to the at least one fully labeled source domain image and a subset of target domain data having the smaller number of labels corresponding to the at least one loosely labeled target domain image for the current training iteration round; processing the at least one fully labeled source domain image through an object detection model to determine a detection loss for the source domain data subset and a source domain instance classification feature set for the at least one fully labeled source domain image; processing at least one loosely labeled target domain image through an object detection model to determine a target domain instance classification feature set for the at least one loosely labeled target domain image; determining an instance-level alignment loss associated with the instance feature alignment based on the source domain instance classification feature set and the target domain instance classification feature set; and optimizing the object detection model by adjusting parameters of the object detection model based on the total loss associated with the detection loss and the instance alignment loss.
According to another aspect of the present disclosure, an object detection method is provided. The method comprises the following steps: training an object detection model by using the model training method; and determining the position and the type of the object in the image to be detected by using the trained object detection model.
According to one aspect of the present disclosure, an apparatus for training a subject detection model is provided. The device comprises: a memory having instructions stored thereon; and one or more processors in communication with the memory to execute the instructions retrieved from the memory, and the instructions cause the one or more processors to: reading, from the source domain data set having the larger number of labels and the target domain data set having the smaller number of labels, respectively, a subset of source domain data having the larger number of labels corresponding to the at least one fully labeled source domain image and a subset of target domain data having the smaller number of labels corresponding to the at least one loosely labeled target domain image for the current training iteration round; processing the at least one fully labeled source domain image through an object detection model to determine a detection loss for the source domain data subset and a source domain instance classification feature set for the at least one fully labeled source domain image; processing at least one loosely labeled target domain image through an object detection model to determine a target domain instance classification feature set for the at least one loosely labeled target domain image; determining an instance-level alignment loss associated with the instance feature alignment based on the source domain instance classification feature set and the target domain instance classification feature set; and optimizing the object detection model by adjusting parameters of the object detection model based on the total loss associated with the detection loss and the instance alignment loss.
According to an aspect of the present disclosure, there is provided a computer-readable storage medium having a program stored thereon. The program causes a computer running the program to: reading, from the source domain data set having the larger number of labels and the target domain data set having the smaller number of labels, respectively, a subset of source domain data having the larger number of labels corresponding to the at least one fully labeled source domain image and a subset of target domain data having the smaller number of labels corresponding to the at least one loosely labeled target domain image for the current training iteration round; processing the at least one fully labeled source domain image through an object detection model to determine a detection loss for the source domain data subset and a source domain instance classification feature set for the at least one fully labeled source domain image; processing at least one loosely labeled target domain image through an object detection model to determine a target domain instance classification feature set for the at least one loosely labeled target domain image; determining an instance-level alignment loss associated with the instance feature alignment based on the source domain instance classification feature set and the target domain instance classification feature set; and optimizing the object detection model by adjusting parameters of the object detection model based on the total loss associated with the detection loss and the instance alignment loss.
The beneficial effects of the aspects of the present disclosure include at least one of the following: robust to tag noise, overcome class imbalance, improve instance level alignment, and improve detection accuracy.
Further areas of applicability will become apparent from the description provided herein. The foregoing description is intended for the purpose of illustration only and is not intended to limit the scope of the present disclosure.
Drawings
The above and other objects, features and advantages of the present disclosure will be more readily understood from the following description of embodiments thereof with reference to the accompanying drawings. The drawings are only for the purpose of illustrating the principles of the disclosure. The dimensions and relative positioning of the elements in the figures are not necessarily drawn to scale. Like reference numerals may denote like features. In the drawings:
FIG. 1 shows a flow diagram of operations involved in one training iteration round in a method for training a subject detection model according to one embodiment of the present disclosure;
FIG. 2 illustrates an exemplary flow diagram of a method for training a subject detection model according to one embodiment of the present disclosure;
FIG. 3 illustrates an exemplary flow diagram of a method for determining an instance-level alignment penalty according to one embodiment of the present disclosure;
FIG. 4 shows a schematic distribution of different processing stage instance points in a feature space according to an embodiment of the present disclosure;
FIG. 5 illustrates an exemplary flow diagram of an object detection method according to one embodiment of the present disclosure;
FIG. 6 shows a block diagram of the structure of an apparatus for training a subject detection model according to one embodiment of the present disclosure;
FIG. 7 shows a block diagram of the structure of an apparatus for training a subject detection model according to one embodiment of the present disclosure; and
fig. 8 shows an exemplary block diagram of an information processing apparatus according to one embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure will be described hereinafter with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual embodiment are described in the specification. It will of course be appreciated that in the development of any such actual embodiment, numerous implementation-specific decisions may be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which will vary from one implementation to another.
Here, it should be further noted that, in order to avoid obscuring the present disclosure with unnecessary details, only the device structure closely related to the scheme according to the present disclosure is shown in the drawings, and other details not so related to the present disclosure are omitted.
It is to be understood that the disclosure is not limited to the described embodiments, as described below with reference to the drawings. In this context, embodiments may be combined with each other, features may be replaced or borrowed between different embodiments, one or more features may be omitted in one embodiment, where feasible.
Computer program code for carrying out operations for aspects of embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages.
The method of the present disclosure may be implemented by a circuit having a corresponding functional configuration. The circuitry includes circuitry for a processor.
One aspect of the present disclosure provides a computer-implemented method for training an object detection model M. The object detection model M is based on a neural network. And training the object detection model M in an iterative mode. In each training iteration wheel, a batch of labeled training sample images and labeled data are input. The operations involved in an exemplary training iteration round will be described in an exemplary manner with reference to FIG. 1.
Fig. 1 shows an exemplary flowchart of operations involved in one training iteration wheel Iter j in a method for training an object detection model (referred to as "model training method") according to an embodiment of the disclosure, where j represents the number of training iteration wheels. For ease of discussion, the jth training iteration round may also be referred to as the "current training iteration round".
In step S101, from the source domain data sets with a larger number of labels, respectively
Figure BDA0003217963940000041
And target domain data sets with smaller number of tags
Figure BDA0003217963940000042
Reading a subset of source domain data corresponding to at least one fully labeled source domain image for a current training iteration wheel
Figure BDA0003217963940000043
And a target domain data subset corresponding to the at least one loosely labeled target domain image
Figure BDA0003217963940000044
By n s Representing the number of source field images read in, n t Indicating the number of target field images read in.
Figure BDA0003217963940000045
A sample image representing the source domain,
Figure BDA0003217963940000051
a sample image representing the target domain. Y is i s To represent
Figure BDA0003217963940000052
The labeling information of the bounding box of the labeled object in (1). Y is i t Represent
Figure BDA0003217963940000053
The labeling information of the bounding box of the labeled object in (1).
Figure BDA0003217963940000054
And
Figure BDA0003217963940000055
one input sample image representing the source and target domains, respectively.
Figure BDA0003217963940000056
To represent
Figure BDA0003217963940000057
The labeling information of the bounding box of the labeled object in (1).
Figure BDA0003217963940000058
To represent
Figure BDA0003217963940000059
The labeling information of the bounding box of the labeled object in (1). The annotation information for the bounding box includes the location of the bounding box and the type of instance (sometimes also referred to as "category") of the instance of the type of object of interest in the image. Loose label and full label are two opposing concepts herein. For the same image containing multiple object instances (e.g., 10 object instances), an annotated image with fewer instances (e.g., 4 instances) may be referred to as a loosely annotated (or fully annotated) image, as opposed to an annotated image with more instances (e.g., all or most of the instances, e.g., 8 instances). Loosely labeled images may be images in which only a few instances of the image are labeled. A more specific example is: in a fully labeled image, almost all instances of the type of interest are labeled, while in a loosely labeled image, only a few of all instances of the type of interest are labeled. That is, some foreground regions that are originally instances of the type of interest are missing from the loosely labeled image, so that these missing instances are treated as background, and even labeled as instances of the background type. N is a radical of s Is contained in the entire source domain data setThe number of training images. N is a radical of t Is the number of training images contained in the entire target domain data set. At each training iteration round, for example, a pair of training images including one source domain image and one target domain image may be input. N is a radical of t <<N s I.e. the number of source domain images is significantly larger than the number of target domain images, e.g. N t /N s Greater than or equal to 10, greater than or equal to 100, or even greater than or equal to 1000. At each training iteration round, the total number of labels of the source domain image is greater than the total number of labels of the target domain. Each training iteration may reuse images used by a previous training iteration.
It should be noted that for the training sample image, if the instance (i.e., foreground) of the type of interest in the image is not labeled and the set of object categories used by the object detection model includes a background class, the unlabeled instance may be labeled as the background class. This may cause tag noise. Loosely labeled target domain images may introduce label noise. In addition, for a fully labeled source domain image and a loosely labeled target domain image, an excessively large cross-over ratio (IoU, intersection overlap union) may cause the bounding box of the background class instance to contain a partial foreground instance, which may also cause label noise. Label noise may cause sample points (example classification features) to be misaligned, negatively affecting the performance of the object detection model.
In step S103, at least one globally labeled source domain image is registered with the object detection model M
Figure BDA00032179639400000510
(i determined by subset Ssb) to determine a detection loss L for the source domain data subset det And for at least one globally labeled source domain image
Figure BDA00032179639400000511
Set of source domain instance classification features O (from subset Ssb) s . Detection loss L for source domain data subset Ssb det Indicating the object detection model M to at least one globally labeled source domain image
Figure BDA0003217963940000061
The statistical accuracy of the detection result output in performing object detection with respect to the labeling information is composed of a classification loss and a regression loss (i.e., a localization loss) of the bounding box. Source domain instance classification feature set O s All the source domain images read in by the current training iteration wheel given by the object detection model M
Figure BDA0003217963940000062
The feature composition for classification.
In step S105, at least one loosely labeled target domain image is labeled by the object detection model M
Figure BDA0003217963940000063
Processing determines a set of target domain instance classification characteristics O (from the subset Stb) for at least one loosely labeled target domain image t . Target domain instance classification feature set O t All target domain images read in by the current training iteration wheel given by the object detection model M
Figure BDA0003217963940000064
The feature composition for classification.
In step S107, a feature set O is classified based on the source domain instance t And a set of target domain instance classification features O t Determining an instance-level alignment penalty L related to instance feature alignment ins
In step S109, loss L is detected based on the AND det And example alignment loss L ins Associated total loss L t o tal The object detection model is optimized by adjusting the parameters of the object detection model M. Total loss L t o tal For example to detect loss L det And example alignment loss L ins And (4) linear combination.
The model training method of the present disclosure may include a determination of whether to end training. A computer-implemented method for training an object detection model of the present disclosure is described below with reference to fig. 2, in which steps for determining a training end condition are shown.
Fig. 2 illustrates an exemplary flow diagram of a method 200 for training a subject detection model M according to one embodiment of the present disclosure. Method 200 is a computer-implemented method for training an object detection model that includes training object detection model M in an iterative manner. The method 200 comprises steps S101, S103, S105 and S107 comprised by the training iteration wheel Iter [ j ] described in FIG. 1.
In step S209-1, it is determined whether a predetermined training end condition is satisfied. In the case that the determination result is yes, ending the training; in the case where the determination result is "no", step S209-2 is executed. The predetermined training end condition may be one of the following conditions: the total loss is less than a predetermined threshold; the total loss has converged. The total loss has converged, for example, meaning that the total loss for the current training iteration round varies by less than a predetermined threshold from the total loss for the previous training iteration round.
In step S209-2, the object detection model M is optimized by adjusting parameters of the object detection model M based on the total loss. Then, the process returns to step S101 to enter the next training iteration.
Step S109 in fig. 1 may be subdivided into step S209-1 and step S209-2 in fig. 2.
As another alternative implementation of step S109, the following sub-steps may be included: optimizing the object detection model M by adjusting parameters of the object detection model M based on the total loss; and determining that the number of training iteration rounds has reached a predetermined count. If the determination result is 'yes', ending the training; if the determination result is "no", the process returns to step S101 to enter the next training iteration round.
The model training method of the present disclosure utilizes a large amount of source domain labeled data and a small amount of target domain labeled data for training. The use of a small number of loosely labeled images of the target domain can reduce the labeling cost of the training data and shorten the training time.
In one embodiment, the object detection model M is configured to target at least one globally labeled source domain image based on a same set of object classes Sc
Figure BDA0003217963940000071
(i from the subset Ssb) and at least one loosely labeled target domain image
Figure BDA0003217963940000072
(i from subset Stb) subject detection. That is, the object class candidate set of the source domain image and the object class candidate set of the target domain image are the same. The set of object categories includes objects of the type of interest (foreground), for example: cars, buses, motorcycles, bicycles, pedestrians, etc. Further, the set of object classes Sc includes a background class. Typically, the regions outside the annotated region in the images of the source and target domains are both defaulted to the background. Several areas that can be randomly selected in the background of the images of the source and target domains serve as background class instance areas.
In one embodiment, the object detection model M includes a feature extractor F and an R-network based on a fast R-CNN (fast Regions with CNN features) framework. The R-network is configured to determine region-of-interest features of the input image. The R-network is further configured to determine a bounding box with classification labels for each region of interest ROI of the input image. The R-Network may for example comprise a regional pro-portal Network (RPN). The feature extractor F performs convolution processing based on the input image, and outputs a feature map (feature) of the image. The region-proposing network RPN can output a region-of-interest feature corresponding to the region of interest based on the output result (feature map) of the feature extractor F. Each region of interest feature characterizes the position of the object instance detected by the model. The loss of localization can be determined using the region of interest features with reference to the actual location information of the object instance in the annotation information. Regarding Faster R-CNN, reference may be made to the following documents:
Ren S,He K,Girshick R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[J].Advances in neural information processing systems,2015,28:91-99。
further, the R network may further include an additional classification feature extraction layer FC. An additional classification feature extraction layer FC follows and interfaces with the RPN network to extract example classification features for classification from the region-of-interest features determined by the RPN network. Each instance classification feature may characterize a classification of an object instance of interest in an image detected by the model. And referring to the labeling classification information of the object instance in the labeling information, and determining the classification loss by using the classification characteristics of each instance. Considering that different types of object instances may occur at the same position of the image, it is preferable that an additional classification feature extraction layer FC is provided to extract an instance classification feature for classification instead of directly determining the type of the object instance using the region-of-interest feature, which is advantageous for improving the performance of the object detection model.
In one embodiment, the R network of the object detection model M incorporates SWDA (Strong-weak distribution alignment) techniques. For SWDA reference can be made to the following documents:
Saito K,Ushiku Y,Harada T,et al.Strong-weak distribution alignment for adaptive object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:6956-6965。
in this embodiment, the R-network integrates weak global alignment and strong local alignment. SWDA is an Unsupervised Data Augmentation (UDA) framework for object detection based on Faster R-CNN. For this purpose, the R-network also comprises a local arbiter D l And a global arbiter D g . The feature extractor F can be decomposed into
Figure BDA0003217963940000085
Wherein, F 1 Is an extractor related to local features, F 2 Is an extractor that is related to global features. Since the images in the target domain are rare and loosely labeled, minimizing the detection loss of the target domain will not only result in overfitting, but also in training failures, since the loosely labeled images of the target domain contain a lot of label noise. Therefore, the source domain detection loss L is calculated in the training method of the present disclosure det And the detection loss of the target domain is not calculated. NeedleFor the current training iteration wheel, the source domain detects the loss L det The calculation of (c) can be as shown in equation (1).
Figure BDA0003217963940000081
Where L represents the loss of object detection, which consists of classification loss and regression loss of bounding box (i.e., localization loss).
Although instance-level alignment can effectively improve the performance of the object detection model, the performance of the target detection domain adaptive model may not be guaranteed only by the instance-level alignment. Thus, in this embodiment, the model training method integrates the weak global alignment and the strong local alignment of the SWDA. To this end, first, image-level features are learned with weak global alignment. For the current training iteration wheel, global arbiter D g Weak global alignment penalty of L global As shown in equation (4).
Figure BDA0003217963940000082
Figure BDA0003217963940000083
Figure BDA0003217963940000084
Where gamma controls the weights of samples that are more difficult to classify.
Second, strong local alignments are used to learn local level features, such as texture or color. For the current training iteration wheel, local arbiter D l Strong local alignment loss L loc As shown in equation (7).
Figure BDA0003217963940000091
Figure BDA0003217963940000092
Figure BDA0003217963940000093
Wherein W and H represent a feature extractor F 1 The width and height of the extracted features. Penalty L for implementing global and local alignment adv As shown in equation (8).
L adv (F,D)=L loc (F 1 ,D l )+L global (F,D g ) (8)
I.e. to combat the loss L adv Including by a global arbiter D l Weak global alignment loss L determined based on image-level features global And by a local arbiter D l Strong local alignment loss L determined based on local level features loc
Example level alignments involved in the model training method of the present disclosure are further described below.
This is done in some embodiments of the present disclosure where the model training method includes an instance-level alignment over the extracted features of the additional classification feature layer. Unlike traditional methods that align only on top of foreground ROI (region of interest) features, in some embodiments, the model training method aligns not only the features of the foreground ROI, but also the features of the reference frame of the background corresponding to the background class instance. This is because with sample point instance alignment, in order to compute instance level alignment loss, the intra-class distance and inter-class distance of each instance need to be computed separately. However, in some scenes there is only one type of foreground, e.g., detecting cars and ignoring other objects. In such a scenario, if only the foreground is considered, then the inter-class distance, and hence the instance alignment penalty, cannot be calculated. Of course, the example alignment based on sample point alignment in this disclosure may also be used to align only foreground classes if there are multiple foreground classes.
In one embodiment, a moving average class center, for example, represented by a feature vector, may also be considered an instance feature, which is added to the set of instance classification features to assist in determining instance-level alignment loss. Determining an example level alignment penalty (e.g., step S107 in fig. 1) is described below with reference to fig. 3. FIG. 3 illustrates an exemplary flow diagram of a method 300 for determining an instance-level alignment penalty according to one embodiment of this disclosure.
The processing object of method 300 is a set of source domain instance classification features O s And a set of target domain instance classification features O t . Each instance feature in the instance classification feature set may be referred to as an instance point, the instance points of each class having a distribution in the feature space determined by the respective instance feature. Therefore, it is advantageous to describe the method 300 with reference to a variation of the distribution of the instance points. FIG. 4 shows a schematic distribution of different processing stage instance points in a feature space according to an embodiment of the present disclosure, wherein FIG. 4 (a _ s) shows a classification feature set O corresponding to an initial source domain instance s (e.g. corresponding source domain images output by the additional classification feature extraction layer FC
Figure BDA0003217963940000101
A source domain instance point distribution of a source domain instance classification feature set (which is determined by the subset Ssb) of classification features; FIG. 4 (a _ t) shows a classification feature set O corresponding to an initial target domain instance t (e.g., corresponding target domain image output by the additional classification feature extraction layer FC
Figure BDA0003217963940000102
A target domain instance point distribution of a target domain instance classification feature set consisting of the classification features of the subset Sst. In this example, the set of object classes Sc comprises 4 classes, which correspond to k =0 to 3, where k =0 corresponds to the background class. An example of background noise is shown in the upper right corner of fig. 4 (a _ t). At this stage, the example classification feature set may lack feature points of a certain class. For example, as shown in fig. 4 (a _ s), the source domain feature points lack the instance points of k = 3; as shown in fig. 4 (a _ t), the target domain feature point lacks k =Example points of 1 and k = 3. And fig. 4 (a _ s), fig. 4 (a _ t) schematically show: for the same category, the number of source domain instance points (labels) is greater than the number of target domain instance points (labels).
In step S301, a feature set O is classified based on the source domain instance s An average class center for each class of the current training iteration source domain is determined. For example, when the set of object classes Sc includes 4 classes, then this step typically determines 4 average class centers. Average class center of source domain classes
Figure BDA0003217963940000103
The determination method of (2) may be determined, for example, using formula (9), where k is a category index.
Figure BDA0003217963940000104
Wherein x is s Classifying feature sets O for source domain instances s The subset of source domain instance classification features for class k
Figure BDA0003217963940000105
The source domain instance classification characteristic of (a), i.e.,
Figure BDA0003217963940000106
Figure BDA0003217963940000107
for the number of k-class object instances, i.e. subsets
Figure BDA0003217963940000108
The number of instance classification features in (1).
In step S303, a set of classification features O is based on the target domain instance t And determining the average class center of each class of the target domain of the current training iteration wheel. For example, when the object class set Sc includes 4 classes, then this step determines 4 average class centers. Average class center of classes of target domain
Figure BDA0003217963940000109
The determination method of (2) may be determined, for example, using formula (10), where k is a category index.
Figure BDA0003217963940000111
Wherein x is t Classifying feature sets O for target domain instances t Target domain instance classification feature subset for class k
Figure BDA0003217963940000112
The target domain instance classification characteristic of (1), i.e.,
Figure BDA0003217963940000113
Figure BDA0003217963940000114
for the number of k-class object instances, i.e. subsets
Figure BDA0003217963940000115
The number of instance classification features in (1).
In step S305, for the source domain, a moving average class center for each class of the source domain of the current training iteration wheel is determined based on the average class center of each class of the current training iteration wheel and the moving average class center of each class of the previous training iteration wheel. Moving average class center of jth training iteration wheel (current training iteration wheel) of kth class for source domain
Figure BDA0003217963940000116
Can be determined using equation (11).
Figure BDA0003217963940000117
Wherein the content of the first and second substances,
Figure BDA0003217963940000118
moving average class of kth class for source domain of previous training iterationThe center, θ, represents a moving average coefficient. Can be combined with
Figure BDA0003217963940000119
(i.e., j = 1) is set to 0.
For the target domain, a moving average class center for each class of the target domain of the current training iteration wheel is determined based on the average class center of each class of the current training iteration wheel and the moving average class center of each class of the previous training iteration wheel in step S307. Moving average class center of jth training iteration wheel (current training iteration wheel) of kth class for target domain
Figure BDA00032179639400001110
Can be determined using equation (12).
Figure BDA00032179639400001111
Wherein the content of the first and second substances,
Figure BDA00032179639400001112
for the moving average class center of the kth class of the target domain of the previous training iteration, θ represents the moving average coefficient. Can be combined with
Figure BDA00032179639400001113
(i.e., j = 1) is set to 0.
In step S309, the source domain instance classification feature set is updated by adding the moving average class center for each class of the source domain of the current training iteration wheel to the source domain classification feature set. Fig. 4 (b _ s) schematically shows a source domain instance point distribution with the addition of a moving average class center for each class of the source domain, wherein each solid geometry corresponds to an example feature point in the source domain representing the moving average class center for each class.
In step S311; the target domain instance classification feature set is updated by adding the moving average class center for each class of the target domain of the current training iteration wheel to the target domain instance classification feature set. Fig. 4 (b _ t) schematically shows a target domain instance point distribution of the moving average class center of each class to which the target domain is added, wherein each solid geometry corresponds to an example feature point in the target domain representing the moving average class center of each class. Adding a moving average class center facilitates computing the inter-and intra-class distances across the domain for all instances of all classes.
In step S313, an instance-level alignment loss between the updated source domain instance classification feature set and the updated target domain instance classification feature set is determined. It should be noted that, if the moving average class center of a certain class of the object class set Sc is zero in a certain training iteration round (for example, in the first training iteration round), no alignment is performed on the instance point of the instance type, and no instance-level alignment loss for the instance type is calculated, that is, no alignment loss related to the instance type is included in the instance-level alignment loss.
In one embodiment, updating the instance classification feature set may also include deleting the background class instance. Since the ROI represented by the reference frame of the background has a lot of label noise, in this embodiment, the classification features of the corresponding background class instances in the classification feature set of the source domain instance are deleted, and the moving average class center of the background class in the classification feature set of the source domain instance is retained; and deleting the classification features of the corresponding background class instances in the target domain instance classification feature set, and simultaneously keeping the moving average class center of the background class in the target domain instance classification feature set. The background deletion operation may be performed according to equation (13) and equation (14).
Figure BDA0003217963940000121
Figure BDA0003217963940000122
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003217963940000123
representing the total number of classes in the set of object classes Sc minus 1,y s =0 or y t =0 represents a background class.
And deleting the characteristic points (classification characteristics) of the background class instances, and simultaneously keeping the moving average center of the background class, which is beneficial to inhibiting the label noise and improving the performance of the object detection model. In the present disclosure, the operation of "deleting feature points (classification features) of the background class instance while retaining the background class moving average center" is also simply referred to as "deleting the background class instance".
Fig. 4 (c _ s) schematically shows the distribution of various types of instance points of the source domain after the background class instance is deleted, and fig. 4 (c _ t) schematically shows the distribution of various types of instance points of the target domain after the background class instance is deleted. In fig. 4 (c _ s) and 4 (c _ t), it can be seen that the true background class instance points represented by the open triangles have been removed while the background class moving average class center instance points represented by the filled triangles remain.
In one embodiment, updating the set of example classification features may also include undersampling. It is known that the class imbalance problem in the training sample set can cause the performance of machine learning to be reduced. Similarly, an imbalance in the distribution of instances can negatively impact instance level alignment. For example, see document 1, the distribution of instances is very uneven on the cityscaps dataset, with the two categories of "car" (car) and "person" (person) of instances in the vast majority:
document 1: cordts M, omran M, ramos S, et al, the cis tables data set for the therapeutic urea scene understating [ C ]// Proceedings of the IEEE reference on computer vision and pattern recognition 2016:3213-3223.
Therefore, to alleviate the performance impact of this problem, in this embodiment, updating the instance classification feature set further includes updating the source domain instance classification feature set O s And a set of target domain instance classification features O t And performing undersampling. In particular, the maximum number of instances of the respective class is limited by discarding instances at each training iteration round. The drop-wise update may be implemented by a function undersampling () as shown in equations (15) and (16).
Figure BDA0003217963940000131
Figure BDA0003217963940000132
Wherein undersampling () is a predefined one for limiting the maximum number of instances of the corresponding class by randomly dropping instances not to exceed a given threshold
Figure BDA0003217963940000133
Wherein each instance corresponds to a respective instance classification characteristic. What is discarded is the true instance feature, and the quasi-instance feature corresponding to the moving average class center is not discarded. If the example classification feature set already contains a moving average class center, the moving average class center remains in the example classification feature set after the drop-wise update (i.e., after undersampling). To O s And O t It is possible to check, on a class-by-class basis, whether the number of instances of the respective class is greater than a given threshold value, and if the check result is "yes", to reduce the number of instances of the class to the number of instances of the class by randomly discarding the respective class instances
Figure BDA0003217963940000134
Fig. 4 (d _ s) schematically shows a discardable updated source domain instance point distribution, where some instance points (instance features) of k =1 and k =2 are discarded according to a given threshold; fig. 4 (d _ t) schematically shows a drop-wise updated target domain instance point distribution, wherein, according to a given threshold, a fraction of instance points (instance features) with k =2 are dropped. The undersampling is performed after the moving average class center is calculated, and before the actual alignment loss is specifically calculated. It is noted that in fig. 4, the exemplary set of object classes Sc includes a background class, which corresponds to the instance point of k = 0. It can be appreciated that if the undersampling is performed before the background class instance is deleted, the background class instance may not be undersampled, which is beneficial to reduce training time. It will be appreciated that a given threshold for each class may not be exactly the same, but substantially the same, or even the same, apart from the background class, is preferred.
The undersampling is beneficial to the example distribution balance and the improvement of the performance of the object detection model.
In one embodiment, updating the set of example classification features includes adding a moving average center of each class, deleting background class instances, and undersampling for the set of source domain example classification features and the set of target domain example classification features.
After the updated source domain instance classification feature set and the target domain instance classification feature set are obtained, the instance-level alignment loss L can be determined based on the alignment of the feature points in the two feature sets ins . In one embodiment, the instance level alignment penalty L ins To account for the extended d-SNE penalty of maximizing the minimum absolute inter-class distance. Reference is made to document 2 for d-SNE.
Document 2: xu X, zhou X, venkatesan R, et al.d-SNE Domain adaptation using a stored neighbor embedded embedding (d-SNE) (CVPR 2019).
d-SNE is a sample point-based alignment method with better performance at present. The d-SNE loss is shown in equation (17).
Figure BDA0003217963940000141
Wherein d (x) s ,x t ) Denotes x s And x t The square of the euclidean distance in feature space. k is x t I.e. k = y t
Figure BDA0003217963940000142
sup { } upper bound of inter-class distance between cross-domain features. Inf { } represents the lower bound of the inter-class distance between features across the domain. Thus, the d-SNE penalty achieves sample point-based instance-level alignment by minimizing the largest cross-domain intra-class distance while maximizing the smallest cross-domain inter-class distance. In one example, an example level alignment penalty of the present disclosure may be determined according to equation (17). Further, for computational efficiency, in one example, the d-SNE loss may be defined by equation (18).
Figure BDA0003217963940000143
Where m is a predefined margin (margin) value and max () represents the maximum value. m can be determined empirically, and one exemplary value is 1. In one example, an example level alignment penalty of the present disclosure may be determined according to equation (18). However, the implementation of the d-SNE penalty shown in equation (18) only increases the relative difference between the maximum intra-class distance and the minimum inter-class distance, and does not maximize the minimum absolute inter-class distance. To address this problem, in one example, the penalty of improved instance level alignment, i.e., extended d-SNE penalty, is employed, which is determined using equation (19).
Figure BDA0003217963940000144
Wherein m is 2 Is another predefined edge value for maximizing the minimum absolute inter-class distance. m is 2 One exemplary value, which can be empirically determined, is 30. The extended d-SNE loss in this embodiment (see equation (19)) utilizes an additional hyperparameter m relative to the original d-SNE loss (see equation (17)) 2 The classes can be better separated.
Adjusting the model parameters will use the overall objective function. The overall objective function is described further below.
In one embodiment, the total loss may be the detection loss L det And instance level alignment loss L ins Linear combinations of (3). Further, the total loss may be a countermeasure loss L adv Detecting the loss L det And instance level alignment penalty L ins Linear combinations of (3). In particular, the total loss L total Can be determined as in equation (20).
L t o tal =L det (F,R)-λ 1 L adv (F,D)+λ 2 L ins (F,R) (20)
λ 1 For example, it may be fetched from a sample data setA value between 0.1 and 1. Lambda [ alpha ] 2 =min(0.1,p 2 ) During training, p may gradually increase from 0 to 1. Lambda [ alpha ] 2 It is also possible to set a fixed value, for example λ 2 =1。
The overall objective function can be defined using the mini-max loss function (see equation (21)). Optimizing the object detection model by adjusting parameters of the object detection model is achieved using the overall objective function.
Figure BDA0003217963940000151
Wherein the content of the first and second substances,
Figure BDA0003217963940000152
represents the minimization of total loss by adjusting the parameters of F and R;
Figure BDA0003217963940000153
by adjusting D l And D g Achieves a maximum total loss. The mini-max loss function may be implemented by a Gradient Reverse Layer (GRL). Reference is made to document 3 for the mini-max loss function.
Document 3: ganin Y, ustinova E, ajakan H, et al, domain-adaptive training of neural networks [ J ]. The joural of machine learning research,2016,17 (1): 2096-2030.
Fig. 4 (e) schematically shows the effect of adjusting the influence of the parameters of the object detection model on the alignment of the feature points. In fig. 4 (e), in order to clearly show the effect of adjusting parameters on alignment, the source domain instance points and the target domain instance points determined by the object detection model after adjusting parameters have been merged in the same spatial arrangement. As shown in fig. 4 (e), after the parameters of the object detection model are adjusted, the feature points of the same type tend to be more concentrated and the alignment degree becomes higher, the intra-type distance decreases, and the feature points of different types tend to have larger intervals and larger inter-type distances.
One aspect of the present disclosure provides an object detection method. This method is exemplarily described below with reference to fig. 5.
Fig. 5 illustrates an exemplary flow diagram of an object detection method 500 according to one embodiment of the disclosure.
In step S501, the object detection model M is trained. In particular, the subject detection model M is trained using the model training method of the present disclosure (e.g., the method 200 shown in fig. 2).
In step S503, an image to be detected is detected. Specifically, the trained object detection model is used to determine the location and type of the object in the image to be detected.
One aspect of the present disclosure provides an apparatus for training an object detection model. The apparatus is described below with reference to fig. 6. Fig. 6 shows a block diagram of the structure of an apparatus 600 for training a subject detection model according to an embodiment of the present disclosure. The apparatus 600 is used to train an object detection model in an iterative manner. The object detection model is based on a neural network. The apparatus 600 comprises: a detection loss determining unit 601, a classification feature set determining unit 603, an alignment loss determining unit 605, a total loss determining unit 607, and an optimizing unit 609. The detection loss determination unit 601 is configured to: a detection loss for the subset of source domain data is determined based on the subset of source domain data having the larger number of labels corresponding to the at least one fully labeled source domain image for the current training iteration round. The classification feature set determination unit 603 is configured to: a set of source domain instance classification features for the at least one globally labeled source domain image is determined, and a set of target domain instance classification features for the at least one loosely labeled target domain image is determined. The alignment loss determination unit 605 is configured to: an instance-level alignment loss associated with the instance feature alignment is determined based on the source domain instance classification feature set and the target domain instance classification feature set. The total loss determination unit 607 is configured to: a total loss is determined based on the detection loss and the instance alignment loss. The optimization unit 609 is configured to: the object detection model is optimized by adjusting parameters of the object detection model based on the total loss. The source domain data subset and the target domain data subset are from a source domain data set having a larger number of labels and a target domain data set having a smaller number of labels, respectively. There is a correspondence between the apparatus 600 and the method 200, and further details of the apparatus 600 can be found in the description of the method 200 herein. For example, the classification feature set determination unit 603 is further configured to perform at least one of the following operations: determining moving average class centers of various types of a source domain and a target domain, adding each moving average class center to a corresponding example classification feature set, deleting background class examples in the example classification feature set, and undersampling the example classification feature set. Optionally, the apparatus 600 may further comprise a counter-loss determination unit. The countermeasure loss determination unit is for determining a countermeasure loss for the source domain data set and the target domain data set. The counter loss determination unit is coupled to the total loss determination unit 607 such that the counter loss is also included in the total loss.
According to one aspect of the present disclosure, an apparatus for training a subject detection model is provided. The apparatus is described below with reference to fig. 7. Fig. 7 illustrates an apparatus 700 for training a subject detection model according to one embodiment of the present disclosure. The device includes: a memory 701 having instructions stored thereon; and one or more processors 703, the one or more processors being capable of communicating with the memory to execute the instructions retrieved from the memory, and the instructions causing the one or more processors to: reading, from the source domain data set having the larger number of labels and the target domain data set having the smaller number of labels, respectively, a subset of source domain data having the larger number of labels corresponding to the at least one fully labeled source domain image and a subset of target domain data having the smaller number of labels corresponding to the at least one loosely labeled target domain image for the current training iteration round; processing the at least one fully labeled source domain image through an object detection model to determine a detection loss for the source domain data subset and a source domain instance classification feature set for the at least one fully labeled source domain image; processing at least one loosely labeled target domain image through an object detection model to determine a target domain instance classification feature set for the at least one loosely labeled target domain image; determining an instance-level alignment loss associated with the instance feature alignment based on the source domain instance classification feature set and the target domain instance classification feature set; and optimizing the object detection model by adjusting parameters of the object detection model based on the total loss associated with the detection loss and the instance alignment loss. There is a correspondence between the device 700 and the method 200, and further details of the device 700 can be found in the description of the method 200 herein.
One aspect of the present disclosure provides a computer-readable storage medium having a program stored thereon. The program causes a computer running the program to: reading, from the source domain data set having the larger number of labels and the target domain data set having the smaller number of labels, respectively, a subset of source domain data having the larger number of labels corresponding to the at least one fully labeled source domain image and a subset of target domain data having the smaller number of labels corresponding to the at least one loosely labeled target domain image for the current training iteration round; processing the at least one fully labeled source domain image through an object detection model to determine a detection loss for the source domain data subset and a source domain instance classification feature set for the at least one fully labeled source domain image; processing at least one loosely labeled target domain image through an object detection model to determine a target domain instance classification feature set for the at least one loosely labeled target domain image; determining an instance-level alignment loss associated with the instance feature alignment based on the source domain instance classification feature set and the target domain instance classification feature set; and optimizing the object detection model by adjusting parameters of the object detection model based on the total loss associated with the detection loss and the instance alignment loss. There is a correspondence between the procedure and the method 200, and further details of the procedure can be found in the description of the method 200 herein.
One aspect of the present disclosure provides a computer-readable storage medium having a program stored thereon. The program causes a computer running the program to implement the method 500.
According to an aspect of the present disclosure, there is also provided an information processing apparatus.
Fig. 8 is an exemplary block diagram of an information processing apparatus 800 according to one embodiment of the present disclosure. In fig. 8, a Central Processing Unit (CPU) 801 performs various processes in accordance with a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage section 808 to a Random Access Memory (RAM) 803. The RAM 803 also stores data and the like necessary when the CPU 801 executes various processes, as necessary.
The CPU 801, the ROM 802, and the RAM 803 are connected to each other via a bus 804. An input/output interface 805 is also connected to the bus 804.
The following components are connected to the input/output interface 805: an input portion 806 including a soft keyboard and the like; an output portion 807 including a display such as a Liquid Crystal Display (LCD) and a speaker; a storage portion 808 such as a hard disk; and a communication section 809 including a network interface card such as a LAN card, a modem, and the like. The communication section 809 performs communication processing via a network such as the internet, a local area network, a mobile network, or a combination thereof.
A drive 810 is also connected to the input/output interface 805 as necessary. A removable medium 811 such as a semiconductor memory or the like is mounted on the drive 810 as needed, so that the program read therefrom is mounted on the storage portion 808 as needed.
The CPU 801 may run a program for implementing the method for training the object recognition model according to the present disclosure or a program for implementing the object detection method of the present disclosure.
The effects of the scheme of the present disclosure are described below.
The following three scenarios were constructed for experiments, comparing the differences in accuracy performance between the scheme of the present disclosure and existing methods: (1) Migration from Cityscapes to Foggy Cityscapes (C- > F); (2) Migration from SIM10K to Cityscapes (S- > C; i.e., training with labeled samples of SIM10K along with a small number of labeled samples of Cityscapes); (3) migration from Udacity to Cityscapes (U- > C). The results of the experiment are shown in tables 1 and 2. The first scenario C- > F is to simulate data deviation caused by domain shift (domain shift) due to weather change. The second scenario, S- > C, is to simulate the data bias between the virtual world and the real world. The third scene U- > C is for simulating data deviation between two different real worlds due to illumination, camera angle, and the like.
TABLE 1 Experimental results for C- > F
Figure BDA0003217963940000181
Figure BDA0003217963940000191
TABLE 2 results of the experiment of S- > C and U- > C
Figure BDA0003217963940000192
The source of the reference data is as follows.
[1]Ren S,He K,Girshick R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[J].Advances in neural information processing systems,2015,28:91-99.
[2]Saito K,Ushiku Y,Harada T,et al.Strong-weak distribution alignment for adaptive object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:6956-6965.
[3]Zhuang C,Han X,Huang W,et al.ifan:Image-instance full alignment networks for adaptive object detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020,34(07):13122-13129.
[4]Wu,A.,Han,Y.,Zhu,L.,&Yang,Y.(2021).Instance-Invariant Domain Adaptive Object Detection via Progressive Disentanglement.IEEE Transactions on Pattern Analysis and Machine Intelligence,1–1.
[5]Wang T,Zhang X,Yuan L,et al.Few-shot adaptive faster r-cnn[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:7173-7182.
Wherein Source-only represents training using only fully labeled Source domain data; target-only means training using only loosely labeled Target domain data; UDA stands for unsupervised domain adaptation method, which uses all unlabeled target domain data for domain adaptation; FUDA represents a few-sample unsupervised domain adaptation method; FDA denotes a few sample domain adaptive method; PICA + SWDA represents the method employed by the present disclosure, PICA represents "point-wise instance and center alignment" (point-wise instance and central alignment); mAP (0.5) represents Mean average precision (Mean average precision), 0.5 is the threshold; the data containing decimal points in the table represents the detection accuracy mAP.
In S- > C and U- > C scenes, 8 target domain images are used, and each image is only marked with 3 automobiles; in a C- > F scene, 8 target domain images are used, one class for each image, and only one instance of the corresponding class is marked out for each image. The FUDA method uses 8 images as the FDA, but does not use the corresponding annotations.
The experimental results of tables 1 and 2 indicate that the method of the present disclosure (PICA + SWDA) outperforms both the prior methods FAFRCNN and SWDA at C- > F, S- > C and U- > C.
Aspects of the present disclosure relate to additional classification feature extraction layers, countermeasures against loss, use of a small number of loosely labeled target domain images, moving average class center alignment, deletion of background class instances, undersampling, improved instance-level alignment loss. The benefits of the present disclosure include at least one of the following: robust to tag noise, overcome class imbalance, improve instance level alignment, and improve detection accuracy.
While the invention has been described in terms of specific embodiments thereof, it will be appreciated that those skilled in the art will be able to devise various modifications (including combinations and substitutions of features between the embodiments, where appropriate), improvements and equivalents of the invention within the spirit and scope of the appended claims. Such modifications, improvements and equivalents are also intended to be included within the scope of the present invention.
It should be emphasized that the term "comprises/comprising" when used herein, is taken to specify the presence of stated features, elements, steps or components, but does not preclude the presence or addition of one or more other features, elements, steps or components.
Furthermore, the methods of the embodiments of the present invention are not limited to being performed in the time sequence described in the specification or shown in the drawings, and may be performed in other time sequences, in parallel, or independently, where technically possible. Therefore, the order of execution of the methods described in this specification does not limit the technical scope of the present invention.
Supplementary note
Included within this disclosure are, but not limited to, the following.
1. A computer-implemented method for training an object detection model, the method comprising training the object detection model in an iterative manner, and the object detection model being based on a neural network, characterized in that a current training iteration round comprises the operations of:
reading, from the source domain data set having the larger number of labels and the target domain data set having the smaller number of labels, a subset of source domain data having the larger number of labels corresponding to at least one fully labeled source domain image and a subset of target domain data having the smaller number of labels corresponding to at least one loosely labeled target domain image, respectively, for the current training iteration wheel;
processing the at least one fully labeled source domain image through the object detection model to determine a detection loss for the subset of source domain data and a set of source domain instance classification features for the at least one fully labeled source domain image;
processing the at least one loosely labeled target domain image through the object detection model to determine a set of target domain instance classification features for the at least one loosely labeled target domain image;
determining an instance-level alignment loss related to instance feature alignment based on the source domain instance classification feature set and the target domain instance classification feature set; and
optimizing the object detection model by adjusting parameters of the object detection model based on a total loss related to the detection loss and the instance alignment loss.
2. The method of supplementary note 1, wherein the object detection model is trained using the at least one fully labeled source domain image and the at least one loosely labeled target domain image based on a same set of object classes, and the same set of object classes includes a background class.
3. The method according to supplementary note 1, wherein the object detection model includes an R network;
the R network is based on a fast RCNN framework;
the R network is configured to determine region-of-interest features of an input image; and is provided with
The R-network is further configured to determine a bounding box with classification labels for regions of interest of the input image.
4. The method of supplementary note 3, wherein the R network includes an additional classification feature extraction layer; and is
The additional classification feature extraction layer is configured to extract an instance classification feature for classification from each region of interest feature.
5. The method according to supplementary note 1, wherein the total loss is further related to a countermeasure loss for the source domain data subset and the target domain data subset.
6. The method according to supplementary note 5, wherein the R-network includes a global discriminator and a local discriminator, and the countermeasure loss includes a weak global alignment loss determined by the global discriminator based on image-level features and a strong local alignment loss determined by the local discriminator based on local-level features.
7. The method of supplementary note 2, wherein determining an instance-level alignment loss related to instance feature alignment based on the source domain instance classification feature set and the target domain instance classification feature set comprises:
determining an average class center for each class of the current training iteration source domain based on the source domain instance classification feature set;
determining the average class center of each class of the target domain of the current training iteration wheel based on the target domain example classification characteristic set;
for the source domain, determining a moving average class center for each class of the source domain for the current training iteration wheel based on an average class center for each class of the current training iteration wheel and a moving average class center for each class of a previous training iteration wheel;
for the target domain, determining a moving average class center for each class of the target domain for the current training iteration wheel based on an average class center for each class of the current training iteration wheel and a moving average class center for each class of a previous training iteration wheel;
updating the source domain instance classification feature set by adding a moving average class center for each class of the source domain of the current training iteration wheel to the source domain classification feature set;
updating the target domain instance classification feature set by adding a moving average class center of each class of the target domain for the current training iteration wheel to the target domain instance classification feature set; and
determining the instance-level alignment loss between the updated source domain instance classification feature set and the updated target domain instance classification feature set.
8. The method of supplementary note 7, wherein updating the set of source domain instance classification features further comprises: undersampling the source domain instance classification feature set; and is
Updating the set of target domain instance classification features further comprises: and undersampling the target domain example classification characteristic set.
9. The method of supplementary note 7, wherein updating the set of source domain instance classification features further comprises: deleting the classification features of the corresponding background class instances in the source domain instance classification feature set, and simultaneously reserving the moving average class center of the background class in the source domain instance classification feature set; and is provided with
Updating the set of target domain instance classification characteristics further comprises: deleting the classification features of the corresponding background class instances in the target domain instance classification feature set, and simultaneously keeping the moving average class center of the background class in the target domain instance classification feature set.
10. The method according to supplementary note 1, wherein the instance-level alignment penalty is an extended d-SNE penalty that takes into account maximizing the minimum absolute inter-class distance.
11. An object detection method, comprising:
training the subject detection model using the method of any one of supplementary notes 1 to 10; and
and determining the position and the type of the object in the image to be detected by using the trained object detection model.
12. A computer-readable storage medium having a program stored thereon, the program causing a computer running the program to:
reading, from the source domain data set having the larger number of labels and the target domain data set having the smaller number of labels, respectively, a subset of source domain data having the larger number of labels corresponding to the at least one fully labeled source domain image and a subset of target domain data having the smaller number of labels corresponding to the at least one loosely labeled target domain image for the current training iteration round;
processing the at least one fully labeled source domain image through an object detection model to determine a detection loss for the subset of source domain data and a set of source domain instance classification features for the at least one fully labeled source domain image;
processing the at least one loosely labeled target domain image through the object detection model to determine a set of target domain instance classification features for the at least one loosely labeled target domain image;
determining an instance-level alignment loss related to instance feature alignment based on the source domain instance classification feature set and the target domain instance classification feature set; and
optimizing the object detection model by adjusting parameters of the object detection model based on a total loss related to the detection loss and the instance alignment loss.
13. The computer-readable storage medium of appendix 12, wherein the object detection model is trained using the at least one fully labeled source domain image and the at least one loosely labeled target domain image based on a same set of object classes, and the same set of object classes includes a background class.
14. The computer-readable storage medium of supplementary note 12, wherein the object detection model includes an R network;
the R network is based on a fast RCNN framework;
the R network is configured to determine region-of-interest features of an input image; and is
The R-network is further configured to determine a bounding box with classification labels for regions of interest of the input image.
15. The computer-readable storage medium of supplementary note 14, wherein the R-network includes an additional classification feature extraction layer; and is
The additional classification feature extraction layer is configured to extract an instance classification feature for classification from each region of interest feature.
16. The computer-readable storage medium of supplementary note 12, wherein the total loss is further related to a counter loss for the subset of source domain data and the subset of target domain data.
17. The computer-readable storage medium of supplementary note 16, wherein the R-network includes a global discriminator and a local discriminator, the antagonistic loss including a weak global alignment loss determined by the global discriminator based on image-level features and a strong local alignment loss determined by the local discriminator based on local-level features.
18. The computer-readable storage medium of supplementary note 13, wherein determining an instance-level alignment loss related to instance feature alignment based on the source domain instance classification feature set and the target domain instance classification feature set comprises:
determining an average class center for each class of the current training iteration source domain based on the source domain instance classification feature set;
determining the average class center of each class of the target domain of the current training iteration wheel based on the target domain instance classification feature set;
for the source domain, determining a moving average class center for each class of the source domain for the current training iteration wheel based on an average class center for each class of the current training iteration wheel and a moving average class center for each class of a previous training iteration wheel;
for the target domain, determining a moving average class center for each class of the target domain for the current training iteration wheel based on an average class center for each class of the current training iteration wheel and a moving average class center for each class of a previous training iteration wheel;
updating the source domain instance classification feature set by adding a moving average class center for each class of the source domain of the current training iteration wheel to the source domain classification feature set;
updating the target domain instance classification feature set by adding a moving average class center of each class of the target domain for the current training iteration wheel to the target domain instance classification feature set; and
determining the instance-level alignment loss between the updated source domain instance classification feature set and the updated target domain instance classification feature set.
19. The computer-readable storage medium of supplementary note 18, wherein updating the set of source domain instance classification features further comprises: undersampling the source domain instance classification feature set; and is
Updating the set of target domain instance classification features further comprises: and undersampling the target domain example classification characteristic set.
20. The computer-readable storage medium of supplementary note 18, wherein updating the set of source domain instance classification features further comprises: deleting the classification features of the corresponding background class instances in the source domain instance classification feature set, and simultaneously reserving the moving average class center of the background class in the source domain instance classification feature set; and is provided with
Updating the set of target domain instance classification features further comprises: deleting the classification features of the corresponding background class instances in the target domain instance classification feature set, and simultaneously keeping the moving average class center of the background class in the target domain instance classification feature set.

Claims (10)

1. A computer-implemented method for training an object detection model, the method comprising training the object detection model in an iterative manner, and the object detection model being based on a neural network, characterized in that a current training iteration round comprises the operations of:
reading, from the source domain dataset with the larger number of labels and the target domain dataset with the smaller number of labels, respectively, a subset of source domain datasets with the larger number of labels corresponding to the at least one fully labeled source domain image and a subset of target domain datasets with the smaller number of labels corresponding to the at least one loosely labeled target domain image for the current training iteration round;
processing the at least one fully labeled source domain image through the object detection model to determine a detection loss for the subset of source domain data and a set of source domain instance classification features for the at least one fully labeled source domain image;
processing the at least one loosely labeled target domain image through the object detection model to determine a set of target domain instance classification features for the at least one loosely labeled target domain image;
determining an instance-level alignment loss related to instance feature alignment based on the source domain instance classification feature set and the target domain instance classification feature set; and
optimizing the object detection model by adjusting parameters of the object detection model based on a total loss related to the detection loss and the instance alignment loss.
2. The method of claim 1, wherein the object detection model is trained using the at least one fully labeled source domain image and the at least one loosely labeled target domain image based on a same set of object classes, and the same set of object classes includes a background class.
3. The method of claim 2, wherein determining an instance-level alignment loss related to instance feature alignment based on the source domain instance classification feature set and the target domain instance classification feature set comprises:
determining an average class center of each class for the current training iteration source domain based on the source domain instance classification feature set;
determining the average class center of each class of the target domain of the current training iteration wheel based on the target domain instance classification feature set;
for the source domain, determining a moving average class center for each class of the source domain for the current training iteration wheel based on an average class center for each class of the current training iteration wheel and a moving average class center for each class of a previous training iteration wheel;
for the target domain, determining a moving average class center for each class of the target domain for the current training iteration wheel based on an average class center for each class of the current training iteration wheel and a moving average class center for each class of a previous training iteration wheel;
updating the source domain instance classification feature set by adding a moving average class center for each class of the source domain of the current training iteration wheel to the source domain classification feature set;
updating the target domain instance classification feature set by adding a moving average class center of each class of the target domain for the current training iteration wheel to the target domain instance classification feature set; and
determining the instance-level alignment loss between the updated source domain instance classification feature set and the updated target domain instance classification feature set.
4. The method of claim 3, wherein updating the set of source domain instance classification features further comprises: undersampling the source domain instance classification feature set; and is provided with
Updating the set of target domain instance classification features further comprises: and undersampling the classification feature set of the target domain example.
5. The method of claim 3, wherein updating the set of source domain instance classification features further comprises: deleting the classification features of the corresponding background class instances in the source domain instance classification feature set, and simultaneously reserving the moving average class center of the background class in the source domain instance classification feature set; and is
Updating the set of target domain instance classification features further comprises: deleting the classification features of the corresponding background class instances in the target domain instance classification feature set, and simultaneously keeping the moving average class center of the background class in the target domain instance classification feature set.
6. The method of claim 1, wherein the instance-level alignment penalty is an extended d-SNE penalty that takes into account maximizing a minimum absolute inter-class distance.
7. The method of claim 1, wherein the total loss is further related to a counter-loss for the source domain data subset and the target domain data subset.
8. The method of claim 1, wherein the object detection model comprises an R-network;
the R network is based on the fast R-CNN framework;
the R network is configured to determine region-of-interest features of an input image; and is
The R-network is further configured to determine a bounding box with classification labels for regions of interest of the input image.
9. The method of claim 8, wherein the R-network includes an additional classification feature extraction layer; and is
The additional classification feature extraction layer is configured to extract an instance classification feature for classification from each region of interest feature.
10. An object detection method, comprising:
training the subject detection model using the method of any one of claims 1 to 9; and
and determining the position and the type of the object in the image to be detected by using the trained object detection model.
CN202110949753.7A 2021-08-18 2021-08-18 Method for training object detection model and object detection method Pending CN115713111A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110949753.7A CN115713111A (en) 2021-08-18 2021-08-18 Method for training object detection model and object detection method
JP2022111473A JP2023029236A (en) 2021-08-18 2022-07-11 Method for training object detection model and object detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110949753.7A CN115713111A (en) 2021-08-18 2021-08-18 Method for training object detection model and object detection method

Publications (1)

Publication Number Publication Date
CN115713111A true CN115713111A (en) 2023-02-24

Family

ID=85229982

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110949753.7A Pending CN115713111A (en) 2021-08-18 2021-08-18 Method for training object detection model and object detection method

Country Status (2)

Country Link
JP (1) JP2023029236A (en)
CN (1) CN115713111A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116343050A (en) * 2023-05-26 2023-06-27 成都理工大学 Target detection method for remote sensing image noise annotation based on self-adaptive weight

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116343050A (en) * 2023-05-26 2023-06-27 成都理工大学 Target detection method for remote sensing image noise annotation based on self-adaptive weight

Also Published As

Publication number Publication date
JP2023029236A (en) 2023-03-03

Similar Documents

Publication Publication Date Title
CN110020592B (en) Object detection model training method, device, computer equipment and storage medium
CN106599836A (en) Multi-face tracking method and tracking system
CN112836639A (en) Pedestrian multi-target tracking video identification method based on improved YOLOv3 model
Ye et al. A two-stage real-time YOLOv2-based road marking detector with lightweight spatial transformation-invariant classification
CN113435319B (en) Classification method combining multi-target tracking and pedestrian angle recognition
WO2023038574A1 (en) Method and system for processing a target image
Luo et al. Adversarial style discrepancy minimization for unsupervised domain adaptation
CN115713111A (en) Method for training object detection model and object detection method
Shi et al. License plate localization in complex environments based on improved GrabCut algorithm
CN111582057B (en) Face verification method based on local receptive field
CN111753731A (en) Face quality evaluation method, device and system and training method of face quality evaluation model
CN111428567A (en) Pedestrian tracking system and method based on affine multi-task regression
CN116630685A (en) Method, system, medium, equipment and terminal for defending countermeasure sample
Han et al. Efficient joint model learning, segmentation and model updating for visual tracking
CN113344102B (en) Target image recognition method based on image HOG features and ELM model
CN115311654A (en) Rice appearance automatic extraction method, device, equipment and storage medium
CN112651996A (en) Target detection tracking method and device, electronic equipment and storage medium
Islam et al. Faster R-CNN based traffic sign detection and classification
Budiarsa et al. Face recognition for occluded face with mask region convolutional neural network and fully convolutional network: a literature review
CN109086730A (en) A kind of Handwritten Digit Recognition method, apparatus, equipment and readable storage medium storing program for executing
CN116824306B (en) Training method of pen stone fossil image recognition model based on multi-mode metadata
Yang et al. Pedestrian detection under dense crowd
CN118037738B (en) Asphalt pavement crack pouring adhesive bonding performance detection method and equipment
CN117876383B (en) Yolov5 l-based highway surface strip-shaped crack detection method
CN116778277B (en) Cross-domain model training method based on progressive information decoupling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination