CN115713111A

CN115713111A - Method for training object detection model and object detection method

Info

Publication number: CN115713111A
Application number: CN202110949753.7A
Authority: CN
Inventors: 钟朝亮; 汪洁; 冯成; 张颖; 孙俊
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2021-08-18
Filing date: 2021-08-18
Publication date: 2023-02-24
Also published as: JP2023029236A

Abstract

The present disclosure relates to a method for training an object detection model and an object detection method. According to one embodiment of the present disclosure, a method for training a model includes training an object detection model in an iterative manner, wherein a current training iteration round includes the operations of: reading a source domain data subset and a target domain data subset; determining a detection loss for the source domain data subset and a source domain instance classification feature set; determining a classification characteristic set of a target domain instance; determining an instance-level alignment loss associated with the instance feature alignment based on the source domain instance classification feature set and the target domain instance classification feature set; and optimizing the object detection model by adjusting parameters of the object detection model based on the total loss associated with the detection loss and the instance alignment loss. The beneficial effects of the aspects of the present disclosure include at least one of the following: robust to tag noise, overcome class imbalance, improve instance level alignment, and improve detection accuracy.

Description

Method for training object detection model and object detection method

Technical Field

The present disclosure relates generally to image processing, and more particularly, to a method for training an object detection model and an object detection method.

Background

In recent years, with the development of neural network technology, an image processing model based on a neural network has been applied in various fields. For example, the fields of face recognition, object classification, object detection (object detection), automatic driving, behavior recognition, and the like.

Generally, an object detection model based on a neural network is trained by using a large number of labeled sample images before object detection is performed, so as to optimize the object detection model to ensure that the model has satisfactory detection performance. After the training is completed, the image to be detected may be input to the object detection model, and after various processing (e.g., feature extraction) of the image to be detected by the object detection model, the object detection model may output the position and type of each object instance included in the image to be detected.

Disclosure of Invention

A brief summary of the disclosure is provided below in order to provide a basic understanding of some aspects of the disclosure. It should be understood that this summary is not an exhaustive overview of the disclosure. It is not intended to identify key or critical elements of the disclosure or to delineate the scope of the disclosure. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later.

According to an aspect of the present disclosure, there is provided a computer-implemented method for training an object detection model, the method comprising training the object detection model in an iterative manner, and the object detection model being based on a neural network. During training, the current training iteration round includes the following operations: reading, from the source domain data set having the larger number of labels and the target domain data set having the smaller number of labels, respectively, a subset of source domain data having the larger number of labels corresponding to the at least one fully labeled source domain image and a subset of target domain data having the smaller number of labels corresponding to the at least one loosely labeled target domain image for the current training iteration round; processing the at least one fully labeled source domain image through an object detection model to determine a detection loss for the source domain data subset and a source domain instance classification feature set for the at least one fully labeled source domain image; processing at least one loosely labeled target domain image through an object detection model to determine a target domain instance classification feature set for the at least one loosely labeled target domain image; determining an instance-level alignment loss associated with the instance feature alignment based on the source domain instance classification feature set and the target domain instance classification feature set; and optimizing the object detection model by adjusting parameters of the object detection model based on the total loss associated with the detection loss and the instance alignment loss.

According to another aspect of the present disclosure, an object detection method is provided. The method comprises the following steps: training an object detection model by using the model training method; and determining the position and the type of the object in the image to be detected by using the trained object detection model.

According to one aspect of the present disclosure, an apparatus for training a subject detection model is provided. The device comprises: a memory having instructions stored thereon; and one or more processors in communication with the memory to execute the instructions retrieved from the memory, and the instructions cause the one or more processors to: reading, from the source domain data set having the larger number of labels and the target domain data set having the smaller number of labels, respectively, a subset of source domain data having the larger number of labels corresponding to the at least one fully labeled source domain image and a subset of target domain data having the smaller number of labels corresponding to the at least one loosely labeled target domain image for the current training iteration round; processing the at least one fully labeled source domain image through an object detection model to determine a detection loss for the source domain data subset and a source domain instance classification feature set for the at least one fully labeled source domain image; processing at least one loosely labeled target domain image through an object detection model to determine a target domain instance classification feature set for the at least one loosely labeled target domain image; determining an instance-level alignment loss associated with the instance feature alignment based on the source domain instance classification feature set and the target domain instance classification feature set; and optimizing the object detection model by adjusting parameters of the object detection model based on the total loss associated with the detection loss and the instance alignment loss.

According to an aspect of the present disclosure, there is provided a computer-readable storage medium having a program stored thereon. The program causes a computer running the program to: reading, from the source domain data set having the larger number of labels and the target domain data set having the smaller number of labels, respectively, a subset of source domain data having the larger number of labels corresponding to the at least one fully labeled source domain image and a subset of target domain data having the smaller number of labels corresponding to the at least one loosely labeled target domain image for the current training iteration round; processing the at least one fully labeled source domain image through an object detection model to determine a detection loss for the source domain data subset and a source domain instance classification feature set for the at least one fully labeled source domain image; processing at least one loosely labeled target domain image through an object detection model to determine a target domain instance classification feature set for the at least one loosely labeled target domain image; determining an instance-level alignment loss associated with the instance feature alignment based on the source domain instance classification feature set and the target domain instance classification feature set; and optimizing the object detection model by adjusting parameters of the object detection model based on the total loss associated with the detection loss and the instance alignment loss.

The beneficial effects of the aspects of the present disclosure include at least one of the following: robust to tag noise, overcome class imbalance, improve instance level alignment, and improve detection accuracy.

Further areas of applicability will become apparent from the description provided herein. The foregoing description is intended for the purpose of illustration only and is not intended to limit the scope of the present disclosure.

Drawings

The above and other objects, features and advantages of the present disclosure will be more readily understood from the following description of embodiments thereof with reference to the accompanying drawings. The drawings are only for the purpose of illustrating the principles of the disclosure. The dimensions and relative positioning of the elements in the figures are not necessarily drawn to scale. Like reference numerals may denote like features. In the drawings:

FIG. 1 shows a flow diagram of operations involved in one training iteration round in a method for training a subject detection model according to one embodiment of the present disclosure;

FIG. 2 illustrates an exemplary flow diagram of a method for training a subject detection model according to one embodiment of the present disclosure;

FIG. 3 illustrates an exemplary flow diagram of a method for determining an instance-level alignment penalty according to one embodiment of the present disclosure;

FIG. 4 shows a schematic distribution of different processing stage instance points in a feature space according to an embodiment of the present disclosure;

FIG. 5 illustrates an exemplary flow diagram of an object detection method according to one embodiment of the present disclosure;

FIG. 6 shows a block diagram of the structure of an apparatus for training a subject detection model according to one embodiment of the present disclosure;

FIG. 7 shows a block diagram of the structure of an apparatus for training a subject detection model according to one embodiment of the present disclosure; and

fig. 8 shows an exemplary block diagram of an information processing apparatus according to one embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure will be described hereinafter with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual embodiment are described in the specification. It will of course be appreciated that in the development of any such actual embodiment, numerous implementation-specific decisions may be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which will vary from one implementation to another.

Here, it should be further noted that, in order to avoid obscuring the present disclosure with unnecessary details, only the device structure closely related to the scheme according to the present disclosure is shown in the drawings, and other details not so related to the present disclosure are omitted.

It is to be understood that the disclosure is not limited to the described embodiments, as described below with reference to the drawings. In this context, embodiments may be combined with each other, features may be replaced or borrowed between different embodiments, one or more features may be omitted in one embodiment, where feasible.

Computer program code for carrying out operations for aspects of embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages.

The method of the present disclosure may be implemented by a circuit having a corresponding functional configuration. The circuitry includes circuitry for a processor.

One aspect of the present disclosure provides a computer-implemented method for training an object detection model M. The object detection model M is based on a neural network. And training the object detection model M in an iterative mode. In each training iteration wheel, a batch of labeled training sample images and labeled data are input. The operations involved in an exemplary training iteration round will be described in an exemplary manner with reference to FIG. 1.

Fig. 1 shows an exemplary flowchart of operations involved in one training iteration wheel Iter j in a method for training an object detection model (referred to as "model training method") according to an embodiment of the disclosure, where j represents the number of training iteration wheels. For ease of discussion, the jth training iteration round may also be referred to as the "current training iteration round".

In step S101, from the source domain data sets with a larger number of labels, respectively

And target domain data sets with smaller number of tags

Reading a subset of source domain data corresponding to at least one fully labeled source domain image for a current training iteration wheel

And a target domain data subset corresponding to the at least one loosely labeled target domain image

By n _s Representing the number of source field images read in, n _t Indicating the number of target field images read in.

A sample image representing the source domain,

a sample image representing the target domain. Y is _i ^s To represent

The labeling information of the bounding box of the labeled object in (1). Y is _i ^t Represent

The labeling information of the bounding box of the labeled object in (1).

And

one input sample image representing the source and target domains, respectively.

To represent

The labeling information of the bounding box of the labeled object in (1).

To represent

The labeling information of the bounding box of the labeled object in (1). The annotation information for the bounding box includes the location of the bounding box and the type of instance (sometimes also referred to as "category") of the instance of the type of object of interest in the image. Loose label and full label are two opposing concepts herein. For the same image containing multiple object instances (e.g., 10 object instances), an annotated image with fewer instances (e.g., 4 instances) may be referred to as a loosely annotated (or fully annotated) image, as opposed to an annotated image with more instances (e.g., all or most of the instances, e.g., 8 instances). Loosely labeled images may be images in which only a few instances of the image are labeled. A more specific example is: in a fully labeled image, almost all instances of the type of interest are labeled, while in a loosely labeled image, only a few of all instances of the type of interest are labeled. That is, some foreground regions that are originally instances of the type of interest are missing from the loosely labeled image, so that these missing instances are treated as background, and even labeled as instances of the background type. N is a radical of _s Is contained in the entire source domain data setThe number of training images. N is a radical of _t Is the number of training images contained in the entire target domain data set. At each training iteration round, for example, a pair of training images including one source domain image and one target domain image may be input. N is a radical of _t ＜＜N _s I.e. the number of source domain images is significantly larger than the number of target domain images, e.g. N _t /N _s Greater than or equal to 10, greater than or equal to 100, or even greater than or equal to 1000. At each training iteration round, the total number of labels of the source domain image is greater than the total number of labels of the target domain. Each training iteration may reuse images used by a previous training iteration.

It should be noted that for the training sample image, if the instance (i.e., foreground) of the type of interest in the image is not labeled and the set of object categories used by the object detection model includes a background class, the unlabeled instance may be labeled as the background class. This may cause tag noise. Loosely labeled target domain images may introduce label noise. In addition, for a fully labeled source domain image and a loosely labeled target domain image, an excessively large cross-over ratio (IoU, intersection overlap union) may cause the bounding box of the background class instance to contain a partial foreground instance, which may also cause label noise. Label noise may cause sample points (example classification features) to be misaligned, negatively affecting the performance of the object detection model.

In step S103, at least one globally labeled source domain image is registered with the object detection model M

(i determined by subset Ssb) to determine a detection loss L for the source domain data subset _det And for at least one globally labeled source domain image

Set of source domain instance classification features O (from subset Ssb) _s . Detection loss L for source domain data subset Ssb _det Indicating the object detection model M to at least one globally labeled source domain image

The statistical accuracy of the detection result output in performing object detection with respect to the labeling information is composed of a classification loss and a regression loss (i.e., a localization loss) of the bounding box. Source domain instance classification feature set O _s All the source domain images read in by the current training iteration wheel given by the object detection model M

The feature composition for classification.

In step S105, at least one loosely labeled target domain image is labeled by the object detection model M

Processing determines a set of target domain instance classification characteristics O (from the subset Stb) for at least one loosely labeled target domain image _t . Target domain instance classification feature set O _t All target domain images read in by the current training iteration wheel given by the object detection model M

The feature composition for classification.

In step S107, a feature set O is classified based on the source domain instance _t And a set of target domain instance classification features O _t Determining an instance-level alignment penalty L related to instance feature alignment _ins 。

In step S109, loss L is detected based on the AND _det And example alignment loss L _ins Associated total loss L _t o _tal The object detection model is optimized by adjusting the parameters of the object detection model M. Total loss L _t o _tal For example to detect loss L _det And example alignment loss L _ins And (4) linear combination.

The model training method of the present disclosure may include a determination of whether to end training. A computer-implemented method for training an object detection model of the present disclosure is described below with reference to fig. 2, in which steps for determining a training end condition are shown.

Fig. 2 illustrates an exemplary flow diagram of a method 200 for training a subject detection model M according to one embodiment of the present disclosure. Method 200 is a computer-implemented method for training an object detection model that includes training object detection model M in an iterative manner. The method 200 comprises steps S101, S103, S105 and S107 comprised by the training iteration wheel Iter [ j ] described in FIG. 1.

In step S209-1, it is determined whether a predetermined training end condition is satisfied. In the case that the determination result is yes, ending the training; in the case where the determination result is "no", step S209-2 is executed. The predetermined training end condition may be one of the following conditions: the total loss is less than a predetermined threshold; the total loss has converged. The total loss has converged, for example, meaning that the total loss for the current training iteration round varies by less than a predetermined threshold from the total loss for the previous training iteration round.

In step S209-2, the object detection model M is optimized by adjusting parameters of the object detection model M based on the total loss. Then, the process returns to step S101 to enter the next training iteration.

Step S109 in fig. 1 may be subdivided into step S209-1 and step S209-2 in fig. 2.

As another alternative implementation of step S109, the following sub-steps may be included: optimizing the object detection model M by adjusting parameters of the object detection model M based on the total loss; and determining that the number of training iteration rounds has reached a predetermined count. If the determination result is 'yes', ending the training; if the determination result is "no", the process returns to step S101 to enter the next training iteration round.

The model training method of the present disclosure utilizes a large amount of source domain labeled data and a small amount of target domain labeled data for training. The use of a small number of loosely labeled images of the target domain can reduce the labeling cost of the training data and shorten the training time.

In one embodiment, the object detection model M is configured to target at least one globally labeled source domain image based on a same set of object classes Sc

(i from the subset Ssb) and at least one loosely labeled target domain image

(i from subset Stb) subject detection. That is, the object class candidate set of the source domain image and the object class candidate set of the target domain image are the same. The set of object categories includes objects of the type of interest (foreground), for example: cars, buses, motorcycles, bicycles, pedestrians, etc. Further, the set of object classes Sc includes a background class. Typically, the regions outside the annotated region in the images of the source and target domains are both defaulted to the background. Several areas that can be randomly selected in the background of the images of the source and target domains serve as background class instance areas.

In one embodiment, the object detection model M includes a feature extractor F and an R-network based on a fast R-CNN (fast Regions with CNN features) framework. The R-network is configured to determine region-of-interest features of the input image. The R-network is further configured to determine a bounding box with classification labels for each region of interest ROI of the input image. The R-Network may for example comprise a regional pro-portal Network (RPN). The feature extractor F performs convolution processing based on the input image, and outputs a feature map (feature) of the image. The region-proposing network RPN can output a region-of-interest feature corresponding to the region of interest based on the output result (feature map) of the feature extractor F. Each region of interest feature characterizes the position of the object instance detected by the model. The loss of localization can be determined using the region of interest features with reference to the actual location information of the object instance in the annotation information. Regarding Faster R-CNN, reference may be made to the following documents:

Ren S,He K,Girshick R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[J].Advances in neural information processing systems,2015,28:91-99。

further, the R network may further include an additional classification feature extraction layer FC. An additional classification feature extraction layer FC follows and interfaces with the RPN network to extract example classification features for classification from the region-of-interest features determined by the RPN network. Each instance classification feature may characterize a classification of an object instance of interest in an image detected by the model. And referring to the labeling classification information of the object instance in the labeling information, and determining the classification loss by using the classification characteristics of each instance. Considering that different types of object instances may occur at the same position of the image, it is preferable that an additional classification feature extraction layer FC is provided to extract an instance classification feature for classification instead of directly determining the type of the object instance using the region-of-interest feature, which is advantageous for improving the performance of the object detection model.

In one embodiment, the R network of the object detection model M incorporates SWDA (Strong-weak distribution alignment) techniques. For SWDA reference can be made to the following documents:

Saito K,Ushiku Y,Harada T,et al.Strong-weak distribution alignment for adaptive object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:6956-6965。

in this embodiment, the R-network integrates weak global alignment and strong local alignment. SWDA is an Unsupervised Data Augmentation (UDA) framework for object detection based on Faster R-CNN. For this purpose, the R-network also comprises a local arbiter D _l And a global arbiter D _g . The feature extractor F can be decomposed into

Wherein, F ₁ Is an extractor related to local features, F ₂ Is an extractor that is related to global features. Since the images in the target domain are rare and loosely labeled, minimizing the detection loss of the target domain will not only result in overfitting, but also in training failures, since the loosely labeled images of the target domain contain a lot of label noise. Therefore, the source domain detection loss L is calculated in the training method of the present disclosure _det And the detection loss of the target domain is not calculated. NeedleFor the current training iteration wheel, the source domain detects the loss L _det The calculation of (c) can be as shown in equation (1).

Where L represents the loss of object detection, which consists of classification loss and regression loss of bounding box (i.e., localization loss).

Although instance-level alignment can effectively improve the performance of the object detection model, the performance of the target detection domain adaptive model may not be guaranteed only by the instance-level alignment. Thus, in this embodiment, the model training method integrates the weak global alignment and the strong local alignment of the SWDA. To this end, first, image-level features are learned with weak global alignment. For the current training iteration wheel, global arbiter D _g Weak global alignment penalty of L _global As shown in equation (4).

Where gamma controls the weights of samples that are more difficult to classify.

Second, strong local alignments are used to learn local level features, such as texture or color. For the current training iteration wheel, local arbiter D _l Strong local alignment loss L _loc As shown in equation (7).

Wherein W and H represent a feature extractor F ₁ The width and height of the extracted features. Penalty L for implementing global and local alignment _adv As shown in equation (8).

L _adv (F,D)＝L _loc (F ₁ ,D _l )+L _global (F,D _g ) (8)

I.e. to combat the loss L _adv Including by a global arbiter D _l Weak global alignment loss L determined based on image-level features _global And by a local arbiter D _l Strong local alignment loss L determined based on local level features _loc 。

Example level alignments involved in the model training method of the present disclosure are further described below.

This is done in some embodiments of the present disclosure where the model training method includes an instance-level alignment over the extracted features of the additional classification feature layer. Unlike traditional methods that align only on top of foreground ROI (region of interest) features, in some embodiments, the model training method aligns not only the features of the foreground ROI, but also the features of the reference frame of the background corresponding to the background class instance. This is because with sample point instance alignment, in order to compute instance level alignment loss, the intra-class distance and inter-class distance of each instance need to be computed separately. However, in some scenes there is only one type of foreground, e.g., detecting cars and ignoring other objects. In such a scenario, if only the foreground is considered, then the inter-class distance, and hence the instance alignment penalty, cannot be calculated. Of course, the example alignment based on sample point alignment in this disclosure may also be used to align only foreground classes if there are multiple foreground classes.

In one embodiment, a moving average class center, for example, represented by a feature vector, may also be considered an instance feature, which is added to the set of instance classification features to assist in determining instance-level alignment loss. Determining an example level alignment penalty (e.g., step S107 in fig. 1) is described below with reference to fig. 3. FIG. 3 illustrates an exemplary flow diagram of a method 300 for determining an instance-level alignment penalty according to one embodiment of this disclosure.

The processing object of method 300 is a set of source domain instance classification features O _s And a set of target domain instance classification features O _t . Each instance feature in the instance classification feature set may be referred to as an instance point, the instance points of each class having a distribution in the feature space determined by the respective instance feature. Therefore, it is advantageous to describe the method 300 with reference to a variation of the distribution of the instance points. FIG. 4 shows a schematic distribution of different processing stage instance points in a feature space according to an embodiment of the present disclosure, wherein FIG. 4 (a _ s) shows a classification feature set O corresponding to an initial source domain instance _s (e.g. corresponding source domain images output by the additional classification feature extraction layer FC

A source domain instance point distribution of a source domain instance classification feature set (which is determined by the subset Ssb) of classification features; FIG. 4 (a _ t) shows a classification feature set O corresponding to an initial target domain instance _t (e.g., corresponding target domain image output by the additional classification feature extraction layer FC

A target domain instance point distribution of a target domain instance classification feature set consisting of the classification features of the subset Sst. In this example, the set of object classes Sc comprises 4 classes, which correspond to k =0 to 3, where k =0 corresponds to the background class. An example of background noise is shown in the upper right corner of fig. 4 (a _ t). At this stage, the example classification feature set may lack feature points of a certain class. For example, as shown in fig. 4 (a _ s), the source domain feature points lack the instance points of k = 3; as shown in fig. 4 (a _ t), the target domain feature point lacks k =Example points of 1 and k = 3. And fig. 4 (a _ s), fig. 4 (a _ t) schematically show: for the same category, the number of source domain instance points (labels) is greater than the number of target domain instance points (labels).

In step S301, a feature set O is classified based on the source domain instance _s An average class center for each class of the current training iteration source domain is determined. For example, when the set of object classes Sc includes 4 classes, then this step typically determines 4 average class centers. Average class center of source domain classes

The determination method of (2) may be determined, for example, using formula (9), where k is a category index.

Wherein x is _s Classifying feature sets O for source domain instances _s The subset of source domain instance classification features for class k

The source domain instance classification characteristic of (a), i.e.,

for the number of k-class object instances, i.e. subsets

The number of instance classification features in (1).

In step S303, a set of classification features O is based on the target domain instance _t And determining the average class center of each class of the target domain of the current training iteration wheel. For example, when the object class set Sc includes 4 classes, then this step determines 4 average class centers. Average class center of classes of target domain

The determination method of (2) may be determined, for example, using formula (10), where k is a category index.

Wherein x is _t Classifying feature sets O for target domain instances _t Target domain instance classification feature subset for class k

The target domain instance classification characteristic of (1), i.e.,

for the number of k-class object instances, i.e. subsets

The number of instance classification features in (1).

In step S305, for the source domain, a moving average class center for each class of the source domain of the current training iteration wheel is determined based on the average class center of each class of the current training iteration wheel and the moving average class center of each class of the previous training iteration wheel. Moving average class center of jth training iteration wheel (current training iteration wheel) of kth class for source domain

Can be determined using equation (11).

Wherein the content of the first and second substances,

moving average class of kth class for source domain of previous training iterationThe center, θ, represents a moving average coefficient. Can be combined with

(i.e., j = 1) is set to 0.

For the target domain, a moving average class center for each class of the target domain of the current training iteration wheel is determined based on the average class center of each class of the current training iteration wheel and the moving average class center of each class of the previous training iteration wheel in step S307. Moving average class center of jth training iteration wheel (current training iteration wheel) of kth class for target domain

Can be determined using equation (12).

Wherein the content of the first and second substances,

for the moving average class center of the kth class of the target domain of the previous training iteration, θ represents the moving average coefficient. Can be combined with

(i.e., j = 1) is set to 0.

In step S309, the source domain instance classification feature set is updated by adding the moving average class center for each class of the source domain of the current training iteration wheel to the source domain classification feature set. Fig. 4 (b _ s) schematically shows a source domain instance point distribution with the addition of a moving average class center for each class of the source domain, wherein each solid geometry corresponds to an example feature point in the source domain representing the moving average class center for each class.

In step S311; the target domain instance classification feature set is updated by adding the moving average class center for each class of the target domain of the current training iteration wheel to the target domain instance classification feature set. Fig. 4 (b _ t) schematically shows a target domain instance point distribution of the moving average class center of each class to which the target domain is added, wherein each solid geometry corresponds to an example feature point in the target domain representing the moving average class center of each class. Adding a moving average class center facilitates computing the inter-and intra-class distances across the domain for all instances of all classes.

In step S313, an instance-level alignment loss between the updated source domain instance classification feature set and the updated target domain instance classification feature set is determined. It should be noted that, if the moving average class center of a certain class of the object class set Sc is zero in a certain training iteration round (for example, in the first training iteration round), no alignment is performed on the instance point of the instance type, and no instance-level alignment loss for the instance type is calculated, that is, no alignment loss related to the instance type is included in the instance-level alignment loss.

In one embodiment, updating the instance classification feature set may also include deleting the background class instance. Since the ROI represented by the reference frame of the background has a lot of label noise, in this embodiment, the classification features of the corresponding background class instances in the classification feature set of the source domain instance are deleted, and the moving average class center of the background class in the classification feature set of the source domain instance is retained; and deleting the classification features of the corresponding background class instances in the target domain instance classification feature set, and simultaneously keeping the moving average class center of the background class in the target domain instance classification feature set. The background deletion operation may be performed according to equation (13) and equation (14).

Wherein, the first and the second end of the pipe are connected with each other,

representing the total number of classes in the set of object classes Sc minus 1,y _s =0 or y _t =0 represents a background class.

And deleting the characteristic points (classification characteristics) of the background class instances, and simultaneously keeping the moving average center of the background class, which is beneficial to inhibiting the label noise and improving the performance of the object detection model. In the present disclosure, the operation of "deleting feature points (classification features) of the background class instance while retaining the background class moving average center" is also simply referred to as "deleting the background class instance".

Fig. 4 (c _ s) schematically shows the distribution of various types of instance points of the source domain after the background class instance is deleted, and fig. 4 (c _ t) schematically shows the distribution of various types of instance points of the target domain after the background class instance is deleted. In fig. 4 (c _ s) and 4 (c _ t), it can be seen that the true background class instance points represented by the open triangles have been removed while the background class moving average class center instance points represented by the filled triangles remain.

In one embodiment, updating the set of example classification features may also include undersampling. It is known that the class imbalance problem in the training sample set can cause the performance of machine learning to be reduced. Similarly, an imbalance in the distribution of instances can negatively impact instance level alignment. For example, see document 1, the distribution of instances is very uneven on the cityscaps dataset, with the two categories of "car" (car) and "person" (person) of instances in the vast majority:

document 1: cordts M, omran M, ramos S, et al, the cis tables data set for the therapeutic urea scene understating [ C ]// Proceedings of the IEEE reference on computer vision and pattern recognition 2016:3213-3223.

Therefore, to alleviate the performance impact of this problem, in this embodiment, updating the instance classification feature set further includes updating the source domain instance classification feature set O _s And a set of target domain instance classification features O _t And performing undersampling. In particular, the maximum number of instances of the respective class is limited by discarding instances at each training iteration round. The drop-wise update may be implemented by a function undersampling () as shown in equations (15) and (16).

Wherein undersampling () is a predefined one for limiting the maximum number of instances of the corresponding class by randomly dropping instances not to exceed a given threshold

Wherein each instance corresponds to a respective instance classification characteristic. What is discarded is the true instance feature, and the quasi-instance feature corresponding to the moving average class center is not discarded. If the example classification feature set already contains a moving average class center, the moving average class center remains in the example classification feature set after the drop-wise update (i.e., after undersampling). To O _s And O _t It is possible to check, on a class-by-class basis, whether the number of instances of the respective class is greater than a given threshold value, and if the check result is "yes", to reduce the number of instances of the class to the number of instances of the class by randomly discarding the respective class instances

Fig. 4 (d _ s) schematically shows a discardable updated source domain instance point distribution, where some instance points (instance features) of k =1 and k =2 are discarded according to a given threshold; fig. 4 (d _ t) schematically shows a drop-wise updated target domain instance point distribution, wherein, according to a given threshold, a fraction of instance points (instance features) with k =2 are dropped. The undersampling is performed after the moving average class center is calculated, and before the actual alignment loss is specifically calculated. It is noted that in fig. 4, the exemplary set of object classes Sc includes a background class, which corresponds to the instance point of k = 0. It can be appreciated that if the undersampling is performed before the background class instance is deleted, the background class instance may not be undersampled, which is beneficial to reduce training time. It will be appreciated that a given threshold for each class may not be exactly the same, but substantially the same, or even the same, apart from the background class, is preferred.

The undersampling is beneficial to the example distribution balance and the improvement of the performance of the object detection model.

In one embodiment, updating the set of example classification features includes adding a moving average center of each class, deleting background class instances, and undersampling for the set of source domain example classification features and the set of target domain example classification features.

After the updated source domain instance classification feature set and the target domain instance classification feature set are obtained, the instance-level alignment loss L can be determined based on the alignment of the feature points in the two feature sets _ins . In one embodiment, the instance level alignment penalty L _ins To account for the extended d-SNE penalty of maximizing the minimum absolute inter-class distance. Reference is made to document 2 for d-SNE.

Document 2: xu X, zhou X, venkatesan R, et al.d-SNE Domain adaptation using a stored neighbor embedded embedding (d-SNE) (CVPR 2019).

d-SNE is a sample point-based alignment method with better performance at present. The d-SNE loss is shown in equation (17).

Wherein d (x) _s ,x _t ) Denotes x _s And x _t The square of the euclidean distance in feature space. k is x _t I.e. k = y _t ，

sup { } upper bound of inter-class distance between cross-domain features. Inf { } represents the lower bound of the inter-class distance between features across the domain. Thus, the d-SNE penalty achieves sample point-based instance-level alignment by minimizing the largest cross-domain intra-class distance while maximizing the smallest cross-domain inter-class distance. In one example, an example level alignment penalty of the present disclosure may be determined according to equation (17). Further, for computational efficiency, in one example, the d-SNE loss may be defined by equation (18).

Where m is a predefined margin (margin) value and max () represents the maximum value. m can be determined empirically, and one exemplary value is 1. In one example, an example level alignment penalty of the present disclosure may be determined according to equation (18). However, the implementation of the d-SNE penalty shown in equation (18) only increases the relative difference between the maximum intra-class distance and the minimum inter-class distance, and does not maximize the minimum absolute inter-class distance. To address this problem, in one example, the penalty of improved instance level alignment, i.e., extended d-SNE penalty, is employed, which is determined using equation (19).

Wherein m is ₂ Is another predefined edge value for maximizing the minimum absolute inter-class distance. m is ₂ One exemplary value, which can be empirically determined, is 30. The extended d-SNE loss in this embodiment (see equation (19)) utilizes an additional hyperparameter m relative to the original d-SNE loss (see equation (17)) ₂ The classes can be better separated.

Adjusting the model parameters will use the overall objective function. The overall objective function is described further below.

In one embodiment, the total loss may be the detection loss L _det And instance level alignment loss L _ins Linear combinations of (3). Further, the total loss may be a countermeasure loss L _adv Detecting the loss L _det And instance level alignment penalty L _ins Linear combinations of (3). In particular, the total loss L _total Can be determined as in equation (20).

L _t o _tal ＝L _det (F,R)-λ ₁ L _adv (F,D)+λ ₂ L _ins (F,R) (20)

λ ₁ For example, it may be fetched from a sample data setA value between 0.1 and 1. Lambda [ alpha ] ₂ ＝min(0.1,p ² ) During training, p may gradually increase from 0 to 1. Lambda [ alpha ] ₂ It is also possible to set a fixed value, for example λ ₂ ＝1。

The overall objective function can be defined using the mini-max loss function (see equation (21)). Optimizing the object detection model by adjusting parameters of the object detection model is achieved using the overall objective function.

Wherein the content of the first and second substances,

represents the minimization of total loss by adjusting the parameters of F and R;

by adjusting D _l And D _g Achieves a maximum total loss. The mini-max loss function may be implemented by a Gradient Reverse Layer (GRL). Reference is made to document 3 for the mini-max loss function.

Document 3: ganin Y, ustinova E, ajakan H, et al, domain-adaptive training of neural networks [ J ]. The joural of machine learning research,2016,17 (1): 2096-2030.

Fig. 4 (e) schematically shows the effect of adjusting the influence of the parameters of the object detection model on the alignment of the feature points. In fig. 4 (e), in order to clearly show the effect of adjusting parameters on alignment, the source domain instance points and the target domain instance points determined by the object detection model after adjusting parameters have been merged in the same spatial arrangement. As shown in fig. 4 (e), after the parameters of the object detection model are adjusted, the feature points of the same type tend to be more concentrated and the alignment degree becomes higher, the intra-type distance decreases, and the feature points of different types tend to have larger intervals and larger inter-type distances.

One aspect of the present disclosure provides an object detection method. This method is exemplarily described below with reference to fig. 5.

Fig. 5 illustrates an exemplary flow diagram of an object detection method 500 according to one embodiment of the disclosure.

In step S501, the object detection model M is trained. In particular, the subject detection model M is trained using the model training method of the present disclosure (e.g., the method 200 shown in fig. 2).

In step S503, an image to be detected is detected. Specifically, the trained object detection model is used to determine the location and type of the object in the image to be detected.

One aspect of the present disclosure provides an apparatus for training an object detection model. The apparatus is described below with reference to fig. 6. Fig. 6 shows a block diagram of the structure of an apparatus 600 for training a subject detection model according to an embodiment of the present disclosure. The apparatus 600 is used to train an object detection model in an iterative manner. The object detection model is based on a neural network. The apparatus 600 comprises: a detection loss determining unit 601, a classification feature set determining unit 603, an alignment loss determining unit 605, a total loss determining unit 607, and an optimizing unit 609. The detection loss determination unit 601 is configured to: a detection loss for the subset of source domain data is determined based on the subset of source domain data having the larger number of labels corresponding to the at least one fully labeled source domain image for the current training iteration round. The classification feature set determination unit 603 is configured to: a set of source domain instance classification features for the at least one globally labeled source domain image is determined, and a set of target domain instance classification features for the at least one loosely labeled target domain image is determined. The alignment loss determination unit 605 is configured to: an instance-level alignment loss associated with the instance feature alignment is determined based on the source domain instance classification feature set and the target domain instance classification feature set. The total loss determination unit 607 is configured to: a total loss is determined based on the detection loss and the instance alignment loss. The optimization unit 609 is configured to: the object detection model is optimized by adjusting parameters of the object detection model based on the total loss. The source domain data subset and the target domain data subset are from a source domain data set having a larger number of labels and a target domain data set having a smaller number of labels, respectively. There is a correspondence between the apparatus 600 and the method 200, and further details of the apparatus 600 can be found in the description of the method 200 herein. For example, the classification feature set determination unit 603 is further configured to perform at least one of the following operations: determining moving average class centers of various types of a source domain and a target domain, adding each moving average class center to a corresponding example classification feature set, deleting background class examples in the example classification feature set, and undersampling the example classification feature set. Optionally, the apparatus 600 may further comprise a counter-loss determination unit. The countermeasure loss determination unit is for determining a countermeasure loss for the source domain data set and the target domain data set. The counter loss determination unit is coupled to the total loss determination unit 607 such that the counter loss is also included in the total loss.

According to one aspect of the present disclosure, an apparatus for training a subject detection model is provided. The apparatus is described below with reference to fig. 7. Fig. 7 illustrates an apparatus 700 for training a subject detection model according to one embodiment of the present disclosure. The device includes: a memory 701 having instructions stored thereon; and one or more processors 703, the one or more processors being capable of communicating with the memory to execute the instructions retrieved from the memory, and the instructions causing the one or more processors to: reading, from the source domain data set having the larger number of labels and the target domain data set having the smaller number of labels, respectively, a subset of source domain data having the larger number of labels corresponding to the at least one fully labeled source domain image and a subset of target domain data having the smaller number of labels corresponding to the at least one loosely labeled target domain image for the current training iteration round; processing the at least one fully labeled source domain image through an object detection model to determine a detection loss for the source domain data subset and a source domain instance classification feature set for the at least one fully labeled source domain image; processing at least one loosely labeled target domain image through an object detection model to determine a target domain instance classification feature set for the at least one loosely labeled target domain image; determining an instance-level alignment loss associated with the instance feature alignment based on the source domain instance classification feature set and the target domain instance classification feature set; and optimizing the object detection model by adjusting parameters of the object detection model based on the total loss associated with the detection loss and the instance alignment loss. There is a correspondence between the device 700 and the method 200, and further details of the device 700 can be found in the description of the method 200 herein.

One aspect of the present disclosure provides a computer-readable storage medium having a program stored thereon. The program causes a computer running the program to: reading, from the source domain data set having the larger number of labels and the target domain data set having the smaller number of labels, respectively, a subset of source domain data having the larger number of labels corresponding to the at least one fully labeled source domain image and a subset of target domain data having the smaller number of labels corresponding to the at least one loosely labeled target domain image for the current training iteration round; processing the at least one fully labeled source domain image through an object detection model to determine a detection loss for the source domain data subset and a source domain instance classification feature set for the at least one fully labeled source domain image; processing at least one loosely labeled target domain image through an object detection model to determine a target domain instance classification feature set for the at least one loosely labeled target domain image; determining an instance-level alignment loss associated with the instance feature alignment based on the source domain instance classification feature set and the target domain instance classification feature set; and optimizing the object detection model by adjusting parameters of the object detection model based on the total loss associated with the detection loss and the instance alignment loss. There is a correspondence between the procedure and the method 200, and further details of the procedure can be found in the description of the method 200 herein.

One aspect of the present disclosure provides a computer-readable storage medium having a program stored thereon. The program causes a computer running the program to implement the method 500.

According to an aspect of the present disclosure, there is also provided an information processing apparatus.

Fig. 8 is an exemplary block diagram of an information processing apparatus 800 according to one embodiment of the present disclosure. In fig. 8, a Central Processing Unit (CPU) 801 performs various processes in accordance with a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage section 808 to a Random Access Memory (RAM) 803. The RAM 803 also stores data and the like necessary when the CPU 801 executes various processes, as necessary.

The CPU 801, the ROM 802, and the RAM 803 are connected to each other via a bus 804. An input/output interface 805 is also connected to the bus 804.

The following components are connected to the input/output interface 805: an input portion 806 including a soft keyboard and the like; an output portion 807 including a display such as a Liquid Crystal Display (LCD) and a speaker; a storage portion 808 such as a hard disk; and a communication section 809 including a network interface card such as a LAN card, a modem, and the like. The communication section 809 performs communication processing via a network such as the internet, a local area network, a mobile network, or a combination thereof.

A drive 810 is also connected to the input/output interface 805 as necessary. A removable medium 811 such as a semiconductor memory or the like is mounted on the drive 810 as needed, so that the program read therefrom is mounted on the storage portion 808 as needed.

The CPU 801 may run a program for implementing the method for training the object recognition model according to the present disclosure or a program for implementing the object detection method of the present disclosure.

The effects of the scheme of the present disclosure are described below.

The following three scenarios were constructed for experiments, comparing the differences in accuracy performance between the scheme of the present disclosure and existing methods: (1) Migration from Cityscapes to Foggy Cityscapes (C- > F); (2) Migration from SIM10K to Cityscapes (S- > C; i.e., training with labeled samples of SIM10K along with a small number of labeled samples of Cityscapes); (3) migration from Udacity to Cityscapes (U- > C). The results of the experiment are shown in tables 1 and 2. The first scenario C- > F is to simulate data deviation caused by domain shift (domain shift) due to weather change. The second scenario, S- > C, is to simulate the data bias between the virtual world and the real world. The third scene U- > C is for simulating data deviation between two different real worlds due to illumination, camera angle, and the like.

TABLE 1 Experimental results for C- > F

TABLE 2 results of the experiment of S- > C and U- > C

The source of the reference data is as follows.

[1]Ren S,He K,Girshick R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[J].Advances in neural information processing systems,2015,28:91-99.

[2]Saito K,Ushiku Y,Harada T,et al.Strong-weak distribution alignment for adaptive object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:6956-6965.

[3]Zhuang C,Han X,Huang W,et al.ifan:Image-instance full alignment networks for adaptive object detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020,34(07):13122-13129.

[4]Wu,A.,Han,Y.,Zhu,L.,&Yang,Y.(2021).Instance-Invariant Domain Adaptive Object Detection via Progressive Disentanglement.IEEE Transactions on Pattern Analysis and Machine Intelligence,1–1.

[5]Wang T,Zhang X,Yuan L,et al.Few-shot adaptive faster r-cnn[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:7173-7182.

Wherein Source-only represents training using only fully labeled Source domain data; target-only means training using only loosely labeled Target domain data; UDA stands for unsupervised domain adaptation method, which uses all unlabeled target domain data for domain adaptation; FUDA represents a few-sample unsupervised domain adaptation method; FDA denotes a few sample domain adaptive method; PICA + SWDA represents the method employed by the present disclosure, PICA represents "point-wise instance and center alignment" (point-wise instance and central alignment); mAP (0.5) represents Mean average precision (Mean average precision), 0.5 is the threshold; the data containing decimal points in the table represents the detection accuracy mAP.

In S- > C and U- > C scenes, 8 target domain images are used, and each image is only marked with 3 automobiles; in a C- > F scene, 8 target domain images are used, one class for each image, and only one instance of the corresponding class is marked out for each image. The FUDA method uses 8 images as the FDA, but does not use the corresponding annotations.

The experimental results of tables 1 and 2 indicate that the method of the present disclosure (PICA + SWDA) outperforms both the prior methods FAFRCNN and SWDA at C- > F, S- > C and U- > C.

Aspects of the present disclosure relate to additional classification feature extraction layers, countermeasures against loss, use of a small number of loosely labeled target domain images, moving average class center alignment, deletion of background class instances, undersampling, improved instance-level alignment loss. The benefits of the present disclosure include at least one of the following: robust to tag noise, overcome class imbalance, improve instance level alignment, and improve detection accuracy.

While the invention has been described in terms of specific embodiments thereof, it will be appreciated that those skilled in the art will be able to devise various modifications (including combinations and substitutions of features between the embodiments, where appropriate), improvements and equivalents of the invention within the spirit and scope of the appended claims. Such modifications, improvements and equivalents are also intended to be included within the scope of the present invention.

It should be emphasized that the term "comprises/comprising" when used herein, is taken to specify the presence of stated features, elements, steps or components, but does not preclude the presence or addition of one or more other features, elements, steps or components.

Furthermore, the methods of the embodiments of the present invention are not limited to being performed in the time sequence described in the specification or shown in the drawings, and may be performed in other time sequences, in parallel, or independently, where technically possible. Therefore, the order of execution of the methods described in this specification does not limit the technical scope of the present invention.

Supplementary note

Included within this disclosure are, but not limited to, the following.

1. A computer-implemented method for training an object detection model, the method comprising training the object detection model in an iterative manner, and the object detection model being based on a neural network, characterized in that a current training iteration round comprises the operations of:

reading, from the source domain data set having the larger number of labels and the target domain data set having the smaller number of labels, a subset of source domain data having the larger number of labels corresponding to at least one fully labeled source domain image and a subset of target domain data having the smaller number of labels corresponding to at least one loosely labeled target domain image, respectively, for the current training iteration wheel;

processing the at least one fully labeled source domain image through the object detection model to determine a detection loss for the subset of source domain data and a set of source domain instance classification features for the at least one fully labeled source domain image;

processing the at least one loosely labeled target domain image through the object detection model to determine a set of target domain instance classification features for the at least one loosely labeled target domain image;

determining an instance-level alignment loss related to instance feature alignment based on the source domain instance classification feature set and the target domain instance classification feature set; and

optimizing the object detection model by adjusting parameters of the object detection model based on a total loss related to the detection loss and the instance alignment loss.

2. The method of supplementary note 1, wherein the object detection model is trained using the at least one fully labeled source domain image and the at least one loosely labeled target domain image based on a same set of object classes, and the same set of object classes includes a background class.

3. The method according to supplementary note 1, wherein the object detection model includes an R network;

the R network is based on a fast RCNN framework;

the R network is configured to determine region-of-interest features of an input image; and is provided with

The R-network is further configured to determine a bounding box with classification labels for regions of interest of the input image.

4. The method of supplementary note 3, wherein the R network includes an additional classification feature extraction layer; and is

The additional classification feature extraction layer is configured to extract an instance classification feature for classification from each region of interest feature.

5. The method according to supplementary note 1, wherein the total loss is further related to a countermeasure loss for the source domain data subset and the target domain data subset.

6. The method according to supplementary note 5, wherein the R-network includes a global discriminator and a local discriminator, and the countermeasure loss includes a weak global alignment loss determined by the global discriminator based on image-level features and a strong local alignment loss determined by the local discriminator based on local-level features.

7. The method of supplementary note 2, wherein determining an instance-level alignment loss related to instance feature alignment based on the source domain instance classification feature set and the target domain instance classification feature set comprises:

determining an average class center for each class of the current training iteration source domain based on the source domain instance classification feature set;

determining the average class center of each class of the target domain of the current training iteration wheel based on the target domain example classification characteristic set;

for the source domain, determining a moving average class center for each class of the source domain for the current training iteration wheel based on an average class center for each class of the current training iteration wheel and a moving average class center for each class of a previous training iteration wheel;

for the target domain, determining a moving average class center for each class of the target domain for the current training iteration wheel based on an average class center for each class of the current training iteration wheel and a moving average class center for each class of a previous training iteration wheel;

updating the source domain instance classification feature set by adding a moving average class center for each class of the source domain of the current training iteration wheel to the source domain classification feature set;

updating the target domain instance classification feature set by adding a moving average class center of each class of the target domain for the current training iteration wheel to the target domain instance classification feature set; and

determining the instance-level alignment loss between the updated source domain instance classification feature set and the updated target domain instance classification feature set.

8. The method of supplementary note 7, wherein updating the set of source domain instance classification features further comprises: undersampling the source domain instance classification feature set; and is

Updating the set of target domain instance classification features further comprises: and undersampling the target domain example classification characteristic set.

9. The method of supplementary note 7, wherein updating the set of source domain instance classification features further comprises: deleting the classification features of the corresponding background class instances in the source domain instance classification feature set, and simultaneously reserving the moving average class center of the background class in the source domain instance classification feature set; and is provided with

Updating the set of target domain instance classification characteristics further comprises: deleting the classification features of the corresponding background class instances in the target domain instance classification feature set, and simultaneously keeping the moving average class center of the background class in the target domain instance classification feature set.

10. The method according to supplementary note 1, wherein the instance-level alignment penalty is an extended d-SNE penalty that takes into account maximizing the minimum absolute inter-class distance.

11. An object detection method, comprising:

training the subject detection model using the method of any one of supplementary notes 1 to 10; and

and determining the position and the type of the object in the image to be detected by using the trained object detection model.

12. A computer-readable storage medium having a program stored thereon, the program causing a computer running the program to:

reading, from the source domain data set having the larger number of labels and the target domain data set having the smaller number of labels, respectively, a subset of source domain data having the larger number of labels corresponding to the at least one fully labeled source domain image and a subset of target domain data having the smaller number of labels corresponding to the at least one loosely labeled target domain image for the current training iteration round;

processing the at least one fully labeled source domain image through an object detection model to determine a detection loss for the subset of source domain data and a set of source domain instance classification features for the at least one fully labeled source domain image;

13. The computer-readable storage medium of appendix 12, wherein the object detection model is trained using the at least one fully labeled source domain image and the at least one loosely labeled target domain image based on a same set of object classes, and the same set of object classes includes a background class.

14. The computer-readable storage medium of supplementary note 12, wherein the object detection model includes an R network;

the R network is based on a fast RCNN framework;

the R network is configured to determine region-of-interest features of an input image; and is

15. The computer-readable storage medium of supplementary note 14, wherein the R-network includes an additional classification feature extraction layer; and is

16. The computer-readable storage medium of supplementary note 12, wherein the total loss is further related to a counter loss for the subset of source domain data and the subset of target domain data.

17. The computer-readable storage medium of supplementary note 16, wherein the R-network includes a global discriminator and a local discriminator, the antagonistic loss including a weak global alignment loss determined by the global discriminator based on image-level features and a strong local alignment loss determined by the local discriminator based on local-level features.

18. The computer-readable storage medium of supplementary note 13, wherein determining an instance-level alignment loss related to instance feature alignment based on the source domain instance classification feature set and the target domain instance classification feature set comprises:

determining the average class center of each class of the target domain of the current training iteration wheel based on the target domain instance classification feature set;

19. The computer-readable storage medium of supplementary note 18, wherein updating the set of source domain instance classification features further comprises: undersampling the source domain instance classification feature set; and is

20. The computer-readable storage medium of supplementary note 18, wherein updating the set of source domain instance classification features further comprises: deleting the classification features of the corresponding background class instances in the source domain instance classification feature set, and simultaneously reserving the moving average class center of the background class in the source domain instance classification feature set; and is provided with

Updating the set of target domain instance classification features further comprises: deleting the classification features of the corresponding background class instances in the target domain instance classification feature set, and simultaneously keeping the moving average class center of the background class in the target domain instance classification feature set.

Claims

reading, from the source domain dataset with the larger number of labels and the target domain dataset with the smaller number of labels, respectively, a subset of source domain datasets with the larger number of labels corresponding to the at least one fully labeled source domain image and a subset of target domain datasets with the smaller number of labels corresponding to the at least one loosely labeled target domain image for the current training iteration round;

2. The method of claim 1, wherein the object detection model is trained using the at least one fully labeled source domain image and the at least one loosely labeled target domain image based on a same set of object classes, and the same set of object classes includes a background class.

3. The method of claim 2, wherein determining an instance-level alignment loss related to instance feature alignment based on the source domain instance classification feature set and the target domain instance classification feature set comprises:

determining an average class center of each class for the current training iteration source domain based on the source domain instance classification feature set;

4. The method of claim 3, wherein updating the set of source domain instance classification features further comprises: undersampling the source domain instance classification feature set; and is provided with

Updating the set of target domain instance classification features further comprises: and undersampling the classification feature set of the target domain example.

5. The method of claim 3, wherein updating the set of source domain instance classification features further comprises: deleting the classification features of the corresponding background class instances in the source domain instance classification feature set, and simultaneously reserving the moving average class center of the background class in the source domain instance classification feature set; and is

6. The method of claim 1, wherein the instance-level alignment penalty is an extended d-SNE penalty that takes into account maximizing a minimum absolute inter-class distance.

7. The method of claim 1, wherein the total loss is further related to a counter-loss for the source domain data subset and the target domain data subset.

8. The method of claim 1, wherein the object detection model comprises an R-network;

the R network is based on the fast R-CNN framework;

9. The method of claim 8, wherein the R-network includes an additional classification feature extraction layer; and is

10. An object detection method, comprising:

training the subject detection model using the method of any one of claims 1 to 9; and