WO2017124336A1

WO2017124336A1 - Method and system for adapting deep model for object representation from source domain to target domain

Info

Publication number: WO2017124336A1
Application number: PCT/CN2016/071501
Authority: WO
Inventors: Xiaoou Tang; Zhanpeng Zhang; Ping Luo; Chen Change Loy
Original assignee: Sensetime Group Limited
Priority date: 2016-01-20
Filing date: 2016-01-20
Publication date: 2017-07-27
Also published as: CN108604304A

Abstract

A method for adapting a deep model for object representation from a source domain to a target domain, comprises: extracting, by the deep model for the source domain, features for objects from input images for the target domain; inferring group labels for objects according to the extracted features; discovering criterions based on target domain priors derived from the input images and the inferred group labels, wherein the criterions contain information indicating which objects should not be inferred to have a same group label; and fine-tuning the deep model for the source domain according to the discovered criterions, wherein the fine-tuned deep model is outputted as a deep model for the target domain. A system for adapting a deep model for object representation from a source domain to a target domain is also enclosed.

Description

[Title established by the ISA under Rule 37.2] METHOD AND SYSTEM FOR ADAPTING DEEP MODEL FOR OBJECT REPRESENTATION FROM SOURCE DOMAIN TO TARGET DOMAIN

Technical Field

The disclosures relate to a method and a system for adapting a deep model for object representation from a source domain to a target domain.

Background

Deep learning approaches have achieved substantial advances for object (e.g., face, dogs, basketball) recognition. However, contemporary deep models, for example, deep convolution networks, usually overfit to the training data distributions, and thus will not be directly generalisable to other unseen target domain. In addition, the annotated data in the unseen target domain is usually not sufficient for training a new deep model. These problems limit the deep learning in the applications, such as object tracking, retrieval, and clustering in unseen images/videos. One example is face clustering in movies, i.e., grouping detected faces into different subsets according to different characters. Clustering faces in movies is extremely challenging since characters’ a ppearance may vary drastically under different scenes as the story progresses. In addition, the various cinematic styles in different movies make it difficult to learn a universal face representation for all movies. Conventional techniques that assume fixed handcrafted features for clustering is infeasible to this problem, however, handcrafted features are susceptible to large appearance, illumination, and viewpoint variations, and thus cannot cope with the drastic appearance changes in movies.

Deep learning approaches have achieved substantial advances for object representation learning. These methods could arguably provide a more robust representation to object recognition. However, contemporary deep models for object recognition are trained with web images or photos from albums. These models overfit to the training data distributions and thus will not be directly generalisable to application in different target domain.

Therefore, it is desired to provide a method for adapting a deep model from the source domain to the target domain automatically.

Summary

The following presents a simplified summary of the disclosure in order to provide a basic understanding of some aspects of the disclosure. This summary is not an extensive overview of the disclosure. It is intended to neither identify key or critical elements of the disclosure nor delineate any scope of particular embodiments of the disclosure, or any scope of the claims. Its sole purpose is to present some concepts of the disclosure in a simplified form as a prelude to the more detailed description that is presented later.

In an aspect, disclosed is a method for adapting a deep model for object representation from a source domain to a target domain, comprising: extracting, by a deep model for the source domain, features for objects from input images for the target domain； inferring group labels for objects according to the extracted features； discovering criterions based on target domain priors derived from the input images and the inferred group labels, wherein the criterions contain information indicating which objects should not be inferred to have a same group label； and fine-tuning the deep model for the source domain according to the discovered criterions, wherein the fine-tuned deep model is outputted as a deep model for the target domain.

In one embodiment of the present application, the extracting, the inferring, the discovering, and the fine-tuning are implemented in an iterative feedback loop that is performed for predetermined times, wherein in starting iteration of the iterative feedback loop, the features for objects are extracted from input images for the target domain by the deep model for the source domain, in iterations following the starting iteration, the features for objects are extracted from input images for the target domain by the fine-tuned deep model fine-tuned in a previous iteration of the iterative feedback loop.

In one embodiment of the present application, the inferring comprises: computing, according to the exacted features of the objects, a judgment score for each of candidate group label distributions for the objects； determining a candidate group label distribution having highest judgment score； and inferring, based on the determined distribution, group labels for objects, wherein the higher the similarity between the features of the objects having same group label is, the higher the judgment score is.

In one embodiment of the present application, the target domain prior comprises information on the objects in the input images or relationship between objects in the input images.

In one embodiment of the present application, the discovering comprises: computing degrees of difference between objects that are inferred to have the same group label； and choosing pairs of object, having a degree of difference larger than a threshold, as the criterions.

In one embodiment of the present application, the discovering comprises: choosing pairs of object from the objects, which is inferred to have the same group label but should have different group labels according to the target domain prior as the criterions.

In one embodiment of the present application, the fine-tuning comprises: computing a fine-tuning score for each of candidate parameter adjustments according to the discovered criterions； determining the candidate parameter adjustment having highest fine-tuning score； and fine-tuning the deep model with the determined parameter adjustment, wherein the fine-tuning score indicates the similarity between the objects having a same group label, and the higher the similarity is, the higher the fine-tuning score is.

In an aspect, disclosed is a system for adapting a deep model for object representation from a source domain to a target domain, comprising: a feature extraction unit configured to receive the deep model for the source domain and use the deep model to extract features for objects from input images for the target domain； an inference unit configured to infer group labels for objects according to the extracted features； a criterions discovery unit configured to discover criterions based on target domain priors derived from the input images and the inferred group labels, wherein the criterions contain information indicating which objects should not be inferred to have a same group label； and a training unit configured to fine-tune the deep model for the source domain according to the discovered criterions, wherein the fine-tuned deep model is outputted as the deep model for the target domain.

In an aspect, disclosed is a system for adapting a deep model for object representation from a source domain to a target domain, comprising: a memory that stores executable components； and a processor electrically coupled to the memory to execute the executable components for: extracting, by a deep model for the source domain, features for objects from input images for the target domain； inferring group labels for objects according to the extracted features； discovering criterions based on target domain priors derived from the input images and the inferred group labels, wherein the criterions contain information indicating which objects should not be inferred to have a same group label； and fine-tuning the deep model for the source domain according to the discovered criterions, wherein the fine-tuned deep model is outputted as the deep model for the target domain.

Brief Description of the Drawing

Exemplary non-limiting embodiments of the present application are described below with reference to the attached drawings. The drawings are illustrative and generally not to an exact scale. The same or similar elements on different figures are referenced with the same reference numbers.

Fig. 1 shows the overall pipeline of the system for adapting a deep model for object representation from a source domain to a target domain according to some embodiments of the present application；

Fig. 2 shows the steps used for the inference unit according to some embodiments of the present application；

Fig. 3 shows the steps used for the criterions discovery unit according to some embodiments of the present application； and

Fig. 4 shows the steps used for the training unit according to some embodiments of the present application.

Detailed Description

Reference will now be made in detail to some specific embodiments of the invention including the best modes contemplated by the inventors for carrying out the invention. Examples of these specific embodiments are illustrated in the accompanying drawings. While the invention is described in conjunction with these specific embodiments, it will be appreciated by one skilled in the art that it is not intended to limit the invention to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. The present application may be practiced without some or all of these specific details. In other instances, well-known process operations have not been described in detail in order not to unnecessarily obscure the present application.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a” , “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising” , when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Fig. 1 shows the overall pipeline of the system for adapting a deep model for object representation from a source domain to a target domain according to some embodiments of the present application. In some embodiments, the deep model may be a deep convolution network (DCN) . The system for adapting a deep model for object representation from a source domain to a target domain 100 comprises a feature extraction unit 101, an inference unit 102, a criterions discovery unit 103 and a training unit 104. The feature extraction unit 101 is configured to extract features for objects from input images for the target domain by a deep model for the source domain； the inference unit 102 is configured to infer group labels for objects according to the extracted features； the criterions discovery unit 103 is configured to discovery criterions based on derived target domain priors derived from the input images and the inferred group labels； and the training unit 104 is configured to fine-tune the deep model for the source domain according to the discovered criterions and outputting the fine-tuned deep model as the deep model for the target domain.

In some embodiments of the present application, the criterions may contain information indicating which objects should not be inferred to have a same group label. The group label may indicate the property, name, classification and the like of the objects. For example, if the system is used for face recognition in a movie, the group label may be the name of the role. If the system is used for object detection in the photo, the group label may be the classification of the object, such as “chair” , “table” and the like.

In some embodiments, the system 100 runs to carry out its functions in an iterative way. In other words, the units 101-104 may be implemented as an iterative feedback loop. Specifically, in each iteration, the feature extraction unit 101 extracts the features from the input images. After that, the inference unit 102 infers group labels for objects according to the extracted features based on the extracted features. Then the criterions discovery unit 103 discovers criterions from the inferred group labels. With the discovered criterions, the training unit 104 fine-tunes the deep model according to the discovered criterions. Then the next iteration is performed. This iterative feedback loop ends when the desired performance is achieved or the predetermined running time is reached. In this way, the deep model is fine-tuned for several times and become more suitable for the target domain. During the iterative feedback loop, in starting iteration of the iterative feedback loop, the features for objects are extracted from input images for the target domain by the deep model for the source domain； in iterations following the starting iteration, the features for objects are extracted from input images by the deep model fine-tuned in the previous iteration of the iterative feedback loop. In the end of iterative feedback loop, the deep model fine-tuned in the last iteration is outputted

In some embodiments, the feature extraction unit 101 may be configured with a deep convolutional network (DCN) that consists of successive convolutional filter banks. That is, the deep convolutional network is used as the deep model. The DCN may be initialized by training on a large source domain for image classification/recognition (e.g., large-scale image classification dataset IMAGENET, or large scale face dataset) , or received from other unit, or inputted by user. For example, when the system 100 is used in the face recognition, the pre-trained DCN may be a DCN used in DeepID2+. Specifically, the input may be, for example, 55×47 RGB face image. The DCN has a plurality of, for example four, successive convolution layers followed by one fully connected layer. Each convolution layer contains learnable filters and is followed by a 2×2 max-pooling layer and Rectified Linear Unites (ReLUs) as the activation function. Then, in this embodiment, the number of feature map generated by each convolution layer will be 128, and the dimension of the face representation generated by the final fully connected layer will be 512. The DCN is pre-trained on CelebFace (as an example) , with around 290,000 faces images from 12,000 identities. The training process is conducted by back-propagation using both the identification and verification loss functions. It should be appreciated that other database with the different number of trained faces images may be applicable.

Fig. 2 shows the steps used for the inference unit according to some embodiments of the present application. In the embodiments, the extracted features are fed into the inference unit 102, then the inference unit 102 is operated to find an appropriate group label distribution for each objects in the input images according to the extracted features, i.e., infers the group label for each object according to the features thereof. The process of inference may be implemented by the following steps.

At step S201, a judgment score for each of candidate group label distributions for the objects is computed according to the features of the objects, wherein the higher the similarity between the features of the objects having same group label is, the higher the judgment score is, i.e., the judgment score presents the degree of appropriateness of the distribution thereof. At step S202, the judgment scores of different distributions are compared with each other, then the candidate group label distribution having highest judgment score is determined. At S203, group labels for objects are inferred based on the determined distribution.

In a specific example, the judgment score may be a value of a function that contains variables related to the features of the objects, the relation of the features or the like. For extracted features

wherein

denotes feature of the j-th object of the i-th cluster and the cluster may be predetermined, the group label of each

in X is denoted as

that may be inferred by maximizing a function p (X, Y) :

where

signifies a set of input images, which are the neighbors of

in the space of feature.

represents the probability of distributing l and l′ group labels to

and

respectively, i.e., the probability of distributing l and l′ group labels to the objects to which

and

correspond. And Gaussian distribution may employed to model the first term in Eqn. (1)

where μ_l and Σ_l denote the mean and covariance matrix of the Gaussian of the l-th character, which are obtained and updated in the learning process. For the second term in Eqn. (1) , it is defined as

wherein 1 (·) is the indicator function and α is a trade-off coefficient between Eqn. (2) and (3) . Furthermore, υ (·，·) is a pre-computed function that encodes the relation between any pair of features

and

where positive relation (i.e. υ (·，·) > 0) means that the features are likely from the same character. Otherwise, they belong to different characters. Specifically, the computation of v is a combination of the similarity between appearances of a pair of features (i.e., the similarity between features of a pair of objects) ； and the pairwise spatial and temporal criterions of the features, which may be obtained from input images. For instance, when the system is used in face representation learning and clustering in movies, face images in two successive frames belong to the same character, while face images appearing in the same frame belong to different characters. In general, Eqn. (3) encourages face images with positive relation to be the same character. For example, if

andl＝l′，then

However, if

but l≠l′, then

indicating the group label distribution is violating the pairwise criterions. The group label distribution

making the Eqn. (1) having highest value may be considered as the most appropriate distribution, and may be determined as the resulting group label distribution, then group label for the objects can be inferred.

Fig. 3 shows the steps used for the criterions discovery unit 103 according to some embodiments of the present application. After inferring the group labels, the resulting group labels for objects as well as the input images are fed into the criterions discovery unit 103. In the criterions discovery unit 103, the following steps are performed. At step S301, the degrees of difference between objects that are inferred to have the same group label are computed. At step S302, the object pairs having a degree of difference larger than a threshold are chosen as the criterions. And at step S303, the object pairs that are inferred with the same group label but should have different group labels according to the target domain prior are chosen as the criterions. These criterions will be used in the training unit 104 to fine-tune the DCN of the feature extraction unit 101. In some embodiments, step S302 may be omitted； in some embodiments, step S303 may be omitted.

In some embodiments, the degrees of difference between objects that are inferred to have the same group label may be obtained by calculating distance between the features of each pair of objects in the feature space, for example, by calculating L2-distance between features of two objects. Then the top 20％or other percentage of object pairs with the largest degree of difference (for example, L2-distance) are chosen as the criterions, that is, the object pairs having a degree of difference larger than a threshold are chosen as the criterions. For example, in the scenario where the 20％object pairs with the largest degree of difference (for example, L2-distance) are chosen as the criterions, the threshold is the shortest L2-distance in the top 20％of all L2-distances. The large L2-distance means that two objects likely belong to different group label, so the inference of two objects having large L2-distance is likely error, the DCN used to extract features should be corrected, and the information on “these two objects belong to different labels” will be used as the criterion in the correction process. So, at step S302, the object pairs having a degree of difference larger than a threshold are chosen as the criterions.

In some embodiments, before calculating the degrees of difference between objects that are inferred to have the same group label, the whole similarity degree of all objects having same group label may be firstly calculated, for example, trace of the covariance matrix i.e. trace (Σ_l) , wherein Σ_l denotes the covariance matrix of the Gaussian of the l-th group label, the lower the whole similarity degree is, the larger the trace (Σ_l) is. Then only the objects with group label whose trace (Σ_l) is larger than a threshold are considered during calculating the degree of difference between objects that are inferred to have the same group label.

In some embodiments, the target domain prior comprises information on the objects in the input images or relationship between objects in the input images. For example, when the system is used in the face tracking or clustering in a movie, the target domain prior can be the context extracted from the subtitle that helps to identify the character’s face. Other similar prior can be in a pairwise form: faces appearing in the same frame of a video/movie unlikely belong to the same person (negative pair) while any two faces in the same location between neighboring frames more likely belong to the same person (positive pair) . If a pair of objects are inferred to have same group label, but it can known from the target domain prior that these two objects should not have same group label, the label inference of these two objects is likely error, the DCN used to extract features should be corrected, and the information on “these two objects belong to different labels” will be used as criterion in the correction process. So at step S303, object pairs that are inferred to have the same group label but should have different group labels according to the target domain prior are chosen as the criterions.

In some embodiments, the criterions may contain the information on which pair of objects that is distributed to have same group label actually are not same object.

Fig. 4 shows the steps used for the training unit 104 according to some embodiments of the present application. In the training unit, the original DCN or DCN used in the previous iteration is fine-tuned according to the discovered criterions. The parameters of DCN are adjusted in order to make the extracted features are more consistent with the criterions. At step S401, a fine-tuning score for each of the candidate parameter adjustments is computed according to the discovered criterions； at step S402, the candidate parameter adjustment having highest fine-tuning score is determined as the resulting parameter adjustments of the deep model； and at step S403, the deep model is fine-tuned with the determined parameter adjustment, then the fine-tuned deep model for the target domain is outputted.

In some embodiments, the fine-tuning score may be inversely proportional to a value of a function that contains variables related to the features of the objects, the relation of features or the like. For example, for criterions obtained from the criterions discovery unit 103, the function may be contrastive loss function that encourages features of the objects of the same group label to be close and that of the different group labels to be far away from each other. The formulation of the contrastive loss may be:

where E_c is the loss, I_i， I_j is the objects i and j. x denotes the according feature. τ is the margin between different identities. C (I_i， I_j) ＝1 means that object I_i， I_j are of the same group label, while C (I_i， I_j) ＝-1 means object I_i， I_j are of different group labels. When the system is used in the face recognition, I_i， I_j may be the face images i and j. x may denote the according feature. τ may be the margin between different identities. C (I_i， I_j) ＝1 may mean that face images I_i， I_j are of the same person, while C (I_i，I_j) ＝-1 may mean face images I_i， I_j are of different persons. The features extracted by DCN with different parameter adjustments are different, and the different E_c are obtained, the more consistent with the criterions, the smaller the value of E_c is. Through minimizing E_c, the most appropriated parameter adjustment may be obtained, or the appropriated parameter adjustment make E_c smallest is the most appropriated parameter adjustment. In some embodiments, the candidate parameter adjustments may be included in a parameter adjustment set. In some embodiments, the process of minimizing E_c may an iterative process, the candidate parameter adjustment may be obtained by modifying the parameter adjustment in the previous iteration, the process ends when the value of E_c converges. After minimizing E_c， the deep model may be fine-tuned with the determined parameter adjustment.

In some embodiments, the triplet loss or other loss functions may also be used, which learn an embedding in which the distances between the positive pairs are smaller than that of the negative pairs.

As will be appreciated by one skilled in the art, the present application may be embodied as a system, a method or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment and hardware aspects that may all generally be referred to herein as a “unit” , “circuit, ” “module” or “system. ” Much of the inventive functionality and many of the inventive principles when implemented, are best supported with or integrated circuits (ICs) , such as a digital signal processor and software therefore or application specific ICs. It is expected that one of ordinary skill, notwithstanding possibly significant effort and many design choices motivated by, for example, available time, current technology, and economic considerations, when guided by the concepts and principles disclosed herein will be readily capable of generating ICs with minimal experimentation. Therefore, in the interest of brevity and minimization of any risk of obscuring the principles and concepts according to the present application, further discussion of such software and ICs, if any, will be limited to the essentials with respect to the principles and concepts used by the preferred embodiments. In addition, the present invention may take the form of an entirely software embodiment (including firmware, resident software, micro-code, etc. ) or an embodiment combining software. For example, the system may comprise a memory that stores executable components and a processor, electrically coupled to the memory to execute the executable components to perform operations of the system, as discussed in reference to Figs. 1-4. Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium.

Although the preferred examples of the present application have been described, those skilled in the art can make variations or modifications to these examples upon knowing the basic inventive concept. The appended claims are intended to be considered as comprising the preferred examples and all the variations or modifications fell into the scope of the present application.

Obviously, those skilled in the art can make variations or modifications to the present application without departing the spirit and scope of the present application. As such, if these variations or modifications belong to the scope of the claims and equivalent technique, they may also fall into the scope of the present application.

Claims

A method for adapting a deep model for object representation from a source domain to a target domain, comprising:

extracting, by the deep model for the source domain, features for objects from input images for the target domain；

inferring group labels for objects according to the extracted features；

discovering criterions based on target domain priors derived from the input images and the inferred group labels, wherein the criterions contain information indicating which objects should not be inferred to have a same group label； and

fine-tuning the deep model for the source domain according to the discovered criterions, wherein the fine-tuned deep model is outputted as a deep model for the target domain.
The method of claim 1, wherein the extracting, the inferring, the discovering, and the fine-tuning are implemented in an iterative feedback loop that is performed for predetermined times, wherein

in starting iteration of the iterative feedback loop, the features for objects are extracted from input images for the target domain by the deep model for the source domain,

in iterations following the starting iteration, the features for objects are extracted from input images for the target domain by the fine-tuned deep model fine-tuned in a previous iteration of the iterative feedback loop.
The method of claim 1 or 2, wherein the inferring comprises:

computing, according to the exacted features of the objects, a judgment score for each of candidate group label distributions for the objects；

determining a candidate group label distribution having highest judgment score； and

inferring, based on the determined distribution, group labels for objects,

wherein the higher the similarity between the features of the objects having same group label is, the higher the judgment score is.
The method of claim 1 or 2, wherein the target domain prior comprises information on the objects in the input images or relationship between objects in the input images.
The method of claim 1 or 2, wherein the discovering comprises:

computing degrees of difference between objects that are inferred to have the same group label； and

choosing pairs of object, having a degree of difference larger than a threshold, as the criterions.
The method of claim 1 or 2, wherein the discovering comprises:

choosing pairs of object from the objects, which are inferred to have the same group label but should have different group labels according to the target domain prior as the criterions.
The method of claim 6, wherein the fine-tuning comprises:

computing a fine-tuning score for each of candidate parameter adjustments according to the discovered criterions；

determining the candidate parameter adjustment having highest fine-tuning score； and

fine-tuning the deep model with the determined parameter adjustment,

wherein the fine-tuning score indicates the similarity between the objects having a same group label, and the higher the similarity is, the higher the fine-tuning score is.
A system for adapting a deep model for object representation from a source domain to a target domain, comprising:

a feature extraction unit configured to receive the deep model for the source domain and use the deep model to extract features for objects from input images for the target domain；

an inference unit configured to infer group labels for objects according to the extracted features；

a criterions discovery unit configured to discover criterions based on target domain priors derived from the input images and the inferred group labels, wherein the criterions contain information indicating which objects should not be inferred to have a same group label； and

a training unit configured to fine-tune the deep model for the source domain according to the discovered criterions, wherein the fine-tuned deep model is outputted as a deep model for the target domain.
The system of claim 8, wherein the feature extraction unit, the inference unit, the criterions discovery unit, and the training unit are implemented in an iterative feedback loop that is performed for predetermined times, wherein

in starting iteration of the iterative feedback loop, the features for objects are extracted from input images for the target domain by the deep model for the source domain,

in iterations following the starting iteration, the features for objects are extracted from input images for the target domain by the fine-tuned deep model fine-tuned in a previous iteration of the iterative feedback loop.
The system of claim 8 or 9, wherein the inference unit is configured for:

computing, according to the extracted features of the objects, a judgment score for each of candidate group label distributions for the objects；

determining a candidate group label distribution having highest judgment score； and

inferring, based on the determined distribution, group labels for objects,

wherein the higher the similarity between the features of the objects having same group label is, the higher the judgment score is.
The system of claim 8 or 9, wherein the target domain prior comprises information on the objects in the input images or relationship between objects in the input images.
The system of claim 8 or 9, wherein the criterions discovery unit is configured for:

computing degrees of difference between objects that are inferred to have the same group label； and

choosing pairs of object, having a degree of difference larger than a threshold, as the criterions.
The system of claim 8 or 9, wherein the criterions discovery unit is configured for:

choosing pairs of object from the objects, which are inferred to have the same group label but should have different group labels according to the target domain prior as the criterions.
The system of claim 13, wherein the training unit is configured for:

computing a fine-tuning score for each of candidate parameter adjustments according to the discovered criterions；

determining the candidate parameter adjustment having highest fine-tuning score； and

fine-tuning the deep model with the determined parameter adjustment,

wherein the fine-tuning score indicates the similarity between the objects having a same group label, and the higher the similarity is, the higher the fine-tuning score is.
A system for object representation, comprising:

a memory that stores executable components； and

a processor electrically coupled to the memory to execute the executable components for:

extracting, by the deep model for the source domain, features for objects from input images for the target domain；

inferring group labels for objects according to the extracted features；

discovering criterions based on target domain priors derived from the input images and the inferred group labels, wherein the criterions contain information indicating which objects should not be inferred to have a same group label； and

fine-tuning the deep model for the source domain according to the discovered criterions, wherein the fine-tuned deep model is outputted as a deep model for the target domain.
The system of claim 15, wherein the extracting, the inferring, the determining, and the fine-tuning are implemented in an iterative feedback loop that is performed for predetermined times, wherein

in starting iteration of the iterative feedback loop, the features for objects are extracted from input images for the target domain by the deep model for the source domain,

in iterations following the starting iteration, the features for objects are extracted from input images for the target domain by the fine-tuned deep model fine-tuned in a previous iteration of the iterative feedback loop.
The system of claim 15 or 16, wherein the inferring comprises:

computing, according to the exacted features of the objects, a judgment score for each of candidate group label distributions for the objects；

determining a candidate group label distribution having highest judgment score； and

inferring, based on the determined distribution, group labels for objects,

wherein the higher the similarity between the features of the objects having same group label is, the higher the judgment score is.
The system of claim 15 or 16, wherein the target domain prior comprises information on the objects in the input images or relationship between objects in the input images.
The system of claim 15 or 16, wherein the discovering comprises:

computing degrees of difference between objects that are inferred to have the same group label； and

choosing pairs of object, having a degree of difference larger than a threshold, as the criterions.
The system of claim 15 or 16, wherein the discovering comprises:

choosing pairs of object from the objects, which are inferred to have the same group label but should have different group labels according to the target domain prior as the criterions.
The system of claim 20, wherein the fine-tuning comprises:

computing a fine-tuning score for each of candidate parameter adjustments according to the discovered criterions；

determining the candidate parameter adjustment having highest fine-tuning score； and

fine-tuning the deep model with the determined parameter adjustment,

wherein the fine-tuning score indicates the similarity between the objects having a same group label, and the higher the similarity is, the higher the fine-tuning score is.