CN109635951A

CN109635951A - Unsupervised cross-cutting self-adapting data scaling method and system based on weight distribution alignment and geometrical characteristic alignment

Info

Publication number: CN109635951A
Application number: CN201811547551.4A
Authority: CN
Inventors: 何慧; 张伟哲; 方滨兴; 杨洪伟; 李韬; 白雅雯
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2018-12-18
Filing date: 2018-12-18
Publication date: 2019-04-16

Abstract

Unsupervised cross-cutting self-adapting data scaling method and system based on weight distribution alignment and geometrical characteristic alignment, are related to data scaling technical field.The present invention is in order to effectively improve data scaling accuracy rate.Weight distribution alignment can weigh the importance of the marginal probability distribution and conditional probability distribution of sample data, and then the difference between reduction field；Geometrical characteristic alignment not only can further between excavation applications sample data geometrical characteristic, and can be very good by Tula Bu Lasi regularization to keep the geometry in sample data space, and then the accuracy of raising sample separability and data scaling.By carrying out Experimental comparison with other methods, the system-that the present invention develops can effectively improve data scaling accuracy rate based on the unsupervised cross-cutting self-adapting data scaling method of weight distribution alignment and geometrical characteristic alignment.

Description

Unsupervised cross-domain self-adaptive data calibration method and system based on weighted distribution alignment and geometric feature alignment

Technical Field

The invention relates to an unsupervised cross-domain self-adaptive data calibration method and system, and relates to the technical field of data calibration.

Background

The unsupervised domain adaptation problem is a class of sub-problem of transfer learning, which aims to solve the domain adaptation problem that the target domain has no label data. Previous research results mainly focus on sample-based domain adaptation and feature transformation-based domain adaptation. The domain adaptation problem method based on feature transformation can be divided into a data-centered method and a subspace-centered method, the data-centered method mainly aims to find a uniform transformation to map data of a source domain and a target domain to a domain invariant space so as to reduce distribution difference and maintain the data features of an original space, but the method does not further utilize the geometric features of the data because the original feature space is distorted or stretched after the feature transformation; the subspace-centered approach only processes the subspace and does not explicitly consider the distribution difference between the domains after mapping.

Disclosure of Invention

The invention aims to provide an unsupervised cross-domain self-adaptive data calibration method and system based on weighted distribution alignment and geometric feature alignment so as to effectively improve the data calibration accuracy.

The technical scheme adopted by the invention for solving the technical problems is as follows:

the first technical scheme is as follows: an unsupervised cross-domain adaptive data calibration method based on weighted distribution alignment and geometric feature alignment is implemented by the following steps:

the input of the method is as follows: x_s，X_t，X_sRepresenting a source domain exemplar, a known label exemplar; x_tRepresenting a target field sample and a sample to be marked;a source domain sample label;

parameters are as follows:

α is evaluated for the importance of maximizing the variance of the sample to be marked,

λ 1 is the importance rating of the difference inside the generalized feature transform,

β are measures of importance for maximizing inter-class variance (with samples of different classes),

mu belongs to [0,1] is a marginal distribution and condition distribution importance parameter in the evaluation field,

delta E [0,1] is a graph Laplace regularization term (importance of further mining marginal distribution) coefficient,

p is the number of nearest neighbors of the sample,

k is the number of subspaces, and T is the number of iterations;

the output of the method is:

the transformation matrix Φ, Ψ; x_sZ obtained by transforming matrix phi_s，X_tZ obtained by transforming the matrix Ψ_t(ii) a And (3) adapting to a classifier: f;

step 1, calculating a divergence matrix S of a target domain_tInter-class divergence matrix S of data_bWithin-class divergence matrix S_w,

M′_s,M′_t,M′_st,M′_tsThe sum of the weighted sum of the marginal probability distribution and the conditional probability distribution of the source field sample and the target field sample and the weighted Laplace regularization term corresponding to the weighted sum is provided (mainly, the prior knowledge is provided for the classification of the target field sample by further mining the potential knowledge for the distribution characteristics of the conditional probability distribution and the marginal probability distribution);

M′_s,M′_t,M′_st,M′_tsis four partitions in a matrix;

M′_srepresenting the sum of the weighted sum of the marginal probability distribution and the conditional probability distribution of the source field sample and the weighted Laplace regularization term corresponding to the weighted sum;

M′_trepresenting the sum of the weighted sum of the marginal probability distribution and the conditional probability distribution of the target field sample and the weighted Laplace regularization term corresponding to the weighted sum;

M′_strepresenting the sum of weighted sum of marginal probability distribution and conditional probability distribution of the source domain sample and the target domain sample and a weighted Laplace regularization term corresponding to the weighted sum;

M′_tsrepresenting the sum of weighted sum of marginal probability distribution and conditional probability distribution of the target field sample and the source field sample and weighted Laplace regularization terms corresponding to the weighted sum;

initializing pseudo labels of a target domain with a classifier trained in a source domain

Step 2: repeating the step 3 to the step 6;

and step 3: solving a generalized eigenvalue problem, taking the corresponding k eigenvectors and the first k eigenvalues as eigenvalues of a generalized eigenvalue transformation U, and then obtaining a source domain data sample transformation matrix phi and a target domain data sample transformation matrix psi;

and 4, step 4: mapping the original data to the corresponding subspace to obtain embedding:

and 5: in thatTraining a classifier to update pseudo labels of a target domain

Step 6: update M'_s,M′_t,M′_st,M′_ts；

And 7: until convergence;

and 8: is finally obtained atAnd an upper classifier f.

The second technical scheme is as follows: an unsupervised cross-domain adaptive data calibration method based on weighted distribution alignment and geometric feature alignment, the method is used for calculating the variable of claim 1, and the method is realized by the following steps:

step one, maximizing target field data variance

The target domain data variance is maximized in the corresponding feature subspace to avoid mapping data features onto unrelated dimensions,

in the formula

S_tIs the target domain divergence matrix and is,is a central matrix of which the center is,is a column vector whose elements are all 1;

m represents the dimension of the target domain sample space; i is_tIs an identity matrix;

step two, source field data separability characteristic maintenance

The method is based on that the label space of the source field is available in the data calibration process, and the separability of the transformed source field data is controlled by using the label information of the source field

Wherein S is_bAnd S_wThe inter-class divergence matrix and the intra-class divergence matrix of the data are defined as follows:

wherein,for a set of source domain data samples belonging to class c, is the central matrix for the class c data,is a matrix of the units,is a column vector whose elements are all 1,the number of data samples belonging to the source field of the class c;

step three, weighted distribution alignment

The weighted distribution alignment is used for quantitatively evaluating the importance of the marginal probability distribution and the conditional probability distribution, and is given in the following way:

D_w＝(1-μ)D(P_s,P_t)+μD(Q_s,Q_t)

calculating marginal probability distribution and conditional probability distribution according to a minimum mean square difference (MMD) method, thereby marginal probability distribution D (P)_s,P_t) And conditional probability distribution D (Q)_s,Q_t) Can be written as:

wherein,is a set of samples belonging to class c in the source domain,is thatThe label of (1), correspondingIs a set of samples belonging to class c in the target domain,is thatA true tag of, andfor the number of samples of class C in the source domain,the number of samples of the C type in the target field is taken;

training a certain classifier in a source field with a label, applying the classifier to a target field to obtain a pseudo label, and reducing conditional probability distribution among fields through pseudo label iteration; therefore, the form of the weight distribution alignment can be rewritten as:

whereinThe blocking matrix for aligning the marginal probability distribution and the conditional probability distribution weighted distribution of the source field samples and the target field samples specifically comprises the following steps:

step four, minimizing the internal difference of the transformation

Note the book

U is a generalized feature transform and is,

thus, transform internal difference minimization can be written as:

step five, graph Laplace regularization

If when x is_i,Marginal probability distribution P of time, data_s(x_s) And P_t(x_t) Sufficiently close, then conditional probability distribution Q_s(y_s|x_s) And Q_t(y_t|x_t) Are also sufficiently similar; suppose from x_iTo x_jIs sufficiently smooth, the graph laplacian regularization term can be expressed as:

wherein W is a graph adjacency matrix, W_ijIs an element in W, L is a regularized graph Laplacian matrix, L_ijIs an element in L; x is the number of_i,x_jIs an element from the sample space;

wherein,denotes x_iA set of p-neighbors of (a); l ═ I-D^-1/2WD^-1/2D is a diagonal element ofA diagonal matrix of (a);

step six, formalization solving

In summary, the optimization objective of WDGA is the following equation:

the method comprises the following steps that TDA, BCV, WDA, GLR, IDT and WCV respectively represent maximization of target domain data variance, maintenance of source domain data separability characteristics, alignment of weighted distribution, graph Laplace regularization, minimization of transformation internal difference and minimization of intra-class variance;

the calculation of each component according to the five steps is substituted into the formula to obtain an optimized expression as follows:

wherein,

next, there are:

to obtain the solution U, let the first derivativeThe following can be obtained:

wherein Λ ═ diag (λ)₁,…,λ_k) Is the first k eigenvalues, U ═ U₁,…,U_k]I is the identity matrix for the eigenvectors belonging to the corresponding eigenvalues. The second technical scheme corresponds to the first technical scheme.

The third technical scheme is as follows: an unsupervised cross-domain self-adaptive data calibration system based on weighted distribution alignment and geometric feature alignment comprises an input module, a data calibration module and an output module, wherein the input module is used for reading data to be calibrated and labels and transmitting the data to the data calibration module, the data calibration module is used for calibrating and classifying the input data to be calibrated, and the output module is used for outputting results classified by the data calibration module. The third technical solution is realized based on the algorithm described in the first and second technical solutions.

The invention has the beneficial effects that:

the transfer learning is used as a means for solving the problem of cross-domain identification data labeling, and the reduction of the difference between the source field data and the target field data is important for solving the problem of unsupervised field self-adaptive data calibration. And the method based on the weighted distribution alignment and the geometric feature alignment better solves the problem. Firstly, the importance of marginal probability distribution and conditional probability distribution of sample data can be balanced by weighted distribution alignment, so that the difference between fields is reduced; the geometric feature alignment can further excavate the geometric features of the sample data between domains, and the geometric structure of the sample data space can be well maintained through the graph-Laplacian regularization, so that the sample separability and the data calibration accuracy are improved. Through experimental comparison with other methods, the unsupervised cross-domain adaptive data calibration method based on weighted distribution alignment and geometric feature alignment, which is a system developed by people, can effectively improve the data calibration accuracy.

The invention provides an unsupervised cross-domain adaptive data calibration method based on weighted distribution alignment and geometric feature alignment, which is provided for overcoming various defects of the prior art in solving the field adaptation problem, and develops a set of system on the basis of the data calibration method.

Drawings

FIG. 1 is a graph of the effect of WDGA comparison with data characteristics of three other algorithms; FIG. 2 is a graph of parameters p, δ, μ, T versus accuracy, and FIG. 3 is a graph of the method of the present invention versus accuracy; FIG. 4 is a flow chart of data calibration performed by the data calibration system based on the method of the present invention.

Detailed Description

The unsupervised cross-domain adaptive data calibration method and system based on weighted distribution alignment and geometric feature alignment are explained by combining the accompanying drawings and tables as follows:

as the name suggests, the weighted distribution alignment and the geometric feature alignment are to perform weighted processing on marginal probability distribution and conditional probability distribution of data and perform alignment and perform geometric feature alignment on sample data space features, and simultaneously introduce a Laplace regularization term to further maintain the geometric structure of a sample space, so as to finally obtain an optimization problem which can obtain a closed solution, thereby solving the field adaptation problem.

In the present invention, the definition of parameters is given except for the common sense parameters and the intermediate parameters in the derivation process.

1-maximization of target domain data variance to avoid mapping data features onto unrelated dimensions, we circumvent this problem by maximizing the target domain data variance in the corresponding feature subspace

Here, the

Is the target domain divergence matrix and is,is a central matrix of which the center is,is a column vector whose elements are all 1.

2 source field data separability feature keeping source field label space available in data calibration process, we can use source field label information to control separability of transformed source field data

wherein,for a set of source domain data samples belonging to class c, is the central matrix for the class c data,is a matrix of the units,is a column vector whose elements are all 1,the number of data samples belonging to the source domain of the class c.

3 weighted distribution alignment the purpose of weighted distribution alignment is to quantitatively evaluate the importance of marginal probability distribution and conditional probability distribution, and general weighted distribution alignment can be given as follows:

D_w＝(1-μ)D(P_s,P_t)+μD(Q_s,Q_t)

here, the marginal probability distribution and the conditional probability distribution are calculated according to a minimum mean square difference (MMD) method, such that the marginsProbability distribution D (P)_s,P_t) And conditional probability distribution D (Q)_s,Q_t) Can be written as:

wherein,is a set of samples belonging to class c in the source domain,is thatThe label of (1), correspondingIs a set of samples belonging to class c in the target domain,is thatA true tag of, andsince the target domain has no label, we adopt the idea proposed by Longmingsheng et al: training a certain classifier in a source field with a label, applying the classifier to a target field to obtain a pseudo label, and reducing conditional probability distribution among fields through pseudo label iteration.

Therefore, the form of the weight distribution alignment can be rewritten as:

wherein

4-transform intra-disparity minimization

Note the book

Is a generalized feature transform.

Thus, transform internal difference minimization can be written as:

5-graph Laplace regularization

Among the field adaptation problems are the presence of tagged data and non-tagged data. We expect to be able to further exploit the marginal probability distribution P_sAnd P_tIn other words, unlabeled data may reveal potential data characteristics of the target domain, such as sample variance. The idea of the prevailing assumption can be expressed as: if when x is_i,x_jWhen x is belonged to, marginal probability distribution P of data_s(x_s) And P_t(x_t) Sufficiently close, then conditional probability distribution Q_s(y_s|x_s) And Q_t(y_t|x_t) But also sufficiently similar. Suppose from x_iTo x_jIs sufficiently smooth, the graph laplacian regularization term can be expressed as:

where W is the graph adjacency matrix and L is the regularized graph Laplace matrix.

Wherein,denotes x_iP-neighbors of (a). L ═ I-D^-1/2WD^-1/2D is a diagonal element ofThe diagonal matrix of (a).

6 formalization solving

In summary, from five steps 1, 2, 3, 4 and 5, the optimal expression can be obtained as follows:

wherein,

next, there are:

wherein Λ ═ diag (λ)₁,…,λ_k) Is the first k eigenvalues, U ═ U₁,…,U_k]Is the eigenvector belonging to the corresponding eigenvalue. Pseudo code for the weighted distribution alignment and geometric feature alignment (WDGA) algorithm is shown as algorithm 2-1:

the data calibration system based on the method of the present invention performs data calibration, and the flow chart is shown in fig. 4.

For the invented data calibration system developed based on the weighted distribution alignment and geometric feature alignment (WDGA) algorithm, we tested on the following public data set:

table 1: six image recognition public data sets

Dataset Name	Data	Features	Classes	Domain(s)
					Office-10	1,410	800(4,096)	10	A，W，D
Caltech-10	1,123	800(4,096)	10	C
					USPS	1,800	256	10	USPS(U)
MINIST	2,000	256	10	MNIST(M)
					ImageNet	7,341	4,096	5	ImageNet(I)
VOC2007	3,376	4,096	5	VOC(V)

Through tests based on different features on different data sets, it can be seen that the system developed by us is significantly better than other algorithms in terms of image classification accuracy (as shown in tables 2, 3, 4).

Table 2: accuracy rate based on SURF features on dataset Office + Caltech10

Task	Raw	SA	SDA	PCA	TCA	GFK	JDA	TJM	ARTL	SCA	JGSA	WDGA(P)
													C→A	36.01	49.27	49.69	36.95	45.82	46.03	45.62	46.76	44.13	45.61	51.46	53.24
C→W	29.15	40.00	38.98	32.54	31.19	36.95	41.69	38.98	31.48	40.02	45.42	51.86
													C→D	38.22	39.49	40.13	38.22	34.39	40.76	45.22	44.59	39.53	47.08	45.86	54.14
A→C	34.19	39.98	39.54	34.73	42.39	40.69	39.36	39.45	36.07	39.70	41.50	41.59
													A→W	31.19	33.22	30.85	35.59	36.27	36.95	37.97	42.03	33.63	34.93	45.76	50.85
A→D	35.67	33.76	33.76	27.39	33.76	40.13	39.49	45.22	36.87	39.47	47.13	45.86
													W→C	28.76	35.17	34.73	26.36	29.39	24.76	31.17	30.19	29.72	31.06	33.21	33.21
W→A	31.63	39.25	39.25	31.00	28.91	27.56	32.78	29.96	38.28	30.04	39.87	39.87
													W→D	84.71	75.16	75.80	77.07	89.17	85.35	89.17	89.17	87.89	87.34	90.45	87.26
D→C	29.56	34.55	35.89	29.65	30.72	29.30	31.52	31.43	30.49	30.67	29.92	34.11
													D→A	28.29	39.87	38.73	32.05	31.00	28.71	33.09	32.78	34.84	31.58	38.00	41.44
D→W	83.73	76.95	76.95	75.93	86.10	80.34	89.49	85.42	88.56	84.42	91.86	89.49
													Average	40.93	44.72	44.52	39.79	43.26	43.13	46.38	46.33	44.30	45.20	50.04	51.91

Table 3: accuracy on datasets MNIST + USPS and ImageNet + VOC2007

Task	Raw	SA	SDA	PCA	TCA	GFK	JDA	TJM	ARTL	SCA	JGSA	WDGA(P)
													M→U	65.94	67.78	65.01	66.20	56.33	61.22	67.28	63.28	88.76	65.11	80.44	85.78
U→M	44.70	48.80	35.73	45.04	51.20	46.45	59.65	52.25	67.66	48.03	68.15	68.30
													I→V	-	-	-	58.36	63.71	59.51	63.44	63.69	62.37	-	52.32	69.88
V→I	-	-	-	65.13	64.86	73.79	70.16	73.02	72.22	-	70.58	82.35
													Average	-	-	-	58.68	59.02	60.24	65.13	63.06	72.75	-	67.87	76.58

Table 4: accuracy rate based on Decaf6 features on dataset Office + Caltech10

Task	DMM	OTGL	PCA	TCA	GFK	JDA	TJM	SCA	ARTL	CORAL	JGSA	WDGA(L)
													C→A	92.41	92.15	88.13	89.78	88.21	90.19	88.83	89.52	92.44	92.01	91.44	92.59
C→W	87.49	84.17	83.37	78.32	77.59	85.42	81.37	85.38	87.76	80.02	86.78	90.17
													C→D	90.42	87.25	84.12	85.37	86.58	85.99	84.68	87.89	86.61	84.72	93.63	94.27
A→C	84.78	85.51	79.28	82.63	79.22	81.92	84.32	78.81	87.39	83.15	84.86	88.33
													A→W	84.73	83.05	70.86	74.19	70.92	80.68	71.92	75.93	88.48	74.61	81.02	91.19
A→D	92.37	85.00	82.26	81.51	82.18	81.53	76.38	85.37	85.42	84.09	88.54	94.27
													W→C	81.74	81.45	70.33	80.42	69.77	81.21	83.01	74.87	88.23	75.53	84.95	87.44
W→A	86.46	90.62	73.47	84.08	76.83	90.71	87.59	85.03	92.37	81.17	90.71	91.75
													W→D	98.73	96.25	99.41	100	100	100	100	100	100	100	100	99.36
D→C	83.27	84.11	71.69	82.30	71.40	80.32	83.80	78.11	87.33	76.80	86.20	85.84
													D→A	90.74	92.31	79.22	89.06	76.34	91.96	90.34	90.01	92.67	85.54	91.96	92.90
D→W	99.27	96.29	97.99	99.73	99.28	99.32	99.25	98.57	100	99.37	99.66	95.25
													Average	89.41	88.18	81.70	85.59	81.52	87.44	85.99	85.89	90.70	84.71	89.98	91.95

As can be seen from fig. 1, in the data processing stage, the system can maintain the original data characteristics, and can also collect the same type of samples as much as possible to improve the data separability, thereby improving the accuracy of data calibration and further improving the accuracy of image recognition.

From fig. 2, the relationship between the regularization parameters δ, μ and the parameters p and the iteration number T and the accuracy can be seen.

From the following fig. 3, it can be seen that the algorithm of our proposed invention system based on weighted distribution alignment and laplacian regularization is related to the classification accuracy.

Graph 5 is a comparison of our method (WDGA) and JGSA methods at algorithm runtime.

Table 5: WDGA and JGSA algorithm runtime comparison

Task	Data×Features	JGSA	WDGA
				C→A	2,081×800	18.50	19.19
M→U	3,800×256	16.21	17.09
				D→A	1,115×800	13.17	13.45

The present invention is capable of other embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit and scope of the present invention.

Claims

1. An unsupervised cross-domain adaptive data calibration method based on weighted distribution alignment and geometric feature alignment is characterized in that the method is realized by the following steps:

parameters are as follows:

β maximize importance assessment for the inter-class variance,

delta E is [0,1] is the coefficient of the graph Laplace regularization term,

p is the number of nearest neighbors of the sample,

k is the number of subspaces, and T is the number of iterations;

the output of the method is:

step 1, calculating a divergence matrix S of a target domain_tInter-class divergence matrix S of data_bWithin-class divergence matrix S_w，

M′_s，M′_t，M′_st，M′_tsThe sum of the weighted sum of the marginal probability distribution and the conditional probability distribution of the source field sample and the target field sample and the weighted Laplace regularization term corresponding to the weighted sum is used as the sum;

M′_s，M′_t，M′_st，M′_tsis four partitions in a matrix;

M′_strepresenting source and target domain samplesThe sum of the weighted sum of the marginal probability distribution and the conditional probability distribution and the weighted Laplace regularization term corresponding to the weighted sum;

Step 2: repeating the step 3 to the step 6;

and 5: in thatTraining a classifier to update pseudo labels of a target domain

Step 6: update M'_s，M′_t，M′_st，M′_ts；

And 7: until convergence;

and 8: is finally obtained atThe classifier f above.

2. An unsupervised cross-domain adaptive data calibration method based on weighted distribution alignment and geometric feature alignment, which is used for calculating the variable of claim 1 and is realized by the following steps:

step one, maximizing target field data variance

in the formula

S_tIs the target domain divergence matrix and is,is a central matrix of which the center is,is a column vector whose elements are all 1; m represents the dimension of the target domain sample space; i is_tIs an identity matrix;

step two, source field data separability characteristic maintenance

step three, weighted distribution alignment

D_w＝(1-μ)D(P_s，P_t)+μD(Q_s，Q_t)

calculating marginal probability distribution and conditional probability distribution according to a minimum mean square difference (MMD) method, thereby marginal probability distribution D (P)_s，P_t) And conditional probability distribution D (Q)_s，Q_t) Can be written as:

step four, minimizing the internal difference of the transformation

Note the book

U is a generalized feature transform and is,

thus, transform internal difference minimization can be written as:

step five, graph Laplace regularization

If when it is usedMarginal probability distribution P of time, data_s(x_s) And P_t(x_t) Sufficiently close, then conditional probability distribution Q_s(y_s|x_s) And Q_t(y_t|x_t) Are also sufficiently similar; suppose from x_iTo x_jIs sufficiently smooth, the graph laplacian regularization term can be expressed as:

wherein W is a graph adjacency matrix, W_ijIs an element in W, L is a regularized graph Laplacian matrix, L_ijIs an element in L; x is the number of_i，x_jIs an element from the sample space;

step six, formalization solving

In summary, the optimization objective of WDGA is the following equation:

wherein,

next, there are:

wherein Λ ═ diag (λ)₁，...，λ_k) Is the first k eigenvalues, U ═ U₁，...，U_k]For eigenvectors belonging to the corresponding eigenvalues, I is the unit momentAnd (5) arraying.

3. The system is characterized by comprising an input module, a data calibration module and an output module, wherein the input module is used for reading data to be calibrated and labels and transmitting the data to the data calibration module, the data calibration module is used for calibrating and classifying the input data to be calibrated, and the output module is used for outputting results classified by the data calibration module.