CN112257806B

CN112257806B - Heterogeneous user-oriented migration learning method

Info

Publication number: CN112257806B
Application number: CN202011195428.8A
Authority: CN
Inventors: 叶阿勇; 张娇美
Original assignee: Fujian Normal University
Current assignee: Fujian Normal University
Priority date: 2020-10-30
Filing date: 2020-10-30
Publication date: 2023-06-20
Anticipated expiration: 2040-10-30
Also published as: CN112257806A

Abstract

The invention discloses a heterogeneous user-oriented migration learning method, wherein a server and other participants cannot obtain original data, so that the risk of privacy disclosure is reduced to a certain extent. Secondly, through domain delimitation and secondary dimension reduction screening, the correlation between sample data and a classification target is higher, the method can adapt to user isomerism, the classification effect is better, and the requirement of high classification accuracy can be met to a great extent. On the other hand, the circulation double-classification algorithm of Softmax and CNN has supervised learning to guide unsupervised learning, and improves the classification accuracy of the insufficient label data. The invention selects and delimits the data acquired by the local end multi-channel to ensure that the transfer learning has enough data quantity. On the basis, the requirement of multi-target output is met, and the classification accuracy is improved.

Description

Heterogeneous user-oriented migration learning method

Technical Field

The invention relates to the technical field of machine learning, in particular to a heterogeneous user-oriented migration learning method.

Background

With the continued development and maturation of traditional machine learning, it has been relatively easy to train a good classification model from a large amount of tagged data. However, in a real application scenario, the conventional machine learning method still cannot fully apply the requirements. On the one hand, it is relatively difficult to obtain tagged data. Most of the data generated in life do not contain labels, and the cost of manual labels is too high; and the data acquisition often also considers personal privacy and security problems, which also increases the difficulty of data acquisition. On the other hand, conventional machine learning requires re-modeling and training each time data is updated, thereby consuming a lot of time and resources.

The transfer learning relieves the data pressure of the traditional machine learning to a certain extent, but the transfer learning can not be performed in any situation, and the effect of the transfer is also influenced by a plurality of factors. Most of the research today uses random source domain data resulting in poor classification accuracy and inability to accommodate user heterogeneity, i.e., inability to meet multi-objective classification requirements. When using the data acquired by multiple channels, the accuracy of the classification result is reduced due to the large difference of the data correlation, and the source domain and the target domain may be randomly determined, so that the transfer learning cannot exert the advantages of sufficient data quantity, but the learning efficiency is not high and the accuracy cannot be ensured. Under the limitation of factors in various aspects, the application of transfer learning is not very wide, most researches only propose specific algorithms for transaction classification in a certain field, and a complete model architecture is not provided.

In summary, the existing classification model does not have a complete flow from data acquisition and data processing to a classification algorithm, the problem of multi-target output cannot be met, and the classification accuracy is difficult to guarantee.

Disclosure of Invention

The invention aims to provide a heterogeneous user-oriented migration learning method.

The technical scheme adopted by the invention is as follows:

a heterogeneous user-oriented migration learning method comprises the following steps:

step 1, a participant performs data acquisition and primary processing on a local side to realize first data dimension reduction.

And 2, selecting and delimiting a source domain and a target domain according to the participant demand server side, and realizing second data dimension reduction.

And 3, classifying by using an S-CNN cyclic classification algorithm.

Further, the specific steps of step 1 are as follows:

step 1-1, participants locally based on the raw data X _n×h Calculating covariance matrix F of the data:

where n is the number of entries of the participant's local data, h is the data dimension;

step 1-2, calculating all eigenvalues lambda and corresponding eigenvectors mu thereof according to |lambda E-F|=0, wherein E is an identity matrix;

step 1-3, for the eigenvalue lambda _i (λ _i Epsilon lambda), and selecting the number of principal components according to a preset threshold r;

step 1-4, outputting a feature vector set (mu) corresponding to the first r feature values ₁ ,μ ₂ ,…,μ _r ) Calculating the modulus of the feature vectors, unitizing the corresponding r feature vectors to form a feature matrix A;

step 1-5, calculating a projection matrix X' _n×r ＝X _n×h A(r<h) Obtaining a new data sample X';

step 1-6, the server receives and stores the local reduced-dimension data set uploaded by all participants to form a sample data pool

wherein X′_v A sample data matrix uploaded by the v-th participant in the data pool is represented, and N represents the number of the participants;

further, the specific steps of the step 2 are as follows:

step 2-1, participant u uploads its classification requirement D _u ＝(N _u ,M _u ,acc _u), wherein N_u For the number of source domains, M _u For the number of categories, acc _u Representing a minimum classification accuracy requirement;

step 2-2, the server calculates each data sample X in the data pool _v ' data sample X uploaded with participant u _u ' correlation I (X) _v ',X _u '), the calculation method is as follows:

wherein X 'represents a set of data of a data matrix X'; p (x) _v ',x _u ' indicates x _v ' and x _u ' joint probability distribution of two sets of data; probability distribution of data x 'represented by P (x'); p (x) _v '|x _u ') represents data x _v At data x _u ' sProbability distribution; KL represents distance and is short for the difference of Kullback-Leibler;

step 2-3, based on correlation I (X _v ',X _u ') to X _v 'ranking from high to low, selecting top N with high relevance according to the participant's needs _u Source field X used as current migration learning _S Sample data X of participants _u ' as target Domain X _T ；

Step 2-4, performing secondary dimension reduction: and mapping the data in a plurality of fields to the same dimensional space by using a migration component analysis TCA algorithm, wherein feature mapping is accompanied with the reduction of feature numbers, and finally, obtaining a new data feature sample matrix.

Further, as a preferred embodiment, the specific steps of steps 2 to 4 are as follows:

step 2-4-1, defining a kernel matrix K: respectively calculate X _S 、X _T And a kernel matrix K of two synthesis domains _S,S 、K _T,T 、K _T,S and K_S,T Then constructing a kernel matrix K by using the formula (1);

wherein K is one (n ₁ +n ₂ )×(n ₁ +n ₂ ) N of the matrix of (a) ₁ and n₂ Respectively X _S and X_T The number of samples mapped to the Regenerated Kernel Hilbert Space (RKHS);

step 2-4-2, decomposing the kernel matrix K into k= (KK) using an empirical kernel mapping function ^-1/2 )(K ^-1/2 K)；

Step 2-4-3, calculating the characteristic distance between the source domain and the target domain according to the formula 2 according to K:

Dist(X _S ,X _T )＝tr(KL) (2)

where tr (KL) represents the trace of matrix KL.

Step 2-4-4, calculating W εR according to equation 3 _(n1+n2)×m (m<n1+n2)：

wherein ,

x represents _S and X_T Maximum mean error (MMD) distance between the empirical mean of two domains, i.e. X _S and X_T KL distance of two domains; />

Is the result of empirical kernel mapping of the m-dimensional space of W;

step 2-4-5, outputting the final source domain matrix W _XS And a target domain matrix W _XT ；

Further, as a preferred embodiment, the specific classification step of step 3 is as follows:

step 3-1, initializing: first training W _XS Obtaining an initialized Softmax classifier;

step 3-2, from W _XT Selecting part of samples, and initializing the cycle discrimination frequency q of the batch of data to be 0;

step 3-3, predicting and marking unlabeled sample data in the labeled sample data through a Softmax classifier, and adding a pseudo label to the labeled sample data to obtain a primary classification result;

step 3-4, the batch of samples are also subjected to CNN classifier to obtain a classification result;

step 3-5, comparing the primary classification result with the secondary classification result; when the two are inconsistent, let q=q+1, and judge whether Q is greater than threshold value Q, if yes, delete this batch of data and return to step 3-2, otherwise return to step 3-3;

step 3-6, using the sample data to train a Softmax classifier, obtaining classification accuracy acc and matching the participant's required accuracy acc _u Comparing; when acc is greater than acc _u Outputting acc and the classification result; otherwise, returning to the step 3-3.

By adopting the technical scheme, firstly, the server and other participants cannot obtain the original data, so that the risk of privacy disclosure is reduced to a certain extent. Secondly, through domain delimitation and secondary dimension reduction screening, the correlation between sample data and a classification target is higher, the method can adapt to user isomerism, the classification effect is better, and the requirement of high classification accuracy can be met to a great extent. On the other hand, the circulation double-classification algorithm of Softmax and CNN has supervised learning to guide unsupervised learning, and improves the classification accuracy of the insufficient label data. The invention selects and delimits the data acquired by the local end multi-channel to ensure that the transfer learning has enough data quantity. On the basis, the requirement of multi-target output is met, and the classification accuracy is improved.

Drawings

The invention is described in further detail below with reference to the drawings and detailed description;

FIG. 1 is a flow chart of a model for transitional learning adapted to user profile according to the present invention;

fig. 2 is a flow chart of the loop classification of the S-CNN algorithm of the present invention.

Detailed Description

For the purposes, technical solutions and advantages of the embodiments of the present application, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

As shown in fig. 1 or fig. 2, the invention discloses a heterogeneous user-oriented migration learning method, which comprises the following steps:

And 3, classifying by using an S-CNN cyclic classification algorithm.

Further, as a preferred embodiment, the specific steps of step 1 are as follows:

step 1-1, participants locally use raw data X _n×h Meter (D)Calculating a data covariance matrix F:

n is the number of entries of the local original data of the participant, and h is the dimension of the data;

and step 1-6, the server receives and stores the local dimensionality reduced data set uploaded by all the participants. Forming a data pool

the specific steps of the step 2 are as follows:

step 2-1, participant u uploads its classification requirement (N _u ,M _u ,acc _u), wherein N_u For the number of source domains, M _u For the number of categories, acc _u Representing the lowest classification accuracy;

step 2-2, the server calculates each data sample matrix X in the data pool _v ' data sample X uploaded with participant u _u The 'correlation' is calculated as follows:

wherein I represents a data matrix X _v ' and X _u ' correlation between; a set of data of a data matrix X 'denoted by X'; p (x) _v ',x _u ' indicates x _v ' and x _u ' joint probability distribution of two sets of data; probability distribution of data x represented by P (x); p (x) _v '|x _u ') represents data x _v At data x _u ' probability distribution; KL represents distance and is short for the difference of Kullback-Leibler;

Dist(X _S ,X _T )＝tr(KL) (2)

where tr (KL) represents the trace of matrix KL.

Step 2-4-4, calculating W according to equation 3:

wherein ,

Is the result of empirical kernel mapping of the m-dimensional space of W;

step 3-6, using the sample data to train a Softmax classifier, obtaining a classification accuracy acc and combining the classification accuracy acc with the accuracy acc required by participant i _u Comparing; when the classification accuracy acc is greater than acc _u Outputting classification accuracy acc and classification results; otherwise, returning to the step 3-3.

It will be apparent that the embodiments described are some, but not all, of the embodiments of the present application. Embodiments and features of embodiments in this application may be combined with each other without conflict. The components of the embodiments of the present application, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the detailed description of the embodiments of the present application is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

Claims

1. A heterogeneous user-oriented migration learning method is characterized by comprising the following steps of: which comprises the following steps:

step 1, a participant performs data acquisition and primary processing at a local end, and performs primary dimension reduction on original data by using a principal component analysis method to obtain sample data;

step 2, calculating the correlation between the sample data provided by the participants and other sample data in the data pool by the server side according to the requirements of the participants; and according to the correlation, all sample data are arranged in descending order; finally, determining a source domain and a target domain according to the number of source domains defined by a user, and realizing second data dimension reduction; the specific steps of the step 2 are as follows:

wherein X 'is a set of data for X'; p (x) _v ',x _u ' indicates x _v ' and x _u ' joint probability distribution; p (x ') represents the probability distribution of x'; KL represents distance, which is the difference between Kullback and LeiblerIs abbreviated as (1);

step 2-3, according to I (X _v ',X _u ') to X _v ' order from high to low; according to D _u Selecting front N with high correlation _u Source field X used as current migration learning _S ，X _u ' as target Domain X _T ；

Step 2-4, performing secondary dimension reduction: mapping data in multiple fields to the same dimensional space by using a migration component analysis TCA algorithm, wherein feature mapping is accompanied with feature number reduction, and finally a new data feature sample matrix W is obtained; the specific steps of the steps 2-4 are as follows:

step 2-4-1, defining a kernel matrix K: respectively calculate X _S 、X _T And a kernel matrix K of two synthesis domains _S,S 、K _T,T 、K _T,S and K_S,T Constructing K by using the formula (1);

Step 2-4-3, calculating the characteristic distance between the source domain and the target domain according to the formula (2) according to K:

Dist(X _S ,X _T )＝tr(KL) (2)

where tr (KL) represents the trace of matrix KL;

step 2-4-4, calculating W according to equation (3):

wherein ,

Is the empirical kernel mapping result of the m-dimensional space of W;

Step 3, classifying by using an S-CNN cyclic classification algorithm: firstly, inputting sample data in batches to train a Softmax classifier to obtain a data primary classification result; then, inputting the sample data of the same batch into a CNN classifier to obtain a classification result, and comparing the two classification results; if the sample data are consistent, circularly sending the sample data of the batch back to the Softmax classifier, further optimizing the classifier until the accuracy meets the requirement, and finally outputting a classification result; if the sample data are inconsistent, the batch of sample data are returned to the Softmax classifier for reclassifying, and a new round of judgment is carried out; the specific classification step of the step 3 is as follows:

step 3-6, using the sample dataTraining a Softmax classifier to obtain classification accuracy acc and to match the accuracy acc required by participant i _u Comparing; if acc is>acc _u Outputting acc and the classification result; otherwise, returning to the step 3-3.

2. The heterogeneous user-oriented migration learning method of claim 1, wherein the method comprises the steps of: the specific steps of the step 1 are as follows:

wherein X′_v Representing a sample data matrix uploaded by the v-th participant in the data pool, and N represents the number of participants.