CN103177114B - Based on the shift learning sorting technique across data field differentiating stream shape - Google Patents

Based on the shift learning sorting technique across data field differentiating stream shape Download PDF

Info

Publication number
CN103177114B
CN103177114B CN201310113911.0A CN201310113911A CN103177114B CN 103177114 B CN103177114 B CN 103177114B CN 201310113911 A CN201310113911 A CN 201310113911A CN 103177114 B CN103177114 B CN 103177114B
Authority
CN
China
Prior art keywords
data
domain
centerdot
sigma
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201310113911.0A
Other languages
Chinese (zh)
Other versions
CN103177114A (en
Inventor
方正
张仲非
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201310113911.0A priority Critical patent/CN103177114B/en
Publication of CN103177114A publication Critical patent/CN103177114A/en
Application granted granted Critical
Publication of CN103177114B publication Critical patent/CN103177114B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a kind of shift learning sorting technique across data field based on differentiating stream shape, comprising the following steps: input the data of each data field and the label data for training, data are set up to the adjacent map being used for spectrogram geometry and regulating; To the data of input, the adjacent map of label information and foundation, optimization aim is combined, sets up unified mathematical model; According to the mathematical model set up, the more new formula of derivation variable, upgrades the hiding factor of each dimension of each data field, the relational structure of inter-domain sharing, and regression coefficient in the mode of alternating iteration, until convergence; Utilize the parameter obtained, generic Tag Estimation is carried out to the data of aiming field, obtains the generic label to aiming field data prediction.The present invention is for learning the data manifold space obtaining a kind of discriminating, and the new expression factor has the height being conducive to classifying and differentiates structure, also maintains the original cluster manifold structure of data.

Description

Cross-data-domain transfer learning classification method based on identification manifold
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a cross-data-domain transfer learning classification method based on identification manifold.
Background
In the information age represented by massive big data, various data are exploded and increased in a geometric series, and the mining of the potential value of the data becomes a hot spot of attention and research of people. In the fields of internet, mobile communication and finance, daily life continuously generates a large amount of data, wherein a classification technology is a very effective method for mining potentially useful knowledge of the data. For example, internet users need to send and receive a large amount of e-mails every day, how to help users sort the e-mails, and how to automatically identify spam e-mails needs accurate and effective sorting technology to help users intelligently. For another example, on a network router node, how to effectively classify and detect data streams and discover abnormal phenomena and trojan virus data in time plays a great role in maintaining the security and stability of a network. The monitoring and classification of the user transaction behaviors in the financial field is beneficial to identifying malicious fraudulent transaction behaviors, so that the major economic loss caused by the malicious fraudulent transaction behaviors can be avoided.
On the other hand, in the actual data mining classification problem, reliable label data is often required as a training sample. To obtain such training data requires a lot of manpower, material resources and time. This often results in only a limited amount of manually classified label data in the subject domain of the study being available to train the model. However, if a certain amount of classified reliable data exists in the related similar data domain, the data of the target domain can be modeled and accurately classified under the condition of lacking training data by effectively utilizing the relation of different data domains to perform knowledge migration. Furthermore, in the internet as an example, although the research data at a certain time has sufficient label data, the data at a future time will evolve with the development of time, and the existing model trained by the previous data may not be suitable for the future data object, and needs to be readjusted or trained, which in turn will bring heavy manpower and time investment. How to use and utilize information and knowledge in training data at previous time to reduce investment requirements brought by retraining is of great significance for researching classification problems of data domains at different times. The most representative transfer learning technology in many existing advanced technologies is to solve the knowledge mining problem how to utilize the labels and useful information of other data domains to assist the clustering, classification and the like of the target object data domains.
In the existing transfer learning text mining algorithm, a plurality of researchers propose to mine potential data expression factors, and use a relation structure between hiding factors of data dimensions and hiding factors of feature dimensions as a physical quantity shared among a plurality of domains. The multi-data inter-domain relation established by the shared hidden factor relation structure achieves the effect of transferring knowledge among data domains to a certain extent, and can be trained and classified by using label data of the auxiliary domain under the condition that the target domain has only a few training samples. However, in most of the hidden factor mining algorithms of the transfer learning technology, the obtained hidden factors lack the identification characteristics beneficial to accurate classification. As most of the hidden factors are obtained through a framework model of matrix decomposition combined clustering, the mining of a data identification structure is ignored while the internal clustering structure of the data is maintained, and the capability of further improving the accurate prediction of the belonged category is lost. And although the potential relation of the hidden factors of each dimension of the target domain and the auxiliary domain is utilized and shared in the process of transfer learning, distribution gaps among different domains exist among the finally learned hidden factors. Especially, under the condition that the classification decision functions of the target data domain and the auxiliary data domain are the same, although the data of the auxiliary domain can be accurately classified, due to the inter-domain deviation of the data distribution, the classifier still can not achieve the ideal classification effect in the target domain.
In view of the defects and shortcomings of the existing transfer learning classification method based on hidden factor mining, the transfer learning classification technology provided by the invention can mine the identification structure which is beneficial to classification in data while keeping a good clustering structure of the data, and the finally obtained inter-domain deviation of the hidden factors can be greatly reduced by adjusting the Maximum Mean Difference (MMD) distance of different data domains. Thus, the problem of learning a classification across transitions between data domains is effectively solved. Compared with the existing transfer learning classification technology based on hidden factor mining, the provided classifier has greatly improved accuracy and stability.
Disclosure of Invention
In order to solve the above problems, an object of the present invention is to provide a cross-data-domain transfer learning classification method based on identification manifold, which is used to obtain an identified data manifold space through unified combination of joint matrix decomposition and regression identification model under certain constraint conditions while performing cross-data-domain transfer learning classification, and a new expression factor of data in the manifold space has a high identification structure beneficial to classification, and simultaneously maintains an original clustering manifold structure of the data. By minimizing the inter-domain data distribution distance MMD (maximum mean difference), the inter-domain difference of the hidden factors obtained by inter-domain learning of different data is greatly reduced, thereby further improving the accuracy and stability of the cross-data-domain transfer learning classifier.
In order to achieve the purpose, the technical scheme of the invention is as follows:
a cross-data-domain transfer learning classification method based on identification manifold comprises the following steps:
S1O, inputting data of each data field and label data for training, and establishing an adjacency graph for spectrogram geometric adjustment on the data;
s20, combining optimization targets such as a cross-data-domain joint matrix decomposition model, an identification regression model, cross-data-domain distance adjustment, manifold geometry adjustment and the like to the input data, label information and the established adjacency graph, and establishing a unified mathematical model;
s30, deriving an updating formula of the variables according to the established mathematical model, and updating hidden factors, inter-domain shared relational structures and regression coefficients of each dimensionality of each data domain in an alternating iteration mode until convergence;
and S40, performing the generic label prediction on the data of the target domain by using the obtained parameters to obtain the generic label predicted on the data of the target domain.
Preferably, S10 specifically includes the following steps:
s101, inputting an auxiliary data fieldAnd a target data fieldIncluding label data of the auxiliary data fieldAnd corresponding label information matrixAnd data of the target domainInputting label indication information P when the target domain has a small amount of label datatA matrix for indicating which data of the target domain are labeled, and label information of the target domain data is inputted at the same timeBy collectionsSubscripts indicating different data fields when referring to a data field ofWhen it is time, the other data field corresponding to it is marked as
S102, respectively constructing an adjacency graph of data dimensions of auxiliary domains by using input dataAnd adjacency graphs of feature dimensionsThe edge weights between the points of the adjacency graph are as follows:
drawing (A)
Drawing (A)
Wherein N isp(x) Representing the p fields of data x, taking p-5,
constructing a data dimension adjacency graph for a target domainAnd a feature dimension adjacency graph, wherein the edge weights between the points of the adjacency graph are respectively as follows:
drawing (A)
Drawing (A)
Wherein N isp(x) P field representing data x is 5.
Preferably, S20 specifically includes the following steps:
s201, establishing a joint matrix decomposition model across data domains:
the matrix factorization model simultaneously resolves the data of the target data domain and the auxiliary data domain into a low-dimensional data expression, and retains a common knowledge structure between the two data domains, wherein,representing a pi data fieldLow-dimensional clustering of features of (1), kmIs the number of clusters of the characteristic dimension;representing a pi data fieldIs also a low-dimensional hidden representation factor, k, of the datanIs the number of clusters of data;representing a pi data fieldThe relationship structure between the characteristic class and the data class, and the target data field and the auxiliary data field share the stable relationship structure;
s202, fusing and identifying a regression model, and carrying out supervision constraint on a low-dimensional hidden representation factor of data:
whereinIs a regression coefficient acting on a data hiding factor, the label indicating information PtThe matrix being a diagonalThe matrix is a matrix of a plurality of matrices,representing a pi data fieldThe ith element in (c) is used for supervised regression discrimination constraint, otherwise P ii π = 0 ;
S203, reducing the difference between the target data domain and the auxiliary data domain, and introducing the adjustment of the MMD distance of the maximum mean difference;
the inter-domain difference distance in the data dimension is defined as follows:
the inter-domain difference distance in feature dimensions is defined as follows:
in order to reduce the difference between the target data domain and the auxiliary data domain, the expected data hidden representation factor and the characteristic low-dimensional clustering structure representation factor can make the inter-domain difference distance in each dimension as small as possible, so that the two distance functions are fused into the model obtained in the previous step S202 as the minimum target adjustment factor, and the following results are obtained:
s204, keeping the low-dimensional manifold structure of the data, and utilizing the auxiliary domain obtained in the step S102 according to the spectrogram geometric theoryEstablishing a measure for measuring the smoothness of the data mapping function along the geodesic lines in the low-dimensional manifold space:
wherein, D s v = diag ( Σ i ( W s v ) ij )
and establishing the measure of the smoothness of the measurement data feature mapping function along the geodesic lines in the low-dimensional manifold space by using the adjacency graph of the feature dimension of the auxiliary domain obtained in the step S102:
wherein, D s u = diag ( Σ i ( W s u ) ij )
similarly, the target domain obtained in step S102 is utilizedIn the target domainEstablishing a measure for measuring the smoothness of the data mapping function along the geodesic lines in the low-dimensional manifold space on the data dimension:
wherein, D t v = diag ( Σ i ( W t v ) ij )
and establishing the measure for measuring the smoothness of the data feature mapping function along the geodesic lines in the low-dimensional manifold space on the feature dimension by using the adjacency graph of the feature dimension of the target domain obtained in the step S102:
wherein, D t u = diag ( Σ i ( W t u ) ij )
s205: the cross-data-domain transfer learning classification model based on the identification manifold is established as follows:
s.t.Vs,Vt,Us,Ut,H≥0
preferably, the alternating iteration in S30 specifically includes the following steps:
s301, updating the auxiliary domain data hiding factor Vs
Wherein B s = A T Y s P s P s T , B s + = ( | B s | + B s ) / 2 , B s - = ( | B s | - B s ) / 2 , E s = A T A V s P s P s T , R=ATA,R+=(|R|+R)/2,R-=(|R|-R)/2,
S302, updating the target domain data hiding factor Vt
Wherein B t = A T Y t P t P t T , B t + = ( | B t | + B t ) / 2 , B t - = ( | B t | - B t ) / 2 , E t = A T A V t P t P t T , R=ATA,R+=(|R|+R)/2,R-=(|R|-R)/2,
S303, updating the feature dimension low-dimensional factor U of the auxiliary domains
S304, updating the feature dimension low-dimensional factor U of the target domaint
S305, updating the sharing factor between the auxiliary domain and the target domain: and updating a relation structure between the hiding factors of the data dimension and the hiding factors of the characteristic dimension according to the following formula:
wherein
S306, updating the regression coefficient A:
wherein γ = α β .
Preferably, S40 further includes the steps of:
s401, utilizing the obtained regression coefficient A and the target domain document hiding factor VtPerforming generic label prediction on the target domain document to obtain a generic label for predicting the target domain news document
Y ~ t = A V t ;
S402, according toThe subscript where the largest element of each column of document factors lies determines the data's category.
Compared with the prior art, the invention has the following beneficial effects:
(1) according to the classifier provided by the embodiment of the invention, the identification regression model is introduced into the mining algorithm of the hidden factors of transfer learning, so that the learned data hidden factors have identification structures beneficial to classification, and the identification and classification accuracy of the classifier are improved;
(2) according to the embodiment of the invention, potential useful structures of data are mined, and meanwhile, the inter-domain difference of hidden factors obtained by learning is minimized by utilizing the minimum inter-domain difference distance (MMD), so that the difference caused by data distribution drifting among different domains is reduced, and a problem of difficulty in a traditional transfer learning algorithm is further solved by sharing a relation matrix of a characteristic dimension and a clustering structure of a data dimension among the domains;
(3) according to the embodiment of the invention, the data of the auxiliary domain and the target domain are subjected to joint matrix decomposition, the inherent manifold structure of the data is reserved in the subspace of the mined hidden factor through spectrogram geometric adjustment, and the learned hidden factor has a classification identification structure and also reserves the clustering structure of the original data, so that the anti-noise capability and robustness of the classifier are improved;
(4) the embodiment of the invention provides a classifier (TLCDM) based on cross-data-domain transfer learning of identification manifold, and innovatively provides a set of effective parameter iterative updating method for training the classifier.
Drawings
FIG. 1 is a flowchart illustrating steps of a cross-data-domain transition learning classification method based on identification manifold according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
On the contrary, the invention is intended to cover alternatives, modifications, equivalents and alternatives which may be included within the spirit and scope of the invention as defined by the appended claims. Furthermore, in the following detailed description of the present invention, certain specific details are set forth in order to provide a better understanding of the present invention. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details.
The embodiment of the present invention provides a classifier (TLCDM) for identifying manifold cross-data domain transfer learning, in which input data is news text data, and topic classification is performed on the news data for example, and the classification method of the embodiment of the present invention may also be applied to various types of data classification problems across domains, such as video data for a target domain and picture data for an auxiliary domain, to perform video data classification; alternatively, the target domain and the auxiliary domain are email data of different users, and spam classification is performed.
Referring to fig. 1, a flowchart illustrating steps of a cross-data domain transfer learning classification method based on identification manifold according to an embodiment of the present invention is shown, which includes the following steps:
and S10, inputting data of each data field and generic label data for training, and establishing an adjacent map for spectrogram geometric adjustment on the data. Specifically, the method comprises steps S101 to S102:
s101, inputting an auxiliary data fieldAnd a target data fieldIncluding label data of the auxiliary data fieldAnd corresponding label information matrixAnd data of the target domainInputting generic label indicating information P when the target domain has a small amount of generic label datatA matrix for indicating which data of the target domain are labeled, and inputting generic label information of the target domain data
S102, for news data, the data dimension is each news document, the characteristic dimension is text words in news, and document adjacent maps of auxiliary domains are respectively constructedAnd text word adjacency graphThe edge weights between the points of the adjacency graph are as follows:
drawing (A)
Drawing (A)
Wherein N isp(x) The field p of the object x is denoted by p 5.
Constructing document adjacency graphs for target domainsAnd text word adjacency graphThe edge weights between the points of the adjacency graph are as follows:
drawing (A)
Drawing (A)
Wherein N isp(x) The field p of the object x is denoted by p 5.
S20, combining optimization objectives such as a joint matrix decomposition model across data domains, an identification regression model, distance adjustment across data domains, manifold geometry adjustment, and the like with respect to the input data, label information, and the created adjacency graph, to create a unified mathematical model, which specifically includes steps S201 to S204:
s201, establishing a joint matrix decomposition model across data domains:
wherein for ease of discussion and ease of expression of modeling, sets are usedSubscripts indicating different data fields when referring to a data field ofWhen it is time, the other data field corresponding to it is marked as
The matrix decomposition model decomposes the documents and text words of the target data field and the auxiliary data field into low-dimensional data expression at the same time, and reserves a common knowledge structure between the two data fields. Wherein,representing a pi data fieldLow-dimensional clustering structure of text words, kmIs the number of clusters of text words;representing a pi data fieldThe low-dimensional cluster structure of the document is also a low-dimensional hidden representation factor, k, of the documentnIs the number of clusters of documents;representing a pi data fieldAnd (4) a relation structure between the text part of speech and the document class. Experience has shown that the target data field and the auxiliary data field share this stable relational structure.
S202, fusing and identifying a regression model, and carrying out supervision constraint on a low-dimensional hidden representation factor of the document:
whereinIs a regression coefficient acting on a data hiding factor, generic indicator information PtThe matrix is a diagonal matrix and,representing a pi data fieldThe ith element in (c) is used for supervised regression discrimination constraint, otherwise P ii π = 0 .
S203, reducing the difference between the target data domain and the auxiliary data domain, and introducing the adjustment of the Maximum Mean Difference (MMD) distance.
The inter-domain difference distance in the data dimension is defined as follows:
the inter-domain difference distance in feature dimensions is defined as follows:
in order to reduce the difference between the target data field and the auxiliary data field, it is desirable that the inter-field difference distance defined on the document hiding factor be as small as possible, and the inter-field difference distance defined on the low-dimensional expression factor of the text word be as small as possible. Thus, the two distance functions are fused as the minimum target adjustment factor into the model obtained in the previous step S202, and the following results are obtained:
s204, maintaining the low-dimensional manifold structure of the data. According to the spectrogram geometric theory, the auxiliary domain obtained in step S102 is utilizedEstablishing a measure for measuring the smoothness of the function of the mapped document along the geodesic lines in the low-dimensional manifold space:
wherein, D s v = diag ( Σ i ( W s v ) ij ) .
establishing a measure for measuring the smoothness of the function of the mapping text words along the geodesic lines in the low-dimensional manifold space by using the adjacency graph of the text word dimension of the auxiliary domain obtained in the step S102:
wherein, D s u = diag ( Σ i ( W s u ) ij ) .
similarly, the target domain obtained in step S102 is utilizedIn the target domainEstablishing a measure for measuring the smoothness of the function of the mapped document along the geodesic lines in the low-dimensional manifold space on the dimension of the document:
wherein, D t v = diag ( Σ i ( W t v ) ij ) .
establishing a measure for measuring the smoothness of the function of the mapping text words along the geodesic lines in the low-dimensional manifold space on the dimension of the text words by using the adjacency graph of the dimension of the text words of the target domain obtained in the step S102:
wherein, D t u = diag ( Σ i ( W t u ) ij ) .
and S205, establishing a cross-data-domain transfer learning classification model based on the identification manifold.
In order to keep the inherent original structure of the data in each dimension manifold space (especially the spatial smoothness of the data) in the target domain and the auxiliary domain, the function smoothness measures of each dimension in the target domain and the auxiliary domain are used as the constraint adjustment of a matrix decomposition model and are fused into a unified mathematical model. And considering the non-negativity of the obtained low-dimensional representation factors of all dimensions and the non-negativity of the relational structure matrix, finally obtaining the following cross-data-domain transfer learning classification model based on the identification manifold:
s.t.Vs,Vt,Us,Ut,H≥0
the hidden factors are mined by using the joint matrix decomposition model, the identifiability of the hidden factors is improved by using the identification regression model, the distribution difference of the hidden factors of different data fields is reduced by using the distance adjustment across the data fields, the local clustering structure of the original data is kept by using the manifold geometry adjustment, and the learned hidden factors have the classification identification structure and simultaneously keep the clustering structure of the original data, so that the anti-noise capability and the robustness of the classifier are improved.
And S30, deriving an updating formula of the variables according to the mathematical model established in S20, and updating the hidden factors, the inter-domain shared relation structure and the regression coefficients of the document and text word dimensions of each data domain in an alternating iteration mode until convergence. Each iteration specifically includes steps S301 to S306:
s301, updating the auxiliary domain document hiding factor Vs
Wherein B s = A T Y s P s P s T , B s + = ( | B s | + B s ) / 2 , B s - = ( | B s | - B s ) / 2 , E s = A T A V s P s P s T , R=ATA,R+=(|R|+R)/2,R-=(|R|-R)/2,
S302, updating the target domain document hiding factor Vt
Wherein B t = A T Y t P t P t T , , B t + = ( | B t | + B t ) / 2 , B t - = ( | B t | - B t ) / 2 , E t = A T A V t P t P t T , R=ATA,R+=(|R|+R)/2,R-=(|R|-R)/2,
S303, updating the auxiliary domain text word low-dimensional representation factor Us
S304, updating the low-dimensional representation factor U of the target domain text wordt
S305, updating the shared structural factor between the auxiliary domain and the target domain: a relationship factor between the clustering structure of documents and the clustering structure of text words. The update formula is as follows:
wherein
S306, updating the regression coefficient A:
wherein γ = α β
And S40, performing the generic label prediction on the data of the target domain by using the obtained parameters to obtain the generic label predicted on the data of the target domain.
Specifically, the method comprises the following steps of,
s401, using the regression coefficient A and the target domain document hiding factor V obtained in S30tPerforming generic label prediction on the target domain document to obtain a generic label for predicting the target domain news document
Y ~ t = A V t .
S402, according toThe subscript where the largest element of each column of document factors lies determines the data's category.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (2)

1. A cross-data-domain transfer learning classification method based on identification manifold is characterized by comprising the following steps:
s10, inputting data of each data field and label data for training, and establishing an adjacent map for spectrogram geometric adjustment on the data;
s20, establishing a unified mathematical model for the input data, label information and the established adjacency graph by combining with an optimization target, wherein the optimization target comprises a cross-data-domain joint matrix decomposition model, an identification regression model, cross-data-domain distance adjustment and manifold geometry adjustment;
s30, deriving an updating formula of the variables according to the established mathematical model, and updating hidden factors, inter-domain shared relational structures and regression coefficients of each dimensionality of each data domain in an alternating iteration mode until convergence;
s40, performing generic label prediction on the data of the target domain by using the obtained parameters to obtain a generic label for predicting the data of the target domain;
wherein, S10 specifically comprises the following steps:
s101, inputting an auxiliary data field DsAnd a target data field DtIncluding label data of the auxiliary data fieldAnd corresponding label information matrixAnd data of the target domainInputting label indication information P when the target domain has a small amount of label datatA matrix for indicating which data of the target domain are labeled, and label information of the target domain data is inputted at the same timeThe subscripts of different data fields are represented by the set I ═ { s, t }, and when the data field is referred to as pi ∈ I, the other data field corresponding to the data field is marked as
S102, respectively constructing an adjacency graph of data dimensions of auxiliary domains by using input dataAnd adjacency graphs of feature dimensionsThe edge weights between the points of the adjacency graph are as follows:
drawing (A)
Drawing (A)
Wherein N isp(x) Representing the p fields of data x, taking p-5,
constructing a data dimension adjacency graph for a target domainAnd a feature dimension adjacency graph, wherein the edge weights between the points of the adjacency graph are respectively as follows:
drawing (A)
Drawing (A)
Wherein N isp(x) Representing the p field of the data x, and taking p as 5;
s20 specifically comprises the following steps:
s201, establishing a joint matrix decomposition model across data domains:
min U π , H , V π ≥ 0 Σ π ∈ I | | X π - U π HV π | | 2
the matrix factorization model simultaneously resolves the data of the target data domain and the auxiliary data domain into a low-dimensional data expression, and retains a common knowledge structure between the two data domains, wherein,represents the pi data field DπLow-dimensional clustering of features of (1), kmIs the number of clusters of the characteristic dimension;represents the pi data field DπIs also a low-dimensional hidden representation factor, k, of the datanIs the number of clusters of data;represents the pi data field DπThe relationship structure between the characteristic class and the data class, and the target data field and the auxiliary data field share the stable relationship structure;
s202, fusing and identifying a regression model, and carrying out supervision constraint on a low-dimensional hidden representation factor of data:
min V π , U π , H , A Σ π ∈ 1 ( | | X π - U π HV π | | 2 + β | | Y π P π - AV π P π | | 2 ) + α | | A | | 2
whereinIs a regression coefficient acting on a data hiding factor, the label indicating information PtThe matrix is a diagonal matrix and,represents the pi data field DπThe ith element in (c) is used for supervised regression discrimination constraint, otherwise P i i π = 0 ;
S203, reducing the difference between the target data domain and the auxiliary data domain, and introducing maximum mean difference MaximumMeanDiscrenancy and MMD distance adjustment;
the inter-domain difference distance in the data dimension is defined as follows:
Dist v ( D s , D t ) = | | 1 n s Σ i = 1 n s v · i s - 1 n t Σ j = 1 n t v · j t | | 2 ;
the inter-domain difference distance in feature dimensions is defined as follows:
Dist u ( D s , D t ) = | | 1 n s Σ i = 1 n s u i · s - 1 n t Σ j = 1 n t u j · t | | 2 ;
in order to reduce the difference between the target data domain and the auxiliary data domain, the expected data hidden representation factor and the characteristic low-dimensional clustering structure representation factor can make the inter-domain difference distance in each dimension as small as possible, so that the two distance functions are fused into the model obtained in the previous step S202 as the minimum target adjustment factor, and the following results are obtained:
min V s , V t , U s , U t , H , A Σ π ∈ I ( | | X π - U π HV π | | 2 + β | | Y π P π - AV π P π | | 2 ) + α | | A | | 2 + | | 1 m s 1 m s T U s - 1 m t 1 m t T U t | | 2 + | | 1 n s V s 1 n s - 1 n t V t 1 n t | | 2
s204, keeping the low-dimensional manifold structure of the data, and utilizing the auxiliary domain obtained in the step S102 according to the spectrogram geometric theoryEstablishing a measure for measuring the smoothness of the data mapping function along the geodesic lines in the low-dimensional manifold space:
R s v = 1 2 Σ i j | | v · i s - v · j s | | 2 ( W s v ) i j = Σ i t r ( v · i s ( v · i s ) T ) ( D s v ) i i - Σ i j t r ( v · i s ( v · j s ) T ) ( W s v ) i j = t r ( V s ( D s v - W s v ) V s T )
wherein, D s v = d i a g ( Σ i ( W s v ) i j )
and establishing the measure of the smoothness of the measurement data feature mapping function along the geodesic lines in the low-dimensional manifold space by using the adjacency graph of the feature dimension of the auxiliary domain obtained in the step S102:
R s u = 1 2 Σ i j | | u i · s - u j · s | | 2 ( W s u ) i j = Σ i t r ( ( u i · s ) T ( u i · s ) ) ( D s u ) i i - Σ i j t r ( ( u i · s ) T ( u j · s ) ) ( W s u ) i j = t r ( U s T ( D s u - W s u ) U s )
wherein, D s u = d i a g ( Σ i ( W s u ) i j )
similarly, the target domain D obtained in step S102 is utilizedtIn the target domain DtEstablishing a measure for measuring the smoothness of the data mapping function along the geodesic lines in the low-dimensional manifold space on the data dimension:
R t v = 1 2 Σ i j | | v · i t - v · j t | | 2 ( W t v ) i j = Σ i t r ( v · i t ( v · i t ) T ) ( D t v ) i i - Σ i j t r ( v · i t ( v · j t ) T ) ( W t v ) i j = t r ( V t ( D t v - W t v ) V t T )
wherein, D t v = d i a g ( Σ i ( W t v ) i j )
and establishing the measure for measuring the smoothness of the data feature mapping function along the geodesic lines in the low-dimensional manifold space on the feature dimension by using the adjacency graph of the feature dimension of the target domain obtained in the step S102:
R t u = 1 2 Σ i j | | u i · t - u j · t | | 2 ( W t u ) i j = Σ i t r ( ( u i · t ) T ( u i · t ) ) ( D t u ) i i - Σ i j t r ( ( u i · t ) T ( u j · t ) ) ( W t u ) i j = t r ( U t T ( D t T - W t u ) U t )
wherein, D t u = d i a g ( Σ i ( W t u ) i j )
s205, establishing a cross-data-domain transfer learning classification model based on the identification manifold as follows:
min V s , V t , U s , U t , H , A Σ π ∈ I ( | | X π - U π HV π | | 2 + β | | Y π P π - AV π P π | | 2 ) + α | | A | | 2 + Σ π ∈ I λ ( R π u + R π v ) + | | 1 m s 1 m s T U s - 1 m t 1 m t T U t | | 2 + | | 1 n s V s 1 n s - 1 n t V t 1 n t | | 2
s.t.Vs,Vt,Us,Ut,H≥0
the alternating iteration in S30 specifically includes the following steps:
s301, updating the auxiliary domain data hiding factor Vs
Wherein B s = A T Y s P s P s T , B s + = ( | B s | + B s ) / 2 , B s - = ( | B s | - B s ) / 2 , E s = A T AV s P s P s T , R=ATA,R+=(|R|+R)/2,R-=(|R|-R)/2,
S302, updating the target domain data hiding factor Vt
Wherein B t = A T Y t P t P t T , B t + = ( | B t | + B t ) / 2 , B t - = ( | B t | - B t ) / 2 , E t = A T AV t P t P t T , R=ATA,R+=(|R|+R)/2,R-=(|R|-R)/2,
S303, updating the feature dimension low-dimensional factor U of the auxiliary domains
S304, updating the feature dimension low-dimensional factor U of the target domaint
S305, updating the sharing factor between the auxiliary domain and the target domain: and updating a relation structure between the hiding factors of the data dimension and the hiding factors of the characteristic dimension according to the following formula:
where I is { s, t }
S306, updating the regression coefficient A:
A = ( Σ π ∈ I Y π P π ( V π P π ) T ) ( Σ π ∈ I V π P π ( V π P π ) T + γ I ) - 1 , where I ═ s, t }, γ = α β .
2. the method for learning and classifying data domain transitions based on discriminative manifold as claimed in claim 1 wherein S40 further comprises the steps of:
s401, utilizing the obtained regression coefficient A and the target domain document hiding factor VtPerforming generic label prediction on the target domain document to obtain a generic label for predicting the target domain news document
Y ~ t = AV t ;
S402, according toThe subscript where the largest element of each column of document factors lies determines the data's category.
CN201310113911.0A 2013-04-02 2013-04-02 Based on the shift learning sorting technique across data field differentiating stream shape Expired - Fee Related CN103177114B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310113911.0A CN103177114B (en) 2013-04-02 2013-04-02 Based on the shift learning sorting technique across data field differentiating stream shape

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310113911.0A CN103177114B (en) 2013-04-02 2013-04-02 Based on the shift learning sorting technique across data field differentiating stream shape

Publications (2)

Publication Number Publication Date
CN103177114A CN103177114A (en) 2013-06-26
CN103177114B true CN103177114B (en) 2016-01-27

Family

ID=48636975

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310113911.0A Expired - Fee Related CN103177114B (en) 2013-04-02 2013-04-02 Based on the shift learning sorting technique across data field differentiating stream shape

Country Status (1)

Country Link
CN (1) CN103177114B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473366B (en) * 2013-09-27 2017-01-04 浙江大学 A kind of various visual angles are across the sorting technique of data field picture material identification and device
CN103678580B (en) * 2013-12-07 2017-08-08 浙江大学 A kind of multitask machine learning method and its device for text classification
US11062792B2 (en) 2017-07-18 2021-07-13 Analytics For Life Inc. Discovering genomes to use in machine learning techniques
US11139048B2 (en) 2017-07-18 2021-10-05 Analytics For Life Inc. Discovering novel features to use in machine learning techniques, such as machine learning techniques for diagnosing medical conditions
CN107563452B (en) * 2017-09-18 2020-03-27 天津师范大学 Cross-domain foundation cloud picture classification method based on discriminant measure learning
CN109492094A (en) * 2018-10-15 2019-03-19 上海电力学院 A kind of mixing multidimensional property data processing method based on density
CN110411724B (en) * 2019-07-30 2021-07-06 广东工业大学 Rotary machine fault diagnosis method, device and system and readable storage medium
CN110928916B (en) * 2019-10-18 2022-03-25 平安科技(深圳)有限公司 Data monitoring method and device based on manifold space and storage medium
CN116538996B (en) * 2023-07-04 2023-09-29 云南超图地理信息有限公司 Laser radar-based topographic mapping system and method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100011025A1 (en) * 2008-07-09 2010-01-14 Yahoo! Inc. Transfer learning methods and apparatuses for establishing additive models for related-task ranking
US20110320387A1 (en) * 2010-06-28 2011-12-29 International Business Machines Corporation Graph-based transfer learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《Transfer Learning with Graph Co-Regularization》;Long Mingsheng等;《Proceedings of the twenty-sixth conference on artificial intelligence》;20120726;第2页右栏倒数第2段-第4页左栏"算法1" *

Also Published As

Publication number Publication date
CN103177114A (en) 2013-06-26

Similar Documents

Publication Publication Date Title
CN103177114B (en) Based on the shift learning sorting technique across data field differentiating stream shape
CN110532542B (en) Invoice false invoice identification method and system based on positive case and unmarked learning
WO2021088499A1 (en) False invoice issuing identification method and system based on dynamic network representation
CN109461025B (en) Electric energy substitution potential customer prediction method based on machine learning
Kumar et al. Crime prediction using K-nearest neighboring algorithm
CN106778832B (en) The semi-supervised Ensemble classifier method of high dimensional data based on multiple-objection optimization
CN109962909B (en) Network intrusion anomaly detection method based on machine learning
CN109034205A (en) Image classification method based on the semi-supervised deep learning of direct-push
Li et al. Integrating ensemble-urban cellular automata model with an uncertainty map to improve the performance of a single model
CN110990718B (en) Social network model building module of company image lifting system
CN113590698A (en) Artificial intelligence technology-based data asset classification modeling and hierarchical protection method
CN109447110A (en) The method of the multi-tag classification of comprehensive neighbours' label correlative character and sample characteristics
Zhao et al. A review of macroscopic carbon emission prediction model based on machine learning
CN109951499A (en) A kind of method for detecting abnormality based on network structure feature
Chu et al. Co-training based on semi-supervised ensemble classification approach for multi-label data stream
Zhang Financial data anomaly detection method based on decision tree and random forest algorithm
CN103473366A (en) Classification method and device for content identification of multi-view cross data field image
CN103559642A (en) Financial data mining method based on cloud computing
CN112668633B (en) Adaptive graph migration learning method based on fine granularity field
Zhang et al. End‐to‐end generation of structural topology for complex architectural layouts with graph neural networks
Liu et al. Learning a similarity metric discriminatively with application to ancient character recognition
CN115099504B (en) Cultural relic security risk element identification method based on knowledge graph completion model
Wang et al. A Novel Multi‐Input AlexNet Prediction Model for Oil and Gas Production
CN116541594A (en) Journal recommendation method based on multi-granularity heterogeneous attribute graph comparison learning
Tang et al. Association Analysis of Abnormal Behavior of Electronic Invoice Based on K-Means and Skip-Gram

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160127

Termination date: 20200402