CN103761532A - Label space dimensionality reducing method and system based on feature-related implicit coding - Google Patents

Label space dimensionality reducing method and system based on feature-related implicit coding Download PDF

Info

Publication number
CN103761532A
CN103761532A CN201410024964.XA CN201410024964A CN103761532A CN 103761532 A CN103761532 A CN 103761532A CN 201410024964 A CN201410024964 A CN 201410024964A CN 103761532 A CN103761532 A CN 103761532A
Authority
CN
China
Prior art keywords
matrix
dimensionality reduction
space
dimension
test case
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410024964.XA
Other languages
Chinese (zh)
Other versions
CN103761532B (en
Inventor
丁贵广
林梓佳
林运祯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201410024964.XA priority Critical patent/CN103761532B/en
Publication of CN103761532A publication Critical patent/CN103761532A/en
Application granted granted Critical
Publication of CN103761532B publication Critical patent/CN103761532B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a label space dimensionality reducing method and system based on feature-related implicit coding. The method comprises the following steps: a training dataset is provided; a feature matrix and a label matrix are constructed according to the training dataset; an optimal related function between a dimensionality reducing matrix and the feature matrix is acquired according to the feature matrix, and an optimal error recovery function between the dimensionality reducing matrix and the label matrix is acquired according to the label matrix; an objective function is constructed according to the optimal related function and the optimal error recovery function; the objective function is used for optimizing dimensionality reducing matrix, and a decoding matrix is acquired through solution according to the optimized dimensionality reducing matrix; the optimized dimensionality reducing matrix is used for learning and training so that a prediction model can be acquired; features of a test case are extracted, and the prediction model is used for predicting expression of the test case in a latent semantic space; the expression of the test case in the latent semantic space is decoded through the decoding matrix so that a classification result of the test case in an original label space can be acquired. The label space dimensionality reducing method is high in compression ratio, good in stability and high in universality.

Description

Based on Label space dimension reduction method and the system of the relevant implicit expression coding of feature
Technical field
The present invention relates to computer software technology, relate in particular to a kind of Label space dimension reduction method and system based on the relevant implicit expression coding of feature.
Background technology
Many labelings technology (Multi-label classification) is mainly used in certain example to be divided among one or more classification, thereby can describe more complete, meticulously the feature of example, and the classification that example belongs to is also called as its corresponding label (Label).Many labelings technology has application very widely in reality, such as the classification of many label text, linguistic indexing of pictures, audio frequency sentiment analysis etc.In recent years, along with emerging in multitude and fast development of network application, the application of many labelings starts to face data volume expand lot of challenges and the difficulty brought, comprising the rapid growth of Label space etc.For example, on picture sharing website Flickr, user can select some for describing the content of picture when uploading pictures from millions of even more vocabularies.For applying by means of many labelings of Flickr data for such as network image semantic tagger etc., these text vocabulary will be regarded as different labels, thereby so huge number of labels will be brought the very big lifting on cost to the Algorithm Learning process of these application bottoms.For many labelings, the basic thought of at present large metering method remains and is decomposed into multiple two classification problems, be the corresponding forecast model of each label training (Predictive model) for judging whether an example belongs to this label, and all labels that this example belongs to are the most at last as multiple descriptions of its correspondence.When Label space rapid expansion, when number of labels is very huge, the forecast model quantity of the required training of these methods also increases rapidly, thereby causes its training cost greatly to rise.
The many labelings problem solving in the huge situation of number of labels that appears as of Label space dimensionality reduction has been pointed out a feasible probing direction, and provide technical support, progressively become in recent years a focus of research circle, and emerged some outstanding dimension reduction methods.For example, utilize the sparse property in original tag space, by carrying out the dimensionality reduction of Label space by compressed sensing (Compressed sensing) method, and utilize its corresponding decoding algorithm to carry out the recovery from latent semantic space to original tag space.On the basis of this scheme, there is researcher further by unified the learning process of reduction process and forecast model, to arrive under same probability model framework, and then by optimize above-mentioned two processes simultaneously, obtain the lifting of classification performance.In addition, some research is also applied to principal component analytical method (Principal component analysis) on Label space dimensionality reduction, is called Principal label space transformation method.Further, there is researcher that the correlativity between feature space and latent semantic space is taken into account, proposed Feature-ware conditional principal label space transformation method, obtained comparatively significantly performance boost.Separately there is researcher also to propose to utilize linear gaussian random projecting direction to shine upon original tag space, and the value of symbol after reserved mapping is as dimensionality reduction result, decode procedure is to utilize a series of hypothesis based on KL divergence (Kullback-Leibler divergence) to test to realize.Also have researcher directly by the mark matrix to training data, to carry out Boolean matrix decomposition (Boolean matrix decomposition), obtain dimensionality reduction matrix and decoding matrix, wherein, dimensionality reduction matrix is dimensionality reduction result, and decoding matrix is the linear mapping that latent semantic space is returned to original tag space.
From the current study, main solution is to presuppose an explicit coding function, and is conventionally taken as linear function.But due to the complicacy of higher dimensional space structure, explicit coding function possibly cannot accurately be described original tag space to the mapping relations between optimum latent semantic space, thereby affects final dimensionality reduction result.In addition, although there is a small amount of work, can not suppose explicit coding function, but directly learn dimensionality reduction result, but these work do not take into account the correlativity of latent semantic space and feature space at present, the dimensionality reduction result that may cause finally obtaining be difficult to by from feature space learning to forecast model describe, thereby cause final classification performance not good.
Summary of the invention
The present invention is intended to solve at least to a certain extent one of technical matters in correlation technique.For this reason, one object of the present invention be to propose a kind ofly have that information is considered fully, classification performance conservation degree is high, Label space compressibility large, good stability, the Label space dimension reduction method based on the relevant implicit expression coding of feature that universality is strong.
Another object of the present invention is to propose a kind of Label space dimensionality reduction system based on the relevant implicit expression coding of feature.
First aspect present invention embodiment has proposed a kind of Label space dimension reduction method based on the relevant implicit expression coding of feature, comprises the following steps: training dataset is provided; According to described training dataset structural attitude matrix and mark matrix; According to described eigenmatrix, obtain the optimum related function of dimensionality reduction matrix and described eigenmatrix, and according to described mark matrix, obtain the optimized database restore error function of described dimensionality reduction matrix and described mark matrix; According to described optimum related function and described optimized database restore error function structure objective function; Apply dimensionality reduction matrix described in described objective function optimization, and solve decoding matrix according to the dimensionality reduction Matrix Calculating after optimizing; Utilize dimensionality reduction matrix learning training after described optimization to obtain forecast model; Extract test case feature, and utilize described forecast model to predict the expression of described test case in latent semantic space; And utilize described decoding matrix to described test case the expression in described latent semantic space decode, to obtain the classification results of described test case in original tag space.
According to the Label space dimension reduction method based on the relevant implicit expression coding of feature of the embodiment of the present invention, in the process of study dimensionality reduction result, also taken into full account it and marked the recovery error of matrix and the correlativity with feature space, by the process of optimizing, guaranteed that dimensionality reduction result can return to mark matrix well, simultaneously also can by feature space learning to forecast model describe, thereby can under lower training cost, obtain good many labelings performance.
In some instances, each dimension of described latent semantic space is mutually orthogonal.
In some instances, described test case is carried out to binary conversion treatment at the classification results in original tag space.
In some instances, the dimension of described latent semantic space is less than the dimension in described original tag space.
The embodiment of second aspect present invention proposes a kind of Label space dimensionality reduction system based on the relevant implicit expression coding of feature, comprising: training module, for carrying out learning training to obtain forecast model according to training dataset; Prediction module, for obtaining the classification results of test case in original tag space according to described forecast model.
According to the Label space dimensionality reduction system based on the relevant implicit expression coding of feature of the embodiment of the present invention, in the process of study dimensionality reduction result, also taken into full account it and marked the recovery error of matrix and the correlativity with feature space, by the process of optimizing, guaranteed that dimensionality reduction result can return to mark matrix well, simultaneously also can by feature space learning to forecast model describe, thereby can under lower training cost, obtain good many labelings performance.
In some instances, described training module specifically comprises: constructing module, for according to training data structural attitude matrix and mark matrix; Optimize module, for obtain the optimum related function between dimensionality reduction matrix and described eigenmatrix according to described eigenmatrix, and obtain the optimized database restore error function between dimensionality reduction matrix and described mark matrix according to described mark matrix; MBM, for according to described optimum related function and described optimized database restore error function structure objective function, and applies described in described objective function optimization after dimensionality reduction matrix, utilizes the dimensionality reduction Matrix Calculating after optimizing to solve decoding matrix; Study module, for utilizing dimensionality reduction matrix learning training after described optimization to obtain forecast model.
In some instances, each dimension of described latent semantic space is mutually orthogonal.
In some instances, described test case is carried out to binary conversion treatment at the classification results in original tag space.
In some instances, the dimension of described latent semantic space is less than the dimension in described original tag space.
The aspect that the present invention is additional and advantage in the following description part provide, and part will become obviously from the following description, or recognize by practice of the present invention.
Accompanying drawing explanation
Fig. 1 is according to the process flow diagram of the Label space dimension reduction method based on the relevant implicit expression coding of feature of the embodiment of the present invention;
Fig. 2 is the schematic diagram of the Label space dimension reduction method based on the relevant implicit expression coding of feature of one embodiment of the invention;
Fig. 3 is according to the structured flowchart of the Label space dimensionality reduction system based on the relevant implicit expression coding of feature of the embodiment of the present invention; With
Fig. 4 is the structured flowchart of the training module of one embodiment of the invention.
Embodiment
Describe embodiments of the invention below in detail, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has the element of identical or similar functions from start to finish.Below by the embodiment being described with reference to the drawings, be exemplary, be intended to for explaining the present invention, and can not be interpreted as limitation of the present invention.
In fact, Label space dimensionality reduction, its fundamental purpose is the original tag space (Original label space) of compression higher-dimension, keeping under the prerequisite of acceptable algorithm performance, be encoded into the latent semantic space (Latent semantic space) of a low-dimensional, thereby by original model training process by the learning process from feature space (Feature space) to the forecast model in original tag space, resolve into the learning process of the forecast model from feature space to latent semantic space and the decode procedure from latent semantic space to original tag space.By dimensionality reduction, required forecast model quantity from feature space to latent semantic space, compares with required quantity before dimensionality reduction, will significantly reduce.And, if forecast model is enough accurate, meanwhile, the decode procedure from latent semantic space to original tag space is also enough accurately with efficient, the many labelings performance so finally obtaining should be still acceptable theoretically, and meanwhile trains cost greatly to be reduced.
In the embodiment of one aspect of the present invention, proposed a kind of Label space dimension reduction method based on the relevant implicit expression coding of feature, comprised the following steps: training dataset is provided; According to training dataset structural attitude matrix and mark matrix; According to eigenmatrix, obtain the optimum related function of dimensionality reduction matrix and eigenmatrix, and obtain dimensionality reduction matrix and the optimized database restore error function that marks matrix according to mark matrix; According to optimum related function and optimized database restore error function structure objective function; Application target function optimization dimensionality reduction matrix, and solve decoding matrix according to the dimensionality reduction Matrix Calculating after optimizing; Dimensionality reduction matrix learning training after utilization is optimized is to obtain forecast model; Extract test case feature, and utilize the expression of forecast model prediction test case in latent semantic space; And utilize decoding matrix to test case the expression in latent semantic space decode, to obtain the classification results of test case in original tag space.
Fig. 1 is according to the process flow diagram of the Label space dimension reduction method based on the relevant implicit expression coding of feature of the embodiment of the present invention, and Fig. 2 is the principle framework figure of the Label space dimension reduction method based on the relevant implicit expression coding of feature of one embodiment of the invention.In conjunction with Fig. 1 and Fig. 2, specifically describe the Label space dimension reduction method based on the relevant implicit expression coding of feature of the present invention.
Step S101: training dataset is provided.
The principle framework figure of the Label space dimension reduction method based on the relevant implicit expression coding of feature as shown in Figure 2, method of the present invention has comprised training process and forecasting process.In training process, need the training dataset of given some.
Step S102: according to training dataset structural attitude matrix and mark matrix.
Particularly, to the given training dataset that comprises m test case, according to the attribute of data itself, select suitable characteristic type, and extract corresponding proper vector x=[x for each test case wherein 1, x 2..., x d], wherein, x iit is the i dimension of proper vector x.Obtaining after the proper vector of all test cases, can with random order, by being about to, it be spliced into required eigenmatrix X, X is the matrix of m × d, wherein, d is the dimension of proper vector.
Meanwhile, to comprising the training dataset of m test case, the quantitative value k of the different labels that statistics wherein occurs, and according to the label ownership situation of each test case, for it constructs corresponding label vector y=[y 1, y 2..., y k], wherein, y jrepresent whether this example belongs to j label.If so, value is 1, otherwise value is 0, by that analogy.Similarly, obtaining after the label vector of all test cases, can by being about to, it be spliced into mark matrix Y, with the splicing sequence consensus of eigenmatrix, Y is the matrix of m × k here.
Step S103: obtain the optimum related function of dimensionality reduction matrix and eigenmatrix according to eigenmatrix, and obtain dimensionality reduction matrix and the optimized database restore error function that marks matrix according to mark matrix.
Particularly, according to eigenmatrix, obtain on the one hand the optimum related function of dimensionality reduction matrix and eigenmatrix.
In actual mechanical process, in conjunction with the method for implicit expression coding, suppose and have dimensionality reduction Matrix C.Correlativity between dimensionality reduction Matrix C and eigenmatrix X can be decomposed into dimensionality reduction Matrix C each row with eigenmatrix X between correlativity sum.For any one the row c in dimensionality reduction Matrix C, the correlativity between itself and eigenmatrix X can be described by cosine correlativity, is expressed as functional form as follows:
ρ = ( Xr ) T c ( Xr ) T ( Xr ) c T c
Wherein, r is a linear mapping of X, for X being projected to the space at c place.
Meanwhile, in order to reduce the redundant information in dimensionality reduction result, suppose that each row of dimensionality reduction Matrix C are mutually orthogonal, that is each dimension of the latent semantic space of dimensionality reduction Matrix C description is mutually orthogonal, its corresponding mathematic(al) representation is C tc=E.By C tc=E can learn c tc=1, and r carries out linear scale does not affect the value of cosine correlativity ρ, therefore can construct following majorized function for solving optimum linear mapping r, and then obtain the optimal correlation of c and X.
ρ *=max r(Xr) Tc
Constraint condition: (Xr) t(Xr)=1
By method of Lagrange multipliers, can draw optimum linear mapping again being updated to above-mentioned majorized function can obtain optimum correlativity and be
Figure BDA0000459235890000052
wherein, Δ=X (X tx) -1x t.
Therefore, the optimal correlation between dimensionality reduction Matrix C and eigenmatrix X can be expressed as following functional expression:
P * = Σ i = 1 l C · , i T ΔC · , i
Wherein, C ., irepresent the i row of dimensionality reduction Matrix C.
According to mark matrix, obtain dimensionality reduction matrix and the optimized database restore error function that marks matrix on the other hand.
In conjunction with the method for implicit expression coding, suppose that having dimensionality reduction Matrix C, C is the matrix of m × l, wherein, l is the dimension size after dimensionality reduction, therefore l < < k.
Under the prerequisite existing in supposition dimensionality reduction Matrix C, the error that is returned to mark matrix by C can be expressed as following functional expression:
&epsiv; = | | Y - CD | | fro 2
Wherein, D is the linear codec matrix for guaranteeing that decoding efficiency is introduced,
Figure BDA0000459235890000055
represent be matrix Frobenius normal form square.In the situation that dimensionality reduction Matrix C is given, optimum recovery error is minimum ε.Therefore, by minimizing ε, can obtain optimum decoding matrix D and optimum recovery error.By being constructed as follows majorized function:
&epsiv; * = min D | | Y - CD | | fro 2
Can solve the optimum decoding matrix D of decoding matrix D *=(C tc) -1c ty, due to C tc=E is unit matrix, D *further abbreviation is D *=C ty, the above-mentioned majorized function of substitution, obtains optimized database restore error function: ε again *=Tr[Y ty-Y tcC ty], wherein Tr[] mark of representing matrix, that is diagonal entry sum.
Step S104: according to optimum related function and optimized database restore error function structure objective function.
By step S103, can obtain returning to from dimensionality reduction Matrix C the optimized database restore error function ε of mark matrix Y *=Tr[Y ty-Y tcC ty], with and with the optimum related function of eigenmatrix X
Figure BDA0000459235890000057
therefore optimum dimensionality reduction matrix should be able to the above-mentioned optimized database restore error function of simultaneous minimization and is maximized above-mentioned optimum related function.
From the character of trace of a matrix, ε *=Tr[Y ty-Y tcC ty]=Tr[Y ty]-Tr[Y tcC ty], due to Tr[Y ty] be constant, minimize ε *be equivalent to and maximize Tr[Y tcC ty].In addition, for optimal correlation part, due to maximization C &CenterDot; , i T &Delta;C . , i Be equivalent to maximization C &CenterDot; , i T &Delta;C &CenterDot; , i , Therefore, maximize P *be equivalent to maximization &Sigma; i = 1 l C &CenterDot; , i T &Delta;C &CenterDot; , i = Tr [ C T &Delta;C ] . Therefore, for simultaneous minimization ε *and maximize P *, can be by being constructed as follows objective function of equal value, to solve optimum dimensionality reduction Matrix C.
Ω=max CTr[Y TCC TY]+αTr[C TΔC]
Constraint condition: C tc=E
Wherein, α is weight parameter, for adjusting the weight relationship between optimized database restore error and optimal correlation.According to the character of trace of a matrix, above-mentioned objective function can be rewritten as following form:
Ω=max CTr[C T(YY T+αΔ)C]
Constraint condition: C tc=E
Step S105: application target function optimization dimensionality reduction matrix, and solve decoding matrix according to the dimensionality reduction Matrix Calculating after optimizing.
Particularly, the objective function Ω obtaining by step S104, can obtain optimum dimensionality reduction Matrix C.C solves the optimization problem that can be decomposed into each row.For dimensionality reduction Matrix C i row C ., ioptimization Solution, can construct following optimization subproblem:
&Omega; i = max C &CenterDot; , i C &CenterDot; , i T ( YY T + &alpha;&Delta; ) C &CenterDot; , i
Constraint condition: C &CenterDot; , i T C &CenterDot; , i = 1 , C &CenterDot; , j T C &CenterDot; , i = 0 ( &ForAll; j < i )
Utilize method of Lagrange multipliers, can draw optimum C ., ineed to meet following optimality condition:
( YY T + &alpha;&Delta; ) C &CenterDot; , i = &lambda; i C &CenterDot; , i
Wherein, λ ibe the Lagrange multiplier of introducing, and the above-mentioned optimization subproblem of substitution can obtain optimum Ω i, i.e. λ i.
By above-mentioned optimality condition, can observe and obtain, optimum C ., imatrix (YY t+ α Δ) a unit character vector (Eigenvector), and due to the orthogonality of proper vector, orthogonality constraint below can meet naturally.So far can find, dimensionality reduction Matrix C is actually by matrix (YY t+ α Δ) in the unit character vector of corresponding maximum l eigenwert (Eigenvalue) by row, be spliced.Therefore the process that, solves dimensionality reduction Matrix C is to matrix (YY t+ α Δ) carry out the process of Eigenvalues Decomposition (Eigenvalue decomposition), thus can obtain all eigenwerts and the unit character vector corresponding to each eigenwert of this matrix.Because the complexity of Eigenvalues Decomposition is not more than and due to (YY t+ α Δ) be symmetric matrix, only need in one embodiment of the invention the corresponding unit character vector of a maximum l eigenwert simultaneously, therefore the complexity of dimensionality reduction Matrix C can be lower than thereby the process that has guaranteed to solve dimensionality reduction Matrix C is enough efficient.
In addition, after solving dimensionality reduction Matrix C by Eigenvalues Decomposition method, can try to achieve optimum decoding matrix D by minimizing the optimized database restore error returning between original mark matrix Y from dimensionality reduction Matrix C.From the computation process of step S103, optimum decoding matrix D *=C ty.
Step S106: the dimensionality reduction matrix learning training after utilization is optimized is to obtain forecast model.
The optimum dimensionality reduction Matrix C obtaining according to step S105, for each dimension of its described latent semantic space is trained corresponding forecast model.Particularly, for i dimension, (1≤i≤l), j test case value in this dimension is C j,ithereby the vector that the value of the training data of all test cases in this dimension forms is the i row C of C ., i.The value vector C tieing up according to the eigenmatrix X of the training data of all test cases and at latent semantic space i ., i, can go out corresponding forecast model h by learning training i: X → C ., i, for predicting the value condition of any example in this dimension, it is input as the proper vector of test case, and output is the value of test case in i dimension.
In service in reality, the selection of forecast model can be according to the concrete condition setting of application, conventional comprises linear regression (Linear regression) etc.Through Label space dimensionality reduction, the dimension l of latent semantic space is often much smaller than the dimension k in original tag space here, thereby the quantity of required forecast model is significantly reduced, and effectively reduced training cost.
Step S107: extract test case feature, and utilize the expression of forecast model prediction test case in latent semantic space.
Particularly, when a given test case to be sorted, need to extract the feature identical with training process to it, and obtain the d dimensional feature vector of this test case
Figure BDA0000459235890000071
Obtaining the proper vector of test case
Figure BDA0000459235890000072
afterwards, l the forecast model that utilizes step S106 learning to obtain, dopes the value of test case in each dimension of latent semantic space, thereby obtain its l dimension on latent semantic space, represents vector z ^ = [ h 1 ( x ^ ) , h 2 ( x ^ ) , &CenterDot; &CenterDot; &CenterDot; , h l ( x ^ ) ] .
Step S108: utilize decoding matrix to test case the expression in latent semantic space decode, to obtain the classification results of test case in original tag space.
The optimum decoding matrix D of utilizing step S105 to obtain *, test case to be decoded at the l of latent semantic space dimension expression vector, the k dimension that returns to original tag space represents vector
Figure BDA0000459235890000074
that is
Figure BDA0000459235890000075
The vector now obtaining
Figure BDA0000459235890000076
value is real number value, need to be by setting threshold (being conventionally taken as 0.5) by its binaryzation.Particularly, if the value in each dimension exceedes set threshold value, value is 1, otherwise value is 0, thereby express the label ownership situation of test case in original tag space, that is the corresponding label of each dimension that value is 1 is many labelings result of test case to be sorted.
In one embodiment of the invention, each dimension of latent semantic space is mutually orthogonal, minimized the redundant information in dimensionality reduction result, thereby the information that this method can be encoded in more original tag space with lower dimension, having guaranteed has larger compressibility to original tag space in reduction process.
In one embodiment of the invention, the dimension of latent semantic space is less than the dimension in original tag space, and do not need to set in advance explicit coding function, thereby the quantity that guarantees the required forecast model of the method for the embodiment of the present invention significantly reduces, effectively reduced training cost, make can learn out optimum dimensionality reduction result adaptively under different data scenes, stability is better.
For example, by the experiment on the standard data set delicious in text classification field, verified the validity of the Label space dimension reduction method based on the relevant implicit expression coding of feature of the embodiment of the present invention.Particularly, Label space dimension on delicious data set has been dropped to original 10%, 20%, 30%, 40% and 50%, and observe the method for the embodiment of the present invention can reach under different ratios classification performance, use respectively mean F 1 based on label to be worth and the accuracy of the mean based on example is weighed (both are all more high better).Table 1 has provided the classification performance statistics of the inventive method, also provided when not carrying out Label space dimensionality reduction simultaneously, the classification performance that the exsertile linear SVM of employing property (Support vector machine) can reach, to carry out the comparison before and after dimensionality reduction.Result from table can find out, the method for the embodiment of the present invention just can keep 68% of F1 value before dimensionality reduction not in the situation that only retaining original tag Spatial Dimension 10% in mean F 1 value, and the while keeps 85% before dimensionality reduction not on accuracy of the mean.As can be seen here, the method for the embodiment of the present invention can be carried out dimensionality reduction to original tag space effectively, and can guarantee well acceptable classification performance when significantly reducing training cost.
The experimental result of the method for table 1 embodiment of the present invention on delicious data set
Dimension ratio after dimensionality reduction 10% 20% 30% 40% 50% Before dimensionality reduction
Mean F 1 based on label is worth 0.054 0.059 0.060 0.060 0.059 0.079
Based on the accuracy of the mean of example 0.120 0.121 0.120 0.120 0.112 0.142
According to the Label space dimension reduction method based on the relevant implicit expression coding of feature of the embodiment of the present invention, in the process of study dimensionality reduction result, also taken into full account it and marked the recovery error of matrix and the correlativity with feature space, by the process of optimizing, guaranteed that dimensionality reduction result can return to mark matrix well, simultaneously also can by feature space learning to forecast model describe, thereby can under lower training cost, obtain good many labelings performance.
The present invention embodiment has on the other hand proposed a kind of Label space dimensionality reduction system based on the relevant implicit expression coding of feature, comprising: training module 100 and prediction module 200, as shown in Figure 3.
Wherein, training module 100, for carrying out learning training to obtain forecast model according to training dataset.Prediction module 200, obtains the classification results of test case in original tag space for the forecast model obtaining according to training module 100.
Particularly, as shown in Figure 4, the training module 100 of the embodiment of the present invention specifically comprises: constructing module 10, optimization module 20, MBM 30 and study module 40.
Wherein, constructing module 100, for according to training dataset structural attitude matrix and mark matrix.
Particularly, particularly, to the given training dataset that comprises m test case, according to the attribute of data itself, select suitable characteristic type, and extract corresponding proper vector x=[x for each test case wherein 1, x 2..., x d], wherein, xi is the i dimension of proper vector x.Obtaining after the proper vector of all test cases, can with random order, by being about to, it be spliced into required eigenmatrix X, X is the matrix of m × d, wherein, d is the dimension of proper vector.
Meanwhile, to comprising the training dataset of m test case, the quantitative value k of the different labels that statistics wherein occurs, and according to the label ownership situation of each test case, for it constructs corresponding label vector y=[y 1, y 2..., y k], wherein, y jrepresent whether this example belongs to j label.If so, value is 1, otherwise value is 0, by that analogy.Similarly, obtaining after the label vector of all test cases, can by being about to, it be spliced into mark matrix Y, with the splicing sequence consensus of eigenmatrix, Y is the matrix of m × k here.
Optimize module 20, for utilizing eigenmatrix to obtain the optimum related function between dimensionality reduction matrix and described eigenmatrix, and obtain the optimized database restore error function between dimensionality reduction matrix and mark matrix according to mark matrix.
Particularly, according to eigenmatrix, obtain on the one hand the optimum related function of dimensionality reduction matrix and eigenmatrix.
In actual mechanical process, in conjunction with the method for implicit expression coding, suppose and have dimensionality reduction Matrix C.Correlativity between dimensionality reduction Matrix C and eigenmatrix X can be decomposed into dimensionality reduction Matrix C each row with eigenmatrix X between correlativity sum.For any one the row c in dimensionality reduction Matrix C, the correlativity between itself and eigenmatrix X can be described by cosine correlativity, is expressed as functional form as follows:
&rho; = ( Xr ) T c ( Xr ) T ( Xr ) c T c
Wherein, r is a linear mapping of X, for X being projected to the space at c place.
Meanwhile, in order to reduce the redundant information in dimensionality reduction result, suppose that each row of dimensionality reduction Matrix C are mutually orthogonal, that is each dimension of the latent semantic space of dimensionality reduction Matrix C description is mutually orthogonal, its corresponding mathematic(al) representation is C tc=E.By C tc=E can learn c tc=1, and r carries out linear scale does not affect the value of cosine correlativity ρ, therefore can construct following majorized function for solving optimum linear mapping r, and then obtain the optimal correlation of c and X.
ρ *=max r(Xr) Tc
Constraint condition: (Xr) t(Xr)=1
By method of Lagrange multipliers, can draw optimum linear mapping
Figure BDA0000459235890000091
again being updated to above-mentioned majorized function can obtain optimum correlativity and be
Figure BDA0000459235890000092
wherein, Δ=X (X tx) -1x t.
Therefore, the optimal correlation between dimensionality reduction Matrix C and eigenmatrix X can be expressed as following functional expression:
P * = &Sigma; i = 1 l C &CenterDot; , i T &Delta;C &CenterDot; , i
Wherein, C ., irepresent the i row of dimensionality reduction Matrix C.
According to mark matrix, obtain dimensionality reduction matrix and the optimized database restore error function that marks matrix on the other hand.
In conjunction with the method for implicit expression coding, suppose that having dimensionality reduction Matrix C, C is the matrix of m × l, wherein, l is the dimension size after dimensionality reduction, therefore l < < k.
Under the prerequisite existing in supposition dimensionality reduction Matrix C, the error that is returned to mark matrix by C can be expressed as following functional expression:
&epsiv; = | | Y - CD | | fro 2
Wherein, D is the linear codec matrix for guaranteeing that decoding efficiency is introduced,
Figure BDA0000459235890000095
represent be matrix Frobenius normal form square.In the situation that dimensionality reduction Matrix C is given, optimum recovery error is minimum ε.Therefore, by minimizing ε, can obtain optimum decoding matrix D and optimum recovery error.By being constructed as follows majorized function:
&epsiv; * = min D | | Y - CD | | fro 2
Can solve the optimum decoding matrix D of decoding matrix D *=(C tc) -1c ty, due to C tc=E is unit matrix, D *further abbreviation is D *=C ty, the above-mentioned majorized function of substitution, obtains optimized database restore error function: ε again *=Tr[Y ty-Y tcC ty], wherein Tr[] mark of representing matrix, that is diagonal entry sum.
MBM 30, for constructing objective function according to optimum related function and optimized database restore error function, and after application target function optimization dimensionality reduction matrix, utilizes the dimensionality reduction Matrix Calculating after optimizing to solve decoding matrix.
Particularly, by optimizing module 20, can obtain returning to from dimensionality reduction Matrix C the optimized database restore error function E of mark matrix Y *=Tr[Y ty-Y tcC ty], with and with the optimum related function of eigenmatrix X
Figure BDA0000459235890000097
therefore optimum dimensionality reduction matrix should be able to the above-mentioned optimized database restore error function of simultaneous minimization and is maximized above-mentioned optimum related function.
From the character of trace of a matrix, ε *=Tr[Y ty-Y tcC ty]=Tr[Y ty]-Tr[Y tcC ty], due to Tr[Y ty] be constant, minimize ε *be equivalent to and maximize Tr[Y tcC ty].In addition, for optimal correlation part, due to maximization C &CenterDot; , i T &Delta;C . , i Be equivalent to maximization C &CenterDot; , i T &Delta;C &CenterDot; , i , Therefore, maximize P *be equivalent to maximization &Sigma; i = 1 l C &CenterDot; , i T &Delta;C &CenterDot; , i = Tr [ C T &Delta;C ] . Therefore, for simultaneous minimization ε *and maximize P *, can be by being constructed as follows objective function of equal value, to solve optimum dimensionality reduction Matrix C.
Ω=max CTr[Y TCC TY]+αTr[C TΔC]
Constraint condition: C tc=E
Wherein, α is weight parameter, for adjusting the weight relationship between optimized database restore error and optimal correlation.According to the character of trace of a matrix, above-mentioned objective function can be rewritten as following form:
Ω=max CTr[C T(YY T+αΔ)C]
Constraint condition: C tc=E
By objective function Ω, can obtain optimum dimensionality reduction Matrix C.Optimum dimensionality reduction Matrix C solve the optimization problem that can be decomposed into each row.For dimensionality reduction Matrix C i row C ., ioptimization Solution, can construct following optimization subproblem:
&Omega; i = max C &CenterDot; , i C &CenterDot; , i T ( YY T + &alpha;&Delta; ) C &CenterDot; , i
Constraint condition: C &CenterDot; , i T C &CenterDot; , i = 1 , C &CenterDot; , j T C &CenterDot; , i = 0 ( &ForAll; j < i )
Utilize method of Lagrange multipliers, can draw optimum C ., ineed to meet following optimality condition:
(YY T+αΔ)C .,iiC .,i
Wherein, λ ibe the Lagrange multiplier of introducing, and the above-mentioned optimization subproblem of substitution can obtain optimum Ω i, i.e. λ i.
By above-mentioned optimality condition, can observe and obtain, optimum C ., imatrix (YY t+ α Δ) a unit character vector (Eigenvector), and due to the orthogonality of proper vector, orthogonality constraint below can meet naturally.So far can find, dimensionality reduction Matrix C is actually by matrix (YY t+ α Δ) in the unit character vector of corresponding maximum l eigenwert (Eigenvalue) by row, be spliced.Therefore the process that, solves dimensionality reduction Matrix C is to matrix (YY t+ α Δ) carry out the process of Eigenvalues Decomposition (Eigenvalue decomposition), thus can obtain all eigenwerts and the unit character vector corresponding to each eigenwert of this matrix.Because the complexity of Eigenvalues Decomposition is not more than
Figure BDA0000459235890000103
and due to (YY t+ α Δ) be symmetric matrix, only need in one embodiment of the invention the corresponding unit character vector of a maximum l eigenwert simultaneously, therefore the complexity of dimensionality reduction Matrix C can be lower than
Figure BDA0000459235890000104
thereby the process that has guaranteed to solve dimensionality reduction Matrix C is enough efficient.
In addition, after solving dimensionality reduction Matrix C by Eigenvalues Decomposition method, can try to achieve optimum decoding matrix D by minimizing the optimized database restore error returning between original mark matrix Y from dimensionality reduction Matrix C.From the computation process of optimizing module 20, optimum decoding matrix D *=C ty.
Study module 40, for utilizing dimensionality reduction matrix learning training after optimization to obtain forecast model.
The optimum dimensionality reduction Matrix C obtaining according to MBM 30, for each dimension of its described latent semantic space is trained corresponding forecast model.Particularly, for i dimension, (1≤i≤l), j test case value in this dimension is C j,ithereby the vector that the value of the training data of all test cases in this dimension forms is the i row C of C ., i.The value vector C tieing up according to the eigenmatrix X of the training data of all test cases and at latent semantic space i ., i, can go out corresponding forecast model h by learning training i: X → C ., i, for predicting the value condition of any example in this dimension, it is input as the proper vector of test case, and output is the value of test case in i dimension.
In service in reality, the selection of forecast model can be according to the concrete condition setting of application, conventional comprises linear regression (Linear regression) etc.Through Label space dimensionality reduction, the dimension l of latent semantic space is often much smaller than the dimension k in original tag space here, thereby the quantity of required forecast model is significantly reduced, and effectively reduced training cost.
In addition, the prediction module 200 of the embodiment of the present invention specifically comprises:
(1), when a given test case to be sorted, need to extract the feature identical with training process to it, and obtain the d dimensional feature vector of this test case
Figure BDA0000459235890000111
(2) obtaining the proper vector of test case
Figure BDA0000459235890000112
afterwards, l the forecast model that utilizes training module 100 learnings to obtain, dopes the value of test case in each dimension of latent semantic space, thereby obtain its l dimension on latent semantic space, represents vector z ^ = [ h 1 ( x ^ ) , h 2 ( x ^ ) , &CenterDot; &CenterDot; &CenterDot; , h l ( x ^ ) ] .
(3) utilize decoding matrix to test case the expression in latent semantic space decode, to obtain the classification results of test case in original tag space.
Utilize and optimize the optimum decoding matrix D that module 30 obtains *, test case to be decoded at the l of latent semantic space dimension expression vector, the k dimension that returns to original tag space represents vector
Figure BDA0000459235890000114
that is
The vector now obtaining
Figure BDA0000459235890000116
value is real number value, need to be by setting threshold (being conventionally taken as 0.5) by its binaryzation.Particularly, if the value in each dimension exceedes set threshold value, value is 1, otherwise value is 0, thereby express the label ownership situation of test case in original tag space, that is the corresponding label of each dimension that value is 1 is many labelings result of test case to be sorted.
In one embodiment of the invention, each dimension of latent semantic space is mutually orthogonal, minimized the redundant information in dimensionality reduction result, thereby the information that this method can be encoded in more original tag space with lower dimension, having guaranteed has larger compressibility to original tag space in reduction process.
In one embodiment of the invention, the dimension of latent semantic space is less than the dimension in original tag space, and do not need to set in advance explicit coding function, thereby the method that guarantees the embodiment of the present invention can be learnt out optimum dimensionality reduction result adaptively under different data scenes, and stability is better.
According to the Label space dimensionality reduction system based on the relevant implicit expression coding of feature of the embodiment of the present invention, in the process of study dimensionality reduction result, also taken into full account it and marked the recovery error of matrix and the correlativity with feature space, by the process of optimizing, guaranteed that dimensionality reduction result can return to mark matrix well, simultaneously also can by feature space learning to forecast model describe, thereby can under lower training cost, obtain good many labelings performance.
In the description of this instructions, the description of reference term " embodiment ", " some embodiment ", " example ", " concrete example " or " some examples " etc. means to be contained at least one embodiment of the present invention or example in conjunction with specific features, structure, material or the feature of this embodiment or example description.In this manual, to the schematic statement of above-mentioned term not must for be identical embodiment or example.And, specific features, structure, material or the feature of description can one or more embodiment in office or example in suitable mode combination.In addition,, not conflicting in the situation that, those skilled in the art can carry out combination and combination by the feature of the different embodiment that describe in this instructions or example and different embodiment or example.
Although illustrated and described embodiments of the invention above, be understandable that, above-described embodiment is exemplary, can not be interpreted as limitation of the present invention, and those of ordinary skill in the art can change above-described embodiment within the scope of the invention, modification, replacement and modification.

Claims (9)

1. the Label space dimension reduction method based on the relevant implicit expression coding of feature, is characterized in that, comprises the following steps:
Training dataset is provided;
According to described training dataset structural attitude matrix and mark matrix;
According to described eigenmatrix, obtain the optimum related function of dimensionality reduction matrix and described eigenmatrix, and according to described mark matrix, obtain the optimized database restore error function of described dimensionality reduction matrix and described mark matrix;
According to described optimum related function and described optimized database restore error function structure objective function;
Apply dimensionality reduction matrix described in described objective function optimization, and solve decoding matrix according to the dimensionality reduction Matrix Calculating after optimizing;
Utilize dimensionality reduction matrix learning training after described optimization to obtain forecast model;
Extract test case feature, and utilize described forecast model to predict the expression of described test case in latent semantic space; And
Utilize described decoding matrix to described test case the expression in described latent semantic space decode, to obtain the classification results of described test case in original tag space.
2. method according to claim 1, is characterized in that, each dimension of described latent semantic space is mutually orthogonal.
3. method according to claim 1, is characterized in that, described test case is carried out to binary conversion treatment at the classification results in original tag space.
4. method according to claim 1, is characterized in that, the dimension of described latent semantic space is less than the dimension in described original tag space.
5. the Label space dimensionality reduction system based on the relevant implicit expression coding of feature, is characterized in that, comprising:
Training module, for carrying out learning training to obtain forecast model according to training dataset;
Prediction module, for obtaining the classification results of test case in original tag space according to described forecast model.
6. system according to claim 5, is characterized in that, described training module specifically comprises:
Constructing module, for according to training data structural attitude matrix and mark matrix;
Optimize module, for obtain the optimum related function between dimensionality reduction matrix and described eigenmatrix according to described eigenmatrix, and obtain the optimized database restore error function between dimensionality reduction matrix and described mark matrix according to described mark matrix;
MBM, for according to described optimum related function and described optimized database restore error function structure objective function, and applies described in described objective function optimization after dimensionality reduction matrix, utilizes the dimensionality reduction Matrix Calculating after optimizing to solve decoding matrix;
Study module, for utilizing dimensionality reduction matrix learning training after described optimization to obtain forecast model.
7. according to the system described in claim 5 or 6 any one, it is characterized in that, each dimension of described latent semantic space is mutually orthogonal.
8. system according to claim 5, is characterized in that, described test case is carried out to binary conversion treatment at the classification results in original tag space.
9. according to the system described in claim 5 or 6 any one, it is characterized in that, the dimension of described latent semantic space is less than the dimension in described original tag space.
CN201410024964.XA 2014-01-20 2014-01-20 Label space dimensionality reducing method and system based on feature-related implicit coding Active CN103761532B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410024964.XA CN103761532B (en) 2014-01-20 2014-01-20 Label space dimensionality reducing method and system based on feature-related implicit coding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410024964.XA CN103761532B (en) 2014-01-20 2014-01-20 Label space dimensionality reducing method and system based on feature-related implicit coding

Publications (2)

Publication Number Publication Date
CN103761532A true CN103761532A (en) 2014-04-30
CN103761532B CN103761532B (en) 2017-04-19

Family

ID=50528767

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410024964.XA Active CN103761532B (en) 2014-01-20 2014-01-20 Label space dimensionality reducing method and system based on feature-related implicit coding

Country Status (1)

Country Link
CN (1) CN103761532B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106952293A (en) * 2016-12-26 2017-07-14 北京影谱科技股份有限公司 A kind of method for tracking target based on nonparametric on-line talking
CN109036553A (en) * 2018-08-01 2018-12-18 北京理工大学 A kind of disease forecasting method based on automatic extraction Medical Technologist's knowledge
CN111967501A (en) * 2020-07-22 2020-11-20 中国科学院国家空间科学中心 Load state discrimination method and system driven by telemetering original data
CN114510518A (en) * 2022-04-15 2022-05-17 北京快立方科技有限公司 Self-adaptive aggregation method and system for massive structured data and electronic equipment
WO2023249555A3 (en) * 2022-06-21 2024-02-15 Lemon Inc. Sample processing based on label mapping

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090016599A1 (en) * 2007-07-11 2009-01-15 John Eric Eaton Semantic representation module of a machine-learning engine in a video analysis system
CN102004774A (en) * 2010-11-16 2011-04-06 清华大学 Personalized user tag modeling and recommendation method based on unified probability model
CN102982344A (en) * 2012-11-12 2013-03-20 浙江大学 Support vector machine sorting method based on simultaneously blending multi-view features and multi-label information
CN103176961A (en) * 2013-03-05 2013-06-26 哈尔滨工程大学 Transfer learning method based on latent semantic analysis
CN103514456A (en) * 2013-06-30 2014-01-15 安科智慧城市技术(中国)有限公司 Image classification method and device based on compressed sensing multi-core learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090016599A1 (en) * 2007-07-11 2009-01-15 John Eric Eaton Semantic representation module of a machine-learning engine in a video analysis system
CN102004774A (en) * 2010-11-16 2011-04-06 清华大学 Personalized user tag modeling and recommendation method based on unified probability model
CN102982344A (en) * 2012-11-12 2013-03-20 浙江大学 Support vector machine sorting method based on simultaneously blending multi-view features and multi-label information
CN103176961A (en) * 2013-03-05 2013-06-26 哈尔滨工程大学 Transfer learning method based on latent semantic analysis
CN103514456A (en) * 2013-06-30 2014-01-15 安科智慧城市技术(中国)有限公司 Image classification method and device based on compressed sensing multi-core learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
黄莉莉: "多标签学习中特征选择和分类问题的研究", 《中国优秀硕士学位论文全文数据库》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106952293A (en) * 2016-12-26 2017-07-14 北京影谱科技股份有限公司 A kind of method for tracking target based on nonparametric on-line talking
CN106952293B (en) * 2016-12-26 2020-02-28 北京影谱科技股份有限公司 Target tracking method based on nonparametric online clustering
CN109036553A (en) * 2018-08-01 2018-12-18 北京理工大学 A kind of disease forecasting method based on automatic extraction Medical Technologist's knowledge
CN109036553B (en) * 2018-08-01 2022-03-29 北京理工大学 Disease prediction method based on automatic extraction of medical expert knowledge
CN111967501A (en) * 2020-07-22 2020-11-20 中国科学院国家空间科学中心 Load state discrimination method and system driven by telemetering original data
CN111967501B (en) * 2020-07-22 2023-11-17 中国科学院国家空间科学中心 Method and system for judging load state driven by telemetering original data
CN114510518A (en) * 2022-04-15 2022-05-17 北京快立方科技有限公司 Self-adaptive aggregation method and system for massive structured data and electronic equipment
CN114510518B (en) * 2022-04-15 2022-07-12 北京快立方科技有限公司 Self-adaptive aggregation method and system for massive structured data and electronic equipment
WO2023249555A3 (en) * 2022-06-21 2024-02-15 Lemon Inc. Sample processing based on label mapping

Also Published As

Publication number Publication date
CN103761532B (en) 2017-04-19

Similar Documents

Publication Publication Date Title
Huang et al. Multi-view intact space clustering
CN107729513B (en) Discrete supervision cross-modal Hash retrieval method based on semantic alignment
AU2018247340B2 (en) Dvqa: understanding data visualizations through question answering
Yin et al. Multi-view clustering via pairwise sparse subspace representation
Zhang et al. Accelerated training for matrix-norm regularization: A boosting approach
CN112711660B (en) Method for constructing text classification sample and method for training text classification model
Wood et al. The sequence memoizer
CN103761532A (en) Label space dimensionality reducing method and system based on feature-related implicit coding
CN105022740A (en) Processing method and device of unstructured data
CN115827819A (en) Intelligent question and answer processing method and device, electronic equipment and storage medium
CN110990596A (en) Multi-mode hash retrieval method and system based on self-adaptive quantization
CN112686053A (en) Data enhancement method and device, computer equipment and storage medium
CN110704543A (en) Multi-type multi-platform information data self-adaptive fusion system and method
CN113868351A (en) Address clustering method and device, electronic equipment and storage medium
CN111581466A (en) Multi-label learning method for characteristic information with noise
CN114742016A (en) Chapter-level event extraction method and device based on multi-granularity entity differential composition
Zhang et al. MOON: Multi-hash codes joint learning for cross-media retrieval
Moon et al. Image patch analysis of sunspots and active regions-II. Clustering via matrix factorization
CN113553442A (en) Unsupervised event knowledge graph construction method and system
CN112487263A (en) Information processing method, system, equipment and computer readable storage medium
US11907307B1 (en) Method and system for event prediction via causal map generation and visualization
CN112818091A (en) Object query method, device, medium and equipment based on keyword extraction
CN116340635A (en) Article recommendation method, model training method, device and equipment
Xu et al. Label distribution changing learning with sample space expanding
CN116030375A (en) Video feature extraction and model training method, device, equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant