CN104199829B - Affection data sorting technique and system - Google Patents

Affection data sorting technique and system Download PDF

Info

Publication number
CN104199829B
CN104199829B CN201410361587.9A CN201410361587A CN104199829B CN 104199829 B CN104199829 B CN 104199829B CN 201410361587 A CN201410361587 A CN 201410361587A CN 104199829 B CN104199829 B CN 104199829B
Authority
CN
China
Prior art keywords
document
matrix
word
emotion
graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410361587.9A
Other languages
Chinese (zh)
Other versions
CN104199829A (en
Inventor
周光有
王巨宏
蒋杰
薛伟
管刚
赵军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Tencent Cyber Tianjin Co Ltd
Original Assignee
Institute of Automation of Chinese Academy of Science
Tencent Cyber Tianjin Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science, Tencent Cyber Tianjin Co Ltd filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN201410361587.9A priority Critical patent/CN104199829B/en
Publication of CN104199829A publication Critical patent/CN104199829A/en
Application granted granted Critical
Publication of CN104199829B publication Critical patent/CN104199829B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a kind of affection data sorting technique and system, methods described includes:The corresponding document document map of construction training dataset and word word figure, in the document document map, node represents the document that the training data is concentrated, the geological information on side represents the degree of correlation between document, in institute's predicate word figure, node represents the word that the training data is concentrated, and the geological information on side represents the degree of correlation between word;The regularization term based on figure in geological information construction object function according to the document document map and word word figure;Treatment, output document emotion matrix are optimized to the object function;The document that test data is concentrated is obtained, Sentiment orientation corresponding with the document of test data concentration is obtained according to the document emotion matrix.Using the method and system, it is possible to increase emotional semantic classification precision.

Description

Sentiment data classification method and system
Technical Field
The invention relates to a natural language processing technology, in particular to an emotion data classification method and system.
Background
With the development of web2.0, more and more users generate data with emotion in web pages, and the data usually exists in the form of comments and blog data in a network. Sentiment classification refers to automatically predicting the sentimental inclination of user-generated sentimental data, such as predicting whether a comment is positive or negative.
Recently, emotion classification has gained general attention in natural language processing, and emotion classification methods can be classified into supervised emotion analysis and unsupervised emotion analysis. Supervised emotion analysis relies on manually labeled training data, and in some cases labeling efforts are time consuming and expensive, which motivates unsupervised or semi-supervised emotion analysis.
The traditional unsupervised (or semi-supervised) method of sentiment analysis is a dictionary-based method. Dictionary-based methods employ an emotional vocabulary to determine the overall emotional propensity of a document. However, it is difficult to define a universal best emotion vocabulary to cover all words from different domains. Furthermore, most semi-automatic dictionary-based methods do not yield satisfactory results. The traditional and more advanced dictionary-based method is an emotion classification method based on constrained non-negative Matrix Tri-factorization (CNMTF), which uses the domain-independent emotion vocabulary as prior knowledge to classify emotion, but experiments show that the emotion classification precision of the CNMTF-based emotion classification method is still to be improved.
Disclosure of Invention
In view of the above, it is desirable to provide a method and system for classifying emotion data, which can improve classification accuracy.
A method of sentiment data classification, the method comprising:
constructing a document-document graph and a word-word graph corresponding to a training data set, wherein in the document-document graph, nodes represent documents in the training data set, geometric information of edges represents the correlation degree between the documents, and in the word-word graph, the nodes represent words in the training data set, and the geometric information of the edges represents the correlation degree between the words;
constructing graph-based regularization items in an objective function according to the geometric information of the document-document graph and the word-word graph;
optimizing the objective function and outputting a document-emotion matrix;
and acquiring documents in a test data set, and acquiring emotional tendency corresponding to the documents in the test data set according to the document-emotional matrix.
A sentiment data classification system, the system comprising:
the graph construction module is used for constructing a document-document graph and a word-word graph corresponding to a training data set, wherein in the document-document graph, nodes represent documents in the training data set, geometric information of edges represents the correlation degree between the documents, in the word-word graph, the nodes represent words in the training data set, and the edge attributes represent the correlation degree between the words;
the regularization item construction module is used for constructing graph-based regularization items in an objective function according to the geometric information of the document-document graph and the word-word graph;
the optimization processing module is used for optimizing the objective function and outputting a document-emotion matrix;
and the emotional tendency determining module is used for acquiring the documents in the test data set and acquiring the emotional tendency corresponding to the documents in the test data set according to the document-emotional matrix.
According to the method and the system for classifying the emotion data, two graphs corresponding to a training data set, namely a document-document graph and a word-word graph are constructed, when an objective function is constructed, geometric information in a document space and a word space is fully considered, the principle that adjacent words or documents always have the same emotion tendency is utilized, after the objective function is optimized, an output document-emotion matrix is more accurate, the corresponding emotion tendency is determined for the documents in a test data set, and therefore the classification accuracy of the emotion data is improved.
Drawings
FIG. 1 is a flow diagram of a method for sentiment data classification in one embodiment;
FIG. 2 is a block diagram of the emotion data classification system in one embodiment;
FIG. 3 is a graph showing comparison of emotion classification accuracy at different parameters on two different data sets;
FIG. 4 is a graph showing comparison of emotion classification accuracy at different nearest neighbor values on two different data sets;
FIG. 5 is a comparative schematic of parametric analysis of GNMTF on two datasets;
FIG. 6 is a comparison diagram of emotion classification accuracy under different percentage mark files in a semi-supervised mode.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The method for classifying the emotion data provided by the embodiment of the invention can determine the corresponding emotion tendency of the document in the test data set. The test data set can be a set of emotion data generated by the user in the internet, such as comment data, blog data and the like existing in the internet. The sentiment data classification method may determine for a document such as a review its corresponding sentiment propensity, such as determining whether positive or negative. Specifically, data in a training data set is trained, the training data set can be a set formed by a large amount of emotion data existing in the internet, a document-emotion matrix can be obtained by training the data, the document-emotion matrix is an optimal document-emotion matrix, the optimal document-emotion data is used for determining the corresponding emotion tendency of the document in the test data set, and a more accurate classification result can be obtained. In the training process, two graphs, namely a document-document graph and a word-word graph, are constructed for a training data set, considering that adjacent words or documents tend to have the same emotional tendency, and the two graphs have close relation and respectively contain geometric information in a document space and a word space. The two graphs are used as regularization of non-negative matrix three-factor decomposition, so that graph-based regularization items in an objective function are constructed, and then the objective function is optimized, so that an optimal document-emotion matrix can be obtained. Because two graphs are constructed, and the constructed target function also comprises a regularization term based on the graphs besides a Non-negative Matrix Tri-factorization term, an algorithm adopted by the embodiment of the invention is also called a Graph-regularized Non-negative Matrix Tri-factorization (GNMTF) algorithm based on Graph regularization.
As shown in FIG. 1, in one embodiment, a sentiment data classification method is provided and comprises the following steps:
and 102, constructing a document-document graph and a word-word graph corresponding to the training data set.
The training data set is a data set used for training, a large amount of emotion data existing in the internet can be selected as training samples, and the training samples can also contain some manually labeled documents. In this embodiment, in the document-document graph, the nodes represent documents in the training dataset, and the geometric information of the edges represents the correlation between the documents. In the word-word graph, nodes represent words in the training data set, and edge attributes represent the degree of correlation between the words. Thus, the two constructed graphs retain the geometric information in the document space and word space, respectively.
And 104, constructing graph-based regularization items of the objective function according to the geometric information of the document-document graph and the word-word graph.
In this embodiment, when constructing the objective function, a graph-based regularization term is added on the basis of the CNMTF. CNMTF refers to constrained non-negative matrix tri-factorization, and the constructed objective function comprises non-negative matrix tri-factorization terms and lexical prior knowledge terms. Specifically, a corresponding correlation matrix can be obtained according to the document-document graph, a corresponding correlation matrix can be obtained according to the word-word graph, a laplacian matrix of the corresponding graph can be obtained according to the two correlation matrices, and a graph-based regularization item in the objective function is constructed according to the laplacian matrix, so that the geometric information in the document space and the word space is reserved.
And 106, optimizing the target function and outputting a document-emotion matrix.
Specifically, the constructed objective function is monotonically decreased until convergence, and finally, parameters corresponding to the minimization of the objective function are obtained, wherein the parameters comprise the document-emotion matrix. The document-emotion matrix is the optimal document-emotion matrix, and identifies the emotion (i.e. emotional tendency) corresponding to a document.
And step 108, acquiring the documents in the test data set, and acquiring the emotional tendency corresponding to the documents in the test data set according to the output document-emotional matrix.
A test data set refers to a collection of data that is needed to determine the emotional tendencies to which a document corresponds. For a document in the test data set, a row corresponding to the document can be found according to the output document-emotion matrix, and the emotion tendency with the maximum value is found according to the row, namely the emotion tendency corresponding to the document. And if the emotional tendency is positive, the emotion corresponding to the document is positive, and if the emotional tendency is negative, the emotion corresponding to the document is negative, so that the document is classified according to the emotion.
In the embodiment, by constructing two graphs corresponding to the training data set, namely the document-document graph and the word-word graph, when the objective function is constructed, geometric information in a document space and a word space is fully considered, and the principle that adjacent words or documents often have the same emotional tendency is utilized, after the objective function is optimized, the output optimal document-emotional matrix is more accurate, so that the corresponding emotional tendency is determined more accurately for the documents in the test data set, and the classification accuracy of the emotional data is improved.
The GNMTF algorithm provided by the embodiment of the invention is provided on the basis of CNMTF, and the CNMTF is constrained NMTF (Non-negative Matrix Tri-factorization). For a clearer understanding of the present invention, the following description is made with respect to NMTF and CNMTF and some basic concepts are introduced.
Non-negative matrix tri-factorization (NMTF) can be used for unsupervised (or semi-supervised) sentiment analysis, in these models, a word-document matrix can be approximated as a tri-factorized matrix, with word and document assigned class labels by solving the optimization problem in equation (1):
wherein the word-document matrixm is the number of words and n is the number of documents. Sigma1And σ2Is a contraction regularization parameter.Is a matrix of words-emotions,is the document-emotion matrix, k is the number of emotion categories of the document, k is 2, that is, the emotion classification includes two categories: positive and negative. For example, Vi11 (or U)i11) denotes that the emotional tendency of the document i (or word i) is positive, Vi21 (or U)i21) represents the emotion of document i (or word i)The tendency is negative. Vi*0 (or U)i*0) indicates unknown, i.e.: document i (or word i) is neither positive nor negative. I | · | purple windFIs the Frobenius norm, I is the identity matrix with all entries 1.
For equation (1), based on the contraction technique, the quadrature constraint of U and V can be approximately satisfied by preventing the second and third terms from being too bulky. Given an arbitrary due to Lagrangian multiplication1And2with an appropriate σ1And σ2So as to satisfyAnd
some concepts related to NMTF are as follows:
given a matrixThe trace of X is calculated as:
if the matrix is not a square matrix, the traces of the matrix may also be calculated. Without loss of generality, m is more than or equal to n andthe Frobenius norm of X is calculated as follows:
the traces and norms of the matrix have the following properties:
attribute 1: given matrixThen Tr (X)TX)=Tr(XXT) (4)
Attribute 2: given matrixThen Tr (a. X + b. Y) ═ a. Tr (X) + b. Tr (Y) (5)
Attribute 3: given matrixThen | X | | non-conducting phosphor2=Tr(XTX)=Tr(XXT) (6)
For the constrained NMTF, i.e., CNMTF, it incorporates lexical prior knowledge into the matrix tri-factorization. For the emotion analysis of unsupervised patterns, the objective function constructed by CNMTF is:
wherein, U0Represents the prior knowledge of the vocabulary about the emotional words in the vocabulary, namely: if the word i is positive then (U)0)i1If the word i is negative, then (U) 10)i21 if it is not present in the vocabulary, (U)0)i*Tr (-) denotes the trace of the matrix, α>0 is a parameter that controls the contribution of the lexical prior knowledge. Cu∈{0,1}m×mIs a diagonal matrix whose input is if the category of the ith word is knownIf not, then,
for semi-supervised emotion analysis, let V0Representing manually labeled documents in the training dataset, if the documents represent positive emotions, (V)0)ii1 if the document represents a message emotion, (V)0)i21. Therefore, in the semi-supervised mode, the objective function constructed by the CNMTF is:
wherein, β>0 is a parameter that controls the contribution of the manual annotation document. Cu∈{0,1}n×nIs a diagonal matrix whose input is if the type of the ith document is labeledIf not, then,
after introducing the NMTF, the CNMTF and some basic concepts, the emotion data classification method provided by the present invention will be described in detail. The emotion data classification method uses a non-negative matrix three-factor decomposition algorithm based on graph regularization, namely GNMTF. With respect to CNTMF, geometric information in document space and word space is combined. The method is based on manifold assumptions, namely: if the documents are distributed in an intrinsic geometry, two documents XiAnd XjClose enough to each other, their emotional tendency ViAnd VjShould be approached as well.
Specifically, to model the geometry, corresponding two graphs are constructed for the training dataset, namely: the method comprises the steps that a document-document graph and a word-word graph are obtained, wherein the document-document graph describes geometric information in a document space and comprises geometric information formed by document nodes and edges; the word-word graph describes geometric information in a word space, and the geometric information comprises word nodes and geometric information formed by edges. These two figures are described separately below.
In the document-document graph, nodes represent documents in the training dataset, and the geometric information of edges represents the correlation between the documents. Document-document graph relevance momentsThe array is defined as: if any one of the two documents is the nearest neighbor of the other document, the correlation degree of the two documents is the cosine between the two documents, otherwise, the correlation degree is 0. E.g. document-document graph GVThe correlation matrix of (a) is defined as:
wherein,representing a document XiI.e. p nodes (documents) closest to the current node (document).
The geometry remaining in the document space is reduced to minimize the following penalty function:
wherein,is a diagonal matrix whose elements are WVBecause of DVIs symmetrical) of the two components of the signal,WVis a drawing GVOf the correlation matrix, Lv=Dv-WvIs a drawing GVThe laplacian matrix of.
Similarly, in a word-word graph, nodes represent words in the training data set and the geometric information of edges represents the degree of correlation between the words. The relevancy matrix of a word-word graph is defined as: if any of the two words is the nearest neighbor of the other word, the correlation degree of the two words is the cosine between the two words, otherwise, the correlation degree is 0. E.g. word-word graph GuThe correlation matrix of (a) is defined as:
wherein,the expression WjI.e. p nodes (words) closest to the current node (word). In this context, the word WjExpressed as a document vector [ X ]j1,...,Xjn]。
The geometry remaining in the word space is reduced to minimize the following penalty function:
wherein L isu=Du-WuIs a drawing GuLaplacian matrix of, WuIs a drawing GuThe correlation matrix of (a) is obtained,is a diagonal matrix whose elements are
Further, in one embodiment, the step of constructing graph-based regularization terms in an objective function based on geometric information of the document-document graph and the word-word graph includes: constructing a document-emotion matrix and a word emotion matrix corresponding to the training data set; acquiring a Laplace matrix of a document-document graph and a Laplace matrix of a word-word graph; and constructing regularization items based on the document-document graph in the target function according to the document-emotion matrix corresponding to the training data set and the Laplace matrix of the document-document graph, and constructing regularization items based on the word-word graph in the target function according to the word-emotion matrix corresponding to the training data set and the Laplace matrix of the word-word graph.
Further, the regularization term based on the document-document map is the product of the control parameter of the preset document space and the first trace, and the regularization term based on the word-word map is the product of the control parameter of the preset word space and the second trace. And the constructed graph-based regularization term is the sum of a product of a control parameter of a preset document space and the first trace and a product of a control parameter of a preset word space and the second trace. In this embodiment, the first trace is a trace of a matrix obtained by multiplying a document-emotion matrix, a transposed matrix of a document emotion matrix, and a laplace matrix of a document-document map, and the second trace is a trace of a matrix obtained by multiplying a word-emotion matrix, a transposed matrix of a word-emotion matrix, and a laplace matrix of a word-word map.
In the present embodiment, the document-document map G described above is combinedvAnd word-word graph GuAnd description of NMTF and CNMTF. Based on the geometric information in the document space and word space, the constructed objective function contains a graph-based regularization term in addition to the term of the objective function in the CNMTF. As mentioned above, the objective function constructed in the unsupervised mode and the semi-supervised mode differs due to the CNMTF. For the semi-supervised mode, the objective function constructed by the present embodiment is as follows:
for the unsupervised mode, β Tr [ (V-V) in equation (13) can be omitted0)TCv(V-V0)]An item. Wherein, gamma Tr (U)TLuU) is a regularization term (second trace) based on word-word graph, Tr (V)TLvV) is a regularization term (first trace) based on the document-document graph.>0 is a preset control parameter of the document space, i.e. a parameter controlling the contribution of geometric information in the document space, γ>0 is the control parameter of the preset word spaceThe number, i.e. the parameter that controls the contribution of the geometric information in the word space. U shape0Is an emotion vocabulary in an emotion dictionary, V0Is a document that is manually labeled in the training dataset. X is a word-document matrix, U is a word-emotion matrix, V is a document-emotion matrix, and H is an association matrix between U and V.
Thus, training the training data set may ultimately be to solve the optimization problem of equation (13), i.e., to arrive at U, H and V that minimize the objective function in equation (13).
Further, in an embodiment, the step of performing optimization processing on the objective function and outputting the document-emotion matrix includes: and performing iterative operation according to preset times, continuously updating the document-emotion matrix, the word-emotion matrix and the correlation matrix among the document emotion matrices, monotonically reducing the objective function until convergence according to the updated document-emotion matrix, word-emotion matrix and correlation matrix, and inputting the document-emotion matrix which minimizes the objective function.
Specifically, initializing U to U0V is initialized to V0Then U, H and V are continuously updated by the following equations (14) to (16) with the number of updates being a preset number, and the objective function is calculated using equation (13):
where operator o is the element-by-element product,is an element-by-element division.
And obtaining final U, H and V, wherein the output V is the optimal document-emotion matrix.
The optimal U, H and V are solved to obtain the GNMTF learning algorithm, and the input X, U is0、V0、Cu、Cv、α、β、、γ、σ1And σ2In conjunction with the above description, the specific algorithm is as follows:
algorithm 1:
1. initialize U ← U0,V←V0,H≥0;
2. Structure WvAnd Wu
3. Performing 4-5 from t < -1 to Iter;
4. updates U, H and V using official posts (14) - (16);
5. calculating an objective function by formula (13);
6. and (6) ending.
The final outputs U, H and V, where Iter represents the number of iterations.
A document-emotion matrix is obtained, which describes the emotional tendency corresponding to a document. Therefore, the further step of obtaining the emotional tendency corresponding to the document in the test data set according to the document-emotion matrix is: for a document in the test data set, acquiring a row where the document is located in the output document-emotion matrix, and acquiring the corresponding emotion tendency of taking the maximum value in the row as the emotion tendency corresponding to the document.
Specifically, in the present embodiment, for the output document-emotion matrix V, VijAnd representing that the ith document corresponds to the emotion j, and obtaining the emotional tendency of the ith document if the maximum value of the j is obtained. Thus, for one document X in the test datasetiThe emotion of (c) can be inferred by the following formula:
wherein j represents the emotion corresponding to the ith document, p represents that the emotion is positive, and n represents that the emotion is negative.
The embodiment of the invention provides an emotion data classification method based on GNMTF, which combines geometric information in a document space and a word space and fully considers the principle that if two documents are close enough in the geometric space, the emotion tendencies of the two documents are close enough and if the two words are close enough in the geometric space, the emotion tendencies of the two words are close enough. The constructed target function containing two graph-based regularization items is used for determining the corresponding emotional tendency of one document in the test data set according to a training result obtained by learning the target function, so that a more accurate emotional data classification result can be obtained.
As shown in FIG. 2, in one embodiment, there is also provided an emotion data classification system, comprising:
the graph construction module 202 is configured to construct a document-document graph and a word-word graph corresponding to the training data set, where in the document-document graph, nodes represent documents in the training data set, geometric information of edges represents correlation between the documents, and in the word-word graph, nodes represent words in the training data set, and geometric information of edges represents correlation between the words.
And a regularization term construction module 204 for constructing graph-based regularization terms in the objective function according to the geometric information of the document-document graph and the word-word graph.
And the optimization processing module 206 is configured to perform optimization processing on the objective function and output a document-emotion matrix.
And an emotional tendency determining module 208, configured to acquire the document in the test data set, and acquire, according to the output document-emotional matrix, an emotional tendency corresponding to the document in the test data set.
Further, the relevancy matrix of the document-document map is defined as: if any one of the two documents is the nearest neighbor of the other document, the correlation degree of the two documents is the cosine between the two documents, otherwise, the correlation degree is 0. The relevancy matrix of a word-word graph is defined as: if any of the two words is the nearest neighbor of the other word, the correlation degree of the two words is the cosine between the two words, otherwise, the correlation degree is 0.
In one embodiment, regularizer construction module 204 is configured to construct a document-emotion matrix and a word-emotion matrix corresponding to the training data set; acquiring a Laplace matrix of a document-document graph and a Laplace matrix of a word-word graph; and constructing regularization items based on the document-document graph in the target function according to the document-emotion matrix corresponding to the training data set and the Laplace matrix of the document-document graph, and constructing regularization items based on the word-word graph in the target function according to the word-emotion matrix corresponding to the training data set and the Laplace matrix of the word-word graph.
Further, in one embodiment, the regularization term based on the document-document map constructed by the regularization term construction module 204 is a product of the control parameter of the preset document space and the first trace, and the regularization term based on the word-word map is a product of the control parameter of the preset word space and the second trace. And the constructed graph-based regularization term is the sum of a product of a control parameter of a preset document space and the first trace and a product of a control parameter of a preset word space and the second trace.
The first trace is a trace of a matrix obtained by multiplying the document-emotion matrix, the transposed matrix of the document-emotion matrix and the Laplace matrix of the document-document graph, and the second trace is a trace of a matrix obtained by multiplying the word-emotion matrix, the transposed matrix of the word-emotion matrix and the Laplace matrix of the word-word graph.
In one embodiment, the optimization processing module 206 is configured to perform iterative operations according to preset times, continuously update the document-emotion matrix, the word-emotion matrix and the association matrix between the document emotion matrix and the word-emotion matrix, monotonically decrease the objective function according to the updated document-emotion matrix, the word-emotion matrix and the association matrix until convergence, and output a document-emotion matrix that minimizes the objective function.
In one embodiment, the emotional tendency determination module 208 is configured to, for a document in the test data set, obtain a row of the document in the output document-emotional matrix, and obtain a corresponding maximum emotional tendency in the row as a corresponding emotional tendency of the document.
The emotion data classification method based on GNMTF provided by the embodiment of the invention is theoretically analyzed to prove the rationality of the GNMTF algorithm. Theoretical analysis includes theoretical analysis of optimization, convergence, and computational complexity. The method comprises the following specific steps:
1) optimization analysis
Without loss of generality, the optimization and constrained lagrangian function showing U only is as follows:
where Ψ is the Lagrangian multiplier for non-negative constraints U ≧ 0.
Partial derivative of (w.r.t.U) (solvingThe derivative with respect to U) is:
using the KKT condition Ψ ° U — 0, we obtain:
thus resulting in the update rule in equation (14). By such derivation, the update rules for all other variables H and V in GNMTF optimization can be obtained, as shown in equations (15) and (16).
2) Convergence analysis
The convergence of the multiplication update given by equations (14) to (16) will be demonstrated below. For the auxiliary function, the following is defined:
definitions 1.1Is oneAuxiliary function of ifAnd if and only ifThe equal sign is true.
1.1 ifIs oneThe auxiliary function of (a) is selected,not growing under update.
And (3) proving that: by the definition of 1.1, the ratio of the main component,
next, we will prove that equation (14) is just an update rule in lemma 4.1 using the appropriate helper function. Relative to UijIs/are as followsIs calculated as:
theorem 1.2, let function
Is composed ofSuitable auxiliary functions.
Proof of theorem 1.2: as is apparent from the above description of the preferred embodiment,thus, we only need to verifyTo achieve this, expansion by Taylor series
Due to the fact that the two are orthogonal,by algebraic manipulation, there are four inequalities:
based on the above four inequalities, thenAnd theorem 1.2 holds.
Proof of theorem 1.1: according to theorem 1.1 and theorem 1.2, the method can be realized by minimizingTo obtain the update rule of U. When setting upWe obtained:
by theorem 1.1 and theorem 1.2, we have Therefore, U monotonically decreases. Since the objective function has a lower bound of 0, the correctness and convergence of theorem 1.1 and algorithm 1 are verified.
3) Computational complexity analysis
Table 1 below shows the number of computational operations, where m > k and n > k.
TABLE 1
Based on the generalized update rule in Algorithm 1, the algorithm operator of each iteration in GNMTF can be easily calculated, and more important attention is paid to CuIs a diagonal matrix, CuIs 1 on each row in the array. Therefore, we only need zero addition and mk multiplication to compute CuAnd U is adopted. Also for CuU0、CvV、CvV0、DuU and DvV, we also only need to perform zero addition and mk multiplication for each of them. Furthermore, we note that W is alsouIs a sparse matrix, W if a p-nearest neighbor graph is useduThe average non-zero element on each row of (a) is p. Thus, only mpk addition and mpk multiplication are required to calculate WuAnd U is adopted. Similarly, for WvV, we need the same operation to calculate as WuAnd U is adopted. Assuming that the multiplicative update stops after the Iter item, the time cost of the multiplicative update then becomes O (Iter × mnk). Thus, the overall running time of the GNMTF algorithm is similar to the standard NMTF and CNMTF.
The following describes the effect achieved by the emotion data classification method based on GNMTF according to the embodiment of the present invention with reference to specific experiments.
In this experiment, the GNMTF-based sentiment data classification method was used on two datasets, including the movie dataset and amazon dataset. Among them, the Movie data set is widely used for emotion analysis in many documents, including 1000 positive and 1000 negative comments drawn from IMDB (Internet Movie Database) obtained from art, movies, reviews, news groups, and the like. Amazon datasets are heterogeneous, severely unbalanced and large-scale datasets, a smaller version of the dataset that has been released, including 4 product types: kitchen, books, DVD and e-books, there are 4000 positive and 4000 negative comments.
For both sets of data, 8000 words of the highest document frequency were selected to generate the vocabulary, stop words were removed, and a canonical word-frequency representation was used. To construct a lexical prior knowledge matrix U0Using Hu&Liu, 2004, et al, contains 2006 positive words (e.g., "beauty") and 4783 negative words (e.g., "vexation").
The sentiment data classification method is divided into an unsupervised mode and a semi-supervised mode, and the first experiment explores the benefits of GNMTF in the unsupervised mode (namely C)v0) thus, β Tr [ (V-V) in equation (13)0)TCv(V-V0)]For this unsupervised mode, it is empirically set that α - β - γ -1, σ1=σ21, Iter 100. The GNMTF algorithm is then run to repeat the 10 words to remove any randomness that results from the random initialization. In the experiments, the proposed GNMTF was compared to the following four classes of methods:
(1) vocabulary-based methods (abbreviated as LBM);
(2) the document clustering method comprises the following steps: selecting the most representative clustering method, namely a k-mean method, NMTF, information theory joint clustering (ITCC) and European cooperative clustering method, wherein the number of clusters is set to be 2, and all the methods cannot use emotion vocabularies;
(3) constraining NMTF (i.e., CNMTF);
(4) GNMTF: the embodiment of the invention provides a non-negative matrix three-factor decomposition algorithm based on graph regularization. In this algorithm, a p-neighbor graph is constructed using cosine similarity, and the number of nearest neighbors p is set to 0 empirically in both document and word space.
Experimental results table 2 shows the emotion classification accuracy results using different methods in the two data sets, with the LBM method as the baseline and the percentage of increase or decrease from the baseline for each of the other methods shown in parenthesis, as shown in table 2.
TABLE 2
# Method of producing a composite material Movie data set Amazon dataset
1 LBM 0.632 0.580
2 K-means 0.543(-8.9%) 0.535(-4.5%)
3 NMTF 0.561(-7.1%) 0.547(-3.3%)
4 ECC 0.678(+4.6%) 0.642(+6.2%)
5 ITCC 0.714(+8.2%) 0.655(+7.5%)
6 CNMTF 0.695(+6.3%) 0.658(+7.8%)
7 GNMTF 0.736(+10.4%) 0.705(+12.5%)
From table 2, it can be derived:
1) the vocabulary-based approach LBM outperforms the document clustering approach, k-means and NMTF, indicating the superiority of emotional classification of emotional vocabulary (line 1 compared to lines 2 and 3 in Table 2).
2) Based on NMTF, CNMTF and GNMTF the algorithm based on matrix tri-factorization was verified to be more efficient for emotion classification due to k-means (row 2 in table 2 compared to rows 3, 6, 7).
3) Both CNMTF and GNMTF consider the vocabulary prior knowledge from ready-made emotions, achieving better performance than NMTF, indicating the importance of the vocabulary prior knowledge in learned emotion classification (row 3 compared to rows 6 and 7 in table 2).
4) Regardless of the dataset, the GNMTF achieves the best performance, notably due to the existing CNMTF, showing the superiority of geometric information and graph-based regularization (row 6 compared to row 7 in table 2).
In experiments, the influence of important parameters on the GNMTF algorithm was also investigated.
The parameters α, β, and γ are empirically set equal to the weights in the experiment (α ═ β ═ γ ═ 1). For the effect of these three parameters, we consider one parameter at a time and set different values to adjust its effect on the unsupervised mode, with both remaining parameters set to 0 (e.g., α ∈ [0,10], ═ γ ═ 0). The results of the experiment on the two data sets are shown in fig. 3, where in fig. 3 the x-axis indicates the different values of α, β and γ. As can be seen from fig. 3, the GNMTF can achieve a relatively good performance when the parameter is between 0.1 and 1. And the vocabulary priori knowledge achieves relatively good effects on data sets and different parameter settings.
For parameter p, the GNMTF uses p-nearest neighbor graphs to capture geometric information in word space and document space, as described above. The success of GNMTF relies on the assumption that two adjacent words or documents have the same emotional tendency. Different p values are set for the document space (or word space), and p is corrected for the word space (or document space) to 10. The experimental results on the two datasets are shown in fig. 4, where CNMTF-document means we set different p-values in the document space and correct p 10 in the word space. CNMTF-word means that we set different p values in the word space and correct p 10 in the document space. As can be seen from FIG. 4, the performance of GNMTF-documents and GNMTF-words decreases with larger p-values.
In experiments, the convergence properties of GNMTF were also investigated. Fig. 5 shows the convergence curves of GNMTF on two datasets, in fig. 5 the y-axis is the value of the objective function and the x-axis represents the number of iterations. As can be seen from fig. 5, the multiplicative update for GNMTF convergence is faster, typically at 50 iterations.
The second experiment explored the benefit of GNMTF in semi-supervised mode. In semi-supervised mode, some documents are used that are manually labeled. For the semi-supervised model of GNMTF, in experiments, empirical setting Iter to 100 over document and word space, σ1=σ2For CNMTF, for fair comparison, α β is 1, and GNMTF is compared with some representative supervised methods, such as consistency method, semi-supervised learning method (abbreviated GFHF) using gaussian field and harmonic function.
The experimental results are shown in fig. 6, and it can be seen from fig. 6 that the whole range of the number of labeled documents on the two data sets of GNMTF is better than that of CNMTF, so it can be concluded that the emotion classification accuracy of the semi-supervised mode can still be improved by using the geometric information.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method of sentiment data classification, the method comprising:
constructing a document-document graph and a word-word graph corresponding to a training data set, wherein in the document-document graph, nodes represent documents in the training data set, geometric information of edges represents the correlation degree between the documents, and in the word-word graph, the nodes represent words in the training data set, and the geometric information of the edges represents the correlation degree between the words;
constructing a document-emotion matrix and a word-emotion matrix corresponding to the training data set;
acquiring a Laplace matrix of the document-document graph and a Laplace matrix of the word-word graph;
constructing regularization items based on a document-document graph in an objective function according to a document-emotion matrix corresponding to the training data set and a Laplace matrix of the document-document graph, and constructing regularization items based on a word-word graph in the objective function according to a word-emotion matrix corresponding to the training data set and the Laplace matrix of the word-word graph;
optimizing the objective function and outputting a document-emotion matrix;
and acquiring documents in a test data set, and acquiring emotional tendency corresponding to the documents in the test data set according to the document-emotional matrix.
2. The method of claim 1, wherein the relevancy matrix of the document-document graph is defined as: if any one of the two documents is the nearest neighbor of the other document, the correlation degree of the two documents is the cosine between the two documents, otherwise, the correlation degree is 0;
the relevancy matrix of the word-word graph is defined as: if any of the two words is the nearest neighbor of the other word, the correlation degree of the two words is the cosine between the two words, otherwise, the correlation degree is 0.
3. The method according to claim 1, wherein the document-map-based regularization term is a product of a preset control parameter of a document space and a first trace, and the word-map-based regularization term is a product of a preset control parameter of a word space and a second trace;
the graph-based regularization term is the sum of a product of a preset control parameter of a document space and a first trace and a product of a preset control parameter of a word space and a second trace;
the first trace is a trace of a matrix obtained by multiplying the document-emotion matrix, the transposed matrix of the document-emotion matrix and the Laplace matrix of the document-document graph, and the second trace is a trace of a matrix obtained by multiplying the word-emotion matrix, the transposed matrix of the word-emotion matrix and the Laplace matrix of the word-word graph.
4. The method of claim 1, wherein the step of optimizing the objective function and outputting the document-emotion matrix comprises:
and carrying out iterative operation according to preset times, continuously updating the document-emotion matrix, the word-emotion matrix and the correlation matrix between the document emotion matrix and the word-emotion matrix, monotonically reducing the objective function according to the updated document-emotion matrix, word-emotion matrix and the correlation matrix until convergence, and outputting the document-emotion matrix which minimizes the objective function.
5. The method of claim 1, wherein the step of obtaining emotional trends corresponding to documents in the test dataset from the document-emotion matrix is:
for a document in the test data set, acquiring a row of the document in the document-emotion matrix, and acquiring the corresponding maximum value emotion tendency in the row as the emotion tendency corresponding to the document.
6. An emotion data classification system, characterized in that the system comprises:
the graph construction module is used for constructing a document-document graph and a word-word graph corresponding to a training data set, wherein in the document-document graph, nodes represent documents in the training data set, and geometric information of edges represents the correlation degree between the documents;
the regularization item construction module is used for constructing a document-emotion matrix and a word-emotion matrix corresponding to the training data set; acquiring a Laplace matrix of the document-document graph and a Laplace matrix of the word-word graph; constructing regularization items based on a document-document graph in an objective function according to a document-emotion matrix corresponding to the training data set and a Laplace matrix of the document-document graph, and constructing regularization items based on a word-word graph in the objective function according to a word-emotion matrix corresponding to the training data set and the Laplace matrix of the word-word graph;
the optimization processing module is used for optimizing the objective function and outputting a document-emotion matrix;
and the emotional tendency determining module is used for acquiring the documents in the test data set and acquiring the emotional tendency corresponding to the documents in the test data set according to the document-emotional matrix.
7. The system of claim 6, wherein the relevancy matrix for a document-document graph is defined as: if any one of the two documents is the nearest neighbor of the other document, the correlation degree of the two documents is the cosine between the two documents, otherwise, the correlation degree is 0;
the relevancy matrix of the word-word graph is defined as: if any of the two words is the nearest neighbor of the other word, the correlation degree of the two words is the cosine between the two words, otherwise, the correlation degree is 0.
8. The system of claim 6, wherein the document-graph-based regularization term is a product of a preset document space control parameter and a first trace, and the word-graph-based regularization term is a product of a preset word space control parameter and a second trace;
the graph-based regularization term is the sum of a product of a preset control parameter of a document space and a first trace and a product of a preset control parameter of a word space and a second trace;
the first trace is a trace of a matrix obtained by multiplying the document-emotion matrix, the transposed matrix of the document-emotion matrix and the Laplace matrix of the document-document graph, and the second trace is a trace of a matrix obtained by multiplying the word-emotion matrix, the transposed matrix of the word-emotion matrix and the Laplace matrix of the word-word graph.
9. The system of claim 6, wherein the optimization processing module is configured to perform iterative operations according to a preset number of times, continuously update the document-emotion matrix, the word-emotion matrix and the correlation matrix between the document emotion matrix and the word-emotion matrix, monotonically decrease the objective function according to the updated document-emotion matrix, the word-emotion matrix and the correlation matrix until convergence, and output the document-emotion matrix that minimizes the objective function.
10. The system of claim 6, wherein the emotional tendency determination module is configured to, for a document in the test data set, obtain a row of the document in the document-emotional matrix, and obtain a corresponding maximum emotional tendency in the row as the corresponding emotional tendency of the document.
CN201410361587.9A 2014-07-25 2014-07-25 Affection data sorting technique and system Active CN104199829B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410361587.9A CN104199829B (en) 2014-07-25 2014-07-25 Affection data sorting technique and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410361587.9A CN104199829B (en) 2014-07-25 2014-07-25 Affection data sorting technique and system

Publications (2)

Publication Number Publication Date
CN104199829A CN104199829A (en) 2014-12-10
CN104199829B true CN104199829B (en) 2017-07-04

Family

ID=52085122

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410361587.9A Active CN104199829B (en) 2014-07-25 2014-07-25 Affection data sorting technique and system

Country Status (1)

Country Link
CN (1) CN104199829B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294506B (en) * 2015-06-10 2020-04-24 华中师范大学 Domain-adaptive viewpoint data classification method and device
CN109726391B (en) * 2018-12-11 2024-01-09 中科恒运股份有限公司 Method, device and terminal for emotion classification of text
US11120229B2 (en) 2019-09-04 2021-09-14 Optum Technology, Inc. Natural language processing using joint topic-sentiment detection
US11163963B2 (en) 2019-09-10 2021-11-02 Optum Technology, Inc. Natural language processing using hybrid document embedding
US11238243B2 (en) 2019-09-27 2022-02-01 Optum Technology, Inc. Extracting joint topic-sentiment models from text inputs
US11068666B2 (en) 2019-10-11 2021-07-20 Optum Technology, Inc. Natural language processing using joint sentiment-topic modeling
US11494565B2 (en) 2020-08-03 2022-11-08 Optum Technology, Inc. Natural language processing techniques using joint sentiment-topic modeling
CN112000788B (en) * 2020-08-19 2024-02-09 腾讯云计算(长沙)有限责任公司 Data processing method, device and computer readable storage medium
US12008321B2 (en) 2020-11-23 2024-06-11 Optum Technology, Inc. Natural language processing techniques for sequential topic modeling

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101334783A (en) * 2008-05-20 2008-12-31 上海大学 Network user behaviors personalization expression method based on semantic matrix
CN101576903A (en) * 2009-03-03 2009-11-11 杜小勇 Method for measuring similarity of documents
EP2487639A1 (en) * 2009-10-09 2012-08-15 Nec Corporation Information management device, data processing method thereof, and computer program
CN102831116A (en) * 2011-06-14 2012-12-19 国际商业机器公司 Method and system for document clustering
CN103544255A (en) * 2013-10-15 2014-01-29 常州大学 Text semantic relativity based network public opinion information analysis method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101334783A (en) * 2008-05-20 2008-12-31 上海大学 Network user behaviors personalization expression method based on semantic matrix
CN101576903A (en) * 2009-03-03 2009-11-11 杜小勇 Method for measuring similarity of documents
EP2487639A1 (en) * 2009-10-09 2012-08-15 Nec Corporation Information management device, data processing method thereof, and computer program
CN102831116A (en) * 2011-06-14 2012-12-19 国际商业机器公司 Method and system for document clustering
CN103544255A (en) * 2013-10-15 2014-01-29 常州大学 Text semantic relativity based network public opinion information analysis method

Also Published As

Publication number Publication date
CN104199829A (en) 2014-12-10

Similar Documents

Publication Publication Date Title
CN104199829B (en) Affection data sorting technique and system
WO2023000574A1 (en) Model training method, apparatus and device, and readable storage medium
WO2020062770A1 (en) Method and apparatus for constructing domain dictionary, and device and storage medium
CN104239512B (en) A kind of text recommends method
CN104915448A (en) Substance and paragraph linking method based on hierarchical convolutional network
CN109992673A (en) A kind of knowledge mapping generation method, device, equipment and readable storage medium storing program for executing
CN104239554A (en) Cross-domain and cross-category news commentary emotion prediction method
CN107220311B (en) Text representation method for modeling by utilizing local embedded topics
CN106778878B (en) Character relation classification method and device
CN107943824A (en) A kind of big data news category method, system and device based on LDA
Huang et al. Large-scale heterogeneous feature embedding
CN108427756B (en) Personalized query word completion recommendation method and device based on same-class user model
Xia et al. A regularized optimization framework for tag completion and image retrieval
He et al. Similarity preserving overlapping community detection in signed networks
Pauletic et al. An overview of clustering models with an application to document clustering
CN109992667B (en) Text classification method and device
Jiang et al. Adaptive KNN and graph-based auto-weighted multi-view consensus spectral learning
Andrews et al. Robust entity clustering via phylogenetic inference
Hascoet et al. Semantic embeddings of generic objects for zero-shot learning
Su et al. An efficient computational model for large-scale prediction of protein–protein interactions based on accurate and scalable graph embedding
Xia et al. Content-irrelevant tag cleansing via bi-layer clustering and peer cooperation
Huang et al. Domain structure-based transfer learning for cross-domain word representation
Duan et al. Sequential embedding induced text clustering, a non-parametric bayesian approach
Zhou et al. Spectral clustering with distinction and consensus learning on multiple views data
CN111199154B (en) Fault-tolerant rough set-based polysemous word expression method, system and medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant