WO2015192798A1 - 主题挖掘方法和装置 - Google Patents

主题挖掘方法和装置 Download PDF

Info

Publication number
WO2015192798A1
WO2015192798A1 PCT/CN2015/081897 CN2015081897W WO2015192798A1 WO 2015192798 A1 WO2015192798 A1 WO 2015192798A1 CN 2015081897 W CN2015081897 W CN 2015081897W WO 2015192798 A1 WO2015192798 A1 WO 2015192798A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
document
word
target
message vector
Prior art date
Application number
PCT/CN2015/081897
Other languages
English (en)
French (fr)
Inventor
曾嘉
袁明轩
张世明
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2015192798A1 publication Critical patent/WO2015192798A1/zh
Priority to US15/383,606 priority Critical patent/US20170097962A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2237Vectors, bitmaps or matrices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Definitions

  • the embodiments of the present invention relate to information technologies, and in particular, to a subject mining method and apparatus.
  • Topic mining is a machine learning model that uses the Latent Dirichlet Allocation (LDA) model to cluster semantically related words in a large-scale document set to obtain large-scale documents in the form of probability distribution. Focus on the topic of each document, which is the theme that the author expresses through the document.
  • LDA Latent Dirichlet Allocation
  • the topic mining needs to be based on the training document, and the LDA model is trained by the Belief Propagation (BP) algorithm to determine the model parameters of the trained LDA model, namely the word-subject matrix ⁇ and the document-theme.
  • the matrix ⁇ is then input into the trained LDA model of the document-document matrix of the document to be tested for topic mining, thereby obtaining a document-topic matrix ⁇ ' for indicating the topic distribution of the document to be tested.
  • the BP algorithm contains a large number of iterative calculations, that is, multiple iterations are performed according to the current document-subject matrix of the LDA model and the current word-subject matrix, each non-zero element in the word-document matrix is calculated to obtain a word-document.
  • the current document-subject matrix and the current word-topic matrix are updated according to all the above message vectors until the message vector, the current document-theme matrix and the current word-theme
  • the matrix reaches a convergence state, because each time iteration needs to calculate a message vector for each non-zero element in the word-document matrix, and update the current document-subject matrix and the current word-topic matrix according to all message vectors.
  • the computational complexity is large, which leads to the low efficiency of topic mining, and the existing topic mining method is only applicable to the word-document matrix as a discrete word packet matrix.
  • the embodiment of the invention provides a theme mining method and device, so as to reduce the amount of topic mining operations and improve the efficiency of topic mining.
  • An aspect of an embodiment of the present invention provides a method for mining a topic, including:
  • Calculation document non-zero elements in the matrix get the message vector M n non-zero elements in accordance with the potential distribution of the current document Dirichlet Model LDA - relating to the current word and the matrix - - relating to the training of the document word matrix; according to the non residual zero element vector message, determining a target vector message from the message vector ObjectM n M n of the non-zero elements; the target message according to the residual vector in descending order of the top message preset ratio a vector, the preset ratio has a value ranging from less than 1 and greater than 0; updating a current document-subject matrix and a current word-topic matrix of the LDA model according to the target message vector ObjectM n ; Determining a target element ObjectE n corresponding to the target message vector ObjectM n in a non-zero element in the document matrix; n+1th execution of the current document-subject matrix and the current word-subject matrix according to the LDA model, the training document word - document matrix
  • the message from the residual non-zero elements of the vector determining a target vector message from the message vector ObjectM n M n of the non-zero elements, comprising: Calculating a residual of the message vector of the non-zero element; from the calculated residuals, querying the target residuals of the preset ratio in the order of the largest to the smallest; the preset ratio is based on the theme mining accuracy of the results and the efficiency of excavation determined topics; from the message vector M n of the non-zero elements, determining the target residual vector corresponding to the target message ObjectM n.
  • K, K is the preset number of topics
  • the k-th element value of the message vector obtained by calculating the elements of the w-th row and the d-th column in the word-document matrix for the nth execution iteration, x w,d is in the word-document matrix
  • Query the target residuals of the preset ratios listed in the order from large to small including: according to the formula The residual Perform calculations to obtain a cumulative residual matrix; The kth element value of the residual of the message vector of the element of the wth row and the dth column in the word-document matrix in the nth execution; The element value of the wth row and the kth column of the cumulative residual matrix in the nth execution iteration process; in each row of the cumulative residual matrix, the pre-preset ratio is determined in descending order The column of the element of ⁇ k Where 0 ⁇ k ⁇ 1; accumulate the elements determined in each row to obtain the sum value corresponding to each row; determine the row corresponding to the sum value of the previous preset ratio ⁇ w in descending order Where 0 ⁇ w ⁇ 1, and ⁇ k ⁇ w ⁇ 1; will satisfy Resi
  • the message according to the target vector ObjectM n the LDA model for the current document - the subject of the current matrix and word - matrix theme updates, including:
  • the current document-subject matrix and the current word-subject matrix according to the potential Dirichlet distribution LDA model are not in the word-document matrix of the training document.
  • the zero element is calculated to obtain a message vector M n of a non-zero element, including: in the nth execution of the iterative process, according to the formula Performing a calculation to obtain a k-th element value of a message vector of the element x wd of the w-th row and the d-th column in the word-document matrix
  • k 1, 2, ..., K
  • K is the preset number of topics
  • w 1, 2, ..., W
  • W is the length of the word list
  • d 1, 2, ..., D
  • D is the number of the training documents
  • the element value of the dth column of the kth row of the current document-theme matrix For the element value of the kth row and the wth column of the current word-topic matrix, ⁇ and ⁇ are preset coefficients, and the value range is a positive number.
  • the current document-subject matrix and the current word-subject matrix according to the potential Dirichlet distribution LDA model are not in the word-document matrix of the training document.
  • a second aspect of the embodiments of the present invention provides a theme mining apparatus, including:
  • a message vector calculation module configured to calculate a non-zero element in a word-document matrix of the training document according to a current document-subject matrix of the potential Dirichlet distribution LDA model and a current word-subject matrix, to obtain a message of a non-zero element vector M n; a first filtering module, for determining a target vector message from the message vector ObjectM n M n of the non-zero elements in the vector of message from the residual non-zero elements; vectors according to the target message residual ratio in descending order of the top preset message vector, the preset ratio in the range of less than 1 and greater than 0; updating module, according to the target vector message to the LDA ObjectM n model current document - and relating the current word matrix - matrix relating to updating; a second filtering module, for the word from - determine the target vector message corresponding document ObjectM n non-zero elements in the matrix of the target element ObjectE n; executing module for the n + 1 executions
  • the first screening module includes: a calculating unit, configured to calculate a residual of the message vector of the non-zero element; and a query unit, configured to obtain the calculated In the residual, the target residuals of the preset proportions are listed in order from large to small; the preset ratio is determined according to the efficiency of the topic mining and the accuracy of the result of the topic mining; to the message vector M n of the non-zero elements, determining the target residual vector corresponding to the target message ObjectM n.
  • the query unit is specifically configured according to a formula The residual Perform calculations to obtain a cumulative residual matrix; The kth element value of the residual of the message vector of the element of the wth row and the dth column in the word-document matrix in the nth execution; The element value of the wth row and the kth column of the cumulative residual matrix in the nth execution iteration process; in each row of the cumulative residual matrix, the pre-preset ratio is determined in descending order The column of the element of ⁇ k Where 0 ⁇ k ⁇ 1; accumulate the elements determined in each row to obtain the sum value corresponding to each row; determine the row corresponding to the sum value of the previous preset ratio ⁇ w in descending order Where 0 ⁇ w ⁇ 1, and ⁇ k ⁇ w ⁇ 1; will satisfy Residual Determined as the target residual.
  • the screening unit particularly for non-zero elements from the message vector M n, determining the target Residual The corresponding target message vector ObjectM n is
  • the update module includes: a first update unit, configured to Performing calculations to obtain the element values of the kth row and the dth column in the current document-subject matrix of the updated LDA model.
  • the message vector calculation module is specifically configured to perform the iterative process in the nth time, according to the formula Performing a calculation to obtain the k-th element value of the message vector of the element x w,d of the w-th row and the d-th column in the word-document matrix
  • k 1, 2, ..., K
  • K is the preset number of topics
  • w 1, 2, ..., W
  • W is the length of the word list
  • d 1, 2, ..., D
  • D is the number of the training documents
  • the element value of the dth column of the kth row of the current document-theme matrix For the element value of the kth row and the wth column of the current word-topic matrix, ⁇ and ⁇ are preset coefficients, and the value range is a positive number.
  • the subject mining method and apparatus determine the target message vector from the message vector according to the residual of the message vector each time the iterative process is performed, and then only determine the target according to the current iterative process.
  • the message vector updates the current document-subject matrix and the current word-subject matrix, so that during the subsequent iterative process, according to the current document-subject matrix and the current word-subject matrix, the previous iteration process in the word-document matrix is performed.
  • the target element corresponding to the determined target message vector is calculated, which avoids the need to calculate all non-zero elements in the word-document matrix for each iteration process, and avoids the current document-topic matrix and current based on all message vectors.
  • the word-subject matrix is updated, which greatly reduces the amount of computation, speeds up the topic mining, and improves the efficiency of topic mining.
  • FIG. 1 is a schematic flowchart diagram of a theme mining method according to an embodiment of the present invention
  • FIG. 2 is a schematic flowchart of a theme mining method according to another embodiment of the present invention.
  • FIG. 3 is a schematic structural diagram of a theme mining device according to an embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of a theme mining device according to another embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a theme mining device according to another embodiment of the present invention.
  • FIG. 6 is an architectural diagram of a theme mining device applied to network public opinion analysis.
  • FIG. 1 is a schematic flowchart of a method for mining a topic according to an embodiment of the present invention. As shown in FIG. 1 , this embodiment may include:
  • the training word document - document matrix elements calculated zero, non-zero elements of the message to obtain the vector (e.g., M n).
  • the word-document matrix is in the form of a word packet matrix or in the form of a word frequency-inverse document frequency (TF-IDF) matrix. If the iterative process including steps 101 to 103 is performed for the first time, the target element may be all non-zero elements in the word-document matrix, otherwise the target element is the target element determined in step 103 of the last iterative process. .
  • TF-IDF word frequency-inverse document frequency
  • the word-document matrix can be directly calculated to obtain a message vector of the target element in the word-document matrix, or the word-document matrix in the form of a token matrix can be transformed.
  • the word-document matrix in the form of a TF-IDF matrix is calculated to obtain a message vector of the target element in the word-document matrix.
  • the message vector is a possibility indicating the topics involved in the elements in the word-document matrix, for example: the message vector ⁇ w, d (k) is an element indicating the kth row d column in the word-document matrix relates to the kth topic
  • the word-document matrix is used to indicate the number of occurrences of a word in a document.
  • a word-document matrix in the form of a token matrix each row of the matrix corresponds to one word, and each column corresponds to a document, in a matrix.
  • Each non-zero element value indicates the number of times the word corresponding to the row of the element appears in the document corresponding to the column in which the element is located. If the element value is zero, the word corresponding to the row of the element does not appear in the document corresponding to the column of the element.
  • Each row of the word-subject matrix corresponds to a word, and each column corresponds to a topic.
  • the element value in the matrix indicates the probability that the topic corresponding to the column of the element relates to the word corresponding to the row.
  • Each row of the document-subject matrix corresponds to a document, and each column corresponds to a topic.
  • the value of the element in the matrix indicates the probability that the document corresponding to the row of the element relates to the topic corresponding to the column in which the element is located.
  • a target message vector (such as ObjectM n ) from the message vector of the non-zero element according to a residual of the non-zero element message vector.
  • the target message vector is a message vector that is ranked in a preset proportion according to the residuals from large to small; the residual is used to indicate the convergence degree of the message vector.
  • calculating a residual of the message vector and querying, from the calculated residual, a target residual corresponding to the pre-preset ratio ( ⁇ k ⁇ w ) in descending order, and corresponding to the target residual
  • the message vector is determined as the target message vector, and the target message vector has a higher residual and a less convergent degree.
  • ( ⁇ k ⁇ ⁇ w ) takes a value range of less than 1 and greater than 0, that is, 0 ⁇ ( ⁇ k ⁇ ⁇ w ) ⁇ 1.
  • the value of ( ⁇ k ⁇ ⁇ w ) is determined according to the accuracy of the topic mining efficiency and the accuracy of the topic mining results.
  • the element corresponding to the target message vector determined last time is queried, and the element in the word-document matrix corresponding to the target message vector is determined as the target element. Therefore, in the current execution of the word-document matrix of the training document according to the current document-subject matrix of the LDA model and the current word-subject matrix, the step of obtaining the message vector of the target element in the word-document matrix is only for this step.
  • the target elements in the second determined word-document matrix are calculated to obtain the message vectors of these target elements. Since the number of target elements determined each time this step is executed is less than the number of target elements determined when the previous step is executed, the message vector of the target element in the word-document matrix is continuously reduced. The calculation amount further reduces the calculation amount of updating the current document-subject matrix and the current word-subject matrix according to the target message vector, thereby improving the efficiency.
  • the target element such as ObjectE n
  • the n+1th execution performs calculation on the nth determined target element in the word-document matrix of the training document according to the current document-subject matrix of the LDA model and the current word-subject matrix, to obtain the word- a message vector (eg, M n+1 ) of the target element determined in the nth time in the document matrix, according to the residual of the message vector of the target element determined by the nth time, from the target determined by the nth time Determining a target message vector (such as ObjectM n+1 ) in the message vector of the element, updating the current document-topic matrix and the current word-topic matrix according to the target message vector determined by the n+1th time, and from the word- In the document matrix, determining an iterative process of the target element (eg, ObjectE n+1 ) corresponding to the target message vector determined by the n+ 1th time, until the message vector of the filtered target element, the current document-subject matrix, and The current word-subject matrix reaches a
  • the message vector, the document-subject matrix, and the word-subject matrix obtained by the n+1th iterative process are compared with the nth
  • the message vector, the document-subject matrix, and the word-subject matrix obtained by the iterative process are similar, that is, the difference between the message vectors obtained by the n+1th and nth iterations, and the n+1th
  • the difference between the document-subject matrix obtained by the nth execution iteration process and the word-subject matrix obtained by the n+1th and nth execution iterations all approach zero. That is, no matter how many iterations are performed, the above message vector, document-subject matrix and word-subject matrix have not changed much and are stable.
  • the target message vector is determined from the message vector according to the residual of the message vector, and then only the target message vector determined according to the current iterative process is performed on the current document-theme.
  • the matrix and the current word-subject matrix are updated so that during the subsequent iterative process, the previous iteration process is performed on the word-document matrix based on the current document-subject matrix and the current word-subject matrix updated by the previous iteration process.
  • the target element corresponding to the determined target message vector is calculated, which avoids the need to calculate all non-zero elements in the word-document matrix for each iteration process, and avoids the current document-topic matrix and according to all message vectors.
  • the current word-subject matrix is updated, which greatly reduces the amount of computation, speeds up the topic mining, and improves the efficiency of topic mining.
  • FIG. 2 is a schematic flowchart of a method for mining a topic according to another embodiment of the present invention.
  • the document-subject matrix in this embodiment is in the form of a packet matrix. As shown in FIG. 2, the embodiment includes:
  • the message vector includes K elements, each element in the message vector corresponds to a topic, and the message vector represents the document indicated by the word-document matrix
  • W is the length of the word list, that is, the number of words contained in the word table, equal to the number of lines included in the word-document matrix
  • D is the number of training documents
  • K is the number of preset topics, and the number of topics can be the subject of the user. Set before mining, the larger the number of topics, the greater the amount of calculation.
  • the W, D, and K values are all positive integers.
  • each training document contains a word in the standard dictionary, and the number of occurrences of the word, and the statistical result is used to generate a word-document matrix in the form of a token matrix, and a word-document in the form of a token matrix.
  • Each row of the matrix corresponds to a word
  • each column corresponds to a document.
  • Each non-zero element value in the matrix indicates the number of times the word corresponding to the row of the element appears in the document corresponding to the column of the element. If the element value is zero, then The word corresponding to the row in which the element is located does not appear in the document corresponding to the column in which the element is located.
  • the target element is all non-zero elements in the word-document matrix; otherwise, the target element is the target element determined in the last execution of the iteration process.
  • the message vector obtained by calculating the element of the dth row and the dth column in the word-document matrix on the kth topic in the nth execution iteration, ⁇ and ⁇ are preset coefficients, generally referred to as the above two
  • ⁇ k and ⁇ w need to be preset before subject excavation, 0 ⁇ ⁇ k ⁇ 1, 0 ⁇ ⁇ w ⁇ 1, and ⁇ k ⁇ ⁇ w ⁇ 1.
  • the target message vector Determining the target message vector among the target elements in 201 with the correspondence between the elements x w,d in the word-document matrix Corresponding word - the target element in the document matrix
  • target message vector Into the formula Calculate use Update current document - topic matrix
  • target message vector Into the formula Calculate use Update the current word-theme matrix
  • step 208 Determine whether the message vector of the filtered target element, the current document-subject matrix, and the current word-topic matrix reach a convergence state. If yes, execute step 209, otherwise repeat steps 202 to 207.
  • the target message vector is determined from the message vector according to the residual of the message vector, and then only the target message vector determined according to the current iterative process is performed on the current document-theme.
  • the matrix and the current word-subject matrix are updated so that during the subsequent iterative process, the previous iteration process is performed on the word-document matrix based on the current document-subject matrix and the current word-subject matrix updated by the previous iteration process.
  • the target element corresponding to the determined target message vector is calculated, which avoids the need to calculate all non-zero elements in the word-document matrix for each iteration process, and avoids the current document-topic matrix and according to all message vectors.
  • the current word-subject matrix is updated, which greatly reduces the amount of computation, speeds up the topic mining, and improves the efficiency of topic mining.
  • the cumulative residual moment obtained from the residual calculation is specifically adopted.
  • a quick sort algorithm or an insert sorting algorithm is used to determine the columns of the elements that are arranged in the order of the presets in descending order, and then the elements determined in each row are accumulated to obtain the corresponding rows of each row.
  • FIG. 3 is a schematic structural diagram of a theme mining apparatus according to an embodiment of the present invention. As shown in FIG. 3, the method includes: a message vector calculation module 31, a first screening module 32, a second screening module 33, an update module 34, and an execution module 35. And theme mining module 36.
  • the message vector calculation module 31 is configured to calculate non-zero elements in the word-document matrix of the training document according to the current document-subject matrix of the potential Dirichlet distribution LDA model and the current word-subject matrix, to obtain a non-zero element. Message vector.
  • the first screening module 32 is coupled to the message vector calculation module 31 for determining a target message vector from the message vector of the non-zero element according to the residual of the non-zero element message vector.
  • the target message vector is a message vector that is arranged in a preset proportion according to the residuals from the largest to the smallest.
  • the preset ratio has a value range of less than 1 and greater than 0, and the residual is used to indicate the message.
  • the degree of convergence of the vector is a message vector that is arranged in a preset proportion according to the residuals from the largest to the smallest.
  • the preset ratio has a value range of less than 1 and greater than 0, and the residual is used to indicate the message. The degree of convergence of the vector.
  • the second screening module 33 is connected to the first screening module 32, and is configured to determine a target element corresponding to the target message vector from the non-zero elements in the word-document matrix.
  • the update module 34 is coupled to the first screening module 33 for updating the current document-topic matrix and the current word-topic matrix of the LDA model according to the target message vector.
  • the execution module 35 is connected to the message vector calculation module 31 and the update module 34 for performing the n+1th execution of the current document-subject matrix and the current word-subject matrix according to the LDA model, in the word-document matrix of the training document. Calculating the target element determined n times, obtaining a message vector of the target element determined by the nth time in the word-document matrix, and according to the residual of the message vector of the target element determined by the nth time, Determining a target message vector in a message vector of the target element determined in the nth time, updating the current document-subject matrix and the current word-topic matrix according to the target message vector determined by the n+1th time, and from the word -text In the file matrix, determining an iterative process of the target element corresponding to the target message vector determined by the n+1th time, until the message vector of the filtered target element, the current document-subject matrix, and the current word-subject matrix reach convergence status.
  • the topic mining module 36 is connected to the execution module 35, and is configured to determine a current document-subject matrix reaching a convergence state and a current word-topic matrix reaching a convergence state as parameters of the LDA model, and using the LDA after determining the parameter.
  • the model performs topic mining on the document to be tested.
  • the target message vector is determined from the message vector according to the residual of the message vector, and then only the target message vector determined according to the current iterative process is performed on the current document-theme.
  • the matrix and the current word-subject matrix are updated so that during the subsequent iterative process, the previous iteration process is performed on the word-document matrix based on the current document-subject matrix and the current word-subject matrix updated by the previous iteration process.
  • the target element corresponding to the determined target message vector is calculated, which avoids the need to calculate all non-zero elements in the word-document matrix for each iteration process, and avoids the current document-topic matrix and according to all message vectors.
  • the current word-subject matrix is updated, which greatly reduces the amount of computation, speeds up the topic mining, and improves the efficiency of topic mining.
  • FIG. 4 is a schematic structural diagram of a theme mining apparatus according to another embodiment of the present invention.
  • the first screening module 32 in this embodiment further includes: a calculating unit 321, The query unit 322 and the screening unit 323.
  • the calculating unit 321 is configured to calculate a residual of the message vector of the non-zero element.
  • the query unit 322 is connected to the calculating unit 321 for querying, from the calculated residuals, the target residuals of the preset ratio ( ⁇ k ⁇ ⁇ w ) in the order of the largest to the smallest.
  • ( ⁇ k ⁇ w ) takes a value ranging from less than 1 and greater than zero.
  • the preset ratio is determined by the efficiency of the topic mining and the accuracy of the results of the topic mining.
  • the query unit 322 is specifically configured to use the residual Perform calculations to obtain a cumulative residual matrix among them, The kth element value of the residual of the message vector of the element of the wth row and the dth column in the word-document matrix in the nth execution; The element value of the wth row and the kth column of the cumulative residual matrix in the nth execution iteration process; in each row of the cumulative residual matrix, it is determined that the pre-preset ratio is ranked in descending order
  • the target element of ⁇ k is in the column
  • the value range of ⁇ k is less than 1 and greater than 0; the target elements determined in each row are accumulated to obtain the sum value corresponding to each row; and the sum value of the preset preset ratio ⁇ w is determined according to the order from largest to smallest.
  • Corresponding line ⁇ w ranges from less than 1 and greater than 0; will satisfy Residual Determined as the target residual.
  • the filtering unit 323 is connected to the query unit 322, and is configured to determine, from a message vector of a non-zero element, a target message vector corresponding to the target residual.
  • the filtering unit 323 is specifically configured to determine a target residual from the message vector of the non-zero element.
  • Corresponding message vector
  • the update module 34 includes: a first update unit 341 and a second update unit 342.
  • a first updating unit 341 for using a formula according to Target message vector Performing calculations to obtain the element values of the kth row and d columns in the current document-subject matrix of the updated LDA model.
  • Use Updating the element value of the kth row and the d column in the current document-subject matrix of the LDA model; wherein k 1, 2, . . . , K, K is a preset number of topics, x w, d is the The element value of the dth row and the dth column in the word-document matrix, Is the kth element value of the message vector obtained by calculating the x w,d during the nth execution of the iterative process.
  • a second update unit 342 for using a formula according to Calculating the element value of the kth row w column in the current word-subject matrix of the updated LDA model use The element value of the kth row w column in the current word-subject matrix of the LDA model is updated.
  • the theme mining device further includes: a determining module 41, a first obtaining module 42 and a second obtaining module 43.
  • a second determining module 41 configured to determine an initial message vector of each non-zero element in the word-document matrix
  • the first obtaining module 42 is connected to the second determining module 41 and the message vector calculating module 31,
  • the second obtaining module 43 is connected to the second determining module 41 and the message vector calculating module 31,
  • the message vector calculation module 31 is specifically configured to perform the iterative process in the nth time, according to the formula Performing a calculation to obtain the k-th element value of the message vector of the element x w,d of the w-th row and the d-th column in the word-document matrix
  • k 1, 2, ..., K
  • K is the preset number of topics
  • w 1, 2, ..., W
  • W is the length of the word list
  • d 1, 2, ..., D
  • D is the number of the training documents
  • ⁇ and ⁇ are preset coefficients
  • the value range is a positive number.
  • the function modules of the theme mining device provided in this embodiment can be used to execute the process of the topic mining method shown in FIG. 1 and FIG. 2 , and the specific working principle is not described here. For details, refer to the description of the method embodiment.
  • the target message vector is determined from the message vector according to the residual of the message vector, and then only the target message vector determined according to the current iterative process is performed on the current document-theme.
  • the matrix and the current word-subject matrix are updated so that during the subsequent iterative process, the previous iteration process is performed on the word-document matrix based on the current document-subject matrix and the current word-subject matrix updated by the previous iteration process.
  • the target element corresponding to the determined target message vector is calculated, which avoids the need to calculate all non-zero elements in the word-document matrix for each iteration process, and avoids the current document-topic matrix and according to all message vectors.
  • the current word-subject matrix is updated, which greatly reduces the amount of computation, speeds up the topic mining, and improves the efficiency of topic mining.
  • the target residuals of the pre-preset ratios are sequentially searched from the residuals in descending order, specifically, in each row of the cumulative residual matrix obtained by the residual calculation, Use the quick sort algorithm or the insert sort algorithm to determine the columns of the elements that are ranked in the previous preset order in descending order, and then enter the elements determined in each row. Line accumulation, obtaining the corresponding sum value of each row, using a quick sort algorithm or an insert sorting algorithm to determine the row corresponding to the sum value of the previous preset scale in descending order, which will be on the determined row and column.
  • the element is determined as the target residual solution, which speeds up the query speed of the target residual, thereby improving the efficiency of the topic mining.
  • FIG. 5 is a schematic structural diagram of a theme mining apparatus according to another embodiment of the present invention. As shown in FIG. 5, the apparatus in this embodiment may include: a memory 51, a communication interface 52, and a processor 53.
  • the memory 51 is for storing a program.
  • the program may include program code, and the program code includes computer operation instructions.
  • the memory 51 may include a high speed RAM memory and may also include a non-volatile memory such as at least one disk memory.
  • the communication interface 52 is configured to obtain a word-document matrix of the training document.
  • the processor 53 is configured to execute a program stored in the memory 51, configured to: select a non-zero element in the word-document matrix of the training document according to the current document-subject matrix of the potential Dirichlet distribution LDA model and the current word-topic matrix Performing a calculation to obtain a message vector of a non-zero element; determining, according to a residual of the non-zero element message vector, a target message vector from the message vector of the non-zero element; the target message vector is from a large residual a message vector of a preset ratio in a small order, the range of the preset ratio being less than 1 and greater than 0; the current document-subject matrix and the current word of the LDA model according to the target message vector Updating the theme matrix; determining a target element corresponding to the target message vector from the non-zero elements in the word-document matrix; repeatedly performing the current document-subject matrix and the current word-topic matrix according to the LDA model, and training The target element of the previous word determined in the word-
  • a message vector determining a target message vector from the message vector of the previously determined target element according to the residual of the message element of the target element determined last time, according to the target message vector determined this time Updating the current document-subject matrix and the current word-topic matrix, and determining an iterative process of the target element corresponding to the determined target message vector from the word-document matrix until the filtered target
  • the message vector of the element, the current document-subject matrix, and the current word-theme matrix reach a convergence state; the convergence state will be reached
  • the current document-subject matrix and the current word-subject matrix reaching the convergence state are determined as parameters of the LDA model, and the LDA model after determining the parameters is used for topic mining of the document to be tested.
  • the function modules of the theme mining device provided in this embodiment can be used to execute the process of the topic mining method shown in FIG. 1 and FIG. 2 , and the specific working principle is not described here. For details, refer to the description of the method embodiment.
  • FIG. 6 is an architectural diagram of a topic mining device applied to network public opinion analysis, which may obtain a document to be tested from a content server, and then select from a document to be tested, or additionally select different topics related to the document to be tested.
  • the document is used as the training document. The more topics covered by the training document, the higher the accuracy of the topic mining. Then, the training document is processed by the topic mining method provided in the above embodiments to determine the parameters of the LDA model.
  • the LDA model after determining the parameters may be used to perform topic mining on the document to be tested formed by the blog post of the microblog in the network, the text content of the webpage, and the like.
  • the obtained subject matter of the document to be tested is sent to the network public opinion analysis server for network public opinion analysis.
  • the target message vector is determined from the message vector according to the residual of the message vector, and then only the target message vector determined according to the current iterative process is performed on the current document-theme.
  • the matrix and the current word-subject matrix are updated so that during the subsequent iterative process, the previous iteration process is performed on the word-document matrix based on the current document-subject matrix and the current word-subject matrix updated by the previous iteration process Calculating the target element corresponding to the determined target message vector, avoiding the need to calculate all non-zero elements in the word-document matrix for each iteration process, and avoiding
  • the current document-subject matrix and the current word-subject matrix are updated according to all message vectors, which greatly reduces the amount of computation, speeds up the topic mining, and improves the efficiency of topic mining.
  • the aforementioned program can be stored in a computer readable storage medium.
  • the program when executed, performs the steps including the foregoing method embodiments; and the foregoing storage medium includes various media that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Algebra (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Fuzzy Systems (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种主题挖掘方法和装置,通过在每次执行迭代过程时,根据消息向量的残差,从消息向量中确定目标消息向量,从而仅根据目标消息向量对当前文档-主题矩阵和当前单词-主题矩阵进行更新,再根据当前文档-主题矩阵和当前单词-主题矩阵,仅对目标消息向量对应的所述单词-文档矩阵中的目标元素进行计算,避免了每次迭代过程均需要对单词-文档矩阵中的全部非零元素进行计算,以及避免了根据全部的消息向量对当前文档-主题矩阵和当前单词-主题矩阵进行更新,极大地减少了运算量,加快了主题挖掘的速度,提高了主题挖掘的效率。

Description

主题挖掘方法和装置 技术领域
本发明实施例涉及信息技术,尤其涉及一种主题挖掘方法和装置。
背景技术
主题挖掘是利用潜在狄利克雷分布(Latent Dirichlet Allocation,LDA)模型这一机器学习模型,在大规模文档集中对具有语义相关的单词进行聚类的过程,从而以概率分布的形式获得大规模文档集中每篇文档的主题(topic),也就是作者通过文档所表达的主题思想。
现有技术中的主题挖掘,需要首先基于训练文档,采用置信传播(Belief Propagation,BP)算法对LDA模型进行训练,确定训练后的LDA模型的模型参数,即单词-主题矩阵Φ和文档-主题矩阵θ,然后将待测试文档的单词-文档矩阵输入训练后的LDA模型进行主题挖掘,从而获得用于指示该待测试文档的主题分布的文档-主题矩阵θ'。由于BP算法包含大量迭代计算,也就是说多次重复执行根据LDA模型的当前文档-主题矩阵和当前单词-主题矩阵,对单词-文档矩阵中的每个非零元素进行计算,获得单词-文档矩阵中每个非零元素的消息向量之后,再根据上述全部的消息向量对当前文档-主题矩阵和当前单词-主题矩阵进行更新的过程,直至消息向量、当前文档-主题矩阵和当前单词-主题矩阵达到收敛状态,由于在每次迭代过程中均需要对单词-文档矩阵中的每个非零元素计算消息向量,以及根据全部的消息向量对当前文档-主题矩阵和当前单词-主题矩阵进行更新,计算量较大,导致主题挖掘的效率较低,并且现有的主题挖掘方法仅适用于单词-文档矩阵为离散词包矩阵。
发明内容
本发明实施例提供一种主题挖掘方法和装置,以减少主题挖掘运算量,提高主题挖掘的效率。
本发明实施例的一个方面是提供一种主题挖掘方法,包括:
根据潜在狄利克雷分布LDA模型的当前文档-主题矩阵和当前单词-主题矩阵对训练文档的单词-文档矩阵中的非零元素进行计算,得到非零元素的消息向量Mn;根据所述非零元素的消息向量的残差,从所述非零元素的消息向量Mn中确定目标消息向量ObjectMn;所述目标消息向量为按照残差从大到小顺序排在前预设比例的消息向量,所述预设比例的取值范围为小于1且大于0;根据所述目标消息向量ObjectMn对所述LDA模型的当前文档-主题矩阵和当前单词-主题矩阵进行更新;从所述单词-文档矩阵中的非零元素中确定所述目标消息向量ObjectMn所对应的目标元素ObjectEn;第n+1次执行根据LDA模型的当前文档-主题矩阵和当前单词-主题矩阵,对训练文档的单词-文档矩阵中第n次所确定的目标元素ObjectEn进行计算,得到所述单词-文档矩阵中第n次所确定的目标元素ObjectEn的消息向量Mn+1,根据所述第n次所确定的目标元素的消息向量的残差,从所述第n次所确定的目标元素ObjectEn的消息向量Mn+1中确定目标消息向量ObjectMn+1,根据第n+1次所确定的目标消息向量ObjectMn+1对当前文档-主题矩阵和当前单词-主题矩阵进行更新,以及从所述单词-文档矩阵中,确定第n+1次所确定的目标消息向量ObjectMn+1所对应的目标元素ObjectEn+1的迭代过程,直至所述筛选后的目标元素ObjectEp的消息向量、当前文档-主题矩阵和当前单词-主题矩阵达到收敛状态;将达到收敛状态的当前文档-主题矩阵和达到收敛状态的当前单词-主题矩阵确定为所述LDA模型的参数,利用确定参数后的所述LDA模型对待测文档进行主题挖掘。
在第一方面的第一种可能的实现方式中,所述根据所述非零元素的消息向量的残差,从所述非零元素的消息向量Mn中确定目标消息向量ObjectMn,包括:计算所述非零元素的消息向量的残差;从计算得到的残差中,按照从大到小顺序查询排在前所述预设比例的目标残差;所述预设比例是依据主题挖掘的效率和主题挖掘的结果的准确度确定的;从所述非零元素的消息向量Mn中,确定所述目标残差所对应的目标消息向量ObjectMn
结合第一方面的第一种可能的实现方式,在第一方面的第二种可能 的实现方式中,所述计算所述非零元素的消息向量的残差,包括:根据公式
Figure PCTCN2015081897-appb-000001
计算所述非零元素的消息向量的残差,其中,其中,
Figure PCTCN2015081897-appb-000002
为所述非零元素的消息向量的残差,k=1,2,...,K,K为预设的主题数目,
Figure PCTCN2015081897-appb-000003
为第n次执行迭代过程中对所述单词-文档矩阵中第w行第d列的元素进行计算所获得的消息向量的第k个元素值,xw,d为所述单词-文档矩阵中第w行第d列的元素值,
Figure PCTCN2015081897-appb-000004
为第n-1次执行迭代过程中对所述单词-文档矩阵中第w行第d列的元素进行计算得到的消息向量的第k个元素值。
结合第一方面的第一种可能的实现方式和第一方面的第二种可能的实现方式,在第一方面的第三种可能的实现方式中,所述从计算得到的残差中,按照从大到小顺序查询排在前所述预设比例的目标残差,包括:根据公式
Figure PCTCN2015081897-appb-000005
对所述残差
Figure PCTCN2015081897-appb-000006
进行计算,得到累积残差矩阵;其中,
Figure PCTCN2015081897-appb-000007
为第n次执行迭代过程中所述单词-文档矩阵中第w行第d列的元素的消息向量的残差的第k个元素值;
Figure PCTCN2015081897-appb-000008
为第n次执行迭代过程中所述累积残差矩阵第w行第k列的元素值;在所述累积残差矩阵的每一行中,按照从大到小顺序确定出排在前预设比例λk的元素所在列
Figure PCTCN2015081897-appb-000009
其中,0<λk≤1;对每一行中确定出的元素进行累加,得到每一行对应的和值;确定按照从大到小顺序排在前预设比例λw的和值所对应的行
Figure PCTCN2015081897-appb-000010
其中,0<λw≤1,且λk×λw≠1;将满足
Figure PCTCN2015081897-appb-000011
的残差
Figure PCTCN2015081897-appb-000012
确定为所述目标残差。
结合第一方面的第三种可能的实现方式,在第一方面的第四种可能的实现方式中,所述从所述非零元素的消息向量Mn中,确定所述目标残差所对应的目标消息向量ObjectMn,包括:从所述非零元素的消息向量Mn中,确定目标残差
Figure PCTCN2015081897-appb-000013
所对应的目标消息向量ObjectMn
Figure PCTCN2015081897-appb-000014
结合第一方面的第二种可能的实现方式、第一方面的第三种可能的实现方式和第一方面的第四种可能的实现方式,在第一方面的第五种可能的实现方式中,所述根据所述目标消息向量ObjectMn对所述LDA模型的当前文档-主题矩阵和当前单词-主题矩阵进行更新,包括:
根据公式
Figure PCTCN2015081897-appb-000015
进行计算,得到更新后的所述LDA模型的当前文档-主题矩阵中第k行d列的元素值
Figure PCTCN2015081897-appb-000016
利用
Figure PCTCN2015081897-appb-000017
更新所述LDA 模型的当前文档-主题矩阵中第k行第d列的元素值;其中,k=1,2,...,K,K为预设的主题数目,xw,d为所述单词-文档矩阵中第w行第d列的元素值,
Figure PCTCN2015081897-appb-000018
是在第n次执行所述迭代过程中对所述xw,d进行计算所获得的消息向量的第k个元素值;根据公式
Figure PCTCN2015081897-appb-000019
计算得到更新后的所述LDA模型的当前单词-主题矩阵中第k行w列的元素值
Figure PCTCN2015081897-appb-000020
利用
Figure PCTCN2015081897-appb-000021
更新所述LDA模型的当前单词-主题矩阵中第k行w列的元素值。
结合第一方面、第一方面的第一种可能的实现方式、第一方面的第二种可能的实现方式、第一方面的第三种可能的实现方式和第一方面的第四种可能的实现方式,在第一方面的第六种可能的实现方式中,所述根据潜在狄利克雷分布LDA模型的当前文档-主题矩阵和当前单词-主题矩阵对训练文档的单词-文档矩阵中的非零元素进行计算,得到非零元素的消息向量Mn,包括:在第n次执行所述迭代过程中,根据公式
Figure PCTCN2015081897-appb-000022
进行计算,得到所述单词-文档矩阵中第w行第d列的元素xwd的消息向量的第k个元素值
Figure PCTCN2015081897-appb-000023
其中,k=1,2,...,K,K为预设的主题数目,w=1,2,...,W,W为单词表长度,d=1,2,...,D,D为所述训练文档数目,
Figure PCTCN2015081897-appb-000024
为当前文档-主题矩阵第k行第d列的元素值,
Figure PCTCN2015081897-appb-000025
为当前单词-主题矩阵第k行第w列的元素值,α和β为预设系数,取值范围为正数。
结合第一方面、第一方面的第一种可能的实现方式、第一方面的第二种可能的实现方式、第一方面的第三种可能的实现方式和第一方面的第四种可能的实现方式,在第一方面的第七种可能的实现方式中,所述根据潜在狄利克雷分布LDA模型的当前文档-主题矩阵和当前单词-主题矩阵对训练文档的单词-文档矩阵中的非零元素进行计算,得到非零元素的消息向量Mn之前,还包括:确定单词-文档矩阵中每个非零元素的初始消息向量
Figure PCTCN2015081897-appb-000026
其中,k=1,2,...,K,K为预设的主题数目,
Figure PCTCN2015081897-appb-000027
Figure PCTCN2015081897-appb-000028
为单词-文档矩阵中第w行第d列的非零元素xw,d的初始
Figure PCTCN2015081897-appb-000029
Figure PCTCN2015081897-appb-000030
行第w列的元素值。
本发明实施例的第二个方面是提供一种主题挖掘装置,包括:
消息向量计算模块,用于根据潜在狄利克雷分布LDA模型的当前文档-主题矩阵和当前单词-主题矩阵,对训练文档的单词-文档矩阵中的非零 元素进行计算,得到非零元素的消息向量Mn;第一筛选模块,用于根据所述非零元素的消息向量的残差,从所述非零元素的消息向量Mn中确定目标消息向量ObjectMn;所述目标消息向量为按照残差从大到小顺序排在前预设比例的消息向量,所述预设比例的取值范围为小于1且大于0;更新模块,用于根据所述目标消息向量ObjectMn对所述LDA模型的当前文档-主题矩阵和当前单词-主题矩阵进行更新;第二筛选模块,用于从所述单词-文档矩阵中的非零元素中确定所述目标消息向量ObjectMn所对应的目标元素ObjectEn;执行模块,用于第n+1次执行根据LDA模型的当前文档-主题矩阵和当前单词-主题矩阵,对训练文档的单词-文档矩阵中第n次所确定的目标元素ObjectEn进行计算,得到所述单词-文档矩阵中第n次所确定的目标元素ObjectEn的消息向量Mn+1,根据所述第n次所确定的目标元素的消息向量的残差,从所述第n次所确定的目标元素ObjectEn的消息向量Mn+1中确定目标消息向量ObjectMn+1,根据第n+1次所确定的目标消息向量ObjectMn+1对当前文档-主题矩阵和当前单词-主题矩阵进行更新,以及从所述单词-文档矩阵中,确定第n+1次所确定的目标消息向量ObjectMn+1所对应的目标元素ObjectEn+1的迭代过程,直至所述筛选后的目标元素的消息向量、当前文档-主题矩阵和当前单词-主题矩阵达到收敛状态;主题挖掘模块,用于将达到收敛状态的当前文档-主题矩阵和达到收敛状态的当前单词-主题矩阵确定为所述LDA模型的参数,利用确定参数后的所述LDA模型对待测文档进行主题挖掘。
在第二方面的第一种可能的实现方式中,所述第一筛选模块,包括:计算单元,用于计算所述非零元素的消息向量的残差;查询单元,用于从计算得到的残差中,按照从大到小顺序查询排在前所述预设比例的目标残差;所述预设比例是依据主题挖掘的效率和主题挖掘的结果的准确度确定的;筛选单元,用于从所述非零元素的消息向量Mn中,确定所述目标残差所对应的目标消息向量ObjectMn
结合第二方面的第一种可能的实现方式,在第二方面的第二种可能的实现方式中,所述计算单元,具体用于根据公式
Figure PCTCN2015081897-appb-000031
计算所述非零元素的消息向量的残差,其中,其中,
Figure PCTCN2015081897-appb-000032
为所述非零元素的消息向量的残差,k=1,2,...,K,K为预设的主 题数目,
Figure PCTCN2015081897-appb-000033
为第n次执行迭代过程中对所述单词-文档矩阵中第w行第d列的元素进行计算所获得的消息向量的第k个元素值,xw,d为所述单词-文档矩阵中第w行第d列的元素值,
Figure PCTCN2015081897-appb-000034
为第n-1次执行迭代过程中对所述单词-文档矩阵中第w行第d列的元素进行计算得到的消息向量的第k个元素值。
结合第二方面的第一种可能的实现方式和第二方面的第二种可能的实现方式,在第二方面的第三种可能的实现方式中,所述查询单元,具体用于根据公式
Figure PCTCN2015081897-appb-000035
对所述残差
Figure PCTCN2015081897-appb-000036
进行计算,得到累积残差矩阵;其中,
Figure PCTCN2015081897-appb-000037
为第n次执行迭代过程中所述单词-文档矩阵中第w行第d列的元素的消息向量的残差的第k个元素值;
Figure PCTCN2015081897-appb-000038
为第n次执行迭代过程中所述累积残差矩阵第w行第k列的元素值;在所述累积残差矩阵的每一行中,按照从大到小顺序确定出排在前预设比例λk的元素所在列
Figure PCTCN2015081897-appb-000039
其中,0<λk≤1;对每一行中确定出的元素进行累加,得到每一行对应的和值;确定按照从大到小顺序排在前预设比例λw的和值所对应的行
Figure PCTCN2015081897-appb-000040
其中,0<λw≤1,且λk×λw≠1;将满足
Figure PCTCN2015081897-appb-000041
的残差
Figure PCTCN2015081897-appb-000042
确定为所述目标残差。
结合第二方面的第三种可能的实现方式,在第二方面的第四种可能的实现方式中,所述筛选单元,具体用于从所述非零元素的消息向量Mn中,确定目标残差
Figure PCTCN2015081897-appb-000043
所对应的目标消息向量ObjectMn
Figure PCTCN2015081897-appb-000044
结合第二方面的第二种可能的实现方式、第二方面的第三种可能的实现方式和第二方面的第四种可能的实现方式,在第二方面的第五种可能的实现方式中,所述更新模块,包括:第一更新单元,用于根据公式
Figure PCTCN2015081897-appb-000045
进行计算,得到更新后的所述LDA模型的当前文档-主题矩阵中第k行第d列的元素值
Figure PCTCN2015081897-appb-000046
利用
Figure PCTCN2015081897-appb-000047
更新所述LDA模型的当前文档-主题矩阵中第k行第d列的元素值;其中,k=1,2,...,K,K为预设的主题数目,xw,d为所述单词-文档矩阵中第w行第d列的元素值,
Figure PCTCN2015081897-appb-000048
是在第n次执行所述迭代过程中对所述xw,d进行计算所获得的消息向量的第k个元素值;第二更新单元,用于根据公式
Figure PCTCN2015081897-appb-000049
计算得到更新后的所述LDA模型的当前单词-主题矩阵中第k行w列的元素值
Figure PCTCN2015081897-appb-000050
利用
Figure PCTCN2015081897-appb-000051
更新所述LDA模型的当前单词-主题矩阵中第k行w列的元素值。
结合第二方面、第二方面的第一种可能的实现方式、第二方面的第二种可能的实现方式、第二方面的第三种可能的实现方式和第二方面的第四种可能的实现方式,在第二方面的第六种可能的实现方式中,所述消息向量计算模块,具体用于在第n次执行所述迭代过程中,根据公式
Figure PCTCN2015081897-appb-000052
进行计算,得到所述单词-文档矩阵中第w行第d列的元素xw,d的消息向量的第k个元素值
Figure PCTCN2015081897-appb-000053
其中,k=1,2,...,K,K为预设的主题数目,w=1,2,...,W,W为单词表长度,d=1,2,...,D,D为所述训练文档数目,
Figure PCTCN2015081897-appb-000054
为当前文档-主题矩阵第k行第d列的元素值,
Figure PCTCN2015081897-appb-000055
为当前单词-主题矩阵第k行第w列的元素值,α和β为预设系数,取值范围为正数。
结合第二方面、第二方面的第一种可能的实现方式、第二方面的第二种可能的实现方式、第二方面的第三种可能的实现方式和第二方面的第四种可能的实现方式,在第二方面的第七种可能的实现方式中,所述装置还包括:确定模块,用于确定单词-文档矩阵中每个非零元素的初始消息向量
Figure PCTCN2015081897-appb-000056
其中,k=1,2,...,K,K为预设的主题数目,
Figure PCTCN2015081897-appb-000057
Figure PCTCN2015081897-appb-000058
为单词-文档矩阵中第w行第d列的非零元素xw,d的初始
Figure PCTCN2015081897-appb-000059
前文档-主题矩阵第k行第d列的元素值;第二获得模块,用于根据公式
Figure PCTCN2015081897-appb-000060
本发明实施例提供的主题挖掘方法和装置,通过在每次执行迭代过程时,根据消息向量的残差,从消息向量中确定目标消息向量,然后仅根据本次执行迭代过程所确定出的目标消息向量对当前文档-主题矩阵和当前单词-主题矩阵进行更新,从而在后续执行迭代过程时,根据当前文档-主题矩阵和当前单词-主题矩阵,对单词-文档矩阵中前一次执行迭代过程所确定的目标消息向量对应的目标元素进行计算,避免了每次迭代过程均需要对单词-文档矩阵中的全部非零元素进行计算,以及避免了根据全部的消息向量对当前文档-主题矩阵和当前单词-主题矩阵进行更新,极大地减少了运算量,加快了主题挖掘的速度,提高了主题挖掘的效率。
附图说明
为了更清楚地说明本发明实施例的技术方案,下面将对实施例描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本发明一实施例提供的主题挖掘方法的流程示意图;
图2为本发明另一实施例提供的主题挖掘方法的流程示意图;
图3为本发明一实施例提供的主题挖掘装置的结构示意图;
图4为本发明另一实施例提供的主题挖掘装置的结构示意图;
图5为本发明又一实施例提供的主题挖掘装置的结构示意图;
图6为主题挖掘装置应用于网络舆情分析的架构图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
图1为本发明一实施例提供的主题挖掘方法的流程示意图,如图1所示,本实施例可以包括:
101、根据潜在狄利克雷分布LDA模型的当前文档-主题矩阵和当前单词-主题矩阵,对训练文档的单词-文档矩阵非零元素进行计算,得到非零元素的消息向量(如Mn)。
其中,单词-文档矩阵为词包矩阵形式或者为词频-逆文档频率(term frequency–inverse document frequency,TF-IDF)矩阵形式。若包括步骤101至103在内的迭代过程为首次执行,则目标元素可为单词-文档矩阵中的全部非零元素,否则,目标元素为上一次执行迭代过程的步骤103中所确定的目标元素。
可选的,若单词-文档矩阵为词包矩阵形式则可直接对该单词-文档矩阵进行计算获得单词-文档矩阵中目标元素的消息向量,或者,将词包 矩阵形式的单词-文档矩阵转化为TF-IDF矩阵形式之后,对TF-IDF矩阵形式的单词-文档矩阵进行计算,获得单词-文档矩阵中目标元素的消息向量。其中,消息向量是指示单词-文档矩阵中元素所涉及各主题的可能性,例如:消息向量μw,d(k)是指示单词-文档矩阵中第w行d列的元素涉及第k个主题的可能性,当主题的总个数为K个时,1≤k≤K,即消息向量μw,d(k)的长度为K。
需要说明的是,单词-文档矩阵用于指示单词在文档中出现的次数,以词包矩阵形式的单词-文档矩阵为例,该矩阵每一行对应一个单词,每一列对应一篇文档,矩阵中的每个非零元素值表示元素所在行对应的单词在元素所在列对应的文档中出现的次数,若元素值为零,则表示元素所在行对应的单词在元素所在列对应的文档中没有出现。单词-主题矩阵的每一行对应一个单词,每一列对应一个主题,矩阵中的元素值表示该元素所在列对应的主题涉及到行对应的单词的概率。文档-主题矩阵的每一行对应一篇文档,每一列对应一个主题,矩阵中的元素值表示该元素所在行对应的文档涉及到该元素所在列对应的主题的概率。
102、根据所述非零元素的消息向量的残差,从所述非零元素的消息向量中确定目标消息向量(如ObjectMn)。
其中,目标消息向量为按照残差从大到小顺序排在前预设比例的消息向量;残差(residual)用于指示消息向量的收敛程度。
可选的,计算消息向量的残差,从计算得到的残差中查询按照从大到小顺序排在前预设比例(λk×λw)的目标残差,将目标残差所对应的消息向量确定为目标消息向量,该目标消息向量的残差较高,收敛程度较差。(λk×λw)取值范围为小于1且大于0,即0<(λk×λw)<1。该(λk×λw)的取值依据主题挖掘效率和主题挖掘结果的准确度来确定。具体的,(λk×λw)的取值越小,运算量越小,主题挖掘效率越高,但主题挖掘所获得的结果的误差较大;取值越大,运算量越大,主题挖掘效率越低,但主题挖掘所获得的结果的误差较小。
103、根据目标消息向量对当前文档-主题矩阵和当前单词-主题矩阵进行更新。
具体的,根据消息向量
Figure PCTCN2015081897-appb-000061
进行计算,获得
Figure PCTCN2015081897-appb-000062
利用
Figure PCTCN2015081897-appb-000063
更新所述LDA模型的当前文档-主题矩阵中第k行d列的元素值; 其中,k=1,2,...,K,K为预设的主题数目,xw,d为所述单词-文档矩阵中第w行第d列的元素值,
Figure PCTCN2015081897-appb-000064
是在第n次执行所述迭代过程中对所述xw,d进行计算所获得的消息向量的第k个元素值;根据
Figure PCTCN2015081897-appb-000065
进行计算,获得
Figure PCTCN2015081897-appb-000066
利用
Figure PCTCN2015081897-appb-000067
更新所述LDA模型的当前单词-主题矩阵中第k行w列的元素值。
104、从单词-文档矩阵中的非零元素中确定目标消息向量所对应的目标元素(如ObjectEn)。
可选的,在单词-文档矩阵的非零元素中,查询前一次所确定的目标消息向量对应的元素,将目标消息向量所对应的单词-文档矩阵中的元素确定为目标元素。从而在本次执行根据LDA模型的当前文档-主题矩阵和当前单词-主题矩阵,对训练文档的单词-文档矩阵进行计算,获得单词-文档矩阵中目标元素的消息向量的步骤时,仅对本次所确定的单词-文档矩阵中目标元素进行计算,获得这些目标元素的消息向量。由于每一次执行本步骤时所确定的目标元素的个数均会少于上一次执行本步骤时所确定的目标元素的个数,因而,不断减小计算单词-文档矩阵中目标元素的消息向量的计算量,也就不断减小了根据所述目标消息向量对当前文档-主题矩阵和当前单词-主题矩阵进行更新的计算量,提高了效率。
105、根据单词-文档矩阵中第n次所确定的目标元素(如ObjectEn),第n+1次执行前述计算消息向量、确定和更新的步骤的迭代过程,直至筛选后的目标元素(如ObjectEp)的消息向量、当前文档-主题矩阵和当前单词-主题矩阵达到收敛状态。
具体的,第n+1次执行根据LDA模型的当前文档-主题矩阵和当前单词-主题矩阵,对训练文档的单词-文档矩阵中第n次所确定的目标元素进行计算,得到所述单词-文档矩阵中第n次所确定的目标元素的消息向量(如Mn+1),根据所述第n次所确定的目标元素的消息向量的残差,从所述第n次所确定的目标元素的消息向量中确定目标消息向量(如ObjectMn+1),根据第n+1次所确定的目标消息向量对当前文档-主题矩阵和当前单词-主题矩阵进行更新,以及从所述单词-文档矩阵中,确定第n+1次所确定的目标消息向量所对应的目标元素(如ObjectEn+1)的迭代过程,直至所述筛选后的目标元素的消息向量、当前文档-主题矩阵和当前单词-主题矩阵达到收敛状态。
需要说明的是,当消息向量、文档-主题矩阵和单词-主题矩阵达到收敛状态时,第n+1次执行迭代过程所获得的消息向量、文档-主题矩阵和单词-主题矩阵,与第n次执行迭代过程所获得的消息向量、文档-主题矩阵和单词-主题矩阵对应相似,也就是说,第n+1次与第n次执行迭代过程所获得的消息向量之差、第n+1次与第n次执行迭代过程所获得的文档-主题矩阵之差和第n+1次与第n次执行迭代过程所获得的单词-主题矩阵之差,均趋近于零。即无论再执行多少次迭代过程,上述消息向量、文档-主题矩阵和单词-主题矩阵已变化不大,达到稳定。
106、将达到收敛状态的当前文档-主题矩阵和达到收敛状态的当前单词-主题矩阵确定为所述LDA模型的参数,利用确定参数后的LDA模型进行主题挖掘。
本实施例中,通过在每次执行迭代过程时,根据消息向量的残差,从消息向量中确定目标消息向量,然后仅根据本次执行迭代过程所确定出的目标消息向量对当前文档-主题矩阵和当前单词-主题矩阵进行更新,从而在后续执行迭代过程时,根据前一次执行迭代过程所更新的当前文档-主题矩阵和当前单词-主题矩阵,对单词-文档矩阵中前一次执行迭代过程所确定的目标消息向量对应的目标元素进行计算,避免了每次迭代过程均需要对单词-文档矩阵中的全部非零元素进行计算,以及避免了根据全部的消息向量对当前文档-主题矩阵和当前单词-主题矩阵进行更新,极大地减少了运算量,加快了主题挖掘的速度,提高了主题挖掘的效率。
图2为本发明另一实施例提供的主题挖掘方法的流程示意图,本实施例中的文档-主题矩阵为词包矩阵形式,如图2所示,本实施例包括:
201、初始化LDA模型的当前文档-主题矩阵
Figure PCTCN2015081897-appb-000068
和当前单词-主题矩阵
Figure PCTCN2015081897-appb-000069
可选的,以
Figure PCTCN2015081897-appb-000070
Figure PCTCN2015081897-appb-000071
为依据确定训练文档的单词-文档矩阵中每个非零元素的消息向量,该消息向量包括K个元素,消息向量中的每个元素对应一个主题,消息向量表示单词-文档矩阵所指示的文档中的单词涉及各主题的概率,例如:初始消息向量
Figure PCTCN2015081897-appb-000072
表示单词-文档矩 阵中第w行第d列的元素xw,d涉及第k个主题上的概率,根据初始消息向量
Figure PCTCN2015081897-appb-000073
进行计算,获得当前文档-主题矩阵
Figure PCTCN2015081897-appb-000074
根据初始消息向量
Figure PCTCN2015081897-appb-000075
进行计算,获得当前单词-主题矩阵
Figure PCTCN2015081897-appb-000076
其中,k=1,2,...,K,w=1,2,...,W,d=1,2,...,D。W为单词表长度,也就是单词表内所包含的单词个数,等于单词-文档矩阵所包含的行数;D为训练文档数目;K为预设主题数目,该主题数目可由用户在进行主题挖掘之前进行设定,主题数目越大,计算量越大。W、D和K取值范围均为正整数。
进一步,在201之前,统计每一篇训练文档中是否包含标准字典中的单词,以及该单词出现的次数,利用统计结果生成词包矩阵形式的单词-文档矩阵,词包矩阵形式的单词-文档矩阵的每一行对应一个单词,每一列对应一篇文档,矩阵中的每个非零元素值表示元素所在行对应的单词在元素所在列对应的文档中出现的次数,若元素值为零,则表示元素所在行对应的单词在元素所在列对应的文档中没有出现。
202、根据当前文档-主题矩阵和当前单词-主题矩阵,对训练文档的单词-文档矩阵进行计算,获得单词-文档矩阵中目标元素的消息向量。
其中,若202为首次执行,则确定为第一次执行迭代过程,目标元素为单词-文档矩阵中的全部非零元素;否则,目标元素为上一次执行迭代过程中所确定的目标元素。
可选的,将当前文档-主题矩阵
Figure PCTCN2015081897-appb-000077
和当前单词-主题矩阵
Figure PCTCN2015081897-appb-000078
以及n=1,带入公式
Figure PCTCN2015081897-appb-000079
进行计算,获得单词-文档矩阵中每个非零元素的消息向量
Figure PCTCN2015081897-appb-000080
其中,n为迭代过程的执行次数,例如第一次执行则n=1,
Figure PCTCN2015081897-appb-000081
为单词-文档矩阵中第w行第d列的元素xw,d在第1次执行迭代过程中第k个主题上的消息向量;
Figure PCTCN2015081897-appb-000082
为第n次执行迭代过程中对所述单词-文档矩阵中第w行第d列的元素在第k个主题上进行计算所获得的消息向量,α和β为预设系数,一般称上述两个预设系数为LDA模型的超级参数,取值范围为非负数,例如:{α=0.01,β=0.01}。
需要说明的是,从首次执行202起,则开始进入迭代过程,记为第一次执行该迭代过程,取n=1。
203、计算消息向量的残差。
可选的,根据公式
Figure PCTCN2015081897-appb-000083
代入n=1,以及
Figure PCTCN2015081897-appb-000084
计算得到消息向量
Figure PCTCN2015081897-appb-000085
的残差
Figure PCTCN2015081897-appb-000086
其中,xw,d为所述单词-文档矩阵中第w行第d列的元素值,
Figure PCTCN2015081897-appb-000087
为第n-1次执行迭代过程中对所述单词-文档矩阵中第w行第d列的元素在第k个主题上进行计算所得到的消息向量。
204、从残差中确定目标残差。
可选的,将残差
Figure PCTCN2015081897-appb-000088
以及n=1,代入公式
Figure PCTCN2015081897-appb-000089
进行计算,得到累积残差矩阵
Figure PCTCN2015081897-appb-000090
其中,
Figure PCTCN2015081897-appb-000091
为第1次执行迭代过程中累积残差矩阵第w行第k列的元素值。在累积残差矩阵的每一行中,采用快速排序算法或者插入排序算法确定出按照从大到小顺序排在前预设比例λk的元素所在列
Figure PCTCN2015081897-appb-000092
对每一行中确定出的元素进行累加,获得每一行对应的和值,采用快速排序算法或者插入排序算法确定按照从大到小顺序排在前预设比例λw的和值所对应的行
Figure PCTCN2015081897-appb-000093
Figure PCTCN2015081897-appb-000094
确定为目标残差。上述λk和λw需要在进行主题挖掘之前进行预先设定,0<λk≤1,0<λw≤1,且λk×λw≠1。
或者,可选的,根据残差
Figure PCTCN2015081897-appb-000095
进行计算,得到累积残差矩阵
Figure PCTCN2015081897-appb-000096
其中,
Figure PCTCN2015081897-appb-000097
为第n次执行迭代过程中所述累积残差矩阵第d行第k列的元素值;在累积残差矩阵的每一行中,采用快速排序算法或者插入排序算法确定出按照从大到小顺序排在前预设比例λk的目标元素所在列
Figure PCTCN2015081897-appb-000098
对每一行中确定出的目标元素进行累加,获得每一行对应的和值;采用快速排序算法或者插入排序算法确定按照从大到小顺序排在前预设比例λw的和值所对应的行
Figure PCTCN2015081897-appb-000099
将满足
Figure PCTCN2015081897-appb-000100
的残差
Figure PCTCN2015081897-appb-000101
确定为所述目标残差。上述λk和λw需要在进行主题挖掘之前进行预先设定,0<λk≤1,0<λw≤1,且λk×λw≠1。
205、将目标残差所对应的消息向量确定为目标消息向量。
可选的,根据目标残差
Figure PCTCN2015081897-appb-000102
与消息向量
Figure PCTCN2015081897-appb-000103
之间的对应关系,代入n=1,确定目标残差
Figure PCTCN2015081897-appb-000104
对应消息向量
Figure PCTCN2015081897-appb-000105
则消息向量
Figure PCTCN2015081897-appb-000106
为目标消息向量。
206、从单词-文档矩阵中重新确定目标消息向量所对应的目标元素。
可选的,根据目标消息向量
Figure PCTCN2015081897-appb-000107
与单词-文档矩阵中的元素xw,d之间的对应关系,在201中的目标元素中确定目标消息向量
Figure PCTCN2015081897-appb-000108
对应的 单词-文档矩阵中的目标元素
Figure PCTCN2015081897-appb-000109
207、根据目标消息向量对当前文档-主题矩阵和当前单词-主题矩阵进行更新。
可选的,将目标消息向量
Figure PCTCN2015081897-appb-000110
代入公式
Figure PCTCN2015081897-appb-000111
进行计算,获得
Figure PCTCN2015081897-appb-000112
利用
Figure PCTCN2015081897-appb-000113
更新当前文档-主题矩阵;将目标消息向量
Figure PCTCN2015081897-appb-000114
代入公式
Figure PCTCN2015081897-appb-000115
进行计算,获得
Figure PCTCN2015081897-appb-000116
利用
Figure PCTCN2015081897-appb-000117
更新当前单词-主题矩阵。
需要说明的是,从202至207为完整的一次迭代过程,207执行完毕,则该次迭代过程完成。
208、判断筛选后的目标元素的消息向量、当前文档-主题矩阵和当前单词-主题矩阵是否达到收敛状态,若是,则执行步骤209,否则重复执行步骤202至步骤207。
可选的,将
Figure PCTCN2015081897-appb-000118
代入公式
Figure PCTCN2015081897-appb-000119
进行计算,判断rn(k)除以W是否趋近于零,若是则确定筛选后的目标元素的消息向量、当前文档-主题矩阵和当前单词-主题矩阵已收敛至稳定状态,否则,确定消息向量、当前文档-主题矩阵和当前单词-主题矩阵未达到收敛状态。
209、将达到收敛状态的当前文档-主题矩阵和达到收敛状态的当前单词-主题矩阵确定为所述LDA模型的参数,利用确定参数后的所述LDA模型对待测文档进行主题挖掘。
本实施例中,通过在每次执行迭代过程时,根据消息向量的残差,从消息向量中确定目标消息向量,然后仅根据本次执行迭代过程所确定出的目标消息向量对当前文档-主题矩阵和当前单词-主题矩阵进行更新,从而在后续执行迭代过程时,根据前一次执行迭代过程所更新的当前文档-主题矩阵和当前单词-主题矩阵,对单词-文档矩阵中前一次执行迭代过程所确定的目标消息向量对应的目标元素进行计算,避免了每次迭代过程均需要对单词-文档矩阵中的全部非零元素进行计算,以及避免了根据全部的消息向量对当前文档-主题矩阵和当前单词-主题矩阵进行更新,极大地减少了运算量,加快了主题挖掘的速度,提高了主题挖掘的效率。另外,本实施例中,在从残差中按照从大到小顺序查询排在前预设比例的目标残差时,具体采用了在根据残差计算获得的累积残差矩 阵的每一行中,采用快速排序算法或者插入排序算法确定出按照从大到小顺序排在前预设比例的元素所在列,然后对每一行中确定出的元素进行累加,获得每一行对应的和值,采用快速排序算法或者插入排序算法确定按照从大到小顺序排在前预设比例的和值所对应的行,将处于确定出的行和列上的元素确定为目标残差的方案,加快了目标残差的查询速度,进而提高了主题挖掘的效率。
图3为本发明一实施例提供的主题挖掘装置的结构示意图,如图3所示,包括:消息向量计算模块31,第一筛选模块32,第二筛选模块33,更新模块34,执行模块35和主题挖掘模块36。
消息向量计算模块31,用于根据潜在狄利克雷分布LDA模型的当前文档-主题矩阵和当前单词-主题矩阵,对训练文档的单词-文档矩阵中的非零元素进行计算,得到非零元素的消息向量。
第一筛选模块32,与消息向量计算模块31连接,用于根据所述非零元素的消息向量的残差,从所述非零元素的消息向量中确定目标消息向量。
其中,目标消息向量为按照残差从大到小顺序排在前预设比例的消息向量,所述预设比例的取值范围为小于1且大于0,所述残差用于指示所述消息向量的收敛程度。
第二筛选模块33,与第一筛选模块32连接,用于从所述单词-文档矩阵中的非零元素中确定所述目标消息向量所对应的目标元素。
更新模块34,与第一筛选模块33连接,用于根据所述目标消息向量对所述LDA模型的当前文档-主题矩阵和当前单词-主题矩阵进行更新。
执行模块35,与消息向量计算模块31和更新模块34连接,用于第n+1次执行根据LDA模型的当前文档-主题矩阵和当前单词-主题矩阵,对训练文档的单词-文档矩阵中第n次所确定的目标元素进行计算,获得所述单词-文档矩阵中第n次所确定的目标元素的消息向量,根据所述第n次所确定的目标元素的消息向量的残差,从所述第n次所确定的目标元素的消息向量中确定目标消息向量,根据第n+1次所确定的目标消息向量对当前文档-主题矩阵和当前单词-主题矩阵进行更新,以及从所述单词-文 档矩阵中,确定第n+1次所确定的目标消息向量所对应的目标元素的迭代过程,直至所述筛选后的目标元素的消息向量、当前文档-主题矩阵和当前单词-主题矩阵达到收敛状态。
主题挖掘模块36,与执行模块35连接,用于将达到收敛状态的当前文档-主题矩阵和达到收敛状态的当前单词-主题矩阵确定为所述LDA模型的参数,利用确定参数后的所述LDA模型对待测文档进行主题挖掘。
本实施例中,通过在每次执行迭代过程时,根据消息向量的残差,从消息向量中确定目标消息向量,然后仅根据本次执行迭代过程所确定出的目标消息向量对当前文档-主题矩阵和当前单词-主题矩阵进行更新,从而在后续执行迭代过程时,根据前一次执行迭代过程所更新的当前文档-主题矩阵和当前单词-主题矩阵,对单词-文档矩阵中前一次执行迭代过程所确定的目标消息向量对应的目标元素进行计算,避免了每次迭代过程均需要对单词-文档矩阵中的全部非零元素进行计算,以及避免了根据全部的消息向量对当前文档-主题矩阵和当前单词-主题矩阵进行更新,极大地减少了运算量,加快了主题挖掘的速度,提高了主题挖掘的效率。
图4为本发明另一实施例提供的主题挖掘装置的结构示意图,如图4所示,在上一实施例的基础上,本实施例中的第一筛选模块32进一步包括:计算单元321,查询单元322和筛选单元323。
计算单元321,用于计算非零元素的消息向量的残差。
可选的,计算单元321具体用于计算获得消息向量
Figure PCTCN2015081897-appb-000120
的残差
Figure PCTCN2015081897-appb-000121
其中,k=1,2,...,K,K为预设的主题数目,
Figure PCTCN2015081897-appb-000122
为第n次执行迭代过程中对所述单词-文档矩阵中第w行第d列的元素进行计算所获得的消息向量的第k个元素值,xw,d为所述单词-文档矩阵中第w行第d列的元素值,
Figure PCTCN2015081897-appb-000123
为第n-1次执行迭代过程中对所述单词-文档矩阵中第w行第d列的元素进行计算所获得的消息向量的第k个元素值。
查询单元322,与计算单元321连接,用于从计算得到的残差中,按照从大到小顺序查询排在前所述预设比例(λk×λw)的目标残差。
其中,(λk×λw)取值范围为小于1且大于0。预设比例是依据主题 挖掘的效率和主题挖掘的结果的准确度确定的。
可选的,查询单元322,具体用于根据所述残差
Figure PCTCN2015081897-appb-000124
进行计算,获得累积残差矩阵
Figure PCTCN2015081897-appb-000125
其中,
Figure PCTCN2015081897-appb-000126
为第n次执行迭代过程中所述单词-文档矩阵中第w行第d列的元素的消息向量的残差的第k个元素值;
Figure PCTCN2015081897-appb-000127
为第n次执行迭代过程中所述累积残差矩阵第w行第k列的元素值;在所述累积残差矩阵的每一行中,确定出按照从大到小顺序排在前预设比例λk的目标元素所在列
Figure PCTCN2015081897-appb-000128
λk取值范围为小于1且大于0;对每一行中确定出的目标元素进行累加,获得每一行对应的和值;确定按照从大到小顺序排在前预设比例λw的和值所对应的行
Figure PCTCN2015081897-appb-000129
λw取值范围为小于1且大于0;将满足
Figure PCTCN2015081897-appb-000130
的残差
Figure PCTCN2015081897-appb-000131
确定为所述目标残差。
筛选单元323,与查询单元322连接,用于从非零元素的消息向量中,确定目标残差所对应的目标消息向量。
可选的,筛选单元323具体用于从所述非零元素的消息向量中,确定目标残差
Figure PCTCN2015081897-appb-000132
所对应的消息向量
Figure PCTCN2015081897-appb-000133
进一步,更新模块34,包括:第一更新单元341和第二更新单元342。
第一更新单元341,用于根据公式
Figure PCTCN2015081897-appb-000134
对目标消息向量
Figure PCTCN2015081897-appb-000135
进行计算,得到更新后的所述LDA模型的当前文档-主题矩阵中第k行d列的元素值
Figure PCTCN2015081897-appb-000136
利用
Figure PCTCN2015081897-appb-000137
更新所述LDA模型的当前文档-主题矩阵中第k行d列的元素值;其中,k=1,2,...,K,K为预设的主题数目,xw,d为所述单词-文档矩阵中第w行第d列的元素值,
Figure PCTCN2015081897-appb-000138
是在第n次执行所述迭代过程中对所述xw,d进行计算所获得的消息向量的第k个元素值。
第二更新单元342,用于根据公式
Figure PCTCN2015081897-appb-000139
计算得到更新后的所述LDA模型的当前单词-主题矩阵中第k行w列的元素值
Figure PCTCN2015081897-appb-000140
利用
Figure PCTCN2015081897-appb-000141
更新所述LDA模型的当前单词-主题矩阵中第k行w列的元素值。
进一步,主题挖掘装置还包括:确定模块41、第一获得模块42和第二获得模块43。
第二确定模块41,用于确定单词-文档矩阵中每个非零元素的初始消息向量
Figure PCTCN2015081897-appb-000142
其中,k=1,2,...,K,K为预设的主题数目,
Figure PCTCN2015081897-appb-000143
Figure PCTCN2015081897-appb-000144
为单词-文档矩阵中第w行第d列的非零元素xw,d的初始消息向量的第k个元素;
第一获得模块42,与第二确定模块41和消息向量计算模块31连接,
Figure PCTCN2015081897-appb-000145
第二获得模块43,与第二确定模块41和消息向量计算模块31连接,
Figure PCTCN2015081897-appb-000146
进一步,消息向量计算模块31,具体用于在第n次执行所述迭代过程中,根据公式
Figure PCTCN2015081897-appb-000147
进行计算,得到所述单词-文档矩阵中第w行第d列的元素xw,d的消息向量的第k个元素值
Figure PCTCN2015081897-appb-000148
其中,k=1,2,...,K,K为预设的主题数目,w=1,2,...,W,W为单词表长度,d=1,2,...,D,D为所述训练文档数目,
Figure PCTCN2015081897-appb-000149
为当前文档-主题矩阵第k行第d列的元素值,
Figure PCTCN2015081897-appb-000150
为当前单词-主题矩阵第k行第w列的元素值,α和β为预设系数,取值范围为正数。
本实施例提供的主题挖掘装置的各功能模块可用于执行图1和图2所示的主题挖掘方法流程,其具体工作原理不再赘述,详见方法实施例的描述。
本实施例中,通过在每次执行迭代过程时,根据消息向量的残差,从消息向量中确定目标消息向量,然后仅根据本次执行迭代过程所确定出的目标消息向量对当前文档-主题矩阵和当前单词-主题矩阵进行更新,从而在后续执行迭代过程时,根据前一次执行迭代过程所更新的当前文档-主题矩阵和当前单词-主题矩阵,对单词-文档矩阵中前一次执行迭代过程所确定的目标消息向量对应的目标元素进行计算,避免了每次迭代过程均需要对单词-文档矩阵中的全部非零元素进行计算,以及避免了根据全部的消息向量对当前文档-主题矩阵和当前单词-主题矩阵进行更新,极大地减少了运算量,加快了主题挖掘的速度,提高了主题挖掘的效率。另外,本实施例中,在从残差中按照从大到小顺序查询排在前预设比例的目标残差时,具体采用了在根据残差计算获得的累积残差矩阵的每一行中,采用快速排序算法或者插入排序算法确定出按照从大到小顺序排在前预设比例的元素所在列,然后对每一行中确定出的元素进 行累加,获得每一行对应的和值,采用快速排序算法或者插入排序算法确定按照从大到小顺序排在前预设比例的和值所对应的行,将处于确定出的行和列上的元素确定为目标残差的方案,加快了目标残差的查询速度,进而提高了主题挖掘的效率。
图5为本发明又一实施例提供的主题挖掘装置的结构示意图,如图5所示,本实施例中的装置可以包括:存储器51、通信接口52和处理器53。
存储器51,用于存放程序。具体的,程序可以包括程序代码,上述程序代码包括计算机操作指令。存储器51可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。
通信接口52,用于获得训练文档的单词-文档矩阵。
处理器53,用于执行存储器51存放的程序,以用于:根据潜在狄利克雷分布LDA模型的当前文档-主题矩阵和当前单词-主题矩阵对训练文档的单词-文档矩阵中的非零元素进行计算,得到非零元素的消息向量;根据所述非零元素的消息向量的残差,从所述非零元素的消息向量中确定目标消息向量;所述目标消息向量为按照残差从大到小顺序排在前预设比例的消息向量,所述预设比例的取值范围为小于1且大于0;根据所述目标消息向量对所述LDA模型的当前文档-主题矩阵和当前单词-主题矩阵进行更新;从所述单词-文档矩阵中的非零元素中确定所述目标消息向量所对应的目标元素;重复执行根据LDA模型的当前文档-主题矩阵和当前单词-主题矩阵,对训练文档的单词-文档矩阵中前一次所确定的目标元素进行计算,获得所述单词-文档矩阵中前一次所确定的目标元素的消息向量,根据所述前一次所确定的目标元素的消息向量的残差,从所述前一次所确定的目标元素的消息向量中确定目标消息向量,根据本次所确定的目标消息向量对当前文档-主题矩阵和当前单词-主题矩阵进行更新,以及从所述单词-文档矩阵中,确定本次所确定的目标消息向量所对应的目标元素的迭代过程,直至所述筛选后的目标元素的消息向量、当前文档-主题矩阵和当前单词-主题矩阵达到收敛状态;将达到收敛状态 的当前文档-主题矩阵和达到收敛状态的当前单词-主题矩阵确定为所述LDA模型的参数,利用确定参数后的所述LDA模型对待测文档进行主题挖掘。
本实施例提供的主题挖掘装置的各功能模块可用于执行图1和图2所示的主题挖掘方法流程,其具体工作原理不再赘述,详见方法实施例的描述。
本发明实施例还提供了一种主题挖掘装置的应用场景:
在进行网络舆情分析、个性化信息推送等需要基于语义进行的信息处理时,需首先对网络中的各待测文档进行主题挖掘,获得各待测文档的主题,即作者通过文档所表达的主题思想。然后才能基于各待测文档的主题进行分析,将分析结果用于个性化信息推送或者网络舆情预警等方面。
作为一种可能的主题挖掘的应用场景,在进行网络舆情分析之前,需要对网络中的微博的博文、网页的文字内容等所构成的待测文档进行主题挖掘,从而获得各待测文档的主题。具体的,图6为主题挖掘装置应用于网络舆情分析的架构图,可从内容服务器中获取待测文档,然后从待测文档中选取,或者,另外选取不同于待测文档的涉及各个不同主题的文档作为训练文档,训练文档所涵盖的主题越多,则主题挖掘准确度越高;然后,采用上述各实施例中所提供的主题挖掘方法对训练文档进行处理,确定LDA模型的参数;在确定LDA模型参数之后,可利用确定参数后的LDA模型对网络中的微博的博文、网页的文字内容等所构成的待测文档进行主题挖掘。将所获得的待测文档的主题发送给网络舆情分析服务器,进行网络舆情分析。
本实施例中,通过在每次执行迭代过程时,根据消息向量的残差,从消息向量中确定目标消息向量,然后仅根据本次执行迭代过程所确定出的目标消息向量对当前文档-主题矩阵和当前单词-主题矩阵进行更新,从而在后续执行迭代过程时,根据前一次执行迭代过程所更新的当前文档-主题矩阵和当前单词-主题矩阵,对单词-文档矩阵中前一次执行迭代过程所确定的目标消息向量对应的目标元素进行计算,避免了每次迭代过程均需要对单词-文档矩阵中的全部非零元素进行计算,以及避免 了根据全部的消息向量对当前文档-主题矩阵和当前单词-主题矩阵进行更新,极大地减少了运算量,加快了主题挖掘的速度,提高了主题挖掘的效率。
本领域普通技术人员可以理解:实现上述各方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成。前述的程序可以存储于一计算机可读取存储介质中。该程序在执行时,执行包括上述各方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。

Claims (16)

  1. 一种主题挖掘方法,其特征在于,包括:
    根据潜在狄利克雷分布LDA模型的当前文档-主题矩阵和当前单词-主题矩阵对训练文档的单词-文档矩阵中的非零元素进行计算,得到非零元素的消息向量Mn
    根据所述非零元素的消息向量的残差,从所述非零元素的消息向量Mn中确定目标消息向量ObjectMn;所述目标消息向量为按照残差从大到小顺序排在前预设比例的消息向量,所述预设比例的取值范围为小于1且大于0;
    根据所述目标消息向量ObjectMn对所述LDA模型的当前文档-主题矩阵和当前单词-主题矩阵进行更新;
    从所述单词-文档矩阵中的非零元素中确定所述目标消息向量ObjectMn所对应的目标元素ObjectEn
    第n+1次执行根据LDA模型的当前文档-主题矩阵和当前单词-主题矩阵,对训练文档的单词-文档矩阵中第n次所确定的目标元素ObjectEn进行计算,得到所述单词-文档矩阵中第n次所确定的目标元素ObjectEn的消息向量Mn+1,根据所述第n次确定的目标元素的消息向量的残差,从所述第n次确定的目标元素ObjectEn的消息向量Mn+1中确定目标消息向量ObjectMn+1,根据第n+1次所确定的目标消息向量ObjectMn+1对当前文档-主题矩阵和当前单词-主题矩阵进行更新,以及从所述单词-文档矩阵中,确定第n+1次所确定的目标消息向量ObjectMn+1所对应的目标元素ObjectEn+1的迭代过程,直至所述筛选后的目标元素ObjectEp的消息向量、当前文档-主题矩阵和当前单词-主题矩阵达到收敛状态;
    将达到收敛状态的当前文档-主题矩阵和达到收敛状态的当前单词-主题矩阵确定为所述LDA模型的参数,利用确定参数后的所述LDA模型对待测文档进行主题挖掘。
  2. 根据权利要求1所述的主题挖掘方法,其特征在于,所述根据所述非零元素的消息向量的残差,从所述非零元素的消息向量Mn中确定目标消息向量ObjectMn,包括:
    计算所述非零元素的消息向量的残差;
    从计算得到的残差中,按照从大到小顺序查询排在前所述预设比例的目标残差;所述预设比例是依据主题挖掘的效率和主题挖掘的结果的准确度确定的;
    从所述非零元素的消息向量Mn中,确定所述目标残差所对应的目标消息向量ObjectMn
  3. 根据权利要求2所述的主题挖掘方法,其特征在于,所述计算所述非零元素的消息向量的残差,包括:
    根据公式
    Figure PCTCN2015081897-appb-100001
    计算所述非零元素的消息向量的残差,其中,其中,
    Figure PCTCN2015081897-appb-100002
    为所述非零元素的消息向量的残差,k=1,2,...,K,K为预设的主题数目,
    Figure PCTCN2015081897-appb-100003
    为第n次执行迭代过程中对所述单词-文档矩阵中第w行第d列的元素进行计算所获得的消息向量的第k个元素值,xw,d为所述单词-文档矩阵中第w行第d列的元素值,
    Figure PCTCN2015081897-appb-100004
    为第n-1次执行迭代过程中对所述单词-文档矩阵中第w行第d列的元素进行计算得到的消息向量的第k个元素值。
  4. 根据权利要求2或3所述的主题挖掘方法,其特征在于,所述从计算得到的残差中,按照从大到小顺序查询排在前所述预设比例的目标残差,包括:
    根据公式
    Figure PCTCN2015081897-appb-100005
    对所述残差
    Figure PCTCN2015081897-appb-100006
    进行计算,得到累积残差矩阵;其中,
    Figure PCTCN2015081897-appb-100007
    为第n次执行迭代过程中所述单词-文档矩阵中第w行第d列的元素的消息向量的残差的第k个元素值;
    Figure PCTCN2015081897-appb-100008
    为第n次执行迭代过程中所述累积残差矩阵第w行第k列的元素值;
    在所述累积残差矩阵的每一行中,按照从大到小顺序确定出排在前预设比例λk的元素所在列
    Figure PCTCN2015081897-appb-100009
    其中,0<λk≤1;
    对每一行中确定出的元素进行累加,得到每一行对应的和值;
    确定按照从大到小顺序排在前预设比例λw的和值所对应的行
    Figure PCTCN2015081897-appb-100010
    其中,0<λw≤1,且λk×λw≠1;
    将满足
    Figure PCTCN2015081897-appb-100011
    的残差
    Figure PCTCN2015081897-appb-100012
    确定为所述目标残差。
  5. 根据权利要求4所述的主题挖掘方法,其特征在于,所述从所述非零元素的消息向量Mn中,确定所述目标残差所对应的目标消息向量ObjectMn,包括:
    从所述非零元素的消息向量Mn中,确定目标残差
    Figure PCTCN2015081897-appb-100013
    所对应的目标消息向量ObjectMn
    Figure PCTCN2015081897-appb-100014
  6. 根据权利要求3-5中任一项所述的主题挖掘方法,其特征在于,所述根据所述目标消息向量ObjectMn对所述LDA模型的当前文档-主题矩阵和当前单词-主题矩阵进行更新,包括:
    根据公式
    Figure PCTCN2015081897-appb-100015
    进行计算,得到更新后的所述LDA模型的当前文档-主题矩阵中第k行第d列的元素值
    Figure PCTCN2015081897-appb-100016
    利用
    Figure PCTCN2015081897-appb-100017
    更新所述LDA模型的当前文档-主题矩阵中第k行第d列的元素值;其中,k=1,2,...,K,K为预设的主题数目,xw,d为所述单词-文档矩阵中第w行第d列的元素值,
    Figure PCTCN2015081897-appb-100018
    是在第n次执行所述迭代过程中对所述xw,d进行计算所获得的消息向量的第k个元素值;
    根据公式
    Figure PCTCN2015081897-appb-100019
    计算得到更新后的所述LDA模型的当前单词-主题矩阵中第k行w列的元素值
    Figure PCTCN2015081897-appb-100020
    利用
    Figure PCTCN2015081897-appb-100021
    更新所述LDA模型的当前单词-主题矩阵中第k行w列的元素值。
  7. 根据权利要求1-5任一项所述的主题挖掘方法,其特征在于,所述根据潜在狄利克雷分布LDA模型的当前文档-主题矩阵和当前单词-主题矩阵对训练文档的单词-文档矩阵中的非零元素进行计算,得到非零元素的消息向量Mn,包括:
    在第n次执行所述迭代过程中,根据公式
    Figure PCTCN2015081897-appb-100022
    进行计算,得到所述单词-文档矩阵中第w行第d列的元素xw,d的消息向量的第k个元素值
    Figure PCTCN2015081897-appb-100023
    其中,k=1,2,...,K,K为预设的主题数目,w=1,2,...,W,W为单词表长度,d=1,2,...,D,D为所述训练文档数目,
    Figure PCTCN2015081897-appb-100024
    为当前文档-主题矩阵第k行第d列的元素值,
    Figure PCTCN2015081897-appb-100025
    为当前单词-主题矩阵第k行第w列的元素值,α和β为预设系数,取值范围为正数。
  8. 根据权利要求1-5任一项所述的主题挖掘方法,其特征在于,所述根据潜在狄利克雷分布LDA模型的当前文档-主题矩阵和当前单词-主题矩阵对训练文档的单词-文档矩阵中的非零元素进行计算,得到非零元素的消息向量Mn之前,还包括:
    确定单词-文档矩阵中每个非零元素的初始消息向量
    Figure PCTCN2015081897-appb-100026
    其中,k=1,2,...,K,K为预设的主题数目,
    Figure PCTCN2015081897-appb-100027
    Figure PCTCN2015081897-appb-100028
    Figure PCTCN2015081897-appb-100029
    为单词-文档矩阵中第w行第d列的非零元素xw,d的初始消息向量的第k个元素;
    根据公式
    Figure PCTCN2015081897-appb-100030
    计算当前文档-主题矩阵;其中,
    Figure PCTCN2015081897-appb-100031
    为所述初始消息向量,
    Figure PCTCN2015081897-appb-100032
    为当前文档-主题矩阵第k行第d列的元素值;
    根据公式
    Figure PCTCN2015081897-appb-100033
    计算当前单词-主题矩阵;其中,
    Figure PCTCN2015081897-appb-100034
    为所述初始消息向量,
    Figure PCTCN2015081897-appb-100035
    为当前单词-主题矩阵第k行第w列的元素值。
  9. 一种主题挖掘装置,其特征在于,包括:
    消息向量计算模块,用于根据潜在狄利克雷分布LDA模型的当前文档-主题矩阵和当前单词-主题矩阵,对训练文档的单词-文档矩阵中的非零元素进行计算,得到非零元素的消息向量Mn
    第一筛选模块,用于根据所述非零元素的消息向量的残差,从所述非零元素的消息向量Mn中确定目标消息向量ObjectMn;所述目标消息向量为按照残差从大到小顺序排在前预设比例的消息向量,所述预设比例的取值范围为小于1且大于0;
    更新模块,用于根据所述目标消息向量ObjectMn对所述LDA模型的当前文档-主题矩阵和当前单词-主题矩阵进行更新;
    第二筛选模块,用于从所述单词-文档矩阵中的非零元素中确定所述目标消息向量ObjectMn所对应的目标元素ObjectEn
    执行模块,用于第n+1次执行根据LDA模型的当前文档-主题矩阵和当前单词-主题矩阵,对训练文档的单词-文档矩阵中第n次所确定的目标元素ObjectEn进行计算,得到所述单词-文档矩阵中第n次所确定的目标元素ObjectEn的消息向量Mn+1,根据所述第n次确定的目标元素的消息向量的残差,从所述第n次确定的目标元素ObjectEn的消息向量Mn+1中确定目标消息向量ObjectMn+1,根据第n+1次所确定的目标消息向量ObjectMn+1对当前文档-主题矩阵和当前单词-主题矩阵进行更新,以及从所述单词-文档矩阵中,确定第n+1次所确定的目标消息向量ObjectMn+1所对应的目标元素ObjectEn+1的迭代过程,直至所述筛选后的目标元素ObjectEp的消息向量、当前文档-主题矩阵和当前单词-主题矩阵达到收敛状态;
    主题挖掘模块,用于将达到收敛状态的当前文档-主题矩阵和达到收敛状态的当前单词-主题矩阵确定为所述LDA模型的参数,利用确定参数后的所述LDA模型对待测文档进行主题挖掘。
  10. 根据权利要求9所述的主题挖掘装置,其特征在于,所述第一筛选模块,包括:
    计算单元,用于计算所述非零元素的消息向量的残差;
    查询单元,用于从计算得到的残差中,按照从大到小顺序查询排在前所述预设比例的目标残差;所述预设比例是依据主题挖掘的效率和主题挖掘的结果的准确度确定的;
    筛选单元,用于从所述非零元素的消息向量Mn中,确定所述目标残差所对应的目标消息向量ObjectMn
  11. 根据权利要求10所述的主题挖掘装置,其特征在于,
    所述计算单元,具体用于根据公式
    Figure PCTCN2015081897-appb-100036
    计算所述非零元素的消息向量的残差,其中,其中,
    Figure PCTCN2015081897-appb-100037
    为所述非零元素的消息向量的残差,k=1,2,...,K,K为预设的主题数目,
    Figure PCTCN2015081897-appb-100038
    为第n次执行迭代过程中对所述单词-文档矩阵中第w行第d列的元素进行计算所获得的消息向量的第k个元素值,xw,d为所述单词-文档矩阵中第w行第d列的元素值,
    Figure PCTCN2015081897-appb-100039
    为第n-1次执行迭代过程中对所述单词-文档矩阵中第w行第d列的元素进行计算得到的消息向量的第k个元素值。
  12. 根据权利要求10或11所述的主题挖掘装置,其特征在于,
    所述查询单元,具体用于根据公式
    Figure PCTCN2015081897-appb-100040
    对所述残差
    Figure PCTCN2015081897-appb-100041
    进行计算,得到累积残差矩阵;其中,
    Figure PCTCN2015081897-appb-100042
    为第n次执行迭代过程中所述单词-文档矩阵中第w行第d列的元素的消息向量的残差的第k个元素值;
    Figure PCTCN2015081897-appb-100043
    为第n次执行迭代过程中所述累积残差矩阵第w行第k列的元素值;在所述累积残差矩阵的每一行中,按照从大到小顺序确定出排在前预设比例λk的元素所在列
    Figure PCTCN2015081897-appb-100044
    其中,0<λk≤1;对每一行中确定出的元素进行累加,得到每一行对应的和值;确定按照从大到小顺序排在前预设比例λw的和值所对应的行
    Figure PCTCN2015081897-appb-100045
    其中,0<λw≤1,且λk×λw≠1;将满足
    Figure PCTCN2015081897-appb-100046
    的残差
    Figure PCTCN2015081897-appb-100047
    确定为所述目标残差。
  13. 根据权利要求12所述的主题挖掘装置,其特征在于,
    所述筛选单元,具体用于从所述非零元素的消息向量Mn中,确定目标残差
    Figure PCTCN2015081897-appb-100048
    所对应的目标消息向量ObjectMn
    Figure PCTCN2015081897-appb-100049
  14. 根据权利要求11-13任一项所述的主题挖掘装置,其特征在于,所述更新模块,包括:
    第一更新单元,用于根据公式
    Figure PCTCN2015081897-appb-100050
    进行计算,得到更新后的所述LDA模型的当前文档-主题矩阵中第k行第d列的元素值
    Figure PCTCN2015081897-appb-100051
    利用
    Figure PCTCN2015081897-appb-100052
    更新所述LDA模型的当前文档-主题矩阵中第k行第d列的元素值;其中,k=1,2,...,K,K为预设的主题数目,xw,d为所述单词-文档矩阵中第w行第d列的元素值,
    Figure PCTCN2015081897-appb-100053
    是在第n次执行所述迭代过程中对所述xw,d进行计算所获得的消息向量的第k个元素值;
    第二更新单元,用于根据公式
    Figure PCTCN2015081897-appb-100054
    计算得到更新后的所述LDA模型的当前单词-主题矩阵中第k行w列的元素值
    Figure PCTCN2015081897-appb-100055
    利用
    Figure PCTCN2015081897-appb-100056
    更新所述LDA模型的当前单词-主题矩阵中第k行w列的元素值。
  15. 根据权利要求9-13任一项所述的主题挖掘装置,其特征在于,
    所述消息向量计算模块,具体用于在第n次执行所述迭代过程中,根据公式
    Figure PCTCN2015081897-appb-100057
    进行计算,得到所述单词-文档矩阵中第w行第d列的元素xw,d的消息向量的第k个元素值
    Figure PCTCN2015081897-appb-100058
    其中,k=1,2,...,K,K为预设的主题数目,w=1,2,...,W,W为单词表长度,d=1,2,...,D,D为所述训练文档数目,
    Figure PCTCN2015081897-appb-100059
    为当前文档-主题矩阵第k行第d列的元素值,
    Figure PCTCN2015081897-appb-100060
    为当前单词-主题矩阵第k行第w列的元素值,α和β为预设系数,取值范围为正数。
  16. 根据权利要求9-13任一项所述的主题挖掘装置,其特征在于,所述装置还包括:
    确定模块,用于确定单词-文档矩阵中每个非零元素的初始消息向量
    Figure PCTCN2015081897-appb-100061
    其中,k=1,2,...,K,K为预设的主题数目,
    Figure PCTCN2015081897-appb-100062
    Figure PCTCN2015081897-appb-100063
    Figure PCTCN2015081897-appb-100064
    为单词-文档矩阵中第w行第d列的非零元素xw,d的初始消息向量的第k个元素;
    第一获得模块,用于根据公式
    Figure PCTCN2015081897-appb-100065
    计算当前文档-主题矩阵;其中,
    Figure PCTCN2015081897-appb-100066
    为所述初始消息向量,
    Figure PCTCN2015081897-appb-100067
    为当前文档-主题矩阵第k行第d列的元素值;
    第二获得模块,用于根据公式
    Figure PCTCN2015081897-appb-100068
    计算当前单词-主题矩阵;其中,
    Figure PCTCN2015081897-appb-100069
    为所述初始消息向量,
    Figure PCTCN2015081897-appb-100070
    为当前单词-主题矩阵第k行第w列的元素值。
PCT/CN2015/081897 2014-06-20 2015-06-19 主题挖掘方法和装置 WO2015192798A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/383,606 US20170097962A1 (en) 2014-06-20 2016-12-19 Topic mining method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410281183.9 2014-06-20
CN201410281183.9A CN105335375B (zh) 2014-06-20 2014-06-20 主题挖掘方法和装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/383,606 Continuation US20170097962A1 (en) 2014-06-20 2016-12-19 Topic mining method and apparatus

Publications (1)

Publication Number Publication Date
WO2015192798A1 true WO2015192798A1 (zh) 2015-12-23

Family

ID=54934889

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/081897 WO2015192798A1 (zh) 2014-06-20 2015-06-19 主题挖掘方法和装置

Country Status (3)

Country Link
US (1) US20170097962A1 (zh)
CN (1) CN105335375B (zh)
WO (1) WO2015192798A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115934808B (zh) * 2023-03-02 2023-05-16 中国电子科技集团公司第三十研究所 一种融入关联分析和风暴抑制机制的网络舆情预警方法

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106844416B (zh) * 2016-11-17 2019-11-29 中国科学院计算技术研究所 一种子话题挖掘方法
CN107958256A (zh) * 2017-10-09 2018-04-24 中国电子科技集团公司第二十八研究所 一种基于假设检验的舆情主题数识别方法及系统
CN111241846B (zh) * 2020-01-15 2023-05-26 沈阳工业大学 一种主题挖掘模型中主题维度自适应确定方法
US10860396B1 (en) 2020-01-30 2020-12-08 PagerDuty, Inc. Inline categorizing of events
US11115353B1 (en) * 2021-03-09 2021-09-07 Drift.com, Inc. Conversational bot interaction with utterance ranking

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101187919A (zh) * 2006-11-16 2008-05-28 北大方正集团有限公司 一种对文档集进行批量单文档摘要的方法及系统
CN101231634A (zh) * 2007-12-29 2008-07-30 中国科学院计算技术研究所 一种多文档自动文摘方法
CN102439597A (zh) * 2011-07-13 2012-05-02 华为技术有限公司 基于潜在狄利克雷模型的参数推断方法、计算装置及系统
EP2624149A2 (en) * 2012-02-02 2013-08-07 Xerox Corporation Document processing employing probabilistic topic modeling of documents represented as text words transformed to a continuous space

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101187919A (zh) * 2006-11-16 2008-05-28 北大方正集团有限公司 一种对文档集进行批量单文档摘要的方法及系统
CN101231634A (zh) * 2007-12-29 2008-07-30 中国科学院计算技术研究所 一种多文档自动文摘方法
CN102439597A (zh) * 2011-07-13 2012-05-02 华为技术有限公司 基于潜在狄利克雷模型的参数推断方法、计算装置及系统
EP2624149A2 (en) * 2012-02-02 2013-08-07 Xerox Corporation Document processing employing probabilistic topic modeling of documents represented as text words transformed to a continuous space

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ZENG, JIA ET AL.: "A New Approach to Speeding Up Topic Modeling. arXiv. org[ online ], [search date 17.08.2015 ].", 8 April 2014 (2014-04-08), XP055245695, Retrieved from the Internet <URL:http://arXiv.org/abs/1204.0170> *
ZENG, JIA ET AL.: "Belief Propagation for Topic Modeling. arXiv. org", 13 June 2013 (2013-06-13), pages 2 - 9, XP055245697, Retrieved from the Internet <URL:http://arXiv.org/abs/1210.2179> [retrieved on 20150817] *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115934808B (zh) * 2023-03-02 2023-05-16 中国电子科技集团公司第三十研究所 一种融入关联分析和风暴抑制机制的网络舆情预警方法

Also Published As

Publication number Publication date
CN105335375A (zh) 2016-02-17
CN105335375B (zh) 2019-01-15
US20170097962A1 (en) 2017-04-06

Similar Documents

Publication Publication Date Title
WO2015192798A1 (zh) 主题挖掘方法和装置
JP7343568B2 (ja) 機械学習のためのハイパーパラメータの識別および適用
CN110162695B (zh) 一种信息推送的方法及设备
US20170330054A1 (en) Method And Apparatus Of Establishing Image Search Relevance Prediction Model, And Image Search Method And Apparatus
WO2020062770A1 (zh) 一种领域词典的构建方法、装置、设备及存储介质
US9213943B2 (en) Parameter inference method, calculation apparatus, and system based on latent dirichlet allocation model
Wang et al. Rank-one matrix pursuit for matrix completion
WO2016062044A1 (zh) 一种模型参数训练方法、装置及系统
WO2020238039A1 (zh) 神经网络搜索方法及装置
US11200466B2 (en) Machine learning classifiers
CN111512283B (zh) 数据库中的基数估算
JP2022541370A (ja) データ強化ポリシーの更新方法、装置、デバイス及び記憶媒体
WO2018121198A1 (en) Topic based intelligent electronic file searching
WO2014201833A1 (en) Method and device for processing data
US11288266B2 (en) Candidate projection enumeration based query response generation
EP3955256A1 (en) Non-redundant gene clustering method and system, and electronic device
US10127192B1 (en) Analytic system for fast quantile computation
CN108268611B (zh) 一种基于MapReduce的k-means文本聚类的方法及装置
US20180247161A1 (en) System, method and apparatus for machine learning-assisted image screening for disallowed content
JP5964781B2 (ja) 検索装置、検索方法および検索プログラム
CN108319682B (zh) 分类器修正和分类语料库构建的方法、装置、设备及介质
JP2013156696A (ja) クラスタリング装置及び方法及びプログラム
JP6698061B2 (ja) 単語ベクトル変換装置、方法、及びプログラム
CN108628889B (zh) 基于时间片的数据抽样方法、系统和装置
Cotter et al. Interpretable set functions

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15810384

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15810384

Country of ref document: EP

Kind code of ref document: A1