CN106776547B

CN106776547B - Document theme generation method and device

Info

Publication number: CN106776547B
Application number: CN201611089622.1A
Authority: CN
Inventors: 董从娇; 龚珊珊; 滕一勤
Original assignee: BEIJING ADVANCED DIGITAL TECHNOLOGY Co Ltd
Current assignee: BEIJING ADVANCED DIGITAL TECHNOLOGY Co Ltd
Priority date: 2016-11-30
Filing date: 2016-11-30
Publication date: 2020-02-07
Anticipated expiration: 2036-11-30
Also published as: CN106776547A

Abstract

The embodiment of the invention provides a document theme generation method and device. The method comprises the following steps: the method comprises the steps of segmenting the documents of a document set, extracting words, counting word-word relation data representing semantic relevance between every two extracted words, counting word-word document relation data representing importance of each word in each document, iteratively updating the document theme relation data, the word theme relation data and an adjusting factor to reach a set end condition, and generating document themes of the document set by iteratively updating the obtained word theme relation data. The invention leads the finally generated word theme relation data to be jointly constrained by the word document relation data and the word relation data, realizes the semantic relation between words in the process of generating the document theme and improves the accuracy of generating the document theme.

Description

Document theme generation method and device

Technical Field

The invention relates to the field of text analysis, in particular to a document theme generation method and device.

Background

In the field of text analysis, it is required to utilize topic model technology to quickly know the key contents described by a document. Given a series of documents, the probability of each word in each document can be obtained by segmenting the documents and calculating the word frequency of each word in each document. The topic model is to learn the data of the probability of each word appearing in each topic and the data of the probability of each topic appearing in each document by training the data of the probability of each word appearing in each document.

In the traditional topic model building process, a large number of meaningless topics are generated due to the fact that semantic relevance between words is omitted. The reason is that in many documents, there are identical words, but different combinations of words will express different meanings.

Disclosure of Invention

In view of the above, the present invention has been made to provide a document theme generation method and apparatus that overcome or at least partially solve the above problems.

According to an aspect of the present invention, there is provided a document theme generation method, including:

segmenting the documents of the document set and extracting words;

counting the inter-word relation data representing the semantic correlation between every two extracted words;

counting word document relation data representing the importance of each word in each document;

randomly generating document theme relation data representing the relevance of each document to each preset theme and word theme relation data representing the relevance of each word to each theme;

generating an adjustment factor according to the word theme relation data and the inter-word relation data;

iteratively updating the document theme relation data, the word theme relation data and the adjustment factor to reach a set end condition according to the relation among the document theme relation data, the word theme relation data and the word document relation data and the relation among the word theme relation data, the adjustment factor and the word relation data, so that the target probability of simultaneously generating the document theme relation data, the word theme relation data and the adjustment factor under the condition of determining the word document relation data and the word relation data reaches a set requirement;

and generating the document theme of the document set by using the word theme relation data obtained by iterative updating.

Preferably, the iteratively updating the document theme relationship data, the term theme relationship data and the adjustment factor to reach a set end condition, so that the target probability of simultaneously generating the document theme relationship data, the term theme relationship data and the adjustment factor under the condition of determining the term document relationship data and the interword relationship data reaches a set requirement includes:

in the (N + 1) th iteration, generating a first adjusting value of the word theme relation data in the current iteration according to the latest document theme relation data, the word theme relation data and an adjusting factor, and updating the word theme relation data according to the first adjusting value and a set learning rate constant;

in the (N + 1) th iteration, generating a second adjustment value of the document theme relation data in the current iteration according to the latest document theme relation data and the latest word theme relation data, and updating the document theme relation data according to the second adjustment value and a set learning rate constant;

in the (N + 1) th iteration, generating a third adjusting value of the adjusting factor in the current iteration according to the latest word theme relation data and the adjusting factor, and updating the adjusting factor according to the third adjusting value and a set learning rate constant;

and ending the iterative updating until the set ending condition is reached, so that the target probability reaches the set requirement.

Preferably, the segmenting the documents of the document set and extracting the words includes:

performing word segmentation on the documents of the document set;

the remaining words excluding the set unnecessary words are extracted.

Preferably, the set unnecessary words include set stop words, recognized words having no practical meaning.

Preferably, the inter-word relationship data statistically characterizing semantic relevance between every two of all extracted words includes:

converting all extracted words into word vectors according to semantic relevance;

and carrying out similarity calculation on the word vectors corresponding to all the extracted words, and obtaining the inter-word relation data.

Preferably, the term document relationship data that statistically characterizes the importance of each term in each document includes:

calculating the frequency of occurrence of each term in each document and the logarithm of the quotient of the total number of documents divided by the number of documents containing the term;

multiplying the occurrence frequency and the corresponding logarithm for each word to obtain word document relation data which characterizes the importance of each word in each document.

Preferably, before the randomly generating document theme relationship data representing the relevance of each document to each set theme and word theme relationship data representing the relevance of each word to the theme, the method further comprises:

and carrying out normalization processing on the word document relation data to obtain the word document relation data after the normalization processing.

According to another aspect of the present invention, there is provided a document theme generation apparatus, including:

the document word segmentation module is used for segmenting the documents of the document set and extracting words;

the inter-word relation data statistics module is used for counting the inter-word relation data representing the semantic relevance between every two extracted words;

the word document relation data statistics module is used for counting word document relation data representing the importance of each word in each document;

the data random generation module is used for randomly generating document theme relation data representing the relevance of each document to each preset theme and word theme relation data representing the relevance of each word to each theme;

the adjusting factor generating module is used for generating adjusting factors according to the word theme relation data and the inter-word relation data;

the iterative updating module is used for iteratively updating the document theme relation data, the word theme relation data and the adjusting factor to reach a set end condition according to the relation among the document theme relation data, the word theme relation data and the word document relation data and the relation among the word theme relation data, the adjusting factor and the word relation data, so that the target probability of simultaneously generating the document theme relation data, the word theme relation data and the adjusting factor under the condition of determining the word document relation data and the word relation data reaches a set requirement;

and the document theme generating module is used for generating the document theme of the document set according to the word theme relation data obtained by iterative updating.

Preferably, the iterative update module comprises:

the word theme relation data updating submodule is used for generating a first adjusting value of the word theme relation data in the iteration according to the latest document theme relation data, the word theme relation data and the adjusting factor in the (N + 1) th iteration and updating the word theme relation data according to the first adjusting value and a set learning rate constant;

the document theme relation data updating submodule is used for generating a second adjusting value of the document theme relation data in the current iteration according to the latest document theme relation data and the latest word theme relation data in the N +1 th iteration and updating the document theme relation data according to the second adjusting value and a set learning rate constant;

the adjustment updating submodule is used for generating a third adjustment value of the adjustment factor in the current iteration according to the latest word theme relation data and the adjustment factor in the (N + 1) th iteration, and updating the adjustment factor according to the third adjustment value and a set learning rate constant;

and the iteration ending submodule is used for ending the iteration updating until the set ending condition is reached, so that the target probability reaches the set requirement.

Preferably, the document word segmentation module comprises:

the document word segmentation sub-module is used for segmenting the documents of the document set;

and the word extraction submodule is used for extracting the residual words excluding the set unnecessary words.

Preferably, the interword relationship data statistics module comprises:

the word vector conversion submodule is used for converting all the extracted words into word vectors according to the semantic correlation;

and the similarity calculation operator module is used for calculating the similarity between every two word vectors corresponding to all the extracted words to obtain the inter-word relation data.

Preferably, the word document relationship data statistics module comprises:

the logarithm calculation submodule is used for calculating the occurrence frequency of each word in each document and the logarithm of the quotient of the total document number divided by the number of the documents containing the word;

and the word document relation data calculation submodule is used for multiplying the occurrence frequency and the corresponding logarithm for each word to obtain word document relation data representing the importance of each word in each document.

Preferably, the apparatus further comprises:

and the word document relation data normalization module is used for performing normalization processing on the word document relation data before randomly generating document theme relation data representing the relevance of each document to each set theme and word theme relation data representing the relevance of each word to the theme to obtain the word document relation data after the normalization processing.

In summary, according to the embodiments of the present invention, the document theme relationship data, the term theme relationship data, and the term document relationship data are iteratively updated according to the relationship among the document theme relationship data, the term theme relationship data, and the relationship among the term theme relationship data, the adjustment factor, and the term relationship data, so that the target probability reaches the set requirement, and the document theme of the document set is generated by iteratively updating the obtained term theme relationship data. In the process of generating the word theme relation data, the word theme relation data is influenced by the word document relation data and the word theme relation data, so that the finally generated word theme relation data is jointly constrained by the word document relation data and the word theme relation data, the semantic relation among words is considered in the process of generating the document theme, and the accuracy of generating the document theme is improved.

Drawings

FIG. 1 is a flowchart of the steps of one embodiment of a document theme generation method of the present invention;

FIG. 2 is a flowchart of the steps of another embodiment of a document theme generation method of the present invention;

fig. 3 is a block diagram showing the structure of an embodiment of the document theme generation apparatus of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Referring to fig. 1, a flowchart illustrating steps of an embodiment of a document theme generation method of the present invention is shown, which may specifically include the following steps:

step 101, performing word segmentation on the documents of the document set and extracting words.

In the embodiment of the invention, the document set refers to a set consisting of a plurality of documents, and each document is subjected to word segmentation to obtain a word list corresponding to each document. There are many word segmentation processing methods, and the selected word segmentation method is not limited in this embodiment. The word segmentation result is a word, and the word is extracted from the word segmentation result in whole or in part. Specifically, words may be extracted according to actual needs, which is not limited in this embodiment.

And 102, counting the data of the relationship between words representing the semantic relevance between every two extracted words.

In the embodiments of the present invention, semantic relevance refers to the degree of semantic similarity between words and words. The data of the word-word relationship refers to data representing semantic correlation between words, and all the data of the word-word relationship can be regarded as a data matrix. There are many ways to calculate the relationship data representing the semantic relevance between words, and this embodiment does not limit the specific calculation way.

In step 103, word document relationship data characterizing the importance of each word in each document is counted.

In embodiments of the present invention, each document may be thought of as a collection of multiple terms, and for each term, the importance in the document may be characterized by the Term Frequency (TF) or term frequency-inverse document frequency (TF-IDF) with which the term appears in the document. The term document relationship data refers to data representing the importance of each term in each document, and all term document relationship data can be obtained by counting the data representing the importance of each term in each document, and specifically all term document relationship data can be regarded as a data matrix.

And 104, randomly generating document theme relation data representing the relevance of each document to each preset theme and word theme relation data representing the relevance of each word to each theme.

In the embodiment of the invention, each document has a plurality of implicit topics, and the topics refer to a concept or an aspect and can be expressed as a series of related words. A document relates to a topic, and words relating to that topic appear with a higher frequency. In a specific implementation, the number of themes needs to be set. Assume that each document has a set number of preset topics.

If described mathematically, the topic is the conditional probability distribution of words among all words. The more closely related a word is to a topic, the greater its conditional probability and vice versa. For a document, each word is obtained by a process of "selecting a topic with a certain probability and selecting a word from the topic with a certain probability". Then a document is generated in which the probability of each word occurring is:

where p (term | documents) represents the probability of occurrence of each term in each document, p (term | topic) represents the probability of occurrence of each term in each topic, and p (topic | documents) represents the probability of occurrence of each topic in each document.

The relevance of each document to each preset topic refers to the degree of relevance between each document and each preset topic, and can be represented by the probability of each topic appearing in each document. The document theme relationship data may refer to a probability of occurrence of each theme in each document, and specifically, all the document theme relationship data may be regarded as a data matrix. The relevance of each word to each preset theme refers to the degree of relevance of each word to each theme, and can be represented by the occurrence probability of each word in each theme. The term topic relationship data may refer to the probability of each term appearing in each topic, and specifically, all term topic relationship data may be regarded as a data matrix.

The randomly generated document theme relationship data and word theme relationship data are data of the probability of occurrence of each theme in each document and data of the probability of occurrence of each word in each theme randomly generated. The specific random generation method is not limited in this embodiment.

And 105, generating an adjusting factor according to the word theme relation data and the word relation data.

In the embodiment of the present invention, the adjustment factor refers to parameter data determined by all the word topic relationship data and the inter-word relationship data. If all the word relationship data are taken as a matrix, all the word theme relationship data are taken as another matrix, and all the adjustment factors are also taken as a matrix, then the matrix of the word theme relationship data and the matrix of the adjustment factors can be obtained by decomposing the matrix of the word theme relationship data. According to the relation, all adjustment factors can be calculated when all the word relation data and the word theme relation data are obtained.

And 106, iteratively updating the document theme relation data, the word theme relation data and the adjustment factor to reach a set end condition according to the relation among the document theme relation data, the word theme relation data and the word document relation data and the relation among the word theme relation data, the adjustment factor and the word relation data, so that the target probability of simultaneously generating the document theme relation data, the word theme relation data and the adjustment factor under the condition of determining the word document relation data and the word relation data reaches a set requirement.

In the embodiment of the present invention, if all document theme relationship data, term theme relationship data, and term document relationship data are respectively regarded as matrices, then the matrices corresponding to all term document relationship data may be decomposed into matrices corresponding to all document theme relationship data and matrices corresponding to all term theme relationship data. If all the word topic relationship data, the adjustment factors, and the word relationship data are regarded as matrices, the matrices corresponding to all the word topic relationship data can be decomposed into matrices corresponding to all the word topic relationship data and matrices corresponding to all the adjustment factors. The above is the relationship of the document topic relationship data, the term topic relationship data and the term document relationship data, and the relationship of the term topic relationship data, the adjustment factor and the interword relationship data.

And in the iterative updating mode, in the (N + 1) th iteration, determining new document theme relation data, word theme relation data and an adjusting factor generated by the iteration according to the word document relation data, the word-word relation data, the latest document theme relation data, the word theme relation data and the adjusting factor.

Specifically, assume that all word document relationship data corresponding matrix R, whose size is m rows and n columns, respectively represents m words and n documents; the size of a matrix C corresponding to all the inter-word relation data is m rows and m columns, and the matrix C represents the semantic relevance between every two words; all the matrix D corresponding to the document theme relation data has the size of n columns and r rows, respectively represents n words and sets r themes; the matrix W corresponding to all the word theme relation data is m columns and r rows, m documents are respectively represented, and r themes are set; and the matrix Z corresponding to all the adjustment factors is set with m rows.

Assuming that the vectors corresponding to each column of the matrix R corresponding to all the word document relation data obey normal distribution, then:

wherein the content of the first and second substances,

the matrix W corresponding to all the word theme relation data and the matrix D corresponding to all the document theme relation data are set, and the variance isThe probability of the matrix R corresponding to all the word document relation data under the condition; w_iIs the column vector of the matrix W, D_jIs the column vector of matrix D;

represents the mean value

Variance of

The normal distribution of (c),

for the index function, if the term document relationship data for term i in the document is not zero, thenIs 1, otherwise is zero.

Assuming that vectors corresponding to each column of the matrix W corresponding to all the term topic relationship data and vectors corresponding to each column of the matrix D corresponding to all the document topic relationship data obey normal distribution with zero mean, then:

wherein the content of the first and second substances,

means setting the variance as

Generating the probability of a matrix W corresponding to all the word topic relation data under the condition of (1);means setting the variance as

Generating the probability of a matrix D corresponding to all document theme relationship data under the condition of (1);

represents a mean of 0 and a variance

The normal distribution of (c),

represents a mean of 0 and a variance of

Normal distribution of (a), I denotes an identity matrix, W_iIs the column vector of the matrix W, D_jIs the column vector of matrix D.

Assuming that the matrix C corresponding to all the inter-word relationship data follows normal distribution, then:

wherein the content of the first and second substances,

the matrix W corresponding to all the word theme relation data and the matrix Z corresponding to all the adjustment factors are set, and the variance is

Probability, W, of matrix C corresponding to all the interword relationship data appearing under the condition_iIs the column vector of the matrix W, Z_kIs the column vector of matrix Z;represents a mean value of

Variance of

The normal distribution of (c),as an index function, if the word and the data of the word-word relationship are not zero, then

Is 1, otherwise is zero.

The matrix C corresponding to all the word relationship data obeys normal distribution, and the vector corresponding to each column of the matrix W corresponding to all the word topic relationship data obeys normal distribution with zero mean, then:

wherein the content of the first and second substances,

means setting the variance as

In the condition of (1)The probability of the matrix Z corresponding to all the adjustment factors is obtained;

represents a mean of 0 and a varianceI denotes the identity matrix.

The target probability is the probability of simultaneously generating document theme relationship data, word theme relationship data and an adjustment factor under the condition of determining all word document relationship data and all inter-word relationship data, and can be expressed as follows according to a Bayesian formula:

wherein the content of the first and second substances,

representing the probability of generating a matrix D corresponding to all document theme relationship data, a matrix W corresponding to all word theme relationship data and a matrix Z corresponding to all adjustment factors under the condition of determining a matrix R corresponding to all word document relationship data and a matrix C corresponding to all inter-word relationship data;representing the probability of generating a matrix R corresponding to all the word document relation data and a matrix C corresponding to all the inter-word relation data under the condition of determining a matrix D corresponding to all the document theme relation data, a matrix W corresponding to all the word theme relation data and a matrix Z corresponding to all the adjustment factors; p (W, D, Z) represents the probability of generating a matrix D corresponding to all document theme relationship data, a matrix W corresponding to all term theme relationship data and a matrix Z corresponding to all adjustment factors;

representing the generation of all corresponding matrixes R and R of the word document relation dataThe probability of the matrix C corresponding to the word relation data.

Solving a natural logarithm formula of the target probability by solving the natural logarithm formula of the target probability:

to make the target probability

Reaching the maximum value is equivalent to finding the maximum point of the above formula. In the process of iterative updating, the target probability is continuously made to approach the maximum value.

Then, after the iteration reaches the set end condition, the target probability may reach the set requirement, and the specific set end condition may be the set iteration number, or the target probability exceeds the set end threshold. The set iteration number and the set ending threshold may be obtained by debugging according to the setting requirement of actual needs, which is not limited in this embodiment.

And step 107, generating the document theme of the document set by the iteratively updated word theme relation data.

In the embodiment of the invention, the word topic relation data obtained by iterative update represents the relevance of each topic and each word. A topic refers to a concept or aspect that may be embodied as a series of related words. And generating a document theme of the document set by using the word theme relation data, wherein the document theme is specifically composed of the probability of each word appearing in the theme.

In the embodiment of the present invention, preferably, one implementation manner of performing word segmentation on the documents in the document set and extracting the words is as follows: performing word segmentation on the documents of the document set; the remaining words excluding the set unnecessary words are extracted.

Specifically, all documents in the document set are subjected to word segmentation to obtain all words, set unnecessary words are removed, and the rest words are extracted.

In the embodiment of the present invention, it is preferable that the set unnecessary words include a set stop word and a recognized word having no practical meaning.

The set stop words refer to words which are manually input and are not automatically generated. The recognized words without practical meanings refer to words without practical meanings such as pronouns, auxiliary words and the like which are automatically recognized according to the part of speech.

In the embodiment of the present invention, preferably, one implementation manner of the inter-word relationship data statistically representing semantic relevance between every two of all extracted words is to convert all extracted words into word vectors according to the semantic relevance; and carrying out similarity calculation on the word vectors corresponding to all the extracted words, and obtaining the inter-word relation data.

Specifically, all words are converted into a vector form by utilizing a training model of word vectors according to semantic similarity, and then similarity among all vectors is calculated to obtain data of the relation among the words. For example, the extracted words are converted into vector form by using an open source tool Word2vec (English full name: Word to vector, Chinese name: Word steering quantity) of google corporation, and then the similarity between vectors is calculated by using a cosine similarity calculation method or a Pearson similarity calculation method, so that the extracted words are classified into different formsThere is similarity between vectors as the interword relationship data. For example, the vector form of word a and word b is V_aAnd V_bThe cosine similarity between the word a and the word b is

The calculation formula is as follows:

in the embodiment of the present invention, preferably, one implementation of the term document relationship data that statistically characterizes the importance of each term in each document is to calculate the frequency of occurrence of each term in each document and the logarithm of the quotient of the total number of documents divided by the number of documents containing the term; multiplying the occurrence frequency and the corresponding logarithm for each word to obtain word document relation data which characterizes the importance of each word in each document.

Specifically, the frequency of occurrence is the number of occurrences of a word in the current article divided by the total number of words in the current article. The occurrence frequency of each word in each document is calculated, the logarithm of the quotient of the total number of documents divided by the number of documents containing the word is calculated, the occurrence frequency and the corresponding logarithm are multiplied for each word, and the obtained data can represent the importance of each word in each document. And calculating to obtain data corresponding to each word to each document to form word document relation data.

Referring to fig. 2, a flowchart illustrating steps of another embodiment of the document theme generation method of the present invention is shown, which may specifically include the following steps:

step 201, performing word segmentation on the documents in the document set and extracting words.

Step 202, counting the data of the relationship between words representing the semantic correlation between every two extracted words.

At step 203, word document relationship data characterizing the importance of each word in each document is counted.

Step 204, carrying out normalization processing on the word document relation data to obtain the word document relation data after the normalization processing.

In the embodiments of the present invention, there are many methods for normalizing data, and the embodiments do not limit this. A normalization processing mode is to subtract the minimum value in the word document relation data from each numerical value in the word document relation data, and then divide the difference value between the maximum value and the minimum value in the word document relation data to obtain new word document relation data.

Step 205, randomly generating document theme relation data representing the relevance of each document to each set theme and word theme relation data representing the relevance of each word to each theme.

And step 206, generating an adjustment factor according to the word theme relation data and the word relation data.

And step 207, in the (N + 1) th iteration, generating a first adjustment value of the word theme relationship data in the current iteration according to the latest document theme relationship data, the word theme relationship data and the adjustment factor, and updating the word theme relationship data according to the first adjustment value and a set learning rate constant.

In the embodiment of the invention, the first adjustment value is used for searching iteratively until the target probability reaches the maximum. The negative sign of the natural logarithm formula of the target probability is removed, and the constant is removed at the same time, so that the following formula is obtained:

wherein L (R, C, W, D, Z) is a new function obtained by taking the negative number of the function of the natural logarithm of the target probability and deleting constant terms,

setting of lambda_D＝λ_W＝λ_Z＝λ。

Finding the maximum value of the target probability is equivalent to finding the minimum point of the above formula L (R, C, W, D, Z), using a ladderCalculating partial derivative of the above formula by degree descent method

To obtain a first adjustment value of the word topic relation data in the iteration

As can be seen from the formula, since the term document relationship data and the inter-term relationship data are determined, the first adjustment value may be determined according to the latest document topic relationship data, term topic relationship data, and adjustment factor.

And updating the word theme relation data by the first adjusting value and the set learning rate constant. The set learning rate constant is a constant set in advance for controlling the amount of change per iteration data update.

And subtracting the product of the set learning rate constant and the first adjusting value from the word topic relation data generated in the nth iteration to obtain new word topic relation data generated in the (N + 1) th iteration. The set learning rate constant is a constant set in advance for controlling the amount of change per iteration data update.

And 208, in the (N + 1) th iteration, generating a second adjustment value of the document theme relation data in the current iteration according to the latest document theme relation data and the latest word theme relation data, and updating the document theme relation data according to the second adjustment value and a set learning rate constant.

In the embodiment of the present invention, the maximum value of the target probability is equivalent to the minimum point of the above formula L (R, C, W, D, Z), and the partial derivative of the above formula is calculated by using the gradient descent methodTo obtain a second adjustment value of the document theme relation data in the iteration

As can be seen from the formula, since the word document relationship data and the inter-word relationship data are determined, the second adjustment value may be determined according to the latest document topic relationship data and word topic relationship data.

And subtracting the product of the set learning rate constant and the second adjusting value from the document theme relation data generated in the nth iteration to obtain new document theme relation data generated in the (N + 1) th iteration. The set learning rate constant is a constant set in advance for controlling the amount of change per iteration data update.

And 209, in the (N + 1) th iteration, generating a third adjusting value of the adjusting factor in the current iteration according to the latest word theme relation data and the adjusting factor, and updating the adjusting factor according to the third adjusting value and a set learning rate constant.

In the embodiment of the present invention, the maximum value of the target probability is equivalent to the minimum point of the above formula L (R, C, W, D, Z), and the partial derivative of the above formula is calculated by using the gradient descent method

To obtain a third adjustment value of the adjustment factor in the current iteration

As can be seen from the formula, since the word document relationship data and the inter-word relationship data are determined, the third adjustment value can be determined according to the latest word topic relationship data and adjustment factor.

And subtracting the product of the set learning rate constant and the first adjusting value from the adjusting factor generated in the nth iteration to obtain a new adjusting factor generated in the (N + 1) th iteration. The set learning rate constant is a constant set in advance for controlling the amount of change per iteration data update.

And step 210, ending the iterative updating until the set ending condition is reached, so that the target probability reaches the set requirement.

In the embodiment of the present invention, after the iteration reaches the set end condition, the target probability may reach the set requirement, and the specific set end condition may be a set iteration number, or a target probability exceeding a set end threshold. The set iteration number and the set ending threshold may be obtained by debugging according to the setting requirement of actual needs, which is not limited in this embodiment.

Specifically, the target probability exceeding the set ending threshold is equivalent to the value of the formula L (R, C, W, D, Z) being less than the set threshold. For example: the set iteration number is 1000 loops, but in a certain iteration within 1000 loops, the value of the L (R, C, W, D, Z) function is smaller than the set threshold value of 0.00001, and then the target probability is considered to reach the set requirement.

And judging whether the set ending condition is reached, and ending the iteration if the set ending condition is reached. If the set end condition is not met, the iterative execution of step 207, step 208 and step 209 is continued. And the execution sequence of step 207, step 208 and step 209 is not limited in the embodiment of the present invention.

And step 211, generating the document theme of the document set by the iteratively updated word theme relation data.

In summary, according to the embodiments of the present invention, the document theme relationship data, the term theme relationship data, and the term document relationship data are generated and iteratively updated according to the relationship among the document theme relationship data, the term theme relationship data, and the relationship among the term theme relationship data, the adjustment factor, and the inter-term relationship data, and after the iteration reaches the set end condition, the target probability reaches the set requirement, and the document theme of the document set is generated by iteratively updating the obtained term theme relationship data. In the process of generating the word theme relation data, the word theme relation data is influenced by the word document relation data and the word theme relation data, so that the finally generated word theme relation data is jointly constrained by the word document relation data and the word theme relation data, the semantic relation among words is considered in the process of generating the document theme, and the accuracy of generating the document theme is improved.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 3, a block diagram of an embodiment of the document theme generation apparatus of the present invention is shown, and may specifically include the following modules:

the document word segmentation module 301 is configured to segment the documents of the document set and extract words;

an inter-word relationship data statistics module 302, configured to count inter-word relationship data representing semantic relevance between every two extracted words;

a term document relationship data statistics module 303, configured to count term document relationship data representing importance of each term in each document;

a data random generation module 304, configured to randomly generate document theme relationship data representing the relevance of each document to each preset theme, and word theme relationship data representing the relevance of each word to each theme;

an adjustment factor generating module 305, configured to generate an adjustment factor according to the word topic relationship data and the inter-word relationship data;

an iteration updating module 306, configured to update the document theme relationship data, the term theme relationship data, and the adjustment factor to reach a set end condition according to the relationship between the document theme relationship data, the term theme relationship data, and the term document relationship data, and the relationship between the term theme relationship data, the adjustment factor, and the term relationship data, so that a target probability that the document theme relationship data, the term theme relationship data, and the adjustment factor are generated at the same time under the condition that the term document relationship data and the term relationship data are determined reaches a set requirement;

and the document theme generating module 307 is configured to generate document themes of the document set according to the iteratively updated word theme relation data.

In the embodiment of the present invention, preferably, the iterative update module includes:

In the embodiment of the present invention, preferably, the document word segmentation module includes:

In the embodiment of the present invention, preferably, the inter-word relationship data statistics module includes:

In the embodiment of the present invention, preferably, the word document relationship data statistics module includes:

In the embodiment of the present invention, preferably, the apparatus further includes:

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The document theme generation method and apparatus provided by the present invention are described in detail above, and a specific example is applied in the text to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A document theme generation method, comprising:

segmenting the documents of the document set and extracting words;

iteratively updating the document theme relationship data, the word theme relationship data and the adjustment factor according to the relationship among the document theme relationship data, the word theme relationship data and the word document relationship data and the relationship among the word theme relationship data, the adjustment factor and the word relation data to achieve a set end condition, so that the target probability of simultaneously generating the document theme relationship data, the word theme relationship data and the adjustment factor under the condition of determining the word document relationship data and the word relation data meets a set requirement, and the method comprises the following steps of: the Nth iteration is used for generating latest word theme relation data, latest document theme relation data and latest adjusting factors; in the (N + 1) th iteration, generating a first adjustment value of the word theme relation data in the current iteration according to the latest document theme relation data, the word theme relation data and the adjustment factor generated in the nth iteration, and updating the word theme relation data according to the first adjustment value and a set learning rate constant; in the (N + 1) th iteration, generating a second adjustment value of the document theme relation data in the current iteration according to the latest document theme relation data and the latest word theme relation data generated in the nth iteration, and updating the document theme relation data according to the second adjustment value and a set learning rate constant; in the (N + 1) th iteration, generating a third adjusting value of the adjusting factor in the current iteration according to the latest word theme relation data and the adjusting factor generated in the nth iteration, and updating the adjusting factor by using the third adjusting value and a set learning rate constant until the set end condition is reached and the iteration updating is ended, so that the target probability meets the set requirement;

2. The method of claim 1, wherein the tokenizing and extracting terms from the documents of the document set comprises:

performing word segmentation on the documents of the document set;

the remaining words excluding the set unnecessary words are extracted.

3. The method of claim 2, wherein the set unwanted words include set stop words, identified words having no actual meaning.

4. The method of claim 1, wherein the inter-word relationship data statistically characterizing semantic relevance between each two of all extracted words comprises:

5. The method of claim 1, wherein the term document relationship data that statistically characterizes the importance of each term in each document comprises:

6. The method of claim 1, wherein prior to randomly generating document topic relationship data characterizing a relevance of each document to each preset topic, and term topic relationship data characterizing a relevance of each term to each said topic, the method further comprises:

7. A document theme generation apparatus, comprising:

an iterative update module, configured to iteratively update the document theme relationship data, the term theme relationship data, and the adjustment factor to reach a set end condition according to a relationship between the document theme relationship data, the term theme relationship data, and the term document relationship data, and a relationship between the term theme relationship data, the adjustment factor, and the term relationship data, so that a target probability that the document theme relationship data, the term theme relationship data, and the adjustment factor are generated at the same time when the term document relationship data and the term relationship data are determined reaches a set requirement, where the iterative update module includes: the Nth iteration is used for generating latest word theme relation data, latest document theme relation data and latest adjusting factors; the word theme relation data updating submodule is used for generating a first adjusting value of the word theme relation data in the current iteration according to the latest document theme relation data, the word theme relation data and the adjusting factor generated in the nth iteration in the (N + 1) th iteration, and updating the word theme relation data according to the first adjusting value and a set learning rate constant; the document theme relation data updating submodule is used for generating a second adjusting value of the document theme relation data in the current iteration according to the latest document theme relation data and the latest word theme relation data generated in the nth iteration in the (N + 1) th iteration, and updating the document theme relation data according to the second adjusting value and a set learning rate constant; the adjustment updating submodule is used for generating a third adjustment value of the adjustment factor in the current iteration according to the latest word theme relation data and the adjustment factor generated in the nth iteration in the (N + 1) th iteration, and updating the adjustment factor according to the third adjustment value and a set learning rate constant; the iteration ending submodule finishes the iteration updating until the set ending condition is reached, so that the target probability reaches the set requirement;

8. The apparatus of claim 7, wherein the document segmentation module comprises:

9. The apparatus of claim 8, wherein the set unwanted words include set stop words, recognized words without actual meaning.

10. The apparatus of claim 7, wherein the interword relationship data statistics module comprises:

11. The apparatus of claim 7, wherein the term document relationship data statistics module comprises:

12. The apparatus of claim 7, further comprising:

and the word document relation data normalization module is used for performing normalization processing on the word document relation data before randomly generating document theme relation data representing the relevance of each document to each preset theme and word theme relation data representing the relevance of each word to each theme to obtain the word document relation data after the normalization processing.