CN116432639B

CN116432639B - News element word mining method based on improved BTM topic model

Info

Publication number: CN116432639B
Application number: CN202310634323.5A
Authority: CN
Inventors: 赵丽萍
Original assignee: East China Jiaotong University
Current assignee: East China Jiaotong University
Priority date: 2023-05-31
Filing date: 2023-05-31
Publication date: 2023-08-25
Anticipated expiration: 2043-05-31
Also published as: CN116432639A

Abstract

The invention provides a news element word mining method based on an improved BTM topic model, which comprises the steps of firstly selecting representative words in a text as characteristic representation of the text, constructing a dictionary by using the representative words, setting a window, constructing biterm by any two representative words in the window, and constructing a biterm list by a plurality of biterm; generating a topic-word distribution and a document-topic distribution using the improved BTM model; and finally, according to the contribution degree of the words to the topics, the probability that the low-frequency representative words in the topic-word distribution are distributed to the topics is improved, and then the general words appearing in each topic are removed to obtain topic element words. The invention selects representative words in the short text to generate biterm and then carries out topic modeling, which can eliminate the problem that the word mixture in different fields appears in a certain topic because the word pairs formed by high-frequency general words and special words in different fields influence the topic modeling effect.

Description

News element word mining method based on improved BTM topic model

Technical Field

The invention relates to the field of text mining, in particular to a news element word mining method based on an improved BTM topic model.

Background

LDA is a Probabilistic Topic Model (PTM) that implicitly infers hidden topics from documents based on higher-order word co-occurrence models. However, traditional PTMs (such as PLSA and LDA) are an unsupervised technique that implicitly infers hidden topics from co-occurrence of words. However, because the microblog text is short, the document-word co-occurrence matrix in the text is sparse, so that the statistical model is difficult to learn, the analysis result is not ideal, and the difficulty that the feature selection of the sparse data and the topic division become short text analysis is solved.

The BTM topic model in the prior art, unlike the document topic layer of the previous topic model, converts documents into word pairs. Word pairs refer to a word pair formed by two words which are randomly co-present after the pretreatment of the document; each word pair is then subject to sampling modeling. In this way, the topic modeling of the document is converted into word pair modeling learning topics on the whole corpus, the problem of single short text sparseness of the microblog is overcome, and the microblog text information is better understood than the traditional topic model. Since BTMs model topics mainly by word co-occurrence, there are the following drawbacks in practice:

(1) The news text contains words with stronger territory and lower frequency, and the occurrence frequency of the words is lower, so that the co-occurrence frequency of the words and the theme is lower, and certain words with stronger territory are not in the theme;

(2) Word pairs are constructed, and word pairs are constructed for all words in the whole text. Certain high-frequency words with high occurrence frequency and weak territory can co-occur with various low-frequency words with strong territory to form word pairs, and the meaning of certain topics is difficult to explain when topic modeling is performed later. As subject (first 10 words): "salmon piece remuneration movie stars standard film and television company actor contract industry". The topic is difficult to explain, and the reason for generating the topic is that words with higher occurrence frequency and weak territory such as ' company ', ' standard ' and words with stronger territory such as ' salmon ', film ' and the like are co-occurrence frequency in a document is higher, so the topic is classified as a topic.

(3) The method is applicable to a topic model BTM model of short text, when the word pair window is large, the extracted biterm words are relatively large, and when the number of topics is large, the time consumption is relatively long when Gibbs sampling topics are used.

Disclosure of Invention

In view of the foregoing, it is a primary object of the present invention to provide a method for mining news element words based on an improved BTM topic model, so as to solve the above-mentioned technical problems.

The invention provides a method for mining news element words based on an improved BTM topic model, which comprises the following steps:

step 1, acquiring a plurality of texts, selecting representative words in the texts, and taking the representative words as characteristic representations of the texts;

step 2, constructing a dictionary by using representative words, setting a window, constructing biterm by using any two representative words in the window, and constructing a biterm list by using a plurality of biterm;

step 3, setting initial parameters according to field knowledge, setting seed words of known topics in a specific field, performing Gibbs sampling by using similarity of the seed words and biterm words as prior knowledge of sampling, reckoning the times of text allocation to the topics and the times of word allocation to the topics until the Gibbs parameters are converged, obtaining an improved BTM model, and generating topic-word distribution and document-topic distribution by using final sampling results of the improved BTM model;

and 4, according to the contribution degree of the words to the topics, improving the probability of distributing the low-frequency representative words in the topic-word distribution to the topics, and eliminating the words with low contribution degree in each topic to obtain topic element words.

In the invention, representative words in the short text are selected to generate biterm and then subject modeling is carried out, so that the effect of influencing the subject modeling due to word pairs formed by high-frequency general words and special words in different fields can be eliminated, and the problem that the words in different fields are mixed and appear in a certain subject is avoided.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

FIG. 1 is a flow chart of a method of mining news element words based on an improved BTM topic model in accordance with the present invention;

FIG. 2 is a probability map of an improved BTM topic model in a method for mining news element words based on the improved BTM topic model according to the present invention;

FIG. 3 is a graph showing comparison of confusion between three models in an embodiment of the present invention;

FIG. 4 is a graph showing a comparison of the mean value of the subject KL divergence of three models in the examples of this invention;

FIG. 5 is a schematic diagram of news element words mined by the improved BTM topic model of the present invention;

FIG. 6 is a schematic diagram of news element words mined based on a BTM topic model in the prior art.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.

These and other aspects of embodiments of the invention will be apparent from and elucidated with reference to the description and drawings described hereinafter. In the description and drawings, particular implementations of embodiments of the invention are disclosed in detail as being indicative of some of the ways in which the principles of embodiments of the invention may be employed, but it is understood that the scope of the embodiments of the invention is not limited correspondingly.

Referring to fig. 1, the invention provides a method for mining news element words based on an improved BTM topic model, wherein the method comprises the following steps:

taking a microblog text as an example, because the microblog text is shorter, the first sentence is mostly title information or summarization sentences, and words in the microblog text contain important information of the microblog text and are given higher weight. Therefore, when the representative word is selected, the invention introduces a word position weight factor to the word in the text, and selects the word in the first sentence in the text to give higher weight.

In order to select words containing important information in a text, considering the contribution of the selected subject words to the whole text, information entropy, namely the information quantity of the words, is introduced. Word setting and />For two keywords in the text, the word +.>The number of times and the word +.>The number of times appearing in the plurality of texts is equal, which cannot simply mark the IDF values of the two keywords equal.

According to the theory of information, wordsDistribution ratio word->Is more concentrated, representing the word +.>Has higher relativity with the text in which it is located. The invention takes the information quantity of the word as the control coefficient of the tf-idf value of the word into the final result, namely the tf-idf value of the word is influenced by the information quantity. The larger the amount of information, the larger its tf-idf value, and thus the representative word in the text is selected according to the weight of the word. Wherein the word position weight factor is expressed as +.>The term given weight factor in the first sentence in the text is expressed as +.>In calculating words in the rest of the text, < +.>The value is 0.

The weights of the words at the rest positions correspond to the following expressions:

；

wherein ,weights of words representing the rest of the positions, +.>Representing entropy of information, i.e.Word->Information amount of->For word frequency revision value, < >>For containing the word->Document number of->For the total number of documents,for words->Word frequency of->For the word +.>Average of frequency of occurrence, +.>Representing the inclusion of words in a text corpus>Is a text number of (c).

Words and phrasesThe information entropy expression of (2) is:

；

wherein ,the expression->Probability of occurrence throughout the text.

referring to fig. 2, in step 3, the method for using the similarity between the seed word and the biterm word as the prior knowledge of the sampling specifically includes:

step 3.1, obtaining the biterm topic joint probability distribution of the improved BTM model by using the known text feature representation and the unknown topic distribution;

step 3.2, obtaining the conditional probability distribution of the theme required by Gibbs sampling by utilizing the biterm theme joint probability distribution of the improved BTM model by using an application chain method;

step 3.3, calculating the similarity of the biterm words and the seed words, and obtaining an improved BTM model sampling conditional probability model based on the similarity information and the occurrence times of the words in the subject in the conditional probability distribution of the subject;

and 3.4, setting a similarity threshold, and defining the similarity lower than the similarity threshold as 0.

In the scheme, the negative influence of word co-occurrence sparsity on theme generation can be made up, and when biterm is sampled, the probability that biterm higher than a similarity threshold belongs to the theme corresponding to the seed word is correspondingly improved according to the similarity. Thus, word-topic distribution of the topic model of the present invention can be improved based on semantic similarity between existing seed words and words in the window.

Wherein, the similarity matrix of the seed word and the biterm wordThe definition expression is:

；

wherein ,representing biterm word->And seed words->Similarity of->Is a specified similarity threshold.

Wherein the biterm topic joint probability distribution of the improved BTM model has the following relation:

；

wherein ,a topic probability distribution representing word pairs in biterm,/->Representing text belonging to the topic->Probability of->The expression of (2) is:

；

wherein ,the representation belongs to the subject->Number of occurrences of biterm, +.>The number of all biterm is indicated,representing the number of subjects->Representing a first dirichlet a priori parameter.

wherein ,indicating that each word in biterm belongs to the subject +.>Probability of->The expression of (2) is:

；

wherein ,representation word->In theme->The number of occurrences of>The representation belongs to the subject->Number of biterm, +.>Representing the number of all documents +.>Representing biterm word->And seed words->Similarity matrix of>Representing a second dirichlet a priori parameter.

Wherein, the expression of the improved BTM model sampling conditional probability model is:

；

wherein ,conditional probability representing biterm, +.>Representing a set of biterm,，/>，/> and />Two different representative words, < ->Representation subject->Middle->The number of occurrences>Representation subject->Middle->The number of occurrences>Representing the total number of words->Representation word->In theme->The number of occurrences of>Representing in addition to->Topics of word pairs other than, potentially variable topics-word distribution +.>And potentially variable document-topic distribution->Denoted as-> and />Is conjugated to (a)A priori distribution.

Further, in step 4, according to the contribution degree of the words to the topic, the method for improving the probability of assigning the low-frequency representative words to the topic in the topic-word distribution specifically includes:

step 4.1, pre-setting wordsIn theme->The probability distribution of occurrence in (a) is->Word->In theme->The probability of occurrence of the term +.>Subject->Is representative of (3);

representing the number of topics generated,/->Representation comprising the word->Subject number of (2), word->Is +.>；

Step 4.2, utilizing wordsIn theme->Probability distribution of occurrence ∈>Multiplying by the inverse subject frequencyGet the word->Subject->Contribution degree of->The contribution degree is expressed as:

；

step 4.3, setting a representative threshold and a mean value, wherein when the contribution degree is larger than the representative threshold, the contribution degree of the subject term isThe method comprises the steps of carrying out a first treatment on the surface of the When the contribution is smaller than the representative threshold, then the mean +.>For words->Subject->Contribution of (2);

the contribution degree of the subject term is expressed as:

；

wherein ,representative thresholds are represented.

To verify the effectiveness of the improved BTM topic model of the present invention, microblog text data is used as an example. Microblog text data from 9 nd month to 2018 8 th month are obtained by a crawler, and 92747 documents are obtained in total. The experiment was performed on data. The text is segmented and part of speech marked by adopting the hundred-degree segmentation words, words with the occurrence frequency lower than 10 or higher than 6000 in the document are deleted, and representative words provided by the hundred-degree search index are adopted by expert domain knowledge.

The model initial parameters in this embodiment are set as follows:

first dirichlet a priori parametersSet to 0.98, second dirichlet a priori parameter ++>Set to 0.003, number of topics set to n_topics=20, number of iterations set to n_iter=300, similarity threshold +.>Set to 0.3 and the window size set to 15.

The improved BTM topic model, BTM topic model and LDA topic model of the present invention were compared in terms of topic cohesion and topic variability, all three models using the same dataset, and the number of topics was set to t=20. Content confusion (superplexity) and KL distance (KL-divengence) reflect the quality of the topic from the cohesiveness and variability of the topic word distribution, respectively. The present invention also evaluates from these two aspects:

content confusion is a widely used metric method to evaluate the modeling accuracy of topic models. The smaller the value of the content confusion degree, the better the accuracy of the topic modeling is, and the content confusion degree expression is:

；

wherein ,representation document->Word number in->Representation document->The%>And (5) personal words. The comparison of the topic confusion of the three models is shown in figure 3.

As can be seen from fig. 3: the topic confusion degree in the three models starts to be stable when the iteration times exceed 200, but the BTM and the improved BTM confusion degree curve of the invention are faster in convergence and smaller in confusion degree, so that the BTM model has better performance than the LDA model, and potential topics in short documents can be effectively mined. The improved BTM provided by the invention has smaller confusion degree than the BTM model, so that the effect of mining news element words from microblogs by the topic model provided by the invention is better, and the analysis reason is mainly because the BTM is more suitable for topic modeling of short texts. In addition, by using the seed words provided by the expert, the semantic information of the words is clarified by increasing the semantic relativity of the words and the topics, and the improvement of the degree of distinction between the topics is facilitated, so that the improved BTM model has better effect.

KL distance (KL-divence): topic variability is another important indicator for evaluating topic models, and is mainly reflected by the variability of word distribution under different topics. The larger the value of the KL distance, the better the subject variability. The expression of the KL distance is:

；

wherein ,，/>the topic KL distance average of the three models representing two topics is compared as shown in fig. 4.

As can be seen from fig. 4: the KL distance between the BTM and the improved BTM model is larger, and the KL distance between the improved BTM model is the largest, which proves that the BTM model meets the requirements of the existing field classification standards, and the introduction of the seed word has a guiding effect on the generation of the topics, so that the difference of the topic words and the difference between the topics can be improved.

Another method of measuring the effect of a topic model is by comparing topics generated by the topic model with words under the topic.

From the distribution of words in the topics in fig. 5 and 6: although words are closer to the subject in the improved BTM and BTM models, the effectiveness of both models is comparable; however, some words with stronger territories and lower frequency in the BTM model result are less co-occurrence times with the theme due to the fact that the frequency of occurrence of the words is lower, so that certain words with stronger territories are not appeared in the theme, and the quality of the theme is reduced. The LDA topic model is not suitable for topic mining of short text, and most of words under the generated topic are high-frequency words, so that the meaning represented by the topic cannot be revealed.

Compared with the prior art, the invention has the following beneficial effects:

1. because the high-frequency universal words and the special words in different fields form word pairs, the effect of modeling the theme can be influenced, and the words in different fields are mixed and appear in a certain theme; before biterm is generated, representative words in the short text are selected to generate biterm, and then theme modeling is carried out, so that the problem of undefined theme meaning can be eliminated.

2. For each topic, giving a seed word, according to the semantic similarity between the existing seed word and the word in the window, using the similarity between the seed word and the biterm word as the prior knowledge of sampling, further improving the word-topic distribution of the BTM topic model, further improving the probability of word distribution with low frequency but strong representativeness, and making the words with low frequency but strong representativeness stand out as much as possible.

3. And the general words appearing in each topic in the topic-word distribution are removed by adopting the inverse topic frequency to obtain element words, so that the element words under the topic can be highlighted, the probability distribution of the topic-words is improved, the probability that low-frequency representative words in the topic are distributed to the topic in the field is improved, and the recognition degree of the topic in the field is further improved.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The foregoing examples illustrate only a few embodiments of the invention and are described in detail herein without thereby limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims

1. A method of mining news element words based on an improved BTM topic model, the method comprising the steps of:

step 4, according to the contribution degree of words to the topics, the probability that low-frequency representative words in the topic-word distribution are distributed to the topics is improved, and then words with low contribution degree in each topic are removed to obtain topic element words;

in the step 1, when a representative word is selected, a word position weight factor is introduced to the word in the text, and a word in a first sentence in the text is selected to give higher weight;

the information quantity of the words is used as a control coefficient of the tf-idf value of the subject word, so that the weights of the words at the rest positions in the text are calculated, and representative words in the text are selected according to the weights of the words;

the term location weight factor is expressed asThe words in the first sentence in the text are given weight factors expressed asThe method comprises the steps of carrying out a first treatment on the surface of the In calculating words for the rest of the text, < +.>The value is 0;

；

wherein ,weights of words representing the rest of the positions, +.>Representing information entropy, i.e. word->Information amount of->For word frequency revision value, < >>For containing the word->Document number of->For the total number of documents>For words->Word frequency of->For the word +.>Average of frequency of occurrence, +.>Representing the inclusion of words in a text corpus>Is a text number of (a);

words and phrasesThe information entropy expression of (2) is:

；

wherein ,the expression->Probability of occurrence in the entire text;

in the step 3, the method for using the similarity between the seed word and the biterm word as the prior knowledge of the sampling specifically includes:

step 3.2, combining the biterm topics of the improved BTM model with probability distribution, and obtaining the conditional probability distribution of the topics required by Gibbs sampling by using an application chain method;

step 3.4, setting a similarity threshold, and defining the similarity lower than the similarity threshold as 0;

the biterm topic joint probability distribution of the improved BTM model has the following relation:

；

wherein ,a topic probability distribution representing word pairs in biterm,/->Representing text belonging to the topic->Probability of (2);

the expression of (2) is:

；

wherein ,the representation belongs to the subject->Number of occurrences of biterm, +.>Indicates the number of all biterm, +.>Representing the number of subjects->Representing a first dirichlet priors parameter;

indicating that each word in biterm belongs to the subject +.>Probability of->The expression of (2) is:

；

wherein ,representation word->In theme->The number of occurrences of>The representation belongs to the subject->Number of biterm, +.>Representing the number of all documents +.>Representing biterm word->And seed words->Similarity matrix of>Representing a second dirichlet priors parameter;

the expression of the improved BTM model sampling conditional probability model is:

；

wherein ,conditional probability representing biterm, +.>Representing a set of biterm,，/>，/> and />Two different representative words, < ->Representation subject->Middle->The number of occurrences>Representation subject->Middle->The number of occurrences>The total number of words is indicated and,representation word->In theme->The number of occurrences of>Representing in addition to->Topics of other word pairs than;

similarity matrix of seed words and biterm wordsThe definition expression is:

；

wherein ,representing biterm word->And seed words->Similarity of->A specified similarity threshold;

in the step 4, the method for improving the probability of distributing the low-frequency representative words to the topics in the topic-word distribution according to the contribution degree of the words to the topics specifically comprises the following steps:

step 4.1, setting wordsIn theme->The probability distribution of occurrence in (a) is->Wherein the word->In theme->The probability of occurrence of the term +.>Subject->Is representative of (3);

Step 4.2, utilizing wordsIn theme->Probability distribution of occurrence ∈>Multiplying by the inverse subject frequencyGet the word->Subject->Contribution degree of->Contribution degree->The expression of (2) is:

；

step 4.3, setting a representative threshold and a mean value when the contribution degree is greater than the generationWhen the tabular threshold value is reached, the contribution degree of the subject term isThe method comprises the steps of carrying out a first treatment on the surface of the When the contribution is smaller than the representative threshold, then the mean +.>For words->Subject->Contribution of (2);

the contribution degree of the subject term is expressed as:

；

wherein ,representative thresholds are represented.