CN116432639B - News element word mining method based on improved BTM topic model - Google Patents

News element word mining method based on improved BTM topic model Download PDF

Info

Publication number
CN116432639B
CN116432639B CN202310634323.5A CN202310634323A CN116432639B CN 116432639 B CN116432639 B CN 116432639B CN 202310634323 A CN202310634323 A CN 202310634323A CN 116432639 B CN116432639 B CN 116432639B
Authority
CN
China
Prior art keywords
words
word
topic
biterm
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310634323.5A
Other languages
Chinese (zh)
Other versions
CN116432639A (en
Inventor
赵丽萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Jiaotong University
Original Assignee
East China Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Jiaotong University filed Critical East China Jiaotong University
Priority to CN202310634323.5A priority Critical patent/CN116432639B/en
Publication of CN116432639A publication Critical patent/CN116432639A/en
Application granted granted Critical
Publication of CN116432639B publication Critical patent/CN116432639B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a news element word mining method based on an improved BTM topic model, which comprises the steps of firstly selecting representative words in a text as characteristic representation of the text, constructing a dictionary by using the representative words, setting a window, constructing biterm by any two representative words in the window, and constructing a biterm list by a plurality of biterm; generating a topic-word distribution and a document-topic distribution using the improved BTM model; and finally, according to the contribution degree of the words to the topics, the probability that the low-frequency representative words in the topic-word distribution are distributed to the topics is improved, and then the general words appearing in each topic are removed to obtain topic element words. The invention selects representative words in the short text to generate biterm and then carries out topic modeling, which can eliminate the problem that the word mixture in different fields appears in a certain topic because the word pairs formed by high-frequency general words and special words in different fields influence the topic modeling effect.

Description

News element word mining method based on improved BTM topic model
Technical Field
The invention relates to the field of text mining, in particular to a news element word mining method based on an improved BTM topic model.
Background
LDA is a Probabilistic Topic Model (PTM) that implicitly infers hidden topics from documents based on higher-order word co-occurrence models. However, traditional PTMs (such as PLSA and LDA) are an unsupervised technique that implicitly infers hidden topics from co-occurrence of words. However, because the microblog text is short, the document-word co-occurrence matrix in the text is sparse, so that the statistical model is difficult to learn, the analysis result is not ideal, and the difficulty that the feature selection of the sparse data and the topic division become short text analysis is solved.
The BTM topic model in the prior art, unlike the document topic layer of the previous topic model, converts documents into word pairs. Word pairs refer to a word pair formed by two words which are randomly co-present after the pretreatment of the document; each word pair is then subject to sampling modeling. In this way, the topic modeling of the document is converted into word pair modeling learning topics on the whole corpus, the problem of single short text sparseness of the microblog is overcome, and the microblog text information is better understood than the traditional topic model. Since BTMs model topics mainly by word co-occurrence, there are the following drawbacks in practice:
(1) The news text contains words with stronger territory and lower frequency, and the occurrence frequency of the words is lower, so that the co-occurrence frequency of the words and the theme is lower, and certain words with stronger territory are not in the theme;
(2) Word pairs are constructed, and word pairs are constructed for all words in the whole text. Certain high-frequency words with high occurrence frequency and weak territory can co-occur with various low-frequency words with strong territory to form word pairs, and the meaning of certain topics is difficult to explain when topic modeling is performed later. As subject (first 10 words): "salmon piece remuneration movie stars standard film and television company actor contract industry". The topic is difficult to explain, and the reason for generating the topic is that words with higher occurrence frequency and weak territory such as ' company ', ' standard ' and words with stronger territory such as ' salmon ', film ' and the like are co-occurrence frequency in a document is higher, so the topic is classified as a topic.
(3) The method is applicable to a topic model BTM model of short text, when the word pair window is large, the extracted biterm words are relatively large, and when the number of topics is large, the time consumption is relatively long when Gibbs sampling topics are used.
Disclosure of Invention
In view of the foregoing, it is a primary object of the present invention to provide a method for mining news element words based on an improved BTM topic model, so as to solve the above-mentioned technical problems.
The invention provides a method for mining news element words based on an improved BTM topic model, which comprises the following steps:
step 1, acquiring a plurality of texts, selecting representative words in the texts, and taking the representative words as characteristic representations of the texts;
step 2, constructing a dictionary by using representative words, setting a window, constructing biterm by using any two representative words in the window, and constructing a biterm list by using a plurality of biterm;
step 3, setting initial parameters according to field knowledge, setting seed words of known topics in a specific field, performing Gibbs sampling by using similarity of the seed words and biterm words as prior knowledge of sampling, reckoning the times of text allocation to the topics and the times of word allocation to the topics until the Gibbs parameters are converged, obtaining an improved BTM model, and generating topic-word distribution and document-topic distribution by using final sampling results of the improved BTM model;
and 4, according to the contribution degree of the words to the topics, improving the probability of distributing the low-frequency representative words in the topic-word distribution to the topics, and eliminating the words with low contribution degree in each topic to obtain topic element words.
In the invention, representative words in the short text are selected to generate biterm and then subject modeling is carried out, so that the effect of influencing the subject modeling due to word pairs formed by high-frequency general words and special words in different fields can be eliminated, and the problem that the words in different fields are mixed and appear in a certain subject is avoided.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
FIG. 1 is a flow chart of a method of mining news element words based on an improved BTM topic model in accordance with the present invention;
FIG. 2 is a probability map of an improved BTM topic model in a method for mining news element words based on the improved BTM topic model according to the present invention;
FIG. 3 is a graph showing comparison of confusion between three models in an embodiment of the present invention;
FIG. 4 is a graph showing a comparison of the mean value of the subject KL divergence of three models in the examples of this invention;
FIG. 5 is a schematic diagram of news element words mined by the improved BTM topic model of the present invention;
FIG. 6 is a schematic diagram of news element words mined based on a BTM topic model in the prior art.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.
These and other aspects of embodiments of the invention will be apparent from and elucidated with reference to the description and drawings described hereinafter. In the description and drawings, particular implementations of embodiments of the invention are disclosed in detail as being indicative of some of the ways in which the principles of embodiments of the invention may be employed, but it is understood that the scope of the embodiments of the invention is not limited correspondingly.
Referring to fig. 1, the invention provides a method for mining news element words based on an improved BTM topic model, wherein the method comprises the following steps:
step 1, acquiring a plurality of texts, selecting representative words in the texts, and taking the representative words as characteristic representations of the texts;
taking a microblog text as an example, because the microblog text is shorter, the first sentence is mostly title information or summarization sentences, and words in the microblog text contain important information of the microblog text and are given higher weight. Therefore, when the representative word is selected, the invention introduces a word position weight factor to the word in the text, and selects the word in the first sentence in the text to give higher weight.
In order to select words containing important information in a text, considering the contribution of the selected subject words to the whole text, information entropy, namely the information quantity of the words, is introduced. Word setting and />For two keywords in the text, the word +.>The number of times and the word +.>The number of times appearing in the plurality of texts is equal, which cannot simply mark the IDF values of the two keywords equal.
According to the theory of information, wordsDistribution ratio word->Is more concentrated, representing the word +.>Has higher relativity with the text in which it is located. The invention takes the information quantity of the word as the control coefficient of the tf-idf value of the word into the final result, namely the tf-idf value of the word is influenced by the information quantity. The larger the amount of information, the larger its tf-idf value, and thus the representative word in the text is selected according to the weight of the word. Wherein the word position weight factor is expressed as +.>The term given weight factor in the first sentence in the text is expressed as +.>In calculating words in the rest of the text, < +.>The value is 0.
The weights of the words at the rest positions correspond to the following expressions:
wherein ,weights of words representing the rest of the positions, +.>Representing entropy of information, i.e.Word->Information amount of->For word frequency revision value, < >>For containing the word->Document number of->For the total number of documents,for words->Word frequency of->For the word +.>Average of frequency of occurrence, +.>Representing the inclusion of words in a text corpus>Is a text number of (c).
Words and phrasesThe information entropy expression of (2) is:
wherein ,the expression->Probability of occurrence throughout the text.
Step 2, constructing a dictionary by using representative words, setting a window, constructing biterm by using any two representative words in the window, and constructing a biterm list by using a plurality of biterm;
step 3, setting initial parameters according to field knowledge, setting seed words of known topics in a specific field, performing Gibbs sampling by using similarity of the seed words and biterm words as prior knowledge of sampling, reckoning the times of text allocation to the topics and the times of word allocation to the topics until the Gibbs parameters are converged, obtaining an improved BTM model, and generating topic-word distribution and document-topic distribution by using final sampling results of the improved BTM model;
referring to fig. 2, in step 3, the method for using the similarity between the seed word and the biterm word as the prior knowledge of the sampling specifically includes:
step 3.1, obtaining the biterm topic joint probability distribution of the improved BTM model by using the known text feature representation and the unknown topic distribution;
step 3.2, obtaining the conditional probability distribution of the theme required by Gibbs sampling by utilizing the biterm theme joint probability distribution of the improved BTM model by using an application chain method;
step 3.3, calculating the similarity of the biterm words and the seed words, and obtaining an improved BTM model sampling conditional probability model based on the similarity information and the occurrence times of the words in the subject in the conditional probability distribution of the subject;
and 3.4, setting a similarity threshold, and defining the similarity lower than the similarity threshold as 0.
In the scheme, the negative influence of word co-occurrence sparsity on theme generation can be made up, and when biterm is sampled, the probability that biterm higher than a similarity threshold belongs to the theme corresponding to the seed word is correspondingly improved according to the similarity. Thus, word-topic distribution of the topic model of the present invention can be improved based on semantic similarity between existing seed words and words in the window.
Wherein, the similarity matrix of the seed word and the biterm wordThe definition expression is:
wherein ,representing biterm word->And seed words->Similarity of->Is a specified similarity threshold.
Wherein the biterm topic joint probability distribution of the improved BTM model has the following relation:
wherein ,a topic probability distribution representing word pairs in biterm,/->Representing text belonging to the topic->Probability of->The expression of (2) is:
wherein ,the representation belongs to the subject->Number of occurrences of biterm, +.>The number of all biterm is indicated,representing the number of subjects->Representing a first dirichlet a priori parameter.
wherein ,indicating that each word in biterm belongs to the subject +.>Probability of->The expression of (2) is:
wherein ,representation word->In theme->The number of occurrences of>The representation belongs to the subject->Number of biterm, +.>Representing the number of all documents +.>Representing biterm word->And seed words->Similarity matrix of>Representing a second dirichlet a priori parameter.
Wherein, the expression of the improved BTM model sampling conditional probability model is:
wherein ,conditional probability representing biterm, +.>Representing a set of biterm,,/>,/> and />Two different representative words, < ->Representation subject->Middle->The number of occurrences>Representation subject->Middle->The number of occurrences>Representing the total number of words->Representation word->In theme->The number of occurrences of>Representing in addition to->Topics of word pairs other than, potentially variable topics-word distribution +.>And potentially variable document-topic distribution->Denoted as-> and />Is conjugated to (a)A priori distribution.
And 4, according to the contribution degree of the words to the topics, improving the probability of distributing the low-frequency representative words in the topic-word distribution to the topics, and eliminating the words with low contribution degree in each topic to obtain topic element words.
Further, in step 4, according to the contribution degree of the words to the topic, the method for improving the probability of assigning the low-frequency representative words to the topic in the topic-word distribution specifically includes:
step 4.1, pre-setting wordsIn theme->The probability distribution of occurrence in (a) is->Word->In theme->The probability of occurrence of the term +.>Subject->Is representative of (3);
representing the number of topics generated,/->Representation comprising the word->Subject number of (2), word->Is +.>
Step 4.2, utilizing wordsIn theme->Probability distribution of occurrence ∈>Multiplying by the inverse subject frequencyGet the word->Subject->Contribution degree of->The contribution degree is expressed as:
step 4.3, setting a representative threshold and a mean value, wherein when the contribution degree is larger than the representative threshold, the contribution degree of the subject term isThe method comprises the steps of carrying out a first treatment on the surface of the When the contribution is smaller than the representative threshold, then the mean +.>For words->Subject->Contribution of (2);
the contribution degree of the subject term is expressed as:
wherein ,representative thresholds are represented.
To verify the effectiveness of the improved BTM topic model of the present invention, microblog text data is used as an example. Microblog text data from 9 nd month to 2018 8 th month are obtained by a crawler, and 92747 documents are obtained in total. The experiment was performed on data. The text is segmented and part of speech marked by adopting the hundred-degree segmentation words, words with the occurrence frequency lower than 10 or higher than 6000 in the document are deleted, and representative words provided by the hundred-degree search index are adopted by expert domain knowledge.
The model initial parameters in this embodiment are set as follows:
first dirichlet a priori parametersSet to 0.98, second dirichlet a priori parameter ++>Set to 0.003, number of topics set to n_topics=20, number of iterations set to n_iter=300, similarity threshold +.>Set to 0.3 and the window size set to 15.
The improved BTM topic model, BTM topic model and LDA topic model of the present invention were compared in terms of topic cohesion and topic variability, all three models using the same dataset, and the number of topics was set to t=20. Content confusion (superplexity) and KL distance (KL-divengence) reflect the quality of the topic from the cohesiveness and variability of the topic word distribution, respectively. The present invention also evaluates from these two aspects:
content confusion is a widely used metric method to evaluate the modeling accuracy of topic models. The smaller the value of the content confusion degree, the better the accuracy of the topic modeling is, and the content confusion degree expression is:
wherein ,representation document->Word number in->Representation document->The%>And (5) personal words. The comparison of the topic confusion of the three models is shown in figure 3.
As can be seen from fig. 3: the topic confusion degree in the three models starts to be stable when the iteration times exceed 200, but the BTM and the improved BTM confusion degree curve of the invention are faster in convergence and smaller in confusion degree, so that the BTM model has better performance than the LDA model, and potential topics in short documents can be effectively mined. The improved BTM provided by the invention has smaller confusion degree than the BTM model, so that the effect of mining news element words from microblogs by the topic model provided by the invention is better, and the analysis reason is mainly because the BTM is more suitable for topic modeling of short texts. In addition, by using the seed words provided by the expert, the semantic information of the words is clarified by increasing the semantic relativity of the words and the topics, and the improvement of the degree of distinction between the topics is facilitated, so that the improved BTM model has better effect.
KL distance (KL-divence): topic variability is another important indicator for evaluating topic models, and is mainly reflected by the variability of word distribution under different topics. The larger the value of the KL distance, the better the subject variability. The expression of the KL distance is:
wherein ,,/>the topic KL distance average of the three models representing two topics is compared as shown in fig. 4.
As can be seen from fig. 4: the KL distance between the BTM and the improved BTM model is larger, and the KL distance between the improved BTM model is the largest, which proves that the BTM model meets the requirements of the existing field classification standards, and the introduction of the seed word has a guiding effect on the generation of the topics, so that the difference of the topic words and the difference between the topics can be improved.
Another method of measuring the effect of a topic model is by comparing topics generated by the topic model with words under the topic.
From the distribution of words in the topics in fig. 5 and 6: although words are closer to the subject in the improved BTM and BTM models, the effectiveness of both models is comparable; however, some words with stronger territories and lower frequency in the BTM model result are less co-occurrence times with the theme due to the fact that the frequency of occurrence of the words is lower, so that certain words with stronger territories are not appeared in the theme, and the quality of the theme is reduced. The LDA topic model is not suitable for topic mining of short text, and most of words under the generated topic are high-frequency words, so that the meaning represented by the topic cannot be revealed.
Compared with the prior art, the invention has the following beneficial effects:
1. because the high-frequency universal words and the special words in different fields form word pairs, the effect of modeling the theme can be influenced, and the words in different fields are mixed and appear in a certain theme; before biterm is generated, representative words in the short text are selected to generate biterm, and then theme modeling is carried out, so that the problem of undefined theme meaning can be eliminated.
2. For each topic, giving a seed word, according to the semantic similarity between the existing seed word and the word in the window, using the similarity between the seed word and the biterm word as the prior knowledge of sampling, further improving the word-topic distribution of the BTM topic model, further improving the probability of word distribution with low frequency but strong representativeness, and making the words with low frequency but strong representativeness stand out as much as possible.
3. And the general words appearing in each topic in the topic-word distribution are removed by adopting the inverse topic frequency to obtain element words, so that the element words under the topic can be highlighted, the probability distribution of the topic-words is improved, the probability that low-frequency representative words in the topic are distributed to the topic in the field is improved, and the recognition degree of the topic in the field is further improved.
It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The foregoing examples illustrate only a few embodiments of the invention and are described in detail herein without thereby limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims (1)

1. A method of mining news element words based on an improved BTM topic model, the method comprising the steps of:
step 1, acquiring a plurality of texts, selecting representative words in the texts, and taking the representative words as characteristic representations of the texts;
step 2, constructing a dictionary by using representative words, setting a window, constructing biterm by using any two representative words in the window, and constructing a biterm list by using a plurality of biterm;
step 3, setting initial parameters according to field knowledge, setting seed words of known topics in a specific field, performing Gibbs sampling by using similarity of the seed words and biterm words as prior knowledge of sampling, reckoning the times of text allocation to the topics and the times of word allocation to the topics until the Gibbs parameters are converged, obtaining an improved BTM model, and generating topic-word distribution and document-topic distribution by using final sampling results of the improved BTM model;
step 4, according to the contribution degree of words to the topics, the probability that low-frequency representative words in the topic-word distribution are distributed to the topics is improved, and then words with low contribution degree in each topic are removed to obtain topic element words;
in the step 1, when a representative word is selected, a word position weight factor is introduced to the word in the text, and a word in a first sentence in the text is selected to give higher weight;
the information quantity of the words is used as a control coefficient of the tf-idf value of the subject word, so that the weights of the words at the rest positions in the text are calculated, and representative words in the text are selected according to the weights of the words;
the term location weight factor is expressed asThe words in the first sentence in the text are given weight factors expressed asThe method comprises the steps of carrying out a first treatment on the surface of the In calculating words for the rest of the text, < +.>The value is 0;
the weights of the words at the rest positions correspond to the following expressions:
wherein ,weights of words representing the rest of the positions, +.>Representing information entropy, i.e. word->Information amount of->For word frequency revision value, < >>For containing the word->Document number of->For the total number of documents>For words->Word frequency of->For the word +.>Average of frequency of occurrence, +.>Representing the inclusion of words in a text corpus>Is a text number of (a);
words and phrasesThe information entropy expression of (2) is:
wherein ,the expression->Probability of occurrence in the entire text;
in the step 3, the method for using the similarity between the seed word and the biterm word as the prior knowledge of the sampling specifically includes:
step 3.1, obtaining the biterm topic joint probability distribution of the improved BTM model by using the known text feature representation and the unknown topic distribution;
step 3.2, combining the biterm topics of the improved BTM model with probability distribution, and obtaining the conditional probability distribution of the topics required by Gibbs sampling by using an application chain method;
step 3.3, calculating the similarity of the biterm words and the seed words, and obtaining an improved BTM model sampling conditional probability model based on the similarity information and the occurrence times of the words in the subject in the conditional probability distribution of the subject;
step 3.4, setting a similarity threshold, and defining the similarity lower than the similarity threshold as 0;
the biterm topic joint probability distribution of the improved BTM model has the following relation:
wherein ,a topic probability distribution representing word pairs in biterm,/->Representing text belonging to the topic->Probability of (2);
the expression of (2) is:
wherein ,the representation belongs to the subject->Number of occurrences of biterm, +.>Indicates the number of all biterm, +.>Representing the number of subjects->Representing a first dirichlet priors parameter;
indicating that each word in biterm belongs to the subject +.>Probability of->The expression of (2) is:
wherein ,representation word->In theme->The number of occurrences of>The representation belongs to the subject->Number of biterm, +.>Representing the number of all documents +.>Representing biterm word->And seed words->Similarity matrix of>Representing a second dirichlet priors parameter;
the expression of the improved BTM model sampling conditional probability model is:
wherein ,conditional probability representing biterm, +.>Representing a set of biterm,,/>,/> and />Two different representative words, < ->Representation subject->Middle->The number of occurrences>Representation subject->Middle->The number of occurrences>The total number of words is indicated and,representation word->In theme->The number of occurrences of>Representing in addition to->Topics of other word pairs than;
similarity matrix of seed words and biterm wordsThe definition expression is:
wherein ,representing biterm word->And seed words->Similarity of->A specified similarity threshold;
in the step 4, the method for improving the probability of distributing the low-frequency representative words to the topics in the topic-word distribution according to the contribution degree of the words to the topics specifically comprises the following steps:
step 4.1, setting wordsIn theme->The probability distribution of occurrence in (a) is->Wherein the word->In theme->The probability of occurrence of the term +.>Subject->Is representative of (3);
representing the number of topics generated,/->Representation comprising the word->Subject number of (2), word->Is +.>
Step 4.2, utilizing wordsIn theme->Probability distribution of occurrence ∈>Multiplying by the inverse subject frequencyGet the word->Subject->Contribution degree of->Contribution degree->The expression of (2) is:
step 4.3, setting a representative threshold and a mean value when the contribution degree is greater than the generationWhen the tabular threshold value is reached, the contribution degree of the subject term isThe method comprises the steps of carrying out a first treatment on the surface of the When the contribution is smaller than the representative threshold, then the mean +.>For words->Subject->Contribution of (2);
the contribution degree of the subject term is expressed as:
wherein ,representative thresholds are represented.
CN202310634323.5A 2023-05-31 2023-05-31 News element word mining method based on improved BTM topic model Active CN116432639B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310634323.5A CN116432639B (en) 2023-05-31 2023-05-31 News element word mining method based on improved BTM topic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310634323.5A CN116432639B (en) 2023-05-31 2023-05-31 News element word mining method based on improved BTM topic model

Publications (2)

Publication Number Publication Date
CN116432639A CN116432639A (en) 2023-07-14
CN116432639B true CN116432639B (en) 2023-08-25

Family

ID=87089359

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310634323.5A Active CN116432639B (en) 2023-05-31 2023-05-31 News element word mining method based on improved BTM topic model

Country Status (1)

Country Link
CN (1) CN116432639B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108182176A (en) * 2017-12-29 2018-06-19 太原理工大学 Enhance BTM topic model descriptor semantic dependencies and theme condensation degree method
CN110134958A (en) * 2019-05-14 2019-08-16 南京大学 A kind of short text Topics Crawling method based on semantic word network
CN110597993A (en) * 2019-09-17 2019-12-20 昆明理工大学 Microblog hot topic data mining method
CN111368072A (en) * 2019-08-20 2020-07-03 河北工程大学 Microblog hot topic discovery algorithm based on linear fusion of BTM and GloVe similarity
CN112632215A (en) * 2020-12-01 2021-04-09 重庆邮电大学 Community discovery method and system based on word-pair semantic topic model
CN113139599A (en) * 2021-04-22 2021-07-20 北方工业大学 Service distributed clustering method fusing word vector expansion and topic model

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10970595B2 (en) * 2018-06-20 2021-04-06 Netapp, Inc. Methods and systems for document classification using machine learning
US10943070B2 (en) * 2019-02-01 2021-03-09 International Business Machines Corporation Interactively building a topic model employing semantic similarity in a spoken dialog system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108182176A (en) * 2017-12-29 2018-06-19 太原理工大学 Enhance BTM topic model descriptor semantic dependencies and theme condensation degree method
CN110134958A (en) * 2019-05-14 2019-08-16 南京大学 A kind of short text Topics Crawling method based on semantic word network
CN111368072A (en) * 2019-08-20 2020-07-03 河北工程大学 Microblog hot topic discovery algorithm based on linear fusion of BTM and GloVe similarity
CN110597993A (en) * 2019-09-17 2019-12-20 昆明理工大学 Microblog hot topic data mining method
CN112632215A (en) * 2020-12-01 2021-04-09 重庆邮电大学 Community discovery method and system based on word-pair semantic topic model
CN113139599A (en) * 2021-04-22 2021-07-20 北方工业大学 Service distributed clustering method fusing word vector expansion and topic model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Improved short text classification method based on BTM topic features;Zheng Cheng et al.;Computer Engineering and Applications;第95-100页 *

Also Published As

Publication number Publication date
CN116432639A (en) 2023-07-14

Similar Documents

Publication Publication Date Title
CN109858028B (en) Short text similarity calculation method based on probability model
CN108052593B (en) Topic keyword extraction method based on topic word vector and network structure
US9015035B2 (en) User modification of generative model for determining topics and sentiments
RU2628431C1 (en) Selection of text classifier parameter based on semantic characteristics
RU2628436C1 (en) Classification of texts on natural language based on semantic signs
CN109960724B (en) Text summarization method based on TF-IDF
CN106372061B (en) Short text similarity calculation method based on semantics
US20230351212A1 (en) Semi-supervised method and apparatus for public opinion text analysis
US20200019611A1 (en) Topic models with sentiment priors based on distributed representations
CN108710611B (en) Short text topic model generation method based on word network and word vector
CN110413768B (en) Automatic generation method of article titles
CN110807326B (en) Short text keyword extraction method combining GPU-DMM and text features
CN106997379B (en) Method for merging similar texts based on click volumes of image texts
CN109101490B (en) Factual implicit emotion recognition method and system based on fusion feature representation
CN116521882A (en) Domain length text classification method and system based on knowledge graph
CN114328939B (en) Natural language processing model construction method based on big data
CN116595975A (en) Aspect-level emotion analysis method for word information enhancement based on sentence information
CN111259156A (en) Hot spot clustering method facing time sequence
CN116432639B (en) News element word mining method based on improved BTM topic model
CN110674293B (en) Text classification method based on semantic migration
CN114996442B (en) Text abstract generation system combining abstract degree discrimination and abstract optimization
Cai et al. Indonesian automatic text summarization based on a new clustering method in sentence level
CN117057346A (en) Domain keyword extraction method based on weighted textRank and K-means
CN107729509B (en) Discourse similarity determination method based on recessive high-dimensional distributed feature representation
CN116151258A (en) Text disambiguation method, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant