CN108399162A

CN108399162A - The topic of phrase-based bag topic model finds method

Info

Publication number: CN108399162A
Application number: CN201810233489.5A
Authority: CN
Inventors: 潘丽敏; 李筱雅; 罗森林; 郭佳
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2018-03-21
Filing date: 2018-03-21
Publication date: 2018-08-14

Abstract

The present invention relates to the topics of phrase-based bag topic model to find method, belongs to natural language processing and machine learning field, it is therefore an objective to solve the problem of the related information between bag of words lose word and can not accurately reflect topic information.The invention firstly uses FP growth algorithms to quickly generate frequent phrase, then excavates candidate phrase by the characteristic of text data Gaussian distributed；It is then based on phrase bag to assume to carry out theme modeling, the probability distribution of " theme phrase " is corrected using the Sa functions of " Topic word " probability distribution of vocabulary in phrase under same subject；Finally topic is stated with the theme phrase generated.The present invention has the characteristics that theme distribution, topic find that accuracy rate is high and topic statement is readable high, is conducive to be monitored microblogging public sentiment, has good application value and promotional value.

Description

The topic of phrase-based bag topic model finds method

Technical field

The present invention relates to the topics of phrase-based bag topic model to find method, belongs to natural language processing and machine learning Field.

Background technology

Microblogging is the Important Platform that user understands current events hot spot information and states one's views, but its topic discussion quantity is huge Causing wherein comprising much contents uncorrelated to topic and redundancy so that user can not quickly and accurately obtain effective information, Therefore user can be helped to obtain information intuitive and conveniently using topic discovery.Most of existing topic discovery method is using lonely Vertical vocabulary states topic, however has the advantages that readable height, information are abundant with the topic that phrase form is presented, and can help to use Accurately comprehensively hold topic information in family.

It is sparse for social text vector higher-dimension, ineffective problem is found so as to cause topic, if existing at present Topic discovery method mainly excavates hiding semantic information by topic model.However still there are two basic problems for such method It needs to solve：1. how to solve the caused effective information missing based on bag of words, to influence Subject Clustering effect, limit Topic processed finds the problem of accuracy rate；2. it is readable poor with isolated descriptor statement topic how to solve, and easy tos produce discrimination The problem of justice.

Invention content

The purpose of the present invention is to solve, bag of words lose related information between word and descriptor can not be accurately anti- The problem of reflecting topic information, it is proposed that the microblog topic of phrase-based bag topic model finds method.

The present invention design principle be：Data are pre-processed first, including the mixed and disorderly label of filtering, Chinese word segmentation With remove stop words；Then by FP-growth algorithm Mining Frequent phrases, then in conjunction with text Gaussian distribution feature to frequent Phrase carries out recombinant, generates candidate phrase；It is finally analyzed using phrase topic model, realizes topic phrase generation.Tool The microblog topic of the phrase-based bag topic model of body finds that method schematic is as shown in Figure 1.

The technical scheme is that be achieved by the steps of：

Step 1, data are pre-processed.Detailed process is as follows：

Step 1.1, the noise symbol that profit filtering microblog data is concentrated, and carry out either traditional and simplified characters conversion.

Step 1.2, participle and part-of-speech tagging are carried out to data set using Chinese word segmentation tool.

Step 1.3, microblogging text of the removal less than 4 effective words.

Step 2, on the basis of step 1, short phrase picking is carried out, detailed process is as follows：

Step 2.1, frequent phrase is generated using FP association algorithms, and counts frequency of occurrence.

Step 2.2, the candidate phrase of high quality is generated in conjunction with the gaussian distribution characteristic of text.

Step 3, on the basis of step 2, topic phrase generation is carried out, detailed process is as follows：

Step 3.1, phrase-based bag is assumed to carry out theme modeling, utilizes " theme-of the vocabulary in phrase under same subject The Sa functions of word " probability distribution correct the probability distribution of " theme-phrase ".

Step 3.2, topic is stated with the theme phrase of generation.

Advantageous effect

Find that method, the topic for the phrase-based bag topic model that the present invention uses are found compared to the topic based on LDA The incidence relation that method can belong to the word in phrase one theme is dissolved into topic model, and Subject Clustering effect is improved Fruit is good.

Find that method, the present invention supplement lexical item { w in phrase compared to the topic based on TopMine_d,g,iIn theme z_j With theme z_iThe otherness of lower probability distribution has obtained better topic table to have modified the probability distribution of " theme-phrase " State result.

Description of the drawings

Fig. 1 is that the topic of the present invention finds method schematic；

Fig. 2 is comparison diagram under different themes number in specific implementation mode；

Fig. 3 is the comparison diagram under different iterations in specific implementation mode.

Specific implementation mode

In order to better illustrate objects and advantages of the present invention, with reference to the accompanying drawings and examples to the reality of the method for the present invention The mode of applying is described in further details.

Detailed process is：

Step 1, data are pre-processed.

Step 1.1, the noise symbols such as the html labels concentrated using canonical filtering microblog data, and carry out either traditional and simplified characters and turn Change.

Step 1.2, participle and part-of-speech tagging are carried out to data set using NLPIR Chinese word segmentation systems.

Step 1.3, removal is less than the microblogging text of 4 effective words, and effective word refers generally to noun, verb, adjective, number Word, time word etc..

Step 2, on the basis of step 1, short phrase picking is carried out.

Step 2.1, frequent short phrase picking is carried out, frequent phrase is extracted using following two rule：

(1) principle is closed downwards：If phrase P is not frequent episode, the arbitrary phrase for including P may be considered that It is not frequent episode；

(2) antimonotone of data：If document does not include the frequent phrase that length is n, the document is not yet It can include the frequent phrase that length is more than n；

First, the pattern that principle utilizes FP-growth algorithm Mining Frequent items is closed downwards, and rope is enlivened by safeguarding one group Draw, close criterion in conjunction with downward and generate phrase, these active indexes are the index letter of frequent phrase that length is n in a document Breath.Then according to data antimonotone rule, judge whether document needs further to excavate.

Step 2.2, candidate phrase generates.On the basis of step 2.1, polymerization from bottom to top is all made of to every document Method combines the adjacent word of frequent phrase and its left and right and calculates the importance sig of new phrase to form new phrase, will reach Candidate phrase concentration is added in phrase to threshold value, updates the phrase of corresponding position in document；Then it concentrates and chooses in candidate phrase The highest phrase Best of importance merges its word controlled in a document, reaches threshold value and candidate phrase concentration is then added, and remove Phrase Best；The last constantly iteration above process, is tied when the sig values of phrase Best are less than threshold value or candidate phrase collection is empty Beam iteration.

The probability distribution of the number of phrase P is in corpus：h₀(f (P))=N (Lp (P), Lp (P) (1-p (P))) ≈ N (L (p (P)), Lp (P)), wherein p (P) is that phrase P carries out the successful probability of Bernoulli trials.By phrase P₁With phrase P₂What is formed is short Language P₀, P₀Frequently mean value calculation mode is： Importance sig (P₁,P₂) be： P is weighed using the index₁And P₂It Between correlation degree, the index is bigger, illustrates P₁And P₂The probability for belonging to same phrase is bigger.

Step 3, on the basis of step 2, topic phrase generation is carried out.

Step 3.1, the content of every document is all divided into the form of candidate phrase and word, and document is converted by bag of words form At phrase bag form.In the phrase of generation, there is stronger incidence relation, so in phrase topic model between lexical item In, it is assumed that all words in phrase share a potential theme.Utilize lexical item { w in phrase_d,g.iIn same subject lower probability The statistical property of distribution carries out parameter Estimation to correct gibbs sampler algorithm, and parameter optimization equation is：

Using the distribution of improved Gibbs sample modes amendment " theme-phrase ", when the theme of vocabulary in phrase is identical The probability distribution of " theme-phrase " is updated.

Step 3.2, it after sampling convergence, is ranked up according to the probability of " theme-phrase ", the first six is chosen for each theme A theme phrase states the topic.

Claims

1. the topic of phrase-based bag of topic model finds method, it is characterised in that：

Frequent phrase is quickly generated first with FP-growth algorithms, then is dug by the characteristic of text data Gaussian distributed Dig candidate phrase；It is then based on phrase bag to assume to carry out theme modeling, utilizes " theme-of the vocabulary in phrase under same subject The Sa functions of word " probability distribution correct the probability distribution of " theme-phrase "；Finally topic is stated with the theme phrase generated； Specifically comprise the following steps：

Step 1, data set is inputted into preprocessing module, the noises such as html labels concentrated using canonical filtering microblog data are accorded with Number, and either traditional and simplified characters conversion is carried out, participle and part-of-speech tagging then are carried out to data set using participle tool, removal has less than 4 Imitate the microblogging text of word；

Step 2, short phrase picking uses two rules to extract frequent phrase first, while counting the number of its appearance, this two rules and regulations It is then that (1) closes downwards principle：If phrase P is not frequent episode, the arbitrary phrase for including P, may be considered that nor Frequent episode；(2) antimonotone of data：If a document does not include the frequent phrase that length is n, the document will not Including length is more than the frequent phrase of n；Then the characteristic for utilizing text Gaussian distributed closes frequent episode and its left and right vocabulary And form new phrase；

Step 3, theme modeling is carried out, the Sa letters of " theme-word " probability distribution of vocabulary in phrase under same subject are utilized Number finally uses the theme phrase statement topic generated to correct the probability distribution of " theme-phrase ".

2. the topic of phrase-based bag topic model according to claim 1 finds method, it is characterised in that：In step 2 The pattern that principle utilizes FP-growth algorithm Mining Frequent items is closed downwards, by safeguarding one group of active index, in conjunction with downward Criterion is closed to generate phrase, data antimonotone rule is then utilized, judges whether document needs further to excavate.

3. the topic of phrase-based bag topic model according to claim 1 finds method, it is characterised in that：In step 2 Frequent phrase and its word of left and right are combined using polymerization merging method from bottom to top, new phrase is generated, with the importance of phrase Sig is iteratively merged as guidance with the highest phrase of the degree of association.

4. the topic of phrase-based bag topic model according to claim 1 finds method, it is characterised in that：In step 3 Using existing incidence relation between word and word in frequent phrase, related information between the word lost supplemented with LDA models.

5. the topic of phrase-based bag topic model according to claim 1 finds method, it is characterised in that：In step 3 Utilize lexical item { w in phrase_d,g.iSame subject lower probability distribution statistical property come correct Gibbs sampling algorithms carry out parameter Estimation.