CN108399162A - The topic of phrase-based bag topic model finds method - Google Patents
The topic of phrase-based bag topic model finds method Download PDFInfo
- Publication number
- CN108399162A CN108399162A CN201810233489.5A CN201810233489A CN108399162A CN 108399162 A CN108399162 A CN 108399162A CN 201810233489 A CN201810233489 A CN 201810233489A CN 108399162 A CN108399162 A CN 108399162A
- Authority
- CN
- China
- Prior art keywords
- phrase
- topic
- theme
- frequent
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Machine Translation (AREA)
Abstract
The present invention relates to the topics of phrase-based bag topic model to find method, belongs to natural language processing and machine learning field, it is therefore an objective to solve the problem of the related information between bag of words lose word and can not accurately reflect topic information.The invention firstly uses FP growth algorithms to quickly generate frequent phrase, then excavates candidate phrase by the characteristic of text data Gaussian distributed;It is then based on phrase bag to assume to carry out theme modeling, the probability distribution of " theme phrase " is corrected using the Sa functions of " Topic word " probability distribution of vocabulary in phrase under same subject;Finally topic is stated with the theme phrase generated.The present invention has the characteristics that theme distribution, topic find that accuracy rate is high and topic statement is readable high, is conducive to be monitored microblogging public sentiment, has good application value and promotional value.
Description
Technical field
The present invention relates to the topics of phrase-based bag topic model to find method, belongs to natural language processing and machine learning
Field.
Background technology
Microblogging is the Important Platform that user understands current events hot spot information and states one's views, but its topic discussion quantity is huge
Causing wherein comprising much contents uncorrelated to topic and redundancy so that user can not quickly and accurately obtain effective information,
Therefore user can be helped to obtain information intuitive and conveniently using topic discovery.Most of existing topic discovery method is using lonely
Vertical vocabulary states topic, however has the advantages that readable height, information are abundant with the topic that phrase form is presented, and can help to use
Accurately comprehensively hold topic information in family.
It is sparse for social text vector higher-dimension, ineffective problem is found so as to cause topic, if existing at present
Topic discovery method mainly excavates hiding semantic information by topic model.However still there are two basic problems for such method
It needs to solve:1. how to solve the caused effective information missing based on bag of words, to influence Subject Clustering effect, limit
Topic processed finds the problem of accuracy rate;2. it is readable poor with isolated descriptor statement topic how to solve, and easy tos produce discrimination
The problem of justice.
Invention content
The purpose of the present invention is to solve, bag of words lose related information between word and descriptor can not be accurately anti-
The problem of reflecting topic information, it is proposed that the microblog topic of phrase-based bag topic model finds method.
The present invention design principle be:Data are pre-processed first, including the mixed and disorderly label of filtering, Chinese word segmentation
With remove stop words;Then by FP-growth algorithm Mining Frequent phrases, then in conjunction with text Gaussian distribution feature to frequent
Phrase carries out recombinant, generates candidate phrase;It is finally analyzed using phrase topic model, realizes topic phrase generation.Tool
The microblog topic of the phrase-based bag topic model of body finds that method schematic is as shown in Figure 1.
The technical scheme is that be achieved by the steps of:
Step 1, data are pre-processed.Detailed process is as follows:
Step 1.1, the noise symbol that profit filtering microblog data is concentrated, and carry out either traditional and simplified characters conversion.
Step 1.2, participle and part-of-speech tagging are carried out to data set using Chinese word segmentation tool.
Step 1.3, microblogging text of the removal less than 4 effective words.
Step 2, on the basis of step 1, short phrase picking is carried out, detailed process is as follows:
Step 2.1, frequent phrase is generated using FP association algorithms, and counts frequency of occurrence.
Step 2.2, the candidate phrase of high quality is generated in conjunction with the gaussian distribution characteristic of text.
Step 3, on the basis of step 2, topic phrase generation is carried out, detailed process is as follows:
Step 3.1, phrase-based bag is assumed to carry out theme modeling, utilizes " theme-of the vocabulary in phrase under same subject
The Sa functions of word " probability distribution correct the probability distribution of " theme-phrase ".
Step 3.2, topic is stated with the theme phrase of generation.
Advantageous effect
Find that method, the topic for the phrase-based bag topic model that the present invention uses are found compared to the topic based on LDA
The incidence relation that method can belong to the word in phrase one theme is dissolved into topic model, and Subject Clustering effect is improved
Fruit is good.
Find that method, the present invention supplement lexical item { w in phrase compared to the topic based on TopMined,g,iIn theme zj
With theme ziThe otherness of lower probability distribution has obtained better topic table to have modified the probability distribution of " theme-phrase "
State result.
Description of the drawings
Fig. 1 is that the topic of the present invention finds method schematic;
Fig. 2 is comparison diagram under different themes number in specific implementation mode;
Fig. 3 is the comparison diagram under different iterations in specific implementation mode.
Specific implementation mode
In order to better illustrate objects and advantages of the present invention, with reference to the accompanying drawings and examples to the reality of the method for the present invention
The mode of applying is described in further details.
Detailed process is:
Step 1, data are pre-processed.
Step 1.1, the noise symbols such as the html labels concentrated using canonical filtering microblog data, and carry out either traditional and simplified characters and turn
Change.
Step 1.2, participle and part-of-speech tagging are carried out to data set using NLPIR Chinese word segmentation systems.
Step 1.3, removal is less than the microblogging text of 4 effective words, and effective word refers generally to noun, verb, adjective, number
Word, time word etc..
Step 2, on the basis of step 1, short phrase picking is carried out.
Step 2.1, frequent short phrase picking is carried out, frequent phrase is extracted using following two rule:
(1) principle is closed downwards:If phrase P is not frequent episode, the arbitrary phrase for including P may be considered that
It is not frequent episode;
(2) antimonotone of data:If document does not include the frequent phrase that length is n, the document is not yet
It can include the frequent phrase that length is more than n;
First, the pattern that principle utilizes FP-growth algorithm Mining Frequent items is closed downwards, and rope is enlivened by safeguarding one group
Draw, close criterion in conjunction with downward and generate phrase, these active indexes are the index letter of frequent phrase that length is n in a document
Breath.Then according to data antimonotone rule, judge whether document needs further to excavate.
Step 2.2, candidate phrase generates.On the basis of step 2.1, polymerization from bottom to top is all made of to every document
Method combines the adjacent word of frequent phrase and its left and right and calculates the importance sig of new phrase to form new phrase, will reach
Candidate phrase concentration is added in phrase to threshold value, updates the phrase of corresponding position in document;Then it concentrates and chooses in candidate phrase
The highest phrase Best of importance merges its word controlled in a document, reaches threshold value and candidate phrase concentration is then added, and remove
Phrase Best;The last constantly iteration above process, is tied when the sig values of phrase Best are less than threshold value or candidate phrase collection is empty
Beam iteration.
The probability distribution of the number of phrase P is in corpus:h0(f (P))=N (Lp (P), Lp (P) (1-p (P))) ≈ N (L
(p (P)), Lp (P)), wherein p (P) is that phrase P carries out the successful probability of Bernoulli trials.By phrase P1With phrase P2What is formed is short
Language P0, P0Frequently mean value calculation mode is: Importance sig
(P1,P2) be: P is weighed using the index1And P2It
Between correlation degree, the index is bigger, illustrates P1And P2The probability for belonging to same phrase is bigger.
Step 3, on the basis of step 2, topic phrase generation is carried out.
Step 3.1, the content of every document is all divided into the form of candidate phrase and word, and document is converted by bag of words form
At phrase bag form.In the phrase of generation, there is stronger incidence relation, so in phrase topic model between lexical item
In, it is assumed that all words in phrase share a potential theme.Utilize lexical item { w in phrased,g.iIn same subject lower probability
The statistical property of distribution carries out parameter Estimation to correct gibbs sampler algorithm, and parameter optimization equation is:
Using the distribution of improved Gibbs sample modes amendment " theme-phrase ", when the theme of vocabulary in phrase is identical
The probability distribution of " theme-phrase " is updated.
Step 3.2, it after sampling convergence, is ranked up according to the probability of " theme-phrase ", the first six is chosen for each theme
A theme phrase states the topic.
Claims (5)
1. the topic of phrase-based bag of topic model finds method, it is characterised in that:
Frequent phrase is quickly generated first with FP-growth algorithms, then is dug by the characteristic of text data Gaussian distributed
Dig candidate phrase;It is then based on phrase bag to assume to carry out theme modeling, utilizes " theme-of the vocabulary in phrase under same subject
The Sa functions of word " probability distribution correct the probability distribution of " theme-phrase ";Finally topic is stated with the theme phrase generated;
Specifically comprise the following steps:
Step 1, data set is inputted into preprocessing module, the noises such as html labels concentrated using canonical filtering microblog data are accorded with
Number, and either traditional and simplified characters conversion is carried out, participle and part-of-speech tagging then are carried out to data set using participle tool, removal has less than 4
Imitate the microblogging text of word;
Step 2, short phrase picking uses two rules to extract frequent phrase first, while counting the number of its appearance, this two rules and regulations
It is then that (1) closes downwards principle:If phrase P is not frequent episode, the arbitrary phrase for including P, may be considered that nor
Frequent episode;(2) antimonotone of data:If a document does not include the frequent phrase that length is n, the document will not
Including length is more than the frequent phrase of n;Then the characteristic for utilizing text Gaussian distributed closes frequent episode and its left and right vocabulary
And form new phrase;
Step 3, theme modeling is carried out, the Sa letters of " theme-word " probability distribution of vocabulary in phrase under same subject are utilized
Number finally uses the theme phrase statement topic generated to correct the probability distribution of " theme-phrase ".
2. the topic of phrase-based bag topic model according to claim 1 finds method, it is characterised in that:In step 2
The pattern that principle utilizes FP-growth algorithm Mining Frequent items is closed downwards, by safeguarding one group of active index, in conjunction with downward
Criterion is closed to generate phrase, data antimonotone rule is then utilized, judges whether document needs further to excavate.
3. the topic of phrase-based bag topic model according to claim 1 finds method, it is characterised in that:In step 2
Frequent phrase and its word of left and right are combined using polymerization merging method from bottom to top, new phrase is generated, with the importance of phrase
Sig is iteratively merged as guidance with the highest phrase of the degree of association.
4. the topic of phrase-based bag topic model according to claim 1 finds method, it is characterised in that:In step 3
Using existing incidence relation between word and word in frequent phrase, related information between the word lost supplemented with LDA models.
5. the topic of phrase-based bag topic model according to claim 1 finds method, it is characterised in that:In step 3
Utilize lexical item { w in phrased,g.iSame subject lower probability distribution statistical property come correct Gibbs sampling algorithms carry out parameter
Estimation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810233489.5A CN108399162A (en) | 2018-03-21 | 2018-03-21 | The topic of phrase-based bag topic model finds method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810233489.5A CN108399162A (en) | 2018-03-21 | 2018-03-21 | The topic of phrase-based bag topic model finds method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108399162A true CN108399162A (en) | 2018-08-14 |
Family
ID=63093035
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810233489.5A Pending CN108399162A (en) | 2018-03-21 | 2018-03-21 | The topic of phrase-based bag topic model finds method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108399162A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110134951A (en) * | 2019-04-29 | 2019-08-16 | 淮阴工学院 | A kind of method and system for analyzing the potential theme phrase of text data |
CN111178048A (en) * | 2019-12-31 | 2020-05-19 | 微梦创科网络科技(中国)有限公司 | Smooth phrase topic model-based topic extraction method and device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100114859A1 (en) * | 2008-10-31 | 2010-05-06 | Yahoo! Inc. | System and method for generating an online summary of a collection of documents |
CN103390051A (en) * | 2013-07-25 | 2013-11-13 | 南京邮电大学 | Topic detection and tracking method based on microblog data |
CN105573985A (en) * | 2016-03-04 | 2016-05-11 | 北京理工大学 | Sentence expression method based on Chinese sentence meaning structural model and topic model |
-
2018
- 2018-03-21 CN CN201810233489.5A patent/CN108399162A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100114859A1 (en) * | 2008-10-31 | 2010-05-06 | Yahoo! Inc. | System and method for generating an online summary of a collection of documents |
CN103390051A (en) * | 2013-07-25 | 2013-11-13 | 南京邮电大学 | Topic detection and tracking method based on microblog data |
CN105573985A (en) * | 2016-03-04 | 2016-05-11 | 北京理工大学 | Sentence expression method based on Chinese sentence meaning structural model and topic model |
Non-Patent Citations (2)
Title |
---|
张琴 等: "基于PhraseLDA模型的主题短语挖掘方法研究", 《图书情报工作》 * |
杨柯帆: "中文微博短文本主题挖掘方法研究与原型系统开发", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110134951A (en) * | 2019-04-29 | 2019-08-16 | 淮阴工学院 | A kind of method and system for analyzing the potential theme phrase of text data |
CN110134951B (en) * | 2019-04-29 | 2021-08-31 | 淮阴工学院 | Method and system for analyzing text data potential subject phrases |
CN111178048A (en) * | 2019-12-31 | 2020-05-19 | 微梦创科网络科技(中国)有限公司 | Smooth phrase topic model-based topic extraction method and device |
CN111178048B (en) * | 2019-12-31 | 2023-08-01 | 微梦创科网络科技(中国)有限公司 | Topic extraction method and device based on smooth phrase topic model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107133213B (en) | Method and system for automatically extracting text abstract based on algorithm | |
CN106294593B (en) | In conjunction with the Relation extraction method of subordinate clause grade remote supervisory and semi-supervised integrated study | |
CN104199972B (en) | A kind of name entity relation extraction and construction method based on deep learning | |
Viola et al. | Learning to extract information from semi-structured text using a discriminative context free grammar | |
CN108363816A (en) | Open entity relation extraction method based on sentence justice structural model | |
CN108874878A (en) | A kind of building system and method for knowledge mapping | |
CN105608218A (en) | Intelligent question answering knowledge base establishment method, establishment device and establishment system | |
US20100241416A1 (en) | Adaptive pattern learning for bilingual data mining | |
CN106776562A (en) | A kind of keyword extracting method and extraction system | |
CN103914494A (en) | Method and system for identifying identity of microblog user | |
CN106874410A (en) | Chinese microblogging text mood sorting technique and its system based on convolutional neural networks | |
US10303770B2 (en) | Determining confidence levels associated with attribute values of informational objects | |
CN109697288B (en) | Instance alignment method based on deep learning | |
CN109002473A (en) | A kind of sentiment analysis method based on term vector and part of speech | |
CN101404033A (en) | Automatic generation method and system for noumenon hierarchical structure | |
CN108363725A (en) | A kind of method of the extraction of user comment viewpoint and the generation of viewpoint label | |
CN110188359B (en) | Text entity extraction method | |
CN112948543A (en) | Multi-language multi-document abstract extraction method based on weighted TextRank | |
CN109086355A (en) | Hot spot association relationship analysis method and system based on theme of news word | |
CN104281565A (en) | Semantic dictionary constructing method and device | |
CN107526721A (en) | A kind of disambiguation method and device to electric business product review vocabulary | |
CN114997288A (en) | Design resource association method | |
CN110377695A (en) | A kind of public sentiment subject data clustering method, device and storage medium | |
US10339223B2 (en) | Text processing system, text processing method and storage medium storing computer program | |
CN106610949A (en) | Text feature extraction method based on semantic analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20180814 |