CN108399162A - The topic of phrase-based bag topic model finds method - Google Patents

The topic of phrase-based bag topic model finds method Download PDF

Info

Publication number
CN108399162A
CN108399162A CN201810233489.5A CN201810233489A CN108399162A CN 108399162 A CN108399162 A CN 108399162A CN 201810233489 A CN201810233489 A CN 201810233489A CN 108399162 A CN108399162 A CN 108399162A
Authority
CN
China
Prior art keywords
phrase
topic
theme
frequent
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810233489.5A
Other languages
Chinese (zh)
Inventor
潘丽敏
李筱雅
罗森林
郭佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN201810233489.5A priority Critical patent/CN108399162A/en
Publication of CN108399162A publication Critical patent/CN108399162A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to the topics of phrase-based bag topic model to find method, belongs to natural language processing and machine learning field, it is therefore an objective to solve the problem of the related information between bag of words lose word and can not accurately reflect topic information.The invention firstly uses FP growth algorithms to quickly generate frequent phrase, then excavates candidate phrase by the characteristic of text data Gaussian distributed;It is then based on phrase bag to assume to carry out theme modeling, the probability distribution of " theme phrase " is corrected using the Sa functions of " Topic word " probability distribution of vocabulary in phrase under same subject;Finally topic is stated with the theme phrase generated.The present invention has the characteristics that theme distribution, topic find that accuracy rate is high and topic statement is readable high, is conducive to be monitored microblogging public sentiment, has good application value and promotional value.

Description

The topic of phrase-based bag topic model finds method
Technical field
The present invention relates to the topics of phrase-based bag topic model to find method, belongs to natural language processing and machine learning Field.
Background technology
Microblogging is the Important Platform that user understands current events hot spot information and states one's views, but its topic discussion quantity is huge Causing wherein comprising much contents uncorrelated to topic and redundancy so that user can not quickly and accurately obtain effective information, Therefore user can be helped to obtain information intuitive and conveniently using topic discovery.Most of existing topic discovery method is using lonely Vertical vocabulary states topic, however has the advantages that readable height, information are abundant with the topic that phrase form is presented, and can help to use Accurately comprehensively hold topic information in family.
It is sparse for social text vector higher-dimension, ineffective problem is found so as to cause topic, if existing at present Topic discovery method mainly excavates hiding semantic information by topic model.However still there are two basic problems for such method It needs to solve:1. how to solve the caused effective information missing based on bag of words, to influence Subject Clustering effect, limit Topic processed finds the problem of accuracy rate;2. it is readable poor with isolated descriptor statement topic how to solve, and easy tos produce discrimination The problem of justice.
Invention content
The purpose of the present invention is to solve, bag of words lose related information between word and descriptor can not be accurately anti- The problem of reflecting topic information, it is proposed that the microblog topic of phrase-based bag topic model finds method.
The present invention design principle be:Data are pre-processed first, including the mixed and disorderly label of filtering, Chinese word segmentation With remove stop words;Then by FP-growth algorithm Mining Frequent phrases, then in conjunction with text Gaussian distribution feature to frequent Phrase carries out recombinant, generates candidate phrase;It is finally analyzed using phrase topic model, realizes topic phrase generation.Tool The microblog topic of the phrase-based bag topic model of body finds that method schematic is as shown in Figure 1.
The technical scheme is that be achieved by the steps of:
Step 1, data are pre-processed.Detailed process is as follows:
Step 1.1, the noise symbol that profit filtering microblog data is concentrated, and carry out either traditional and simplified characters conversion.
Step 1.2, participle and part-of-speech tagging are carried out to data set using Chinese word segmentation tool.
Step 1.3, microblogging text of the removal less than 4 effective words.
Step 2, on the basis of step 1, short phrase picking is carried out, detailed process is as follows:
Step 2.1, frequent phrase is generated using FP association algorithms, and counts frequency of occurrence.
Step 2.2, the candidate phrase of high quality is generated in conjunction with the gaussian distribution characteristic of text.
Step 3, on the basis of step 2, topic phrase generation is carried out, detailed process is as follows:
Step 3.1, phrase-based bag is assumed to carry out theme modeling, utilizes " theme-of the vocabulary in phrase under same subject The Sa functions of word " probability distribution correct the probability distribution of " theme-phrase ".
Step 3.2, topic is stated with the theme phrase of generation.
Advantageous effect
Find that method, the topic for the phrase-based bag topic model that the present invention uses are found compared to the topic based on LDA The incidence relation that method can belong to the word in phrase one theme is dissolved into topic model, and Subject Clustering effect is improved Fruit is good.
Find that method, the present invention supplement lexical item { w in phrase compared to the topic based on TopMined,g,iIn theme zj With theme ziThe otherness of lower probability distribution has obtained better topic table to have modified the probability distribution of " theme-phrase " State result.
Description of the drawings
Fig. 1 is that the topic of the present invention finds method schematic;
Fig. 2 is comparison diagram under different themes number in specific implementation mode;
Fig. 3 is the comparison diagram under different iterations in specific implementation mode.
Specific implementation mode
In order to better illustrate objects and advantages of the present invention, with reference to the accompanying drawings and examples to the reality of the method for the present invention The mode of applying is described in further details.
Detailed process is:
Step 1, data are pre-processed.
Step 1.1, the noise symbols such as the html labels concentrated using canonical filtering microblog data, and carry out either traditional and simplified characters and turn Change.
Step 1.2, participle and part-of-speech tagging are carried out to data set using NLPIR Chinese word segmentation systems.
Step 1.3, removal is less than the microblogging text of 4 effective words, and effective word refers generally to noun, verb, adjective, number Word, time word etc..
Step 2, on the basis of step 1, short phrase picking is carried out.
Step 2.1, frequent short phrase picking is carried out, frequent phrase is extracted using following two rule:
(1) principle is closed downwards:If phrase P is not frequent episode, the arbitrary phrase for including P may be considered that It is not frequent episode;
(2) antimonotone of data:If document does not include the frequent phrase that length is n, the document is not yet It can include the frequent phrase that length is more than n;
First, the pattern that principle utilizes FP-growth algorithm Mining Frequent items is closed downwards, and rope is enlivened by safeguarding one group Draw, close criterion in conjunction with downward and generate phrase, these active indexes are the index letter of frequent phrase that length is n in a document Breath.Then according to data antimonotone rule, judge whether document needs further to excavate.
Step 2.2, candidate phrase generates.On the basis of step 2.1, polymerization from bottom to top is all made of to every document Method combines the adjacent word of frequent phrase and its left and right and calculates the importance sig of new phrase to form new phrase, will reach Candidate phrase concentration is added in phrase to threshold value, updates the phrase of corresponding position in document;Then it concentrates and chooses in candidate phrase The highest phrase Best of importance merges its word controlled in a document, reaches threshold value and candidate phrase concentration is then added, and remove Phrase Best;The last constantly iteration above process, is tied when the sig values of phrase Best are less than threshold value or candidate phrase collection is empty Beam iteration.
The probability distribution of the number of phrase P is in corpus:h0(f (P))=N (Lp (P), Lp (P) (1-p (P))) ≈ N (L (p (P)), Lp (P)), wherein p (P) is that phrase P carries out the successful probability of Bernoulli trials.By phrase P1With phrase P2What is formed is short Language P0, P0Frequently mean value calculation mode is: Importance sig (P1,P2) be: P is weighed using the index1And P2It Between correlation degree, the index is bigger, illustrates P1And P2The probability for belonging to same phrase is bigger.
Step 3, on the basis of step 2, topic phrase generation is carried out.
Step 3.1, the content of every document is all divided into the form of candidate phrase and word, and document is converted by bag of words form At phrase bag form.In the phrase of generation, there is stronger incidence relation, so in phrase topic model between lexical item In, it is assumed that all words in phrase share a potential theme.Utilize lexical item { w in phrased,g.iIn same subject lower probability The statistical property of distribution carries out parameter Estimation to correct gibbs sampler algorithm, and parameter optimization equation is:
Using the distribution of improved Gibbs sample modes amendment " theme-phrase ", when the theme of vocabulary in phrase is identical The probability distribution of " theme-phrase " is updated.
Step 3.2, it after sampling convergence, is ranked up according to the probability of " theme-phrase ", the first six is chosen for each theme A theme phrase states the topic.

Claims (5)

1. the topic of phrase-based bag of topic model finds method, it is characterised in that:
Frequent phrase is quickly generated first with FP-growth algorithms, then is dug by the characteristic of text data Gaussian distributed Dig candidate phrase;It is then based on phrase bag to assume to carry out theme modeling, utilizes " theme-of the vocabulary in phrase under same subject The Sa functions of word " probability distribution correct the probability distribution of " theme-phrase ";Finally topic is stated with the theme phrase generated; Specifically comprise the following steps:
Step 1, data set is inputted into preprocessing module, the noises such as html labels concentrated using canonical filtering microblog data are accorded with Number, and either traditional and simplified characters conversion is carried out, participle and part-of-speech tagging then are carried out to data set using participle tool, removal has less than 4 Imitate the microblogging text of word;
Step 2, short phrase picking uses two rules to extract frequent phrase first, while counting the number of its appearance, this two rules and regulations It is then that (1) closes downwards principle:If phrase P is not frequent episode, the arbitrary phrase for including P, may be considered that nor Frequent episode;(2) antimonotone of data:If a document does not include the frequent phrase that length is n, the document will not Including length is more than the frequent phrase of n;Then the characteristic for utilizing text Gaussian distributed closes frequent episode and its left and right vocabulary And form new phrase;
Step 3, theme modeling is carried out, the Sa letters of " theme-word " probability distribution of vocabulary in phrase under same subject are utilized Number finally uses the theme phrase statement topic generated to correct the probability distribution of " theme-phrase ".
2. the topic of phrase-based bag topic model according to claim 1 finds method, it is characterised in that:In step 2 The pattern that principle utilizes FP-growth algorithm Mining Frequent items is closed downwards, by safeguarding one group of active index, in conjunction with downward Criterion is closed to generate phrase, data antimonotone rule is then utilized, judges whether document needs further to excavate.
3. the topic of phrase-based bag topic model according to claim 1 finds method, it is characterised in that:In step 2 Frequent phrase and its word of left and right are combined using polymerization merging method from bottom to top, new phrase is generated, with the importance of phrase Sig is iteratively merged as guidance with the highest phrase of the degree of association.
4. the topic of phrase-based bag topic model according to claim 1 finds method, it is characterised in that:In step 3 Using existing incidence relation between word and word in frequent phrase, related information between the word lost supplemented with LDA models.
5. the topic of phrase-based bag topic model according to claim 1 finds method, it is characterised in that:In step 3 Utilize lexical item { w in phrased,g.iSame subject lower probability distribution statistical property come correct Gibbs sampling algorithms carry out parameter Estimation.
CN201810233489.5A 2018-03-21 2018-03-21 The topic of phrase-based bag topic model finds method Pending CN108399162A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810233489.5A CN108399162A (en) 2018-03-21 2018-03-21 The topic of phrase-based bag topic model finds method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810233489.5A CN108399162A (en) 2018-03-21 2018-03-21 The topic of phrase-based bag topic model finds method

Publications (1)

Publication Number Publication Date
CN108399162A true CN108399162A (en) 2018-08-14

Family

ID=63093035

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810233489.5A Pending CN108399162A (en) 2018-03-21 2018-03-21 The topic of phrase-based bag topic model finds method

Country Status (1)

Country Link
CN (1) CN108399162A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110134951A (en) * 2019-04-29 2019-08-16 淮阴工学院 A kind of method and system for analyzing the potential theme phrase of text data
CN111178048A (en) * 2019-12-31 2020-05-19 微梦创科网络科技(中国)有限公司 Smooth phrase topic model-based topic extraction method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100114859A1 (en) * 2008-10-31 2010-05-06 Yahoo! Inc. System and method for generating an online summary of a collection of documents
CN103390051A (en) * 2013-07-25 2013-11-13 南京邮电大学 Topic detection and tracking method based on microblog data
CN105573985A (en) * 2016-03-04 2016-05-11 北京理工大学 Sentence expression method based on Chinese sentence meaning structural model and topic model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100114859A1 (en) * 2008-10-31 2010-05-06 Yahoo! Inc. System and method for generating an online summary of a collection of documents
CN103390051A (en) * 2013-07-25 2013-11-13 南京邮电大学 Topic detection and tracking method based on microblog data
CN105573985A (en) * 2016-03-04 2016-05-11 北京理工大学 Sentence expression method based on Chinese sentence meaning structural model and topic model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张琴 等: "基于PhraseLDA模型的主题短语挖掘方法研究", 《图书情报工作》 *
杨柯帆: "中文微博短文本主题挖掘方法研究与原型系统开发", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110134951A (en) * 2019-04-29 2019-08-16 淮阴工学院 A kind of method and system for analyzing the potential theme phrase of text data
CN110134951B (en) * 2019-04-29 2021-08-31 淮阴工学院 Method and system for analyzing text data potential subject phrases
CN111178048A (en) * 2019-12-31 2020-05-19 微梦创科网络科技(中国)有限公司 Smooth phrase topic model-based topic extraction method and device
CN111178048B (en) * 2019-12-31 2023-08-01 微梦创科网络科技(中国)有限公司 Topic extraction method and device based on smooth phrase topic model

Similar Documents

Publication Publication Date Title
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
CN106294593B (en) In conjunction with the Relation extraction method of subordinate clause grade remote supervisory and semi-supervised integrated study
CN104199972B (en) A kind of name entity relation extraction and construction method based on deep learning
Viola et al. Learning to extract information from semi-structured text using a discriminative context free grammar
CN108363816A (en) Open entity relation extraction method based on sentence justice structural model
CN108874878A (en) A kind of building system and method for knowledge mapping
CN105608218A (en) Intelligent question answering knowledge base establishment method, establishment device and establishment system
US20100241416A1 (en) Adaptive pattern learning for bilingual data mining
CN106776562A (en) A kind of keyword extracting method and extraction system
CN103914494A (en) Method and system for identifying identity of microblog user
CN106874410A (en) Chinese microblogging text mood sorting technique and its system based on convolutional neural networks
US10303770B2 (en) Determining confidence levels associated with attribute values of informational objects
CN109697288B (en) Instance alignment method based on deep learning
CN109002473A (en) A kind of sentiment analysis method based on term vector and part of speech
CN101404033A (en) Automatic generation method and system for noumenon hierarchical structure
CN108363725A (en) A kind of method of the extraction of user comment viewpoint and the generation of viewpoint label
CN110188359B (en) Text entity extraction method
CN112948543A (en) Multi-language multi-document abstract extraction method based on weighted TextRank
CN109086355A (en) Hot spot association relationship analysis method and system based on theme of news word
CN104281565A (en) Semantic dictionary constructing method and device
CN107526721A (en) A kind of disambiguation method and device to electric business product review vocabulary
CN114997288A (en) Design resource association method
CN110377695A (en) A kind of public sentiment subject data clustering method, device and storage medium
US10339223B2 (en) Text processing system, text processing method and storage medium storing computer program
CN106610949A (en) Text feature extraction method based on semantic analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180814