CN106096014A - The Text Clustering Method of mixing length text set based on DMR - Google Patents

The Text Clustering Method of mixing length text set based on DMR Download PDF

Info

Publication number
CN106096014A
CN106096014A CN201610469360.5A CN201610469360A CN106096014A CN 106096014 A CN106096014 A CN 106096014A CN 201610469360 A CN201610469360 A CN 201610469360A CN 106096014 A CN106096014 A CN 106096014A
Authority
CN
China
Prior art keywords
text
text set
long
dmr
mixing length
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610469360.5A
Other languages
Chinese (zh)
Inventor
黄瑞章
闫盈盈
王瑞
钟文良
黄庭
李晶
陈功
刘博伟
朱坤
王振军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou Farming Technology Co Ltd
Guizhou University
Original Assignee
Guizhou Farming Technology Co Ltd
Guizhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou Farming Technology Co Ltd, Guizhou University filed Critical Guizhou Farming Technology Co Ltd
Priority to CN201610469360.5A priority Critical patent/CN106096014A/en
Publication of CN106096014A publication Critical patent/CN106096014A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses the Text Clustering Method of a kind of mixing length text set based on DMR.The present invention is mixing length text set, compared with traditional long text collection, has universality;And have employed DMR method and determine the Study first of model, improve tradition cluster use and be manually set priori value and obtain method;For relatively conventional hybrid text set, long assigned short text set shares identical theme.The present invention is simple, and using effect is good.

Description

The Text Clustering Method of mixing length text set based on DMR
Technical field
The present invention relates to field of computer technology, the text cluster of a kind of mixing length text set based on DMR Method.
Background technology
Along with the arrival of big data age, from mass text data, excavate potential subject information become increasingly to weigh Want.
User's request and valuable content is met, in text mining field, generally in order to find in the data of magnanimity Use Text Clustering Method.Text cluster refers to be divided into the text data set that given multiple data class, each class Interior text semantic height is similar, and between class, semantic similarity is extremely low.At present, Text Clustering Method is widely used to text and digs Pick field, especially in the field such as information retrieval, intelligent searching engine.
Text is divided into long text and short text two class according to the difference of length.Single long text is clustered by existing technology It is made quite ripe, in terms of to single short text clustering, has also achieved some achievements, but due to short text self Two major features: feature height is sparse and contextual dependency is strong.Clustering method for short text still needs to explore and improve. Cluster for mixing length text set equally still cannot obtain preferable Clustering Effect due to the Characteristic Problem of short text.
At present, Text Clustering Algorithm layer based on probability topic model goes out not group, and they are for long text (news, blog And mail etc.) often there is good Clustering Effect.But it is as the development of the social form explosion types such as microblogging, excavates and hide Semanteme in this kind of short text is very important, and the key feature yet with short text is the most sparse and contextual dependency By force, directly applying the clustering method of long text, the Clustering Effect of generation is not so good as people's will.Certainly, it is known that in real life Text set includes long text collection and assigned short text set two kinds, so far, for mixing length text set clustering method also in In the immature stage, still there are many improvements.
Summary of the invention
The technical problem to be solved is: provide the text cluster of a kind of mixing length text set based on DMR Method, it can realize being better than the Clustering Effect of prior art.
The present invention is achieved in that the Text Clustering Method of mixing length text set based on DMR, including walking as follows Rapid:
1) original mixing length text set is carried out Text Pretreatment;
2) text set that pretreatment is good is divided into long text collection and assigned short text set;
3) use DMR method that text set is modeled;
4) according to model, it is thus achieved that the theme of whole corpus-word distribution and the respective document-theme distribution of long short text;
5) corresponding mixing length text cluster is realized according to described distribution.
Carrying out Text Pretreatment described in step 1), text set uses and data base, graph image or computer network The collection of thesis that network is relevant, pretreatment includes participle and goes stop words process.
Step 2) in the text set that pretreatment is good is divided into long text collection and assigned short text set, content is less than 140 characters Text set term assigned short text set, the most then be long text collection;The Abstractb part of every paper of text set is divided into length Text set, is divided into the title of every paper in text set in assigned short text set.
Modeling described in step 3) is to utilize the auxiliary assigned short text set modeling of long text collection, and both have identical word-master Topic distribution.
In modeling process, have employed the method i.e. DMR method logarithm as document-theme of Di Li Cray polynomial regression Linear priori.Owing to the respective prior information of long assigned short text set is different, therefore long short text can produce different Study first, this mould Type use prior information be current document be long text or short text, if long text is then labeled as 1, if short text mark It is designated as 0.
Owing to short text has the shortcoming that feature is openness and contextual dependency is strong, therefore to mixing length text set Modeling process in, by long text information auxiliary short text help to create relatively good effect.In described modeling Cheng Zhong, long assigned short text set has identical theme-word distribution, therefore, it is possible to reach the target of long text information auxiliary short text.
Compared with prior art, the present invention is mixing length text set, compared with traditional long text collection, has pervasive Property;And have employed DMR method and determine the Study first of model, improve tradition cluster use and be manually set the priori value side of obtaining Method;For relatively conventional hybrid text set, long assigned short text set shares identical theme.The present invention is simple, and using effect is good.
Accompanying drawing explanation
Fig. 1 is the execution flow chart of embodiments of the invention;
Fig. 2 is the model of embodiments of the invention.
Detailed description of the invention
Embodiments of the invention 1: the Text Clustering Method of mixing length text set based on DMR, the flow process of the present embodiment As shown in the figure:
S1 step being first carried out, obtains mixing text set to be clustered, the present embodiment uses the data being derived from Twitter Collection;
Next performs s2, and mixing length text set is carried out Text Pretreatment work;For English text, need to carry out point Word, the removal work such as stop words, root reduction;After pre-treatment step, remove the information of redundancy in text so that text Collection becomes succinct carefully and neatly done very saving resource and is easy to calculate;
Preferably assist short text to realize long text, perform s3 step, in extraction collection of thesis in every paper Abstract part, is brought into long text and is concentrated, and forms auxiliary text set, otherwise extracts the title of every paper, include in In assigned short text set, form text set to be assisted;
Long short text be divided complete after, perform s4 step set up model;In the model, long assigned short text set is used in conjunction with One theme-word distribution matrix, is the marrow place using long text auxiliary short text;But then because of Di Li to be passed through Cray polynomial regression method determines the most different Study first, so respective theme distribution also differs.Such as Fig. 2 Shown in.
First the symbol in interpretation model.Symbol scalar main in this example is as shown in table 1.
Table 1
The generation process of this instance model is described below:
After setting up model, perform the present invention s5 step, in this step, in every document each word give one with The theme of machine, as the original state of markov chain.
Owing to the present embodiment uses mixing length text set, so when updating text subject, if long text, then Perform s6 step, if short text, then perform s7 step.In the two step, gibbs (Gibbs) sampling is all used to carry out Theme updates, and its more new regulation is as follows:
Can be obtained by sampled result when gibbs sampler reaches convergence state and obtain parameter estimation by statistics.
Performing s8 step and obtain the document-theme distribution of short text, the theme-word performing the s9 whole corpus of acquisition divides Cloth, performs s10 step, it is thus achieved that the document-theme distribution of long text.
Perform the s11 step of the present invention, it is achieved the cluster of text.
It is above embodiments of the present invention, it is noted that for those skilled in the art, not In the case of departing from the principle of the invention, some improvement can be made, and these improvement are also considered as protection scope of the present invention.

Claims (4)

1. the Text Clustering Method of a mixing length text set based on DMR, it is characterised in that: comprise the steps:
1) original mixing length text set is carried out Text Pretreatment;
2) text set that pretreatment is good is divided into long text collection and assigned short text set;
3) use DMR method that text set is modeled;
4) according to model, it is thus achieved that the theme of whole corpus-word distribution and the respective document-theme distribution of long short text;
5) corresponding mixing length text cluster is realized according to described distribution.
The Text Clustering Method of mixing length text set based on DMR the most according to claim 1, it is characterised in that: Carrying out Text Pretreatment described in step 1), text set uses the opinion relevant to data base, graph image or computer network Collected works, pretreatment includes participle and goes stop words process.
The Text Clustering Method of mixing length text set based on DMR the most according to claim 1, it is characterised in that: step Rapid 2) in, the text set that pretreatment is good being divided into long text collection and assigned short text set, content is less than the text set term of 140 characters Assigned short text set, the most then be long text collection;The Abstractb part of every paper of text set is divided into long text collection, will In text set, the title of every paper is divided in assigned short text set.
The Text Clustering Method of mixing length text set based on DMR the most according to claim 3, it is characterised in that: step Rapid 3) modeling described in is to utilize the auxiliary assigned short text set modeling of long text collection, and both have identical word-theme distribution.
CN201610469360.5A 2016-06-25 2016-06-25 The Text Clustering Method of mixing length text set based on DMR Pending CN106096014A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610469360.5A CN106096014A (en) 2016-06-25 2016-06-25 The Text Clustering Method of mixing length text set based on DMR

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610469360.5A CN106096014A (en) 2016-06-25 2016-06-25 The Text Clustering Method of mixing length text set based on DMR

Publications (1)

Publication Number Publication Date
CN106096014A true CN106096014A (en) 2016-11-09

Family

ID=57252516

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610469360.5A Pending CN106096014A (en) 2016-06-25 2016-06-25 The Text Clustering Method of mixing length text set based on DMR

Country Status (1)

Country Link
CN (1) CN106096014A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107992549A (en) * 2017-11-28 2018-05-04 南京信息工程大学 Dynamic short text stream Clustering Retrieval method
CN109086345A (en) * 2018-07-12 2018-12-25 北京奇艺世纪科技有限公司 A kind of content identification method, content distribution method, device and electronic equipment
CN109815336A (en) * 2019-01-28 2019-05-28 无码科技(杭州)有限公司 A kind of text polymerization and system
CN111309906A (en) * 2020-02-09 2020-06-19 北京工业大学 Long and short mixed type text classification optimization method based on integrated neural network

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999516A (en) * 2011-09-15 2013-03-27 北京百度网讯科技有限公司 Method and device for classifying text
CN104573070A (en) * 2015-01-26 2015-04-29 清华大学 Text clustering method special for mixed length text sets

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999516A (en) * 2011-09-15 2013-03-27 北京百度网讯科技有限公司 Method and device for classifying text
CN104573070A (en) * 2015-01-26 2015-04-29 清华大学 Text clustering method special for mixed length text sets

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
江雨燕: "《融合DSTM和USTM方法的主题模型》", 《计算机科学与探索》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107992549A (en) * 2017-11-28 2018-05-04 南京信息工程大学 Dynamic short text stream Clustering Retrieval method
CN107992549B (en) * 2017-11-28 2022-11-01 南京信息工程大学 Dynamic short text stream clustering retrieval method
CN109086345A (en) * 2018-07-12 2018-12-25 北京奇艺世纪科技有限公司 A kind of content identification method, content distribution method, device and electronic equipment
CN109815336A (en) * 2019-01-28 2019-05-28 无码科技(杭州)有限公司 A kind of text polymerization and system
CN109815336B (en) * 2019-01-28 2021-07-09 无码科技(杭州)有限公司 Text aggregation method and system
CN111309906A (en) * 2020-02-09 2020-06-19 北京工业大学 Long and short mixed type text classification optimization method based on integrated neural network

Similar Documents

Publication Publication Date Title
CN101950284B (en) Chinese word segmentation method and system
CN107992481B (en) Regular expression matching method, device and system based on multi-way tree
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
CN103077164B (en) Text analyzing method and text analyzer
CN103678412B (en) A kind of method and device of file retrieval
CN104933113A (en) Expression input method and device based on semantic understanding
CN106096014A (en) The Text Clustering Method of mixing length text set based on DMR
CN103617157A (en) Text similarity calculation method based on semantics
CN102419778A (en) Information searching method for discovering and clustering sub-topics of query statement
CN110188359B (en) Text entity extraction method
CN102622346B (en) Method, device and system for protein knowledge mining and discovery in Chinese bibliographic database
CN110991184B (en) Relay protection fixed value self-adaptive checking method based on comprehensive dictionary characteristics
CN105760524A (en) Multi-level and multi-class classification method for science news headlines
CN102929902A (en) Character splitting method and device based on Chinese retrieval
CN114495143B (en) Text object recognition method and device, electronic equipment and storage medium
CN104536830A (en) KNN text classification method based on MapReduce
CN107577713B (en) Text handling method based on electric power dictionary
CN105404677A (en) Tree structure based retrieval method
Liu et al. Chinese named entity recognition based on rules and conditional random field
CN103150409A (en) Method and system for recommending user search word
CN103207921A (en) Method for automatically extracting terms from Chinese electronic document
CN102541935A (en) Novel Chinese Web document representing method based on characteristic vectors
CN105426490A (en) Tree structure based indexing method
CN105718441A (en) Method and device for searching UI modules with similar functions between different platforms
CN105512270A (en) Method and device for determining related objects

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20161109