CN103593454A - Mining method and system for microblog text classification - Google Patents

Mining method and system for microblog text classification Download PDF

Info

Publication number
CN103593454A
CN103593454A CN201310591482.8A CN201310591482A CN103593454A CN 103593454 A CN103593454 A CN 103593454A CN 201310591482 A CN201310591482 A CN 201310591482A CN 103593454 A CN103593454 A CN 103593454A
Authority
CN
China
Prior art keywords
microblogging
lexical item
text
feature
classification
Prior art date
Application number
CN201310591482.8A
Other languages
Chinese (zh)
Inventor
罗军
章昉
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Priority to CN201310591482.8A priority Critical patent/CN103593454A/en
Publication of CN103593454A publication Critical patent/CN103593454A/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The invention relates to a mining method for microblog text classification. The mining method includes the following steps that existing microblog data are obtained; obtained microblog texts are analyzed and preprocessed; search traversal is performed on a lexical item set of the microblog texts, and stop word lexical items are removed; development and inspection CHI value calculation is performed on each lexical item in an original feature lexical item set, N lexical items of the obtained highest value serve as a feature lexical item set, and the original feature lexical item set is the lexical item set of all the microblog texts; association rule mining is performed on the N lexical items, strong associated word items of the feature lexical items in the microblog texts are added into the feature lexical item set of a microblog, and therefore the classification precision of the microblog texts is improved. The invention further relates to a mining system for the microblog text classification. According to the mining method and system for microblog text classification, the complexity of the association rule mining of original microblog texts is effectively simplified, the data volume needing to be analyzed is greatly reduced, and the classification precision of the microblog texts is improved.

Description

Method for digging and system towards microblogging text classification

Technical field

The present invention relates to a kind of method for digging towards microblogging text classification and system.

Background technology

Microblogging, has become people and has carried out social a kind of Important Platform and one of medium, in state-ownedly surpass 400,000,000 microblog users, and Twitter user surpasses 500,000,000 especially, information day traffic volume over 200,000,000, become the second largest social network sites that is only second to Facebook.In recent years, microblogging becomes the cradle of countless hot issues and trend.Along with at home popular of the social network sites such as Sina's microblogging, Tengxun's microblogging, the social medias such as microblogging not only become the platform that netizen issues, shares, diffuses information, and have accumulated extensive netizen's behavioral data.In May, 2012, the deputy general manager Lu Yi of microblogging division department of Sina points out, the microblogging registered user of Sina surpasses 300,000,000, wherein has 60% any active ues to login by mobile terminal, and average issue every day of user surpasses 100,000,000 microblogging contents.The data volume of visible microblogging is increasing, thereby the excavation of microblogging data is had to feasibility, novelty and practicality, and is subject to the extensive concern of domestic and international academia.

In microblogging text classification, correlation rule can effectively improve the precision of classification.Wherein, correlation rule is the number percent that data centralization things comprises X item, Y item simultaneously, i.e. probability in the support (support) of data centralization; Degree of confidence (confidence) is that data centralization things has comprised in the situation of X item, the number percent that comprises Y item, i.e. conditional probability.If meet minimum support threshold value and minimal confidence threshold.These threshold values are to need artificial setting according to excavating.

Existing association rule algorithm mainly contains two classes: Apriori algorithm and FP-tree be set algorithm frequently.

Apriori algorithm: first find out all frequency collection, the frequency that these collection occur is at least the same with predefined minimum support.Then by the Strong association rule of collection generation frequently, these rules must meet minimum support and minimum confidence level.Then use the frequency collection finding to produce the rule of expectation, produce the strictly all rules of the item that only comprises set, wherein the right part of each rule only has one.Once generate these rules, only have those rules that are greater than the given minimum confidence level of user to be just left, use the method for recursion to generate all frequency collection.

FP-tree is set algorithm frequently: adopt the strategy of dividing and rule, after process first pass, frequency collection in database is compressed into a frequent pattern tree (fp tree) (FP-tree), still retain related information wherein simultaneously, again FP-tree is divided into subsequently to some condition storehouses, the frequency collection that each storehouse is 1 with a length is relevant, and then these condition storehouses are excavated respectively.When original data volume is very large, also can, in conjunction with the method for dividing, make a FP-tree can put into main memory.Experiment shows, FP-growth has good adaptability to the rule of different length, has huge raising in efficiency than Apriori algorithm simultaneously.

Yet for the such short text of microblogging, Apriori algorithm produces a large amount of Candidate Sets, and may need multiple scanning database, has greatly increased excavation complexity and excavation time.Although FP-tree frequently set algorithm can effectively be raised the efficiency, for short text, efficiency is still not high.

Summary of the invention

In view of this, be necessary to provide a kind of method for digging towards microblogging text classification and system.

The invention provides a kind of method for digging towards microblogging text classification, the method comprises the steps: that a. obtains existing microblogging data; B. the microblogging text obtaining is analyzed and pre-service; C. the lexical item set of described microblogging text is carried out to search spread, remove stop words lexical item; D. each lexical item in the set of primitive character lexical item is done to exploitation check CHI value and calculate, the N of a drawn mxm. lexical item is as feature lexical item collection, and the set of described primitive character lexical item is the lexical item set of all microblogging texts; E. a described N lexical item is carried out to association rule mining, the feature lexical item that the strong associated lexical item of the feature lexical item in microblogging text is joined to this microblogging is concentrated, to improve microblogging text classification precision.

Wherein, described microblogging data comprise: user ID, user name, microblogging text.

Described step b comprises special symbols such as described microblogging text removal punctuation marks, removes non-Chinese character and participle operation, obtains the lexical item set of described microblogging text, and this microblogging is carried out to manual sort.

Described described feature lexical item collection is arranged according to the height of mutual information value, and wherein N is User Defined, and N is less than total lexical item number.

Described exploitation check CHI value calculating method is: for each word, calculate respectively: the microblogging amount of text a that comprises this word under this classification; The microblogging amount of text b that does not comprise this word under this classification; The microblogging amount of text c that does not comprise this word under this classification; Not under this classification, and do not comprise the microblogging amount of text d of this word; Z1=a*d-b*c; CHI=(z1*z1*float (N))/((a+c) * (a+b) * (b+d) * (c+d).

Described step e comprises: every microblogging in the microblogging data that traversal is obtained, carries out two tuples to the feature lexical item collection of every microblogging; Set the threshold value of support and degree of confidence; According to the support of setting and the threshold value of degree of confidence, get Strong association rule, the feature lexical item that the strong associated lexical item of the feature lexical item in microblogging text is joined to this microblogging is concentrated.

The present invention also provides a kind of digging system towards microblogging text classification, comprises acquisition module, pretreatment module, extraction module, the computing module of mutual electric connection and excavates module, wherein: described acquisition module is used for obtaining existing microblogging data; Described pretreatment module is for analyzing and pre-service the microblogging text obtaining; Described extraction module, for the lexical item set of described microblogging text is carried out to search spread, is removed stop words lexical item; Described computing module calculates for each lexical item of primitive character lexical item set being done to exploitation check CHI value, and the N of a drawn mxm. lexical item is as feature lexical item collection, and the set of described primitive character lexical item is the lexical item set of all microblogging texts; Described excavation module is for a described N lexical item is carried out to association rule mining, and the feature lexical item that the strong associated lexical item of the feature lexical item in microblogging text is joined to this microblogging is concentrated, to improve microblogging text classification precision.

Wherein, described microblogging data comprise: user ID, user name, microblogging text.

Described pretreatment module, for described microblogging text being removed to the special symbols such as punctuation mark, being removed non-Chinese character and participle operation, obtains the lexical item set of described microblogging text.

Described described feature lexical item collection is arranged according to the height of mutual information value, and wherein N is User Defined, and N is less than total lexical item number.

The present invention is towards method for digging and the system of microblogging text classification, considered the text structure of microblogging, for the characteristic of microblogging text short text and the necessity of microblogging text correlation rule, a kind of simple and effective association rule mining method for microblogging text classification has been proposed, compare with previous association rule mining method, time complexity of the present invention reduces greatly, needs the data volume of analyzing to greatly reduce, and microblogging text classification precision is significantly improved.

Accompanying drawing explanation

Fig. 1 is that the present invention is towards the process flow diagram of the method for digging of microblogging text classification;

Fig. 2 is that the present invention is towards the hardware structure figure of the digging system of microblogging text classification.

Embodiment

Below in conjunction with drawings and the specific embodiments, the present invention is further detailed explanation.

Consulting shown in Fig. 1, is that the present invention is towards the operation process chart of the method for digging preferred embodiment of microblogging text classification.

Step S401, obtains existing microblogging data.Particularly, obtain existing data on microblogging website.Be limited to analytical technology, it is Chinese microblogging data that the present embodiment only obtains content.Described microblogging data comprise: user ID, user name, microblogging text.

Step S402, analyzes and pre-service the microblogging text obtaining.Particularly, every microblogging text is carried out to initialization process, described microblogging text, through removing the special symbols such as punctuation mark, removing after non-Chinese character and participle operation, obtains the lexical item set of described microblogging text, and this microblogging is carried out to manual sort.

Step S403, carries out feature extraction to described microblogging text, the lexical item set of described microblogging text is carried out to search spread, removes stop words lexical item.

Step S404, carries out feature selecting to microblogging data.Particularly, each lexical item in the set of primitive character lexical item is done to exploitation check CHI value and calculate, the N of a drawn mxm. lexical item is as feature lexical item collection.Wherein, the set of described primitive character lexical item is the lexical item set of all microblogging texts.Described feature lexical item collection is arranged according to the height of mutual information value, and wherein N is User Defined, and N is less than total lexical item number.

Described exploitation check CHI value calculating method is as follows:

For each word, calculate respectively: the microblogging amount of text a that comprises this word under this classification; The microblogging amount of text b that does not comprise this word under this classification; The microblogging amount of text c that does not comprise this word under this classification; Not under this classification, and do not comprise the microblogging amount of text d of this word.

z1=a*d-b*c。

CHI=(z1*z1*float(N))/((a+c)*(a+b)*(b+d)*(c+d)。

Step S405, carries out association rule mining to a described N lexical item.Concrete steps are as follows:

1. every microblogging in the microblogging data that traversal is obtained, feature lexical item collection to every microblogging carries out two tuples, and each two tuple is joined to MAP< (lexical item x, lexical item y), count>, count is the number of times that this two tuple occurs.

2. the number of times of having selected in characteristic procedure as calculated each lexical item to occur, sets the threshold value of support and degree of confidence.

21. filter two tuples that microblogging sum * that count are less than microblogging data has set support;

The microblogging sum of 22.support (x=>y)=count/ microblogging data;

23.confidence(x=>y)=count/(a+b)。

3. according to the threshold value of the support of above-mentioned setting and degree of confidence, get Strong association rule.The feature lexical item that the strong associated lexical item of the feature lexical item in microblogging text is joined to this microblogging is concentrated, to improve microblogging text classification precision.

Consulting shown in Fig. 2, is that the present invention is towards the hardware structure figure of the digging system of microblogging text classification.This system comprises acquisition module, pretreatment module, extraction module, the computing module of mutual electric connection and excavates module.

Described acquisition module is used for obtaining existing microblogging data.Particularly, described acquisition module obtains existing data on microblogging website.Be limited to analytical technology, it is Chinese microblogging data that the present embodiment only obtains content.Described microblogging data comprise: user ID, user name, microblogging text.

Described processing module is for the image obtaining being carried out to denoising and strengthening pre-service, for processing and the screening in later stage are prepared.Particularly, described processing module carries out respectively denoising and strengthens processing to the described image obtaining, to improve the resolution of image.

Described pretreatment module is for analyzing and pre-service the microblogging text obtaining.Particularly, described pretreatment module is carried out initialization process to every microblogging text, described microblogging text, through removing the special symbols such as punctuation mark, removing after non-Chinese character and participle operation, obtains the lexical item set of described microblogging text, and this microblogging is carried out to manual sort.

Described extraction module is for carrying out feature extraction to described microblogging text, and described extraction module carries out search spread to the lexical item set of described microblogging text, removes stop words lexical item.

Described computing module is for carrying out feature selecting to microblogging data.Particularly, described computing module is done exploitation check CHI value to each lexical item in the set of primitive character lexical item and is calculated, and the N of a drawn mxm. lexical item is as feature lexical item collection.Wherein, the set of described primitive character lexical item is the lexical item set of all microblogging texts.Described feature lexical item collection is arranged according to the height of mutual information value, and wherein N is User Defined, and N is less than total lexical item number.

It is specific as follows that described computing module calculates described exploitation check CHI value:

For each word, calculate respectively: the microblogging amount of text a that comprises this word under this classification; The microblogging amount of text b that does not comprise this word under this classification; The microblogging amount of text c that does not comprise this word under this classification; Not under this classification, and do not comprise the microblogging amount of text d of this word.

z1=a*d-b*c。

CHI=(z1*z1*float(N))/((a+c)*(a+b)*(b+d)*(c+d)。

Described excavation module is for carrying out association rule mining to a described N lexical item.Specific as follows:

First described excavation module travels through every microblogging in the microblogging data of obtaining, feature lexical item collection to every microblogging carries out two tuples, each two tuple is joined to MAP< (lexical item x, lexical item y), count>, count is the number of times that this two tuple occurs.

The number of times of then having selected in characteristic procedure as calculated each lexical item to occur, sets the threshold value of support and degree of confidence: filter two tuples that microblogging sum * that count is less than microblogging data has set support; The microblogging sum of support (x=>y)=count/ microblogging data; Confidence (x=>y)=count/ (a+b).

Last according to the threshold value of the support of above-mentioned setting and degree of confidence, get Strong association rule.The feature lexical item that the strong associated lexical item of the feature lexical item in microblogging text is joined to this microblogging is concentrated, to improve microblogging text classification precision.

Although the present invention is described with reference to current preferred embodiments; but those skilled in the art will be understood that; above-mentioned preferred embodiments is only used for illustrating the present invention; not be used for limiting protection scope of the present invention; any within the spirit and principles in the present invention scope; any modification of doing, equivalence replacement, improvement etc., within all should being included in the scope of the present invention.

Claims (10)

1. towards a method for digging for microblogging text classification, it is characterized in that, the method comprises the steps:
A. obtain existing microblogging data;
B. the microblogging text obtaining is analyzed and pre-service;
C. the lexical item set of described microblogging text is carried out to search spread, remove stop words lexical item;
D. each lexical item in the set of primitive character lexical item is done to exploitation check CHI value and calculate, the N of a drawn mxm. lexical item is as feature lexical item collection, and the set of described primitive character lexical item is the lexical item set of all microblogging texts;
E. a described N lexical item is carried out to association rule mining, the feature lexical item that the strong associated lexical item of the feature lexical item in microblogging text is joined to this microblogging is concentrated, to improve microblogging text classification precision.
2. the method for claim 1, is characterized in that, described microblogging data comprise: user ID, user name, microblogging text.
3. method as claimed in claim 2, it is characterized in that, described step b comprises special symbols such as described microblogging text removal punctuation marks, removes non-Chinese character and participle operation, obtains the lexical item set of described microblogging text, and this microblogging is carried out to manual sort.
4. method as claimed in claim 3, is characterized in that, described described feature lexical item collection is arranged according to the height of mutual information value, and wherein N is User Defined, and N is less than total lexical item number.
5. method as claimed in claim 4, is characterized in that, described exploitation check CHI value calculating method is:
For each word, calculate respectively: the microblogging amount of text a that comprises this word under this classification; The microblogging amount of text b that does not comprise this word under this classification; The microblogging amount of text c that does not comprise this word under this classification; Not under this classification, and do not comprise the microblogging amount of text d of this word;
z1=a*d-b*c;
CHI=(z1*z1*float(N))/((a+c)*(a+b)*(b+d)*(c+d)。
6. method as claimed in claim 5, is characterized in that, described step e comprises:
Every microblogging in the microblogging data that traversal is obtained, carries out two tuples to the feature lexical item collection of every microblogging;
Set the threshold value of support and degree of confidence;
According to the support of setting and the threshold value of degree of confidence, get Strong association rule, the feature lexical item that the strong associated lexical item of the feature lexical item in microblogging text is joined to this microblogging is concentrated.
7. towards a digging system for microblogging text classification, it is characterized in that, this system comprises acquisition module, pretreatment module, extraction module, computing module and the excavation module of mutual electric connection, wherein:
Described acquisition module is used for obtaining existing microblogging data;
Described pretreatment module is for analyzing and pre-service the microblogging text obtaining;
Described extraction module, for the lexical item set of described microblogging text is carried out to search spread, is removed stop words lexical item;
Described computing module calculates for each lexical item of primitive character lexical item set being done to exploitation check CHI value, and the N of a drawn mxm. lexical item is as feature lexical item collection, and the set of described primitive character lexical item is the lexical item set of all microblogging texts;
Described excavation module is for a described N lexical item is carried out to association rule mining, and the feature lexical item that the strong associated lexical item of the feature lexical item in microblogging text is joined to this microblogging is concentrated, to improve microblogging text classification precision.
8. system as claimed in claim 7, is characterized in that, described microblogging data comprise: user ID, user name, microblogging text.
9. system as claimed in claim 8, is characterized in that, described pretreatment module, for described microblogging text being removed to the special symbols such as punctuation mark, being removed non-Chinese character and participle operation, obtains the lexical item set of described microblogging text.
10. system as claimed in claim 9, is characterized in that, described described feature lexical item collection is arranged according to the height of mutual information value, and wherein N is User Defined, and N is less than total lexical item number.
CN201310591482.8A 2013-11-21 2013-11-21 Mining method and system for microblog text classification CN103593454A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310591482.8A CN103593454A (en) 2013-11-21 2013-11-21 Mining method and system for microblog text classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310591482.8A CN103593454A (en) 2013-11-21 2013-11-21 Mining method and system for microblog text classification

Publications (1)

Publication Number Publication Date
CN103593454A true CN103593454A (en) 2014-02-19

Family

ID=50083595

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310591482.8A CN103593454A (en) 2013-11-21 2013-11-21 Mining method and system for microblog text classification

Country Status (1)

Country Link
CN (1) CN103593454A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104361008A (en) * 2014-10-11 2015-02-18 北京中搜网络技术股份有限公司 Microblog classification method based on dictionary or/and threshold value
CN105653533A (en) * 2014-11-13 2016-06-08 腾讯数码(深圳)有限公司 Method and device for updating classified associated word set
WO2017101342A1 (en) * 2015-12-15 2017-06-22 乐视控股(北京)有限公司 Sentiment classification method and apparatus
CN107302474A (en) * 2017-07-04 2017-10-27 四川无声信息技术有限公司 The feature extracting method and device of network data application
CN107391489A (en) * 2017-07-31 2017-11-24 阿里巴巴集团控股有限公司 A kind of text analyzing method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6272478B1 (en) * 1997-06-24 2001-08-07 Mitsubishi Denki Kabushiki Kaisha Data mining apparatus for discovering association rules existing between attributes of data
CN101510204A (en) * 2009-03-02 2009-08-19 南京航空航天大学 Abnormal enquiry and monitor method based on target condition association rule database
CN101634983A (en) * 2008-07-21 2010-01-27 华为技术有限公司 Method and device for text classification
CN101655837A (en) * 2009-09-08 2010-02-24 北京邮电大学 Method for detecting and correcting error on text after voice recognition

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6272478B1 (en) * 1997-06-24 2001-08-07 Mitsubishi Denki Kabushiki Kaisha Data mining apparatus for discovering association rules existing between attributes of data
CN101634983A (en) * 2008-07-21 2010-01-27 华为技术有限公司 Method and device for text classification
CN101510204A (en) * 2009-03-02 2009-08-19 南京航空航天大学 Abnormal enquiry and monitor method based on target condition association rule database
CN101655837A (en) * 2009-09-08 2010-02-24 北京邮电大学 Method for detecting and correcting error on text after voice recognition

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
麦艺华: "《面向中文微博的社会网络分析及应用》", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104361008A (en) * 2014-10-11 2015-02-18 北京中搜网络技术股份有限公司 Microblog classification method based on dictionary or/and threshold value
CN105653533A (en) * 2014-11-13 2016-06-08 腾讯数码(深圳)有限公司 Method and device for updating classified associated word set
CN105653533B (en) * 2014-11-13 2019-10-25 腾讯数码(深圳)有限公司 A kind of method and apparatus updating classification associated set of words
WO2017101342A1 (en) * 2015-12-15 2017-06-22 乐视控股(北京)有限公司 Sentiment classification method and apparatus
CN107302474A (en) * 2017-07-04 2017-10-27 四川无声信息技术有限公司 The feature extracting method and device of network data application
CN107302474B (en) * 2017-07-04 2020-02-04 四川无声信息技术有限公司 Feature extraction method and device for network data application
CN107391489A (en) * 2017-07-31 2017-11-24 阿里巴巴集团控股有限公司 A kind of text analyzing method and device
CN107391489B (en) * 2017-07-31 2020-09-25 阿里巴巴集团控股有限公司 Text analysis method and device

Similar Documents

Publication Publication Date Title
US9483803B2 (en) Search intent for queries on online social networks
AU2017202634B2 (en) Search query interactions
US10255331B2 (en) Static rankings for search queries on online social networks
CA2919447C (en) Using inverse operators for queries on online social networks
CN103559233B (en) Network neologisms abstracting method and microblog emotional analysis method and system in microblogging
AU2012327239B8 (en) Method and apparatus for automatically summarizing the contents of electronic documents
RU2632408C2 (en) Classification of documents using multilevel signature text
US20200159741A1 (en) Selection of a representative data subset of a set of unstructured data
CN104660594B (en) A kind of virtual malicious node and its Network Recognition method towards social networks
CN104239539B (en) A kind of micro-blog information filter method merged based on much information
Sun et al. Gesundheit! modeling contagion through facebook news feed
CN102930055B (en) The network new word discovery method of the connecting inner degree of polymerization and external discrete information entropy
Klinkmüller et al. Increasing recall of process model matching by improved activity label matching
WO2015139559A1 (en) Method and system for generating digital human
KR101475682B1 (en) Method, system and server for managing friends&#39; feed in network
CN105550583B (en) Android platform malicious application detection method based on random forest classification method
CN102662952B (en) Chinese text parallel data mining method based on hierarchy
US9705761B2 (en) Opinion information display system and method
US7627542B2 (en) Group identification in large-scaled networks via hierarchical clustering through refraction over edges of networks
CN105512245B (en) A method of enterprise&#39;s portrait is established based on regression model
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
WO2016137765A1 (en) Topically aware word suggestions
US9448999B2 (en) Method and device to detect similar documents
CN103186524B (en) A kind of place name identification method and apparatus
US7668942B2 (en) Generating document templates that are robust to structural variations

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20140219