CN103593454A

CN103593454A - Mining method and system for microblog text classification

Info

Publication number: CN103593454A
Application number: CN201310591482.8A
Authority: CN
Inventors: 罗军; 章昉
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2013-11-21
Filing date: 2013-11-21
Publication date: 2014-02-19

Abstract

The invention relates to a mining method for microblog text classification, comprising the following steps: acquiring existing microblog data; analyzing and preprocessing the acquired microblog text; searching for a term set of the microblog text Traversing, removing stop word terms; doing development test CHI value calculation for each term in the original feature term set, and obtaining the N terms with the highest value as the feature term set, the original feature term The collection is the collection of terms of all microblog texts; the N terms are carried out for association rule mining, and the strongly associated terms of the feature terms in the microblog text are added to the feature term set of this microblog to improve the microblog Blog text classification accuracy. The invention also relates to a mining system oriented to microblog text classification. The invention can effectively simplify the mining complexity of association rules of original microblog texts, greatly reduce the amount of data to be analyzed, and improve the classification accuracy of microblog texts.

Description

Mining method and system for microblog text classification

技术领域technical field

本发明涉及一种面向微博文本分类的挖掘方法及系统。The invention relates to a mining method and system for microblog text classification.

背景技术Background technique

微博，已经成为人们进行社交的一种重要平台与媒介之一，中国有超过4亿的微博用户，而Twitter用户更是超过5亿，信息日发送量则超过2亿，成为仅次于Facebook的第二大社交网站。近年来，微博成为无数热门话题与潮流的发源地。随着新浪微博、腾讯微博等社交网站在国内的流行，微博等社会化媒体不仅成为了网民发布、共享、传播信息的平台，而且积累了大规模网民的行为数据。2012年5月，新浪微博事业部副总经理芦义指出，新浪微博注册用户已超过3亿，其中有60%的活跃用户通过移动终端登录，用户平均每天发布超过1亿条微博内容。可见微博的数据量越来越大，因而对微博数据的挖掘具有可行性、创新性以及实用性，并受到国内外学术界的广泛关注。Weibo has become one of the important platforms and media for people to socialize. There are more than 400 million Weibo users in China, while Twitter users are more than 500 million, and the daily sending volume of information exceeds 200 million. Facebook's second largest social networking site. In recent years, Weibo has become the birthplace of countless hot topics and trends. With the popularity of social networking sites such as Sina Weibo and Tencent Weibo in China, social media such as Weibo has not only become a platform for netizens to publish, share, and disseminate information, but also has accumulated large-scale Internet user behavior data. In May 2012, Lu Yi, deputy general manager of Sina Weibo Business Department, pointed out that Sina Weibo has more than 300 million registered users, of which 60% active users log in through mobile terminals, and users publish more than 100 million Weibo content every day on average . It can be seen that the amount of microblog data is increasing, so the mining of microblog data is feasible, innovative and practical, and has attracted extensive attention from domestic and foreign academic circles.

在微博文本分类中，关联规则能够有效的提高分类的精度。其中，关联规则在数据集中的支持度(support)是数据集中事物同时包含X项、Y项的百分比，即概率；置信度(confidence)是数据集中事物已经包含X项的情况下，包含Y项的百分比，即条件概率。如果满足最小支持度阈值和最小置信度阈值。这些阈值是根据挖掘需要人为设定。In microblog text classification, association rules can effectively improve the classification accuracy. Among them, the support of association rules in the data set (support) is the percentage of items in the data set that contain both X items and Y items, that is, the probability; the confidence (confidence) is the case that the items in the data set already contain X items, including Y items The percentage of is the conditional probability. If the minimum support threshold and the minimum confidence threshold are met. These thresholds are artificially set according to mining needs.

现有的关联规则算法主要有两类：Apriori算法和FP-树频集算法。There are two main types of existing association rule algorithms: Apriori algorithm and FP-tree frequency set algorithm.

Apriori算法：首先找出所有的频集，这些项集出现的频繁性至少和预定义的最小支持度一样。然后由频集产生强关联规则，这些规则必须满足最小支持度和最小可信度。然后使用找到的频集产生期望的规则，产生只包含集合的项的所有规则，其中每一条规则的右部只有一项。一旦生成这些规则，只有那些大于用户给定的最小可信度的规则才被留下来，使用递推的方法生成所有频集。Apriori Algorithm: First find all frequency sets whose frequency of occurrence is at least the same as the predefined minimum support. Then strong association rules are generated from the frequency set, and these rules must satisfy the minimum support and minimum confidence. Then use the found frequency set to generate the desired rules, generating all rules that contain only the items of the set, where the right part of each rule has only one item. Once these rules are generated, only those rules that are greater than the minimum confidence given by the user are left, and all frequency sets are generated using a recursive method.

FP-树频集算法：采用分而治之的策略，在经过第一遍扫描之后，把数据库中的频集压缩进一棵频繁模式树（FP-tree），同时依然保留其中的关联信息，随后再将FP-tree分化成一些条件库，每个库和一个长度为1的频集相关，然后再对这些条件库分别进行挖掘。当原始数据量很大的时候，也可以结合划分的方法，使得一个FP-tree可以放入主存中。实验表明，FP-growth对不同长度的规则都有很好的适应性，同时在效率上较之Apriori算法有巨大的提高。FP-Tree Frequency Set Algorithm: Using a divide-and-conquer strategy, after the first pass of scanning, the frequency set in the database is compressed into a frequent pattern tree (FP-tree), while still retaining the associated information, and then the FP-tree is divided into some conditional libraries, each library is related to a frequency set with a length of 1, and then these conditional libraries are mined separately. When the amount of original data is large, the division method can also be combined so that an FP-tree can be placed in the main memory. Experiments show that FP-growth has good adaptability to rules of different lengths, and has a huge improvement in efficiency compared with Apriori algorithm.

然而，对于微博这样的短文本而言，Apriori算法产生大量的候选集，以及可能需要重复扫描数据库，大大增加了挖掘复杂度和挖掘时间。FP-树频集算法虽然可以有效提高效率，但是对于短文本而言，效率依然不高。However, for short texts such as Weibo, the Apriori algorithm generates a large number of candidate sets, and may need to scan the database repeatedly, which greatly increases the mining complexity and mining time. Although the FP-tree frequency set algorithm can effectively improve the efficiency, it is still not efficient for short texts.

发明内容Contents of the invention

有鉴于此，有必要提供一种面向微博文本分类的挖掘方法及系统。In view of this, it is necessary to provide a mining method and system for microblog text classification.

本发明提供一种面向微博文本分类的挖掘方法，该方法包括如下步骤：a.获取现有的微博数据；b.对获取的微博文本进行分析和预处理；c.对所述微博文本的词项集合进行搜索遍历，去除停用词词项；d.对原始特征词项集合中的每个词项做开发检验CHI值计算，所得出的最高值的N个词项作为特征词项集，所述原始特征词项集合为所有微博文本的词项集合；e.对所述N个词项进行关联规则挖掘，将微博文本中的特征词项的强关联词项加入到该微博的特征词项集中，以提高微博文本分类精度。The present invention provides a mining method for microblog text classification. The method includes the following steps: a. obtaining existing microblog data; b. analyzing and preprocessing the obtained microblog text; c. The term set of the blog text is searched and traversed to remove the stop word term; d. The development test CHI value is calculated for each term in the original feature term set, and the N terms with the highest value obtained are used as features Term set, the original characteristic term set is the term set of all microblog texts; e. carry out association rule mining to described N term, the strongly associated term of the characteristic term in the microblog text is added to The characteristic word items of the microblog are concentrated to improve the classification accuracy of the microblog text.

其中，所述的微博数据包括：用户ID、用户名、微博文本。Wherein, the microblog data includes: user ID, user name, and microblog text.

所述的步骤b包括对所述微博文本去除标点符号等特殊符号、去除非中文字符和分词操作，得到所述微博文本的词项集合，并对该微博进行人工分类。The step b includes removing special symbols such as punctuation marks, removing non-Chinese characters and word segmentation operations on the microblog text to obtain a set of words in the microblog text, and manually classifying the microblog.

所述的所述特征词项集按照互信息值的高低排列，其中N为用户自定义，N小于总词项数目。The feature term sets are arranged according to the mutual information value, where N is user-defined, and N is less than the total number of terms.

所述开发检验CHI值计算方法为：对于每个词分别计算得到：在这个分类下包含这个词的微博文本数量a；不在该分类下包含这个词的微博文本数量b；在这个分类下不包含这个词的微博文本数量c；不在该分类下，且不包含这个词的微博文本数量d；z1=a*d-b*c；CHI=(z1*z1*float(N))/((a+c)*(a+b)*(b+d)*(c+d)。The calculation method of the CHI value of the development test is: for each word, it is calculated separately: the number a of microblog texts containing this word under this classification; the number b of microblog texts not containing this word under this classification; The number c of Weibo texts that do not contain this word; the number d of Weibo texts that are not under this category and do not contain this word; z1=a*d-b*c; CHI=(z1*z1*float(N))/( (a+c)*(a+b)*(b+d)*(c+d).

所述的步骤e包括：遍历获取的微博数据中的每条微博，对每条微博的特征词项集进行二元组化；设定支持度和置信度的阈值；根据设定的支持度和置信度的阈值，取强关联规则，将微博文本中的特征词项的强关联词项加入到该微博的特征词项集中。Described step e comprises: traversing each microblog in the microblog data that obtains, carry out binary grouping to the feature word item set of each microblog; Set the threshold value of support degree and confidence degree; According to the set The threshold of support and confidence is based on the strong association rules, and the strongly associated terms of the feature terms in the microblog text are added to the feature term set of the microblog.

本发明还提供一种面向微博文本分类的挖掘系统，包括相互电性连接的获取模块、预处理模块、提取模块、计算模块及挖掘模块，其中：所述获取模块用于获取现有的微博数据；所述预处理模块用于对获取的微博文本进行分析和预处理；所述提取模块用于对所述微博文本的词项集合进行搜索遍历，去除停用词词项；所述计算模块用于对原始特征词项集合中的每个词项做开发检验CHI值计算，所得出的最高值的N个词项作为特征词项集，所述原始特征词项集合为所有微博文本的词项集合；所述挖掘模块用于对所述N个词项进行关联规则挖掘，将微博文本中的特征词项的强关联词项加入到该微博的特征词项集中，以提高微博文本分类精度。The present invention also provides a microblog text classification-oriented mining system, including an acquisition module, a preprocessing module, an extraction module, a calculation module, and a mining module electrically connected to each other, wherein: the acquisition module is used to acquire the existing microblog blog data; the preprocessing module is used to analyze and preprocess the obtained microblog text; the extraction module is used to search and traverse the term set of the microblog text, and remove stop word terms; The calculation module is used to calculate the development and inspection CHI value of each term in the original feature term set, and the N terms with the highest value obtained are used as the feature term set, and the original feature term set is all micro A set of terms in the blog text; the mining module is used to mine the N terms for association rules, and add strongly associated terms of the feature terms in the microblog text to the set of feature terms in the microblog to Improve the accuracy of microblog text classification.

所述预处理模块用于对所述微博文本去除标点符号等特殊符号、去除非中文字符和分词操作，得到所述微博文本的词项集合。The preprocessing module is used to remove special symbols such as punctuation marks, non-Chinese characters and word segmentation operations on the microblog text to obtain a term set of the microblog text.

本发明面向微博文本分类的挖掘方法及系统，综合考虑了微博的文本结构，针对微博文本短文本的特性和微博文本关联规则的必要性，提出了一种简单有效的针对微博文本分类的关联规则挖掘方法，与先前关联规则挖掘方法相比，本发明的时间复杂度大大降低，需要分析的数据量大大减少，微博文本分类精度得到显著提高。The invention is oriented to the mining method and system of microblog text classification, comprehensively considers the text structure of microblog, aims at the characteristics of short text of microblog text and the necessity of association rules of microblog text, and proposes a simple and effective method for microblog The association rule mining method of text classification, compared with the previous association rule mining method, the time complexity of the present invention is greatly reduced, the amount of data to be analyzed is greatly reduced, and the microblog text classification accuracy is significantly improved.

附图说明Description of drawings

图1为本发明面向微博文本分类的挖掘方法的流程图；Fig. 1 is the flow chart of the mining method facing microblog text classification in the present invention;

图2为本发明面向微博文本分类的挖掘系统的硬件架构图。FIG. 2 is a hardware architecture diagram of the microblog text classification-oriented mining system of the present invention.

具体实施方式Detailed ways

下面结合附图及具体实施例对本发明作进一步详细的说明。The present invention will be described in further detail below in conjunction with the accompanying drawings and specific embodiments.

参阅图1所示，是本发明面向微博文本分类的挖掘方法较佳实施例的作业流程图。Referring to FIG. 1 , it is a flow chart of a preferred embodiment of the mining method for microblog text classification in the present invention.

步骤S401，获取现有的微博数据。具体而言，获取微博网站上现有的数据。受限于分析技术，本实施例仅获取内容为中文的微博数据。所述微博数据包括：用户ID、用户名、微博文本。Step S401, acquiring existing microblog data. Specifically, obtain the existing data on the microblogging website. Limited by the analysis technology, this embodiment only acquires microblog data whose content is in Chinese. The microblog data includes: user ID, user name, and microblog text.

步骤S402，对获取的微博文本进行分析和预处理。具体而言，对每条微博文本进行初始化处理，所述微博文本经过去除标点符号等特殊符号、去除非中文字符和分词操作后，得到所述微博文本的词项集合，并对该微博进行人工分类。Step S402, analyzing and preprocessing the acquired microblog text. Specifically, each microblog text is initialized. After the microblog text is removed from special symbols such as punctuation marks, non-Chinese characters and word segmentation operations, the term set of the microblog text is obtained, and the Weibo is manually classified.

步骤S403，对所述微博文本进行特征提取，即对所述微博文本的词项集合进行搜索遍历，去除停用词词项。Step S403, performing feature extraction on the microblog text, that is, searching and traversing the term set of the microblog text to remove stop word terms.

步骤S404，对微博数据进行特征选择。具体而言，对原始特征词项集合中的每个词项做开发检验CHI值计算，所得出的最高值的N个词项作为特征词项集。其中，所述原始特征词项集合为所有微博文本的词项集合。所述特征词项集按照互信息值的高低排列，其中N为用户自定义，N小于总词项数目。Step S404, performing feature selection on the microblog data. Specifically, the development test CHI value is calculated for each term in the original feature term set, and the N terms with the highest values obtained are used as the feature term set. Wherein, the original characteristic term set is a term set of all microblog texts. The feature term sets are arranged according to the mutual information value, where N is user-defined, and N is less than the total number of terms.

所述开发检验CHI值计算方法如下：The CHI value calculation method for the development test is as follows:

对于每个词分别计算得到：在这个分类下包含这个词的微博文本数量a；不在该分类下包含这个词的微博文本数量b；在这个分类下不包含这个词的微博文本数量c；不在该分类下，且不包含这个词的微博文本数量d。Calculated separately for each word: the number of microblog texts that contain this word under this category a; the number of microblog texts that do not contain this word under this category b; the number of microblog texts that do not contain this word under this category c ; The number d of Weibo texts that are not under this category and do not contain this word.

z1=a*d-b*c。z1=a*d-b*c.

CHI=(z1*z1*float(N))/((a+c)*(a+b)*(b+d)*(c+d)。CHI=(z1*z1*float(N))/((a+c)*(a+b)*(b+d)*(c+d).

步骤S405，对所述N个词项进行关联规则挖掘。具体步骤如下：Step S405, performing association rule mining on the N terms. Specific steps are as follows:

1.遍历获取的微博数据中的每条微博，对每条微博的特征词项集进行二元组化，将每个二元组加入到MAP<(词项x，词项y)，count>，count为该二元组出现的次数。1. Traverse each microblog in the obtained microblog data, perform binary grouping on the feature term set of each microblog, and add each binary group to MAP<(term x, term y) , count>, count is the number of occurrences of the binary group.

2.选择特征过程中已经计算了每个词项出现的次数，设定支持度和置信度的阈值。2. The number of occurrences of each term has been calculated in the process of selecting features, and the thresholds of support and confidence are set.

21.过滤count小于微博数据的微博总数*已设定support的二元组；21. Filter the total number of microblogs whose count is less than the microblog data * the binary group with support set;

22.support(x=>y)=count/微博数据的微博总数；22. support(x=>y)=count/the total number of microblog data of microblog data;

23.confidence(x=>y)=count/(a+b)。23. confidence(x=>y)=count/(a+b).

3.根据上述设定的支持度和置信度的阈值，取强关联规则。将微博文本中的特征词项的强关联词项加入到该微博的特征词项集中，以提高微博文本分类精度。3. According to the support and confidence thresholds set above, strong association rules are taken. The strongly associated terms of the feature terms in the microblog text are added to the feature term set of the microblog to improve the classification accuracy of the microblog text.

参阅图2所示，是本发明面向微博文本分类的挖掘系统的硬件架构图。该系统包括相互电性连接的获取模块、预处理模块、提取模块、计算模块及挖掘模块。Referring to FIG. 2 , it is a hardware architecture diagram of the microblog text classification-oriented mining system of the present invention. The system includes an acquisition module, a preprocessing module, an extraction module, a calculation module and an excavation module electrically connected to each other.

所述获取模块用于获取现有的微博数据。具体而言，所述获取模块获取微博网站上现有的数据。受限于分析技术，本实施例仅获取内容为中文的微博数据。所述微博数据包括：用户ID、用户名、微博文本。The obtaining module is used to obtain existing microblog data. Specifically, the obtaining module obtains existing data on the microblog website. Limited by the analysis technology, this embodiment only acquires microblog data whose content is in Chinese. The microblog data includes: user ID, user name, and microblog text.

所述处理模块用于对获取的图像进行去噪和增强预处理，为后期的处理和筛选做准备。具体而言，所述处理模块对所述获取的图像分别进行去噪处理及增强处理，以提高图像的识别度。The processing module is used to perform denoising and enhancement pre-processing on the acquired images to prepare for later processing and screening. Specifically, the processing module respectively performs denoising processing and enhancement processing on the acquired image, so as to improve the recognition degree of the image.

所述预处理模块用于对获取的微博文本进行分析和预处理。具体而言，所述预处理模块对每条微博文本进行初始化处理，所述微博文本经过去除标点符号等特殊符号、去除非中文字符和分词操作后，得到所述微博文本的词项集合，并对该微博进行人工分类。The preprocessing module is used for analyzing and preprocessing the acquired microblog text. Specifically, the preprocessing module initializes each piece of microblog text, and after the microblog text removes special symbols such as punctuation marks, removes non-Chinese characters, and performs word segmentation operations, the word items of the microblog text are obtained Collect and manually classify the microblogs.

所述提取模块用于对所述微博文本进行特征提取，即所述提取模块对所述微博文本的词项集合进行搜索遍历，去除停用词词项。The extraction module is used to perform feature extraction on the microblog text, that is, the extraction module searches and traverses the term set of the microblog text to remove stop word terms.

所述计算模块用于对微博数据进行特征选择。具体而言，所述计算模块对原始特征词项集合中的每个词项做开发检验CHI值计算，所得出的最高值的N个词项作为特征词项集。其中，所述原始特征词项集合为所有微博文本的词项集合。所述特征词项集按照互信息值的高低排列，其中N为用户自定义，N小于总词项数目。The calculation module is used for feature selection of microblog data. Specifically, the calculation module calculates the development test CHI value for each term in the original feature term set, and the obtained N terms with the highest values are used as the feature term set. Wherein, the original characteristic term set is a term set of all microblog texts. The feature term sets are arranged according to the mutual information value, where N is user-defined, and N is less than the total number of terms.

所述计算模块计算得到所述开发检验CHI值具体如下：The calculation module calculates and obtains the CHI value of the development inspection as follows:

z1=a*d-b*c。z1=a*d-b*c.

所述挖掘模块用于对所述N个词项进行关联规则挖掘。具体如下：The mining module is used to mine association rules for the N terms. details as follows:

所述挖掘模块首先遍历获取的微博数据中的每条微博，对每条微博的特征词项集进行二元组化，将每个二元组加入到MAP<(词项x，词项y)，count>，count为该二元组出现的次数。Described mining module first traverses each microblog in the microblog data that obtains, carries out binary grouping to the feature word item set of each microblog, joins each binary group to MAP<(term x, word Item y), count>, count is the number of occurrences of the binary group.

而后选择特征过程中已经计算了每个词项出现的次数，设定支持度和置信度的阈值：过滤count小于微博数据的微博总数*已设定support的二元组；support(x=>y)=count/微博数据的微博总数；confidence(x=>y)=count/(a+b)。Then, in the feature selection process, the number of occurrences of each term has been calculated, and the threshold of support and confidence is set: filter the total number of microblogs whose count is less than the microblog data * the binary group that has set support; support(x= >y)=count/the total number of microblogs in the microblog data; confidence(x=>y)=count/(a+b).

最后根据上述设定的支持度和置信度的阈值，取强关联规则。将微博文本中的特征词项的强关联词项加入到该微博的特征词项集中，以提高微博文本分类精度。Finally, according to the support and confidence thresholds set above, strong association rules are taken. The strongly associated terms of the feature terms in the microblog text are added to the feature term set of the microblog to improve the classification accuracy of the microblog text.

虽然本发明参照当前的较佳实施方式进行了描述，但本领域的技术人员应能理解，上述较佳实施方式仅用来说明本发明，并非用来限定本发明的保护范围，任何在本发明的精神和原则范围之内，所做的任何修饰、等效替换、改进等，均应包含在本发明的权利保护范围之内。Although the present invention has been described with reference to the current preferred embodiments, those skilled in the art should understand that the above-mentioned preferred embodiments are only used to illustrate the present invention, and are not used to limit the protection scope of the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and scope of principles shall be included in the protection scope of the present invention.

Claims

1. towards a method for digging for microblogging text classification, it is characterized in that, the method comprises the steps:

A. obtain existing microblogging data;

B. the microblogging text obtaining is analyzed and pre-service;

C. the lexical item set of described microblogging text is carried out to search spread, remove stop words lexical item;

D. each lexical item in the set of primitive character lexical item is done to exploitation check CHI value and calculate, the N of a drawn mxm. lexical item is as feature lexical item collection, and the set of described primitive character lexical item is the lexical item set of all microblogging texts;

E. a described N lexical item is carried out to association rule mining, the feature lexical item that the strong associated lexical item of the feature lexical item in microblogging text is joined to this microblogging is concentrated, to improve microblogging text classification precision.

2. the method for claim 1, is characterized in that, described microblogging data comprise: user ID, user name, microblogging text.

3. method as claimed in claim 2, it is characterized in that, described step b comprises special symbols such as described microblogging text removal punctuation marks, removes non-Chinese character and participle operation, obtains the lexical item set of described microblogging text, and this microblogging is carried out to manual sort.

4. method as claimed in claim 3, is characterized in that, described described feature lexical item collection is arranged according to the height of mutual information value, and wherein N is User Defined, and N is less than total lexical item number.

5. method as claimed in claim 4, is characterized in that, described exploitation check CHI value calculating method is:

For each word, calculate respectively: the microblogging amount of text a that comprises this word under this classification; The microblogging amount of text b that does not comprise this word under this classification; The microblogging amount of text c that does not comprise this word under this classification; Not under this classification, and do not comprise the microblogging amount of text d of this word;

z1=a*d-b*c；

CHI=(z1*z1*float(N))/((a+c)*(a+b)*(b+d)*(c+d)。

6. method as claimed in claim 5, is characterized in that, described step e comprises:

Every microblogging in the microblogging data that traversal is obtained, carries out two tuples to the feature lexical item collection of every microblogging;

Set the threshold value of support and degree of confidence;

According to the support of setting and the threshold value of degree of confidence, get Strong association rule, the feature lexical item that the strong associated lexical item of the feature lexical item in microblogging text is joined to this microblogging is concentrated.

7. towards a digging system for microblogging text classification, it is characterized in that, this system comprises acquisition module, pretreatment module, extraction module, computing module and the excavation module of mutual electric connection, wherein:

Described acquisition module is used for obtaining existing microblogging data;

Described pretreatment module is for analyzing and pre-service the microblogging text obtaining;

Described extraction module, for the lexical item set of described microblogging text is carried out to search spread, is removed stop words lexical item;

Described computing module calculates for each lexical item of primitive character lexical item set being done to exploitation check CHI value, and the N of a drawn mxm. lexical item is as feature lexical item collection, and the set of described primitive character lexical item is the lexical item set of all microblogging texts;

Described excavation module is for a described N lexical item is carried out to association rule mining, and the feature lexical item that the strong associated lexical item of the feature lexical item in microblogging text is joined to this microblogging is concentrated, to improve microblogging text classification precision.

8. system as claimed in claim 7, is characterized in that, described microblogging data comprise: user ID, user name, microblogging text.

9. system as claimed in claim 8, is characterized in that, described pretreatment module, for described microblogging text being removed to the special symbols such as punctuation mark, being removed non-Chinese character and participle operation, obtains the lexical item set of described microblogging text.

10. system as claimed in claim 9, is characterized in that, described described feature lexical item collection is arranged according to the height of mutual information value, and wherein N is User Defined, and N is less than total lexical item number.