CN103593454A - Mining method and system for microblog text classification - Google Patents

Mining method and system for microblog text classification Download PDF

Info

Publication number
CN103593454A
CN103593454A CN201310591482.8A CN201310591482A CN103593454A CN 103593454 A CN103593454 A CN 103593454A CN 201310591482 A CN201310591482 A CN 201310591482A CN 103593454 A CN103593454 A CN 103593454A
Authority
CN
China
Prior art keywords
lexical item
microblogging
text
microblog
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310591482.8A
Other languages
Chinese (zh)
Inventor
罗军
章昉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN201310591482.8A priority Critical patent/CN103593454A/en
Publication of CN103593454A publication Critical patent/CN103593454A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明涉及一种面向微博文本分类的挖掘方法,包括如下步骤:获取现有的微博数据;对获取的微博文本进行分析和预处理;对所述微博文本的词项集合进行搜索遍历,去除停用词词项;对原始特征词项集合中的每个词项做开发检验CHI值计算,所得出的最高值的N个词项作为特征词项集,所述原始特征词项集合为所有微博文本的词项集合;对所述N个词项进行关联规则挖掘,将微博文本中的特征词项的强关联词项加入到该微博的特征词项集中,以提高微博文本分类精度。本发明还涉及一种面向微博文本分类的挖掘系统。本发明能够有效地简化原始微博文本的关联规则挖掘复杂度,且所需要分析的数据量大大减少,提高了微博文本分类精度。

Figure 201310591482

The invention relates to a mining method for microblog text classification, comprising the following steps: acquiring existing microblog data; analyzing and preprocessing the acquired microblog text; searching for a term set of the microblog text Traversing, removing stop word terms; doing development test CHI value calculation for each term in the original feature term set, and obtaining the N terms with the highest value as the feature term set, the original feature term The collection is the collection of terms of all microblog texts; the N terms are carried out for association rule mining, and the strongly associated terms of the feature terms in the microblog text are added to the feature term set of this microblog to improve the microblog Blog text classification accuracy. The invention also relates to a mining system oriented to microblog text classification. The invention can effectively simplify the mining complexity of association rules of original microblog texts, greatly reduce the amount of data to be analyzed, and improve the classification accuracy of microblog texts.

Figure 201310591482

Description

面向微博文本分类的挖掘方法及系统Mining method and system for microblog text classification

技术领域technical field

本发明涉及一种面向微博文本分类的挖掘方法及系统。The invention relates to a mining method and system for microblog text classification.

背景技术Background technique

微博,已经成为人们进行社交的一种重要平台与媒介之一,中国有超过4亿的微博用户,而Twitter用户更是超过5亿,信息日发送量则超过2亿,成为仅次于Facebook的第二大社交网站。近年来,微博成为无数热门话题与潮流的发源地。随着新浪微博、腾讯微博等社交网站在国内的流行,微博等社会化媒体不仅成为了网民发布、共享、传播信息的平台,而且积累了大规模网民的行为数据。2012年5月,新浪微博事业部副总经理芦义指出,新浪微博注册用户已超过3亿,其中有60%的活跃用户通过移动终端登录,用户平均每天发布超过1亿条微博内容。可见微博的数据量越来越大,因而对微博数据的挖掘具有可行性、创新性以及实用性,并受到国内外学术界的广泛关注。Weibo has become one of the important platforms and media for people to socialize. There are more than 400 million Weibo users in China, while Twitter users are more than 500 million, and the daily sending volume of information exceeds 200 million. Facebook's second largest social networking site. In recent years, Weibo has become the birthplace of countless hot topics and trends. With the popularity of social networking sites such as Sina Weibo and Tencent Weibo in China, social media such as Weibo has not only become a platform for netizens to publish, share, and disseminate information, but also has accumulated large-scale Internet user behavior data. In May 2012, Lu Yi, deputy general manager of Sina Weibo Business Department, pointed out that Sina Weibo has more than 300 million registered users, of which 60% active users log in through mobile terminals, and users publish more than 100 million Weibo content every day on average . It can be seen that the amount of microblog data is increasing, so the mining of microblog data is feasible, innovative and practical, and has attracted extensive attention from domestic and foreign academic circles.

在微博文本分类中,关联规则能够有效的提高分类的精度。其中,关联规则在数据集中的支持度(support)是数据集中事物同时包含X项、Y项的百分比,即概率;置信度(confidence)是数据集中事物已经包含X项的情况下,包含Y项的百分比,即条件概率。如果满足最小支持度阈值和最小置信度阈值。这些阈值是根据挖掘需要人为设定。In microblog text classification, association rules can effectively improve the classification accuracy. Among them, the support of association rules in the data set (support) is the percentage of items in the data set that contain both X items and Y items, that is, the probability; the confidence (confidence) is the case that the items in the data set already contain X items, including Y items The percentage of is the conditional probability. If the minimum support threshold and the minimum confidence threshold are met. These thresholds are artificially set according to mining needs.

现有的关联规则算法主要有两类:Apriori算法和FP-树频集算法。There are two main types of existing association rule algorithms: Apriori algorithm and FP-tree frequency set algorithm.

Apriori算法:首先找出所有的频集,这些项集出现的频繁性至少和预定义的最小支持度一样。然后由频集产生强关联规则,这些规则必须满足最小支持度和最小可信度。然后使用找到的频集产生期望的规则,产生只包含集合的项的所有规则,其中每一条规则的右部只有一项。一旦生成这些规则,只有那些大于用户给定的最小可信度的规则才被留下来,使用递推的方法生成所有频集。Apriori Algorithm: First find all frequency sets whose frequency of occurrence is at least the same as the predefined minimum support. Then strong association rules are generated from the frequency set, and these rules must satisfy the minimum support and minimum confidence. Then use the found frequency set to generate the desired rules, generating all rules that contain only the items of the set, where the right part of each rule has only one item. Once these rules are generated, only those rules that are greater than the minimum confidence given by the user are left, and all frequency sets are generated using a recursive method.

FP-树频集算法:采用分而治之的策略,在经过第一遍扫描之后,把数据库中的频集压缩进一棵频繁模式树(FP-tree),同时依然保留其中的关联信息,随后再将FP-tree分化成一些条件库,每个库和一个长度为1的频集相关,然后再对这些条件库分别进行挖掘。当原始数据量很大的时候,也可以结合划分的方法,使得一个FP-tree可以放入主存中。实验表明,FP-growth对不同长度的规则都有很好的适应性,同时在效率上较之Apriori算法有巨大的提高。FP-Tree Frequency Set Algorithm: Using a divide-and-conquer strategy, after the first pass of scanning, the frequency set in the database is compressed into a frequent pattern tree (FP-tree), while still retaining the associated information, and then the FP-tree is divided into some conditional libraries, each library is related to a frequency set with a length of 1, and then these conditional libraries are mined separately. When the amount of original data is large, the division method can also be combined so that an FP-tree can be placed in the main memory. Experiments show that FP-growth has good adaptability to rules of different lengths, and has a huge improvement in efficiency compared with Apriori algorithm.

然而,对于微博这样的短文本而言,Apriori算法产生大量的候选集,以及可能需要重复扫描数据库,大大增加了挖掘复杂度和挖掘时间。FP-树频集算法虽然可以有效提高效率,但是对于短文本而言,效率依然不高。However, for short texts such as Weibo, the Apriori algorithm generates a large number of candidate sets, and may need to scan the database repeatedly, which greatly increases the mining complexity and mining time. Although the FP-tree frequency set algorithm can effectively improve the efficiency, it is still not efficient for short texts.

发明内容Contents of the invention

有鉴于此,有必要提供一种面向微博文本分类的挖掘方法及系统。In view of this, it is necessary to provide a mining method and system for microblog text classification.

本发明提供一种面向微博文本分类的挖掘方法,该方法包括如下步骤:a.获取现有的微博数据;b.对获取的微博文本进行分析和预处理;c.对所述微博文本的词项集合进行搜索遍历,去除停用词词项;d.对原始特征词项集合中的每个词项做开发检验CHI值计算,所得出的最高值的N个词项作为特征词项集,所述原始特征词项集合为所有微博文本的词项集合;e.对所述N个词项进行关联规则挖掘,将微博文本中的特征词项的强关联词项加入到该微博的特征词项集中,以提高微博文本分类精度。The present invention provides a mining method for microblog text classification. The method includes the following steps: a. obtaining existing microblog data; b. analyzing and preprocessing the obtained microblog text; c. The term set of the blog text is searched and traversed to remove the stop word term; d. The development test CHI value is calculated for each term in the original feature term set, and the N terms with the highest value obtained are used as features Term set, the original characteristic term set is the term set of all microblog texts; e. carry out association rule mining to described N term, the strongly associated term of the characteristic term in the microblog text is added to The characteristic word items of the microblog are concentrated to improve the classification accuracy of the microblog text.

其中,所述的微博数据包括:用户ID、用户名、微博文本。Wherein, the microblog data includes: user ID, user name, and microblog text.

所述的步骤b包括对所述微博文本去除标点符号等特殊符号、去除非中文字符和分词操作,得到所述微博文本的词项集合,并对该微博进行人工分类。The step b includes removing special symbols such as punctuation marks, removing non-Chinese characters and word segmentation operations on the microblog text to obtain a set of words in the microblog text, and manually classifying the microblog.

所述的所述特征词项集按照互信息值的高低排列,其中N为用户自定义,N小于总词项数目。The feature term sets are arranged according to the mutual information value, where N is user-defined, and N is less than the total number of terms.

所述开发检验CHI值计算方法为:对于每个词分别计算得到:在这个分类下包含这个词的微博文本数量a;不在该分类下包含这个词的微博文本数量b;在这个分类下不包含这个词的微博文本数量c;不在该分类下,且不包含这个词的微博文本数量d;z1=a*d-b*c;CHI=(z1*z1*float(N))/((a+c)*(a+b)*(b+d)*(c+d)。The calculation method of the CHI value of the development test is: for each word, it is calculated separately: the number a of microblog texts containing this word under this classification; the number b of microblog texts not containing this word under this classification; The number c of Weibo texts that do not contain this word; the number d of Weibo texts that are not under this category and do not contain this word; z1=a*d-b*c; CHI=(z1*z1*float(N))/( (a+c)*(a+b)*(b+d)*(c+d).

所述的步骤e包括:遍历获取的微博数据中的每条微博,对每条微博的特征词项集进行二元组化;设定支持度和置信度的阈值;根据设定的支持度和置信度的阈值,取强关联规则,将微博文本中的特征词项的强关联词项加入到该微博的特征词项集中。Described step e comprises: traversing each microblog in the microblog data that obtains, carry out binary grouping to the feature word item set of each microblog; Set the threshold value of support degree and confidence degree; According to the set The threshold of support and confidence is based on the strong association rules, and the strongly associated terms of the feature terms in the microblog text are added to the feature term set of the microblog.

本发明还提供一种面向微博文本分类的挖掘系统,包括相互电性连接的获取模块、预处理模块、提取模块、计算模块及挖掘模块,其中:所述获取模块用于获取现有的微博数据;所述预处理模块用于对获取的微博文本进行分析和预处理;所述提取模块用于对所述微博文本的词项集合进行搜索遍历,去除停用词词项;所述计算模块用于对原始特征词项集合中的每个词项做开发检验CHI值计算,所得出的最高值的N个词项作为特征词项集,所述原始特征词项集合为所有微博文本的词项集合;所述挖掘模块用于对所述N个词项进行关联规则挖掘,将微博文本中的特征词项的强关联词项加入到该微博的特征词项集中,以提高微博文本分类精度。The present invention also provides a microblog text classification-oriented mining system, including an acquisition module, a preprocessing module, an extraction module, a calculation module, and a mining module electrically connected to each other, wherein: the acquisition module is used to acquire the existing microblog blog data; the preprocessing module is used to analyze and preprocess the obtained microblog text; the extraction module is used to search and traverse the term set of the microblog text, and remove stop word terms; The calculation module is used to calculate the development and inspection CHI value of each term in the original feature term set, and the N terms with the highest value obtained are used as the feature term set, and the original feature term set is all micro A set of terms in the blog text; the mining module is used to mine the N terms for association rules, and add strongly associated terms of the feature terms in the microblog text to the set of feature terms in the microblog to Improve the accuracy of microblog text classification.

其中,所述的微博数据包括:用户ID、用户名、微博文本。Wherein, the microblog data includes: user ID, user name, and microblog text.

所述预处理模块用于对所述微博文本去除标点符号等特殊符号、去除非中文字符和分词操作,得到所述微博文本的词项集合。The preprocessing module is used to remove special symbols such as punctuation marks, non-Chinese characters and word segmentation operations on the microblog text to obtain a term set of the microblog text.

所述的所述特征词项集按照互信息值的高低排列,其中N为用户自定义,N小于总词项数目。The feature term sets are arranged according to the mutual information value, where N is user-defined, and N is less than the total number of terms.

本发明面向微博文本分类的挖掘方法及系统,综合考虑了微博的文本结构,针对微博文本短文本的特性和微博文本关联规则的必要性,提出了一种简单有效的针对微博文本分类的关联规则挖掘方法,与先前关联规则挖掘方法相比,本发明的时间复杂度大大降低,需要分析的数据量大大减少,微博文本分类精度得到显著提高。The invention is oriented to the mining method and system of microblog text classification, comprehensively considers the text structure of microblog, aims at the characteristics of short text of microblog text and the necessity of association rules of microblog text, and proposes a simple and effective method for microblog The association rule mining method of text classification, compared with the previous association rule mining method, the time complexity of the present invention is greatly reduced, the amount of data to be analyzed is greatly reduced, and the microblog text classification accuracy is significantly improved.

附图说明Description of drawings

图1为本发明面向微博文本分类的挖掘方法的流程图;Fig. 1 is the flow chart of the mining method facing microblog text classification in the present invention;

图2为本发明面向微博文本分类的挖掘系统的硬件架构图。FIG. 2 is a hardware architecture diagram of the microblog text classification-oriented mining system of the present invention.

具体实施方式Detailed ways

下面结合附图及具体实施例对本发明作进一步详细的说明。The present invention will be described in further detail below in conjunction with the accompanying drawings and specific embodiments.

参阅图1所示,是本发明面向微博文本分类的挖掘方法较佳实施例的作业流程图。Referring to FIG. 1 , it is a flow chart of a preferred embodiment of the mining method for microblog text classification in the present invention.

步骤S401,获取现有的微博数据。具体而言,获取微博网站上现有的数据。受限于分析技术,本实施例仅获取内容为中文的微博数据。所述微博数据包括:用户ID、用户名、微博文本。Step S401, acquiring existing microblog data. Specifically, obtain the existing data on the microblogging website. Limited by the analysis technology, this embodiment only acquires microblog data whose content is in Chinese. The microblog data includes: user ID, user name, and microblog text.

步骤S402,对获取的微博文本进行分析和预处理。具体而言,对每条微博文本进行初始化处理,所述微博文本经过去除标点符号等特殊符号、去除非中文字符和分词操作后,得到所述微博文本的词项集合,并对该微博进行人工分类。Step S402, analyzing and preprocessing the acquired microblog text. Specifically, each microblog text is initialized. After the microblog text is removed from special symbols such as punctuation marks, non-Chinese characters and word segmentation operations, the term set of the microblog text is obtained, and the Weibo is manually classified.

步骤S403,对所述微博文本进行特征提取,即对所述微博文本的词项集合进行搜索遍历,去除停用词词项。Step S403, performing feature extraction on the microblog text, that is, searching and traversing the term set of the microblog text to remove stop word terms.

步骤S404,对微博数据进行特征选择。具体而言,对原始特征词项集合中的每个词项做开发检验CHI值计算,所得出的最高值的N个词项作为特征词项集。其中,所述原始特征词项集合为所有微博文本的词项集合。所述特征词项集按照互信息值的高低排列,其中N为用户自定义,N小于总词项数目。Step S404, performing feature selection on the microblog data. Specifically, the development test CHI value is calculated for each term in the original feature term set, and the N terms with the highest values obtained are used as the feature term set. Wherein, the original characteristic term set is a term set of all microblog texts. The feature term sets are arranged according to the mutual information value, where N is user-defined, and N is less than the total number of terms.

所述开发检验CHI值计算方法如下:The CHI value calculation method for the development test is as follows:

对于每个词分别计算得到:在这个分类下包含这个词的微博文本数量a;不在该分类下包含这个词的微博文本数量b;在这个分类下不包含这个词的微博文本数量c;不在该分类下,且不包含这个词的微博文本数量d。Calculated separately for each word: the number of microblog texts that contain this word under this category a; the number of microblog texts that do not contain this word under this category b; the number of microblog texts that do not contain this word under this category c ; The number d of Weibo texts that are not under this category and do not contain this word.

z1=a*d-b*c。z1=a*d-b*c.

CHI=(z1*z1*float(N))/((a+c)*(a+b)*(b+d)*(c+d)。CHI=(z1*z1*float(N))/((a+c)*(a+b)*(b+d)*(c+d).

步骤S405,对所述N个词项进行关联规则挖掘。具体步骤如下:Step S405, performing association rule mining on the N terms. Specific steps are as follows:

1.遍历获取的微博数据中的每条微博,对每条微博的特征词项集进行二元组化,将每个二元组加入到MAP<(词项x,词项y),count>,count为该二元组出现的次数。1. Traverse each microblog in the obtained microblog data, perform binary grouping on the feature term set of each microblog, and add each binary group to MAP<(term x, term y) , count>, count is the number of occurrences of the binary group.

2.选择特征过程中已经计算了每个词项出现的次数,设定支持度和置信度的阈值。2. The number of occurrences of each term has been calculated in the process of selecting features, and the thresholds of support and confidence are set.

21.过滤count小于微博数据的微博总数*已设定support的二元组;21. Filter the total number of microblogs whose count is less than the microblog data * the binary group with support set;

22.support(x=>y)=count/微博数据的微博总数;22. support(x=>y)=count/the total number of microblog data of microblog data;

23.confidence(x=>y)=count/(a+b)。23. confidence(x=>y)=count/(a+b).

3.根据上述设定的支持度和置信度的阈值,取强关联规则。将微博文本中的特征词项的强关联词项加入到该微博的特征词项集中,以提高微博文本分类精度。3. According to the support and confidence thresholds set above, strong association rules are taken. The strongly associated terms of the feature terms in the microblog text are added to the feature term set of the microblog to improve the classification accuracy of the microblog text.

参阅图2所示,是本发明面向微博文本分类的挖掘系统的硬件架构图。该系统包括相互电性连接的获取模块、预处理模块、提取模块、计算模块及挖掘模块。Referring to FIG. 2 , it is a hardware architecture diagram of the microblog text classification-oriented mining system of the present invention. The system includes an acquisition module, a preprocessing module, an extraction module, a calculation module and an excavation module electrically connected to each other.

所述获取模块用于获取现有的微博数据。具体而言,所述获取模块获取微博网站上现有的数据。受限于分析技术,本实施例仅获取内容为中文的微博数据。所述微博数据包括:用户ID、用户名、微博文本。The obtaining module is used to obtain existing microblog data. Specifically, the obtaining module obtains existing data on the microblog website. Limited by the analysis technology, this embodiment only acquires microblog data whose content is in Chinese. The microblog data includes: user ID, user name, and microblog text.

所述处理模块用于对获取的图像进行去噪和增强预处理,为后期的处理和筛选做准备。具体而言,所述处理模块对所述获取的图像分别进行去噪处理及增强处理,以提高图像的识别度。The processing module is used to perform denoising and enhancement pre-processing on the acquired images to prepare for later processing and screening. Specifically, the processing module respectively performs denoising processing and enhancement processing on the acquired image, so as to improve the recognition degree of the image.

所述预处理模块用于对获取的微博文本进行分析和预处理。具体而言,所述预处理模块对每条微博文本进行初始化处理,所述微博文本经过去除标点符号等特殊符号、去除非中文字符和分词操作后,得到所述微博文本的词项集合,并对该微博进行人工分类。The preprocessing module is used for analyzing and preprocessing the acquired microblog text. Specifically, the preprocessing module initializes each piece of microblog text, and after the microblog text removes special symbols such as punctuation marks, removes non-Chinese characters, and performs word segmentation operations, the word items of the microblog text are obtained Collect and manually classify the microblogs.

所述提取模块用于对所述微博文本进行特征提取,即所述提取模块对所述微博文本的词项集合进行搜索遍历,去除停用词词项。The extraction module is used to perform feature extraction on the microblog text, that is, the extraction module searches and traverses the term set of the microblog text to remove stop word terms.

所述计算模块用于对微博数据进行特征选择。具体而言,所述计算模块对原始特征词项集合中的每个词项做开发检验CHI值计算,所得出的最高值的N个词项作为特征词项集。其中,所述原始特征词项集合为所有微博文本的词项集合。所述特征词项集按照互信息值的高低排列,其中N为用户自定义,N小于总词项数目。The calculation module is used for feature selection of microblog data. Specifically, the calculation module calculates the development test CHI value for each term in the original feature term set, and the obtained N terms with the highest values are used as the feature term set. Wherein, the original characteristic term set is a term set of all microblog texts. The feature term sets are arranged according to the mutual information value, where N is user-defined, and N is less than the total number of terms.

所述计算模块计算得到所述开发检验CHI值具体如下:The calculation module calculates and obtains the CHI value of the development inspection as follows:

对于每个词分别计算得到:在这个分类下包含这个词的微博文本数量a;不在该分类下包含这个词的微博文本数量b;在这个分类下不包含这个词的微博文本数量c;不在该分类下,且不包含这个词的微博文本数量d。Calculated separately for each word: the number of microblog texts that contain this word under this category a; the number of microblog texts that do not contain this word under this category b; the number of microblog texts that do not contain this word under this category c ; The number d of Weibo texts that are not under this category and do not contain this word.

z1=a*d-b*c。z1=a*d-b*c.

CHI=(z1*z1*float(N))/((a+c)*(a+b)*(b+d)*(c+d)。CHI=(z1*z1*float(N))/((a+c)*(a+b)*(b+d)*(c+d).

所述挖掘模块用于对所述N个词项进行关联规则挖掘。具体如下:The mining module is used to mine association rules for the N terms. details as follows:

所述挖掘模块首先遍历获取的微博数据中的每条微博,对每条微博的特征词项集进行二元组化,将每个二元组加入到MAP<(词项x,词项y),count>,count为该二元组出现的次数。Described mining module first traverses each microblog in the microblog data that obtains, carries out binary grouping to the feature word item set of each microblog, joins each binary group to MAP<(term x, word Item y), count>, count is the number of occurrences of the binary group.

而后选择特征过程中已经计算了每个词项出现的次数,设定支持度和置信度的阈值:过滤count小于微博数据的微博总数*已设定support的二元组;support(x=>y)=count/微博数据的微博总数;confidence(x=>y)=count/(a+b)。Then, in the feature selection process, the number of occurrences of each term has been calculated, and the threshold of support and confidence is set: filter the total number of microblogs whose count is less than the microblog data * the binary group that has set support; support(x= >y)=count/the total number of microblogs in the microblog data; confidence(x=>y)=count/(a+b).

最后根据上述设定的支持度和置信度的阈值,取强关联规则。将微博文本中的特征词项的强关联词项加入到该微博的特征词项集中,以提高微博文本分类精度。Finally, according to the support and confidence thresholds set above, strong association rules are taken. The strongly associated terms of the feature terms in the microblog text are added to the feature term set of the microblog to improve the classification accuracy of the microblog text.

虽然本发明参照当前的较佳实施方式进行了描述,但本领域的技术人员应能理解,上述较佳实施方式仅用来说明本发明,并非用来限定本发明的保护范围,任何在本发明的精神和原则范围之内,所做的任何修饰、等效替换、改进等,均应包含在本发明的权利保护范围之内。Although the present invention has been described with reference to the current preferred embodiments, those skilled in the art should understand that the above-mentioned preferred embodiments are only used to illustrate the present invention, and are not used to limit the protection scope of the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and scope of principles shall be included in the protection scope of the present invention.

Claims (10)

1. towards a method for digging for microblogging text classification, it is characterized in that, the method comprises the steps:
A. obtain existing microblogging data;
B. the microblogging text obtaining is analyzed and pre-service;
C. the lexical item set of described microblogging text is carried out to search spread, remove stop words lexical item;
D. each lexical item in the set of primitive character lexical item is done to exploitation check CHI value and calculate, the N of a drawn mxm. lexical item is as feature lexical item collection, and the set of described primitive character lexical item is the lexical item set of all microblogging texts;
E. a described N lexical item is carried out to association rule mining, the feature lexical item that the strong associated lexical item of the feature lexical item in microblogging text is joined to this microblogging is concentrated, to improve microblogging text classification precision.
2. the method for claim 1, is characterized in that, described microblogging data comprise: user ID, user name, microblogging text.
3. method as claimed in claim 2, it is characterized in that, described step b comprises special symbols such as described microblogging text removal punctuation marks, removes non-Chinese character and participle operation, obtains the lexical item set of described microblogging text, and this microblogging is carried out to manual sort.
4. method as claimed in claim 3, is characterized in that, described described feature lexical item collection is arranged according to the height of mutual information value, and wherein N is User Defined, and N is less than total lexical item number.
5. method as claimed in claim 4, is characterized in that, described exploitation check CHI value calculating method is:
For each word, calculate respectively: the microblogging amount of text a that comprises this word under this classification; The microblogging amount of text b that does not comprise this word under this classification; The microblogging amount of text c that does not comprise this word under this classification; Not under this classification, and do not comprise the microblogging amount of text d of this word;
z1=a*d-b*c;
CHI=(z1*z1*float(N))/((a+c)*(a+b)*(b+d)*(c+d)。
6. method as claimed in claim 5, is characterized in that, described step e comprises:
Every microblogging in the microblogging data that traversal is obtained, carries out two tuples to the feature lexical item collection of every microblogging;
Set the threshold value of support and degree of confidence;
According to the support of setting and the threshold value of degree of confidence, get Strong association rule, the feature lexical item that the strong associated lexical item of the feature lexical item in microblogging text is joined to this microblogging is concentrated.
7. towards a digging system for microblogging text classification, it is characterized in that, this system comprises acquisition module, pretreatment module, extraction module, computing module and the excavation module of mutual electric connection, wherein:
Described acquisition module is used for obtaining existing microblogging data;
Described pretreatment module is for analyzing and pre-service the microblogging text obtaining;
Described extraction module, for the lexical item set of described microblogging text is carried out to search spread, is removed stop words lexical item;
Described computing module calculates for each lexical item of primitive character lexical item set being done to exploitation check CHI value, and the N of a drawn mxm. lexical item is as feature lexical item collection, and the set of described primitive character lexical item is the lexical item set of all microblogging texts;
Described excavation module is for a described N lexical item is carried out to association rule mining, and the feature lexical item that the strong associated lexical item of the feature lexical item in microblogging text is joined to this microblogging is concentrated, to improve microblogging text classification precision.
8. system as claimed in claim 7, is characterized in that, described microblogging data comprise: user ID, user name, microblogging text.
9. system as claimed in claim 8, is characterized in that, described pretreatment module, for described microblogging text being removed to the special symbols such as punctuation mark, being removed non-Chinese character and participle operation, obtains the lexical item set of described microblogging text.
10. system as claimed in claim 9, is characterized in that, described described feature lexical item collection is arranged according to the height of mutual information value, and wherein N is User Defined, and N is less than total lexical item number.
CN201310591482.8A 2013-11-21 2013-11-21 Mining method and system for microblog text classification Pending CN103593454A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310591482.8A CN103593454A (en) 2013-11-21 2013-11-21 Mining method and system for microblog text classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310591482.8A CN103593454A (en) 2013-11-21 2013-11-21 Mining method and system for microblog text classification

Publications (1)

Publication Number Publication Date
CN103593454A true CN103593454A (en) 2014-02-19

Family

ID=50083595

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310591482.8A Pending CN103593454A (en) 2013-11-21 2013-11-21 Mining method and system for microblog text classification

Country Status (1)

Country Link
CN (1) CN103593454A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104361008A (en) * 2014-10-11 2015-02-18 北京中搜网络技术股份有限公司 Microblog classification method based on dictionary or/and threshold value
CN105653533A (en) * 2014-11-13 2016-06-08 腾讯数码(深圳)有限公司 Method and device for updating classified associated word set
WO2017101342A1 (en) * 2015-12-15 2017-06-22 乐视控股(北京)有限公司 Sentiment classification method and apparatus
CN107302474A (en) * 2017-07-04 2017-10-27 四川无声信息技术有限公司 The feature extracting method and device of network data application
CN107357925A (en) * 2017-07-26 2017-11-17 深圳中泓在线股份有限公司 Personal ledger method in microblogging wechat
CN107391489A (en) * 2017-07-31 2017-11-24 阿里巴巴集团控股有限公司 A kind of text analyzing method and device
CN112733828A (en) * 2020-12-30 2021-04-30 航天信息股份有限公司 Method and system for character recognition
WO2022116444A1 (en) * 2020-12-01 2022-06-09 平安科技(深圳)有限公司 Text classification method and apparatus, and computer device and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6272478B1 (en) * 1997-06-24 2001-08-07 Mitsubishi Denki Kabushiki Kaisha Data mining apparatus for discovering association rules existing between attributes of data
CN101510204A (en) * 2009-03-02 2009-08-19 南京航空航天大学 Abnormal enquiry and monitor method based on target condition association rule database
CN101634983A (en) * 2008-07-21 2010-01-27 华为技术有限公司 Method and device for text classification
CN101655837A (en) * 2009-09-08 2010-02-24 北京邮电大学 Method for detecting and correcting error on text after voice recognition

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6272478B1 (en) * 1997-06-24 2001-08-07 Mitsubishi Denki Kabushiki Kaisha Data mining apparatus for discovering association rules existing between attributes of data
CN101634983A (en) * 2008-07-21 2010-01-27 华为技术有限公司 Method and device for text classification
CN101510204A (en) * 2009-03-02 2009-08-19 南京航空航天大学 Abnormal enquiry and monitor method based on target condition association rule database
CN101655837A (en) * 2009-09-08 2010-02-24 北京邮电大学 Method for detecting and correcting error on text after voice recognition

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
麦艺华: "《面向中文微博的社会网络分析及应用》", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104361008A (en) * 2014-10-11 2015-02-18 北京中搜网络技术股份有限公司 Microblog classification method based on dictionary or/and threshold value
CN105653533A (en) * 2014-11-13 2016-06-08 腾讯数码(深圳)有限公司 Method and device for updating classified associated word set
CN105653533B (en) * 2014-11-13 2019-10-25 腾讯数码(深圳)有限公司 A kind of method and apparatus updating classification associated set of words
WO2017101342A1 (en) * 2015-12-15 2017-06-22 乐视控股(北京)有限公司 Sentiment classification method and apparatus
CN107302474A (en) * 2017-07-04 2017-10-27 四川无声信息技术有限公司 The feature extracting method and device of network data application
CN107302474B (en) * 2017-07-04 2020-02-04 四川无声信息技术有限公司 Feature extraction method and device for network data application
CN107357925A (en) * 2017-07-26 2017-11-17 深圳中泓在线股份有限公司 Personal ledger method in microblogging wechat
CN107391489A (en) * 2017-07-31 2017-11-24 阿里巴巴集团控股有限公司 A kind of text analyzing method and device
CN107391489B (en) * 2017-07-31 2020-09-25 阿里巴巴集团控股有限公司 Text analysis method and device
WO2022116444A1 (en) * 2020-12-01 2022-06-09 平安科技(深圳)有限公司 Text classification method and apparatus, and computer device and medium
CN112733828A (en) * 2020-12-30 2021-04-30 航天信息股份有限公司 Method and system for character recognition

Similar Documents

Publication Publication Date Title
US11715315B2 (en) Systems, methods and computer readable media for identifying content to represent web pages and creating a representative image from the content
CN103593454A (en) Mining method and system for microblog text classification
CN104281882B (en) The method and system of prediction social network information stream row degree based on user characteristics
CN102722709B (en) Method and device for identifying garbage pictures
CN104765729B (en) A kind of cross-platform microblogging community account matching process
CN106815307A (en) Public Culture knowledge mapping platform and its use method
CN105005594A (en) Abnormal Weibo user identification method
CN104239539A (en) Microblog information filtering method based on multi-information fusion
CN104317784A (en) Cross-platform user identification method and cross-platform user identification system
CN107291886A (en) A kind of microblog topic detecting method and system based on incremental clustering algorithm
CN107944032B (en) Method and apparatus for generating information
CN106909669B (en) Method and device for detecting promotion information
CN110858217A (en) Method and device for detecting microblog sensitive topics and readable storage medium
CN104615715A (en) Social network event analyzing method and system based on geographic positions
CN112559747A (en) Event classification processing method and device, electronic equipment and storage medium
CN105224593A (en) Frequent co-occurrence account method for digging in a kind of of short duration online affairs
CN105573971B (en) Table reconfiguration device and method
CN103455593A (en) Service competitiveness realization system and method based on social contact network
CN106681980A (en) Method and device for analyzing junk short messages
US20160283582A1 (en) Device and method for detecting similar text, and application
CN105677757A (en) Big data similarity join method based on prefix-affix filtering
US9332031B1 (en) Categorizing accounts based on associated images
CN105589935A (en) Social group recognition method
CN107832611A (en) The bot program detection and sorting technique that a kind of dynamic static nature combines
CN104268214B (en) A kind of user&#39;s gender identification method and system based on microblog users relation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20140219