CN103593454A - Mining method and system for microblog text classification - Google Patents
Mining method and system for microblog text classification Download PDFInfo
- Publication number
- CN103593454A CN103593454A CN201310591482.8A CN201310591482A CN103593454A CN 103593454 A CN103593454 A CN 103593454A CN 201310591482 A CN201310591482 A CN 201310591482A CN 103593454 A CN103593454 A CN 103593454A
- Authority
- CN
- China
- Prior art keywords
- lexical item
- microblogging
- text
- microblog
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000005065 mining Methods 0.000 title claims abstract description 28
- 238000000034 method Methods 0.000 title claims abstract description 25
- 238000000605 extraction Methods 0.000 claims description 9
- 238000009412 basement excavation Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 abstract description 11
- 238000004364 calculation method Methods 0.000 abstract description 9
- 238000011981 development test Methods 0.000 abstract description 6
- 238000012545 processing Methods 0.000 description 5
- 230000011218 segmentation Effects 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000007689 inspection Methods 0.000 description 2
- 230000006855 networking Effects 0.000 description 2
- 230000002354 daily effect Effects 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明涉及一种面向微博文本分类的挖掘方法,包括如下步骤:获取现有的微博数据;对获取的微博文本进行分析和预处理;对所述微博文本的词项集合进行搜索遍历,去除停用词词项;对原始特征词项集合中的每个词项做开发检验CHI值计算,所得出的最高值的N个词项作为特征词项集,所述原始特征词项集合为所有微博文本的词项集合;对所述N个词项进行关联规则挖掘,将微博文本中的特征词项的强关联词项加入到该微博的特征词项集中,以提高微博文本分类精度。本发明还涉及一种面向微博文本分类的挖掘系统。本发明能够有效地简化原始微博文本的关联规则挖掘复杂度,且所需要分析的数据量大大减少,提高了微博文本分类精度。
The invention relates to a mining method for microblog text classification, comprising the following steps: acquiring existing microblog data; analyzing and preprocessing the acquired microblog text; searching for a term set of the microblog text Traversing, removing stop word terms; doing development test CHI value calculation for each term in the original feature term set, and obtaining the N terms with the highest value as the feature term set, the original feature term The collection is the collection of terms of all microblog texts; the N terms are carried out for association rule mining, and the strongly associated terms of the feature terms in the microblog text are added to the feature term set of this microblog to improve the microblog Blog text classification accuracy. The invention also relates to a mining system oriented to microblog text classification. The invention can effectively simplify the mining complexity of association rules of original microblog texts, greatly reduce the amount of data to be analyzed, and improve the classification accuracy of microblog texts.
Description
技术领域technical field
本发明涉及一种面向微博文本分类的挖掘方法及系统。The invention relates to a mining method and system for microblog text classification.
背景技术Background technique
微博,已经成为人们进行社交的一种重要平台与媒介之一,中国有超过4亿的微博用户,而Twitter用户更是超过5亿,信息日发送量则超过2亿,成为仅次于Facebook的第二大社交网站。近年来,微博成为无数热门话题与潮流的发源地。随着新浪微博、腾讯微博等社交网站在国内的流行,微博等社会化媒体不仅成为了网民发布、共享、传播信息的平台,而且积累了大规模网民的行为数据。2012年5月,新浪微博事业部副总经理芦义指出,新浪微博注册用户已超过3亿,其中有60%的活跃用户通过移动终端登录,用户平均每天发布超过1亿条微博内容。可见微博的数据量越来越大,因而对微博数据的挖掘具有可行性、创新性以及实用性,并受到国内外学术界的广泛关注。Weibo has become one of the important platforms and media for people to socialize. There are more than 400 million Weibo users in China, while Twitter users are more than 500 million, and the daily sending volume of information exceeds 200 million. Facebook's second largest social networking site. In recent years, Weibo has become the birthplace of countless hot topics and trends. With the popularity of social networking sites such as Sina Weibo and Tencent Weibo in China, social media such as Weibo has not only become a platform for netizens to publish, share, and disseminate information, but also has accumulated large-scale Internet user behavior data. In May 2012, Lu Yi, deputy general manager of Sina Weibo Business Department, pointed out that Sina Weibo has more than 300 million registered users, of which 60% active users log in through mobile terminals, and users publish more than 100 million Weibo content every day on average . It can be seen that the amount of microblog data is increasing, so the mining of microblog data is feasible, innovative and practical, and has attracted extensive attention from domestic and foreign academic circles.
在微博文本分类中,关联规则能够有效的提高分类的精度。其中,关联规则在数据集中的支持度(support)是数据集中事物同时包含X项、Y项的百分比,即概率;置信度(confidence)是数据集中事物已经包含X项的情况下,包含Y项的百分比,即条件概率。如果满足最小支持度阈值和最小置信度阈值。这些阈值是根据挖掘需要人为设定。In microblog text classification, association rules can effectively improve the classification accuracy. Among them, the support of association rules in the data set (support) is the percentage of items in the data set that contain both X items and Y items, that is, the probability; the confidence (confidence) is the case that the items in the data set already contain X items, including Y items The percentage of is the conditional probability. If the minimum support threshold and the minimum confidence threshold are met. These thresholds are artificially set according to mining needs.
现有的关联规则算法主要有两类:Apriori算法和FP-树频集算法。There are two main types of existing association rule algorithms: Apriori algorithm and FP-tree frequency set algorithm.
Apriori算法:首先找出所有的频集,这些项集出现的频繁性至少和预定义的最小支持度一样。然后由频集产生强关联规则,这些规则必须满足最小支持度和最小可信度。然后使用找到的频集产生期望的规则,产生只包含集合的项的所有规则,其中每一条规则的右部只有一项。一旦生成这些规则,只有那些大于用户给定的最小可信度的规则才被留下来,使用递推的方法生成所有频集。Apriori Algorithm: First find all frequency sets whose frequency of occurrence is at least the same as the predefined minimum support. Then strong association rules are generated from the frequency set, and these rules must satisfy the minimum support and minimum confidence. Then use the found frequency set to generate the desired rules, generating all rules that contain only the items of the set, where the right part of each rule has only one item. Once these rules are generated, only those rules that are greater than the minimum confidence given by the user are left, and all frequency sets are generated using a recursive method.
FP-树频集算法:采用分而治之的策略,在经过第一遍扫描之后,把数据库中的频集压缩进一棵频繁模式树(FP-tree),同时依然保留其中的关联信息,随后再将FP-tree分化成一些条件库,每个库和一个长度为1的频集相关,然后再对这些条件库分别进行挖掘。当原始数据量很大的时候,也可以结合划分的方法,使得一个FP-tree可以放入主存中。实验表明,FP-growth对不同长度的规则都有很好的适应性,同时在效率上较之Apriori算法有巨大的提高。FP-Tree Frequency Set Algorithm: Using a divide-and-conquer strategy, after the first pass of scanning, the frequency set in the database is compressed into a frequent pattern tree (FP-tree), while still retaining the associated information, and then the FP-tree is divided into some conditional libraries, each library is related to a frequency set with a length of 1, and then these conditional libraries are mined separately. When the amount of original data is large, the division method can also be combined so that an FP-tree can be placed in the main memory. Experiments show that FP-growth has good adaptability to rules of different lengths, and has a huge improvement in efficiency compared with Apriori algorithm.
然而,对于微博这样的短文本而言,Apriori算法产生大量的候选集,以及可能需要重复扫描数据库,大大增加了挖掘复杂度和挖掘时间。FP-树频集算法虽然可以有效提高效率,但是对于短文本而言,效率依然不高。However, for short texts such as Weibo, the Apriori algorithm generates a large number of candidate sets, and may need to scan the database repeatedly, which greatly increases the mining complexity and mining time. Although the FP-tree frequency set algorithm can effectively improve the efficiency, it is still not efficient for short texts.
发明内容Contents of the invention
有鉴于此,有必要提供一种面向微博文本分类的挖掘方法及系统。In view of this, it is necessary to provide a mining method and system for microblog text classification.
本发明提供一种面向微博文本分类的挖掘方法,该方法包括如下步骤:a.获取现有的微博数据;b.对获取的微博文本进行分析和预处理;c.对所述微博文本的词项集合进行搜索遍历,去除停用词词项;d.对原始特征词项集合中的每个词项做开发检验CHI值计算,所得出的最高值的N个词项作为特征词项集,所述原始特征词项集合为所有微博文本的词项集合;e.对所述N个词项进行关联规则挖掘,将微博文本中的特征词项的强关联词项加入到该微博的特征词项集中,以提高微博文本分类精度。The present invention provides a mining method for microblog text classification. The method includes the following steps: a. obtaining existing microblog data; b. analyzing and preprocessing the obtained microblog text; c. The term set of the blog text is searched and traversed to remove the stop word term; d. The development test CHI value is calculated for each term in the original feature term set, and the N terms with the highest value obtained are used as features Term set, the original characteristic term set is the term set of all microblog texts; e. carry out association rule mining to described N term, the strongly associated term of the characteristic term in the microblog text is added to The characteristic word items of the microblog are concentrated to improve the classification accuracy of the microblog text.
其中,所述的微博数据包括:用户ID、用户名、微博文本。Wherein, the microblog data includes: user ID, user name, and microblog text.
所述的步骤b包括对所述微博文本去除标点符号等特殊符号、去除非中文字符和分词操作,得到所述微博文本的词项集合,并对该微博进行人工分类。The step b includes removing special symbols such as punctuation marks, removing non-Chinese characters and word segmentation operations on the microblog text to obtain a set of words in the microblog text, and manually classifying the microblog.
所述的所述特征词项集按照互信息值的高低排列,其中N为用户自定义,N小于总词项数目。The feature term sets are arranged according to the mutual information value, where N is user-defined, and N is less than the total number of terms.
所述开发检验CHI值计算方法为:对于每个词分别计算得到:在这个分类下包含这个词的微博文本数量a;不在该分类下包含这个词的微博文本数量b;在这个分类下不包含这个词的微博文本数量c;不在该分类下,且不包含这个词的微博文本数量d;z1=a*d-b*c;CHI=(z1*z1*float(N))/((a+c)*(a+b)*(b+d)*(c+d)。The calculation method of the CHI value of the development test is: for each word, it is calculated separately: the number a of microblog texts containing this word under this classification; the number b of microblog texts not containing this word under this classification; The number c of Weibo texts that do not contain this word; the number d of Weibo texts that are not under this category and do not contain this word; z1=a*d-b*c; CHI=(z1*z1*float(N))/( (a+c)*(a+b)*(b+d)*(c+d).
所述的步骤e包括:遍历获取的微博数据中的每条微博,对每条微博的特征词项集进行二元组化;设定支持度和置信度的阈值;根据设定的支持度和置信度的阈值,取强关联规则,将微博文本中的特征词项的强关联词项加入到该微博的特征词项集中。Described step e comprises: traversing each microblog in the microblog data that obtains, carry out binary grouping to the feature word item set of each microblog; Set the threshold value of support degree and confidence degree; According to the set The threshold of support and confidence is based on the strong association rules, and the strongly associated terms of the feature terms in the microblog text are added to the feature term set of the microblog.
本发明还提供一种面向微博文本分类的挖掘系统,包括相互电性连接的获取模块、预处理模块、提取模块、计算模块及挖掘模块,其中:所述获取模块用于获取现有的微博数据;所述预处理模块用于对获取的微博文本进行分析和预处理;所述提取模块用于对所述微博文本的词项集合进行搜索遍历,去除停用词词项;所述计算模块用于对原始特征词项集合中的每个词项做开发检验CHI值计算,所得出的最高值的N个词项作为特征词项集,所述原始特征词项集合为所有微博文本的词项集合;所述挖掘模块用于对所述N个词项进行关联规则挖掘,将微博文本中的特征词项的强关联词项加入到该微博的特征词项集中,以提高微博文本分类精度。The present invention also provides a microblog text classification-oriented mining system, including an acquisition module, a preprocessing module, an extraction module, a calculation module, and a mining module electrically connected to each other, wherein: the acquisition module is used to acquire the existing microblog blog data; the preprocessing module is used to analyze and preprocess the obtained microblog text; the extraction module is used to search and traverse the term set of the microblog text, and remove stop word terms; The calculation module is used to calculate the development and inspection CHI value of each term in the original feature term set, and the N terms with the highest value obtained are used as the feature term set, and the original feature term set is all micro A set of terms in the blog text; the mining module is used to mine the N terms for association rules, and add strongly associated terms of the feature terms in the microblog text to the set of feature terms in the microblog to Improve the accuracy of microblog text classification.
其中,所述的微博数据包括:用户ID、用户名、微博文本。Wherein, the microblog data includes: user ID, user name, and microblog text.
所述预处理模块用于对所述微博文本去除标点符号等特殊符号、去除非中文字符和分词操作,得到所述微博文本的词项集合。The preprocessing module is used to remove special symbols such as punctuation marks, non-Chinese characters and word segmentation operations on the microblog text to obtain a term set of the microblog text.
所述的所述特征词项集按照互信息值的高低排列,其中N为用户自定义,N小于总词项数目。The feature term sets are arranged according to the mutual information value, where N is user-defined, and N is less than the total number of terms.
本发明面向微博文本分类的挖掘方法及系统,综合考虑了微博的文本结构,针对微博文本短文本的特性和微博文本关联规则的必要性,提出了一种简单有效的针对微博文本分类的关联规则挖掘方法,与先前关联规则挖掘方法相比,本发明的时间复杂度大大降低,需要分析的数据量大大减少,微博文本分类精度得到显著提高。The invention is oriented to the mining method and system of microblog text classification, comprehensively considers the text structure of microblog, aims at the characteristics of short text of microblog text and the necessity of association rules of microblog text, and proposes a simple and effective method for microblog The association rule mining method of text classification, compared with the previous association rule mining method, the time complexity of the present invention is greatly reduced, the amount of data to be analyzed is greatly reduced, and the microblog text classification accuracy is significantly improved.
附图说明Description of drawings
图1为本发明面向微博文本分类的挖掘方法的流程图;Fig. 1 is the flow chart of the mining method facing microblog text classification in the present invention;
图2为本发明面向微博文本分类的挖掘系统的硬件架构图。FIG. 2 is a hardware architecture diagram of the microblog text classification-oriented mining system of the present invention.
具体实施方式Detailed ways
下面结合附图及具体实施例对本发明作进一步详细的说明。The present invention will be described in further detail below in conjunction with the accompanying drawings and specific embodiments.
参阅图1所示,是本发明面向微博文本分类的挖掘方法较佳实施例的作业流程图。Referring to FIG. 1 , it is a flow chart of a preferred embodiment of the mining method for microblog text classification in the present invention.
步骤S401,获取现有的微博数据。具体而言,获取微博网站上现有的数据。受限于分析技术,本实施例仅获取内容为中文的微博数据。所述微博数据包括:用户ID、用户名、微博文本。Step S401, acquiring existing microblog data. Specifically, obtain the existing data on the microblogging website. Limited by the analysis technology, this embodiment only acquires microblog data whose content is in Chinese. The microblog data includes: user ID, user name, and microblog text.
步骤S402,对获取的微博文本进行分析和预处理。具体而言,对每条微博文本进行初始化处理,所述微博文本经过去除标点符号等特殊符号、去除非中文字符和分词操作后,得到所述微博文本的词项集合,并对该微博进行人工分类。Step S402, analyzing and preprocessing the acquired microblog text. Specifically, each microblog text is initialized. After the microblog text is removed from special symbols such as punctuation marks, non-Chinese characters and word segmentation operations, the term set of the microblog text is obtained, and the Weibo is manually classified.
步骤S403,对所述微博文本进行特征提取,即对所述微博文本的词项集合进行搜索遍历,去除停用词词项。Step S403, performing feature extraction on the microblog text, that is, searching and traversing the term set of the microblog text to remove stop word terms.
步骤S404,对微博数据进行特征选择。具体而言,对原始特征词项集合中的每个词项做开发检验CHI值计算,所得出的最高值的N个词项作为特征词项集。其中,所述原始特征词项集合为所有微博文本的词项集合。所述特征词项集按照互信息值的高低排列,其中N为用户自定义,N小于总词项数目。Step S404, performing feature selection on the microblog data. Specifically, the development test CHI value is calculated for each term in the original feature term set, and the N terms with the highest values obtained are used as the feature term set. Wherein, the original characteristic term set is a term set of all microblog texts. The feature term sets are arranged according to the mutual information value, where N is user-defined, and N is less than the total number of terms.
所述开发检验CHI值计算方法如下:The CHI value calculation method for the development test is as follows:
对于每个词分别计算得到:在这个分类下包含这个词的微博文本数量a;不在该分类下包含这个词的微博文本数量b;在这个分类下不包含这个词的微博文本数量c;不在该分类下,且不包含这个词的微博文本数量d。Calculated separately for each word: the number of microblog texts that contain this word under this category a; the number of microblog texts that do not contain this word under this category b; the number of microblog texts that do not contain this word under this category c ; The number d of Weibo texts that are not under this category and do not contain this word.
z1=a*d-b*c。z1=a*d-b*c.
CHI=(z1*z1*float(N))/((a+c)*(a+b)*(b+d)*(c+d)。CHI=(z1*z1*float(N))/((a+c)*(a+b)*(b+d)*(c+d).
步骤S405,对所述N个词项进行关联规则挖掘。具体步骤如下:Step S405, performing association rule mining on the N terms. Specific steps are as follows:
1.遍历获取的微博数据中的每条微博,对每条微博的特征词项集进行二元组化,将每个二元组加入到MAP<(词项x,词项y),count>,count为该二元组出现的次数。1. Traverse each microblog in the obtained microblog data, perform binary grouping on the feature term set of each microblog, and add each binary group to MAP<(term x, term y) , count>, count is the number of occurrences of the binary group.
2.选择特征过程中已经计算了每个词项出现的次数,设定支持度和置信度的阈值。2. The number of occurrences of each term has been calculated in the process of selecting features, and the thresholds of support and confidence are set.
21.过滤count小于微博数据的微博总数*已设定support的二元组;21. Filter the total number of microblogs whose count is less than the microblog data * the binary group with support set;
22.support(x=>y)=count/微博数据的微博总数;22. support(x=>y)=count/the total number of microblog data of microblog data;
23.confidence(x=>y)=count/(a+b)。23. confidence(x=>y)=count/(a+b).
3.根据上述设定的支持度和置信度的阈值,取强关联规则。将微博文本中的特征词项的强关联词项加入到该微博的特征词项集中,以提高微博文本分类精度。3. According to the support and confidence thresholds set above, strong association rules are taken. The strongly associated terms of the feature terms in the microblog text are added to the feature term set of the microblog to improve the classification accuracy of the microblog text.
参阅图2所示,是本发明面向微博文本分类的挖掘系统的硬件架构图。该系统包括相互电性连接的获取模块、预处理模块、提取模块、计算模块及挖掘模块。Referring to FIG. 2 , it is a hardware architecture diagram of the microblog text classification-oriented mining system of the present invention. The system includes an acquisition module, a preprocessing module, an extraction module, a calculation module and an excavation module electrically connected to each other.
所述获取模块用于获取现有的微博数据。具体而言,所述获取模块获取微博网站上现有的数据。受限于分析技术,本实施例仅获取内容为中文的微博数据。所述微博数据包括:用户ID、用户名、微博文本。The obtaining module is used to obtain existing microblog data. Specifically, the obtaining module obtains existing data on the microblog website. Limited by the analysis technology, this embodiment only acquires microblog data whose content is in Chinese. The microblog data includes: user ID, user name, and microblog text.
所述处理模块用于对获取的图像进行去噪和增强预处理,为后期的处理和筛选做准备。具体而言,所述处理模块对所述获取的图像分别进行去噪处理及增强处理,以提高图像的识别度。The processing module is used to perform denoising and enhancement pre-processing on the acquired images to prepare for later processing and screening. Specifically, the processing module respectively performs denoising processing and enhancement processing on the acquired image, so as to improve the recognition degree of the image.
所述预处理模块用于对获取的微博文本进行分析和预处理。具体而言,所述预处理模块对每条微博文本进行初始化处理,所述微博文本经过去除标点符号等特殊符号、去除非中文字符和分词操作后,得到所述微博文本的词项集合,并对该微博进行人工分类。The preprocessing module is used for analyzing and preprocessing the acquired microblog text. Specifically, the preprocessing module initializes each piece of microblog text, and after the microblog text removes special symbols such as punctuation marks, removes non-Chinese characters, and performs word segmentation operations, the word items of the microblog text are obtained Collect and manually classify the microblogs.
所述提取模块用于对所述微博文本进行特征提取,即所述提取模块对所述微博文本的词项集合进行搜索遍历,去除停用词词项。The extraction module is used to perform feature extraction on the microblog text, that is, the extraction module searches and traverses the term set of the microblog text to remove stop word terms.
所述计算模块用于对微博数据进行特征选择。具体而言,所述计算模块对原始特征词项集合中的每个词项做开发检验CHI值计算,所得出的最高值的N个词项作为特征词项集。其中,所述原始特征词项集合为所有微博文本的词项集合。所述特征词项集按照互信息值的高低排列,其中N为用户自定义,N小于总词项数目。The calculation module is used for feature selection of microblog data. Specifically, the calculation module calculates the development test CHI value for each term in the original feature term set, and the obtained N terms with the highest values are used as the feature term set. Wherein, the original characteristic term set is a term set of all microblog texts. The feature term sets are arranged according to the mutual information value, where N is user-defined, and N is less than the total number of terms.
所述计算模块计算得到所述开发检验CHI值具体如下:The calculation module calculates and obtains the CHI value of the development inspection as follows:
对于每个词分别计算得到:在这个分类下包含这个词的微博文本数量a;不在该分类下包含这个词的微博文本数量b;在这个分类下不包含这个词的微博文本数量c;不在该分类下,且不包含这个词的微博文本数量d。Calculated separately for each word: the number of microblog texts that contain this word under this category a; the number of microblog texts that do not contain this word under this category b; the number of microblog texts that do not contain this word under this category c ; The number d of Weibo texts that are not under this category and do not contain this word.
z1=a*d-b*c。z1=a*d-b*c.
CHI=(z1*z1*float(N))/((a+c)*(a+b)*(b+d)*(c+d)。CHI=(z1*z1*float(N))/((a+c)*(a+b)*(b+d)*(c+d).
所述挖掘模块用于对所述N个词项进行关联规则挖掘。具体如下:The mining module is used to mine association rules for the N terms. details as follows:
所述挖掘模块首先遍历获取的微博数据中的每条微博,对每条微博的特征词项集进行二元组化,将每个二元组加入到MAP<(词项x,词项y),count>,count为该二元组出现的次数。Described mining module first traverses each microblog in the microblog data that obtains, carries out binary grouping to the feature word item set of each microblog, joins each binary group to MAP<(term x, word Item y), count>, count is the number of occurrences of the binary group.
而后选择特征过程中已经计算了每个词项出现的次数,设定支持度和置信度的阈值:过滤count小于微博数据的微博总数*已设定support的二元组;support(x=>y)=count/微博数据的微博总数;confidence(x=>y)=count/(a+b)。Then, in the feature selection process, the number of occurrences of each term has been calculated, and the threshold of support and confidence is set: filter the total number of microblogs whose count is less than the microblog data * the binary group that has set support; support(x= >y)=count/the total number of microblogs in the microblog data; confidence(x=>y)=count/(a+b).
最后根据上述设定的支持度和置信度的阈值,取强关联规则。将微博文本中的特征词项的强关联词项加入到该微博的特征词项集中,以提高微博文本分类精度。Finally, according to the support and confidence thresholds set above, strong association rules are taken. The strongly associated terms of the feature terms in the microblog text are added to the feature term set of the microblog to improve the classification accuracy of the microblog text.
虽然本发明参照当前的较佳实施方式进行了描述,但本领域的技术人员应能理解,上述较佳实施方式仅用来说明本发明,并非用来限定本发明的保护范围,任何在本发明的精神和原则范围之内,所做的任何修饰、等效替换、改进等,均应包含在本发明的权利保护范围之内。Although the present invention has been described with reference to the current preferred embodiments, those skilled in the art should understand that the above-mentioned preferred embodiments are only used to illustrate the present invention, and are not used to limit the protection scope of the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and scope of principles shall be included in the protection scope of the present invention.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310591482.8A CN103593454A (en) | 2013-11-21 | 2013-11-21 | Mining method and system for microblog text classification |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310591482.8A CN103593454A (en) | 2013-11-21 | 2013-11-21 | Mining method and system for microblog text classification |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103593454A true CN103593454A (en) | 2014-02-19 |
Family
ID=50083595
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310591482.8A Pending CN103593454A (en) | 2013-11-21 | 2013-11-21 | Mining method and system for microblog text classification |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103593454A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104361008A (en) * | 2014-10-11 | 2015-02-18 | 北京中搜网络技术股份有限公司 | Microblog classification method based on dictionary or/and threshold value |
CN105653533A (en) * | 2014-11-13 | 2016-06-08 | 腾讯数码(深圳)有限公司 | Method and device for updating classified associated word set |
WO2017101342A1 (en) * | 2015-12-15 | 2017-06-22 | 乐视控股(北京)有限公司 | Sentiment classification method and apparatus |
CN107302474A (en) * | 2017-07-04 | 2017-10-27 | 四川无声信息技术有限公司 | The feature extracting method and device of network data application |
CN107357925A (en) * | 2017-07-26 | 2017-11-17 | 深圳中泓在线股份有限公司 | Personal ledger method in microblogging wechat |
CN107391489A (en) * | 2017-07-31 | 2017-11-24 | 阿里巴巴集团控股有限公司 | A kind of text analyzing method and device |
CN112733828A (en) * | 2020-12-30 | 2021-04-30 | 航天信息股份有限公司 | Method and system for character recognition |
WO2022116444A1 (en) * | 2020-12-01 | 2022-06-09 | 平安科技(深圳)有限公司 | Text classification method and apparatus, and computer device and medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6272478B1 (en) * | 1997-06-24 | 2001-08-07 | Mitsubishi Denki Kabushiki Kaisha | Data mining apparatus for discovering association rules existing between attributes of data |
CN101510204A (en) * | 2009-03-02 | 2009-08-19 | 南京航空航天大学 | Abnormal enquiry and monitor method based on target condition association rule database |
CN101634983A (en) * | 2008-07-21 | 2010-01-27 | 华为技术有限公司 | Method and device for text classification |
CN101655837A (en) * | 2009-09-08 | 2010-02-24 | 北京邮电大学 | Method for detecting and correcting error on text after voice recognition |
-
2013
- 2013-11-21 CN CN201310591482.8A patent/CN103593454A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6272478B1 (en) * | 1997-06-24 | 2001-08-07 | Mitsubishi Denki Kabushiki Kaisha | Data mining apparatus for discovering association rules existing between attributes of data |
CN101634983A (en) * | 2008-07-21 | 2010-01-27 | 华为技术有限公司 | Method and device for text classification |
CN101510204A (en) * | 2009-03-02 | 2009-08-19 | 南京航空航天大学 | Abnormal enquiry and monitor method based on target condition association rule database |
CN101655837A (en) * | 2009-09-08 | 2010-02-24 | 北京邮电大学 | Method for detecting and correcting error on text after voice recognition |
Non-Patent Citations (1)
Title |
---|
麦艺华: "《面向中文微博的社会网络分析及应用》", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104361008A (en) * | 2014-10-11 | 2015-02-18 | 北京中搜网络技术股份有限公司 | Microblog classification method based on dictionary or/and threshold value |
CN105653533A (en) * | 2014-11-13 | 2016-06-08 | 腾讯数码(深圳)有限公司 | Method and device for updating classified associated word set |
CN105653533B (en) * | 2014-11-13 | 2019-10-25 | 腾讯数码(深圳)有限公司 | A kind of method and apparatus updating classification associated set of words |
WO2017101342A1 (en) * | 2015-12-15 | 2017-06-22 | 乐视控股(北京)有限公司 | Sentiment classification method and apparatus |
CN107302474A (en) * | 2017-07-04 | 2017-10-27 | 四川无声信息技术有限公司 | The feature extracting method and device of network data application |
CN107302474B (en) * | 2017-07-04 | 2020-02-04 | 四川无声信息技术有限公司 | Feature extraction method and device for network data application |
CN107357925A (en) * | 2017-07-26 | 2017-11-17 | 深圳中泓在线股份有限公司 | Personal ledger method in microblogging wechat |
CN107391489A (en) * | 2017-07-31 | 2017-11-24 | 阿里巴巴集团控股有限公司 | A kind of text analyzing method and device |
CN107391489B (en) * | 2017-07-31 | 2020-09-25 | 阿里巴巴集团控股有限公司 | Text analysis method and device |
WO2022116444A1 (en) * | 2020-12-01 | 2022-06-09 | 平安科技(深圳)有限公司 | Text classification method and apparatus, and computer device and medium |
CN112733828A (en) * | 2020-12-30 | 2021-04-30 | 航天信息股份有限公司 | Method and system for character recognition |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11715315B2 (en) | Systems, methods and computer readable media for identifying content to represent web pages and creating a representative image from the content | |
CN103593454A (en) | Mining method and system for microblog text classification | |
CN104281882B (en) | The method and system of prediction social network information stream row degree based on user characteristics | |
CN102722709B (en) | Method and device for identifying garbage pictures | |
CN104765729B (en) | A kind of cross-platform microblogging community account matching process | |
CN106815307A (en) | Public Culture knowledge mapping platform and its use method | |
CN105005594A (en) | Abnormal Weibo user identification method | |
CN104239539A (en) | Microblog information filtering method based on multi-information fusion | |
CN104317784A (en) | Cross-platform user identification method and cross-platform user identification system | |
CN107291886A (en) | A kind of microblog topic detecting method and system based on incremental clustering algorithm | |
CN107944032B (en) | Method and apparatus for generating information | |
CN106909669B (en) | Method and device for detecting promotion information | |
CN110858217A (en) | Method and device for detecting microblog sensitive topics and readable storage medium | |
CN104615715A (en) | Social network event analyzing method and system based on geographic positions | |
CN112559747A (en) | Event classification processing method and device, electronic equipment and storage medium | |
CN105224593A (en) | Frequent co-occurrence account method for digging in a kind of of short duration online affairs | |
CN105573971B (en) | Table reconfiguration device and method | |
CN103455593A (en) | Service competitiveness realization system and method based on social contact network | |
CN106681980A (en) | Method and device for analyzing junk short messages | |
US20160283582A1 (en) | Device and method for detecting similar text, and application | |
CN105677757A (en) | Big data similarity join method based on prefix-affix filtering | |
US9332031B1 (en) | Categorizing accounts based on associated images | |
CN105589935A (en) | Social group recognition method | |
CN107832611A (en) | The bot program detection and sorting technique that a kind of dynamic static nature combines | |
CN104268214B (en) | A kind of user's gender identification method and system based on microblog users relation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20140219 |