CN102222072A - Method and device for information classification - Google Patents

Method and device for information classification Download PDF

Info

Publication number
CN102222072A
CN102222072A CN 201010151119 CN201010151119A CN102222072A CN 102222072 A CN102222072 A CN 102222072A CN 201010151119 CN201010151119 CN 201010151119 CN 201010151119 A CN201010151119 A CN 201010151119A CN 102222072 A CN102222072 A CN 102222072A
Authority
CN
Grant status
Application
Patent type
Prior art keywords
category
number
times
information
ratio
Prior art date
Application number
CN 201010151119
Other languages
Chinese (zh)
Inventor
李战胜
王迪
薛晔伟
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Abstract

The invention discloses a method and device for information classification, belonging to the field of Internet information retrieval. The method comprises the steps of: obtaining search results of information in a search engine and recording the class of every search result; counting the occurrence times of every class in all the recorded classes; and choosing the classes for the information according to the occurrence times of every class. The device comprises a class recording module, a times counting module and a class choosing module. The method and the device disclosed by the invention are used for classifying the information according to the occurrence times of every class in the search results, thus the accuracy of classifying the information is increased.

Description

一种信息分类的方法和装置 A method and apparatus for classifying information

技术领域 FIELD

[0001] 本发明涉及互联网信息检索领域,特别涉及一种信息分类的方法和装置。 [0001] The present invention relates to the field of Internet information retrieval, and more particularly to a method and apparatus for classifying information. 背景技术 Background technique

[0002] 随着hternet (因特网)的普及和信息量的迅猛增长,web (网页)数据挖掘技术已成为信息检索领域的研究热点,文本分类作为web数据挖掘的一个重要技术也得到了广泛的研究和应用。 [0002] With the rapid growth of hternet (Internet) popularity and the amount of information, web (web) data mining technology has become a hot topic in the field of information retrieval, text classification as an important web data mining technology has also been widely studied and applications. 目前一些统计分类法和机器学习方法,如:贝叶斯(NaiveBayes)、k_近邻法(k-Nearest Neighbor,kNN)、决策树(DecisionTree)和支持向量机(support vector machine, SVM)等都被应用为文本分类方法,且都取得了相当好的效果。 At present, some statistical classification and machine learning methods, such as: Bayes (NaiveBayes), k_ nearest neighbor (k-Nearest Neighbor, kNN), decision trees (DecisionTree) and SVM (support vector machine, SVM) and so It is applied to text classification methods, and have achieved very good results.

[0003] 贝叶斯、k-近邻法、决策树和支持向量机等文本分类方法均可分为训练和预测2 个过程,训练过程:首先收集对应分类体系的大规模语料,其次用特征来表示大规模语料, 最后根据大规模语料的特征建立模型形成分类器;预测过程:首先用特征来表示新文本, 其次将新文本的特征输入到分类器,最后输出新文本的类别。 [0003] Bayesian, k- nearest neighbor, decision tree and support vector machine text classification and forecasting methods can be divided into two training process, the training process: First collect the corresponding large-scale corpus classification system, followed by the features represents a scale corpus, and finally established according to the characteristics of large scale corpus formed classifier model; prediction process: first, the features represented by the new text, followed by text input to a new feature to the classifier, the final output class for new text.

[0004] 然而近年来,各种短文本如文章摘要、电子邮件、网上即时消息、用户用于搜索的Query (查询串)等已经源源不断地大量涌现,当采用上述现有技术的各种文本分类方法对这些长度比较短、结构各异的短文本进行分类时,发明人发现至少存在下述缺点: [0004] In recent years, however, various articles such as short text summary, e-mail, online instant messaging, Query (query string) for search and other user has an endless stream of large numbers, when using various texts of the prior art when the classification for these relatively short length, the structure of different short text classification, the inventor finds at least the following disadvantages:

[0005] 由于现有技术的各种文本分类方法均是根据新文本的特征得到新文本的类别,而短文本本身包含的信息量很少,根据短文本的特征得到短文本的类别的准确率很低,即对短文本进行分类的准确率很低。 [0005] are obtained according to category for the new features of the new version of the text of text classification due to various prior art methods, while the short text information itself contains little to obtain a short text of the short text of the feature classes of accuracy very low, very low accuracy rate that is short text classification.

发明内容 SUMMARY

[0006] 为了提高对信息进行分类的准确率,本发明实施例提供了一种信息分类的方法和装置。 [0006] In order to improve the accuracy of classifying the information, the present invention provides a method and apparatus for classifying information. 所述技术方案如下: The technical solutions are as follows:

[0007] —种信息分类的方法,所述方法包括: [0007] - The method of classification of information, said method comprising species:

[0008] 获取信息在搜索引擎中的搜索结果,并记录每条搜索结果的类别; [0008] obtain information search result in the search engine, and records each search result category;

[0009] 在记录的所有类别中,统计每种类别出现的次数; [0009] In all categories recorded, count the number of each category appears;

[0010] 根据所述每种类别出现的次数,为所述信息选择类别。 [0010] The frequency of the occurrence of each category, the information for the selected category.

[0011] 进一步地,所述获取信息在搜索引擎中的搜索结果之前还包括: Before [0011] Further, the access to information in a search engine search result further comprises:

[0012] 对所述搜索结果进行类别的标注。 [0012] The category of tagging the search result.

[0013] 进一步地,所述根据所述每种类别出现的次数,为所述信息选择类别具体包括: [0013] Further, according to the number of times of the occurrence of each category, the information to select a category comprises:

[0014] 根据所述每种类别出现的次数,计算出所述每种类别出现的次数与记录的所有类别出现的总次数之间的比例; [0014] The frequency of the occurrence of each category, the total number of times calculating the ratio between the number of times recording each category appear all categories appear;

[0015] 根据所述每种类别出现的次数与记录的所有类别出现的总次数之间的比例,为所述信息选择类别。 [0015] The ratio between the total number of all categories for each category the number of times of the recording occurs occurring, to select a category of the information.

[0016] 进一步地,所述根据所述每种类别出现的次数,为所述信息选择类别具体包括: [0016] Further, according to the number of times of the occurrence of each category, the information to select a category comprises:

[0017] 根据所述每种类别出现的次数,计算出所述每种类别出现的次数与记录的所有类别出现的总次数之间的比例; [0017] The frequency of the occurrence of each category, the total number of times calculating the ratio between the number of times recording each category appear all categories appear;

[0018] 根据所述每种类别出现的次数,以及所述每种类别出现的次数与记录的所有类别出现的总次数之间的比例,为所述信息选择类别。 [0018] The ratio of the total number of times of the occurrence of each category, and each category all categories of the appearing times of the recording occurs, the information for the selected category.

[0019] 一种信息分类的装置,所述装置包括: [0019] An information classification means, said apparatus comprising:

[0020] 类别记录模块,用于获取信息在搜索引擎中的搜索结果,并记录每条搜索结果的类别; [0020] Category recording module, configured to obtain the information search result in the search engine, and records each search result category;

[0021] 次数统计模块,用于在所述类别记录模块记录下每条搜索结果的类别后,在记录的所有类别中,统计每种类别出现的次数; [0021] The number counting module, for the category in the category of each recording module records the search results, recorded in all categories, the statistical frequency of occurrence for each category;

[0022] 类别选择模块,用于在所述次数统计模块统计出每种类别出现的次数后,根据所述每种类别出现的次数,为所述信息选择类别。 [0022] category selection module for after the number of times counting module statistics appear in each category, according to the number of times each category appear, the information to select a category.

[0023] 进一步地,所述装置还包括: [0023] Preferably, said apparatus further comprising:

[0024] 类别标注模块,用于在所述类别记录模块获取信息在搜索引擎中的搜索结果之前,对所述搜索结果进行类别的标注。 [0024] The category labels module, before obtaining the search result information in the search engine module for recording in the category, the search results labeled category.

[0025] 进一步地,所述类别选择模块具体包括: [0025] Further, the category selection module comprises:

[0026] 第一比例计算单元,用于在所述次数统计模块统计出每种类别出现的次数后,根据所述每种类别出现的次数,计算出所述每种类别出现的次数与记录的所有类别出现的总次数之间的比例; [0026] The first ratio calculating unit for recording the number of times each category appear after the number of times counting module statistics appear in each category, according to the number of times each category appear, the calculated the ratio between the total number of all categories appear;

[0027] 第一类别选择单元,用于在所述第一比例计算单元计算出所述每种类别出现的次数与记录的所有类别出现的总次数之间的比例后,根据所述每种类别出现的次数与记录的所有类别出现的总次数之间的比例,为所述信息选择类别。 [0027] The first category selecting means, for after the first ratio calculating unit calculates the ratio between the total number of times of the recording occurs in each category appear all categories, each category according to the the ratio between the total number of all occurrences of categories occurring number of recording, the selected category information.

[0028] 进一步地,所述类别选择模块具体包括: [0028] Further, the category selection module comprises:

[0029] 第二比例计算单元,用于在所述次数统计模块统计出每种类别出现的次数后,根据所述每种类别出现的次数,计算出所述每种类别出现的次数与记录的所有类别出现的总次数之间的比例; [0029] The second ratio calculation means for recording the number of times each category appear after the number of times counting module statistics appear in each category, according to the number of times each category appear, the calculated the ratio between the total number of all categories appear;

[0030] 第二类别选择单元,用于在所述第二比例计算单元计算出所述每种类别出现的次数与记录的所有类别出现的总次数之间的比例后,根据所述每种类别出现的次数,以及所述每种类别出现的次数与记录的所有类别出现的总次数之间的比例,为所述信息选择类别。 [0030] The second category selecting unit, configured to, after the second ratio calculating unit calculates the ratio between the total number of times of the recording occurs in each category appear all categories, each category according to the number of occurrences, and the ratio between the total number of all occurrences of the category in each category and the number of occurrence recorded, the information for the selected category.

[0031] 本发明实施例提供的技术方案的有益效果是: [0031] Advantageous effects of the technical solutions provided in the embodiments of the present invention is:

[0032] 利用搜索结果中每种类别出现的次数对信息进行分类,提高了对信息进行分类的准确率。 The number of [0032] the use of the search results appear for each category of classified information, to improve the accuracy of the information classification. 利用搜索结果中每种类别出现的次数对信息进行分类无需训练集和事先训练,减少了工作量,可以对无法用训练集训练的类(如:脑筋急转弯、幽默笑话等)进行分类。 Using the number of search results for each category appears to classify information without having prior training set and training, reducing the workload, can not be used for the training set of training classes: classification (such as brain teasers, humor, jokes, etc.). 搜索结果可以实时更新,增加了对信息进行分类的实时性。 Search results can be updated in real time, increasing the real-time classification of the information. 只需统计每种类别出现的次数既可对信息进行分类,实现简单。 The number of times each category appeared only statistics can sort the information, simple.

附图说明 BRIEF DESCRIPTION

[0033] 图1是本发明实施例1提供的一种信息分类的方法流程图; [0033] FIG. 1 is a flowchart of a method for providing information according to an embodiment of the present invention, the classification;

[0034] 图2是本发明实施例2提供的一种信息分类的方法流程图; [0034] FIG 2 is a flowchart of a method for providing information according to classification of the embodiment 2 of the present invention;

[0035] 图3是本发明实施例3提供的一种信息分类的装置结构示意图;[0036] 图4是本发明实施例3提供的另一种信息分类的装置结构示意图。 [0035] FIG. 3 is a schematic view of the structure of an information classification apparatus according to embodiment 3 of the present invention is provided; [0036] FIG. 4 is a schematic diagram of another configuration information classification apparatus according to embodiment 3 of the present invention is provided. 具体实施方式 detailed description

[0037] 为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施方式作进一步地详细描述。 [0037] To make the objectives, technical solutions, and advantages of the present invention will become apparent in conjunction with the accompanying drawings of the following embodiments of the present invention will be described in further detail.

[0038] 实施例1 [0038] Example 1

[0039] 参见图1,本发明实施例提供了一种信息分类的方法,包括: [0039] Referring to Figure 1, embodiments of the present invention provides a method for classifying information, comprising:

[0040] 101 :获取信息在搜索引擎中的搜索结果,并记录每条搜索结果的类别。 [0040] 101: Get information search results in search engines, and record category for each search result.

[0041] 102 :在记录的所有类别中,统计每种类别出现的次数。 [0041] 102: in all categories recorded, count the number of each category appear.

[0042] 103 :根据每种类别出现的次数,为信息选择类别。 [0042] 103: The number of times each category appears, select a category of information.

[0043] 进一步地,获取信息在搜索引擎中的搜索结果之前还包括: Before [0043] Further, access to information in the search engine search results also include:

[0044] 对搜索结果进行类别的标注。 [0044] search results marked categories.

[0045] 进一步地,根据每种类别出现的次数,为信息选择类别具体包括: [0045] Further, according to the number of occurrence of each category, selecting the information category comprises:

[0046] 根据每种类别出现的次数,计算出每种类别出现的次数与记录的所有类别出现的总次数之间的比例; [0046] The frequency of occurrence for each category calculated ratio of the total number of all categories for each category the number of times of the recording occurs occurring;

[0047] 根据每种类别出现的次数与记录的所有类别出现的总次数之间的比例,为信息选择类别。 [0047] The ratio of the total number of all categories with each category record number occurring appears, select a category of information.

[0048] 或进一步地,根据每种类别出现的次数,为信息选择类别具体包括: [0048] or additionally, the number of times each category appears as category selection information comprises:

[0049] 根据每种类别出现的次数,计算出每种类别出现的次数与记录的所有类别出现的总次数之间的比例; [0049] The frequency of occurrence for each category calculated ratio of the total number of all categories for each category the number of times of the recording occurs occurring;

[0050] 根据每种类别出现的次数,以及每种类别出现的次数与记录的所有类别出现的总次数之间的比例,为信息选择类别。 [0050] The ratio of the total number of times each category appear, all categories and for each category the number of times of the recording occurs occurring, the information category selected.

[0051] 本发明实施例所述的信息分类的方法,利用搜索结果中每种类别出现的次数对信息进行分类,提高了对信息进行分类的准确率。 [0051] The method of classification according to the information of the embodiment of the present invention, using the information classified for each category the number of search results appears to improve the accuracy of classifying the information. 利用搜索结果中每种类别出现的次数对信息进行分类无需训练集和事先训练,减少了工作量,可以对无法用训练集训练的类(如:脑筋急转弯、幽默笑话等)进行分类。 Using the number of search results for each category appears to classify information without having prior training set and training, reducing the workload, can not be used for the training set of training classes: classification (such as brain teasers, humor, jokes, etc.). 搜索结果可以实时更新,增加了对信息进行分类的实时性。 Search results can be updated in real time, increasing the real-time classification of the information. 只需统计每种类别出现的次数既可对信息进行分类,实现简单。 The number of times each category appeared only statistics can sort the information, simple.

[0052] 本发明实施例提供的信息分类的方法,可以用于对长文本、短文本等各种信息进行分类,为了便于更好地理解本发明实施例,下面以该方法应用于对短文本进行分类为例进行进一步地说明。 [0052] The method of classification of information according to an embodiment of the present invention, various information can be used long text, short text classification, etc., in order to facilitate a better understanding of the embodiments of the present invention, the following applies to the method of short text classification as an example to further explained.

[0053] 实施例2 [0053] Example 2

[0054] 参见图2,本发明实施例提供了一种信息分类的方法,包括: [0054] Referring to Figure 2, embodiments of the present invention provides a method for classifying information, comprising:

[0055] 201 :获取短文本在搜索引擎中的搜索结果,并记录每条搜索结果的类别。 [0055] 201: search result acquired in the short text search engine, and records each search result category.

[0056] 其中,短文本可以是文章摘要、电子邮件、网上即时消息、一篇文章的标题、用户用于搜索的Query等。 [0056] which may be a short text summary of the article, e-mail, online instant messaging, the title of an article, a user searching for Query and so on. 获取短文本在搜索引擎中的搜索结果可以是采用与现有技术中通过百度、Google等搜索引擎获取搜索结果类似的方法进行获取。 Get short text search engine search results may be acquired using a similar search results acquired with the method of the prior art by Baidu, Google and other search engines. 搜索结果的内容可以是一篇文章、一段文字等,且需要说明的是,本发明实施例中获取的大部分搜索结果中都标注有类别,即搜索结果的内容所归属的类别,例如:短文本的内容为:“中国的首都是? ”,获取到该短文本对应的搜索结果有11个,且其中第1、7的搜索结果对应的类别为政治;第2、4、8、10的搜索结果对应的类别为文化;第5、6、11的搜索结果对应的类别为经济;第3、9的搜索结果没有类别。 Content search result may be an article, a text, etc., and should be noted that, in most cases the search results acquired in the embodiment of the present invention are marked with the category, i.e., the content of the search result category belongs, for example: short this content is: "China is the capital?", to obtain the corresponding short text search returned 11, 1, 7 and wherein the search results corresponding category is political; the first 2,4,8,10 Search results for the corresponding category culture; the first 5,6,11 search results corresponding to the economic category; 3, 9 search results without category. 则相应的记录为< 第1搜索结果,政治>、< 第2搜索结果,文化>、< 第3搜索结果,无〉、〈第4搜索结果,文化〉、〈第5搜索结果,经济>···〈第11搜索结果,经济>。 The corresponding record is <1 search results, political>, <2 search results, culture>, <3 search results, none>, <4 search result, cultural>, <5 search results, the economy> · · <11 search results, the economy>.

[0057] 并且,在使用本发明实施例所述分类方法的初期,绝大部分搜索结果没有标注类别的情况下,可以通过以下几种方式预先对搜索结果进行类别的标注:第一种可以通过人工标注搜索结果的类别,即根据搜索结果的内容,判断该搜索结果可以归属为哪种类别,将该搜索结果标识为相应的类别。 In the case [0057] Further, the present invention during the initial period of the classification method of the embodiment, most of the search result category is not marked, the search results may be pre-labeled category in the following ways: first by manual annotation search result category, i.e., based on the content of the search results, the determination of which search result categories can be assigned, the search results identify the corresponding category. 第二种在用户通过搜索引擎进行搜索时,用户选择其输入的短文本的类别,将用户选择的类别作为通过该短文本搜索出的搜索结果的类别。 When users search a second search engine, the user selects the category of short text input, the category selected by the user as a search category through the short text of the search results. 第三种可以采用现有技术的分类方法自动为搜索结果推荐类别,将推荐的类别标注为相应的搜索结果的类别。 A third prior art classification method may be employed to automatically search results recommended category, the category recommended as the corresponding label category search results. 需要说明的是,并不限于上述3种方式,可以根据实际应用状况选择其他任何可行的方法,对此不做具体限定。 Incidentally, not limited to the three kinds of ways, any other feasible methods can be selected according to actual situation, which is not specifically defined.

[0058] 202 :在记录的所有类别中,统计每种类别出现的次数。 [0058] 202: in all categories recorded, count the number of each category appear.

[0059] 例如:对于步骤201中所述的例子,统计得到每种类别出现的次数为:政治:次数为2次;文化:次数为4次;经济:次数为3次;无类别:次数为2次。 [0059] For example: For example, in the step 201, the counted number of times each category is emerging: political: for two times; Culture: the number is four; Economy: for 3 times; No Type: number of times 2 times.

[0060] 203 :根据每种类别出现的次数,为短文本选择类别。 [0060] 203: The number of times each category appears, select a category for the short text.

[0061] 具体地,可以是将每种类别出现的次数中次数最大的类别作为短文本的类别。 [0061] Specifically, the number may be a number of times each category appears as category having the highest category of short text. 例如:对于步骤201中所述的例子,选择出文化作为短文本“中国的首都是? ”的类别。 For example: For example, in the step 201, the selected culture as a short text, "China is the capital?" Category.

[0062] 或可以是依次判断每种类别出现的次数是否大于预设的次数阈值,将次数大于预设的次数阈值的类别作为短文本的类别,其中,预设的次数阈值可以根据实际应用状况进行灵活设置,例如:对于步骤201中所述的例子,当预设的次数阈值设置为4时,选择出文化作为短文本“中国的首都是? ”的类别,当预设的次数阈值设置为2时,选择出文化和经济作为短文本“中国的首都是? ”的类别。 [0062] or may be sequentially determined whether the number of times each category appear greater than a preset threshold number, the category number is greater than a preset threshold value as the number of short text category, wherein the predetermined number threshold value may be according to the actual application condition flexible settings, such as: for example, in the step 201, when 4 to select a culture as short text preset threshold number is set to "? the capital of China is" category, when the pre-set threshold number is 2:00, selected cultural and economic as short text, "China is the capital?" category. 即通过这种方式选择出的符合要求的类别可能不至一个。 That is selected in this way may not meet the requirements to a category.

[0063] 需要说明的是,并不限于上述2种方式,可以根据实际应用状况选择其他任何可行的方法,对此不做具体限定。 [0063] Incidentally, not limited to the two kinds of ways, any other feasible methods can be selected according to actual situation, which is not specifically defined.

[0064] 具体地,根据每种类别出现的次数,为短文本选择类别,具体可以包括:根据每种类别出现的次数,计算出每种类别出现的次数与记录的所有类别出现的总次数之间的比例;根据每种类别出现的次数与记录的所有类别出现的总次数之间的比例,为短文本选择类别。 [0064] In particular, each category based on the number appears, select a category of short text may specifically include: the number of times each category appear, to calculate the total number of all categories and each category appears number of records appearing ratio between; according to the ratio between the total number of all categories with each category record number occurring appears, select a category for the short text.

[0065] 例如:对于步骤201中所述的例子,计算得到每种类别出现的次数占记录的所有类别出现的总次数的比例为:政治:比例为2/11 ;文化:比例为4/11 ;经济:比例为3/11 ; 无类别:比例为2/11。 [0065] For example: For example, in the step 201, the calculated total number of times each category appear account records appearing in all categories ratio: Politics: a ratio of 2/11; Culture: ratio of 4/11 ; economy: the ratio of 3/11; classless: ratio of 2/11.

[0066] 根据每种类别出现的次数与记录的所有类别出现的总次数之间的比例,为短文本选择类别,与根据每种类别出现的次数,为短文本选择类别的具体实现方式类似,如可以是将每种类别出现的次数占记录的所有类别出现的总次数的比例中比例值最大的类别作为短文本的类别;或可以是依次判断每种类别出现的次数与记录的所有类别出现的总次数之间的比例是否大于预设的比例阈值,将比例大于预设的比例阈值的类别作为短文本的类别,其中,预设的比例阈值可以根据实际应用状况进行灵活设置。 [0066] The ratio of the total number of all categories with each category record number occurring appears, select a category of short text, with each category based on the number appears, select a category similar to the short text of the specific implementation, as may be a number of the largest proportion of the total number of records in each category appear account all categories appear as class category values ​​proportion of short text; or may be sequentially Analyzing all categories for each category the number of times of the recording occurs appears the ratio between the total number is larger than a preset threshold ratio, the ratio is greater than a preset threshold ratio as class category of short text, wherein a predetermined ratio threshold may be set flexibly according to the actual application condition.

[0067] 或具体地,根据每种类别出现的次数,为短文本选择类别,可以具体包括:根据每 [0067] or, in particular, the number of times each category appears, select a category of short text may specifically include: The per

7种类别出现的次数,计算出每种类别出现的次数与记录的所有类别出现的总次数之间的比例;根据每种类别出现的次数,以及每种类别出现的次数与记录的所有类别出现的总次数之间的比例,为短文本选择类别。 Number of categories appears seven kinds, the calculated ratio of the total number of all occurrences of each category categories appear with the recorded number of times; all categories occur depending on the number of occurrence of each category, and each category appear with the recorded number of times the ratio between the total number, short text category selected. 即可以将比例和次数综合起来进行判断,具体实现过程与单独采用其中一个进行判断类似,如可以是将每种类别出现的次数,以及每种类别出现的次数占记录的所有类别出现的总次数的比例均最大的类别作为短文本的类别;或可以是依次判断每种类别出现的次数,以及每种类别出现的次数与记录的所有类别出现的总次数之间的比例是否分别大于预设的次数阈值和预设的比例阈值,将次数大于预设的次数阈值同时比例也大于预设的比例阈值的类别作为短文本的类别。 I.e., the number and ratio can be determined together, alone and specific implementation where a similar determination, as can be the total number of the number of times each category appear, and each category appear account records all categories appear the maximum ratios are as class category of short text; or may be sequentially determined whether the ratio between the total number of times each category appear all categories, and each category number recording appears occurring are greater than a predetermined category number and a preset threshold value of the threshold ratio, the number of times greater than a preset threshold value while the proportion of greater than a preset threshold value as the ratio of short text category.

[0068] 本发明实施例所述的信息分类的方法,利用搜索结果中每种类别出现的次数对短文本进行分类,提高了对短文本进行分类的准确率。 [0068] The method of classification according to the information of the embodiment of the present invention, using a number of search results appearing in each category of short text classification, improve the accuracy of short text classification. 利用搜索结果中每种类别出现的次数对短文本进行分类无需训练集和事先训练,减少了工作量,可以对无法用训练集训练的类(如:脑筋急转弯、幽默笑话等)进行分类。 Using the number of search results for each category of short text appear to classify the training set and do not need prior training, reducing the workload, can not be used for the training set of training classes: classification (such as brain teasers, humor, jokes, etc.). 搜索结果可以实时更新,增加了对短文本进行分类的实时性。 Search results can be updated in real time, increasing the real-time short text classification. 只需统计每种类别出现的次数既可对短文本进行分类,实现简单。 The number of each category can appear only statistical classification of short text, simple.

[0069] 实施例3 [0069] Example 3

[0070] 参见图3,本发明实施例提供了一种信息分类的装置,包括: [0070] Referring to Figure 3, embodiments of the present invention provides an apparatus for classifying information, comprising:

[0071] 类别记录模块301,用于获取信息在搜索引擎中的搜索结果,并记录每条搜索结果的类别。 [0071] category record module 301, configured to obtain the information search result in the search engine, and search results for each category record.

[0072] 次数统计模块302,用于在类别记录模块301记录下每条搜索结果的类别后,在记录的所有类别中,统计每种类别出现的次数。 [0072] The number counting module 302, the search results for each of the categories recorded in the category of the recording module 301 records in all categories, the statistical frequency of occurrence for each category.

[0073] 类别选择模块303,用于在次数统计模块302统计出每种类别出现的次数后,根据每种类别出现的次数,为信息选择类别。 [0073] category selection module 303, a statistics module 302 after the number of times each category statistics appear, based on the number of occurrence of each category, selecting the information category.

[0074] 进一步地,参见图4,该装置还包括: [0074] Further, referring to FIG. 4, the apparatus further comprising:

[0075] 类别标注模块304,用于在类别记录模块301获取信息在搜索引擎中的搜索结果之前,对搜索结果进行类别的标注。 [0075] Type label module 304, access information 301 before the search results in a search engine module type record for search results labeled category.

[0076] 进一步地,本发明实施例中类别选择模块303具体包括: [0076] Further, embodiments of the present invention comprises category selection module 303:

[0077] 第一比例计算单元,用于在次数统计模块302统计出每种类别出现的次数后,根据每种类别出现的次数,计算出每种类别出现的次数与记录的所有类别出现的总次数之间的比例。 [0077] After the first ratio calculating means for counting the number of times each category appear in the statistics module 302, each category based on the number of occurrence is calculated for each of the total number of all types and categories appear recording appearing the ratio between the number.

[0078] 第一类别选择单元,用于在第一比例计算单元计算出每种类别出现的次数与记录的所有类别出现的总次数之间的比例后,根据每种类别出现的次数与记录的所有类别出现的总次数之间的比例,为信息选择类别。 [0078] The first category selecting means for, after the first proportion calculation unit calculates the ratio between the total number of all categories for each category the number of times of the recording occurs occurring, each category based on the number of occurrence of the recording the ratio between the total number of all categories that appears, choose the category for the information.

[0079] 进一步地,类别选择模块303具体包括: [0079] Further, category selection module 303 specifically comprises:

[0080] 第二比例计算单元,用于在次数统计模块302统计出每种类别出现的次数后,根据每种类别出现的次数,计算出每种类别出现的次数与记录的所有类别出现的总次数之间的比例。 [0080] After the second ratio calculating means for counting the number of times each category appear in the statistics module 302, each category based on the number of occurrence is calculated for each of the total number of all types and categories appear recording appearing the ratio between the number.

[0081] 第二类别选择单元,用于在第二比例计算单元计算出每种类别出现的次数与记录的所有类别出现的总次数之间的比例后,根据每种类别出现的次数,以及每种类别出现的次数与记录的所有类别出现的总次数之间的比例,为信息选择类别。 [0081] The second category selecting unit, configured to, after the ratio between the total number of all categories of the second calculation unit calculates a ratio of the number of times recording each category appear occurring, each category based on the number of occurrence, and each the ratio between the total number of categories appearing all categories appear with the number of records, select the category information.

[0082] 本发明实施例所述的信息分类的装置,利用搜索结果中每种类别出现的次数对信息进行分类,提高了对信息进行分类的准确率。 [0082] The information classification apparatus according to the present embodiment of the invention, using the information classified for each category the number of search results appears to improve the accuracy of classifying the information. 利用搜索结果中每种类别出现的次数对信息进行分类无需训练集和事先训练,减少了工作量,可以对无法用训练集训练的类(如:脑筋急转弯、幽默笑话等)进行分类。 Using the number of search results for each category appears to classify information without having prior training set and training, reducing the workload, can not be used for the training set of training classes: classification (such as brain teasers, humor, jokes, etc.). 搜索结果可以实时更新,增加了对信息进行分类的实时性。 Search results can be updated in real time, increasing the real-time classification of the information. 只需统计每种类别出现的次数既可对信息进行分类,实现简单。 The number of times each category appeared only statistics can sort the information, simple.

[0083] 以上实施例提供的技术方案中的全部或部分内容可以通过软件编程实现,其软件程序存储在可读取的存储介质中,存储介质例如:计算机中的硬盘、光盘或软盘。 [0083] The embodiments above technical solution provided in whole or in part may be implemented through software programming, in the storage medium may be readable, for example, a storage medium storing software programs: a computer hard disk, optical disk or floppy disk.

[0084] 以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。 [0084] The foregoing is only preferred embodiments of the present invention, not intended to limit the present invention within the spirit and principle of the present invention, any modification, equivalent replacement, or improvement, it should be included in the present within the scope of the invention.

Claims (8)

  1. 1. 一种信息分类的方法,其特征在于,所述方法包括:获取信息在搜索引擎中的搜索结果,并记录每条搜索结果的类别; 在记录的所有类别中,统计每种类别出现的次数; 根据所述每种类别出现的次数,为所述信息选择类别。 1. A method of classification of information, wherein, said method comprising: acquiring information search result in the search engine, and records each search result category; recorded in all categories, each category appear statistical times; the number of times of the occurrence of each category, the information for the selected category.
  2. 2.根据权利要求1所述的信息分类的方法,其特征在于,所述获取信息在搜索引擎中的搜索结果之前还包括:对所述搜索结果进行类别的标注。 The method of classification of information according to claim 1, characterized in that, prior to obtaining the search result information in the search engine further comprises: tagging the search result categories.
  3. 3.根据权利要求1所述的信息分类的方法,其特征在于,所述根据所述每种类别出现的次数,为所述信息选择类别具体包括:根据所述每种类别出现的次数,计算出所述每种类别出现的次数与记录的所有类别出现的总次数之间的比例;根据所述每种类别出现的次数与记录的所有类别出现的总次数之间的比例,为所述信息选择类别。 3. The method of classification of information according to claim 1, wherein said each category based on the number of the occurrence of the information to select a category comprises: according to the number of times each category appear calculated the ratio between the total number of all occurrences of each category categories appear with the recorded number of times; according to the ratio between the total number of all categories for each category the number of times of the recording occurs occurring, said information select a category.
  4. 4.根据权利要求1所述的信息分类的方法,其特征在于,所述根据所述每种类别出现的次数,为所述信息选择类别具体包括:根据所述每种类别出现的次数,计算出所述每种类别出现的次数与记录的所有类别出现的总次数之间的比例;根据所述每种类别出现的次数,以及所述每种类别出现的次数与记录的所有类别出现的总次数之间的比例,为所述信息选择类别。 4. The method of classification of information according to claim 1, wherein said each category based on the number of the occurrence of the information to select a category comprises: according to the number of times each category appear calculated the ratio between the total number of all occurrences of each category categories appear with the recorded number of times; the total number of times according to the occurrence of each category appear, all categories and each category the number of times of occurrence of the recording the ratio between the frequency of the selected information category.
  5. 5. 一种信息分类的装置,其特征在于,所述装置包括:类别记录模块,用于获取信息在搜索引擎中的搜索结果,并记录每条搜索结果的类别;次数统计模块,用于在所述类别记录模块记录下每条搜索结果的类别后,在记录的所有类别中,统计每种类别出现的次数;类别选择模块,用于在所述次数统计模块统计出每种类别出现的次数后,根据所述每种类别出现的次数,为所述信息选择类别。 5. An information classification means, characterized in that said apparatus comprising: a category record module, configured to obtain the information search result in the search engine, and records each search result category; number counting module, configured to after recording module records the category of each category in the search results, recorded in all categories, the statistical frequency of occurrence for each category; category selection means for each category the number of times the occurrence of a statistics module statistics after each category according to the number of times of occurrence for the selected category of information.
  6. 6.根据权利要求5所述的信息分类的装置,其特征在于,所述装置还包括:类别标注模块,用于在所述类别记录模块获取信息在搜索引擎中的搜索结果之前,对所述搜索结果进行类别的标注。 6. The device according to claim classification information, wherein, said apparatus further comprising: a tagging module category, prior to the recording module acquires the category information in the search result in the search engine, the Search results are labeled categories.
  7. 7.根据权利要求5所述的信息分类的装置,其特征在于,所述类别选择模块具体包括: 第一比例计算单元,用于在所述次数统计模块统计出每种类别出现的次数后,根据所述每种类别出现的次数,计算出所述每种类别出现的次数与记录的所有类别出现的总次数之间的比例;第一类别选择单元,用于在所述第一比例计算单元计算出所述每种类别出现的次数与记录的所有类别出现的总次数之间的比例后,根据所述每种类别出现的次数与记录的所有类别出现的总次数之间的比例,为所述信息选择类别。 7. The device according to claim classification information, wherein, said category selecting module comprises: a first ratio calculating means for the frequency and the number counting module statistics appear in each category, the number of times the occurrence of each category, the total number of times calculating the ratio between the number of times recording each category appear all categories appear; a first category selecting unit for calculating a ratio of the first unit the ratio between the total number of times the calculated ratio between the total number of all categories for each category the number of times of the recording occurs occurring, the occurrence of all categories for each category according to the number and occurrence of recording, to the said information category selection.
  8. 8.根据权利要求5所述的信息分类的装置,其特征在于,所述类别选择模块具体包括: 第二比例计算单元,用于在所述次数统计模块统计出每种类别出现的次数后,根据所述每种类别出现的次数,计算出所述每种类别出现的次数与记录的所有类别出现的总次数之间的比例;第二类别选择单元,用于在所述第二比例计算单元计算出所述每种类别出现的次数与记录的所有类别出现的总次数之间的比例后,根据所述每种类别出现的次数,以及所述每种类别出现的次数与记录的所有类别出现的总次数之间的比例,为所述信息选择类别。 8. The device according to claim classification information, wherein, said category selecting module comprises: a second ratio calculating means for the frequency and the number counting module statistics appear in each category, the number of times the occurrence of each category, the total number of times calculating the ratio between the number of times recording each category appear all categories appear; second category selecting unit for calculating a ratio of the second unit after calculating the ratio between the total number of records for each category the number of times of occurrence of all categories appear, occur depending on the number of times each category all categories occurring, and the occurrence of each category and the number of records the ratio between the total number of times, the selected category information.
CN 201010151119 2010-04-19 2010-04-19 Method and device for information classification CN102222072A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010151119 CN102222072A (en) 2010-04-19 2010-04-19 Method and device for information classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010151119 CN102222072A (en) 2010-04-19 2010-04-19 Method and device for information classification

Publications (1)

Publication Number Publication Date
CN102222072A true true CN102222072A (en) 2011-10-19

Family

ID=44778627

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010151119 CN102222072A (en) 2010-04-19 2010-04-19 Method and device for information classification

Country Status (1)

Country Link
CN (1) CN102222072A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014036975A1 (en) * 2012-09-10 2014-03-13 腾讯科技(深圳)有限公司 Method and device for presenting social network search results and storage medium
CN103646341A (en) * 2013-11-29 2014-03-19 北京奇虎科技有限公司 A method and an apparatus for recommending website-provided objects
CN104094262A (en) * 2012-02-10 2014-10-08 谷歌公司 Search result categorization
CN104391977A (en) * 2014-12-05 2015-03-04 北京国双科技有限公司 Web page keyword occurrence frequency detection method and device
CN104503833A (en) * 2014-12-22 2015-04-08 广州唯品会网络技术有限公司 Optimal task scheduling method and apparatus
CN104951551A (en) * 2015-06-26 2015-09-30 深圳市腾讯计算机系统有限公司 Data classifying method and system
CN105468627A (en) * 2014-09-04 2016-04-06 纬创资通股份有限公司 Method and system for shielding and filtering web page contents

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1609859A (en) * 2004-11-26 2005-04-27 孙斌 Search results clustering method
WO2006036781A2 (en) * 2004-09-22 2006-04-06 Perfect Market Technologies, Inc. Search engine using user intent
CN1781100A (en) * 2003-04-29 2006-05-31 国际商业机器公司 System and method for generating refinement categories for a set of search results
CN101179472A (en) * 2007-05-31 2008-05-14 腾讯科技(深圳)有限公司 Network resource searching method and searching system
CN101211368A (en) * 2007-12-25 2008-07-02 北京搜狗科技发展有限公司 Method for classifying search term, device and search engine system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1781100A (en) * 2003-04-29 2006-05-31 国际商业机器公司 System and method for generating refinement categories for a set of search results
WO2006036781A2 (en) * 2004-09-22 2006-04-06 Perfect Market Technologies, Inc. Search engine using user intent
CN1609859A (en) * 2004-11-26 2005-04-27 孙斌 Search results clustering method
CN101179472A (en) * 2007-05-31 2008-05-14 腾讯科技(深圳)有限公司 Network resource searching method and searching system
CN101211368A (en) * 2007-12-25 2008-07-02 北京搜狗科技发展有限公司 Method for classifying search term, device and search engine system

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104094262A (en) * 2012-02-10 2014-10-08 谷歌公司 Search result categorization
WO2014036975A1 (en) * 2012-09-10 2014-03-13 腾讯科技(深圳)有限公司 Method and device for presenting social network search results and storage medium
CN103646341A (en) * 2013-11-29 2014-03-19 北京奇虎科技有限公司 A method and an apparatus for recommending website-provided objects
CN103646341B (en) * 2013-11-29 2018-06-22 北京奇虎科技有限公司 Kind of website objects recommendation method and apparatus
CN105468627A (en) * 2014-09-04 2016-04-06 纬创资通股份有限公司 Method and system for shielding and filtering web page contents
CN104391977A (en) * 2014-12-05 2015-03-04 北京国双科技有限公司 Web page keyword occurrence frequency detection method and device
CN104391977B (en) * 2014-12-05 2018-04-03 北京国双科技有限公司 Frequency detection method and apparatus of the page keyword appears
CN104503833A (en) * 2014-12-22 2015-04-08 广州唯品会网络技术有限公司 Optimal task scheduling method and apparatus
CN104951551A (en) * 2015-06-26 2015-09-30 深圳市腾讯计算机系统有限公司 Data classifying method and system

Similar Documents

Publication Publication Date Title
Castells et al. An adaptation of the vector-space model for ontology based information retrieval
Liu et al. On issues of instance selection
US20070078889A1 (en) Method and system for automated knowledge extraction and organization
US20110093459A1 (en) Incorporating Recency in Network Search Using Machine Learning
Konstas et al. On social networks and collaborative recommendation
US8005643B2 (en) System and method for measuring the quality of document sets
US20100325135A1 (en) Methods and apparatus for determining a mood profile associated with media data
US20090292660A1 (en) Using rule induction to identify emerging trends in unstructured text streams
US7596552B2 (en) Method and system for extracting web data
US20090094233A1 (en) Modeling Topics Using Statistical Distributions
US20090319449A1 (en) Providing context for web articles
Yom-Tov et al. Learning to estimate query difficulty: including applications to missing content detection and distributed information retrieval
US20080235283A1 (en) Generating audio annotations for search and retrieval
McCann et al. Mapping maintenance for data integration systems
Wan et al. CollabRank: towards a collaborative approach to single-document keyphrase extraction
Bhatt et al. Multimedia data mining: state of the art and challenges
Liu et al. Learning to video search rerank via pseudo preference feedback
US8676815B2 (en) Suffix tree similarity measure for document clustering
Liu et al. Mining quality phrases from massive text corpora
Chou et al. Using incremental PLSI for threshold-resilient online event analysis
CN101174273A (en) News event detecting method based on metadata analysis
Das Sarma et al. Dynamic relationship and event discovery
US20090094020A1 (en) Recommending Terms To Specify Ontology Space
Lao et al. Fast query execution for retrieval models based on path-constrained random walks
US20060004561A1 (en) Method and system for clustering using generalized sentence patterns

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination
C12 Rejection of a patent application after its publication