CN103246655A

CN103246655A - Text categorizing method, device and system

Info

Publication number: CN103246655A
Application number: CN2012100243714A
Authority: CN
Inventors: 何晓宁; 勇凤伟
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Shenzhen Shiji Guangsu Information Technology Co Ltd
Priority date: 2012-02-03
Filing date: 2012-02-03
Publication date: 2013-08-14

Abstract

The invention is applicable to the technical field of internet text categorizing and provides a text categorizing method, device and system. The method includes: extracting characteristics of texts to be categorized; and categorizing the texts to be categorized according to the characteristics of the texts to be categorized to obtain normal texts and junk texts, wherein the characteristics include word properties in the texts to be categorized. By using the property of each word in the texts as the characteristics to conduct characteristic extraction and categorizing, characteristic space is greatly reduced, a relatively complex and precise categorizing model can be selected from a categorizer to categorize the texts to be categorized, and the categorizing accuracy is greatly improved.

Description

A kind of file classification method, Apparatus and system

Technical field

The invention belongs to internet text classification technical field, relate in particular to a kind of file classification method, Apparatus and system.

Background technology

The opening that the internet is good and the interactive rubbish text problem of thereupon having brought, some bad users are by a large amount of politics, advertisement and the Pornographs of internet issue, serious harm public network safety, therefore, need classify to the text message that the user uploads, therefrom filter out rubbish text.

Existing file classification method is based on word and carries out feature extraction, because any language all possesses the vocabulary of magnanimity, therefore there is the huge problem of feature space on the one hand in the feature extraction based on word, limited the performance of sorter, on the other hand, with respect to huge feature space, the relative much less of amount of text that is used for training that can get access to, above-mentioned two aspect problems all cause classifying quality undesirable.Simultaneously, at such encyclopaedia class text of for example " searching encyclopaedia ", because the word that occurs in its text is quite unfixing and coverage is extremely wide, and each text to be sorted all relates to brand-new encyclopaedia entry, it is the content of text that sorter was not learnt, therefore as if the feature extracting method of taking based on word, then training text needs often to upgrade, and influences classifying quality.

Summary of the invention

The purpose of the embodiment of the invention is to provide a kind of file classification method, is intended to solve the existing not good problem of file classification method classifying quality of carrying out feature extraction based on word.

The embodiment of the invention is achieved in that a kind of file classification method, and described method comprises:

Extract the feature of text to be sorted, described feature comprises related part of speech in the described text to be sorted;

Feature according to described text to be sorted is classified to described text to be sorted, obtains normal text and rubbish text.

Another purpose of the embodiment of the invention is to provide a kind of document sorting apparatus, and described device comprises:

Characteristic extracting module, for the feature of extracting text to be sorted, described feature comprises related part of speech in the described text to be sorted;

Sort module is used for according to the feature of described text to be sorted described text to be sorted being classified, and obtains normal text and rubbish text.

Another purpose of the embodiment of the invention is to provide described system of a kind of text classification system to comprise aforesaid document sorting apparatus.

The embodiment of the invention is carried out feature extraction and classification with the part of speech of each word in the text as feature, dwindled feature space greatly, and therefore can in sorter, select relative complex and accurate disaggregated model to treat classifying text and classify, improved classification accuracy greatly.

Description of drawings

Fig. 1 is the realization flow figure of the file classification method that provides of first embodiment of the invention;

Fig. 2 is the realization flow figure of the file classification method that provides of second embodiment of the invention;

Fig. 3 is the realization flow figure of the disaggregated model update method that provides of third embodiment of the invention;

Fig. 4 is the structural drawing of the document sorting apparatus that provides of fourth embodiment of the invention.

Embodiment

In order to make purpose of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explaining the present invention, and be not used in restriction the present invention.

Fig. 1 shows the realization flow of the file classification method that first embodiment of the invention provides, and details are as follows:

In step S101, extract the feature of text to be sorted, described feature comprises related part of speech in the described text to be sorted.

In the present embodiment, the part of speech of word in the text to be sorted is extracted as feature, carried out the part of speech that word segmentation processing is obtained each word in the text to be sorted by treating classifying text.For example, text to be sorted is " People's Republic of China (PRC) declared its establishment in 1949 ", then carries out can obtaining after the word segmentation processing＜1949 years, China, the people, republic, declaration is set up〉these six words, and these six corresponding part of speech＜time words of words difference, proper noun, noun, noun, verb, verb 〉, then carry in the text to be sorted＜time word, proper noun, noun, verb〉these four features.

In step S102, according to the feature of described text to be sorted described text to be sorted is classified, obtain normal text and rubbish text.

In the present embodiment, after the feature of text to be sorted is extracted, this text to be sorted namely is converted into the feature that disaggregated model can be identified, therefore, feature by will this text to be sorted is input to carries out characteristic matching in the default disaggregated model, namely can classify to this text to be sorted, judge that this text to be sorted is rubbish text or normal text.

In the present embodiment, classification type is that the classifying text of rubbish text then is filtered, thereby has guaranteed the quality of content of text on the internet.

Particularly, as the refinement to first embodiment of the invention, Fig. 2 shows the realization flow of the file classification method that second embodiment of the invention provides, and details are as follows:

In step S201, be the word string with described text dividing to be sorted.

For example, text to be sorted is " People's Republic of China (PRC) declared its establishment in 1949 ", then carry out can obtaining after the cutting of word string＜1949, and China, the people, republic, declaration is set up〉these six word strings.

In step S202, extract the part of speech of each word string, with the feature as described text to be sorted.

For example,＜1949 years, China, the people, republic, declaration is set up〉these six corresponding part of speech＜time words of words difference, and proper noun, noun, noun, verb, verb 〉, above-mentioned part of speech is the feature of this classifying text.

In step S203, calculate the eigenwert of each feature in the described text to be sorted, described eigenwert is the ratio of the part of speech total amount in the shared described text to be sorted of the occurrence number of each part of speech in the described text to be sorted.

For example, in text to be sorted " People's Republic of China (PRC) declared its establishment in 1949 ", always have four types part of speech, its feature＜time word〉and＜proper noun〉respectively occurred once, feature＜noun〉and＜verb〉respectively occurred twice, feature＜time word then, proper noun, noun, verb〉respectively the characteristic of correspondence value be＜1/6,1/6,1/3,1/3 〉.

In step S204, feature and the characteristic of correspondence value thereof of described text to be sorted are imported default disaggregated model so that described text to be sorted is classified, obtain classifying text.

In the present embodiment, for default disaggregated model, because a kind of part of speech of language has only tens kinds, feature space based on part of speech is little, can adopt comparatively complicated and accurate non-linear disaggregated model, therefore in the present embodiment, can be based on for example asymptotic decision tree of machine learning method gradient (gradient boost decision tree, GBDT) etc. disaggregated model is trained and made up to method, this model is by the training in early stage, input is through manual sort's samples of text, namely can be according to the feature in the samples of text, eigenwert and text are learnt, train the disaggregated model that carries out text classification by characteristic matching, its concrete modular concept can not repeat them here with reference to pertinent literature.

In the present embodiment, treat classifying text by default disaggregated model and carried out after the classification, the text that classifies as rubbish text can be deleted.And as one embodiment of the present of invention, the text that is classified as normal text can be further via artificial screening, with the accuracy of further raising classification.

Fig. 3 shows the realization flow of the disaggregated model update method that third embodiment of the invention provides, and details are as follows:

In step S301, the normal text that utilizes classification to obtain is optimized described default disaggregated model.

In the present embodiment, in upgrading the process of disaggregated model, at first obtain through the type of classifying text of classification, comprise normal text and rubbish text.Particularly, the type of this classifying text be can obtain by the two-value tag along sort that detects each classifying text, for example normal text and rubbish text indicated respectively with 0 and 1.In the present embodiment, some classification types are the classifying text composing training language material of normal text, are used for further carrying out the training of disaggregated model.

After getting access to the type of text, extract the feature of normal text and calculate the wherein eigenwert of each feature, with the disaggregated model training to presetting.In the present embodiment, training method is different because of the model that adopts, and concrete training method is not done to be limited at this.Be example with GBDT, training algorithm traversal training data is set up decision tree, travels through whole samples afterwards again, at the sample of the decision tree classification mistake that obtains, sets up second decision tree.After N iteration traversal, disaggregated model is made up of N decision tree.

In step S302, upgrade default disaggregated model according to optimizing the result.

In the present embodiment, deposit N the decision tree that obtains in model file upgrading default disaggregated model, and call when text classification next time for sort program.

Fig. 4 shows the structure of the document sorting apparatus that fourth embodiment of the invention provides, and for convenience of explanation, only shows the part relevant with present embodiment.

As shown in Figure 4, text sorter can run in the text classification systems such as search engine, database, and for running on the software unit of said system, it specifically comprises:

Characteristic extracting module 41, for the feature of extracting text to be sorted, described feature comprises related part of speech in the described text to be sorted.

Sort module 42 is used for according to the feature of described text to be sorted described text to be sorted being classified, and obtains normal text and rubbish text.

Particularly, characteristic extracting module 41 comprises:

Cutting submodule 411, being used for described text dividing to be sorted is the word string.

Extract submodule 412, be used for extracting the part of speech of each word string with the feature as described text to be sorted.

Calculating sub module 413, for the eigenwert of calculating described each feature of text to be sorted, described eigenwert is the ratio of the part of speech total amount in the shared described text to be sorted of the occurrence number of each part of speech in the described text to be sorted.

Sort module 42 comprises:

Input submodule 421, the disaggregated model that is used for feature and the input of characteristic of correspondence value thereof of described text to be sorted are preset is to classify to described text to be sorted.

Further, text sorter also comprises:

Optimize module 43, the normal text that is used for utilizing classification to obtain is optimized described default disaggregated model.

Update module 44 is used for upgrading default disaggregated model according to optimizing the result.

Filtering module 45, the described rubbish text that is used for classification is obtained filters.

The above only is preferred embodiment of the present invention, not in order to limiting the present invention, all any modifications of doing within the spirit and principles in the present invention, is equal to and replaces and improvement etc., all should be included within protection scope of the present invention.

Claims

1. a file classification method is characterized in that, described method comprises:

2. the method for claim 1 is characterized in that, the step of the feature of described extraction text to be sorted specifically comprises:

Be the word string with described text dividing to be sorted;

Extract the part of speech of each word string with the feature as described text to be sorted;

Calculate the eigenwert of each feature in the described text to be sorted, described eigenwert is the ratio of the part of speech total amount in the shared described text to be sorted of the occurrence number of each part of speech in the described text to be sorted.

3. method as claimed in claim 2 is characterized in that, described step of described text to be sorted being classified according to the feature of described text to be sorted specifically comprises:

Feature and the characteristic of correspondence value thereof of described text to be sorted are imported default disaggregated model so that described text to be sorted is classified.

4. the method for claim 1 is characterized in that, described method also comprises:

The normal text that utilizes classification to obtain is optimized described default disaggregated model;

Upgrade default disaggregated model according to optimizing the result.

5. the method for claim 1 is characterized in that, described method also comprises:

The described rubbish text that classification is obtained filters.

6. a document sorting apparatus is characterized in that, described device comprises:

7. device as claimed in claim 6 is characterized in that, described characteristic extracting module comprises:

The cutting submodule, being used for described text dividing to be sorted is the word string;

Extract submodule, be used for extracting the part of speech of each word string with the feature as described text to be sorted;

Calculating sub module, for the eigenwert of calculating described each feature of text to be sorted, described eigenwert is the ratio of the part of speech total amount in the shared described text to be sorted of the occurrence number of each part of speech in the described text to be sorted;

Described sort module comprises:

The input submodule, the disaggregated model that is used for feature and the input of characteristic of correspondence value thereof of described text to be sorted are preset is to classify to described text to be sorted.

8. device as claimed in claim 6 is characterized in that, described device also comprises:

Optimize module, the normal text that is used for utilizing classification to obtain is optimized described default disaggregated model;

Update module is used for upgrading default disaggregated model according to optimizing the result.

9. device as claimed in claim 6 is characterized in that, described device also comprises:

Filtering module, the described rubbish text that is used for classification is obtained filters.

10. a text classification system is characterized in that, described system comprises as each described document sorting apparatus of claim 6 to 9.