Embodiment
In order to make purpose of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explaining the present invention, and be not used in restriction the present invention.
The embodiment of the invention is carried out feature extraction and classification with the part of speech of each word in the text as feature, dwindled feature space greatly, and therefore can in sorter, select relative complex and accurate disaggregated model to treat classifying text and classify, improved classification accuracy greatly.
Fig. 1 shows the realization flow of the file classification method that first embodiment of the invention provides, and details are as follows:
In step S101, extract the feature of text to be sorted, described feature comprises related part of speech in the described text to be sorted.
In the present embodiment, the part of speech of word in the text to be sorted is extracted as feature, carried out the part of speech that word segmentation processing is obtained each word in the text to be sorted by treating classifying text.For example, text to be sorted is " People's Republic of China (PRC) declared its establishment in 1949 ", then carries out can obtaining after the word segmentation processing<1949 years, China, the people, republic, declaration is set up〉these six words, and these six corresponding part of speech<time words of words difference, proper noun, noun, noun, verb, verb 〉, then carry in the text to be sorted<time word, proper noun, noun, verb〉these four features.
In step S102, according to the feature of described text to be sorted described text to be sorted is classified, obtain normal text and rubbish text.
In the present embodiment, after the feature of text to be sorted is extracted, this text to be sorted namely is converted into the feature that disaggregated model can be identified, therefore, feature by will this text to be sorted is input to carries out characteristic matching in the default disaggregated model, namely can classify to this text to be sorted, judge that this text to be sorted is rubbish text or normal text.
In the present embodiment, classification type is that the classifying text of rubbish text then is filtered, thereby has guaranteed the quality of content of text on the internet.
Particularly, as the refinement to first embodiment of the invention, Fig. 2 shows the realization flow of the file classification method that second embodiment of the invention provides, and details are as follows:
In step S201, be the word string with described text dividing to be sorted.
For example, text to be sorted is " People's Republic of China (PRC) declared its establishment in 1949 ", then carry out can obtaining after the cutting of word string<1949, and China, the people, republic, declaration is set up〉these six word strings.
In step S202, extract the part of speech of each word string, with the feature as described text to be sorted.
For example,<1949 years, China, the people, republic, declaration is set up〉these six corresponding part of speech<time words of words difference, and proper noun, noun, noun, verb, verb 〉, above-mentioned part of speech is the feature of this classifying text.
In step S203, calculate the eigenwert of each feature in the described text to be sorted, described eigenwert is the ratio of the part of speech total amount in the shared described text to be sorted of the occurrence number of each part of speech in the described text to be sorted.
For example, in text to be sorted " People's Republic of China (PRC) declared its establishment in 1949 ", always have four types part of speech, its feature<time word〉and<proper noun〉respectively occurred once, feature<noun〉and<verb〉respectively occurred twice, feature<time word then, proper noun, noun, verb〉respectively the characteristic of correspondence value be<1/6,1/6,1/3,1/3 〉.
In step S204, feature and the characteristic of correspondence value thereof of described text to be sorted are imported default disaggregated model so that described text to be sorted is classified, obtain classifying text.
In the present embodiment, for default disaggregated model, because a kind of part of speech of language has only tens kinds, feature space based on part of speech is little, can adopt comparatively complicated and accurate non-linear disaggregated model, therefore in the present embodiment, can be based on for example asymptotic decision tree of machine learning method gradient (gradient boost decision tree, GBDT) etc. disaggregated model is trained and made up to method, this model is by the training in early stage, input is through manual sort's samples of text, namely can be according to the feature in the samples of text, eigenwert and text are learnt, train the disaggregated model that carries out text classification by characteristic matching, its concrete modular concept can not repeat them here with reference to pertinent literature.
In the present embodiment, treat classifying text by default disaggregated model and carried out after the classification, the text that classifies as rubbish text can be deleted.And as one embodiment of the present of invention, the text that is classified as normal text can be further via artificial screening, with the accuracy of further raising classification.
Fig. 3 shows the realization flow of the disaggregated model update method that third embodiment of the invention provides, and details are as follows:
In step S301, the normal text that utilizes classification to obtain is optimized described default disaggregated model.
In the present embodiment, in upgrading the process of disaggregated model, at first obtain through the type of classifying text of classification, comprise normal text and rubbish text.Particularly, the type of this classifying text be can obtain by the two-value tag along sort that detects each classifying text, for example normal text and rubbish text indicated respectively with 0 and 1.In the present embodiment, some classification types are the classifying text composing training language material of normal text, are used for further carrying out the training of disaggregated model.
After getting access to the type of text, extract the feature of normal text and calculate the wherein eigenwert of each feature, with the disaggregated model training to presetting.In the present embodiment, training method is different because of the model that adopts, and concrete training method is not done to be limited at this.Be example with GBDT, training algorithm traversal training data is set up decision tree, travels through whole samples afterwards again, at the sample of the decision tree classification mistake that obtains, sets up second decision tree.After N iteration traversal, disaggregated model is made up of N decision tree.
In step S302, upgrade default disaggregated model according to optimizing the result.
In the present embodiment, deposit N the decision tree that obtains in model file upgrading default disaggregated model, and call when text classification next time for sort program.
The embodiment of the invention is carried out feature extraction and classification with the part of speech of each word in the text as feature, dwindled feature space greatly, and therefore can in sorter, select relative complex and accurate disaggregated model to treat classifying text and classify, improved classification accuracy greatly.
Fig. 4 shows the structure of the document sorting apparatus that fourth embodiment of the invention provides, and for convenience of explanation, only shows the part relevant with present embodiment.
As shown in Figure 4, text sorter can run in the text classification systems such as search engine, database, and for running on the software unit of said system, it specifically comprises:
Characteristic extracting module 41, for the feature of extracting text to be sorted, described feature comprises related part of speech in the described text to be sorted.
Sort module 42 is used for according to the feature of described text to be sorted described text to be sorted being classified, and obtains normal text and rubbish text.
Particularly, characteristic extracting module 41 comprises:
Cutting submodule 411, being used for described text dividing to be sorted is the word string.
Extract submodule 412, be used for extracting the part of speech of each word string with the feature as described text to be sorted.
Calculating sub module 413, for the eigenwert of calculating described each feature of text to be sorted, described eigenwert is the ratio of the part of speech total amount in the shared described text to be sorted of the occurrence number of each part of speech in the described text to be sorted.
Sort module 42 comprises:
Input submodule 421, the disaggregated model that is used for feature and the input of characteristic of correspondence value thereof of described text to be sorted are preset is to classify to described text to be sorted.
Further, text sorter also comprises:
Optimize module 43, the normal text that is used for utilizing classification to obtain is optimized described default disaggregated model.
Update module 44 is used for upgrading default disaggregated model according to optimizing the result.
Filtering module 45, the described rubbish text that is used for classification is obtained filters.
The embodiment of the invention is carried out feature extraction and classification with the part of speech of each word in the text as feature, dwindled feature space greatly, and therefore can in sorter, select relative complex and accurate disaggregated model to treat classifying text and classify, improved classification accuracy greatly.
The above only is preferred embodiment of the present invention, not in order to limiting the present invention, all any modifications of doing within the spirit and principles in the present invention, is equal to and replaces and improvement etc., all should be included within protection scope of the present invention.