CN103246686A - Method and device for text classification, and method and device for characteristic processing of text classification - Google Patents

Method and device for text classification, and method and device for characteristic processing of text classification Download PDF

Info

Publication number
CN103246686A
CN103246686A CN2012100332084A CN201210033208A CN103246686A CN 103246686 A CN103246686 A CN 103246686A CN 2012100332084 A CN2012100332084 A CN 2012100332084A CN 201210033208 A CN201210033208 A CN 201210033208A CN 103246686 A CN103246686 A CN 103246686A
Authority
CN
China
Prior art keywords
text classification
characteristic
feature
learning
feature word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012100332084A
Other languages
Chinese (zh)
Inventor
许文奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN2012100332084A priority Critical patent/CN103246686A/en
Publication of CN103246686A publication Critical patent/CN103246686A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for text classification, and a method and a device for characteristic processing of text classification. The method for characteristic processing of text classification includes obtaining a characteristic set of learning materials used for text classification; calculating a sum of information gain values of each characteristic word in all classification types; and extracting characteristic words in a preset number in the characteristic set as learning characteristics used for text classification to enable the learning characteristics used for text classification to be part of the characteristic words in residual characteristic words except for stop words in the characteristic set, wherein the sum of the information gain values corresponding to extracted characteristic words is larger than the sum of the information gain values corresponding to non-extracted characteristic words. By applying the method and the device for text classification to characteristic extraction of text classification, noise characteristics can be effectively avoided being brought into a machine learning process, so that accuracy of text classification is improved, the scale of a characteristic library is greatly reduced, and memory usage is reduced.

Description

The characteristic processing method and apparatus of file classification method and device and text classification
Technical field
The application relates to data processing field, in particular to the characteristic processing method and apparatus of a kind of file classification method and device and text classification.
Background technology
Machine learning algorithm relies on and extracts the good results of learning that effective characteristic just can obtain, and how extracting validity feature and avoiding the interference of feature of noise is the important channel of improving the machine learning effect.
At present, when obtaining the learning characteristic of machine learning, usually with all words as feature, make feature database huge, thereby committed memory is huge when machine learning, and is mingled with a lot of feature of noise, the text classification weak effect.
In order to remove feature of noise, the word after the stop words deletion as feature, still is merely able in the feature that to a certain degree abates the noise, and feature database is still bigger, thereby committed memory is still bigger when machine learning, because feature of noise flows into, the text classification effect does not obtain bigger improvement.
Feature database at the classification of correlation technique Chinese version is big, and the big problem of committed memory when causing machine learning does not propose effective solution at present as yet.
Summary of the invention
The application's fundamental purpose is to provide the characteristic processing method and apparatus of a kind of file classification method and device and text classification, and is big with the feature database that solves text classification, the big problem of committed memory when causing machine learning.
To achieve these goals, according to the application's a aspect, provide a kind of characteristic processing method of text classification.Characteristic processing method according to the application's text classification comprises: obtain the characteristic set for the learning materials of text classification, wherein, characteristic set comprises a plurality of feature words; Calculate the information gain value sum of each feature word in all class categories; And in the extraction characteristic set feature word of predetermined quantity as the learning characteristic that is used for text classification, it is the Partial Feature word in the residue character word of removing in the characteristic set behind the stop words with the learning characteristic that is used in text classification, the information gain value sum of the feature word correspondence of wherein, extracting is greater than the information gain value sum of undrawn feature word correspondence.
To achieve these goals, according to the application on the other hand, provide a kind of file classification method.File classification method according to the application comprises: the characteristic processing method of any one text classification that employing the application provides is carried out feature extraction, obtains the learning characteristic for text classification; Learning characteristic is trained, obtain disaggregated model; And adopt disaggregated model to treat classifying text and carry out text classification.
To achieve these goals, according to the application's another aspect, provide a kind of characteristic processing device of text classification.The characteristic processing method that is used for carrying out any one text classification that the application proposes according to the characteristic processing device of the application's text classification.
To achieve these goals, according to the application's another aspect, provide a kind of characteristic processing device of text classification.Characteristic processing device according to the application's text classification comprises: acquisition module, be used for obtaining the characteristic set for the learning materials of text classification, and wherein, characteristic set comprises a plurality of feature words; Computing module is used for calculating each feature word in the information gain value sum of all class categories; And extraction module, be used for extracting the feature word of characteristic set predetermined number as the learning characteristic that is used for text classification, it is the Partial Feature word in the residue character word of removing in the characteristic set behind the stop words with the learning characteristic that is used in text classification, the information gain value sum of the feature word correspondence of wherein, extracting is greater than the information gain value sum of undrawn feature word correspondence.
To achieve these goals, the another aspect according to the application provides a kind of document sorting apparatus.Document sorting apparatus according to the application is used for carrying out any one file classification method that the application proposes.
To achieve these goals, the another aspect according to the application provides a kind of document sorting apparatus.Document sorting apparatus according to the application comprises: the characteristic processing device of any one text classification that the application provides, be used for feature extraction, and obtain the learning characteristic for text classification; Training module is used for learning characteristic is trained, and obtains disaggregated model; And sort module, be used for adopting disaggregated model to treat classifying text and carry out text classification.
Pass through the application, the characteristic processing method of the text classification that employing the application provides, according to information gain value sum size, extract the Partial Feature word of predetermined quantity as the learning characteristic of text classification, feature database formed in Partial Feature word in the residue character word behind the removal stop words in the full feature set, dwindle feature database, reduced committed memory.Further, because the information gain value sum of the noise word correspondence in the characteristic set is less than the information gain value sum of non-noise word correspondence, therefore, as long as the Partial Feature word that information gain value sum is bigger in the employing characteristic set is as feature database, just can remove the part or all of noise word in the non-stop words, thereby make and do not comprise or comprise the less noise word in the learning characteristic of text classification, improved the effect of text training, it is higher make to adopt this characteristic processing method carry out the nicety of grading of file classification method of feature extraction, the feature database that has solved the classification of prior art Chinese version is big, the big problem of committed memory when causing machine learning, and then reach the feature database that reduces text classification, the effect of committed memory when reducing machine learning.
Description of drawings
In order to be illustrated more clearly in the embodiment of the present application or technical scheme of the prior art, to do to introduce simply to the accompanying drawing of required use in embodiment or the description of the Prior Art below, apparently, the accompanying drawing that describes below only is some embodiment that put down in writing among the application, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.In the accompanying drawings:
Fig. 1 is the block diagram according to the document sorting apparatus of the embodiment of the present application;
Fig. 2 is the block diagram according to the characteristic processing device of the text classification of the application first embodiment;
Fig. 3 is the block diagram according to the characteristic processing device of the text classification of the application second embodiment;
Fig. 4 is the process flow diagram according to the file classification method of the embodiment of the present application;
Fig. 5 is the process flow diagram according to the characteristic processing method of the text classification of the application first embodiment; And
Fig. 6 is the process flow diagram according to the characteristic processing method of the text classification of the application second embodiment.
Embodiment
In order to make those skilled in the art person understand technical scheme among the application better, below in conjunction with the accompanying drawing in the embodiment of the present application, technical scheme in the embodiment of the present application is clearly and completely described, obviously, described embodiment only is the application's part embodiment, rather than whole embodiment.Based on the embodiment among the application, the every other embodiment that those of ordinary skills obtain should belong to the scope that the application protects.
At first, a kind of document sorting apparatus of the embodiment of the present application is described, as shown in Figure 1, text sorter comprises: characteristic processing device 20, training module 40 and sort module 60.
Before the machine learning task of text classification, must there be a certain amount of learning materials to offer machine, the machine here refers to a kind of can the operation according to program, automatically, the modernized intelligent electronic device of high speed processing mass data.Such as our common PC, server etc.So-called learning materials can refer to the text information through artificial mark classification.These texts are generally from the environment of our actual use.So that being put into different categories, the books in library are example, need randomly draw and contain all categories books that by the classification of artificial these books of mark earlier, this part books that mark just can be used as the data of machine learning.
After getting access to learning materials, characteristic processing device 20 is used for these learning materials are carried out feature extraction, obtains the learning characteristic for text classification.The characteristic processing device 20 of present embodiment unlike the prior art, not directly will carry out participle to learning materials to obtain the feature word and remove behind the stop words learning characteristic as text classification, but will be chosen by the feature word that the learning materials participle obtains, selected part feature word is as the learning characteristic of text classification.Wherein, the Rule of judgment of choosing is the size of the information gain value sum of feature word in all class categories, information gain value sum characteristic of correspondence word that will be bigger is as the learning characteristic that is used for text classification, wherein, the learning characteristic that is used for text classification is the Partial Feature word that removes the residue character word behind the stop words, and the information gain value sum of the feature word correspondence of extracting is all greater than the information gain value sum of undrawn feature word correspondence.
Wherein, stop words can be the empty word that computer retrieval is used, i.e. non-retrieval word, for example, word such as " ", " " in the Chinese, words such as " a " in the English, " of ".Stop words because all such word can occur in nearly all text, does not have significant differentiation to the not special contribution of text classification.
Training module 40 is used for the learning characteristic that characteristic processing device 20 extracts is trained, obtain disaggregated model, the training module 40 at this place is used for finishing the machine learning process, can adopt mode identification method arbitrarily during training, support vector machine for example, neural network etc.
After obtaining the disaggregated model of text classification by training module 40, sort module 60 is carried out text classification for adopting disaggregated model to treat classifying text.Machine can be by learning as the books of learning materials part, obtain the disaggregated model of book classification after, just can realize the classification of other books.
In the present techniques scheme, characteristic processing device 20 extracts information gain value sum characteristic of correspondence word predetermined quantity, bigger as the learning characteristic of text classification, 40 pairs of these learning characteristics of training module are trained, obtain disaggregated model, sort module 60 these disaggregated models of employing are treated classifying text and are carried out text classification.
Use the present techniques scheme, when text classification, characteristic processing device 20 extracts the feature word of predetermined quantity and forms feature database, adopts the predetermined number of suitable size, feature database is further dwindled removing on the basis of stop words, thus the committed memory when having reduced training module 40 study.Further, characteristic processing device 20 can be realized extracting and comprise less or do not comprise the learning characteristic of noise word, thereby can improve the training precision of training module 40, and then makes the nicety of grading of sort module 60 improve.
Secondly, the characteristic processing device of a kind of text classification of the embodiment of the present application is described, as shown in Figure 2, the characteristic processing device of text classification comprises: acquisition module 22, computing module 24 and extraction module 26.
Acquisition module 22 is used for obtaining the characteristic set for the learning materials of text classification, wherein, characteristic set comprises a plurality of feature words, and acquisition module 22 can directly receive the characteristic set of user's input, also can receive the learning materials of user's input, learning materials be carried out participle obtain the feature word.
Computing module 24 is used for calculating each feature word in the information gain value sum of all text categories.Wherein, the information gain value refers to effective reduction (weighing with " byte " usually) of expectation information or information entropy, can determine which type of level to select which type of variable to classify at according to it.The information gain value is used for representing the quantity of information that a feature word brings this classification, more big this feature word of more representing of information gain value is more good for this classification, also namely this feature word more belongs to this classification, thereby adopt this feature word to carry out the branch time-like, the accuracy of classification is more high, particularly, can adopt following method to calculate the information gain value sum of a feature word in all class categories:
Suppose to be characterized as t, classification is C1~Cn, and then the information gain sum of feature t is:
IG ( t ) = - Σ i = 1 n P ( Ci ) log 2 P ( Ci ) + P ( t ) Σ i = 1 n P ( Ci | t ) log 2 P ( Ci | t ) + P ( t ‾ ) Σ i = 1 n P ( Ci | t ‾ ) log 2 P ( Ci | t ‾ )
Wherein, when the probability that P (Ci) expression Ci occurs, the probability that P (t) representation feature t occurs, P (Ci|t) are illustrated in feature t and occur, the probability of Ci appears.
Computing module 24 adopts above-mentioned computing method at first to calculate the information gain value of each feature word in each class categories, to the summation of the information gain value in all class categories of each feature word correspondence, obtain the information gain value sum of each feature word correspondence then.Extraction module 26 extracts the feature word of predetermined quantity in the characteristic set as the learning characteristic that is used for text classification, it is the Partial Feature word in the residue character word of removing in the characteristic set behind the stop words with the learning characteristic that is used in text classification, wherein, the information gain value sum of the feature word correspondence of extracting all greater than the information gain value sum of undrawn feature word correspondence, reject by the less Partial Feature word (being comprised stop words and part or all of noise word) of all the other information gain value sums.
In the present techniques scheme, acquisition module 22 obtains the characteristic set for the learning materials of text classification, the information gain value sum of each feature word correspondence in the set of computing module 24 calculated characteristics, information gain value sum feature word bigger, predetermined number is as the learning characteristic of text classification in the extraction module 26 extraction characteristic sets.
Use the present techniques scheme, when the feature extraction of text classification, calculate the information gain value sum of feature word correspondence by computing module 24, feature database formed in the feature word that extraction module 26 extracts predetermined quantity with the size of information gain value sum, wherein, the definite of predetermined quantity do not comprise stop words and noise word or comprises that the purpose of partial noise word is criterion to reach, thereby dwindled feature database, the committed memory when having reduced text training after the feature extraction in the feature database.Further, because the information gain value sum of the noise word correspondence in the text classification is less, extraction module 26 adopts suitable predetermined quantity can make the learning characteristic of extraction not comprise the noise word, thereby improved the training effect of text training after the feature extraction, can obtain better classifying quality when making text classification.
Preferably, during the Partial Feature word of extraction module 26 in extracting characteristic set, can adopt following dual mode to extract the bigger feature word of information gain value sum as the learning characteristic of text classification:
Mode one: extraction module 26 comprises that ordering submodule and first extracts submodule.Wherein, the ordering submodule is used for the size according to information gain value sum, all the feature words in the characteristic set are sorted, according to from small to large or order from big to small all can, adopt arbitrarily sort algorithm all can; First extracts submodule is used for according to the descending order of information gain value sum, extract the feature word of default percent quantities in the characteristic set as the learning characteristic of text classification, wherein, default percent quantities should adopt enough little value, is the requirement of the Partial Feature word in the residue character word of removing in the characteristic set behind the stop words with the feature word that satisfy to extract.For example, after the ordering submodule obtains the feature word sequence with all feature words according to from small to large rank order, first extracts submodule from the row tail of feature word sequence, extract 30 percent learning characteristic as text classification of total characteristic word successively, thus, extraction module 26 extracts bigger 30 percent feature word in the characteristic set, and does not comprise stop words and noise word in this feature word of 30 percent.
Mode two: extraction module 26 comprises judges that submodule and second extracts submodule.Wherein, judge that submodule is used for judging that whether the information gain value sum of each feature word correspondence is greater than preset value, the preset value here can be according to the actual text classification setting, as the threshold value of Feature Selection, the feature word that make to extract by this threshold value is the Partial Feature word in the residue character word of removing in the characteristic set behind the stop words; Second extracts submodule is used for extracting characteristic set information gain value sum greater than the feature word of the preset value learning characteristic as text classification.Thus, extraction module 26 all extracts the feature word of information gain value sum in the characteristic set greater than preset value.
The learning characteristic of the text classification that employing mode one obtains can accurately be determined the size of feature database.The learning characteristic of the text classification that employing mode two obtains can accurately be determined the information gain value sum of feature word correspondence in the feature database.
The 3rd, characteristic processing device to the another kind of text classification of the embodiment of the present application describes, as shown in Figure 3, the characteristic processing device of text classification comprises: acquisition module 22, computing module 24 and extraction module 26, wherein, acquisition module 22 comprises: obtain submodule 222, participle submodule 224, screening submodule 226 and statistics submodule 228.
Acquisition module 22 is used for obtaining the characteristic set for the learning materials of text classification, realizes in the following manner: obtain submodule 222 for the learning materials of obtaining for text classification, can directly receive the learning materials of user's input; Participle submodule 224 is used for the learning materials of text classification are carried out word segmentation processing, obtain a plurality of feature words, since during text classification with word as feature, so learning materials need be carried out word segmentation processing, be broken into significant word, such as " I love Tian An-men, Beijing " carried out word segmentation processing, obtain " I ", " love ", " Beijing " and " Tian An-men " these several words are as the feature word, word segmentation processing belongs to the natural language processing category, participle also has different algorithms, and the participle submodule 224 at this place can use and divide word algorithm that learning materials are carried out word segmentation processing arbitrarily; Screening submodule 226 is used for removing the stop words in a plurality of feature words after participle submodule 224 word segmentation processing; Statistics submodule 228 is used for the feature word after stop words removed in a plurality of feature words of statistics, obtains the characteristic set of the learning materials of text classification.
Computing module 24 among this embodiment and extraction module 26 respectively with embodiment illustrated in fig. 2 in computing module and extraction module identical, repeat no more herein.
In the present techniques scheme, acquisition module 22 is when the characteristic set of the learning materials of obtaining text classification, at first adopt 224 pairs of participle submodules to obtain the learning materials that submodule 222 obtains and carry out participle, by screening submodule 226 stop words removed in the feature word that obtains behind the participle then, add up submodule 228 statistics at last and remove the characteristic set that feature word behind the stop words obtains the learning materials of text classification.
Use the present techniques scheme, before the information gain value sum of computing module 24 calculated characteristics word correspondences, acquisition module 22 is removed the stop words in the feature word, thereby make present embodiment when reaching technique effect embodiment illustrated in fig. 2, further reduced the workload of computing module 24.
Corresponding with the embodiment of the application's document sorting apparatus, the application also provides the embodiment of file classification method.
The 4th, a kind of file classification method of the embodiment of the present application is described, as shown in Figure 4, text sorting technique following steps:
Step S12: by feature extraction, obtain the learning characteristic of text classification, this step can adopt the characteristic processing device 20 in the document sorting apparatus shown in Figure 1 to carry out.This step is not directly all to extract learning materials are carried out participle to obtain the feature word and remove behind the stop words learning characteristic as text classification, but will be chosen by the feature word that the learning materials participle obtains, selected part feature word is as the learning characteristic of text classification.Wherein, the Rule of judgment of choosing is the size of the information gain value sum of feature word in all class categories, information gain value sum characteristic of correspondence word that will be bigger is as the learning characteristic that is used for text classification, wherein, the learning characteristic that is used for text classification is the Partial Feature word that removes the residue character word behind the stop words, and the information gain value sum of the feature word correspondence of extracting is all greater than the information gain value sum of undrawn feature word correspondence.
Step S14: learning characteristic is trained, obtain disaggregated model, this step can adopt the training module 40 in the document sorting apparatus shown in Figure 1 to carry out.Finish the machine learning process by this step, can adopt mode identification method arbitrarily during training, support vector machine for example, neural network etc.
Step S16: adopt disaggregated model to treat classifying text and carry out text classification, this step can adopt the sort module 60 in the document sorting apparatus shown in Figure 1 to carry out.
In the present techniques scheme, at first extract information gain value sum characteristic of correspondence word predetermined number, bigger as the learning characteristic of text classification, then this learning characteristic is trained, obtain disaggregated model, adopt this disaggregated model to treat classifying text at last and carry out text classification.
Use the present techniques scheme, when text classification, extract the feature word of predetermined quantity by step S12 and form feature database, adopt the predetermined number of suitable size, feature database is further dwindled removing on the basis of stop words, thus the committed memory when having reduced step S14 study.Further, can realize extracting by step S12 and comprise less or do not comprise the learning characteristic of noise word, thereby can improve the training precision of step S14, and then make the nicety of grading of step S16 improve.
Corresponding with the embodiment of the characteristic processing device of the application's text classification, the application also provides the embodiment of the characteristic processing method of text classification.
The 5th, the characteristic processing method of a kind of text classification of the embodiment of the present application is described, as shown in Figure 5, the characteristic processing method of text sorting technique may further comprise the steps:
Step S102: obtain the characteristic set for the learning materials of text classification, wherein, characteristic set comprises a plurality of feature words, and this step can adopt the acquisition module 22 in the document sorting apparatus shown in Figure 2 to carry out.Can directly receive the characteristic set of user's input, also can receive the learning materials of user's input, learning materials be carried out participle obtain the feature word.
Step S104: calculate the information gain value sum of each feature word in all class categories, this step can adopt the computing module 24 in the document sorting apparatus shown in Figure 2 to carry out.
Wherein, the information gain value refers to effective reduction (weighing with " byte " usually) of expectation information or information entropy, can determine which type of level to select which type of variable to classify at according to it.The information gain value is used for representing the quantity of information that a feature word brings such other, and more big this feature word of more representing of information gain value is more good for this classification, and computing method are described hereinbefore particularly, are not giving unnecessary details herein.This step at first calculates the information gain value of each feature word in each class categories, then to the information gain value summation in corresponding all class categories of each feature word, obtains the information gain value sum of each feature word correspondence.
Step S106: extract the feature word of predetermined quantity in the characteristic set as the learning characteristic that is used for text classification, it is the Partial Feature word in the residue character word of removing in the characteristic set behind the stop words with the learning characteristic that is used in text classification, wherein, the information gain value sum of the feature word correspondence of extracting all greater than the information gain value sum of undrawn feature word correspondence, reject by the less Partial Feature word (being comprised stop words and part or all of noise word) of all the other information gain value sums.This step can adopt the extraction module 26 in the document sorting apparatus shown in Figure 2 to carry out.
In the present techniques scheme, at first obtain the characteristic set for the learning materials of text classification, the information gain value sum of each feature word correspondence in the calculated characteristics set is then extracted in the characteristic set information gain value sum feature word bigger, predetermined number at last as the learning characteristic of text classification.
Use the present techniques scheme, when the feature extraction of text classification, calculate the information gain value sum of feature word correspondence by step S104, form feature database by the feature word that step S106 extracts predetermined number with the size of information gain value sum, wherein, the definite of predetermined quantity do not comprise stop words and noise word or comprises that the purpose of partial noise word is criterion to reach, thereby dwindled feature database, the committed memory when having reduced text training after the feature extraction in the feature database.Further, because the information gain value sum of the noise word correspondence in the text classification is less, adopt suitable predetermined quantity can make the learning characteristic of extraction not comprise the noise word by step S106, thereby improved the training effect of text training after the feature extraction, and then can obtain better classifying quality when making text classification.
Preferably, during the Partial Feature word of step S106 in extracting characteristic set, can adopt following dual mode to extract the bigger feature word of information gain value sum as the learning characteristic of text classification:
Mode one: the size according to information gain value sum sorts to the feature word in the characteristic set, the ordering here according to from small to large or order from big to small all can, adopt arbitrarily sort algorithm all can; According to the descending order of information gain value sum, extract the feature word of default percent quantities in the characteristic set as the learning characteristic of text classification, wherein, default percent quantities should adopt enough little value, is the requirement of the Partial Feature word in the residue character word of removing in the characteristic set behind the stop words with the feature word that satisfy to extract.
Mode two: judge that whether the information gain value sum of each feature word correspondence is greater than preset value, the preset value here can be according to the actual text classification setting, as the threshold value of Feature Selection, the feature word that make to extract by this threshold value is the Partial Feature word in the residue character word of removing in the characteristic set behind the stop words; Extract information gain value sum in the characteristic set greater than the feature word of the preset value learning characteristic as text classification.
The learning characteristic of the text classification that employing mode one obtains can accurately be determined the size of feature database.The learning characteristic of the text classification that employing mode two obtains can accurately be determined the information gain value sum of feature word correspondence in the feature database.
At last, the characteristic processing method of the another kind of text classification of the embodiment of the present application is described, as shown in Figure 6, the characteristic processing method of text classification may further comprise the steps:
Step S202: obtain the learning materials for text classification, can directly receive the learning materials of user's input, this step can adopt the submodule 222 that obtains in the document sorting apparatus shown in Figure 3 to carry out.
Step S204: the learning materials to text classification are carried out word segmentation processing, obtain a plurality of feature words, and this step can adopt the participle submodule 224 in the document sorting apparatus shown in Figure 3 to carry out.Since during text classification with word as feature, so learning materials need be carried out word segmentation processing, be broken into significant word, such as " I love Tian An-men, Beijing " carried out word segmentation processing, obtain " I ", " love ", " Beijing " and " Tian An-men " these several words as the feature word, word segmentation processing belongs to the natural language processing category, and participle also has different algorithms, and step S204 can carry out word segmentation processing to learning materials by dividing word algorithm arbitrarily.
Step S206: remove the stop words in a plurality of feature words, this step can adopt the screening submodule 226 in the document sorting apparatus shown in Figure 3 to carry out.Stop words refers to the empty word that computer retrieval is used, i.e. non-retrieval word, for example, word such as " ", " " in the Chinese, words such as " a " in the English, " of ".Stop words because all such word can occur in nearly all text, does not have significant differentiation to not contribution of text classification.
Step S208: add up the feature word behind the removal stop words in a plurality of feature words, obtain the characteristic set for the learning materials of text classification.
Step S210: calculate the information gain value sum of each feature word in all class categories.
Step S212: extract the feature word of predetermined quantity in the characteristic set as the learning characteristic that is used for text classification, it is the Partial Feature word in the residue character word of removing in the characteristic set behind the stop words with the learning characteristic that is used in text classification, the information gain value sum of the feature word correspondence of wherein, extracting is all greater than the information gain value sum of undrawn feature word correspondence.
In the present techniques scheme, when the characteristic set of the learning materials of obtaining text classification, at first the learning materials of obtaining by the step S202 of step S204 are carried out participle, stop words removed in the feature word that obtains after with participle by step S206 then, removes feature word behind the stop words by step S208 statistics at last and obtain characteristic set for the learning materials of text classification.
Use the present techniques scheme, before the information gain value sum that adopts step S210 calculated characteristics word correspondence, the stop words in the feature word is removed, thereby made present embodiment when reaching technique effect embodiment illustrated in fig. 5, further reduced the workload of step S210.
As seen through the above description of the embodiments, characteristic set to the text classification learning materials obtained in the embodiment of the present application is handled, the information gain value sum of each feature word in all class categories during at first calculated characteristics is gathered, then with the information gain value sum of feature word correspondence as alternative condition, information gain value sum feature word bigger, predetermined quantity is as the learning characteristic of text classification in the selected characteristic set.This shows, feature database in the embodiment of the present application is not to be made of the full feature set of removing behind the stop words, but form feature database by the feature word of the bigger predetermined quantity of information gain value sum, also be about to the Partial Feature word composition feature database in the residue character word behind the removal stop words in the full feature set, dwindled feature database, thereby reduced committed memory, simultaneously, because the information gain value sum of the noise word correspondence in the text classification is less, so adopt the predetermined number of suitable size, the learning characteristic of choosing is not comprised or comprise the partial noise word, improve the text training effect.
For the convenience of describing, be divided into various modules with function when describing above the device and describe respectively.Certainly, when implementing the application, can in same or a plurality of softwares and/or hardware, realize the function of each module.
As seen through the above description of the embodiments, those skilled in the art can be well understood to the application and can realize by the mode that software adds essential general hardware platform.Based on such understanding, the part that the application's technical scheme contributes to prior art in essence in other words can embody with the form of software product, this computer software product can be stored in the storage medium, as ROM/RAM, magnetic disc, CD etc., comprise that some instructions are with so that a computer equipment (can be personal computer, server, the perhaps network equipment etc.) carry out the described method of some part of each embodiment of the application or embodiment.
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, and identical similar part is mutually referring to getting final product between each embodiment, and each embodiment stresses is difference with other embodiment.Especially, for system embodiment, because it is substantially similar in appearance to method embodiment, so description is fairly simple, relevant part gets final product referring to the part explanation of method embodiment.
The application can be used in numerous general or special purpose computingasystem environment or the configuration.For example: personal computer, server computer, handheld device or portable set, plate equipment, multicomputer system, the system based on microprocessor, set top box, programmable consumer-elcetronics devices, network PC, small-size computer, mainframe computer, comprise distributed computing environment of above any system or equipment etc.
The application can describe in the general context of the computer executable instructions of being carried out by computing machine, for example program module.Usually, program module comprises the routine carrying out particular task or realize particular abstract data type, program, object, assembly, data structure etc.Also can in distributed computing environment, put into practice the application, in these distributed computing environment, be executed the task by the teleprocessing equipment that is connected by communication network.In distributed computing environment, program module can be arranged in the local and remote computer-readable storage medium that comprises memory device.
Though described the application by embodiment, those of ordinary skills know, the application has many distortion and variation and the spirit that do not break away from the application, wish that appended claim comprises these distortion and variation and the spirit that do not break away from the application.

Claims (10)

1. the characteristic processing method of a text classification is characterized in that, comprising:
Obtain the characteristic set for the learning materials of text classification, wherein, described characteristic set comprises a plurality of feature words;
Calculate the information gain value sum of each feature word in all class categories; And
Extract the feature word of predetermined quantity in the described characteristic set as the learning characteristic that is used for text classification, so that described learning characteristic for text classification is the Partial Feature word in the residue character word after described characteristic set is removed stop words, the information gain value sum of the feature word correspondence of wherein, extracting is greater than the information gain value sum of undrawn feature word correspondence.
2. the characteristic processing method of text classification according to claim 1 is characterized in that,
After obtaining described a plurality of feature word, described method also comprises: remove the stop words in described a plurality of feature word,
Wherein, the step of calculating the information gain value sum of each feature word in all class categories comprises: calculate the information gain value sum of each feature word in all class categories of removing behind the stop words.
3. the characteristic processing method of text classification according to claim 1 is characterized in that, the step of characteristic set of obtaining the learning materials of text classification comprises:
Obtain the learning materials for text classification;
Described learning materials for text classification are carried out word segmentation processing, obtain a plurality of feature words; And
Add up described a plurality of feature word, obtain the characteristic set for the learning materials of text classification.
4. according to the characteristic processing method of each described text classification in the claim 1 to 3, it is characterized in that the feature word that extracts predetermined quantity in the described characteristic set comprises as the step of the learning characteristic that is used for text classification:
Size according to information gain value sum sorts to the feature word in the described characteristic set; And
According to the size order of information gain value sum, extract the feature word of default percent quantities in the described characteristic set as the learning characteristic that is used for text classification.
5. according to the characteristic processing method of each described text classification in the claim 1 to 3, it is characterized in that the feature word that extracts predetermined quantity in the described characteristic set comprises as the step of the learning characteristic that is used for text classification:
Judge that whether the information gain value sum of described each feature word correspondence is greater than preset value; And
Extract information gain value sum in the described characteristic set greater than the feature word of described preset value as the learning characteristic that is used for text classification.
6. a file classification method is characterized in that, comprising:
Adopt the characteristic processing method of each described text classification in the claim 1 to 5 to carry out feature extraction, obtain the learning characteristic for text classification;
Described learning characteristic is trained, obtain disaggregated model; And
Adopt described disaggregated model to treat classifying text and carry out text classification.
7. the characteristic processing device of a text classification is characterized in that, comprising:
Acquisition module is used for obtaining the characteristic set for the learning materials of text classification, and wherein, described characteristic set comprises a plurality of feature words;
Computing module is used for calculating each feature word in the information gain value sum of all class categories; And
Extraction module, be used for extracting the feature word of described characteristic set predetermined quantity as the learning characteristic that is used for text classification, so that described learning characteristic for text classification is the Partial Feature word in the residue character word after described characteristic set is removed stop words, the information gain value sum of the feature word correspondence of wherein, extracting is greater than the information gain value sum of undrawn feature word correspondence.
8. the characteristic processing device of text classification according to claim 7 is characterized in that,
Described acquisition module also comprises: the screening submodule, be used for after obtaining described a plurality of feature word, removing the stop words in described a plurality of feature word,
Wherein, described computing module be used for to calculate is removed each feature word behind the stop words in the information gain value sum of all class categories.
9. the characteristic processing device of text classification according to claim 7 is characterized in that, described acquisition module comprises:
Obtain submodule, be used for obtaining the learning materials for text classification;
The participle submodule is used for described learning materials for text classification are carried out word segmentation processing, obtains a plurality of feature words; And
The statistics submodule is used for the described a plurality of feature words of statistics, obtains the characteristic set for the learning materials of text classification.
10. a document sorting apparatus is characterized in that, comprising:
The characteristic processing device of each described text classification in the claim 7 to 9 is used for feature extraction, obtains the learning characteristic for text classification;
Training module is used for described learning characteristic is trained, and obtains disaggregated model; And
Sort module is used for adopting described disaggregated model to treat classifying text and carries out text classification.
CN2012100332084A 2012-02-14 2012-02-14 Method and device for text classification, and method and device for characteristic processing of text classification Pending CN103246686A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012100332084A CN103246686A (en) 2012-02-14 2012-02-14 Method and device for text classification, and method and device for characteristic processing of text classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012100332084A CN103246686A (en) 2012-02-14 2012-02-14 Method and device for text classification, and method and device for characteristic processing of text classification

Publications (1)

Publication Number Publication Date
CN103246686A true CN103246686A (en) 2013-08-14

Family

ID=48926210

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012100332084A Pending CN103246686A (en) 2012-02-14 2012-02-14 Method and device for text classification, and method and device for characteristic processing of text classification

Country Status (1)

Country Link
CN (1) CN103246686A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105389307A (en) * 2015-12-02 2016-03-09 上海智臻智能网络科技股份有限公司 Statement intention category identification method and apparatus
CN105678587A (en) * 2016-01-12 2016-06-15 腾讯科技(深圳)有限公司 Recommendation feature determination method and information recommendation method and device
WO2017028416A1 (en) * 2015-08-19 2017-02-23 小米科技有限责任公司 Classifier training method, type recognition method, and apparatus
CN106663037A (en) * 2014-06-30 2017-05-10 亚马逊科技公司 Feature processing tradeoff management
CN108733733A (en) * 2017-04-21 2018-11-02 为朔生物医学有限公司 Categorization algorithms for biomedical literatures, system based on machine learning and storage medium
CN110019639A (en) * 2017-07-18 2019-07-16 腾讯科技(北京)有限公司 Data processing method, device and storage medium
CN113254644A (en) * 2021-06-07 2021-08-13 成都数之联科技有限公司 Model training method, non-complaint work order processing method, system, device and medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102141978A (en) * 2010-02-02 2011-08-03 阿里巴巴集团控股有限公司 Method and system for classifying texts

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102141978A (en) * 2010-02-02 2011-08-03 阿里巴巴集团控股有限公司 Method and system for classifying texts

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
裴英博: "《中文文本分类中特征选择方法的研究与实现》", 《中国优秀硕士学位论文全文数据库(电子期刊)》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106663037A (en) * 2014-06-30 2017-05-10 亚马逊科技公司 Feature processing tradeoff management
CN106663038A (en) * 2014-06-30 2017-05-10 亚马逊科技公司 Feature processing recipes for machine learning
WO2017028416A1 (en) * 2015-08-19 2017-02-23 小米科技有限责任公司 Classifier training method, type recognition method, and apparatus
RU2643500C2 (en) * 2015-08-19 2018-02-01 Сяоми Инк. Method and device for training classifier and recognizing type
CN105389307A (en) * 2015-12-02 2016-03-09 上海智臻智能网络科技股份有限公司 Statement intention category identification method and apparatus
CN105678587A (en) * 2016-01-12 2016-06-15 腾讯科技(深圳)有限公司 Recommendation feature determination method and information recommendation method and device
CN108733733A (en) * 2017-04-21 2018-11-02 为朔生物医学有限公司 Categorization algorithms for biomedical literatures, system based on machine learning and storage medium
CN108733733B (en) * 2017-04-21 2022-03-08 为朔生物医学有限公司 Biomedical text classification method, system and storage medium based on machine learning
CN110019639A (en) * 2017-07-18 2019-07-16 腾讯科技(北京)有限公司 Data processing method, device and storage medium
CN110019639B (en) * 2017-07-18 2023-04-18 腾讯科技(北京)有限公司 Data processing method, device and storage medium
CN113254644A (en) * 2021-06-07 2021-08-13 成都数之联科技有限公司 Model training method, non-complaint work order processing method, system, device and medium
CN113254644B (en) * 2021-06-07 2021-09-17 成都数之联科技有限公司 Model training method, non-complaint work order processing method, system, device and medium

Similar Documents

Publication Publication Date Title
CN103246686A (en) Method and device for text classification, and method and device for characteristic processing of text classification
CN108573047A (en) A kind of training method and device of Module of Automatic Chinese Documents Classification
CN106776574B (en) User comment text mining method and device
CN108710651A (en) A kind of large scale customer complaint data automatic classification method
CN107844559A (en) A kind of file classifying method, device and electronic equipment
CN106021410A (en) Source code annotation quality evaluation method based on machine learning
CN106528528A (en) A text emotion analysis method and device
CN106156372B (en) A kind of classification method and device of internet site
CN101587493A (en) Text classification method
CN105045913B (en) File classification method based on WordNet and latent semantic analysis
CN102332028A (en) Webpage-oriented unhealthy Web content identifying method
CN104915327A (en) Text information processing method and device
CN104317784A (en) Cross-platform user identification method and cross-platform user identification system
CN113590764B (en) Training sample construction method and device, electronic equipment and storage medium
CN111159404B (en) Text classification method and device
CN103106275A (en) Text classification character screening method based on character distribution information
CN108629047B (en) Song list generation method and terminal equipment
CN109271517A (en) IG TF-IDF Text eigenvector generates and file classification method
CN108897769A (en) Network implementations text classification data set extension method is fought based on production
CN102411592B (en) Text classification method and device
CN105046270A (en) Application classification model constructing method and system and application classification method and system
CN108153899B (en) Intelligent text classification method
CN110263155A (en) The training method and system of data classification method, data classification model
CN110287311A (en) File classification method and device, storage medium, computer equipment
CN107818173B (en) Vector space model-based Chinese false comment filtering method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 1183723

Country of ref document: HK

WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20130814

WD01 Invention patent application deemed withdrawn after publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: WD

Ref document number: 1183723

Country of ref document: HK