CN101937436B - Text classification method and device - Google Patents

Text classification method and device Download PDF

Info

Publication number
CN101937436B
CN101937436B CN 200910088411 CN200910088411A CN101937436B CN 101937436 B CN101937436 B CN 101937436B CN 200910088411 CN200910088411 CN 200910088411 CN 200910088411 A CN200910088411 A CN 200910088411A CN 101937436 B CN101937436 B CN 101937436B
Authority
CN
China
Prior art keywords
classification
decision package
weights
sentence
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN 200910088411
Other languages
Chinese (zh)
Other versions
CN101937436A (en
Inventor
张翼
陈儒
王震
高立琦
刘桂平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN 200910088411 priority Critical patent/CN101937436B/en
Publication of CN101937436A publication Critical patent/CN101937436A/en
Application granted granted Critical
Publication of CN101937436B publication Critical patent/CN101937436B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The embodiment of the invention relates to a text classification method in the field of communication. The method comprises the following steps of: phrasing a text to be classified, performing dependence syntax analysis on each sentence and extracting all dependence pairs serving as extracted decision units; searching the types of the extracted decision units from a knowledge base, wherein the knowledge base stores the decision units serving as classification bases and the types and weights of the decision units; accumulating the weight sums of the extracted decision units according to the types; and taking a type with a maximum weight sum as the type of the text to be classified. The embodiment of the invention also provides a corresponding text classification device. The text classification method and the text classification device provided by the embodiment of the invention have the advantages of high classification accuracy, small redundancy rate and capability of effectively resolving conflicts by using a syntactic distance.

Description

A kind of file classification method and device
Technical field
The present invention relates to the text mining technical field, relate in particular to a kind of file classification method and device.
Background technology
Online forum is one of typical participation method of the contemporary network life, along with increasing of model quantity, more and more needs a kind of mechanism that the model of issue is put into different categories, has both made things convenient for forum's Content Management, makes things convenient for the user to select topics of interest to paste again greatly.There is classification feature at present a lot of forums, but major part is to rely on the user to select classification or label is provided when the issue model, the problem that this mode exists is, a lot of users are unwilling initiatively to select classification or label is provided, the user is also arranged in addition in order to improve the model amount of reading, a lot of irrelevant labels deliberately are provided.
Based on the problems referred to above, need carry out text classification to the model of online forum, text classification (TextClassification, Text Categorization) refers to according to certain algorithm, to give the process of one or more predefined item names with the document of textual representation.Each text is only specified a class another name " hard classification ", specify a plurality of classifications then to be " soft classification ", all do not refer to hard classification below the explanation if do not add.Existing classification comprises rule-based classification and based on statistical learning two big classes.And because the most of model of the model of online forum is shorter, is characterized in that characteristic number is few, word is lack of standardization, omits the relevant background knowledge of classification etc. in a large number, this class model does not often possess statistical information, can not use the classification based on statistical learning.Therefore generally use rule-based sorting algorithm.
Rule-based sorting algorithm, its rule formatization are expressed as<w1, w2 ... wr, C 〉, mean and ought w1 occur in one piece of text, w2 ... these words of wr, then assign among the classification C.
Realize in the process of the present invention that the inventor there are the following problems at least to find existing rule-based sorting algorithm:
1) there is big redundancy.Whether the combination that such algorithm is only paid close attention to some characteristic items has classification capacity preferably, like this as long as each element that characteristic item is concentrated has better classification capacity, then any combination of set element all has better classification capacity, its consequence is in some cases, and the regular quantity of excavating is exponential increase.
2) may be satisfied by " by accident ".Do not have mutual relationship between the item of composition rule, this means when being applied to actual classification, as long as these characteristic items have occurred simultaneously, just think that the text satisfies this rule in one piece of text.For example, suppose regular " Gan Qing ﹠amp; The world → emotion ", mean and when " emotion " and " world " appears in text simultaneously, then assign to " emotion " class, " emotion is really wasted in object for appreciation " World of Warcraft " applying it to sentence." when going up just " by accident " be satisfied, and in fact " emotion " at this place and " world " is the relation that does not have semantically, its real classification should be " recreation ".
Summary of the invention
The embodiment of the invention provides a kind of file classification method and device, when the model of online forum is carried out text classification, realizes classification degree of accuracy height, and redundance is little.
The embodiment of the invention is achieved through the following technical solutions:
A kind of file classification method comprises: treat classifying text and carry out subordinate sentence, each sentence is carried out interdependent syntactic analysis, extract all interdependent decision packages that conduct is extracted; Retrieve the classification under the decision package of described extraction from knowledge base, store in the described knowledge base as the decision package of classification foundation and under classification and weights; The add up decision package weights sum of described extraction of category; With the classification of the described weights sum maximum classification as text to be sorted.
A kind of document sorting apparatus comprises: acquiring unit, and be used for treating classifying text and carry out subordinate sentence, each sentence is carried out interdependent syntactic analysis and extracts all interdependent decision packages that conduct is extracted; Retrieval unit for the classification under the decision package of retrieving described extraction from knowledge base, stores decision package and affiliated classification and weights as classification foundation in the described knowledge base; The add up decision package weights sum of described extraction of computing unit, category; The classification determining unit is used for the classification of the described weights sum maximum classification as text to be sorted.
The technical scheme that is provided by the invention described above embodiment as can be seen, the file classification method that the embodiment of the invention provides and device have realized avoiding occurring the word of " chance " co-occurrence, cause the phenomenon of classification error, its degree of accuracy height of classifying, redundance is little.
Description of drawings
Fig. 1 is embodiment of the invention file classification method process flow diagram;
Fig. 2 is interdependent parsing tree synoptic diagram in the conflict resolution Processing Example of the present invention;
Fig. 3 sets up process flow diagram for embodiment of the invention knowledge base;
Fig. 4 is the interdependent parsing tree synoptic diagram of sentence of the embodiment of the invention;
Fig. 5 is the interdependent parsing tree synoptic diagram of another sentence of the embodiment of the invention;
Fig. 6 is embodiment of the invention document sorting apparatus structural representation;
Fig. 7 sets up the cellular construction synoptic diagram for embodiment of the invention knowledge base;
Fig. 8 is embodiment of the invention conflict resolution cellular construction synoptic diagram.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the invention, the technical scheme in the embodiment of the invention is clearly and completely described, be understandable that described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills belong to the scope of protection of the invention not making the every other embodiment that obtains under the creative work prerequisite.
The embodiment of the invention provides a kind of file classification method, as shown in Figure 1, comprises the steps:
Step 10: treat classifying text and carry out subordinate sentence, each sentence is carried out interdependent syntactic analysis, extract all interdependent decision packages that conduct is extracted;
The described decision package acquisition methods of the embodiment of the invention uses a kind of " interdependent syntactic analysis " technology based on statistics, described interdependent syntactic analysis is to be that unit carries out grammatical analysis with the sentence, a sentence is considered as one tree, wherein the relation of father node and child node represents a kind of grammer modified relationship, be called " dependence ", and child node exists with ... father node, and it is interdependent right that father and son's node is formed, and is referred to as candidate's decision package.Every tree has a unique root node, and this node does not have father's node, therefore also any other composition of rhetoric not, and the word on this node is called " core word ".Core word is the core of whole sentence, and when progressively sentence being contracted, remaining is exactly core word at last.
Because interdependent syntactic analysis technology is prior art, the present invention does not do concrete restriction at this, will introduce in detail in the following embodiments for the acquisition methods of decision package.
Comprise alternatively in this step: the decision package that extracts is filtered, filter out the decision package that comprises stop words and do not belong to predetermined dependence.Concrete filter method will be introduced among the embodiment that knowledge base is set up below.
Step 11: retrieve the classification under the decision package of described extraction from knowledge base; Store decision package and affiliated classification and weights as classification foundation in the described knowledge base;
Owing to preserve decision package and the corresponding weights of plurality of classes in the knowledge base, in this knowledge base, retrieve the decision package of above-mentioned extraction, can determine classification and the weights of the decision package of above-mentioned extraction according to result for retrieval.
The process of setting up for described knowledge base will be introduced in the following embodiments in detail.
Step 12: the add up decision package weights sum of described extraction of category;
The add up weights sum of decision package of described extraction of described category even has a plurality of decision packages to belong to same classification, then calculates the weights sum of this decision package that belongs to same classification, as such other weights;
Step 13: judge the weights sum greater than 1 classification number whether more than 1, this step is optional;
If more than 1, then execution in step 14 greater than 1 classification number for the weights sum, namely carry out conflict resolution and handle.Otherwise execution in step 15:
Step 14: conflict resolution is handled;
A sentence often comprises a lot of decision packages, and they adhere to different classes separately, at this moment can produce conflict, therefore need carry out conflict resolution and handle, and described conflict resolution is handled and comprised:
1) sentence structure with each decision package that extracts in each sentence is the level number of its place dependency tree apart from assignment;
To an interdependent parsing tree, can arrive all nodes from core word through the interdependent arc of different numbers, because more near word has more represented the center meaning of sentence on the freestone heart word method, therefore the embodiment of the invention is used from the arc number of core word and is measured this distance, is referred to as " sentence structure distance ".For example, the sentence structure that can be the decision package (root node) of HED with dependence is 0 apart from assignment, from root node, sentence structure with each decision package during the traversal dependency tree is the level number of the dependency tree at its place apart from the threshold assignment, level number has represented interdependent to the sentence structure distance from core word since 0;
2) carry out the respectively weights adjustment of the decision package of extraction according to formula weight=w/ (d+n), wherein weight is the weights after adjusting, and w is the preceding weights of adjustment, and d is the sentence structure distance; N is natural number; This formula can make priori weights (weights before adjusting) height and the little decision package of sentence structure distance give high weights.
3) will adjust each decision package weights category that extracts of back and add up, namely obtain the final weights of respective classes.
For example, judge the classification of a sentence " company personnel will go on a tour following weekend together ", its interdependent parsing tree as shown in Figure 2, among the figure, arrow is set out by father node, point to child node, the expression father node is centre word, and child node is qualifier, the symbolic representation modified relationship on the camber line, be dependence, at first finding dependence is decision package and the initialize of HED.Namely have decision package 1.<EOS, go on a tour, HED, 0, NIL 〉, core word " is gone on a tour ", HED is dependence, and 0 is the sentence structure distance value of giving, and NIL represents that its weights are also definite, zone circle numbering before this decision package is to make things convenient for during for following statement, from core word " the first wide traversal dependency tree and give each decision package sentence structure distance of " of going on a tour, namely have decision package 2.<company, the employee, ATT, 2, NIL〉(not enumerating the decision package that does not have in the knowledge base here), wherein " company ", " employee " is two feature words, ATT is the dependence of these two feature words, and 2 is the sentence structure distance value of giving.Retrieving decision package priori weights 1. from the priori knowledge base is 2.03, and classification is " tourism ", priori weights 3.34 2., classification " job market ".1. weights are 2.03/ (0+1)=2.03 after adjusting, and weights 2. are 3.34/ (2+1)=1.11, and calculating the weights that the text belongs to " tourism " is 2.03, and the weights that belong to " job market " are 1.11, so the final decision classification is " tourism ".From the adjustment process of this example sentence as can be seen, there is the classification conflict in this example sentence text, if only will be judged to " job market " class according to the priori weights, and is corrected as " tourism " class after adjusting.
Therefore the embodiment of the invention is generally paid close attention to the cited dependence of following table 1:
Table 1
Figure G200910088411XD00061
Step 15: with the classification of the described weights sum maximum classification as text to be sorted.
If still have the identical classification of weights this moment, then can select a classification to get final product as the classification of text to be sorted arbitrarily.
The embodiment of the invention is in the text classification process, when determining the weights of a decision package, both consider existing priori weights, also combined the context of co-text information of the concrete sentence of its appearance simultaneously, dynamically adjust the priori weights, thereby more effectively carry out conflict resolution.
Introduce the process of setting up of the knowledge base that the embodiment of the invention provides below in detail, in this process, will introduce the decision package acquisition methods in detail.The described decision package of the embodiment of the invention is that the collocation of feature word that a kind ofly can be used as classification foundation, has certain proper grammar relation (dependence) is right, can formalization representation be five-tuple<w 1, w 2, type, weight, C 〉, wherein w1, w2 are two feature words, and type is their grammatical relation (dependence), and weight is the weights of this decision package in classification C, has represented its classification capacity.Concrete knowledge base is set up process as shown in Figure 3, comprises the steps:
Step 30: be that unit carries out interdependent syntactic analysis with the sentence with corpus, obtain the dependency tree of each sentence;
As shown in Fig. 4 and Fig. 5, be two and all comprise " Gan Qing ﹠amp; The world " the interdependent syntactic analysis synoptic diagram of sentence, what the camber line arrow among the figure was represented is two set memberships between the word, also is modified relationship, and arrow is set out by father node, points to child node, and the expression father node is centre word, and child node is qualifier.Symbolic representation modified types on the camber line, i.e. dependence between father and son's node.As can be seen from Figure 4, " emotion " in the example sentence shown in Figure 4 do not have dependence with " world ", and " emotion " is middle surely relation with " world " in example sentence shown in Figure 5.Can determine that by this dependence certain is " chance " co-occurrence or " certainty " co-occurrence to word, that is to say, the classification error that can avoid " chance " co-occurrence to cause according to this relation.
Be example with example sentence shown in Fig. 5, its dependency tree that carries out obtaining after the interdependent syntactic analysis can be shown with string table " [7] think _ [6] pure (ADV) [5] are _ [4] its (SBV) [5] are _ [7] think (VOB) [5] be _ [8] as (VOB) [10] _ [9] come out (ATT) [5] of the world _ [10], (DE) [11] be _ [11] world (VOB) [5] is _ [2] world (SBV) [13]<EOS _ [5] be the world, (HED) [2] _ [1] emotion (ATT) ", namely formed being separated by by the space by a series of shapes interdependent as " [Num1] w1_[Num2] w2 (type) ", each is interdependent to an arc in the corresponding diagram 5, and Num1 and Num2 are the index number of word in dependency tree.
Step 31: what extract every dependency tree is interdependent to as candidate's decision package;
To every dependency tree extract its each interdependent right, each is interdependent to comprising two feature words, and has specific dependence;
Described interdependent to adopting<w 1, w 2, type, NIL, C〉and form represents, wherein NIL represents that its weights also determine.
Wherein, to consider special circumstances during extraction, for shape as " world of emotion " " " structure syntactic analysis meeting be divided into two interdependent right<emotion,, ATT〉and<, the world, DE 〉, need be merged into one interdependent right<emotion, the world, ATT 〉.
Step 32: filter out the candidate's decision package that comprises the predefine stop words and filter out candidate's decision package that dependence does not belong to the predefine dependence;
The described stop words of the embodiment of the invention mainly comprises function word, auxiliary word, pronoun etc.Since shape as<it, be SBV〉such comprise stop words interdependent to not having classification information substantially, so stop words filters all w1 when filtering or w2 is the decision package of stop words.
In addition, some dependence interdependent to also not having classified information, for example<three individual, QUN 〉,<gently, ground, DI〉and<good,, MT〉etc.Step 33: calculate the weights of each candidate's decision package in of all categories;
Can be calculated as follows the weight w (du) of each candidate's decision package in of all categories:
w ( du ) = arg max i ∈ C f i N - f i + 1
Wherein fi represents the frequency that candidate's decision package du occurs in classification i, and N is illustrated in the sum frequency that candidate's decision package occurs in of all categories, and C is classification.This formula make in certain class frequent the appearance and in other class the decision package of less appearance give than high weight.Higher weights mean that its classification capacity is stronger.
Step 34: the decision package that weights is higher than specific threshold stores knowledge base into as the decision package that gets access to, and stores the affiliated classification of decision package and weights that described weights are higher than specific threshold.
This step is carried out weights and is filtered, namely, with all kinds of decision packages that comprise set by step 33 weights that calculate sort from high to low, the decision package that will be higher than certain specific threshold stores in the knowledge base, and store the affiliated classification of this decision package that is higher than certain specific threshold and weights, can be used as the foundation of text classification.
The knowledge base of present embodiment is set up process is excavated all kinds of frequent appearance from corpus decision package, these decision packages represent that some word is to being used with certain fixed relationship in certain class of being everlasting, therefore this decision package can be used as the foundation of new text classification, avoid occurring the word of " chance " co-occurrence, cause the phenomenon of classification error.By the knowledge base formed of the decision package of enough scales, just can be used for carrying out the classification of new text.
The embodiment of the invention also provides a kind of document sorting apparatus, and this device comprises as shown in Figure 6:
Acquiring unit 61 is used for treating classifying text and carries out subordinate sentence, and each sentence is carried out interdependent syntactic analysis and extracts all interdependent decision packages that conduct is extracted;
Retrieval unit 62 is for the classification under the decision package of retrieving described extraction from knowledge base; Store decision package and affiliated classification and weights as classification foundation in the described knowledge base;
The add up decision package weights sum of described extraction of computing unit 63, category;
Classification determining unit 64 is used for the classification of the described weights sum maximum classification as text to be sorted.
Described device can also comprise:
Knowledge base is set up unit 60, is used for setting up the described decision package that stores as classification foundation and reaches affiliated classification and the knowledge base of weights.
As shown in Figure 7, described knowledge base is set up the unit and be may further include:
First subelement 601, being used for corpus is that unit carries out interdependent syntactic analysis with the sentence, obtains the dependency tree of each sentence;
Second subelement 602, what be used for extracting every dependency tree is interdependent to as candidate's decision package; Each is interdependent to comprising two feature words, and has specific dependence;
The 3rd subelement 603 is used for filtering out the candidate's decision package that comprises the predefine stop words and filtering out candidate's decision package that dependence does not belong to the predefine dependence;
The 4th subelement 604 is used for calculating each candidate's decision package at weights of all categories; Can be calculated as follows the weight w (du) of each candidate's decision package in of all categories:
w ( du ) = arg max i ∈ C f i N - f i + 1
Wherein fi represents the frequency that candidate's decision package du occurs in classification i, and N is illustrated in the sum frequency that candidate's decision package occurs in of all categories, and C is classification.
The 5th subelement 605 stores knowledge base for the decision package that weights is higher than specific threshold into as the decision package that gets access to, and stores the affiliated classification of decision package and weights that described weights are higher than specific threshold.
Described device can further include:
Conflict resolution unit 65 is used for during more than 1, carrying out conflict resolution and handling greater than 1 classification number in the weights sum.
As shown in Figure 8, described conflict resolution unit further comprises:
Sentence structure distance determining unit 651, the sentence structure that is used for each decision package that extracts of each sentence is the level number of its place dependency tree apart from assignment;
Adjustment unit 652 is used for carrying out according to formula weight=w/ (d+n) the weights adjustment of each decision package that extracts, and wherein weight is the weights after adjusting, and w is the weights before adjusting, and d is the sentence structure distance; N is natural number;
Final weight calculation unit 653 is used for adjusting each decision package weights category that extracts of back and adds up, and obtains the final weights of respective classes.
The embodiment of the invention is in the text classification process, when determining the weights of a decision package, both consider existing priori weights, also combined the context of co-text information of the concrete sentence of its appearance simultaneously, dynamically adjust the priori weights, thereby more effectively carry out conflict resolution.And be used for effectively avoiding occurring as the process of setting up of the knowledge base of classification benchmark the word of " chance " co-occurrence, cause the phenomenon of classification error.
The text classification algorithm that provides for embodiment above the present invention, for confirming its beneficial effect, the inventor has done following test, adopt the sorting technique B of sorting technique A and the embodiment of the invention to carry out text classification respectively, wherein method A realizes according to the basic thought of existing rule-based sorting technique, it is right that this method extracts all binary words of co-occurrence in the sentence of corpus, the binary word between also do not require to have direct grammatical relation, give each binary word to composing with weights according to certain formula, and weights satisfy the word of certain threshold value to depositing knowledge base in as the foundation to new text classification.Mainly consider two indexs when calculating weights: the one, calculate the entropy of word and determine how many classification information word has, the 2nd, the average pitch character of calculating two words of word centering in the sentence of its co-occurrence from, determine that this word is to having much certainty.Carrying out the branch time-like, it is right to extract text to be sorted all words in each, the corresponding classification of retrieval and weights information in the knowledge base, and text assigned to weights and maximum classification.
It is the corpus of 10,000 pieces of texts that two kinds of sorting techniques are all used with a collection of size.Through the threshold value adjustment, method A is drawn into 100,000 of rules, and method B extracts 80,000 of decision packages.It is to have bulk redundancy that algorithm A rule is wanted many major reasons.
Following table 2 is testing material distribution situations, and table 3 and table 4 are respectively the evaluation detailed results of method A and method B.
Table 2 testing material distributes
Fashion Life Emotion Physical culture The job market Recreation Amusement Amount to
260 281 638 302 195 228 320 2224
The evaluation result of table 3 method A
The evaluation result of table 4 method B
Figure G200910088411XD00112
From evaluation result, the every index of method B ratio method A all improves to some extent.Particularly recall rate but has the raising nearly 10 percentage point under the situation of the decision package ratio method A few 20% that method B uses, the decision package degree of accuracy height that this explanation the present invention excavates, and redundance is little.Algorithm B has 75% to obtain correctly clearing up in the classification conflict that test takes place in addition, and this has illustrated that also utilizing the sentence structure distance to carry out conflict resolution can obtain better effects.
One of ordinary skill in the art will appreciate that realize that all or part of step in above-described embodiment method is to instruct relevant hardware to finish by program, described program can be stored in the computer-readable recording medium.This readable storage medium storing program for executing is ROM (read-only memory) (being called for short ROM), random access memory (being called for short RAM), disk, CD etc. for example.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims (12)

1. a file classification method is characterized in that, comprising:
Treat classifying text and carry out subordinate sentence, each sentence is carried out interdependent syntactic analysis, extract all interdependent decision packages that conduct is extracted; Described decision package is that the collocation of feature word that a kind ofly can be used as classification foundation, has certain proper grammar relation is right, can formalization representation be five-tuple<w 1, w 2, type, weight, C>, wherein w1, w2 are two feature words, and type is their grammatical relation, and weight is the weights of this decision package in classification C, has represented its classification capacity;
Filter out the decision package that comprises the predefine stop words and filter out the decision package that dependence does not belong to the predefine dependence; Retrieve the classification under the decision package of described extraction from knowledge base, store in the described knowledge base as the decision package of classification foundation and under classification and weights;
The add up decision package weights sum of described extraction of category;
With the classification of the described weights sum maximum classification as text to be sorted.
2. the method for claim 1 is characterized in that, also comprises setting up described knowledge base, specifically comprises:
Be that unit carries out interdependent syntactic analysis with the sentence with corpus, obtain the dependency tree of each sentence;
What extract every dependency tree is interdependent to as candidate's decision package, and each is interdependent to comprising two feature words, and has specific dependence;
Calculate the weights of each candidate's decision package in of all categories;
The decision package that weights is higher than specific threshold stores knowledge base into as classification foundation, and stores affiliated classification and the weights of decision package that described weights are higher than specific threshold.
3. method as claimed in claim 2 is characterized in that, is calculated as follows the weight w (du) of each candidate's decision package in of all categories:
w ( du ) = arg max i ∈ C f i N - f i + 1
Wherein fi represents the frequency that candidate's decision package du occurs in classification i, and N is illustrated in the sum frequency that candidate's decision package occurs in of all categories, and C is classification.
4. method as claimed in claim 2 is characterized in that, what extract every dependency tree is interdependent to as behind candidate's decision package, calculate the weights of each candidate's decision package in of all categories before, also comprise:
Filter out the candidate's decision package that comprises the predefine stop words and filter out candidate's decision package that dependence does not belong to the predefine dependence.
5. as each described method in the claim 1 to 4, it is characterized in that, also comprise:
If the weights sum, is then carried out conflict resolution and is handled more than 1 greater than 1 classification number.
6. method as claimed in claim 5 is characterized in that, described conflict resolution is handled and comprised:
Be the level number of its place dependency tree apart from assignment with the sentence structure of each decision package that extracts in each sentence;
Carry out the weights adjustment of each decision package that extracts according to formula weight=w/ (d+n), wherein w is the weights before adjusting, and d is the sentence structure distance; N is natural number;
Add up adjusting each decision package weights category that extracts of back, obtain the final weights of respective classes.
7. a document sorting apparatus is characterized in that, comprising:
Acquiring unit is used for treating classifying text and carries out subordinate sentence, and each sentence is carried out interdependent syntactic analysis and extracts all interdependent decision packages that conduct is extracted; Described decision package is that the collocation of feature word that a kind ofly can be used as classification foundation, has certain proper grammar relation is right, can formalization representation be five-tuple<w 1, w 2, type, weight, C>, wherein w1, w2 are two feature words, and type is their grammatical relation, and weight is the weights of this decision package in classification C, has represented its classification capacity;
Retrieval unit for the classification under the decision package of retrieving described extraction from knowledge base, stores decision package and affiliated classification and weights as classification foundation in the described knowledge base;
The add up decision package weights sum of described extraction of computing unit, category;
The classification determining unit is used for the classification of the described weights sum maximum classification as text to be sorted.
8. device as claimed in claim 8 is characterized in that, also comprises:
Knowledge base is set up the unit, is used for setting up the described decision package that stores as classification foundation and reaches affiliated classification and the knowledge base of weights.
9. device as claimed in claim 8 is characterized in that, described knowledge base is set up the unit and further comprised:
First subelement, being used for corpus is that unit carries out interdependent syntactic analysis with the sentence, obtains the dependency tree of each sentence;
Second subelement, what be used for extracting every dependency tree is interdependent to as candidate's decision package; Each is interdependent to comprising two feature words, and has specific dependence;
The 3rd subelement is used for filtering out the candidate's decision package that comprises the predefine stop words and filtering out candidate's decision package that dependence does not belong to the predefine dependence;
The 4th subelement is used for calculating each candidate's decision package at weights of all categories;
The 5th subelement stores knowledge base for the decision package that weights is higher than specific threshold into as the decision package that gets access to, and stores the affiliated classification of decision package and weights that described weights are higher than specific threshold.
10. device as claimed in claim 9 is characterized in that, described the 4th subelement is calculated as follows the weight w (du) of each candidate's decision package in of all categories:
w ( du ) = arg max i ∈ C f i N - f i + 1
Wherein fi represents the frequency that candidate's decision package du occurs in classification i, and N is illustrated in the sum frequency that candidate's decision package occurs in of all categories, and C is classification.
11. as each described device in the claim 7 to 10, it is characterized in that, further comprise:
The conflict resolution unit is used for during more than 1, carrying out conflict resolution and handling greater than 1 classification number in the weights sum.
12. device as claimed in claim 11 is characterized in that, described conflict resolution unit further comprises:
The sentence structure distance determining unit, the sentence structure that is used for each decision package that extracts of each sentence is the level number of its place dependency tree apart from assignment;
Adjustment unit is used for carrying out according to formula weight=w/ (d+n) the weights adjustment of each decision package that extracts, and wherein w is the weights before adjusting, and d is the sentence structure distance; N is natural number;
Final weight calculation unit is used for adjusting each decision package weights category that extracts of back and adds up, and obtains the final weights of respective classes.
CN 200910088411 2009-06-29 2009-06-29 Text classification method and device Active CN101937436B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200910088411 CN101937436B (en) 2009-06-29 2009-06-29 Text classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200910088411 CN101937436B (en) 2009-06-29 2009-06-29 Text classification method and device

Publications (2)

Publication Number Publication Date
CN101937436A CN101937436A (en) 2011-01-05
CN101937436B true CN101937436B (en) 2013-09-25

Family

ID=43390770

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200910088411 Active CN101937436B (en) 2009-06-29 2009-06-29 Text classification method and device

Country Status (1)

Country Link
CN (1) CN101937436B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102214208B (en) * 2011-04-27 2014-04-09 百度在线网络技术(北京)有限公司 Method and equipment for generating structured information entity based on non-structured text
CN103514151A (en) * 2012-06-29 2014-01-15 富士通株式会社 Dependency grammar analysis method and device and auxiliary classifier training method
CN104123291B (en) * 2013-04-25 2017-09-12 华为技术有限公司 A kind of method and device of data classification
CN106598935B (en) * 2015-10-16 2019-04-23 北京国双科技有限公司 A kind of method and device of determining document emotion tendency
CN105373808B (en) * 2015-10-28 2018-11-20 小米科技有限责任公司 Information processing method and device
CN106354762B (en) * 2016-08-17 2020-03-20 海信集团有限公司 Service positioning method and device for interactive statements
CN106713083B (en) * 2016-11-24 2020-06-26 海信集团有限公司 Intelligent household equipment control method, device and system based on knowledge graph
CN108549723B (en) * 2018-04-28 2022-04-05 北京神州泰岳软件股份有限公司 Text concept classification method and device and server
CN108897832B (en) * 2018-06-22 2021-09-03 申报家(广州)智能科技发展有限公司 Method and device for automatically analyzing value information
CN112560488A (en) * 2020-12-07 2021-03-26 北京明略软件系统有限公司 Noun phrase extraction method, system, storage medium and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1997012485A1 (en) * 1995-09-25 1997-04-03 Philips Electronics N.V. Method and device for transmitting and receiving teletext pages
CN101174273A (en) * 2007-12-04 2008-05-07 清华大学 News event detecting method based on metadata analysis
CN101178714A (en) * 2006-12-20 2008-05-14 腾讯科技(深圳)有限公司 Web page classification method and device
CN101268465A (en) * 2005-09-20 2008-09-17 法国电信公司 Method for sorting a set of electronic documents

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1997012485A1 (en) * 1995-09-25 1997-04-03 Philips Electronics N.V. Method and device for transmitting and receiving teletext pages
CN101268465A (en) * 2005-09-20 2008-09-17 法国电信公司 Method for sorting a set of electronic documents
CN101178714A (en) * 2006-12-20 2008-05-14 腾讯科技(深圳)有限公司 Web page classification method and device
CN101174273A (en) * 2007-12-04 2008-05-07 清华大学 News event detecting method based on metadata analysis

Also Published As

Publication number Publication date
CN101937436A (en) 2011-01-05

Similar Documents

Publication Publication Date Title
CN101937436B (en) Text classification method and device
CN110597988B (en) Text classification method, device, equipment and storage medium
CN106156372B (en) A kind of classification method and device of internet site
CN103106275B (en) The text classification Feature Selection method of feature based distributed intelligence
CN107122340B (en) A kind of similarity detection method of the science and technology item return based on synonym analysis
CN103399891B (en) Method for automatic recommendation of network content, device and system
CN110472203B (en) Article duplicate checking and detecting method, device, equipment and storage medium
CN107122382A (en) A kind of patent classification method based on specification
CN103995876A (en) Text classification method based on chi square statistics and SMO algorithm
CN101430708A (en) Blog hierarchy classification tree construction method based on label clustering
CN102194013A (en) Domain-knowledge-based short text classification method and text classification system
CN101320374A (en) Field question classification method combining syntax structural relationship and field characteristic
CN109670039A (en) Sentiment analysis method is commented on based on the semi-supervised electric business of tripartite graph and clustering
CN102663001A (en) Automatic blog writer interest and character identifying method based on support vector machine
CN108268554A (en) A kind of method and apparatus for generating filtering junk short messages strategy
CN105224520B (en) A kind of Chinese patent document term automatic identifying method
CN103473262A (en) Automatic classification system and automatic classification method for Web comment viewpoint on the basis of association rule
CN102629272A (en) Clustering based optimization method for examination system database
CN109446393B (en) Network community topic classification method and device
CN103309857B (en) A kind of taxonomy determines method and apparatus
CN103092966A (en) Vocabulary mining method and device
CN107463711A (en) A kind of tag match method and device of data
CN106874322A (en) A kind of data table correlation method and device
CN107992550A (en) A kind of network comment analysis method and system
CN103324758A (en) News classifying method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant