CN101944099B - Method for automatically classifying text documents by utilizing body - Google Patents

Method for automatically classifying text documents by utilizing body Download PDF

Info

Publication number
CN101944099B
CN101944099B CN2010102101070A CN201010210107A CN101944099B CN 101944099 B CN101944099 B CN 101944099B CN 2010102101070 A CN2010102101070 A CN 2010102101070A CN 201010210107 A CN201010210107 A CN 201010210107A CN 101944099 B CN101944099 B CN 101944099B
Authority
CN
China
Prior art keywords
word
meaning
notion
text document
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2010102101070A
Other languages
Chinese (zh)
Other versions
CN101944099A (en
Inventor
郭雷
方俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Tianying Environmental Protection Energy Co ltd
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN2010102101070A priority Critical patent/CN101944099B/en
Publication of CN101944099A publication Critical patent/CN101944099A/en
Application granted granted Critical
Publication of CN101944099B publication Critical patent/CN101944099B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a method for automatically classifying text documents by utilizing a body, comprising the following steps: firstly expressing the characteristic information of a text document by utilizing a weighted key word set; and then expressing the characteristic information of a classifying catalogue by a body which is subject to body disambiguation and body expansion; transforming the body into a weighted word meaning set through analyzing the body structural characteristic; finally calculating the semantic similar value between the key word set of the text document and the body weighted word meaning set by utilizing a Earth Mover's Distance method; further calculating the similar value between the text document and the classifying catalogue; and classifying and sequencing the text document according to the similar value between the text document and the classifying catalogue. By utilizing the method of the invention, the text document can be automatically classified, and the accuracy of the text document classification can be improved.

Description

A kind of body that uses carries out the text document method of classification automatically
Technical field
The present invention relates to a kind of method of using body that text document is classified automatically, belong to fields such as computer information processing, information retrieval.Be applicable to the network text document of magnanimity is classified fast and accurately automatically.
Background technology
In order to improve the efficient of text document tissue, better support the user to browse and search information, the text document classification is the emphasis problem that people paid close attention to all the time.Begin most, text document classification is that the people manually accomplishes, but more and more along with the text document resource, manual classification become can not, so autotext document classification technology becomes the emphasis of research.
The text document classification generally is divided into three phases: at first, the characteristic information of text document and split catalog is extracted out; Then, classifier calculated goes out the similar value of text document and split catalog; At last, text document belongs to different catalogues according to similar value.
Traditional machine learning method has been applied to text document classifies automatically, comprises neural network, Bayes, SVMs and k neighbours' method.Some classified text documents of collection that these methods are at first manual use these classified text document collection to come training classifier then, use the sorter that trains that text document is divided in the split catalog at last.The sorting technique of these machine learning has following shortcoming:
1) traditional machine learning method training classifier needs manual a large amount of document sets of classifying text of collecting, and this process is very loaded down with trivial details, and to different split catalogs, needs the manual different text document collection of collection to come training classifier;
2) method of traditional machine learning is not considered the semantic relation between the speech, so be difficult to improve the accuracy rate of classification.
In order to solve the shortcoming of machine learning method, the present invention proposes a kind of body that uses and come the method that text document is classified automatically.
Summary of the invention
The technical matters that solves
In order to solve at present the shortcoming based on the method for machine learning, the present invention proposes to use body that text document is classified automatically, can fast and accurately text document be classified automatically and sort.
Technical scheme
Thought of the present invention is: use body to come the characteristic information of presentation class catalogue; Utilize the semantic similar value between text document and the body to carry out real-time classification; Saved the process of training study like this; And, will constantly be improved based on the accuracy rate and the recall rate of the sorting technique of body along with body constantly upgrades and evolves; On the other hand, when the similar value of calculating between text document and the body, consider the semantic relation between the speech, thereby improve the accuracy rate of classification based on the method for body.
The invention is characterized in: the characteristic information that proposes the effective presentation class catalogue of this physical efficiency; And through using the body after disambiguation and the extension process to come the characteristic information of presentation class catalogue, utilization treats that the semantic similar value between classifying text document and the body classifies.
Basic process of the present invention is: at first, use the heavy keyword set of cum rights to represent the characteristic information of text document; Then, use through the body after disambiguation and the extension process and come the characteristic information of presentation class catalogue, and, body is converted into the heavy meaning of a word set of cum rights through analyzing body structural feature; At last; Use ground displacement EarthMover ' s Distance method to calculate the keyword set of text document and the semantic similar value between the set of the body weight meaning of a word; Wherein, Similar value between the single meaning of a word and the speech adopts and measures based on the method for speech net WordNet lexical or textual analysis, and utilizes this semanteme similar value to calculate the similar value between text document and the split catalog, carries out the classification and the ordering of text document according to the similar value between text document and the split catalog.
A kind of body that uses carries out the text document method of classification automatically, it is characterized in that step is following:
(1) extracts the keyword set of treating every piece of text document in the classifying text collection of document with keyword extraction algorithm KEA algorithm (Keyphrases Extraction Algorithm), obtain the heavy keyword set of cum rights of text document; In semantic net search engine Swoogle, retrieve with each split catalog term by name in the given catalogue set; The body of ordering first is as the body of this split catalog of expression in the result for retrieval that obtains; The body of representing each split catalog is carried out body qi and the body expansion that disappears, obtain representing the new body of this split catalog;
The described body qi process that disappears is:
At first, select the context of the interior speech of each notional word L scope of body middle distance as this notional word; The span of described L is [3,5];
Then, by semantic relatedness computation formula
Relateness ( s i , Con j ) = NumOfOverlaps _ s i Con j ( WordNumInGlossOf s i + WordNumInGlossOfco n j ) / 2 I that calculates each notional word maybe meaning of a word s iJ context con with this notional word jSemantic relevancy relateness (s i, con j), and press
Figure GSB00000638971300032
I that calculates each notional word maybe meaning of a word s iAverage semantic relevancy Rel (s i);
Wherein, i=1,2 ..., I, I represent the number of the possible meaning of a word of notional word, j=1, and 2 ..., J, J represent the contextual number of notional word; WordNumInGlossOfs iExpression s iThe word number that comprises of speech net WordNet lexical or textual analysis, wordNumInGlossOfcon jExpression con jThe word number that comprises of speech net WordNet lexical or textual analysis, NumOfOverlaps_s iCon jExpression s iWordNet lexical or textual analysis of speech net and con jThe word that comprised of speech net WordNet lexical or textual analysis in the number of same word; Described possibly the meaning of a word be the meaning of a word that is defined among the speech net WordNet;
At last, select to have the notion meaning of a word of the possible meaning of a word of maximum average semantic relevancy Rel value as notional word;
Described body expansion process is:
Utilize the semantic relevancy computing formula
Relateness ( s ^ p , s ′ Pq ) = NumOfOverlaps _ s ^ p s ′ Pq ( WordNumInGlossOf s ^ p + WordNumInGlossO Fs ′ Pq ) / 2 Calculate that each notion meaning of a word of the body of qi after handling superordination meaning of a word set and the next in speech net WordNet that disappears concerns each meaning of a word and the semantic relevancy between this notion meaning of a word in the meaning of a word set through body; And judge: for each meaning of a word in the set of the superordination meaning of a word; If the semantic relevancy between it and this notion meaning of a word greater than given threshold value one, then joins this meaning of a word the parent set of this notion meaning of a word; For the next each meaning of a word that concerns in the meaning of a word set, if it with this notion meaning of a word between semantic relevancy greater than given threshold value two, the subclass that then this meaning of a word is joined this notion meaning of a word is gathered; All meaning of a word in the synonymy meaning of a word set of each notion meaning of a word in speech net WordNet are all joined the similar set of this notion meaning of a word;
Wherein,
Figure GSB00000638971300041
Expression is through disappear p the notion meaning of a word of the body of qi after handling of body, p=1, and 2 ..., P, P represent through the disappear number of the notion meaning of a word of the body of qi after handling of body; S ' PqExpression
Figure GSB00000638971300042
Superordination meaning of a word set/the next q meaning of a word that concerns in the meaning of a word set, q=1,2 ..., Q, Q represent superordination meaning of a word set/the next number that concerns the meaning of a word in the meaning of a word set;
Figure GSB00000638971300043
Expression
Figure GSB00000638971300044
The word number that comprises of speech net WordNet lexical or textual analysis, wordNumInGlossOfs ' PqExpression s ' PqThe word number that comprises of speech net WordNet lexical or textual analysis,
Figure GSB00000638971300045
Expression
Figure GSB00000638971300046
WordNet lexical or textual analysis of speech net and s ' PqThe word that comprised of speech net WordNet lexical or textual analysis in the number of same word;
The described given threshold value one and the span of threshold value two are [0.6,1];
(2) the weight meaning of a word of the new body of each split catalog of represents set is specially:
At first; Body is changed into the digraph of being made up of vertex set and directed edge set: each summit of digraph is a notion meaning of a word in the body; Each bar directed edge of digraph is two relation of inclusion between the notion meaning of a word, and the direction of directed edge is pointed to father's notion meaning of a word by the sub-notion meaning of a word;
Then, calculate the weight of each notion meaning of a word by ;
Wherein, weight representes the weight of the notion meaning of a word, and layer representes the number of plies on the summit that this notion meaning of a word is corresponding;
The number of plies on described summit is the shortest path distance of the corresponding notion meaning of a word in summit apart from the body root;
(3) press Sim (d; O)=1-EMD (d, o) calculate similar value Sim between text document and the split catalog (d, o); If the similar value Sim (d between text document and split catalog; O), then text document is categorized into this split catalog, otherwise text document is not categorized into this split catalog greater than given threshold value δ;
Wherein, d is the heavy keyword set of cum rights of text document, and o is the weight meaning of a word set of body; (d is o) for utilizing text document that ground displacement Earth Mover ' s Distance method calculates and the semantic similar value between the body for EMD; The span of described given threshold value δ is [0.5,0.6];
(4) to all text documents under the sorted split catalog according to similar value Sim (d, o) the descending ordering.
Beneficial effect
This method of the present invention uses body to represent the characteristic information of catalogue, carries out real-time classification through the semantic similar value of calculating between text document and the body, has saved the process of training study, and has improved the accuracy rate of classification.In addition, the present invention uses the disambiguation technology will represent that the speech in the body becomes the meaning of a word, has solved the inaccurate problem of result of calculation of the similar value that the polysemy of speech causes, improves the precision that semantic similar value is calculated, and has further improved the precision of classification; On the basis of body disambiguation, the present invention comes body is automatically expanded through making word net WordNet, has enriched the notion content of body, thereby has improved the accuracy rate that follow-up similar value is calculated, and solves the bothersome problem of manual creation body.
Description of drawings
Fig. 1: the basic flow sheet of the inventive method
Embodiment
Combine accompanying drawing that the present invention is further described at present:
The use body that proposes according to the present invention carries out the method for text document classification, and we use Java and Perl language to realize that concrete implementation procedure is following:
Use body to carry out the text document sorting technique and be divided into following four steps:
Step 1: the structure of text document keyword set.Here, adopt keyword extraction algorithm KEA algorithm to extract the heavy keyword set of cum rights of treating each piece text document in the classifying text collection of document, be specially: for treating classified text collection of document D={d 1, d 2..., d | D|Each piece text document d in (| D| representes the text document record among the text document set D) i, at first, adopt naive Bayesian to estimate, through consider three characteristic attributes of number Length of letter in frequency tf * idf that speech (existing word) occurs, mean place Occurrence that speech occurs and the speech in text document in text document, to d iIn each speech, adopt following formula to calculate the probability P r of its speech that is the theme:
Pr=Pr[T|yes]×Pr[O|yes]×Pr[L|yes]×Pr[yes] (1)
Wherein, Pr [T|yes], Pr [O|yes] and Pr [L|yes] are illustrated respectively in the be the theme probability of speech of this speech under the condition that three characteristic attribute tf * idf, Occurrence and Length get currency; The ratio of number and the number of the text document that does not comprise descriptor that comprises the text document of descriptor in the set of Pr [yes] expression text document.
Preceding n the speech (n gets 4~6 usually) of then, selecting to have maximum Pr value is as text document d iKeyword, obtain text document d iThe heavy keyword set of cum rights, and with text document d iUse the heavy keyword set of this cum rights to represent, i.e. d i={ URL i, (t 1, tw 1) ..., (t Ij, tw Ij) ..., wherein, t IjFor extracting the keyword that obtains, tw as stated above IjBe keyword t IjWeight, be its Pr value that calculates by formula (1).
Step 2: body pre-service.At first, retrieve in semantic net search engine Swoogle with each split catalog term by name in the given catalogue set, and represent this split catalog with the body of ordering first in the result for retrieval that obtains, like this, catalogue set CA={ca 1, ca 2..., ca | CA|Just use body to gather O={o 1, o 2..., o | 0|Represent, wherein, | O| representes the body number among the body set O, | CA| representes the split catalog number among the catalogue set CA, satisfies | O|=|CA|.Wherein, split catalog corresponding a body, i.e. a body o mRepresent a split catalog ca mCharacteristic information, i.e. ca m:=o m
Next, to each body o mCarry out the body disambiguation of step 2.1 and the body extension process of step 2.2.Wherein, the present invention adopts the meaning of a word be defined among the speech net WordNet to represent as the morphology of body, and sets that the path distance between any two notional words is 1 in same the knowledge.
Step 2.1: body disambiguation.Because the corresponding a plurality of meaning of a word of speech possibility, this phenomenon can reduce the precision that semantic similar value is calculated.In order to eliminate the ambiguousness of construed in the body, body is carried out disambiguation handle, promptly utilize the context of speech in the body, confirm the meaning of a word that it is correct.Be specially:
At first, the speech in the L distance range of the notional word s in the body is chosen as the context of notional word s, obtains the set of context Con={con of notional word s i..., con j..., wherein, con jJ the context of expression notional word s; The span of L is [3,5];
Then, use formula (2) to calculate notional word s each meaning of a word s in speech net WordNet i(i=1 ..., N i, N iBe the meaning of a word number of notional word s in speech net WordNet) and its set of context Con in average semantic relevancy Rel (s between all contexts i):
Rel ( s i ) = Σ j = 1 | Con | relateness ( s i , con j ) | Con | - - - ( 2 )
Wherein, | Con| is the context number of notional word s, i.e. the number of speech among the set of context Con; Relateness (s i, con j) be i meaning of a word s iWith its j context cor jSemantic relevancy, its computing formula is following:
relateness ( s i , con j ) = NumOfOverlaps _ s i con j ( wordNumInGlossOf s i + wordNumInGlossOfco n j ) / 2 - - - ( 3 )
Wherein, wordNumInGlossOfs iBe s iThe word number that comprises of speech net WordNet lexical or textual analysis, wordNumInGlossOfcon jBe con jThe word number that comprises of speech net WordNet lexical or textual analysis, NumOfOverlaps_s iCon jBe s iWordNet lexical or textual analysis of speech net and con jThe word that comprised of speech net WordNet lexical or textual analysis in the number of same word;
At last, select the correct meaning of a word of the corresponding meaning of a word of maximum average semantic relevancy Rel value, i.e. the notion meaning of a word of notional word s as notional word s.Because the notional word that in same body, occurs together has certain semantic relation, so the correct meaning of a word is and the maximum meaning of a word of the semantic relevancy of neighbours' notional word.
Each notional word in the body is all handled by the said process qi that disappears.
Step 2.2: body expansion.Disappear after qi handles through the body of step 2.1, body is represented that by the notion meaning of a word notion meaning of a word in the body after the qi that disappears to body is handled in the body expansion adds the meaning of a word that is associated, thereby enriches body.
At first, obtain body each the notion meaning of a word cs in the body of qi after handling that disappears k(k=1 ..., N k, N kNumber for the notion meaning of a word in the body) set of the superordination meaning of a word in speech net WordNet (hypernymy), the next meaning of a word set (hyponymy) and the synonymy meaning of a word of concerning are gathered (synonym), use hypernym (cs respectively k)={ a K1, a K2..., hyponym (cs k)={ b K1, b K2... And synonym (cs k)={ c K1, c K2... Expression notion meaning of a word cs kThese three kinds concern meaning of a word set, and set hypernym_value and hyponym_value is two threshold values, the span of hypernym_value and hyponym_value is [0.6,1].
Then, calculate notion meaning of a word cs by formula (4) kGather with the superordination meaning of a word
Figure GSB00000638971300081
In each meaning of a word a Kp(p=1,2 ..., P, P are the meaning of a word number in the set of the superordination meaning of a word) semantic relevancy, if cs kAnd a KpSemantic relevancy greater than given hypernym_value threshold value, then with a KpAdd cs kParent set in; Calculate notion meaning of a word cs by formula (5) kWith the next meaning of a word set hyponym (cs that concerns k) in each meaning of a word b Kq(q=1,2 ..., Q, Q are the next meaning of a word number in the meaning of a word set that concerns) semantic relevancy, if cs kAnd b KqSemantic relevancy greater than given hyponym_value threshold value, then with b KqAdd cs kSubclass set in; With synonymy meaning of a word set synonym (cs k) in all meaning of a word c Kl(l=1,2 ..., L, L are the next meaning of a word number that concerns in the meaning of a word set) all join cs kSimilar set in.
relateness ( cs k , a kp ) = MumOfOverlaps _ cs k a kp ( wordNumInGlossOf cs k + wordNumInGlossOf a kp ) / 2 - - - ( 4 )
relateness ( cs k , b kq ) = NumOfOverlaps _ cs k b kq ( wordNumInGlossOf cs k + wordNumInGlossO fb kq ) / 2 - - - ( 5 )
Wherein, wordNumInGlossOfcs k, wordNumInGlossOfa Kp, wordNumInGlossOfb KqBe respectively cs k, a Kp, b KqThe word number that comprises of speech net WordNet lexical or textual analysis; NumOfOverlaps_cs ka KpBe cs kWordNet lexical or textual analysis of speech net and a KpThe word that comprised of speech net WordNet lexical or textual analysis in the number of same word; NumOfOverlaps_cs kb KqBe cs kWordNet lexical or textual analysis of speech net and b KqThe word that comprised of speech net WordNet lexical or textual analysis in the number of same word.
Each notion meaning of a word in the body is all carried out extension process as stated above.
Through after the step 2; The new body set
Figure GSB00000638971300091
that obtains presentation class directory feature information wherein, is the body body after qi and the body extension process that disappears.
In the present embodiment, use the Jena routine package that body is operated, use the JAWS routine package to realize operation speech net WordNet.
Step 3: the structure of body meaning of a word set.For each body after disappear through body qi and the body extension process o m * ( m = 1,2 , . . . , | o | ) :
At first, with body Change into digraph G, promptly
Figure GSB00000638971300095
Wherein, V is a vertex set, V={v 1, v 2. ..., v | V|(| V| representes the number on summit among the vertex set V), E is the directed edge set, E={e 1, e 2, e | E|(| E| representes the number of directed edge among the directed edge set E); Each summit of digraph is a notion meaning of a word in the body; Each bar directed edge of digraph is two relation of inclusion between the notion meaning of a word, and the direction of directed edge is pointed to father's notion meaning of a word by the sub-notion meaning of a word.In digraph G, the number of plies on summit is the shortest path distance of its corresponding notion meaning of a word apart from the body root.According to the big more principle of the notion meaning of a word place high more contribution of level, the present invention uses formula (6) to calculate notion meaning of a word cs kWeight sw k:
sw k = 1 ( layer ( v k ) ) 1 / 4 - - - ( 6 )
Wherein, layer (v k) be and notion meaning of a word cs kCorresponding vertex v kThe number of plies.
Through after such processing, body
Figure GSB00000638971300097
Can be expressed as the heavy body meaning of a word set of cum rights o m * = { ( Cs 1 , Sw 1 ) , ( Cs 2 , Sw 2 ) , . . . , ( Cs | o m * | , Sw | o m * | ) } . Wherein,
Figure GSB00000638971300099
Be body
Figure GSB000006389713000910
In the notion meaning of a word, sw kBe its pairing weight,
Figure GSB000006389713000911
The expression body The number of the middle notion meaning of a word.
Each body in the body set
Figure GSB000006389713000913
is all handled as stated above.
Step 4: classification and ordering.Through after three top step process, text document is represented that by the heavy keyword set of cum rights split catalog is represented with the heavy body meaning of a word set of cum rights.Below; To determine whether according to the similar value between text document and the split catalog text document is referred in a certain split catalog; Similar value is big more, and the relation between text document and the split catalog is just tight more, and text document might belong to this split catalog more.
At first, use the measure of ground displacement Earth Mover ' s Distance to calculate the semantic similar value between text document and the body.Be specially:
For text document d={ (t 1, tw 1), (t 2, tw 2) ..., (t | d|, tw | d|) (t representes the keyword of text document, and tw representes the weight of keyword) and body o={ (cs 1, sw 1), (cs 2, sw 2) ..., (cs | o|, sw | o|) (cs representes the notion meaning of a word in the body, and sw representes the weight of the notion meaning of a word), can obtain weight graph by them
Figure GSB00000638971300101
Wherein, W is a distance matrix, its element w IjKeyword t for text document i(i=1,2 ..., | d|, | d| is the number of the keyword of text document d) and the notion meaning of a word cs of body j(j=1,2 ..., | o|, | o| is the number of the notion meaning of a word of body o) between semantic similar value.
Right figure
Figure GSB00000638971300102
The vertex set:
Figure GSB00000638971300103
W is
Figure GSB00000638971300104
set of edges.For weight graph is arranged
Figure GSB00000638971300105
, the target that semantic similar value is calculated is to find a paths F={f Ij, i=1 ..., p, j=1 ..., q} (f IjBe t iAnd cs jBetween the limit), make following formula EMD (d, o) value is minimum:
EMD ( d , o ) = Σ i = 1 p Σ j = 1 q f ij w ij Σ i = 1 p Σ j = 1 q f ij - - - ( 7 )
(d o) is semantic similar value between text document and the body to EMD.
Then the similar value between text document and the split catalog is:
Sim(d,o)=1-EMD(d,o) (8)
After obtaining the similar value Sim between text document and the split catalog, set a threshold values δ and determine whether text document is categorized into this split catalog.If the similar value Sim between text document and this split catalog then is classified into this split catalog greater than threshold value δ, otherwise will not be classified into this split catalog.Similar value Sim between text document and the split catalog has represented the closeness relation between text document and the split catalog; So utilize the similar value Sim between text document and the split catalog that the text document under each split catalog is sorted again; Similar value Sim is big more, and then the sorting position of text document is forward more.The span of threshold value δ is [0.5,0.6].
Example experiment: on the basis that program realizes, carried out one group and tested and assess the present invention, in the experiment, threshold value δ is 0.5, and selecting contextual distance range L is 3.Choosing of text document: from the website (http://dmoz.org) of Open Directory Project project (the maximum directory items of artificially webpage being classified), chosen 26 web page text document.In these webpages, there are 11 webpages to belong to the Arts catalogue, 8 webpages belong to the Sports catalogue, and 7 webpages belong to the Games catalogue.These 26 webpages and their url address are as shown in table 1.Use prefix sign " a " " s " and " g " to represent to belong to the text document of Arts, Sports and Games catalogue respectively.
The web page text document that table 1 is collected from http://dmoz.org/
Figure GSB00000638971300111
Body is chosen: at first from the RDF file (structure.rdf.u8.gz) of representing Open Directory Project bibliographic structure, extract Arts, Sports and Games body.The Arts body comprises 521 notions, and the Spots body comprises 602 notions, and the Games body comprises 558 notions.Then, these bodies are through body disambiguation and body extension process, and in the process of handling, hypernym_value and hyponym_value are set as 0.9.The number of the notion that contains in Arts, Sports and the Games body after treatment, is respectively 1557,1809 and 1719.
Adopt method of the present invention that these 26 web page text document are carried out classification processing, the result is as shown in the table:
The result of table 2 classification and ordering
Figure GSB00000638971300121
Table 2 has provided text document that directory A rts, Spots and Games comprised and similar value; And the text document in each catalogue sorts according to similar value; From these results, calculate sorting technique whole accuracy rate and recall rate, the result is as shown in table 3:
Table 3 sorting algorithm performance
Recall rate Accuracy rate
96.2% 83.9%
In order to assess the performance of sort method, table 2 sorted lists that produces and the sorted lists that manually generates are compared, the sorted lists that manually generates is as shown in the table:
The sorted lists that table 4 manually produces
Figure GSB00000638971300131
Suppose τ iBe to use sort algorithm to catalogue c iIn the tabulation of classified text document after sorting,
Figure GSB00000638971300132
It is manual standard sorted tabulation.The phase recency of two tabulations calculates with following formula so:
S = Σ i = 1 | C | S ′ ( τ i , τ i * ) | C | - - - ( 9 )
Figure GSB00000638971300134
Be tabulation τ iAnd tabulation
Figure GSB00000638971300135
Identical element number on same sequence position, | C| is tabulation τ iOr tabulation
Figure GSB00000638971300136
The total number of element.The phase recency S that formula above using calculates each directory listing averages, and ranking results of the present invention is 79.1% with the average recency mutually of the sorted lists of table 4 standard.
Table 5 sort method Performance Evaluation
?S Arts(%) ?S Sports(%) S Games(%) Average phase recency (%)
?78.5 ?75.0 83.7 79.1
Can find out that from preliminary assessment experiment classification of the present invention and ordering have reached preferable performance, accuracy rate and recall rate are all than higher, and ordering also has good effect.This because the body that body structure algorithm is created is fairly perfect, is because when calculating the similar value of text document and body, considered semantic information on the one hand on the other hand.In addition, because the present invention's trouble of having removed the manual collection training set from, and its classification performance can also improve along with the evolution of body, so use body that text document is classified and the method that sorts has good prospect.

Claims (1)

1. one kind is used body to carry out the text document method of classification automatically, it is characterized in that step is following:
(1) extracts the keyword set of treating every piece of text document in the classifying text collection of document with keyword extraction algorithm KEA algorithm (Keyphrases Extraction Algorithm), obtain the heavy keyword set of cum rights of text document; In semantic net search engine Swoogle, retrieve with each split catalog term by name in the given catalogue set; The body of ordering first is as the body of this split catalog of expression in the result for retrieval that obtains; The body of representing each split catalog is carried out body qi and the body expansion that disappears, obtain representing the new body of this split catalog;
The described body qi process that disappears is:
At first, select the context of the interior speech of each notional word L scope of body middle distance as this notional word;
The span of described L is [3,5];
Then, by semantic relatedness computation formula
Relateness ( s i , Con j ) = NumOfOverlaps _ s i Con j ( WordNumInGlossOf s i + WordNumInGlossOfco n j ) / 2 I that calculates each notional word maybe meaning of a word s iJ context con with this notional word jSemantic relevancy relateness (s i, con j), and press
Figure FSB00000638971200012
I that calculates each notional word maybe meaning of a word s iAverage semantic relevancy Rel (s i);
Wherein, i=1,2 ..., I, I represent the number of the possible meaning of a word of notional word, j=1, and 2 ..., J, J represent the contextual number of notional word; WordNumInGlossOfs iExpression s iThe word number that comprises of speech net WordNet lexical or textual analysis, wordNumInGlossOfcon jExpression con jThe word number that comprises of speech net WordNet lexical or textual analysis, NumOfOverlaps_s iCon jExpression s iWordNet lexical or textual analysis of speech net and con jThe word that comprised of speech net WordNet lexical or textual analysis in the number of same word; Described possibly the meaning of a word be the meaning of a word that is defined among the speech net WordNet;
At last, select to have the notion meaning of a word of the possible meaning of a word of maximum average semantic relevancy Rel value as notional word;
Described body expansion process is:
Utilize the semantic relevancy computing formula
Relateness ( s ^ p , s ′ Pq ) = NumOfOverlaps _ s ^ p s ′ Pq ( WordNumInGlossOf s ^ p + WordNumInGlossO Fs ′ Pq ) / 2 Calculate that each notion meaning of a word of the body of qi after handling superordination meaning of a word set and the next in speech net WordNet that disappears concerns each meaning of a word and the semantic relevancy between this notion meaning of a word in the meaning of a word set through body; And judge: for each meaning of a word in the set of the superordination meaning of a word; If the semantic relevancy between it and this notion meaning of a word greater than given threshold value one, then joins this meaning of a word the parent set of this notion meaning of a word; For the next each meaning of a word that concerns in the meaning of a word set, if it with this notion meaning of a word between semantic relevancy greater than given threshold value two, the subclass that then this meaning of a word is joined this notion meaning of a word is gathered; All meaning of a word in the synonymy meaning of a word set of each notion meaning of a word in speech net WordNet are all joined the similar set of this notion meaning of a word;
Wherein, Expression is through disappear p the notion meaning of a word of the body of qi after handling of body, p=1, and 2 ..., P, P represent through the disappear number of the notion meaning of a word of the body of qi after handling of body; S ' PqExpression
Figure FSB00000638971200023
Superordination meaning of a word set/the next q meaning of a word that concerns in the meaning of a word set, q=1,2 ..., Q, Q represent superordination meaning of a word set/the next number that concerns the meaning of a word in the meaning of a word set;
Figure FSB00000638971200024
Expression
Figure FSB00000638971200025
The word number that comprises of speech net WordNet lexical or textual analysis, wordNumInGlossOfs ' PqExpression s ' PqThe word number that comprises of speech net WordNet lexical or textual analysis,
Figure FSB00000638971200026
Expression
Figure FSB00000638971200027
WordNet lexical or textual analysis of speech net and s ' PqThe word that comprised of speech net WordNet lexical or textual analysis in the number of same word;
The described given threshold value one and the span of threshold value two are [0.6,1];
(2) the weight meaning of a word of the new body of each split catalog of represents set is specially:
At first; Body is changed into the digraph of being made up of vertex set and directed edge set: each summit of digraph is a notion meaning of a word in the body; Each bar directed edge of digraph is two relation of inclusion between the notion meaning of a word, and the direction of directed edge is pointed to father's notion meaning of a word by the sub-notion meaning of a word;
Then, calculate the weight of each notion meaning of a word by
Figure FSB00000638971200031
;
Wherein, weight representes the weight of the notion meaning of a word, and layer representes the number of plies on the summit that this notion meaning of a word is corresponding;
The number of plies on described summit is the shortest path distance of the corresponding notion meaning of a word in summit apart from the body root;
(3) press Sim (d; O)=1-EMD (d, o) calculate similar value Sim between text document and the split catalog (d, o); If the similar value Sim (d between text document and split catalog; O), then text document is categorized into this split catalog, otherwise text document is not categorized into this split catalog greater than given threshold value δ;
Wherein, d is the heavy keyword set of cum rights of text document, and o is the weight meaning of a word set of body;
(d is o) for utilizing text document that ground displacement Earth Mover ' s Distance method calculates and the semantic similar value between the body for EMD; The span of described given threshold value δ is [0.5,0.6];
(4) to all text documents under the sorted split catalog according to similar value Sim (d, o) the descending ordering.
CN2010102101070A 2010-06-24 2010-06-24 Method for automatically classifying text documents by utilizing body Expired - Fee Related CN101944099B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010102101070A CN101944099B (en) 2010-06-24 2010-06-24 Method for automatically classifying text documents by utilizing body

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010102101070A CN101944099B (en) 2010-06-24 2010-06-24 Method for automatically classifying text documents by utilizing body

Publications (2)

Publication Number Publication Date
CN101944099A CN101944099A (en) 2011-01-12
CN101944099B true CN101944099B (en) 2012-05-30

Family

ID=43436091

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010102101070A Expired - Fee Related CN101944099B (en) 2010-06-24 2010-06-24 Method for automatically classifying text documents by utilizing body

Country Status (1)

Country Link
CN (1) CN101944099B (en)

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5730413B2 (en) * 2011-02-25 2015-06-10 エンパイア テクノロジー ディベロップメント エルエルシー Ontology expansion
CN102708104B (en) * 2011-03-28 2015-03-11 日电(中国)有限公司 Method and equipment for sorting document
US8849828B2 (en) * 2011-09-30 2014-09-30 International Business Machines Corporation Refinement and calibration mechanism for improving classification of information assets
CN102521242A (en) * 2011-11-14 2012-06-27 江苏联著实业有限公司 Automatic classification system based on OWL (Ontology of Web Language) ontology analysis
CN103123685B (en) * 2011-11-18 2016-03-02 江南大学 Text mode recognition method
CN103218362B (en) * 2012-01-19 2016-12-14 中兴通讯股份有限公司 A kind of Methodologies for Building Domain Ontology and system
CN104102651B (en) * 2013-04-07 2017-07-25 华东师范大学 Based on semantic adaptive file classification method under cloud computing environment
CN104516902A (en) * 2013-09-29 2015-04-15 北大方正集团有限公司 Semantic information acquisition method and corresponding keyword extension method and search method
CN103970888B (en) * 2014-05-21 2017-02-15 山东省科学院情报研究所 Document classifying method based on network measure index
CN104182463A (en) * 2014-07-21 2014-12-03 安徽华贞信息科技有限公司 Semantic-based text classification method
CN105205090A (en) * 2015-05-29 2015-12-30 湖南大学 Web page text classification algorithm research based on web page link analysis and support vector machine
CN105117397B (en) * 2015-06-18 2018-08-28 浙江大学 A kind of medical files semantic association search method based on ontology
CN105354184B (en) * 2015-10-28 2018-04-20 甘肃智呈网络科技有限公司 A kind of vector space model using optimization realizes the method that document is classified automatically
CN105893606A (en) * 2016-04-25 2016-08-24 深圳市永兴元科技有限公司 Text classifying method and device
CN108304365A (en) * 2017-02-23 2018-07-20 腾讯科技(深圳)有限公司 keyword extracting method and device
CN108573750B (en) * 2017-03-07 2021-01-15 京东方科技集团股份有限公司 Method and system for automatically discovering medical knowledge
CN107066448A (en) * 2017-04-23 2017-08-18 四川用联信息技术有限公司 New small-world network model realizes the extracting method of text feature
CN108009248A (en) * 2017-11-30 2018-05-08 国信优易数据有限公司 A kind of data classification method and system
CN108197109B (en) * 2017-12-29 2021-04-23 北京百分点科技集团股份有限公司 Multi-language analysis method and device based on natural language processing
CN109271513B (en) * 2018-09-07 2021-10-22 华南师范大学 Text classification method, computer readable storage medium and system
CN110348497B (en) * 2019-06-28 2021-09-10 西安理工大学 Text representation method constructed based on WT-GloVe word vector
CN112632968B (en) * 2020-12-18 2024-02-13 万兴科技(湖南)有限公司 PDF catalog identification method, electronic equipment and computer readable storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005316699A (en) * 2004-04-28 2005-11-10 Hitachi Ltd Content disclosure system, content disclosure method and content disclosure program
US7610315B2 (en) * 2006-09-06 2009-10-27 Adobe Systems Incorporated System and method of determining and recommending a document control policy for a document
CN101169780A (en) * 2006-10-25 2008-04-30 华为技术有限公司 Semantic ontology retrieval system and method
CN101639837B (en) * 2008-07-29 2012-10-24 日电(中国)有限公司 Method and system for automatically classifying objects

Also Published As

Publication number Publication date
CN101944099A (en) 2011-01-12

Similar Documents

Publication Publication Date Title
CN101944099B (en) Method for automatically classifying text documents by utilizing body
CN102799647B (en) Method and device for webpage reduplication deletion
CN104573046B (en) A kind of comment and analysis method and system based on term vector
CN105488024B (en) The abstracting method and device of Web page subject sentence
EP2041669B1 (en) Text categorization using external knowledge
CN103186574B (en) A kind of generation method and apparatus of Search Results
US20170322964A1 (en) Understanding tables for search
CN103593425B (en) Preference-based intelligent retrieval method and system
CN106599054B (en) Method and system for classifying and pushing questions
CN103577462B (en) A kind of Document Classification Method and device
CN103838833A (en) Full-text retrieval system based on semantic analysis of relevant words
CN103235812B (en) Method and system for identifying multiple query intents
CN104484380A (en) Personalized search method and personalized search device
CN105183784A (en) Content based junk webpage detecting method and detecting apparatus thereof
CN104484431A (en) Multi-source individualized news webpage recommending method based on field body
CN104778276A (en) Multi-index combining and sequencing algorithm based on improved TF-IDF (term frequency-inverse document frequency)
CN103744956A (en) Diversified expansion method of keyword
Chifu et al. Word sense discrimination in information retrieval: A spectral clustering-based approach
CN100458797C (en) Process for ordering network advertisement
CN103838801A (en) Webpage theme information extraction method
CN102789452A (en) Similar content extraction method
CN106294330A (en) A kind of scientific text selection method and device
CN105373546A (en) Information processing method and system for knowledge services
Hajeer et al. A new stemming algorithm for efficient information retrieval systems and web search engines
CN104572915A (en) User event relevance calculation method based on content environment enhancement

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C53 Correction of patent of invention or patent application
CB03 Change of inventor or designer information

Inventor after: Fang Jun

Inventor after: Guo Lei

Inventor after: Yang Ning

Inventor before: Guo Lei

Inventor before: Fang Jun

COR Change of bibliographic data

Free format text: CORRECT: INVENTOR; FROM: GUO LEI FANG JUN TO: FANG JUN GUO LEI YANG NING

ASS Succession or assignment of patent right

Owner name: NORTHWESTERN POLYTECHNICAL UNIVERSITY

Effective date: 20140814

Owner name: JIANGSU T.Y. ENVIRONMENTAL ENERGY CO., LTD.

Free format text: FORMER OWNER: NORTHWESTERN POLYTECHNICAL UNIVERSITY

Effective date: 20140814

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 710072 XI'AN, SHAANXI PROVINCE TO: 226600 NANTONG, JIANGSU PROVINCE

TR01 Transfer of patent right

Effective date of registration: 20140814

Address after: 226600 the Yellow Sea Avenue, Haian, Jiangsu province (West), No. 268, No.

Patentee after: JIANGSU TIANYING ENVIRONMENTAL PROTECTION ENERGY Co.,Ltd.

Patentee after: Northwestern Polytechnical University

Address before: 710072 Xi'an friendship West Road, Shaanxi, No. 127

Patentee before: Northwestern Polytechnical University

CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120530

CF01 Termination of patent right due to non-payment of annual fee