Summary of the invention
The technical matters that solves
In order to solve at present the shortcoming based on the method for machine learning, the present invention proposes to use body that text document is classified automatically, can fast and accurately text document be classified automatically and sort.
Technical scheme
Thought of the present invention is: use body to come the characteristic information of presentation class catalogue; Utilize the semantic similar value between text document and the body to carry out real-time classification; Saved the process of training study like this; And, will constantly be improved based on the accuracy rate and the recall rate of the sorting technique of body along with body constantly upgrades and evolves; On the other hand, when the similar value of calculating between text document and the body, consider the semantic relation between the speech, thereby improve the accuracy rate of classification based on the method for body.
The invention is characterized in: the characteristic information that proposes the effective presentation class catalogue of this physical efficiency; And through using the body after disambiguation and the extension process to come the characteristic information of presentation class catalogue, utilization treats that the semantic similar value between classifying text document and the body classifies.
Basic process of the present invention is: at first, use the heavy keyword set of cum rights to represent the characteristic information of text document; Then, use through the body after disambiguation and the extension process and come the characteristic information of presentation class catalogue, and, body is converted into the heavy meaning of a word set of cum rights through analyzing body structural feature; At last; Use ground displacement EarthMover ' s Distance method to calculate the keyword set of text document and the semantic similar value between the set of the body weight meaning of a word; Wherein, Similar value between the single meaning of a word and the speech adopts and measures based on the method for speech net WordNet lexical or textual analysis, and utilizes this semanteme similar value to calculate the similar value between text document and the split catalog, carries out the classification and the ordering of text document according to the similar value between text document and the split catalog.
A kind of body that uses carries out the text document method of classification automatically, it is characterized in that step is following:
(1) extracts the keyword set of treating every piece of text document in the classifying text collection of document with keyword extraction algorithm KEA algorithm (Keyphrases Extraction Algorithm), obtain the heavy keyword set of cum rights of text document; In semantic net search engine Swoogle, retrieve with each split catalog term by name in the given catalogue set; The body of ordering first is as the body of this split catalog of expression in the result for retrieval that obtains; The body of representing each split catalog is carried out body qi and the body expansion that disappears, obtain representing the new body of this split catalog;
The described body qi process that disappears is:
At first, select the context of the interior speech of each notional word L scope of body middle distance as this notional word; The span of described L is [3,5];
Then, by semantic relatedness computation formula
I that calculates each notional word maybe meaning of a word s
iJ context con with this notional word
jSemantic relevancy relateness (s
i, con
j), and press
I that calculates each notional word maybe meaning of a word s
iAverage semantic relevancy Rel (s
i);
Wherein, i=1,2 ..., I, I represent the number of the possible meaning of a word of notional word, j=1, and 2 ..., J, J represent the contextual number of notional word; WordNumInGlossOfs
iExpression s
iThe word number that comprises of speech net WordNet lexical or textual analysis, wordNumInGlossOfcon
jExpression con
jThe word number that comprises of speech net WordNet lexical or textual analysis, NumOfOverlaps_s
iCon
jExpression s
iWordNet lexical or textual analysis of speech net and con
jThe word that comprised of speech net WordNet lexical or textual analysis in the number of same word; Described possibly the meaning of a word be the meaning of a word that is defined among the speech net WordNet;
At last, select to have the notion meaning of a word of the possible meaning of a word of maximum average semantic relevancy Rel value as notional word;
Described body expansion process is:
Utilize the semantic relevancy computing formula
Calculate that each notion meaning of a word of the body of qi after handling superordination meaning of a word set and the next in speech net WordNet that disappears concerns each meaning of a word and the semantic relevancy between this notion meaning of a word in the meaning of a word set through body; And judge: for each meaning of a word in the set of the superordination meaning of a word; If the semantic relevancy between it and this notion meaning of a word greater than given threshold value one, then joins this meaning of a word the parent set of this notion meaning of a word; For the next each meaning of a word that concerns in the meaning of a word set, if it with this notion meaning of a word between semantic relevancy greater than given threshold value two, the subclass that then this meaning of a word is joined this notion meaning of a word is gathered; All meaning of a word in the synonymy meaning of a word set of each notion meaning of a word in speech net WordNet are all joined the similar set of this notion meaning of a word;
Wherein,
Expression is through disappear p the notion meaning of a word of the body of qi after handling of body, p=1, and 2 ..., P, P represent through the disappear number of the notion meaning of a word of the body of qi after handling of body; S '
PqExpression
Superordination meaning of a word set/the next q meaning of a word that concerns in the meaning of a word set, q=1,2 ..., Q, Q represent superordination meaning of a word set/the next number that concerns the meaning of a word in the meaning of a word set;
Expression
The word number that comprises of speech net WordNet lexical or textual analysis, wordNumInGlossOfs '
PqExpression s '
PqThe word number that comprises of speech net WordNet lexical or textual analysis,
Expression
WordNet lexical or textual analysis of speech net and s '
PqThe word that comprised of speech net WordNet lexical or textual analysis in the number of same word;
The described given threshold value one and the span of threshold value two are [0.6,1];
(2) the weight meaning of a word of the new body of each split catalog of represents set is specially:
At first; Body is changed into the digraph of being made up of vertex set and directed edge set: each summit of digraph is a notion meaning of a word in the body; Each bar directed edge of digraph is two relation of inclusion between the notion meaning of a word, and the direction of directed edge is pointed to father's notion meaning of a word by the sub-notion meaning of a word;
Then, calculate the weight of each notion meaning of a word by
;
Wherein, weight representes the weight of the notion meaning of a word, and layer representes the number of plies on the summit that this notion meaning of a word is corresponding;
The number of plies on described summit is the shortest path distance of the corresponding notion meaning of a word in summit apart from the body root;
(3) press Sim (d; O)=1-EMD (d, o) calculate similar value Sim between text document and the split catalog (d, o); If the similar value Sim (d between text document and split catalog; O), then text document is categorized into this split catalog, otherwise text document is not categorized into this split catalog greater than given threshold value δ;
Wherein, d is the heavy keyword set of cum rights of text document, and o is the weight meaning of a word set of body; (d is o) for utilizing text document that ground displacement Earth Mover ' s Distance method calculates and the semantic similar value between the body for EMD; The span of described given threshold value δ is [0.5,0.6];
(4) to all text documents under the sorted split catalog according to similar value Sim (d, o) the descending ordering.
Beneficial effect
This method of the present invention uses body to represent the characteristic information of catalogue, carries out real-time classification through the semantic similar value of calculating between text document and the body, has saved the process of training study, and has improved the accuracy rate of classification.In addition, the present invention uses the disambiguation technology will represent that the speech in the body becomes the meaning of a word, has solved the inaccurate problem of result of calculation of the similar value that the polysemy of speech causes, improves the precision that semantic similar value is calculated, and has further improved the precision of classification; On the basis of body disambiguation, the present invention comes body is automatically expanded through making word net WordNet, has enriched the notion content of body, thereby has improved the accuracy rate that follow-up similar value is calculated, and solves the bothersome problem of manual creation body.
Embodiment
Combine accompanying drawing that the present invention is further described at present:
The use body that proposes according to the present invention carries out the method for text document classification, and we use Java and Perl language to realize that concrete implementation procedure is following:
Use body to carry out the text document sorting technique and be divided into following four steps:
Step 1: the structure of text document keyword set.Here, adopt keyword extraction algorithm KEA algorithm to extract the heavy keyword set of cum rights of treating each piece text document in the classifying text collection of document, be specially: for treating classified text collection of document D={d
1, d
2..., d
| D|Each piece text document d in (| D| representes the text document record among the text document set D)
i, at first, adopt naive Bayesian to estimate, through consider three characteristic attributes of number Length of letter in frequency tf * idf that speech (existing word) occurs, mean place Occurrence that speech occurs and the speech in text document in text document, to d
iIn each speech, adopt following formula to calculate the probability P r of its speech that is the theme:
Pr=Pr[T|yes]×Pr[O|yes]×Pr[L|yes]×Pr[yes] (1)
Wherein, Pr [T|yes], Pr [O|yes] and Pr [L|yes] are illustrated respectively in the be the theme probability of speech of this speech under the condition that three characteristic attribute tf * idf, Occurrence and Length get currency; The ratio of number and the number of the text document that does not comprise descriptor that comprises the text document of descriptor in the set of Pr [yes] expression text document.
Preceding n the speech (n gets 4~6 usually) of then, selecting to have maximum Pr value is as text document d
iKeyword, obtain text document d
iThe heavy keyword set of cum rights, and with text document d
iUse the heavy keyword set of this cum rights to represent, i.e. d
i={ URL
i, (t
1, tw
1) ..., (t
Ij, tw
Ij) ..., wherein, t
IjFor extracting the keyword that obtains, tw as stated above
IjBe keyword t
IjWeight, be its Pr value that calculates by formula (1).
Step 2: body pre-service.At first, retrieve in semantic net search engine Swoogle with each split catalog term by name in the given catalogue set, and represent this split catalog with the body of ordering first in the result for retrieval that obtains, like this, catalogue set CA={ca
1, ca
2..., ca
| CA|Just use body to gather O={o
1, o
2..., o
| 0|Represent, wherein, | O| representes the body number among the body set O, | CA| representes the split catalog number among the catalogue set CA, satisfies | O|=|CA|.Wherein, split catalog corresponding a body, i.e. a body o
mRepresent a split catalog ca
mCharacteristic information, i.e. ca
m:=o
m
Next, to each body o
mCarry out the body disambiguation of step 2.1 and the body extension process of step 2.2.Wherein, the present invention adopts the meaning of a word be defined among the speech net WordNet to represent as the morphology of body, and sets that the path distance between any two notional words is 1 in same the knowledge.
Step 2.1: body disambiguation.Because the corresponding a plurality of meaning of a word of speech possibility, this phenomenon can reduce the precision that semantic similar value is calculated.In order to eliminate the ambiguousness of construed in the body, body is carried out disambiguation handle, promptly utilize the context of speech in the body, confirm the meaning of a word that it is correct.Be specially:
At first, the speech in the L distance range of the notional word s in the body is chosen as the context of notional word s, obtains the set of context Con={con of notional word s
i..., con
j..., wherein, con
jJ the context of expression notional word s; The span of L is [3,5];
Then, use formula (2) to calculate notional word s each meaning of a word s in speech net WordNet
i(i=1 ..., N
i, N
iBe the meaning of a word number of notional word s in speech net WordNet) and its set of context Con in average semantic relevancy Rel (s between all contexts
i):
Wherein, | Con| is the context number of notional word s, i.e. the number of speech among the set of context Con; Relateness (s
i, con
j) be i meaning of a word s
iWith its j context cor
jSemantic relevancy, its computing formula is following:
Wherein, wordNumInGlossOfs
iBe s
iThe word number that comprises of speech net WordNet lexical or textual analysis, wordNumInGlossOfcon
jBe con
jThe word number that comprises of speech net WordNet lexical or textual analysis, NumOfOverlaps_s
iCon
jBe s
iWordNet lexical or textual analysis of speech net and con
jThe word that comprised of speech net WordNet lexical or textual analysis in the number of same word;
At last, select the correct meaning of a word of the corresponding meaning of a word of maximum average semantic relevancy Rel value, i.e. the notion meaning of a word of notional word s as notional word s.Because the notional word that in same body, occurs together has certain semantic relation, so the correct meaning of a word is and the maximum meaning of a word of the semantic relevancy of neighbours' notional word.
Each notional word in the body is all handled by the said process qi that disappears.
Step 2.2: body expansion.Disappear after qi handles through the body of step 2.1, body is represented that by the notion meaning of a word notion meaning of a word in the body after the qi that disappears to body is handled in the body expansion adds the meaning of a word that is associated, thereby enriches body.
At first, obtain body each the notion meaning of a word cs in the body of qi after handling that disappears
k(k=1 ..., N
k, N
kNumber for the notion meaning of a word in the body) set of the superordination meaning of a word in speech net WordNet (hypernymy), the next meaning of a word set (hyponymy) and the synonymy meaning of a word of concerning are gathered (synonym), use hypernym (cs respectively
k)={ a
K1, a
K2..., hyponym (cs
k)={ b
K1, b
K2... And synonym (cs
k)={ c
K1, c
K2... Expression notion meaning of a word cs
kThese three kinds concern meaning of a word set, and set hypernym_value and hyponym_value is two threshold values, the span of hypernym_value and hyponym_value is [0.6,1].
Then, calculate notion meaning of a word cs by formula (4)
kGather with the superordination meaning of a word
In each meaning of a word a
Kp(p=1,2 ..., P, P are the meaning of a word number in the set of the superordination meaning of a word) semantic relevancy, if cs
kAnd a
KpSemantic relevancy greater than given hypernym_value threshold value, then with a
KpAdd cs
kParent set in; Calculate notion meaning of a word cs by formula (5)
kWith the next meaning of a word set hyponym (cs that concerns
k) in each meaning of a word b
Kq(q=1,2 ..., Q, Q are the next meaning of a word number in the meaning of a word set that concerns) semantic relevancy, if cs
kAnd b
KqSemantic relevancy greater than given hyponym_value threshold value, then with b
KqAdd cs
kSubclass set in; With synonymy meaning of a word set synonym (cs
k) in all meaning of a word c
Kl(l=1,2 ..., L, L are the next meaning of a word number that concerns in the meaning of a word set) all join cs
kSimilar set in.
Wherein, wordNumInGlossOfcs
k, wordNumInGlossOfa
Kp, wordNumInGlossOfb
KqBe respectively cs
k, a
Kp, b
KqThe word number that comprises of speech net WordNet lexical or textual analysis; NumOfOverlaps_cs
ka
KpBe cs
kWordNet lexical or textual analysis of speech net and a
KpThe word that comprised of speech net WordNet lexical or textual analysis in the number of same word; NumOfOverlaps_cs
kb
KqBe cs
kWordNet lexical or textual analysis of speech net and b
KqThe word that comprised of speech net WordNet lexical or textual analysis in the number of same word.
Each notion meaning of a word in the body is all carried out extension process as stated above.
Through after the step 2; The new body set
that obtains presentation class directory feature information wherein,
is the body body after qi and the body extension process that disappears.
In the present embodiment, use the Jena routine package that body is operated, use the JAWS routine package to realize operation speech net WordNet.
Step 3: the structure of body meaning of a word set.For each body after disappear through body qi and the body extension process
At first, with body
Change into digraph G, promptly
Wherein, V is a vertex set, V={v
1, v
2. ..., v
| V|(| V| representes the number on summit among the vertex set V), E is the directed edge set, E={e
1, e
2, e
| E|(| E| representes the number of directed edge among the directed edge set E); Each summit of digraph is a notion meaning of a word in the body; Each bar directed edge of digraph is two relation of inclusion between the notion meaning of a word, and the direction of directed edge is pointed to father's notion meaning of a word by the sub-notion meaning of a word.In digraph G, the number of plies on summit is the shortest path distance of its corresponding notion meaning of a word apart from the body root.According to the big more principle of the notion meaning of a word place high more contribution of level, the present invention uses formula (6) to calculate notion meaning of a word cs
kWeight sw
k:
Wherein, layer (v
k) be and notion meaning of a word cs
kCorresponding vertex v
kThe number of plies.
Through after such processing, body
Can be expressed as the heavy body meaning of a word set of cum rights
Wherein,
Be body
In the notion meaning of a word, sw
kBe its pairing weight,
The expression body
The number of the middle notion meaning of a word.
Each body in the body set
is all handled as stated above.
Step 4: classification and ordering.Through after three top step process, text document is represented that by the heavy keyword set of cum rights split catalog is represented with the heavy body meaning of a word set of cum rights.Below; To determine whether according to the similar value between text document and the split catalog text document is referred in a certain split catalog; Similar value is big more, and the relation between text document and the split catalog is just tight more, and text document might belong to this split catalog more.
At first, use the measure of ground displacement Earth Mover ' s Distance to calculate the semantic similar value between text document and the body.Be specially:
For text document d={ (t
1, tw
1), (t
2, tw
2) ..., (t
| d|, tw
| d|) (t representes the keyword of text document, and tw representes the weight of keyword) and body o={ (cs
1, sw
1), (cs
2, sw
2) ..., (cs
| o|, sw
| o|) (cs representes the notion meaning of a word in the body, and sw representes the weight of the notion meaning of a word), can obtain weight graph by them
Wherein, W is a distance matrix, its element w
IjKeyword t for text document
i(i=1,2 ..., | d|, | d| is the number of the keyword of text document d) and the notion meaning of a word cs of body
j(j=1,2 ..., | o|, | o| is the number of the notion meaning of a word of body o) between semantic similar value.
Right figure
The vertex set:
W is
set of edges.For weight graph is arranged
, the target that semantic similar value is calculated is to find a paths F={f
Ij, i=1 ..., p, j=1 ..., q} (f
IjBe t
iAnd cs
jBetween the limit), make following formula EMD (d, o) value is minimum:
(d o) is semantic similar value between text document and the body to EMD.
Then the similar value between text document and the split catalog is:
Sim(d,o)=1-EMD(d,o) (8)
After obtaining the similar value Sim between text document and the split catalog, set a threshold values δ and determine whether text document is categorized into this split catalog.If the similar value Sim between text document and this split catalog then is classified into this split catalog greater than threshold value δ, otherwise will not be classified into this split catalog.Similar value Sim between text document and the split catalog has represented the closeness relation between text document and the split catalog; So utilize the similar value Sim between text document and the split catalog that the text document under each split catalog is sorted again; Similar value Sim is big more, and then the sorting position of text document is forward more.The span of threshold value δ is [0.5,0.6].
Example experiment: on the basis that program realizes, carried out one group and tested and assess the present invention, in the experiment, threshold value δ is 0.5, and selecting contextual distance range L is 3.Choosing of text document: from the website (http://dmoz.org) of Open Directory Project project (the maximum directory items of artificially webpage being classified), chosen 26 web page text document.In these webpages, there are 11 webpages to belong to the Arts catalogue, 8 webpages belong to the Sports catalogue, and 7 webpages belong to the Games catalogue.These 26 webpages and their url address are as shown in table 1.Use prefix sign " a " " s " and " g " to represent to belong to the text document of Arts, Sports and Games catalogue respectively.
The web page text document that table 1 is collected from http://dmoz.org/
Body is chosen: at first from the RDF file (structure.rdf.u8.gz) of representing Open Directory Project bibliographic structure, extract Arts, Sports and Games body.The Arts body comprises 521 notions, and the Spots body comprises 602 notions, and the Games body comprises 558 notions.Then, these bodies are through body disambiguation and body extension process, and in the process of handling, hypernym_value and hyponym_value are set as 0.9.The number of the notion that contains in Arts, Sports and the Games body after treatment, is respectively 1557,1809 and 1719.
Adopt method of the present invention that these 26 web page text document are carried out classification processing, the result is as shown in the table:
The result of table 2 classification and ordering
Table 2 has provided text document that directory A rts, Spots and Games comprised and similar value; And the text document in each catalogue sorts according to similar value; From these results, calculate sorting technique whole accuracy rate and recall rate, the result is as shown in table 3:
Table 3 sorting algorithm performance
Recall rate |
Accuracy rate |
96.2% |
83.9% |
In order to assess the performance of sort method, table 2 sorted lists that produces and the sorted lists that manually generates are compared, the sorted lists that manually generates is as shown in the table:
The sorted lists that table 4 manually produces
Suppose τ
iBe to use sort algorithm to catalogue c
iIn the tabulation of classified text document after sorting,
It is manual standard sorted tabulation.The phase recency of two tabulations calculates with following formula so:
Be tabulation τ
iAnd tabulation
Identical element number on same sequence position, | C| is tabulation τ
iOr tabulation
The total number of element.The phase recency S that formula above using calculates each directory listing averages, and ranking results of the present invention is 79.1% with the average recency mutually of the sorted lists of table 4 standard.
Table 5 sort method Performance Evaluation
?S
Arts(%)
|
?S
Sports(%)
|
S
Games(%)
|
Average phase recency (%) |
?78.5 |
?75.0 |
83.7 |
79.1 |
Can find out that from preliminary assessment experiment classification of the present invention and ordering have reached preferable performance, accuracy rate and recall rate are all than higher, and ordering also has good effect.This because the body that body structure algorithm is created is fairly perfect, is because when calculating the similar value of text document and body, considered semantic information on the one hand on the other hand.In addition, because the present invention's trouble of having removed the manual collection training set from, and its classification performance can also improve along with the evolution of body, so use body that text document is classified and the method that sorts has good prospect.