CN101944099B

CN101944099B - Method for automatically classifying text documents by utilizing body

Info

Publication number: CN101944099B
Application number: CN2010102101070A
Authority: CN
Inventors: 郭雷; 方俊
Original assignee: Northwestern Polytechnical University
Current assignee: Jiangsu Tianying Environmental Protection Energy Co ltd; Northwestern Polytechnical University
Priority date: 2010-06-24
Filing date: 2010-06-24
Publication date: 2012-05-30
Anticipated expiration: 2030-06-24
Also published as: CN101944099A

Abstract

The invention relates to a method for automatically classifying text documents by utilizing a body, comprising the following steps: firstly expressing the characteristic information of a text document by utilizing a weighted key word set; and then expressing the characteristic information of a classifying catalogue by a body which is subject to body disambiguation and body expansion; transforming the body into a weighted word meaning set through analyzing the body structural characteristic; finally calculating the semantic similar value between the key word set of the text document and the body weighted word meaning set by utilizing a Earth Mover's Distance method; further calculating the similar value between the text document and the classifying catalogue; and classifying and sequencing the text document according to the similar value between the text document and the classifying catalogue. By utilizing the method of the invention, the text document can be automatically classified, and the accuracy of the text document classification can be improved.

Description

A kind of body that uses carries out the text document method of classification automatically

Technical field

The present invention relates to a kind of method of using body that text document is classified automatically, belong to fields such as computer information processing, information retrieval.Be applicable to the network text document of magnanimity is classified fast and accurately automatically.

Background technology

In order to improve the efficient of text document tissue, better support the user to browse and search information, the text document classification is the emphasis problem that people paid close attention to all the time.Begin most, text document classification is that the people manually accomplishes, but more and more along with the text document resource, manual classification become can not, so autotext document classification technology becomes the emphasis of research.

The text document classification generally is divided into three phases: at first, the characteristic information of text document and split catalog is extracted out; Then, classifier calculated goes out the similar value of text document and split catalog; At last, text document belongs to different catalogues according to similar value.

Traditional machine learning method has been applied to text document classifies automatically, comprises neural network, Bayes, SVMs and k neighbours' method.Some classified text documents of collection that these methods are at first manual use these classified text document collection to come training classifier then, use the sorter that trains that text document is divided in the split catalog at last.The sorting technique of these machine learning has following shortcoming:

1) traditional machine learning method training classifier needs manual a large amount of document sets of classifying text of collecting, and this process is very loaded down with trivial details, and to different split catalogs, needs the manual different text document collection of collection to come training classifier;

2) method of traditional machine learning is not considered the semantic relation between the speech, so be difficult to improve the accuracy rate of classification.

In order to solve the shortcoming of machine learning method, the present invention proposes a kind of body that uses and come the method that text document is classified automatically.

Summary of the invention

The technical matters that solves

In order to solve at present the shortcoming based on the method for machine learning, the present invention proposes to use body that text document is classified automatically, can fast and accurately text document be classified automatically and sort.

Technical scheme

Thought of the present invention is: use body to come the characteristic information of presentation class catalogue; Utilize the semantic similar value between text document and the body to carry out real-time classification; Saved the process of training study like this; And, will constantly be improved based on the accuracy rate and the recall rate of the sorting technique of body along with body constantly upgrades and evolves; On the other hand, when the similar value of calculating between text document and the body, consider the semantic relation between the speech, thereby improve the accuracy rate of classification based on the method for body.

The invention is characterized in: the characteristic information that proposes the effective presentation class catalogue of this physical efficiency; And through using the body after disambiguation and the extension process to come the characteristic information of presentation class catalogue, utilization treats that the semantic similar value between classifying text document and the body classifies.

Basic process of the present invention is: at first, use the heavy keyword set of cum rights to represent the characteristic information of text document; Then, use through the body after disambiguation and the extension process and come the characteristic information of presentation class catalogue, and, body is converted into the heavy meaning of a word set of cum rights through analyzing body structural feature; At last; Use ground displacement EarthMover ' s Distance method to calculate the keyword set of text document and the semantic similar value between the set of the body weight meaning of a word; Wherein, Similar value between the single meaning of a word and the speech adopts and measures based on the method for speech net WordNet lexical or textual analysis, and utilizes this semanteme similar value to calculate the similar value between text document and the split catalog, carries out the classification and the ordering of text document according to the similar value between text document and the split catalog.

A kind of body that uses carries out the text document method of classification automatically, it is characterized in that step is following:

(1) extracts the keyword set of treating every piece of text document in the classifying text collection of document with keyword extraction algorithm KEA algorithm (Keyphrases Extraction Algorithm), obtain the heavy keyword set of cum rights of text document; In semantic net search engine Swoogle, retrieve with each split catalog term by name in the given catalogue set; The body of ordering first is as the body of this split catalog of expression in the result for retrieval that obtains; The body of representing each split catalog is carried out body qi and the body expansion that disappears, obtain representing the new body of this split catalog;

The described body qi process that disappears is:

At first, select the context of the interior speech of each notional word L scope of body middle distance as this notional word; The span of described L is [3,5];

Then, by semantic relatedness computation formula

Relateness (s_{i}, {Con}_{j}) = \frac{NumOfOverlaps_s_{i} {Con}_{j}}{(WordNumInGlossOf s_{i} + WordNumInGlossOfco n_{j}) / 2}

I that calculates each notional word maybe meaning of a word s _iJ context con with this notional word _jSemantic relevancy relateness (s _i, con _j), and press

I that calculates each notional word maybe meaning of a word s _iAverage semantic relevancy Rel (s _i);

Wherein, i=1,2 ..., I, I represent the number of the possible meaning of a word of notional word, j=1, and 2 ..., J, J represent the contextual number of notional word; WordNumInGlossOfs _iExpression s _iThe word number that comprises of speech net WordNet lexical or textual analysis, wordNumInGlossOfcon _jExpression con _jThe word number that comprises of speech net WordNet lexical or textual analysis, NumOfOverlaps_s _iCon _jExpression s _iWordNet lexical or textual analysis of speech net and con _jThe word that comprised of speech net WordNet lexical or textual analysis in the number of same word; Described possibly the meaning of a word be the meaning of a word that is defined among the speech net WordNet;

At last, select to have the notion meaning of a word of the possible meaning of a word of maximum average semantic relevancy Rel value as notional word;

Described body expansion process is:

Utilize the semantic relevancy computing formula

Relateness ({\hat{s}}_{p}, {s^{'}}_{Pq}) = \frac{NumOfOverlaps_{\hat{s}}_{p} {s^{'}}_{Pq}}{(WordNumInGlossOf {\hat{s}}_{p} + WordNumInGlossO {Fs}^{'}_{Pq}) / 2}

Calculate that each notion meaning of a word of the body of qi after handling superordination meaning of a word set and the next in speech net WordNet that disappears concerns each meaning of a word and the semantic relevancy between this notion meaning of a word in the meaning of a word set through body; And judge: for each meaning of a word in the set of the superordination meaning of a word; If the semantic relevancy between it and this notion meaning of a word greater than given threshold value one, then joins this meaning of a word the parent set of this notion meaning of a word; For the next each meaning of a word that concerns in the meaning of a word set, if it with this notion meaning of a word between semantic relevancy greater than given threshold value two, the subclass that then this meaning of a word is joined this notion meaning of a word is gathered; All meaning of a word in the synonymy meaning of a word set of each notion meaning of a word in speech net WordNet are all joined the similar set of this notion meaning of a word;

Wherein,

Expression is through disappear p the notion meaning of a word of the body of qi after handling of body, p=1, and 2 ..., P, P represent through the disappear number of the notion meaning of a word of the body of qi after handling of body; S ' _PqExpression

Superordination meaning of a word set/the next q meaning of a word that concerns in the meaning of a word set, q=1,2 ..., Q, Q represent superordination meaning of a word set/the next number that concerns the meaning of a word in the meaning of a word set;

Expression

The word number that comprises of speech net WordNet lexical or textual analysis, wordNumInGlossOfs ' _PqExpression s ' _PqThe word number that comprises of speech net WordNet lexical or textual analysis,

Expression

WordNet lexical or textual analysis of speech net and s ' _PqThe word that comprised of speech net WordNet lexical or textual analysis in the number of same word;

The described given threshold value one and the span of threshold value two are [0.6,1];

(2) the weight meaning of a word of the new body of each split catalog of represents set is specially:

At first; Body is changed into the digraph of being made up of vertex set and directed edge set: each summit of digraph is a notion meaning of a word in the body; Each bar directed edge of digraph is two relation of inclusion between the notion meaning of a word, and the direction of directed edge is pointed to father's notion meaning of a word by the sub-notion meaning of a word;

Then, calculate the weight of each notion meaning of a word by ;

Wherein, weight representes the weight of the notion meaning of a word, and layer representes the number of plies on the summit that this notion meaning of a word is corresponding;

The number of plies on described summit is the shortest path distance of the corresponding notion meaning of a word in summit apart from the body root;

(3) press Sim (d; O)=1-EMD (d, o) calculate similar value Sim between text document and the split catalog (d, o); If the similar value Sim (d between text document and split catalog; O), then text document is categorized into this split catalog, otherwise text document is not categorized into this split catalog greater than given threshold value δ;

Wherein, d is the heavy keyword set of cum rights of text document, and o is the weight meaning of a word set of body; (d is o) for utilizing text document that ground displacement Earth Mover ' s Distance method calculates and the semantic similar value between the body for EMD; The span of described given threshold value δ is [0.5,0.6];

(4) to all text documents under the sorted split catalog according to similar value Sim (d, o) the descending ordering.

Beneficial effect

This method of the present invention uses body to represent the characteristic information of catalogue, carries out real-time classification through the semantic similar value of calculating between text document and the body, has saved the process of training study, and has improved the accuracy rate of classification.In addition, the present invention uses the disambiguation technology will represent that the speech in the body becomes the meaning of a word, has solved the inaccurate problem of result of calculation of the similar value that the polysemy of speech causes, improves the precision that semantic similar value is calculated, and has further improved the precision of classification; On the basis of body disambiguation, the present invention comes body is automatically expanded through making word net WordNet, has enriched the notion content of body, thereby has improved the accuracy rate that follow-up similar value is calculated, and solves the bothersome problem of manual creation body.

Description of drawings

Fig. 1: the basic flow sheet of the inventive method

Embodiment

Combine accompanying drawing that the present invention is further described at present:

The use body that proposes according to the present invention carries out the method for text document classification, and we use Java and Perl language to realize that concrete implementation procedure is following:

Use body to carry out the text document sorting technique and be divided into following four steps:

Step 1: the structure of text document keyword set.Here, adopt keyword extraction algorithm KEA algorithm to extract the heavy keyword set of cum rights of treating each piece text document in the classifying text collection of document, be specially: for treating classified text collection of document D={d ₁, d ₂..., d _{| D|}Each piece text document d in (| D| representes the text document record among the text document set D) _i, at first, adopt naive Bayesian to estimate, through consider three characteristic attributes of number Length of letter in frequency tf * idf that speech (existing word) occurs, mean place Occurrence that speech occurs and the speech in text document in text document, to d _iIn each speech, adopt following formula to calculate the probability P r of its speech that is the theme:

Pr＝Pr[T|yes]×Pr[O|yes]×Pr[L|yes]×Pr[yes] (1)

Wherein, Pr [T|yes], Pr [O|yes] and Pr [L|yes] are illustrated respectively in the be the theme probability of speech of this speech under the condition that three characteristic attribute tf * idf, Occurrence and Length get currency; The ratio of number and the number of the text document that does not comprise descriptor that comprises the text document of descriptor in the set of Pr [yes] expression text document.

Preceding n the speech (n gets 4～6 usually) of then, selecting to have maximum Pr value is as text document d _iKeyword, obtain text document d _iThe heavy keyword set of cum rights, and with text document d _iUse the heavy keyword set of this cum rights to represent, i.e. d _i={ URL _i, (t ₁, tw ₁) ..., (t _Ij, tw _Ij) ..., wherein, t _IjFor extracting the keyword that obtains, tw as stated above _IjBe keyword t _IjWeight, be its Pr value that calculates by formula (1).

Step 2: body pre-service.At first, retrieve in semantic net search engine Swoogle with each split catalog term by name in the given catalogue set, and represent this split catalog with the body of ordering first in the result for retrieval that obtains, like this, catalogue set CA={ca ₁, ca ₂..., ca _{| CA|}Just use body to gather O={o ₁, o ₂..., o _{| 0|}Represent, wherein, | O| representes the body number among the body set O, | CA| representes the split catalog number among the catalogue set CA, satisfies | O|=|CA|.Wherein, split catalog corresponding a body, i.e. a body o _mRepresent a split catalog ca _mCharacteristic information, i.e. ca _m:=o _m

Next, to each body o _mCarry out the body disambiguation of step 2.1 and the body extension process of step 2.2.Wherein, the present invention adopts the meaning of a word be defined among the speech net WordNet to represent as the morphology of body, and sets that the path distance between any two notional words is 1 in same the knowledge.

Step 2.1: body disambiguation.Because the corresponding a plurality of meaning of a word of speech possibility, this phenomenon can reduce the precision that semantic similar value is calculated.In order to eliminate the ambiguousness of construed in the body, body is carried out disambiguation handle, promptly utilize the context of speech in the body, confirm the meaning of a word that it is correct.Be specially:

At first, the speech in the L distance range of the notional word s in the body is chosen as the context of notional word s, obtains the set of context Con={con of notional word s _i..., con _j..., wherein, con _jJ the context of expression notional word s; The span of L is [3,5];

Then, use formula (2) to calculate notional word s each meaning of a word s in speech net WordNet _i(i=1 ..., N _i, N _iBe the meaning of a word number of notional word s in speech net WordNet) and its set of context Con in average semantic relevancy Rel (s between all contexts _i):

Rel (s_{i}) = \frac{Σ_{j = 1}^{| Con |} relateness (s_{i}, {con}_{j})}{| Con |} - - - (2)

Wherein, | Con| is the context number of notional word s, i.e. the number of speech among the set of context Con; Relateness (s _i, con _j) be i meaning of a word s _iWith its j context cor _jSemantic relevancy, its computing formula is following:

relateness (s_{i}, {con}_{j}) = \frac{NumOfOverlaps_s_{i} {con}_{j}}{(wordNumInGlossOf s_{i} + wordNumInGlossOfco n_{j}) / 2} - - - (3)

Wherein, wordNumInGlossOfs _iBe s _iThe word number that comprises of speech net WordNet lexical or textual analysis, wordNumInGlossOfcon _jBe con _jThe word number that comprises of speech net WordNet lexical or textual analysis, NumOfOverlaps_s _iCon _jBe s _iWordNet lexical or textual analysis of speech net and con _jThe word that comprised of speech net WordNet lexical or textual analysis in the number of same word;

At last, select the correct meaning of a word of the corresponding meaning of a word of maximum average semantic relevancy Rel value, i.e. the notion meaning of a word of notional word s as notional word s.Because the notional word that in same body, occurs together has certain semantic relation, so the correct meaning of a word is and the maximum meaning of a word of the semantic relevancy of neighbours' notional word.

Each notional word in the body is all handled by the said process qi that disappears.

Step 2.2: body expansion.Disappear after qi handles through the body of step 2.1, body is represented that by the notion meaning of a word notion meaning of a word in the body after the qi that disappears to body is handled in the body expansion adds the meaning of a word that is associated, thereby enriches body.

At first, obtain body each the notion meaning of a word cs in the body of qi after handling that disappears _k(k=1 ..., N _k, N _kNumber for the notion meaning of a word in the body) set of the superordination meaning of a word in speech net WordNet (hypernymy), the next meaning of a word set (hyponymy) and the synonymy meaning of a word of concerning are gathered (synonym), use hypernym (cs respectively _k)={ a _K1, a _K2..., hyponym (cs _k)={ b _K1, b _K2... And synonym (cs _k)={ c _K1, c _K2... Expression notion meaning of a word cs _kThese three kinds concern meaning of a word set, and set hypernym_value and hyponym_value is two threshold values, the span of hypernym_value and hyponym_value is [0.6,1].

Then, calculate notion meaning of a word cs by formula (4) _kGather with the superordination meaning of a word

In each meaning of a word a _Kp(p=1,2 ..., P, P are the meaning of a word number in the set of the superordination meaning of a word) semantic relevancy, if cs _kAnd a _KpSemantic relevancy greater than given hypernym_value threshold value, then with a _KpAdd cs _kParent set in; Calculate notion meaning of a word cs by formula (5) _kWith the next meaning of a word set hyponym (cs that concerns _k) in each meaning of a word b _Kq(q=1,2 ..., Q, Q are the next meaning of a word number in the meaning of a word set that concerns) semantic relevancy, if cs _kAnd b _KqSemantic relevancy greater than given hyponym_value threshold value, then with b _KqAdd cs _kSubclass set in; With synonymy meaning of a word set synonym (cs _k) in all meaning of a word c _Kl(l=1,2 ..., L, L are the next meaning of a word number that concerns in the meaning of a word set) all join cs _kSimilar set in.

relateness ({cs}_{k}, a_{kp}) = \frac{MumOfOverlaps_{cs}_{k} a_{kp}}{(wordNumInGlossOf {cs}_{k} + wordNumInGlossOf a_{kp}) / 2} - - - (4)

relateness ({cs}_{k}, b_{kq}) = \frac{NumOfOverlaps_{cs}_{k} b_{kq}}{(wordNumInGlossOf {cs}_{k} + wordNumInGlossO {fb}_{kq}) / 2} - - - (5)

Wherein, wordNumInGlossOfcs _k, wordNumInGlossOfa _Kp, wordNumInGlossOfb _KqBe respectively cs _k, a _Kp, b _KqThe word number that comprises of speech net WordNet lexical or textual analysis; NumOfOverlaps_cs _ka _KpBe cs _kWordNet lexical or textual analysis of speech net and a _KpThe word that comprised of speech net WordNet lexical or textual analysis in the number of same word; NumOfOverlaps_cs _kb _KqBe cs _kWordNet lexical or textual analysis of speech net and b _KqThe word that comprised of speech net WordNet lexical or textual analysis in the number of same word.

Each notion meaning of a word in the body is all carried out extension process as stated above.

Through after the step 2; The new body set

that obtains presentation class directory feature information wherein, is the body body after qi and the body extension process that disappears.

In the present embodiment, use the Jena routine package that body is operated, use the JAWS routine package to realize operation speech net WordNet.

Step 3: the structure of body meaning of a word set.For each body after disappear through body qi and the body extension process

o_{m}^{*} (m = 1,2, . . ., | o |) :

At first, with body Change into digraph G, promptly

Wherein, V is a vertex set, V={v ₁, v ₂. ..., v _{| V|}(| V| representes the number on summit among the vertex set V), E is the directed edge set, E={e ₁, e ₂, e _{| E|}(| E| representes the number of directed edge among the directed edge set E); Each summit of digraph is a notion meaning of a word in the body; Each bar directed edge of digraph is two relation of inclusion between the notion meaning of a word, and the direction of directed edge is pointed to father's notion meaning of a word by the sub-notion meaning of a word.In digraph G, the number of plies on summit is the shortest path distance of its corresponding notion meaning of a word apart from the body root.According to the big more principle of the notion meaning of a word place high more contribution of level, the present invention uses formula (6) to calculate notion meaning of a word cs _kWeight sw _k:

{sw}_{k} = \frac{1}{{(layer (v_{k}))}^{1 / 4}} - - - (6)

Wherein, layer (v _k) be and notion meaning of a word cs _kCorresponding vertex v _kThe number of plies.

Through after such processing, body

Can be expressed as the heavy body meaning of a word set of cum rights

o_{m}^{*} = {({Cs}_{1}, {Sw}_{1}), ({Cs}_{2}, {Sw}_{2}), . . ., ({Cs}_{| o_{m}^{*} |}, {Sw}_{| o_{m}^{*} |})} .

Wherein,

Be body

In the notion meaning of a word, sw _kBe its pairing weight,

The expression body The number of the middle notion meaning of a word.

Each body in the body set

is all handled as stated above.

Step 4: classification and ordering.Through after three top step process, text document is represented that by the heavy keyword set of cum rights split catalog is represented with the heavy body meaning of a word set of cum rights.Below; To determine whether according to the similar value between text document and the split catalog text document is referred in a certain split catalog; Similar value is big more, and the relation between text document and the split catalog is just tight more, and text document might belong to this split catalog more.

At first, use the measure of ground displacement Earth Mover ' s Distance to calculate the semantic similar value between text document and the body.Be specially:

For text document d={ (t ₁, tw ₁), (t ₂, tw ₂) ..., (t _{| d|}, tw _{| d|}) (t representes the keyword of text document, and tw representes the weight of keyword) and body o={ (cs ₁, sw ₁), (cs ₂, sw ₂) ..., (cs _{| o|}, sw _{| o|}) (cs representes the notion meaning of a word in the body, and sw representes the weight of the notion meaning of a word), can obtain weight graph by them

Wherein, W is a distance matrix, its element w _IjKeyword t for text document _i(i=1,2 ..., | d|, | d| is the number of the keyword of text document d) and the notion meaning of a word cs of body _j(j=1,2 ..., | o|, | o| is the number of the notion meaning of a word of body o) between semantic similar value.

Right figure

The vertex set:

W is

set of edges.For weight graph is arranged

, the target that semantic similar value is calculated is to find a paths F={f _Ij, i=1 ..., p, j=1 ..., q} (f _IjBe t _iAnd cs _jBetween the limit), make following formula EMD (d, o) value is minimum:

EMD (d, o) = \frac{Σ_{i = 1}^{p} Σ_{j = 1}^{q} f_{ij} w_{ij}}{Σ_{i = 1}^{p} Σ_{j = 1}^{q} f_{ij}} - - - (7)

(d o) is semantic similar value between text document and the body to EMD.

Then the similar value between text document and the split catalog is:

Sim(d，o)＝1-EMD(d，o) (8)

After obtaining the similar value Sim between text document and the split catalog, set a threshold values δ and determine whether text document is categorized into this split catalog.If the similar value Sim between text document and this split catalog then is classified into this split catalog greater than threshold value δ, otherwise will not be classified into this split catalog.Similar value Sim between text document and the split catalog has represented the closeness relation between text document and the split catalog; So utilize the similar value Sim between text document and the split catalog that the text document under each split catalog is sorted again; Similar value Sim is big more, and then the sorting position of text document is forward more.The span of threshold value δ is [0.5,0.6].

Example experiment: on the basis that program realizes, carried out one group and tested and assess the present invention, in the experiment, threshold value δ is 0.5, and selecting contextual distance range L is 3.Choosing of text document: from the website (http://dmoz.org) of Open Directory Project project (the maximum directory items of artificially webpage being classified), chosen 26 web page text document.In these webpages, there are 11 webpages to belong to the Arts catalogue, 8 webpages belong to the Sports catalogue, and 7 webpages belong to the Games catalogue.These 26 webpages and their url address are as shown in table 1.Use prefix sign " a " " s " and " g " to represent to belong to the text document of Arts, Sports and Games catalogue respectively.

The web page text document that table 1 is collected from http://dmoz.org/

Body is chosen: at first from the RDF file (structure.rdf.u8.gz) of representing Open Directory Project bibliographic structure, extract Arts, Sports and Games body.The Arts body comprises 521 notions, and the Spots body comprises 602 notions, and the Games body comprises 558 notions.Then, these bodies are through body disambiguation and body extension process, and in the process of handling, hypernym_value and hyponym_value are set as 0.9.The number of the notion that contains in Arts, Sports and the Games body after treatment, is respectively 1557,1809 and 1719.

Adopt method of the present invention that these 26 web page text document are carried out classification processing, the result is as shown in the table:

The result of table 2 classification and ordering

Table 2 has provided text document that directory A rts, Spots and Games comprised and similar value; And the text document in each catalogue sorts according to similar value; From these results, calculate sorting technique whole accuracy rate and recall rate, the result is as shown in table 3:

Table 3 sorting algorithm performance

Recall rate	Accuracy rate
		96.2％	83.9％

In order to assess the performance of sort method, table 2 sorted lists that produces and the sorted lists that manually generates are compared, the sorted lists that manually generates is as shown in the table:

The sorted lists that table 4 manually produces

Suppose τ _iBe to use sort algorithm to catalogue c _iIn the tabulation of classified text document after sorting,

It is manual standard sorted tabulation.The phase recency of two tabulations calculates with following formula so:

S = \frac{Σ_{i = 1}^{| C |} S^{'} (τ_{i}, τ_{i}^{*})}{| C |} - - - (9)

Be tabulation τ _iAnd tabulation

Identical element number on same sequence position, | C| is tabulation τ _iOr tabulation

The total number of element.The phase recency S that formula above using calculates each directory listing averages, and ranking results of the present invention is 79.1% with the average recency mutually of the sorted lists of table 4 standard.

Table 5 sort method Performance Evaluation

?S _Arts(％)	?S _Sports(％)	S _Games(％)	Average phase recency (%)
				?78.5	?75.0	83.7	79.1

Can find out that from preliminary assessment experiment classification of the present invention and ordering have reached preferable performance, accuracy rate and recall rate are all than higher, and ordering also has good effect.This because the body that body structure algorithm is created is fairly perfect, is because when calculating the similar value of text document and body, considered semantic information on the one hand on the other hand.In addition, because the present invention's trouble of having removed the manual collection training set from, and its classification performance can also improve along with the evolution of body, so use body that text document is classified and the method that sorts has good prospect.

Claims

1. one kind is used body to carry out the text document method of classification automatically, it is characterized in that step is following:

The described body qi process that disappears is:

At first, select the context of the interior speech of each notional word L scope of body middle distance as this notional word;

The span of described L is [3,5];

Then, by semantic relatedness computation formula

Relateness (s_{i}, {Con}_{j}) = \frac{NumOfOverlaps_s_{i} {Con}_{j}}{(WordNumInGlossOf s_{i} + WordNumInGlossOfco n_{j}) / 2}

Described body expansion process is:

Utilize the semantic relevancy computing formula

Relateness ({\hat{s}}_{p}, {s^{'}}_{Pq}) = \frac{NumOfOverlaps_{\hat{s}}_{p} {s^{'}}_{Pq}}{(WordNumInGlossOf {\hat{s}}_{p} + WordNumInGlossO {Fs}^{'}_{Pq}) / 2}

Wherein, Expression is through disappear p the notion meaning of a word of the body of qi after handling of body, p=1, and 2 ..., P, P represent through the disappear number of the notion meaning of a word of the body of qi after handling of body; S ' _PqExpression

Expression

Expression

Then, calculate the weight of each notion meaning of a word by

;

Wherein, d is the heavy keyword set of cum rights of text document, and o is the weight meaning of a word set of body;

(d is o) for utilizing text document that ground displacement Earth Mover ' s Distance method calculates and the semantic similar value between the body for EMD; The span of described given threshold value δ is [0.5,0.6];