CN101286159A

CN101286159A - Document meaning similarity distance metrization method based on EMD

Info

Publication number: CN101286159A
Application number: CNA200810018390XA
Authority: CN
Inventors: 郭雷; 王晓东; 方俊
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2008-06-05
Filing date: 2008-06-05
Publication date: 2008-10-15
Anticipated expiration: 2028-06-05
Also published as: CN101286159B

Abstract

The invention relates to a document semantic similarity distance metrization method based on EMD, which belongs to the fields such as information retrieval, data mining and etc. The method is characterized in that: firstly, the document is pretreated and is expressed as a tf question mark idf item weight vector; secondly, the width and width difference of the document vector are calculated; thirdly, semantic similarity distances between various characteristic vocabulary are calculated; fourthly, a total weight is complemented by inserting a virtual item into the document vector; normalization processing is carried out for the processed document vectors; finally, according to complete-matching criterion and most similar and highest priority criterion, the EMD simplified calculation is completed. The method has the beneficial effects of realizing metrization of the document semantic similarity distance on the basis of EMD, improving resolving power of original algorithm and expanding algorithmic application range, simplifying the calculation process of EMD algorithm, improving operating speed, and being suitable for higher real-time application occasion.

Description

A kind of document meaning similarity distance metrization method based on EMD

Technical field

The present invention relates to a kind of document meaning similarity distance metrization method, belong to information retrieval, data mining field based on EMD.

Background technology

The file similarity measure algorithm is used to calculate the similarity degree between the document, the crucial meaning of tool in fields such as information retrieval, data mining, it is the basic calculation of senior document data such as classification, filtration, cluster, search Organization And Management, and its performance quality directly has influence on the whole structure and the quality of information retrieval and data mining.The file similarity measure algorithm can adopt the form of similarity coefficient, also can adopt the form of similarity distance, and the two can be changed mutually.

In common document similarity distance metric algorithm method, as: Euclidean, hamming distance or the like, think that each characteristic item/vocabulary of document is mutually orthogonal, and ignored semantic relation between the different characteristic item, adopt the mode of identical morphology " one to one " coupling in the different document to carry out similarity relatively, accuracy is not good enough.In order in the document similarity distance calculates, to introduce the semantic relation between the different characteristic item, set up the matching relationship of " multi-to-multi " between the file characteristics item, some scholars are based on EMD (Earth Mover ' s Distance) algorithm and WordNet electronic dictionary commonly used in the field of image search, realized document semantic similarity distance algorithm, improved the accuracy of calculating effectively based on EMD.

Provide computing method below based on the document similarity distance of EMD.

Document A={ (t _{A, 1}, w _{A, 1}), (t _{A, 2}, w _{A, 2}) ..., (t _{A, N}, w _{A, N}), B={ (t _{B, 1}, w _{B, 1}), (t _{B, 2}, w _{B, 2}) ..., (t _{B, N}, w _{B, N}), D={d is arranged _Ij, d _IjBe characteristic item t _{A, i}With t _{B, j}Semantic similarity distance,

W = Σ_{i = 1}^{N} w_{a, i}

U = Σ_{j = 1}^{N} w_{b, j},

Other has matching degree F={f _{I, j}, f _{I, j}Be by w _{A, i}Through d _{I, j}Match w _{B, j}Amount, and satisfy:

f _i，j≥0?i＝1，…，N；j＝1，…，N (1)

Σ_{j = 1}^{N} f_{i, j} \leq w_{a, i}

i＝1，…，N (2)

Σ_{i = 1}^{N} f_{i, j} \leq w_{b, j}

j＝1，…，N (3)

Σ_{i = 1}^{N} Σ_{j = 1}^{N} f_{i, j} = \min (W, U) - - - (4)

If with A, the set A that the characteristic item of B is formed, B is two groups of summits (down with), connects two groups of summit constituent relation figure G={A, B, D}, obtain smallest match total amount Work (A, B) as follows:

Work (A, B) = \min_{f &Element; F} Σ_{i = 1}^{N} Σ_{j = 1}^{N} f_{i, j} d_{i, j} - - - (5)

A then, the similarity distance of B is defined as the EMD distance of a set A and B:

EMD (A, B) = \frac{Work (A, B)}{\min (W, U)} - - - (6)

In sum, in EMD calculates, can regard the characteristic item of A as quality and be respectively w _{A, i}Some mound sides, it is w that the characteristic item of B is regarded some capacity as _{B, j}Hole (vice versa), ask the problem of the EMD similarity distance of document A and B to be and find the solution the earthwork is d through distance _{I, j}The path be filled into the bee-line traffic program of hole.Here f _{I, j}Be the flow on each path, EMD similarity distance then is minimum transport total amount and the relative light necromancer side's gross mass or the ratio of capacity.EMD is actually a linear programming algorithm about transportation problem.

From above algorithm background introduction as can be seen, existing document similarity distance computing method based on EMD exist an important weak point, promptly positivity axiom and the triangle that does not satisfy the tolerance definition do not wait axiom, be embodied in algorithm and have serious local matching problem, it is relatively poor that this will cause it to calculate resolution characteristic.

Summary of the invention

The technical matters that solves

In order to eliminate the defective that existing EMD document semantic similarity distance algorithm does not satisfy the tolerance axiom, the present invention proposes a kind of method based on the EMD file similarity measureization, can realize the metrization of former algorithm.

Technical scheme

Thought of the present invention is: EMD will satisfy the condition that need guarantee of tolerance axiom fully to be had, between the calculated characteristics item function of similarity (being called basis function) this as tolerance, and each file characteristics item set total weight value is identical in the computer memory territory of EMD.The former is easy to realize, need solves the latter emphatically.Selecting to satisfy on the basis of the basis function of measuring axiom, the present invention is poor with the form polishing document vectors weights of virtual item, afterwards two document vectors of polishing characteristic item weights are carried out normalized, carry out EMD again and calculate, thereby the document semantic similarity distance based on EMD of realizing a kind of strictness is measured.

The technology of the present invention is characterised in that, has proposed the notion of document width and virtual item in the document semantic similarity distance based on EMD calculates, and has proposed the EMD computational short cut method based on the most similar highest priority criterion and full matching criterior, and concrete steps are:

1, at first the document of two pieces of document semantic similarity distances to be calculated in the collected works is carried out pre-service, remove stop words, document is represented to become tfidf item weighted vector, A is a bra vector, and B is right vector;

2, document bra vector A and right vector B are calculated the document width || A|| _Tfidf, || B|| _TfidfWith document stand out W _AB, W _AB=|| | A|| _Tfidf-|| B|| _Tfidf|;

3, utilization is calculated the similarity distance between non-0 characteristic item of left and right sides document vectors weights based on the vocabulary similarity distance instrument of WordNet, and stores the tabulation of similarity distance record respectively into;

4, the similarity distance of the weights of defining virtual item and virtual item and further feature item is with the record tabulation of the similarity distance write step 3 of the virtual item that obtains and further feature item; The weights of described virtual item equal the stand out of the left and right sides document vectors that step 2 obtains; The similarity distance of described virtual item and further feature item is: the maximal value of getting similarity distance between the characteristic item of left and right sides document vectors;

If the document width of 5 left and right sides document vectors is unequal, just the document stand out is not 0, then need insert virtual item and handle.If the bra vector width greater than right vector, then is the virtual item that right vector inserting step 4 makes up; Otherwise, be the virtual item of bra vector inserting step 4 structures;

6, after inserting virtual item, document vectors is carried out normalized: with the total weight value of weights every in the document vectors divided by the document vector, substitute original item weights with quotient, the total weight value of the new left and right document vectors that obtains will be respectively 1;

7, carry out EMD simplification calculating according to the most similar highest priority criterion and full matching criterior.

Described document width is: establishing X is the set of the characteristic item composition of a document vectors, and x is a characteristic item, has: each shines upon M:x → R ⁺{ 0}, x ∈ X are called the distribution value of X under the distribution M, then to ∪

Be the document width of X under distribution M, be designated as || X|| _M, when X=Φ, || X|| _M=0.

The width difference that described document stand out is a left and right sides document vectors, this value is nonnegative value.

The similarity distance of described virtual item and further feature item is: the mean value of getting the similarity distance between each characteristic item of left and right sides document vectors.

The most similar described highest priority criterion is: when the similarity of calculating between document, the item of always wishing to lack for similarity distance is to giving the highest priority when document vectors is mated, promptly the most similar item (might be synonym or near synonym) at first participates in " transportation " of weights when coupling, the item of realizing the similarity distance minimum calculates right of priority to giving maximum coupling, other right coupling calculate right of priority according between the increase of the similarity distance distribution of successively decreasing.Can reduce the computation burden of EMD algorithm by the most similar highest priority criterion of utilization.

Described full matching criterior is: a polysemant in one piece of document is only got a meaning of a word usually in the document, so item mates to the situation of a plurality of speech seldom simultaneously when carrying out coupling.Therefore, we can think and should mate fully, unless just one of them weights need mate again greater than another to remaining weights.By using full matching criterior can reduce the iterations of EMD algorithm effectively.

According to above-mentioned criterion, the EMD of simplification calculates search similarity distance minimum value in the at first record of the similarity distance in step 3 tabulation, and " earthwork " of the bra vector characteristic item that this minimum value is connected measures " hole " that (weights just) all are transported to right vector." if hole " capacity inadequately then redundance be retained in the bra vector, if " earthwork " amount is inadequately, then redundance is retained in the right vector, with the actual shipment amount as the flow on this paths.Like this after once transportation is finished, according to the next similarity distance minimum value of search in the similarity distance record tabulation of said method in step 3, betransported up to the weights of all bra vectors and to finish.

Calculate the document semantic similarity distance according to following formula:

EMD (A, B) \min_{f &Element; F} Σ_{i = 1}^{N} Σ_{j = 1}^{N + 1} f_{ij} d_{ij} - - - (7)

Beneficial effect

The present invention proposes a kind of document meaning similarity distance metrization method based on EMD, the method of utilization insertion virtual item has been carried out balance to the weights of document vectors, eliminated the EMD algorithm and in computation process, do not distinguished the weights difference between the document vectors and cause algorithm to be absorbed in the defective of local coupling, thereby improved the resolution characteristic of algorithm and expanded the range of application of algorithm.

Method has also been carried out simplification to the computation process of EMD algorithm and has been improved arithmetic speed, is suitable for the higher application scenario of real-time.

Description of drawings

Fig. 1: the basic flow sheet of the inventive method

Embodiment

Now in conjunction with the accompanying drawings the present invention is further described:

Two pieces of English text documents in the employing Reuters-21578 collected works of the present invention are as embodiment, and the hardware environment that is used to implement is: P4 3.0Ghz CPU, internal memory 512M, hard disk 80G; Windows XP Professional operating system, new technology file system.Utilize Perl instrument and WordNet2.1 to finish semantic distance and calculate, virtual item similarity distance assignment employing method 2., storage computation result; Master routine adopts VC++6.0 to realize.

1, pre-service.On the basis of removing stop words, according to the VSM model document is expressed as the n n dimensional vector n, left document A={ (t _{A, 1}, w _{A, 1}), (t _{A, 2}, w _{A, 2}) ..., (t _{A, N}, w _{A, N}), right document B={ (t _{B, 1}, w _{B, 1}), (t _{B, 2}, w _{B, 2}) ..., (t _{B, N}, w _{B, N}), t is characteristic item (can be phrase, phrase, speech etc., generally get speech), w is the tfidf weight of a t.

2, calculate document width and document stand out.

The width of document A, B || A|| _Tfidf, || B|| _TfidfBe they separately the tfidf weight of characteristic item add up and.

The difference of the width of document A, B is designated as W _AB, W is arranged _AB=|| | A|| _Tfidf-|| B|| _Tfidf|;

3, the similarity distance between the characteristic item calculates.Can call the calculating of finishing the vocabulary similarity distance based on the lesk algorithm in the WordNet vocabulary similarity distance Perl instrument.Store the result into similarity distance record tabulation { d _{I, j}.

4, make up virtual item.Virtual item is (t _v, w _v), w _v=W _AB, t here _vThere is not actual vocabulary implication.Virtual item and other similarity distance are designated as d _Iv, assignment adopts method two, promptly gets average d,

\overset{&OverBar;}{d} = \frac{1}{N^{2}} Σ_{i = 1}^{N} Σ_{j = 1}^{N} d_{ij},

W wherein _{A, i}≠ 0, w _{B, j}≠ 0.With d _IvStore similarity distance record tabulation { d into _{I, j}.

5, for B inserts virtual item, because || A|| _Tfidf〉=|| B|| _Tfidf, thereby need insert above-mentioned virtual item for right document B, obtain B '={ (t _{B, 1}, w _{B, 1}), (t _{B, 2}, w _{B, 2}) ..., (t _{B, N}, w _{B, N}), (t _v, w _v).

6, normalized.Order: w _{A, i}'=w _{A, i}/ max (W, U), 1≤i≤N, w _{B, j}'=w _{B, j}/ max (W, U), 1≤j≤N, w _v'=w _v(W U), obtains/max: A '={ (t _{A, 1}, w _{A, 1}'), (t _{A, 2}, w _{A, 2}') ..., (t _{A, N}, w _{A, N}'), B "={ (t _{B, 1}, w _{B, 1}'), (t _{B, 2}, w _{B, 2}') ..., (t _v, w _v').

7, EMD simplifies calculating.For simplify EMD (A ', B ") algorithm at first adopts the most similar highest priority criterion: thereby simplify and at first select similarity distance record tabulation { d in the EMD computation process _{I, j}Middle minimum d _{I, j}, and obtain one two tuple example (i, j).

By d _Ij(i j), obtains w _{A, i}With w _{B, j}Relatively little weights, also be min (w _{A, i}, w _{B, j}), if min (w _{A, i}, w _{B, j}Next step is carried out in)=0, otherwise f _{I, j}=min (w _{A, i}, w _{B, j}), if w _{A, i}≤ w _{B, j}, w then _{B, j}=w _{B, j}-w _{A, i}, w _{A, i}=0; Otherwise w _{A, i}＞w _{B, j}W then _{A, i}=w _{A, i}-w _{B, j}, w _{B, j}=0.

Obtaining next according to the most similar highest priority criterion organizes two tuple examples (i j) calculates matching value according to top method, finishes until the weight coupling of all feature speech.

Calculate the similarity distance of A, B according to formula (7).

We have realized the semantic similarity distance metric calculation based on EMD document A, B thus.

This method has solved based on the satisfied defect problem of measuring in the EMD document semantic similarity distance algorithm, and has carried out simplifying on original EMD computing method and handled.Method can be used for being not suitable in the past the document triangle index of EMD document semantic similarity distance, and the resolution characteristic and the counting yield of calculating all increase.

Claims

1. document meaning similarity distance metrization method based on EMD is characterized in that step is as follows:

1) at first the document of two pieces of document semantic similarity distances to be calculated in the collected works is carried out pre-service, remove stop words, document is represented to become tfidf item weighted vector, A is a bra vector, and B is right vector;

2) document bra vector A and right vector B are calculated the document width || A|| _Tfidf, || B|| _TfidfWith document stand out W _AB, W _AB=|| | A|| _Tfidf-|| B|| _Tfidf|;

3) utilization is calculated the similarity distance between non-0 characteristic item of left and right sides document vectors weights based on the vocabulary similarity distance instrument of WordNet, and stores the tabulation of similarity distance record respectively into;

4) similarity distance of the weights of defining virtual item and virtual item and further feature item is with the record tabulation of the similarity distance write step 3 of the virtual item that obtains and further feature item; The weights of described virtual item equal the stand out of the left and right sides document vectors that step 2 obtains; The similarity distance of described virtual item and further feature item is: the maximal value of getting similarity distance between the characteristic item of left and right sides document vectors;

5) if the document width of left and right sides document vectors is unequal, just the document stand out is not 0, then need insert virtual item and handle.If the bra vector width greater than right vector, then is the virtual item that right vector inserting step 4 makes up; Otherwise, be the virtual item of bra vector inserting step 4 structures;

6) after inserting virtual item, document vectors is carried out normalized: with the total weight value of weights every in the document vectors divided by the document vector, substitute original item weights with quotient, the total weight value of the new left and right document vectors that obtains will be respectively 1;

7) carry out EMD simplification calculating according to full matching criterior and the most similar highest priority criterion.

2. the document meaning similarity distance metrization method based on EMD according to claim 1 is characterized in that:

Described document width is: establishing X is the set of the characteristic item composition of a document vectors, and x is a characteristic item, has: each shines upon M:x → R ⁺U{0}, x ∈ X is called the distribution value of X under the distribution M, then

3. the document meaning similarity distance metrization method based on EMD according to claim 1 is characterized in that: the width difference that described document stand out is a left and right sides document vectors, this value is nonnegative value.

4. the document meaning similarity distance metrization method based on EMD according to claim 1 is characterized in that: the similarity distance of described virtual item and further feature item is: the mean value of getting the similarity distance between each characteristic item of left and right sides document vectors.

5. the document meaning similarity distance metrization method based on EMD according to claim 1, it is characterized in that: the most similar described highest priority criterion is: the item of similarity distance minimum calculates right of priority to giving maximum match between item when calculating the document similarity distance, other right coupling calculate right of priority according between the increase of the similarity distance distribution of successively decreasing.

6. the document meaning similarity distance metrization method based on EMD according to claim 1, it is characterized in that: described full matching criterior is: with mate fully, unless just one of them weights need mate again greater than another weights to remaining weights.