CN101286159B - Document meaning similarity distance metrization method based on EMD - Google Patents

Document meaning similarity distance metrization method based on EMD Download PDF

Info

Publication number
CN101286159B
CN101286159B CN200810018390XA CN200810018390A CN101286159B CN 101286159 B CN101286159 B CN 101286159B CN 200810018390X A CN200810018390X A CN 200810018390XA CN 200810018390 A CN200810018390 A CN 200810018390A CN 101286159 B CN101286159 B CN 101286159B
Authority
CN
China
Prior art keywords
document
item
similarity distance
emd
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN200810018390XA
Other languages
Chinese (zh)
Other versions
CN101286159A (en
Inventor
郭雷
王晓东
方俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN200810018390XA priority Critical patent/CN101286159B/en
Publication of CN101286159A publication Critical patent/CN101286159A/en
Application granted granted Critical
Publication of CN101286159B publication Critical patent/CN101286159B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a document semantic similarity distance metrization method based on EMD, which belongs to the fields such as information retrieval, data mining and etc. The method is characterized in that: firstly, the document is pretreated and is expressed as a tf question mark idf item weight vector; secondly, the width and width difference of the document vector are calculated; thirdly, semantic similarity distances between various characteristic vocabulary are calculated; fourthly, a total weight is complemented by inserting a virtual item into the document vector; normalization processing is carried out for the processed document vectors; finally, according to complete-matching criterion and most similar and highest priority criterion, the EMD simplified calculation is completed. The method has the beneficial effects of realizing metrization of the document semantic similarity distance on the basis of EMD, improving resolving power of original algorithm and expanding algorithmic application range, simplifying the calculation process of EMD algorithm, improving operating speed, and being suitable for higher real-time application occasion.

Description

A kind of document meaning similarity distance metrization method based on EMD
Technical field
The present invention relates to a kind of document meaning similarity distance metrization method, belong to information retrieval, data mining field based on EMD.
Background technology
The file similarity measure algorithm is used to calculate the similarity degree between the document, the crucial meaning of tool in fields such as information retrieval, data mining, it is the basic calculation of senior document data such as classification, filtration, cluster, search Organization And Management, and its performance quality directly has influence on the whole structure and the quality of information retrieval and data mining.The file similarity measure algorithm can adopt the form of similarity coefficient, also can adopt the form of similarity distance, and the two can be changed mutually.
In common document similarity distance metric algorithm method, as: Euclidean, hamming distance or the like, think that each characteristic item/vocabulary of document is mutually orthogonal, and ignored semantic relation between the different characteristic item, adopt the mode of identical morphology " one to one " coupling in the different document to carry out similarity relatively, accuracy is not good enough.In order in the document similarity distance calculates, to introduce the semantic relation between the different characteristic item, set up the matching relationship of " multi-to-multi " between the file characteristics item, some scholars are based on EMD (Earth Mover ' s Distance) algorithm and WordNet electronic dictionary commonly used in the field of image search, realized document semantic similarity distance algorithm, improved the accuracy of calculating effectively based on EMD.
Provide computing method below based on the document similarity distance of EMD.
Document A={ (t A, 1, w A, 1), (t A, 2, w A, 2) ..., (t A, N, w A, N), B={ (t B, 1, w B, 1), (t B, 2, w B, 2) ..., (t B, N, w B, N), D={d is arranged Ij, d IjBe characteristic item t A, iWith t B, jSemantic similarity distance, W = Σ i = 1 N w a , i , U = Σ j = 1 N w b , j , Other has matching degree F={f I, j, f I, jBe by w A, iThrough d I, jMatch w B, jAmount, and satisfy:
f i , j ≥ 0 , i = 1 , . . . , N ; j = 1 , . . . , N - - - ( 1 )
Σ j = 1 N f i , j ≤ w a , i , i = 1 , . . . , N - - - ( 2 )
Σ i = 1 N f i , j ≤ w b , j , j = 1 , . . . , N - - - ( 3 )
Σ i = 1 N Σ j = 1 N f i , j = min ( W , U ) - - - ( 4 )
If with A, the set A that the characteristic item of B is formed, B is two groups of summits (down with), connects two groups of summit constituent relation figure G={A, B, D}, obtain smallest match total amount Work (A, B) as follows:
Work ( A , B ) = min f ∈ F Σ i = 1 N Σ j = 1 N f i , j d i , j - - - ( 5 )
A then, the similarity distance of B is defined as the EMD distance of a set A and B:
EMD ( A , B ) = Work ( A , B ) min ( W , U ) - - - ( 6 )
In sum, in EMD calculates, can regard the characteristic item of A as quality and be respectively w A, jSome mound sides, it is w that the characteristic item of B is regarded some capacity as B, jHole (vice versa), ask the problem of the EMD similarity distance of document A and B to be and find the solution the earthwork is d through distance I, jThe path be filled into the bee-line traffic program of hole.Here f I, jBe the flow on each path, EMD similarity distance then is minimum transport total amount and the relative light necromancer side's gross mass or the ratio of capacity.EMD is actually a linear programming algorithm about transportation problem.
From above algorithm background introduction as can be seen, existing document similarity distance computing method based on EMD exist an important weak point, promptly positivity axiom and the triangle that does not satisfy the tolerance definition do not wait axiom, be embodied in algorithm and have serious local matching problem, it is relatively poor that this will cause it to calculate resolution characteristic.
Summary of the invention
The technical matters that solves
In order to eliminate the defective that existing EMD document semantic similarity distance algorithm does not satisfy the tolerance axiom, the present invention proposes a kind of method based on the EMD file similarity measureization, can realize the metrization of former algorithm.
Technical scheme
Thought of the present invention is: EMD will satisfy the condition that need guarantee of tolerance axiom fully to be had, between the calculated characteristics item function of similarity (being called basis function) this as tolerance, and each file characteristics item set total weight value is identical in the computer memory territory of EMD.The former is easy to realize, need solves the latter emphatically.Selecting to satisfy on the basis of the basis function of measuring axiom, the present invention is poor with the form polishing document vectors weights of virtual item, afterwards two document vectors of polishing characteristic item weights are carried out normalized, carry out EMD again and calculate, thereby the document semantic similarity distance based on EMD of realizing a kind of strictness is measured.
The technology of the present invention is characterised in that, has proposed the notion of document width and virtual item in the document semantic similarity distance based on EMD calculates, and has proposed the EMD computational short cut method based on the most similar highest priority criterion and full matching criterior, and concrete steps are:
1, at first the document of two pieces of document semantic similarity distances to be calculated in the collected works is carried out pre-service, remove stop words, document is represented to become tfidf item weighted vector, A is a bra vector, and B is right vector;
2, document bra vector A and right vector B are calculated document width ‖ A ‖ Tfidf, ‖ B ‖ TfidfWith document stand out W AB, W AB=| ‖ A ‖ Tfidf-‖ B ‖ Tfidf|;
3, utilization is calculated the similarity distance between non-0 characteristic item of left and right sides document vectors weights based on the vocabulary similarity distance instrument of WordNet, and stores the tabulation of similarity distance record respectively into;
4, the similarity distance of the weights of defining virtual item and virtual item and further feature item is with the record tabulation of the similarity distance write step 3 of the virtual item that obtains and further feature item; The weights of described virtual item equal the stand out of the left and right sides document vectors that step 2 obtains; The similarity distance of described virtual item and further feature item is: the maximal value of getting similarity distance between the characteristic item of left and right sides document vectors;
If the document width of 5 left and right sides document vectors is unequal, just the document stand out is not 0, then need insert virtual item and handle.If the bra vector width greater than right vector, then is the virtual item that right vector inserting step 4 makes up; Otherwise, be the virtual item of bra vector inserting step 4 structures;
6, after inserting virtual item, document vectors is carried out normalized: with the total weight value of weights every in the document vectors divided by the document vector, substitute original item weights with quotient, the total weight value of the new left and right document vectors that obtains will be respectively 1;
7, carry out EMD simplification calculating according to the most similar highest priority criterion and full matching criterior.
Described document width is: establishing X is the set of the characteristic item composition of a document vectors, and x is a characteristic item, has: each shines upon M:x → R +{ 0}, x ∈ X are called the distribution value of X under the distribution M, then to ∪ Be the document width of X under distribution M, be designated as ‖ X ‖ M, when X=Φ, ‖ X ‖ M=0.
The width difference that described document stand out is a left and right sides document vectors, this value is nonnegative value.
The similarity distance of described virtual item and further feature item is: the mean value of getting the similarity distance between each characteristic item of left and right sides document vectors.
The most similar described highest priority criterion is: when the similarity of calculating between document, the item of always wishing to lack for similarity distance is to giving the highest priority when document vectors is mated, promptly the most similar item (might be synonym or near synonym) at first participates in " transportation " of weights when coupling, the item of realizing the similarity distance minimum calculates right of priority to giving maximum coupling, other right coupling calculate right of priority according between the increase of the similarity distance distribution of successively decreasing.Can reduce the computation burden of EMD algorithm by the most similar highest priority criterion of utilization.
Described full matching criterior is: a polysemant in one piece of document is only got a meaning of a word usually in the document, so item mates to the situation of a plurality of speech seldom simultaneously when carrying out coupling.Therefore, we can think and should mate fully, unless just one of them weights need mate again greater than another to remaining weights.By using full matching criterior can reduce the iterations of EMD algorithm effectively.
According to above-mentioned criterion, the EMD of simplification calculates search similarity distance minimum value in the at first record of the similarity distance in step 3 tabulation, and " earthwork " of the bra vector characteristic item that this minimum value is connected measures " hole " that (weights just) all are transported to right vector." if hole " capacity inadequately then redundance be retained in the bra vector, if " earthwork " amount is inadequately, then redundance is retained in the right vector, with the actual shipment amount as the flow on this paths.Like this after once transportation is finished, according to the next similarity distance minimum value of search in the similarity distance record tabulation of said method in step 3, betransported up to the weights of all bra vectors and to finish.
Calculate the document semantic similarity distance according to following formula:
EMD ( A , B ) = min f ∈ F Σ i = 1 N Σ j = 1 N + 1 f ij d ij - - - ( 7 )
Beneficial effect
The present invention proposes a kind of document meaning similarity distance metrization method based on EMD, the method of utilization insertion virtual item has been carried out balance to the weights of document vectors, eliminated the EMD algorithm and in computation process, do not distinguished the weights difference between the document vectors and cause algorithm to be absorbed in the defective of local coupling, thereby improved the resolution characteristic of algorithm and expanded the range of application of algorithm.
Method has also been carried out simplification to the computation process of EMD algorithm and has been improved arithmetic speed, is suitable for the higher application scenario of real-time.
Description of drawings
Fig. 1: the basic flow sheet of the inventive method
Embodiment
Now in conjunction with the accompanying drawings the present invention is further described:
Two pieces of English text documents in the employing Reuters-21578 collected works of the present invention are as embodiment, and the hardware environment that is used to implement is: P4 3.0Ghz CPU, internal memory 512M, hard disk 80G; Windows XP Professional operating system, new technology file system.Utilize Perl instrument and WordNet2.1 to finish semantic distance and calculate, virtual item similarity distance assignment employing method 2., storage computation result; Master routine adopts VC++6.0 to realize.
1, pre-service.On the basis of removing stop words, according to the VSM model document is expressed as the n n dimensional vector n, left document A={ (t A, 1, w A, 1), (t A, 2, w A, 2) ..., (t A, N, w A, N), right document B={ (t B, 1, w B, 1), (t B, 2, w B, 2) ..., (t B, N, w B, N), t is characteristic item (can be phrase, phrase, speech etc., generally get speech), w is the tfidf weight of a t.
2, calculate document width and document stand out.
The width ‖ A ‖ of document A, B Tfidf, ‖ B ‖ TfidfBe they separately the tfidf weight of characteristic item add up and.
The difference of the width of document A, B is designated as W AB, W is arranged AB=| ‖ A ‖ Tfidf-‖ B ‖ Tfidf|;
3, the similarity distance between the characteristic item calculates.Can call the calculating of finishing the vocabulary similarity distance based on the lesk algorithm in the WordNet vocabulary similarity distance Perl instrument.Store the result into similarity distance record tabulation { d I, j.
4, make up virtual item.Virtual item is (t v, w v), w v=W AB, t here vThere is not actual vocabulary implication.Virtual item and other similarity distance are designated as d Iv, assignment adopts method two, promptly gets average d, d ‾ = 1 N 2 Σ i = 1 N Σ j = 1 N d ij , W wherein A, i≠ 0, w B, j≠ 0.With d IvStore similarity distance record tabulation { d into I, j.
5, for B inserts virtual item, because ‖ A ‖ Tfidf〉=‖ B ‖ Tfidf, thereby need insert above-mentioned virtual item for right document B, obtain B '={ (t B, 1, w B, 1), (t B, 2, w B, 2) ..., (t B, N, w B, N), (t v, w v).
6, normalized.Order: w A, j'=w A, j/ max (W, U), 1≤i≤N, w B, j'=w B, j/ max (W, U), 1≤j≤N, w v'=w v(W U), obtains/max: A '={ (t A, 1, w A, 1'), (t A, 2, w A, 2') ..., (t A, N, w A, N'), B "={ (t B, 1, w B, 1'), (t B, 2, w B, 2') ..., (t v, w v').
7, EMD simplifies calculating.For simplify EMD (A ', B ") algorithm at first adopts the most similar highest priority criterion: thereby simplify and at first select similarity distance record tabulation { d in the EMD computation process I, jMiddle minimum d I, j, and obtain one two tuple example (i, j).
By d Ij(i j), obtains w A, iWith w B, jRelatively little weights, also be min (w A, i, w B, j), if min (w A, i, w B, jNext step is carried out in)=0, otherwise f I, j=min (w A, i, w B, j), if w A, i≤ w B, j, w then B, j=w B, j-w A, i, w A, i=0; Otherwise w A, i>w B, jW then A, i=w A, i-w B, j, w B, j=0.
Obtaining next according to the most similar highest priority criterion organizes two tuple examples (i j) calculates matching value according to top method, finishes until the weight coupling of all feature speech.
Calculate the similarity distance of A, B according to formula (7).
We have realized the semantic similarity distance metric calculation based on EMD document A, B thus.
This method has solved based on the satisfied defect problem of measuring in the EMD document semantic similarity distance algorithm, and has carried out simplifying on original EMD computing method and handled.Method can be used for being not suitable in the past the document triangle index of EMD document semantic similarity distance, and the resolution characteristic and the counting yield of calculating all increase.

Claims (6)

1. document meaning similarity distance metrization method based on EMD is characterized in that step is as follows:
1) at first the document of two pieces of document semantic similarity distances to be calculated in the collected works is carried out pre-service, promptly remove stop words earlier, according to the VSM model document is expressed as tfidf item weighted vector again, A is that bra vector, B are right vector, A={ (t A, 1, w A, 1), (t A, 2, w A, 2) ..., (t A, N, w A, N), B={ (t B, 1, w B, 1), (t B, 2, w B, 2) ..., (t B, N, w B, N), t is a characteristic item, w is the tfidf weight of a t; Described characteristic item is phrase, phrase, speech;
2) document bra vector A and right vector B are calculated the document width || A|| Tfidf, || B|| TfidfWith document stand out W AB, || A|| TfidfFor the tfidf weight of A characteristic item add up and, || B|| TfidfFor the tfidf weight of B characteristic item add up and, W AB=|| A|| Tfidf-|| B|| Tfidf|;
3) call based on the lesk algorithm in the WordNet vocabulary similarity distance Perl instrument, calculate the similarity distance between non-0 characteristic item of left and right sides document vectors weights, and store the tabulation of similarity distance record respectively into;
4) similarity distance of the weights of defining virtual item and virtual item and further feature item is with the record tabulation of the similarity distance write step 3 of the virtual item that obtains and further feature item; The weights of described virtual item equal the stand out of the left and right sides document vectors that step 2 obtains; The similarity distance of described virtual item and further feature item is: the maximal value of getting similarity distance between the characteristic item of left and right sides document vectors;
5) if the document width of left and right sides document vectors is unequal, just the document stand out is not 0, then need insert virtual item and handle, if the bra vector width greater than right vector, then is the virtual item that right vector inserting step 4 makes up; Otherwise, be the virtual item of bra vector inserting step 4 structures;
6) after inserting virtual item, document vectors is carried out normalized: with the total weight value of weights every in the document vectors divided by the document vector, substitute original item weights with quotient, the total weight value of the new left and right document vectors that obtains will be respectively 1;
7) carry out EMD simplification calculating according to full matching criterior and the most similar highest priority criterion, the most similar described highest priority criterion is: the item of similarity distance minimum calculates right of priority to giving maximum match between item when calculating the document similarity distance, other right coupling calculate right of priority according between the increase of the similarity distance distribution of successively decreasing, described full matching criterior is: with mate fully, unless just one of them weights need mate again greater than another weights to remaining weights.
2. the document meaning similarity distance metrization method based on EMD according to claim 1 is characterized in that: described document width is: establishing X is the set of the characteristic item composition of a document vectors, and x is a characteristic item, has: each shines upon M:x → R +{ 0}, x ∈ X are called the distribution value of X under the distribution M, then to ∪
Figure FSB00000007015800021
Be the document width of X under distribution M, be designated as || X|| M, when X=Φ, || X|| M=0.
3. the document meaning similarity distance metrization method based on EMD according to claim 1 is characterized in that: the width difference that described document stand out is a left and right sides document vectors, this value is nonnegative value.
4. the document meaning similarity distance metrization method based on EMD according to claim 1 is characterized in that: the similarity distance of described virtual item and further feature item is: the mean value of getting the similarity distance between each characteristic item of left and right sides document vectors.
5. the document meaning similarity distance metrization method based on EMD according to claim 1, it is characterized in that: the most similar described highest priority criterion is: the item of similarity distance minimum calculates right of priority to giving maximum match between item when calculating the document similarity distance, other right coupling calculate right of priority according between the increase of the similarity distance distribution of successively decreasing.
6. the document meaning similarity distance metrization method based on EMD according to claim 1, it is characterized in that: described full matching criterior is: with mate fully, unless just one of them weights need mate again greater than another weights to remaining weights.
CN200810018390XA 2008-06-05 2008-06-05 Document meaning similarity distance metrization method based on EMD Expired - Fee Related CN101286159B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200810018390XA CN101286159B (en) 2008-06-05 2008-06-05 Document meaning similarity distance metrization method based on EMD

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200810018390XA CN101286159B (en) 2008-06-05 2008-06-05 Document meaning similarity distance metrization method based on EMD

Publications (2)

Publication Number Publication Date
CN101286159A CN101286159A (en) 2008-10-15
CN101286159B true CN101286159B (en) 2010-06-23

Family

ID=40058370

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200810018390XA Expired - Fee Related CN101286159B (en) 2008-06-05 2008-06-05 Document meaning similarity distance metrization method based on EMD

Country Status (1)

Country Link
CN (1) CN101286159B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750315A (en) * 2012-04-25 2012-10-24 北京航空航天大学 Rapid discovering method of conceptual relations based on sovereignty iterative search

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104833535A (en) * 2015-05-15 2015-08-12 西南交通大学 Railway vehicle tire tread scratch detection method
US11176186B2 (en) 2020-03-27 2021-11-16 International Business Machines Corporation Construing similarities between datasets with explainable cognitive methods
CN113239222B (en) * 2021-01-19 2023-10-31 佳木斯大学 Image retrieval method based on image information extraction and EMD distance improvement

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750315A (en) * 2012-04-25 2012-10-24 北京航空航天大学 Rapid discovering method of conceptual relations based on sovereignty iterative search
CN102750315B (en) * 2012-04-25 2016-03-23 北京航空航天大学 Based on the conceptual relation rapid discovery method of feature iterative search with sovereign right

Also Published As

Publication number Publication date
CN101286159A (en) 2008-10-15

Similar Documents

Publication Publication Date Title
CN103617157B (en) Based on semantic Text similarity computing method
CN101398814B (en) Method and system for simultaneously abstracting document summarization and key words
CN102722531B (en) Query method based on regional bitmap indexes in cloud environment
US7401073B2 (en) Term-statistics modification for category-based search
US20120203717A1 (en) Learning Similarity Function for Rare Queries
US20090307209A1 (en) Term-statistics modification for category-based search
CN110569289B (en) Column data processing method, equipment and medium based on big data
CN101286159B (en) Document meaning similarity distance metrization method based on EMD
CN102156728B (en) Improved personalized summary system based on user interest model
CN106372122A (en) Wiki semantic matching-based document classification method and system
US20090182797A1 (en) Consistent contingency table release
CN104036051A (en) Database mode abstract generation method based on label propagation
CN110175184A (en) A kind of lower drill method, system and the electronic equipment of data dimension
CN103714154A (en) Method for determining optimum cluster number
Zhou et al. Efficient approaches to k representative g-skyline queries
CN104951562A (en) Image retrieval method based on VLAD (vector of locally aggregated descriptors) dual self-adaptation
Sargolzaei et al. Pagerank problem, survey and future research directions
Goyal et al. Lossy conservative update (LCU) sketch: Succinct approximate count storage
CN109614074A (en) Approximate adder reliability degree calculation method based on probability transfer matrix model
CN103440292A (en) Method and system for retrieving multimedia information based on bit vector
CN105956012A (en) Database mode abstract method based on graphical partition strategy
CN110442678A (en) A kind of text words weighing computation method and system, storage medium and terminal
CN104731889A (en) Query result size estimation method
CN109408514A (en) A kind of water conservancy census data method for digging based on closure segment cube
CN112765960B (en) Text matching method and device and computer equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20100623

Termination date: 20130605