CN101763405A

CN101763405A - Approximate character string searching technology based on synonym rule

Info

Publication number: CN101763405A
Application number: CN200910222333A
Authority: CN
Inventors: 陆嘉恒
Original assignee: Individual
Current assignee: Individual
Priority date: 2009-11-16
Filing date: 2009-11-16
Publication date: 2010-06-30

Abstract

The invention belongs to the information technical field and relates to an approximate character string matching method, in particular to a new approximate character string searching technology based on synonym rule. The method adds the synonym rule to expand two kinds of traditional approximate functions so as to realize the searching of approximate character strings, wherein one kind of expanding method is that synonymous substring substitution operation is introduced through the modification of function based on editing operation to search approximate character strings and the finite replacement of the approximate substrings is performed to calculate a new editing distance and realize efficient approximate searching based on synonym information; and the other kind of expanding method is that the synonym rule is utilized to expand the gram set of character substring through the function modification based on the number of common grams and identifiers, then the new gram set is compared with the gram sets in searching set and the similarity of searched character string and candidate character string is judged according to the number of the common grams so as to realize the efficient searching of approximate characters.

Description

Approximate character string search technique based on the synonym rule

Technical field

The present invention relates to the application of character string proximity search in the applications such as information retrieval or data query, especially there are differences, but semantically referring to the approximate character string search technique of identical entity on expressing.

Background technology

Many application all require to support the approximate character string search technique, there are differences so that search in the expression, but refer to the character string of identical entity in real world.There are various character string similar functions at present, as Levenshtein or editing distance, hamming distance, cosine tolerance, outstanding card German number, dice similarity and BM25 etc.These similar functions can be divided into two classes: a class is based on the function of editing operation, as editing distance and hamming distance.The another kind of function that is based between the character string number of common word or gram is as cosine tolerance, the outstanding German number of card and dice similarity.Some researchs about traditional editing distance function is expanded are also arranged, operate, or support to replace based on the substring of editing distance as adding " moving ".

Editing distance is meant the number of times that transforms time unit's overhead operations in two character strings mutually, comprises insertion, deletion and replacement operator to single letter in the character string, and it can the method with dynamic programming calculate in the time of O (mn).The time complexity that calculates editing distance can enough Four-Russians technology reduce to O (mn/logn) with it from O (mn).Based on the function of editing distance is the similarity that size that number of times or expense according to editing operation are editing distance is judged two character strings.Some extendability technology about the editing distance function are also arranged at present, as by exchange matrix transpose operation that contiguous character adds, by changing any substring order and for the move operation of unit expense, allow classical editing distance--the affine editing distance of prefix abbreviation, but in fact these technology only can be described some special circumstances, as the prefix abbreviation, it can not be replaced synonym and be applied to generally speaking.

More above-mentioned similar functions can only be to providing a more rudimentary similarity judgment criteria between these character strings, can not accurately judge the similarity of two character strings that under stringent condition, are associated, thereby it is there are differences on the query express effectively, but similar or refer to the character string of identical entity on semantic meaning representation based on the approximate character string search technique of this similar function.Yet, the synonym information that in many application, exists some in some concrete application, to be considered to equivalence, what have also in addition that some refer to is identical entity, but may there be difference completely in it on expression-form.This type of synonym information may be widely different on orthographic form, but express the identical meaning semantically.When the character string with some synonyms is applied to the approximate character string matching process, these synonym information can reduce the difference between character string, resulting value was littler when their editing distances were calculated than the traditional function based on editing distance of employing, also helped inquiring character string similar or identical on the semantic meaning representation more.

Summary of the invention

On expressing, deposit big-difference in order to overcome in the traditional algorithm character string proximity search, but the low discrimination of the character string that it is semantically similar, the present invention utilizes synonym rule pairing approximation character string search technology to improve, and has proposed to expand the method for the similar function of two quasi-traditions.

The present invention at first utilizes the synonym rule to improve similar function based on editing distance, proposes the approximate character string search technique based on synonym rule and editing distance.The main performing step of this steps A comprises: A1, at first character string is carried out pre-service.A2, execution proximity search.It is characterized in that at first character string having been carried out just carrying out after the pre-service proximity search process of character string.

In aspect this, wherein the process of in the steps A 1 character string being managed in advance comprises: the gram collection G (s) of the character string s among the S is formed in A1-1, calculating; A1-2, utilize the synonym rule that G (s) is extended for new set EG (s).It is characterized in that having integrated the synonym rule, and the traditional gram collection of character string is expanded with synonym information in character string.

In aspect this, wherein steps A 1-2 utilizes and can describe following process to the expansion of G (s): threshold values δ of predefine, search in the synonym rule set and satisfy ed (s[i, j], a)/| synonym rule P＜a of a|≤δ, b 〉, will be inserted into from the gram of character string b among the G (s).In aspect this, wherein 2 pairs of character strings of steps A process of carrying out proximity search comprises: A2-1, to each the gram g among the inquiry string t, and retrieval g arranges chained list, and finding with inquiry string t has T identical gram g candidate character strings s; A2-2, the candidate character strings s to obtaining among the A2-1 calculate sbed (s, t P, the δ) value (representing the editing distance of two character strings based on the substring replacement operation of synonym rule), and unmatched character string is removed according to this value of it and inquiry string t.It is characterized in that at first obtaining a candidate characters trail, existing then basis finds matched character string based on the editing distance of the substring replacement operation that has utilized the synonym rule.

In aspect this, wherein steps A 2-2 can do following description:

A2-2-1, according to preestablishing a threshold values δ (this threshold values show that this character string can carry out the minimum similarity of substring replacement operation) with the character string in the synonym rule, find synonym rule P＜a, b 〉, ed (s[i1 judges whether to satisfy condition, i2], a)/| a|≤δ;

A2-2-2, utilize the synonym rule P satisfy condition among the steps A 2-2-1 that character string s is carried out substring replacement operation based on the synonym rule, a certain substring that is about in the character string is replaced with the character substring in the synonym rule; Remember that its expense is the unit expense;

A2-2-3, sbed (s, t P, δ) expense of value for having carried out being spent when after the substring replacement operation of synonym rule P and the traditional editing operation character string s being changed into t;

A2-2-4: data centralization is not met steps A sbed that 2-2-3 calculates, and (δ) approximate character string of value removes for s, t P, the remaining approximate character string that is coupling.It is characterized in that at first having filtered and to carry out substring and replace synonym rule and character string, proposed individually to have integrated the synonym rule and expense is the substring replacement operation of unit expense, propose to calculate the editing distance of replacing editing operation based on substring.

Secondly, the present invention proposes and utilize the synonym rule to improve similar function, propose approximate character string search technique based on synonym rule and common gram number based on common gram number.This step B mainly comprises following three step: B1, concentrates each gram of character string s to generate one to data to arrange chained list, and utilizes the synonym rule that each character string is carried out recurrence and expand; B2, inquiry string t expanded its gram collection with the same procedure among the B1; B3, the character string that data are concentrated are carried out proximity search.It is characterized in that the character string that data are concentrated and the gram collection of inquiry string all carry out just carrying out the approximate character string search procedure after recurrence expands.

Wherein the further process of step B1 is: one of predefine expands similarity threshold values γ, in the synonym rule set, search satisfy similarity (s ', a) 〉=synonym rule P＜a of γ, b 〉, s ' is the substring of s, and the gram that all gram of a and b is added s concentrates; The gram that B1-2, continuation will join s from the gram of synonym rule concentrates, and new synonym rule is not available until having again.It is characterized in that utilizing the synonym rule that has certain similarity with character string to come the gram collection of escape character string, and this process is a recurrence.

Wherein the further process of step B3 is:

B3-1, obtain candidate character strings by solving the T-overlap problem;

B3-2, remove unmatched character string by calculating real approximate value (sbjc/sbcs/sbdc, the outstanding person of expression synonym rule blocks German number, cosine similarity function, dice similarity function).It is characterized in that at first being met the candidate character strings of certain similarity, just these candidate character strings are calculated they and the real similarity of inquiry string then, draw the approximate character string of coupling.

The process of its step B3-2 can further describe for: the expansion gram collection EGP that obtains according to step B1 and B2 (s, P, γ) and EGP (t, P γ) come the approximate value of calculating character string s and t.It is characterized in that having utilized the gram collection that expands based on the synonym rule to come approximate value between the calculating character string.

The invention has the beneficial effects as follows, improved that traditional similar function can not differentiate that character string has difference on expression-form but in semantically identical deficiency, make proximity search to character string can obtain more similar and significative results on the multi-semantic meaning, thereby can realize approximate character string function of search efficiently.

Description of drawings

Fig. 1: utilize the synonym rule to approximate character string search routine figure based on editing distance.

Fig. 2: utilize the sbed value of two similar character strings of method calculating of dynamic programming, the clauses and subclauses of red runic mark are shown as the path that draws net result.

Fig. 3: utilize the synonym rule to improvement based on common Gram number approximate character string search technique.

Fig. 4: three synonym rules that are used for escape character string gram collection

Fig. 5:, rule＜a then, b if in s and a, abundant common character string fragment arranged〉can be used in character string s (be count 〉=| GP (a) | * γ), and all these character string fragments all in a scope (ultimate range is less than | GP (a) |/γ).

Fig. 6: rule＜a, b〉can be applied to character string s, and GP ' (s) is replaced by GP (b).At | GP ' (s) | behind+pmin the position, the position of character string fragment is reduced to | GP ' (s) |-| GP (b) |.

Fig. 7: the example that the expansion of selectivity character string fragment is described

Fig. 8: (a-c) performance of each class function, (d-f) time of searching of calculating nsbed and the contrast of calculating dynamic programming table time.

Fig. 9: the recurrence number of times of gram expansion sets in the algorithm of calculating sbjc/sbcs/sbdc value.

Figure 10: (a-c) search performance of each class function, (d-f) for the each wrong report number of times of inquiring about of three data sets.

Embodiment

For a more complete understanding of the present invention and advantage, below in conjunction with drawings and the specific embodiments the present invention is done explanation in further detail.

One. during clear understanding, at first following definition is introduced simply.

The synonym tuple: a synonym tuple (or synonym rule) is＜a1 a2〉a pair of character string of this form, a1 and a2 represent an entity in the real world respectively in the character string.

For example, following several synonym tuples:

<William，Bill>；

<car，automobile>；

<Very?Large?Data?Bases，VLDB>.

Make that P is the set of the synonym tuple in one or more fields, as meeting title, name and place name.From conceptive, two character strings in each tuple are symmetrical.That also just means, for each tuple＜a1, a2〉∈ P, a tuple＜a2 is correspondingly also arranged, a1〉∈ P.Commonly used abbreviation, abbreviation and slightly write relation and can be expressed as the synonym tuple between character string.

Gram: make that ∑ is an alphabet.For a character string s who forms with the letter in the ∑, using | s| represents the length of s.I letter (since 1) among " s[i] " expression s, and with " s[i, j] " represent from an i alphabetical substring to the j letter.A given character string s and a positive integer q, the character string fragment that a length of the last step-by-step of s intercepting is q be expressed as two tuples (g, p), the g here is meant character string fragment since p alphabetical s, i.e. g=s[p, p+q-1].The set of the character string fragment of s can be expressed as GP (s, q).

For example, suppose q=3, s=university, then have GP (s, q)={ (uni, 1), (niv, 2), (ive, 3), (ver, 4), (ers, 5), (rsi, 6), (sit, 7), (ity, 8) }.In addition, if do not clearly state its length when mentioning a character string fragment in some cases, this moment, the character string fragment set (unmarked length) of this character string s just was designated as G (s).

Sbed (s, t, P, δ): a synonym rule set P, based on synon editing distance between two character strings is to utilize insertion, deletion, displacement and substring replacement operation so that s is transformed into the minimal-overhead of t.Insertion, deletion, displacement and replacement operation all are the unit expenses.When P and δ can obtain from context clearly, based on the editing distance of synonym rule can be expressed as sbed (s, t, P, δ) or simply be expressed as sbed (s, t).

Substring similarity threshold values δ: this threshold values shows that this character string can carry out the minimum similarity of substring replacement operation with the character string in the synonym rule.

Expansion similarity threshold γ: a given inquiry string s, find synonym rule P＜a, b 〉, the feasible substring s ' that has a s, and similarity (s ', a) 〉=γ.

Two, following detailed description concrete implementation step of the present invention:

Steps A is utilized the synonym rule to improve based on the similar function of editing distance and is searched for approximate character string.

Concrete steps below in conjunction with Fig. 1 description of step A comprise following several:

Steps A 1: the character string among the character set S is carried out pre-service.

Steps A 1-1: calculate the gram collection G (s) that forms the character string s among the S.This set can be that q character window obtains by length among the slip character string s.Promptly GP (s has in q) | s|-q+1 character string fragment.Adopt identical method can obtain the gram collection G (t) of inquiry string t.

Steps A 1-2: utilize the synonym rule that GP (s) is extended for new set EG (s).

Threshold values δ of predefine, in the synonym rule set, search satisfy ed (s[i, j], a)/| synonym rule P＜a of a|≤δ, b 〉, will be inserted into from the gram of character string b among the G (s).

Steps A 2: after being met the candidate character strings of similarity to a certain degree, again according to the similarity of sbed value calculated candidate character string and inquiry string.

Process in the steps A 2 can further be decomposed into following step:

Steps A 2-1: arrange chained list for one of the gram generation that data centralization occurred, this element of arranging in the chained list is the character string that this gram occurred; Gram according to the inquiry string that obtains in the steps A 1 gathers G (t) then, chained list is arranged in each gram retrieval among the G (t), searching in the chained list has T identical gram g candidate character strings s with inquiry string t, and wherein T is a threshold values that pre-defines.This problem can be summed up as the T-overlap problem.

The T-Overlap problem: make the q-gram set of G (s) for character string s, δ is apart from threshold values (0≤δ≤1).Find the character string in G (s) to be inverted the set that has occurred Ta character string a in the chained list at least, and Ta=(| a|-q+1-δ | a|q).

Therefore, problem becomes find the character string enough similar to the substring of character string s from these character string.The concrete enforcement of T-overlap problem-solving approach is as follows:

At first, be inversion chain table index of gram structure of the character string in the synonym rule, i.e. the chained list of all forming for each gram g of character string by an id who comprises the synonym rule of this gram.

Then, make the q-gram set of G (s) for character string s, δ is apart from threshold values 0≤δ≤1.

At last, search the set of such character string a, promptly the gram of character string a has occurred Ta at least (Ta=(| a|-q+q)-δ | a|q)) is inferior in G (s).

Steps A 2-2: to the candidate character strings s that obtains among the A2-1, calculate the sbed (s of it and inquiry string t, t P, δ) value (representing the editing distance of two character strings) based on the substring replacement operation of synonym rule, and according to this value unmatched character string is removed.

Process among the steps A 2-2 can further be decomposed into following process:

Steps A 2-2-1:, find synonym rule P＜a, b according to preestablishing threshold values δ 〉, judge character string s whether satisfy condition ed (s[i1, i2], a)/| a| 〉=δ.And judge character string t whether satisfy condition ed (t[j1, j2], b)/| b| 〉=δ

Steps A 2-2-2: the character string s that satisfies condition among the steps A 2-2-1 is carried out substring replacement operation based on the synonym rule, and a certain substring that is about in the character string is replaced with the character substring in the synonym rule; Remember that its expense is the unit expense.

Steps A 2-2-3: efficient calculation sbed (s, t P, δ) value.

Initialization has defined a matrix M[0..|s|, 0..|t|], this M[i, j]=sbed (s[1, i], t[1, j]) $ final goal is to calculate M[|s|, | t|], i.e. sbed (s, t)), to each integer 0≤i≤| s|, definition M[i, 0]=i; To each integer 0≤i≤| t|, the definition M[0, j]=j.

Recursive function shown in descending at each integer 0≤i≤1 and 0≤j≤1 definition: according to the value of defined recursive function calculating matrix M.

Preceding four kinds of situations of recursive function promptly are equivalent to adopt traditional editing operation, and the letter of the character string of forming matrix M is inserted, deletes and revises.The 4th kind of situation promptly is equivalent to adopt the substring replacement operation based on the synonym rule.First three kind situation requires in the 4th kind of situation to search the substring of two character strings as long as calculate according to the definition of recursive function, finds character string s and t substring s[m separately, i], t[n, j] and synonym rule tuple＜a, b〉∈ P, make ed (s[m, i], a)/| a|≤δ, ed (t[n, j], b)/| b|≤δ.

The character string on synonym rule both sides is all carried out the coupling of approximate string.For the character string on character string s and the one group rule left side, find the substring s[m of s, i] and rule＜a, b 〉, make ed (s[m, i], a)/| a|≤δ.Then the character string on character string t and one group of rule the right is carried out identical operations.At last, alternately regular in the execution in above two steps, because if＜a, b〉∈ P, then＜and b, a〉∈ P.

Example 1. is below in conjunction with Fig. 2 explanation.How Fig. 2 has shown in the process of the sbed value of calculating character string " World WideWeb Conf " and character string " WWW Conferece " and has filled matrix.Note that a typo is arranged in " C onferece ".Two synonym tuple＜WorldWide Web are arranged, WWW〉and＜Conf, Conference 〉, their sbed value is 3.

Steps A 2-2-4: data centralization does not meet steps A sbed that 2-2-3 calculates (s, t P, δ) character string of value removes, the similarity of soon being calculated of utilizing the synonym rule to carry out should satisfying at least among editing distance that the editing operation of substring replacement operation calculated and pre-defined s of showing and the t compares, if less than predefined value then this character string of data centralization s is removed the then remaining approximate character string that is coupling.

Step B: the similar function search approximate character string that utilizes improved number based on common gram.

Concrete implementation step below in conjunction with Fig. 3 description of step B:

Step B1: concentrate each gram of character string s to generate one to data and arrange chained list, and utilize the synonym rule that the gram collection of each character string is carried out recurrence and expand.

Step B1-1: one of predefine expands similarity threshold values γ, in the synonym rule set, search satisfy similarity (s ', a) 〉=synonym rule P＜a of γ, b 〉, s ' is the substring of s, and the gram that all gram of a and b is added s concentrates, thereby obtains the expansion gram collection EGP (s of s, P, γ).

Step B1-1-1: the synonym rule P that finds the gram collection expansion that to use character string s.

The synonym rule P can be applicable to the condition that character string s expands: a given character string s makes the character string fragment set of GP (s)={ (g, p) } expression s.γ is self-defining expansion similarity threshold value.If there is a subclass

{GP}^{'} (s) &SubsetEqual; {GP}^{'} (s),

Then we say that this synonym rule can be used in a character string s, and make

|GP′(s)∩GP(a)|/|GP(a)|≥γ；????(1)

And (g1, p1), (g2, p2) ∈ GP ' is (s) | p1-p2|≤GP (a)/γ.(2)

As shown in Figure 5, first condition stub GP ' (s) and GP (a) should own enough identical characters string fragments together.Second condition stub be not because wish to consider the character string fragment of wide apart among the GP (s), and the character string fragment of GP ' in (s) must be in the scope by a and γ appointment.

According to above definition, calculate two conditions that the synonym rule can be applicable to character string, and in the synonym rule set, find the synonym rule P＜a that can be used for character string s, b〉according to this condition.

Example 2. is below in conjunction with Fig. 4 explanation.Three rules among Fig. 4 and character string s=" Intl Confon Management of Data " suppose the length q=3 and γ=0.9 of character string fragment.At first, because there is GP ' (s)={ (Int, 1), (ntl, 2) }, (| GP ' is ∩ GP (Intl) (s) |)/(| GP (Intl) |)=1, so regular r1 can be applied to character string s.For any two character string fragments (g1, p1), (g2, p2) ∈ GP ' (s) has | p1-p2|≤| 1-2|≤2/0.9.Similarly, regular r2 also can be applied to s.

Step B1-1-2: character string a and b in the synonym rule that obtains respectively from step B1-1-1, add their gram among the GP (s) to obtain GP ' (s).

This process of explanation in 6 in conjunction with the accompanying drawings, when application rule increases set GP (s) repeatedly in the process of adding gram, in order to check above-mentioned second inequality condition, need calculate its position to each character string fragment that from character string a, newly is inserted into GP (s) to the character string fragment of new interpolation.Suppose that the position of g in a is pa, GP ' (s) middle minimum position is pmin, and then the reposition of g in GP (s) is pa+pmin.In addition, use synonym rule＜a, b〉after, at | GP ' (s) | the position of all character string fragments will be moved behind+pmin the position | GP (b) |-| GP ' (s) | the position.If | GP (b) |＞| GP ' (s) |, then all character string fragments will move right, otherwise they will be to moving to left.

Step B1-1-3: last repeating step B1-1-1 and step B1-1-2, available until the synonym rule that does not satisfy condition.

Attention character string fragment expansion process in this process is a recurrence.Below the process that expands with regard to these recurrence of example 3.

Example 3. is described in detail below in conjunction with the expansion of 2 pairs of gram collection of top example.Using rule＜Intl, International〉after, the character string fragment that adds is recently arranged: (nte, 2), (ter, 3), (ern, 4), (rna, 5), (nat, 6), (ati, 7), (tio, 8), (ion, 9), (ona, 10) and (nal, 11).Character string fragments all behind Intl are to a mobile 13-4=9 position, and add escape character (ESC) string fragment to and concentrate.Similarly, used rule (Conf, Conference) after because character string fragment " Conf " starts from the 5th position among the s, the character string fragment of Tian Jiaing (nfe, 7) recently, (fer, 8), (ere, 9), (ren, 10), (enc, 11) and (nce, 12).Character string fragment after " Conf " 10-4=6 the position that will move right.Therefore, GP ' (s) comprises from rule＜Intl, Internationa〉and＜Conf, Conference〉all character string fragments.

Step B1-2: the gram that intermittently will join s from the gram of synonym rule concentrates, and new synonym rule is not available until having again.

Step B2: the gram collection GP (t) that inquiry string t is expanded it with the same procedure among the B1.

Step B3: the character string that data are concentrated is carried out proximity search.

Step B3-1: obtain candidate character strings by solving the T-overlap problem.Steps A 2-1 is similar in this step and the option A, just no longer describes in detail here.

Step B3-2: remove unmatched character string by calculating real approximate value (sbjc/sbcs/sbdc, the outstanding person of expression synonym rule blocks German number, cosine similarity function, dice similarity function).

If two character string s and t, the set of synonym rule tuple P and expansion similarity threshold value γ, make R=EGP (s, P, γ), S=EGP (t, P, γ) the character string fragment collection of expression after with the expansion of character string fragment selectivity.Between two character strings based on the Jie Kade similarity (being abbreviated as " sbjc ") of synonym string, be defined as follows based on the cosine similarity (being abbreviated as " sbcs ") of synonym string with based on the dice similarity (being abbreviated as " sbdc ") of synonym string:

sbjc(s，t，P，γ)＝|R∩S|/|R∪S|

sbcs (s, t, P, γ) = | R \cap S | / \sqrt{| R | \cdot | S |}

sbdc(s，t，P，γ)＝2|R∩S|/|R|+|S|

Example 4. is established s=" abcd ", and t=" efgh " and six synonym rules in Fig. 7 are constructed a figure who comprises three connected components.Rule r1, r3, r4 and r6 can directly apply to character string s and t.But selectively respectively to character string s and t application rule r1 and r3.This is because CID (r1)=CID (r3)=1.Therefore during the long q=2 of character string fragment, have | GP (s) |=| GP (t) |=3, | EGP (s) |=| EGP (t) |=6.Then sbjc (s, t)=0.5, sbcs (s, t)=0.67 and sbdc (s, t)=0.67.Please note Jaccard (s, t)=Cosin (s, t)=Dice (s, t)=0.Then the similar function of Ti Chuing has correctly been used rule and has been reacted s and t in similarity semantically.

Judge according to above calculating similarity value whether candidate character strings s mates with character string t.

Three. the experiment test result

1. preliminary work:

In experiment, estimated the scalability of the performance of expansion similar function, more synonym rules of processing time, tool that similarity is calculated and with a large amount of synonym rules performance to the character string execution proximity search algorithm of large data collection.All experiments all be configured to the cpu dominant frequency be 2.13GHz, in save as on the double-core computer that 2GB and hard disk are 250GB and moved, operating system is UbuntuLinux operating system.All algorithms all use the C Plus Plus of GNU compiler compiling to carry out.

Data set: (adopting meeting title, addresses of items of mail, Babel synonym and four data sets of magazine title)

The meeting title: 724 about the abbreviation of computer science meeting title and the synonym rule of full name, and has extracted the abbreviation as " ICDE ", " SIGMOD " and " VLDB " meeting title from these synonym rules.

The address name: 284 synonym rules that are usually used in U.S.'s place name or name, as "＜St, Street〉", "＜CA, California〉" and "＜Danny, Daniel〉" 4, and collect some name, street names commonly used, city's name, state name and postcode.Generate the address of 100,000,000 U.S. with these data, each address comprises a name, a street name, a city title and a postcode.Can use approximate character string matching function treatment misspelling in order to test the function that is proposed, also in the title of address, add the wrong information of spelling.

Babel: this data set comprises 9,136 about the abbreviation of computer science aspect and the synonym tuple of slightly writing.By connect the set that two speech of selecting at random generate 100,000,000 character strings from the synonym tuple.

The magazine title: this data set comprises 46,696 synonym tuples, and this tuple is made up of the full name and the abbreviation of magazine title.In addition, also has a set of forming by the character string of 6,164 magazine titles of writing a Chinese character in simplified form.

2. the performance of function

In first experiment, adopted the data set of meeting title.Randomly draw 10 bit machine scientific research persons' personal homepage, therefrom search their record of delivering, and select 50 meeting titles.These data are how to represent the good sample of these titles on the network, are referred to as query note with these names and pass through to scan one by one 350 meeting titles are individual, and therefrom search for an approximate record.To each query note Q and a given normal function f, threshold values δ of predefine is used for search string s, make f (Q, s) 〉=δ.Correct meeting title can be passed through manually to browse the meeting title and writes down and find for each query note.Can the purpose of this experiment be utilize the synonym tuple to search efficiently various expression-forms are arranged in actual life for understanding approximate function based on the synonym rule, and in fact be meant the meeting title of same meeting.

Difference between result that the comparison approximate function returns and the correct result: estimated improved approximate function in order to following method:

The response of the correct response of precision ratio (precision)=return/return;

Correct response/correct the response of recall ratio (recall)=return;

F tolerance=2* precision ratio * recall ratio/precision ratio+recall ratio.

Table 1,2 and table 3 shown experimental result, can draw to draw a conclusion from these results: (1) original outstanding person blocks German number, cosine and dice function and can not utilize the synonym rule to mate, and the coupling quality lower.But two kinds of similar functions that the present invention proposes have very obvious superiority, have greatly improved precision ratio and recall ratio.(2) relatively the function nsbed of the expanded editing distance that proposes of the present invention and the function sbjc/sbcs/sbdc of expansion gram set, the recall ratio of finding nsbed is than higher, but precision ratio is lower.Reason is that the nsbed value only allows substring is carried out a replacement operation, and three kinds of approximate functions of sbjc/sbcs/sbdc can be carried out repeatedly replacement operation to identical substring.Therefore, these three functions can utilize the semantic similar entity of synonym tuple retrieval efficiently.So when the user wishes high recall ratio, can use nsbed, when the user wishes high precision ratio, can consider sbjc/sbcs/sbdc.

Table 1:Precision (meeting title) (jc:Jaccard, cs:Cosine, dc:Dice)

??Threshold	??0.95	??0.90	??0.85	??0.80	??0.75	??0.70	??0.65
??Threshold	??0.95	??0.90	??0.85	??0.80	??0.75	??0.70	??0.65	??nsbed	??1.00	??1.00	??1.00	??1.00	??0.88	??0.88	??0.88
??sbjc	??0.97	??0.83	??0.61	??0.40	??0.31	??0.22	??0.16	??nsbed	??1.00	??1.00	??1.00	??1.00	??0.88	??0.88	??0.88
??sbjc	??0.97	??0.83	??0.61	??0.40	??0.31	??0.22	??0.16	??sbcs	??0.94	??0.43	??0.30	??0.17	??0.11	??0.08	??0.05
??sbdc	??0.94	??0.45	??0.31	??0.18	??0.12	??0.09	??0.06	??sbcs	??0.94	??0.43	??0.30	??0.17	??0.11	??0.08	??0.05
??sbdc	??0.94	??0.45	??0.31	??0.18	??0.12	??0.09	??0.06	??ned	??1.00	??1.00	??1.00	??1.00	??0.60	??0.60	??0.60
??jc/cs/dc	??1.00	??1.00	??1.00	??1.00	??1.00	??1.00	??1.00	??ned	??1.00	??1.00	??1.00	??1.00	??0.60	??0.60	??0.60

Table 2:Recall (meeting title)

??Threshold	??0.95	??0.90	??0.85	??0.80	??0.75	??0.70	??0.65
??Threshold	??0.95	??0.90	??0.85	??0.80	??0.75	??0.70	??0.65	??nsbed	??0.18	??0.24	??0.28	??0.28	??0.28	??0.28	??0.28
??sbjc	??0.58	??0.60	??0.62	??0.62	??0.72	??0.78	??0.82	??nsbed	??0.18	??0.24	??0.28	??0.28	??0.28	??0.28	??0.28
??sbjc	??0.58	??0.60	??0.62	??0.62	??0.72	??0.78	??0.82	??sbcs	??0.60	??0.62	??0.78	??0.82	??0.84	??0.90	??0.92
??sbdc	??0.60	??0.62	??0.78	??0.82	??0.84	??0.88	??0.90	??sbcs	??0.60	??0.62	??0.78	??0.82	??0.84	??0.90	??0.92
??sbdc	??0.60	??0.62	??0.78	??0.82	??0.84	??0.88	??0.90	??ned	??0.06	??0.06	??0.06	??0.06	??0.06	??0.06	??0.06
??jc/cs/dc	??0.06	??0.06	??0.06	??0.06	??0.06	??0.06	??0.06	??ned	??0.06	??0.06	??0.06	??0.06	??0.06	??0.06	??0.06

Table 3:F-measure (meeting title) (best value is to be obtained by sbjc, sbcs and sbdc, δ=0.95)

??Threshold	??0.95	??0.90	??0.85	??0.80	??0.75	??0.70	??0.65
??Threshold	??0.95	??0.90	??0.85	??0.80	??0.75	??0.70	??0.65	??nsbed	??0.18	??0.24	??0.28	??0.28	??0.28	??0.28	??0.28
??sbjc	??0.58	??0.60	??0.62	??0.62	??0.72	??0.78	??0.82	??nsbed	??0.18	??0.24	??0.28	??0.28	??0.28	??0.28	??0.28
??sbjc	??0.58	??0.60	??0.62	??0.62	??0.72	??0.78	??0.82	??sbcs	??0.60	??0.62	??0.78	??0.82	??0.84	??0.90	??0.92
??sbdc	??0.60	??0.62	??0.78	??0.82	??0.84	??0.88	??0.90	??sbcs	??0.60	??0.62	??0.78	??0.82	??0.84	??0.90	??0.92
??sbdc	??0.60	??0.62	??0.78	??0.82	??0.84	??0.88	??0.90	??ned	??0.06	??0.06	??0.06	??0.06	??0.06	??0.06	??0.06

??Threshold	??0.95	??0.90	??0.85	??0.80	??0.75	??0.70	??0.65
??Threshold	??0.95	??0.90	??0.85	??0.80	??0.75	??0.70	??0.65	??jc/cs/dc	??0.06	??0.06	??0.06	??0.06	??0.06	??0.06	??0.06

3. the counting yield of function

In order to estimate the performance of the approximate function that is proposed, select some synonyms rule and generate the character string tuple to pass to each function and calculate their similarity.Because three the data set scales in back are big, so adopt this three data sets.Concentrate any for three data, the synonym rule of picked at random some from existing rule.Then, concentrate two character strings of picked at random to form the character string tuple, require to have 10,000 character strings to generate from a little character strings.Carry out above all approximate functions for each character string tuple, and measure their working times.All character string tuples are calculated their average operating time.In order to weigh the extensibility of this algorithm, adopted the synonym rule of varying number simultaneously.

Fig. 8 (a-c) has shown the average operating time of each class function to varying number synonym rule.For nsbed, comprise the time of carrying out proximity search and calculating the dynamic programming table working time.Concerning sbjc/sbcs/sbdc, comprise working time and carrying out approximate character string search and optionally time and the common factor of final set of computations and the time of union of escape character (ESC) string fragment collection.

When sbjc, sbcs were adopted identical structure with sbdc, their performance was always identical.For the Jaccard coefficient, this time is the time of carrying out union of sets and friendship.

For address data set, the algorithm of the employing nsbed value of two character strings of calculating is lower than sbjc efficient.This is that nsbed requires to use dynamic programming algorithm and substring coupling because the main operation in sbjc is also operating with handing over of pair set, yet the friendship of pair set requires lower with operation also.

For Babel and magazine data set, here the quantity of their synonym rule is bigger, and this moment, the performance of sbjc was just far short of what is expected than nsbed.This is because sbjc supports the repeatedly iteration expansion of synonym rule.When the number of synonym rule increased, the time of iteration also increased thereupon, thereby also can increase accordingly working time.Sbjc has the matching efficiency higher than nsbed in F tolerance.In other words, sbjc/sbcs/sbdc will spend more time in order to obtain high-quality matched record.

Fig. 8 (d-f) has shown time of searching in the process of calculating nsbed and the time of calculating the dynamic programming table.The time of searching is the time (comprising the time that solves the T-overlap problem and revise error) that is used to carry out approximate character string matching.When synonym rule number greater than 50 the time, the time of searching occupied an leading position in the computing time of nsbed.

Fig. 9 has shown that a kind of is to expand repeatedly based on the average iterations of the character string gram collection expansion of two kinds of expanding policies, and another kind expands repeatedly and optionally.As shown in Figure 9, the iterations of process for selective expansion obviously still less.

In addition as a rule, when increasing the quantity of synonym rule, iterations also will increase to some extent.But for the Babel data, after the synonym rule set was greater than 6,000, the iterations of expansion began to reduce (seeing Fig. 9 b).

4. the efficient of searching algorithm

Below analyze the approximate function that utilizes based on the synonym rule and carried out approximate character string search experimental result.Selected 100 records as query note at random from various data centralizations.Approximate threshold values of synonym rule and the retrieval approximate threshold values that is complementary is 0.9 in all experiments.

Figure 10 (a-c) has described the average search time of above-mentioned four approximate functions respectively to three data sets.This time comprises merging time and final submission processing time.For all data sets, nsbed and sbjc have more performance than sbcs and sbdc, and this is because the merging threshold value T of nsbed and sbjc is stricter more than other two approximate functions.The threshold value of a strictness mean still less intermediate result and the time overhead of merging.Report its error number at Figure 10 (d-f) at various approximate functions, this confirms that also above-mentioned analysis: sbcs and sbdc have a looser threshold value T simultaneously, and generation is not the intermediate result of net result in a large number.In most of the cases, the increasing time error number of similar threshold value is few more.Yet unique exception is in the magazine data centralization, for nsbed, when threshold value greater than 0.9 the time, can find that its error increases (seeing Figure 10 f) on the contrary slightly.This is because candidate character strings number in this scope is when still identical, but net result will reduce.

Utilize in the present invention synonym measure at the character string approximation and the approximate character string search extension two types approximate function.Improved approximate function by having increased substring replacement operation based on the synonym rule based on editing operation.Expanded character string fragment (sign) collection of having integrated synonym information by recurrence, improved based on the common character string fragment and the function of sign.And utilize the efficient algorithm that proposes to carry out based on the approximate character string search that improves approximate function.At last, prove the validity and the high-level efficiency of the inventive method by experiment.

The present invention also has many far-reaching development prospects that have more.For example, because algorithm of the present invention is the natural expansion to traditional approximate function,, realize approximate character string search efficiently so the method for existing approximate character string join algorithm and data cleansing can be easy to be enhanced by integrating the synonym rule.The search technique of approximate character string efficiently that proposes by the present invention can find not only similar on expression-form to inquiry string expeditiously in data centralization, and the approximate character string that is semantically also interrelating, thereby the present invention can be applied to for example various application occasions such as search engine, information query system, also has very high efficient simultaneously.

What may be obvious that for the person of ordinary skill of the art in addition, draws other advantages and modification.Therefore, the present invention with wider aspect is not limited to shown and described specifying and exemplary embodiment here.Therefore, under situation about not breaking away from, can make various modifications to it by the spirit and scope of claim and the defined general inventive concept of equivalents thereof subsequently.

Claims

1. one kind based on the synonym rule, expands the approximate character string search technique of two quasi-tradition approximate functions, and concrete step comprises:

A, utilize the approximate character string search technique of synonym rule based on editing distance.

B, utilize the approximate character string search technique of synonym rule based on common gram number.

2. according to the steps A in the right request 1, its concrete steps comprise following content:

Steps A 2: carry out proximity search.

3. according to right request 2, wherein steps A 1 further is illustrated as:

Steps A 1-1: the gram collection G (s) that calculates each the character string s among the S;

Steps A 1-2: utilize the synonym rule that G (s) is extended for new set EG (s);

Steps A 1-3: for each gram among the set EG (s) sets up a chained list of arranging at character string ID.

4. according to right request 3, wherein steps A 1-2 further is illustrated as:

5. according to right request 2, wherein steps A 2 further processes are:

Steps A 2-1: to each the gram g among the inquiry string t, what retrieval g set up in steps A 1-3 arranges chained list, searches the candidate character strings s that T identical gram arranged with inquiry string t.

6. according to right request 5, wherein steps A 2-2 further describes below can doing:

Steps A 2-2-1: according to preestablishing a threshold values δ (this threshold values show that this character string can carry out the minimum similarity of substring replacement operation) with the character string in the synonym rule, find synonym rule P＜a, b 〉, ed (s[i1 judges whether to satisfy condition, i2], a)/| a|≤δ.

Steps A 2-2-2: utilize the synonym rule P satisfy condition among the steps A 2-2-1 that character string s is carried out substring replacement operation based on the synonym rule, a certain substring that is about in the character string is replaced with the character substring in the synonym rule; Inquiry string t is also carried out same operation.

Steps A 2-2-3: calculate sbed (s, t P, the expense that δ) value, this value are spent when being based on after the substring replacement operation of synonym rule P and the traditional editing operation character string s being changed into t.

Steps A 2-2-4: data centralization is not met steps A sbed that 2-2-3 calculates, and (δ) approximate character string of value removes for s, t P, the remaining approximate character string that is coupling.

7. according to the step B in the right request 1, its concrete steps comprise following content:

8. according to right request 7, wherein the further process of step B1 is:

Step B1-2: in the corresponding synonym rule the gram gram that joins s concentrate, new synonym rule is not available until having again.

9. according to right request 7, wherein the further process of step B3 is:

Step B3-1: obtain candidate character strings by solving the T-overlap problem.

Step B3-2: remove unmatched character string by calculating real approximate value (sbjc/sbcs/sbdc, expression blocks German number/cosine similarity function/dice similarity function based on the outstanding person of synonym rule).

10. according to right request 9, wherein the further process of step B3-2 is:

The expansion gram collection EGP that obtains according to the step B1 in the right request 1 and B2 (s, P, γ) and EGP (t, P γ) come the approximate value of calculating character string s and t.