WO2015178756A1

WO2015178756A1 - Adaptive-window edit distance algorithm computation

Info

Publication number: WO2015178756A1
Application number: PCT/MY2015/050026
Authority: WO
Inventors: Binti Mohamad Yassin YASZRINA; A/l Karuppiah ETTIKAN KANDASAMY; Binti Maidin AZIZAH; Koong WAH YAN; Ngo CHUAN HAI
Original assignee: Mimos Berhad
Priority date: 2014-05-23
Filing date: 2015-05-06
Publication date: 2015-11-26
Also published as: MY173084A

Abstract

The present invention relates to an apparatus and method for performing computation using adaptive-window edit distance algorithm to determine matching possibilities by optimizing ratio of similarity values and weighted order of string, wherein the computation on string performed using parallel forward and backward computation on each sub-string to determine sequential order. The apparatus for parallel moving adaptive window filtering edit distance computation includes: a central processing unit (1), a storage unit (2), a memory module (6), a plurality of input devices (4), a random access memory (7) having connected to the memory module (6), an Input-Output hub (3) having connected to the memory module (6), the input devices (4) and the storage unit (2), and a parallel computation acceleration device (5) having connected to the central processing unit (1) via memory module (6), whereby the parallel computation acceleration device (5) further includes: a match filtering scoping unit to extract a first string from at least one reference list that is most similar to a second string from at least one search list for matching operation, a eliminator unit to extract at least one unmatched string from the first string and second string, a comparator unit to compute the unmatched string, wherein the unmatched string are split to at least one sub-string using a space character and eliminate at least one patronymic character, and at least one edit distance adaptive window determining unit to prepare for adaptive window setting based on the unmatched string and to calculate adaptive distance values on at least one critical path.

Description

ADAPTIVE-WINDOW EDIT DISTANCE ALGORITHM COMPUTATION FIELD OF INVENTION

The present invention relates to an apparatus and method for performing computation using adaptive-window edit distance algorithm to determine matching possibilities by optimizing ratio of similarity values and weighted order of string, wherein the computation on string performed using parallel forward and backward computation on each sub-string to determine sequential order. BACKGROUND ART

The rapid digitization of various data sources eases the extraction of information, thus allowing relevant authorities to enhance and optimize their system based on a much larger amount of information that they never had before. However, most of the information is converted from older paper based documents and is prone to mistakes of data entry like typographical errors or duplicate entries of updated and stale information. Besides this, older digital systems also tend to have limited storage, and could contain truncated information when compared to those contained in newer systems. In view of this, data cleansing is imperative when upgrading a database system to ensure that information is correct and validated, and up to date in order to find out the most accurate information and for it to be useful. Data cleansing is used during the ETL process (extract, transform, loading) when building a data warehouse. Considered as the main problem for data warehouse, data cleansing (or cleaning) is essential to make certain of the data integrity and data correctness.

In the Malaysian context, data cleansing is particularly challenging due to the fact that Malaysia is a heterogeneous society with different naming conventions and recordkeeping left over from British colonial times. Malaysia is composed of more than 100 ethnic groups, with different naming conventions. The three main ethnic groups in Malaysia are Malays, Chinese, and Indians, each with a different naming convention. As such, a traditional edit distance based approach may not yield the most accurate results. Cleansing of 15 million names against another reference database with similar name size (e.g 15 million names), leads to 225 x 1012 or trillions of combination, illustrating complexity of matching process. Since such cleansing process needs multiple rules, the magnitude of complexity is enormous.

One example which disclosure is a new extended application to use the algorithm where it describes optimization of the Levenshtein/Edit Distance algorithm is disclosed in Malaysian Patent Application No. PI 2013004420. However, the Edit Distance algorithm could not cover all cases of name/string comparison since it takes the whole string for comparison. This leads to unfair comparison as described in below example:

Table 1

From Table 1 , we can see that the 2 names are similar but the Levenshtein ratio score is only 16% and it is calculated by using:

LR = round ((1 - Ievenshtein(string1 , string2)/maximum length of either stringl or string2)^*100);

There are several other factors identified that will give low Levenshtein score such as: 1 . Different spelling for interfixes that shows relationships (bin/binti/anaklelakietc)

2. Prefixes from titles conferred by Malaysian monarchs and/or professional titles

3. Usage of unofficial names or nicknames

4. Inverted name order

5. Name change in National Registration Department (NRD) (prompting addition of @ between the old & new name)

Therefore, a new application of optimized Edit distance algorithm is needed.

In view of the above, there is a need to accelerate the data cleansing of personal information for a Malaysian or any other national organization. Also, a system to increase the accuracy of names / string verification for low accuracy matches between two or more input sources for data cleansing. SUMMARY OF INVENTION

The present invention provides an apparatus for parallel moving adaptive window filtering edit distance computation comprising: a central processing unit (1 ); a storage unit (2); a memory module (6); a plurality of input devices (4); a random access memory (7) having connected to the memory module (6) ; an Input-Output hub (3) having connected to the memory module (6), the input devices (4) and the storage unit (2); and a parallel computation acceleration device (5) having connected to the central processing unit (1 ) via memory module (6), wherein the parallel computation acceleration device (5) further comprises: a match filtering scoping unit to extract a first string from at least one reference list that is most similar to a second string from at least one search list for matching operation; a eliminator unit to extract at least one unmatched string from the first string and second string; a comparator unit to compute the unmatched string, in which the unmatched string are split to at least one sub-string using a space character and eliminate at least one patronymic character; and at least one edit distance adaptive window determining unit to prepare for adaptive window setting based on the unmatched string and to calculate adaptive distance values on at least one critical path.

Preferably, the edit distance adaptive window determining unit further comprising: at least one parallelized adaptive window computation unit to calculate edit distance between the unmatched string by using an adaptive window filter traversing along the critical path ; and at least one feedback based filtering threshold adjustment unit to initialize and continually move the said window according to input of said adaptive window computation unit.

Preferably, the parallel computation acceleration device further includes an arrangement unit to compute the unmatched string, in which the order of the sub-string is determined by sequential weighted string order. Further, the results of the computation are based on at least one optimized ratio between the parallel adaptive edit distance operation results and the sequential weighted string order results.

A further aspect of the present invention provides a method for parallel moving adaptive window filtering edit distance computation comprising steps of: executing matching operation using a central processing unit (1 ) for a reference string from at least one reference list from storage unit (2) that is most similar to a search string from at least one search list from a storage unit (2) using a match filtering scoping unit; extracting at least one unmatched string from the reference string and search string using an eliminator unit; splitting the unmatched string (20) to at least one sub-string by using a space character to remove the unmatched string (21 ) which contains at least one patronymic character using a comparator unit; performing edit distance computation (22) between the sub-string of the reference string and search string using at least one edit distance adaptive window determining unit having at least one traversing critical path for obtaining at least one computation results; and tagging the unmatched string (23) with the computation results.

Preferably, executing matching operation (15) for a reference string from at least one reference list that is most similar to a search string from at least one search list using a match filtering scoping unit further comprising steps of: extracting at least one matched string from the reference string and search string using an eliminator unit (16); and tagging the matched string with the matched results (18).

Preferably, performing edit distance computation (22) on the sub-string further comprising steps of: calculating edit distance between the unmatched string by using at least one parallelized adaptive window computation unit having an adaptive window filter traversing along the critical path; and initializing and moving the adaptive window filter according to input of the adaptive window computation unit.

Preferably, performing edit distance computation (10) on the sub-string further comprising steps of computation on the unmatched string (12) to determine the order of the sub-string by sequential weighted string order (11 ) using an arrangement unit.

Preferably, tagging the unmatched string (17) with the computation results further comprising steps of optimizing at least one ratio between the parallel adaptive edit distance operation results and the sequential weighted string order results. BRIEF DESCRIPTION OF PREFERRED EMBODIMENT

The invention will now be described in greater detail, by way of example with reference to the drawings, in which ;

Figure 1 illustrates the System Architecture of the adaptive-window edit distance algorithm.

Figure 2 illustrates the Process Flow for performing computation using adaptive-window edit distance algorithm.

Figure 3 illustrates the Process Flow of extracting at least one unmatched string.

Figure 4 illustrates the Process Flow of splitting the unmatched string and removing unmatched string.

Figure 5 illustrates the Process Flow of updating the weighted order parameters for matching of sub-strings. Figure 6 illustrates the Process Flow of updating the weighted order parameters for non- matching of sub-strings.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

As illustrated in Figure 1 , the system for performing computation using adaptive-window edit distance algorithm comprising a control processing unit (1 ) connected to Random Access Memory (7), Parallel computation acceleration device (5) and Input Output Hub (3) via Memory Hub (6), wherein the Memory Hub (6) is coupled to a storage unit (2) and at least one Input Device (4).

The parallel computation acceleration device (5) further comprises a match filtering scoping unit to extract a first string from at least one reference list that is most similar to a second string from at least one search list for matching operation, a eliminator unit to extract at least one unmatched string from the first string and second string, a comparator unit to compute the unmatched string, in which the unmatched string are split to at least one sub-string using a space character and eliminate at least one patronymic character and at least one edit distance adaptive window determining unit to prepare for adaptive window setting based on the unmatched string and to calculate adaptive distance values on at least one critical path.

The request for matching is received by control processing unit (1 ) and sent to the parallel computation device (5) for accelerated processing via Memory Hub (6).

Then, a Data (compressed or uncompressed) from a storage unit (2) is transferred via Input Output Hub (3) and Memory Hub (6) to the parallel computation device (5) for accelerated processing.

The process flow of the parallel computation acceleration device is illustrated in Figure 2. As illustrated in Figure 2, the process is initiated by segmenting the string into substrings to do the comparison for each sub-string. Firstly, the parallel exact match comparison will be executed to filter out the 100% matches. From the system point of view, the data input table comes from a specified database called inputl and input2 (8).

Further illustrated in Figure 2 and 3, the fields/columns compared should be equivalent e.g. Names vs Names. Number of fields / columns is not specified as to make it general. For this example we take only 1 column from each table, which is the full name column. The invention will then sort the 2 input columns, filter out the exact matches records from the 2 columns by comparing each record and only keep the records not matched 100% in the input for further processing (9).

Further illustrated in Figure 2 and 4, the next process is the parallel forward and backward word-by-word comparison using the optimized edit distance algorithm (10). In this particular process, there are several steps involved as listed as below:

The process is executed by splitting the full string using whitespace as delimiters (20). After splitting, the first sub-strings is called datal and the second sub-strings is called data2. Then, removing patronymic characters (21 ) in both first and second such as @, bin, binti, etc. Then, performing adaptive-window edit distance (22) for forward and backward word comparison from datal to data2. Finally, tagging the compared names with similarity percentage output (23). The process of performing adaptive-window edit distance (22), further comprising steps of Comparing each datal sub-string with each data2 sub-string. Then, calculating Edit Distance for each pair using Adaptive Window Edit Distance method (optimized edit distance algorithm) for only one direction. Finally, calculating the average of maximum similarity score for forward and backward for output.

Table 2

Thereafter, the similarity score calculated for forward are reused for the backward score, to reduce number of cycles as the score is the same, where datal (RUSEE LIM) is compared towards data2 (RUSEE LIM RUSEE RAMAN), as illustrated above in Table 2. As illustrated in Table 2, each pair that is being compared from datal (RUSEE LIM) and data2 (RUSEE LIM RUSEE RAMAN). For simplicity, let's use this example using exact match score, which anything less than 100 is considered as 0 and exact match uses 100. 100 means the pair match 100% and 0 means the pair doesn't match. For forward, we see that both sub-strings in datal match to data2. For backward, 3 sub-strings are matched while 1 is not matched. The calculation for forward and backward scores then is as follows

(18» + S )

f orward score = iOO%.2 denotes the total iiMmber of d-tata l sub— str ings lOO -MOG ÷ 100 -f 0^')

backward score = = 7S% similarity scare = maxiforw rd, backward)

The next process is the sequential weighted order to determine order accuracy of the sub-string. For example, if the sequential weighted order process executed by retrieving 1 st data2 sub-string and start to find a match with datal . Once there is a match of the sub-string, the process will include and compare the next sub-string for data2. If the first data2 sub-string has no matches, the process will move on to the next sub-string of data2. The result will provide the maximum order matches.

Examples of sequential weighted order score are as below;

2

Order score =— * 1Q& 67 %

As illustrated in Figure 5 and 6, the method Performing Parallel Sequential weighted order accuracy further comprises the steps of: initializing parameters for data l and data2, then determining whether the first sub-string datal (indexl ) matches with first sub-string data2 (index2). Then, if these compared sub-strings does match, proceed to update the parameters, otherwise if the compared sub-strings doesn't match, proceed to find a match. Finally, tagging the output for maximum order percentage As illustrated in Figure 5, the method further explains on updating the parameters for matching sub-strings, as follows: incrementing indexl and index2 for datal and data2. Then, keep comparing the next sub-strings of datal and data2 until end of sub-string. Then, keeping track of the maximum number of sequence count.

As illustrated in Figure 6, the method further explains finding a match for non-matching sub-strings, as follows: Incrementing indexl only to compare with the next sub-string in datal to the first sub-string in data2. Then, determining whether the first sub-string datal (indexl ) matches with first sub-string data2 (index2) until end of sub-string first. Then, keeping track of the maximum number of sequence count.

Finally, the last process is calculating the final weighted average score. The proposed Invention give the user the choice of whether the user want to give more precedence to similarity match or order match or even equal priority.

Final score = a. Similarity + β. Order Where

α + β = 1

a = weighted average to similarity score

β = weighted average to order score

Depends on user to give more priority on a or β.

The present invention provides a system and method for performing computation using adaptive-window edit distance algorithm to determine matching possibilities by optimizing ratio of similarity values and weighted order of string, wherein the computation on string performed using parallel forward and backward computation on each sub-string to determine sequential order.

Claims

1 . An apparatus for parallel moving adaptive window filtering edit distance computation comprising:

a central processing unit (1 );

a storage unit (2);

a memory module (6);

a plurality of input devices (4);

a random access memory (7) having connected to the memory module (6); an Input-Output hub (3) having connected to the memory module (6), the input devices (4) and the storage unit (2); and

a parallel computation acceleration device (5) having connected to the central processing unit (1 ) via memory module (6),

wherein the parallel computation acceleration device (5) further comprises: a match filtering scoping unit to extract a first string from at least one reference list that is most similar to a second string from at least one search list for matching operation;

a eliminator unit to extract at least one unmatched string from the first string and second string;

a comparator unit to compute the unmatched string, in which the unmatched string are split to at least one sub-string using a space character and eliminate at least one patronymic character; and

at least one edit distance adaptive window determining unit to prepare for adaptive window setting based on the unmatched string and to calculate adaptive distance values on at least one critical path.

2. The apparatus according to claim 1 , wherein the at least one edit distance adaptive window determining unit further comprising:

at least one parallelized adaptive window computation unit to calculate edit distance between the unmatched string by using an adaptive window filter traversing along the critical path; and

at least one feedback based filtering threshold adjustment unit to initialize and continually move the said window according to input of said adaptive window computation unit.

3. The apparatus according to claim 1 , wherein the parallel computation acceleration device (5) further includes an arrangement unit to compute the unmatched string, in which the order of the sub-string is determined by sequential weighted string order.

4. The apparatus according to claim 1 , wherein the results of the computation are based on at least one optimized ratio between the parallel adaptive edit distance operation results and the sequential weighted string order results.

5. A method for parallel moving adaptive window filtering edit distance computation comprising steps of:

executing matching operation (15) using a central processing unit (1 ) for a reference string from at least one reference list from storage unit (2) that is most similar to a search string from at least one search list from a storage unit (2) using a match filtering scoping unit;

extracting at least one unmatched string from the reference string and search string using a eliminator unit;

splitting the unmatched string (20) to at least one sub-string by using a space character to remove the unmatched string (21 ) which contains at least one patronymic character using a comparator unit;

performing edit distance computation (22) between the sub-string of the reference string and search string using at least one edit distance adaptive window determining unit having at least one traversing critical path for obtaining at least one computation results; and

tagging the unmatched string (23) with the computation results.

6. The method according to claim 5, wherein executing matching operation (15) for a reference string from at least one reference list that is most similar to a search string from at least one search list using a match filtering scoping unit further comprising steps of:

extracting at least one matched string from the reference string and search string using a eliminator unit (16) ; and

tagging the matched string with the matched results (18).

7. The method according to claim 5, wherein performing edit distance computation (22) on the sub-string further comprising steps of: calculating edit distance between the unmatched string by using at least one parallelized adaptive window computation unit having an adaptive window filter traversing along the critical path; and

initializing and moving the adaptive window filter according to input of the adaptive window computation unit.

8. The method according to claim 5, wherein the performing edit distance computation (10) on the sub-string further comprising steps of computation on the unmatched string (12) to determine the order of the sub-string by sequential weighted string order (11 ) using an arrangement unit.

9. The method according to claim 5, wherein the tagging the unmatched string (17) with the computation results further comprising steps of optimizing at least one ratio between the parallel adaptive edit distance operation results and the sequential weighted string order results.