WO2015178756A1 - Adaptive-window edit distance algorithm computation - Google Patents

Adaptive-window edit distance algorithm computation Download PDF

Info

Publication number
WO2015178756A1
WO2015178756A1 PCT/MY2015/050026 MY2015050026W WO2015178756A1 WO 2015178756 A1 WO2015178756 A1 WO 2015178756A1 MY 2015050026 W MY2015050026 W MY 2015050026W WO 2015178756 A1 WO2015178756 A1 WO 2015178756A1
Authority
WO
WIPO (PCT)
Prior art keywords
string
computation
unit
unmatched
edit distance
Prior art date
Application number
PCT/MY2015/050026
Other languages
French (fr)
Inventor
Binti Mohamad Yassin YASZRINA
A/l Karuppiah ETTIKAN KANDASAMY
Binti Maidin AZIZAH
Koong WAH YAN
Ngo CHUAN HAI
Original Assignee
Mimos Berhad
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mimos Berhad filed Critical Mimos Berhad
Publication of WO2015178756A1 publication Critical patent/WO2015178756A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Definitions

  • the present invention relates to an apparatus and method for performing computation using adaptive-window edit distance algorithm to determine matching possibilities by optimizing ratio of similarity values and weighted order of string, wherein the computation on string performed using parallel forward and backward computation on each sub-string to determine sequential order.
  • the present invention provides an apparatus for parallel moving adaptive window filtering edit distance computation comprising: a central processing unit (1 ); a storage unit (2); a memory module (6); a plurality of input devices (4); a random access memory (7) having connected to the memory module (6) ; an Input-Output hub (3) having connected to the memory module (6), the input devices (4) and the storage unit (2); and a parallel computation acceleration device (5) having connected to the central processing unit (1 ) via memory module (6), wherein the parallel computation acceleration device (5) further comprises: a match filtering scoping unit to extract a first string from at least one reference list that is most similar to a second string from at least one search list for matching operation; a eliminator unit to extract at least one unmatched string from the first string and second string; a comparator unit to compute the unmatched string, in which the unmatched string are split to at least one sub-string using a space character and eliminate at least one patronymic character; and at least one edit distance adaptive window determining unit to prepare for adaptive window setting based on
  • the edit distance adaptive window determining unit further comprising: at least one parallelized adaptive window computation unit to calculate edit distance between the unmatched string by using an adaptive window filter traversing along the critical path ; and at least one feedback based filtering threshold adjustment unit to initialize and continually move the said window according to input of said adaptive window computation unit.
  • the parallel computation acceleration device further includes an arrangement unit to compute the unmatched string, in which the order of the sub-string is determined by sequential weighted string order. Further, the results of the computation are based on at least one optimized ratio between the parallel adaptive edit distance operation results and the sequential weighted string order results.
  • a further aspect of the present invention provides a method for parallel moving adaptive window filtering edit distance computation comprising steps of: executing matching operation using a central processing unit (1 ) for a reference string from at least one reference list from storage unit (2) that is most similar to a search string from at least one search list from a storage unit (2) using a match filtering scoping unit; extracting at least one unmatched string from the reference string and search string using an eliminator unit; splitting the unmatched string (20) to at least one sub-string by using a space character to remove the unmatched string (21 ) which contains at least one patronymic character using a comparator unit; performing edit distance computation (22) between the sub-string of the reference string and search string using at least one edit distance adaptive window determining unit having at least one traversing critical path for obtaining at least one computation results; and tagging the unmatched string (23) with the computation results.
  • executing matching operation (15) for a reference string from at least one reference list that is most similar to a search string from at least one search list using a match filtering scoping unit further comprising steps of: extracting at least one matched string from the reference string and search string using an eliminator unit (16); and tagging the matched string with the matched results (18).
  • performing edit distance computation (22) on the sub-string further comprising steps of: calculating edit distance between the unmatched string by using at least one parallelized adaptive window computation unit having an adaptive window filter traversing along the critical path; and initializing and moving the adaptive window filter according to input of the adaptive window computation unit.
  • performing edit distance computation (10) on the sub-string further comprising steps of computation on the unmatched string (12) to determine the order of the sub-string by sequential weighted string order (11 ) using an arrangement unit.
  • tagging the unmatched string (17) with the computation results further comprising steps of optimizing at least one ratio between the parallel adaptive edit distance operation results and the sequential weighted string order results.
  • Figure 1 illustrates the System Architecture of the adaptive-window edit distance algorithm.
  • Figure 2 illustrates the Process Flow for performing computation using adaptive-window edit distance algorithm.
  • Figure 3 illustrates the Process Flow of extracting at least one unmatched string.
  • Figure 4 illustrates the Process Flow of splitting the unmatched string and removing unmatched string.
  • Figure 5 illustrates the Process Flow of updating the weighted order parameters for matching of sub-strings.
  • Figure 6 illustrates the Process Flow of updating the weighted order parameters for non- matching of sub-strings.
  • the system for performing computation using adaptive-window edit distance algorithm comprising a control processing unit (1 ) connected to Random Access Memory (7), Parallel computation acceleration device (5) and Input Output Hub (3) via Memory Hub (6), wherein the Memory Hub (6) is coupled to a storage unit (2) and at least one Input Device (4).
  • the parallel computation acceleration device (5) further comprises a match filtering scoping unit to extract a first string from at least one reference list that is most similar to a second string from at least one search list for matching operation, a eliminator unit to extract at least one unmatched string from the first string and second string, a comparator unit to compute the unmatched string, in which the unmatched string are split to at least one sub-string using a space character and eliminate at least one patronymic character and at least one edit distance adaptive window determining unit to prepare for adaptive window setting based on the unmatched string and to calculate adaptive distance values on at least one critical path.
  • a match filtering scoping unit to extract a first string from at least one reference list that is most similar to a second string from at least one search list for matching operation
  • a eliminator unit to extract at least one unmatched string from the first string and second string
  • a comparator unit to compute the unmatched string, in which the unmatched string are split to at least one sub-string using a space character and eliminate at least one patronymic character and at
  • control processing unit (1 ) The request for matching is received by control processing unit (1 ) and sent to the parallel computation device (5) for accelerated processing via Memory Hub (6).
  • a Data (compressed or uncompressed) from a storage unit (2) is transferred via Input Output Hub (3) and Memory Hub (6) to the parallel computation device (5) for accelerated processing.
  • the process flow of the parallel computation acceleration device is illustrated in Figure 2. As illustrated in Figure 2, the process is initiated by segmenting the string into substrings to do the comparison for each sub-string. Firstly, the parallel exact match comparison will be executed to filter out the 100% matches. From the system point of view, the data input table comes from a specified database called inputl and input2 (8).
  • the fields/columns compared should be equivalent e.g. Names vs Names. Number of fields / columns is not specified as to make it general. For this example we take only 1 column from each table, which is the full name column. The invention will then sort the 2 input columns, filter out the exact matches records from the 2 columns by comparing each record and only keep the records not matched 100% in the input for further processing (9).
  • the process is executed by splitting the full string using whitespace as delimiters (20). After splitting, the first sub-strings is called datal and the second sub-strings is called data2. Then, removing patronymic characters (21 ) in both first and second such as @, bin, binti, etc. Then, performing adaptive-window edit distance (22) for forward and backward word comparison from datal to data2. Finally, tagging the compared names with similarity percentage output (23).
  • the process of performing adaptive-window edit distance (22) further comprising steps of Comparing each datal sub-string with each data2 sub-string. Then, calculating Edit Distance for each pair using Adaptive Window Edit Distance method (optimized edit distance algorithm) for only one direction. Finally, calculating the average of maximum similarity score for forward and backward for output.
  • f orward score iOO%.2 denotes the total iiMmber of d-tata l sub— str ings lOO -MOG ⁇ 100 -f 0 ' )
  • the next process is the sequential weighted order to determine order accuracy of the sub-string. For example, if the sequential weighted order process executed by retrieving 1 st data2 sub-string and start to find a match with datal . Once there is a match of the sub-string, the process will include and compare the next sub-string for data2. If the first data2 sub-string has no matches, the process will move on to the next sub-string of data2. The result will provide the maximum order matches.
  • Order score — * 1Q& 67 %
  • the method Performing Parallel Sequential weighted order accuracy further comprises the steps of: initializing parameters for data l and data2, then determining whether the first sub-string datal (indexl ) matches with first sub-string data2 (index2). Then, if these compared sub-strings does match, proceed to update the parameters, otherwise if the compared sub-strings doesn't match, proceed to find a match. Finally, tagging the output for maximum order percentage As illustrated in Figure 5, the method further explains on updating the parameters for matching sub-strings, as follows: incrementing indexl and index2 for datal and data2. Then, keep comparing the next sub-strings of datal and data2 until end of sub-string. Then, keeping track of the maximum number of sequence count.
  • the method further explains finding a match for non-matching sub-strings, as follows: Incrementing indexl only to compare with the next sub-string in datal to the first sub-string in data2. Then, determining whether the first sub-string datal (indexl ) matches with first sub-string data2 (index2) until end of sub-string first. Then, keeping track of the maximum number of sequence count.
  • the proposed Invention give the user the choice of whether the user want to give more precedence to similarity match or order match or even equal priority.
  • the present invention provides a system and method for performing computation using adaptive-window edit distance algorithm to determine matching possibilities by optimizing ratio of similarity values and weighted order of string, wherein the computation on string performed using parallel forward and backward computation on each sub-string to determine sequential order.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)
  • Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to an apparatus and method for performing computation using adaptive-window edit distance algorithm to determine matching possibilities by optimizing ratio of similarity values and weighted order of string, wherein the computation on string performed using parallel forward and backward computation on each sub-string to determine sequential order. The apparatus for parallel moving adaptive window filtering edit distance computation includes: a central processing unit (1), a storage unit (2), a memory module (6), a plurality of input devices (4), a random access memory (7) having connected to the memory module (6), an Input-Output hub (3) having connected to the memory module (6), the input devices (4) and the storage unit (2), and a parallel computation acceleration device (5) having connected to the central processing unit (1) via memory module (6), whereby the parallel computation acceleration device (5) further includes: a match filtering scoping unit to extract a first string from at least one reference list that is most similar to a second string from at least one search list for matching operation, a eliminator unit to extract at least one unmatched string from the first string and second string, a comparator unit to compute the unmatched string, wherein the unmatched string are split to at least one sub-string using a space character and eliminate at least one patronymic character, and at least one edit distance adaptive window determining unit to prepare for adaptive window setting based on the unmatched string and to calculate adaptive distance values on at least one critical path.

Description

ADAPTIVE-WINDOW EDIT DISTANCE ALGORITHM COMPUTATION FIELD OF INVENTION
The present invention relates to an apparatus and method for performing computation using adaptive-window edit distance algorithm to determine matching possibilities by optimizing ratio of similarity values and weighted order of string, wherein the computation on string performed using parallel forward and backward computation on each sub-string to determine sequential order. BACKGROUND ART
The rapid digitization of various data sources eases the extraction of information, thus allowing relevant authorities to enhance and optimize their system based on a much larger amount of information that they never had before. However, most of the information is converted from older paper based documents and is prone to mistakes of data entry like typographical errors or duplicate entries of updated and stale information. Besides this, older digital systems also tend to have limited storage, and could contain truncated information when compared to those contained in newer systems. In view of this, data cleansing is imperative when upgrading a database system to ensure that information is correct and validated, and up to date in order to find out the most accurate information and for it to be useful. Data cleansing is used during the ETL process (extract, transform, loading) when building a data warehouse. Considered as the main problem for data warehouse, data cleansing (or cleaning) is essential to make certain of the data integrity and data correctness.
In the Malaysian context, data cleansing is particularly challenging due to the fact that Malaysia is a heterogeneous society with different naming conventions and recordkeeping left over from British colonial times. Malaysia is composed of more than 100 ethnic groups, with different naming conventions. The three main ethnic groups in Malaysia are Malays, Chinese, and Indians, each with a different naming convention. As such, a traditional edit distance based approach may not yield the most accurate results. Cleansing of 15 million names against another reference database with similar name size (e.g 15 million names), leads to 225 x 1012 or trillions of combination, illustrating complexity of matching process. Since such cleansing process needs multiple rules, the magnitude of complexity is enormous.
One example which disclosure is a new extended application to use the algorithm where it describes optimization of the Levenshtein/Edit Distance algorithm is disclosed in Malaysian Patent Application No. PI 2013004420. However, the Edit Distance algorithm could not cover all cases of name/string comparison since it takes the whole string for comparison. This leads to unfair comparison as described in below example:
Figure imgf000004_0001
Table 1
From Table 1 , we can see that the 2 names are similar but the Levenshtein ratio score is only 16% and it is calculated by using:
LR = round ((1 - Ievenshtein(string1 , string2)/maximum length of either stringl or string2)*100);
There are several other factors identified that will give low Levenshtein score such as: 1 . Different spelling for interfixes that shows relationships (bin/binti/anaklelakietc)
2. Prefixes from titles conferred by Malaysian monarchs and/or professional titles
3. Usage of unofficial names or nicknames
4. Inverted name order
5. Name change in National Registration Department (NRD) (prompting addition of @ between the old & new name)
Therefore, a new application of optimized Edit distance algorithm is needed.
In view of the above, there is a need to accelerate the data cleansing of personal information for a Malaysian or any other national organization. Also, a system to increase the accuracy of names / string verification for low accuracy matches between two or more input sources for data cleansing. SUMMARY OF INVENTION
The present invention provides an apparatus for parallel moving adaptive window filtering edit distance computation comprising: a central processing unit (1 ); a storage unit (2); a memory module (6); a plurality of input devices (4); a random access memory (7) having connected to the memory module (6) ; an Input-Output hub (3) having connected to the memory module (6), the input devices (4) and the storage unit (2); and a parallel computation acceleration device (5) having connected to the central processing unit (1 ) via memory module (6), wherein the parallel computation acceleration device (5) further comprises: a match filtering scoping unit to extract a first string from at least one reference list that is most similar to a second string from at least one search list for matching operation; a eliminator unit to extract at least one unmatched string from the first string and second string; a comparator unit to compute the unmatched string, in which the unmatched string are split to at least one sub-string using a space character and eliminate at least one patronymic character; and at least one edit distance adaptive window determining unit to prepare for adaptive window setting based on the unmatched string and to calculate adaptive distance values on at least one critical path.
Preferably, the edit distance adaptive window determining unit further comprising: at least one parallelized adaptive window computation unit to calculate edit distance between the unmatched string by using an adaptive window filter traversing along the critical path ; and at least one feedback based filtering threshold adjustment unit to initialize and continually move the said window according to input of said adaptive window computation unit.
Preferably, the parallel computation acceleration device further includes an arrangement unit to compute the unmatched string, in which the order of the sub-string is determined by sequential weighted string order. Further, the results of the computation are based on at least one optimized ratio between the parallel adaptive edit distance operation results and the sequential weighted string order results.
A further aspect of the present invention provides a method for parallel moving adaptive window filtering edit distance computation comprising steps of: executing matching operation using a central processing unit (1 ) for a reference string from at least one reference list from storage unit (2) that is most similar to a search string from at least one search list from a storage unit (2) using a match filtering scoping unit; extracting at least one unmatched string from the reference string and search string using an eliminator unit; splitting the unmatched string (20) to at least one sub-string by using a space character to remove the unmatched string (21 ) which contains at least one patronymic character using a comparator unit; performing edit distance computation (22) between the sub-string of the reference string and search string using at least one edit distance adaptive window determining unit having at least one traversing critical path for obtaining at least one computation results; and tagging the unmatched string (23) with the computation results.
Preferably, executing matching operation (15) for a reference string from at least one reference list that is most similar to a search string from at least one search list using a match filtering scoping unit further comprising steps of: extracting at least one matched string from the reference string and search string using an eliminator unit (16); and tagging the matched string with the matched results (18).
Preferably, performing edit distance computation (22) on the sub-string further comprising steps of: calculating edit distance between the unmatched string by using at least one parallelized adaptive window computation unit having an adaptive window filter traversing along the critical path; and initializing and moving the adaptive window filter according to input of the adaptive window computation unit.
Preferably, performing edit distance computation (10) on the sub-string further comprising steps of computation on the unmatched string (12) to determine the order of the sub-string by sequential weighted string order (11 ) using an arrangement unit.
Preferably, tagging the unmatched string (17) with the computation results further comprising steps of optimizing at least one ratio between the parallel adaptive edit distance operation results and the sequential weighted string order results. BRIEF DESCRIPTION OF PREFERRED EMBODIMENT
The invention will now be described in greater detail, by way of example with reference to the drawings, in which ;
Figure 1 illustrates the System Architecture of the adaptive-window edit distance algorithm.
Figure 2 illustrates the Process Flow for performing computation using adaptive-window edit distance algorithm.
Figure 3 illustrates the Process Flow of extracting at least one unmatched string.
Figure 4 illustrates the Process Flow of splitting the unmatched string and removing unmatched string.
Figure 5 illustrates the Process Flow of updating the weighted order parameters for matching of sub-strings. Figure 6 illustrates the Process Flow of updating the weighted order parameters for non- matching of sub-strings.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
As illustrated in Figure 1 , the system for performing computation using adaptive-window edit distance algorithm comprising a control processing unit (1 ) connected to Random Access Memory (7), Parallel computation acceleration device (5) and Input Output Hub (3) via Memory Hub (6), wherein the Memory Hub (6) is coupled to a storage unit (2) and at least one Input Device (4).
The parallel computation acceleration device (5) further comprises a match filtering scoping unit to extract a first string from at least one reference list that is most similar to a second string from at least one search list for matching operation, a eliminator unit to extract at least one unmatched string from the first string and second string, a comparator unit to compute the unmatched string, in which the unmatched string are split to at least one sub-string using a space character and eliminate at least one patronymic character and at least one edit distance adaptive window determining unit to prepare for adaptive window setting based on the unmatched string and to calculate adaptive distance values on at least one critical path.
The request for matching is received by control processing unit (1 ) and sent to the parallel computation device (5) for accelerated processing via Memory Hub (6).
Then, a Data (compressed or uncompressed) from a storage unit (2) is transferred via Input Output Hub (3) and Memory Hub (6) to the parallel computation device (5) for accelerated processing.
The process flow of the parallel computation acceleration device is illustrated in Figure 2. As illustrated in Figure 2, the process is initiated by segmenting the string into substrings to do the comparison for each sub-string. Firstly, the parallel exact match comparison will be executed to filter out the 100% matches. From the system point of view, the data input table comes from a specified database called inputl and input2 (8).
Further illustrated in Figure 2 and 3, the fields/columns compared should be equivalent e.g. Names vs Names. Number of fields / columns is not specified as to make it general. For this example we take only 1 column from each table, which is the full name column. The invention will then sort the 2 input columns, filter out the exact matches records from the 2 columns by comparing each record and only keep the records not matched 100% in the input for further processing (9).
Further illustrated in Figure 2 and 4, the next process is the parallel forward and backward word-by-word comparison using the optimized edit distance algorithm (10). In this particular process, there are several steps involved as listed as below:
The process is executed by splitting the full string using whitespace as delimiters (20). After splitting, the first sub-strings is called datal and the second sub-strings is called data2. Then, removing patronymic characters (21 ) in both first and second such as @, bin, binti, etc. Then, performing adaptive-window edit distance (22) for forward and backward word comparison from datal to data2. Finally, tagging the compared names with similarity percentage output (23). The process of performing adaptive-window edit distance (22), further comprising steps of Comparing each datal sub-string with each data2 sub-string. Then, calculating Edit Distance for each pair using Adaptive Window Edit Distance method (optimized edit distance algorithm) for only one direction. Finally, calculating the average of maximum similarity score for forward and backward for output.
Figure imgf000009_0001
Table 2
Thereafter, the similarity score calculated for forward are reused for the backward score, to reduce number of cycles as the score is the same, where datal (RUSEE LIM) is compared towards data2 (RUSEE LIM RUSEE RAMAN), as illustrated above in Table 2. As illustrated in Table 2, each pair that is being compared from datal (RUSEE LIM) and data2 (RUSEE LIM RUSEE RAMAN). For simplicity, let's use this example using exact match score, which anything less than 100 is considered as 0 and exact match uses 100. 100 means the pair match 100% and 0 means the pair doesn't match. For forward, we see that both sub-strings in datal match to data2. For backward, 3 sub-strings are matched while 1 is not matched. The calculation for forward and backward scores then is as follows
(18» + S )
f orward score = iOO%.2 denotes the total iiMmber of d-tata l sub— str ings lOO -MOG ÷ 100 -f 0')
backward score = = 7S% similarity scare = maxiforw rd, backward)
The next process is the sequential weighted order to determine order accuracy of the sub-string. For example, if the sequential weighted order process executed by retrieving 1 st data2 sub-string and start to find a match with datal . Once there is a match of the sub-string, the process will include and compare the next sub-string for data2. If the first data2 sub-string has no matches, the process will move on to the next sub-string of data2. The result will provide the maximum order matches.
Examples of sequential weighted order score are as below;
2
Order score =— * 1Q& 67 %
As illustrated in Figure 5 and 6, the method Performing Parallel Sequential weighted order accuracy further comprises the steps of: initializing parameters for data l and data2, then determining whether the first sub-string datal (indexl ) matches with first sub-string data2 (index2). Then, if these compared sub-strings does match, proceed to update the parameters, otherwise if the compared sub-strings doesn't match, proceed to find a match. Finally, tagging the output for maximum order percentage As illustrated in Figure 5, the method further explains on updating the parameters for matching sub-strings, as follows: incrementing indexl and index2 for datal and data2. Then, keep comparing the next sub-strings of datal and data2 until end of sub-string. Then, keeping track of the maximum number of sequence count.
As illustrated in Figure 6, the method further explains finding a match for non-matching sub-strings, as follows: Incrementing indexl only to compare with the next sub-string in datal to the first sub-string in data2. Then, determining whether the first sub-string datal (indexl ) matches with first sub-string data2 (index2) until end of sub-string first. Then, keeping track of the maximum number of sequence count.
Finally, the last process is calculating the final weighted average score. The proposed Invention give the user the choice of whether the user want to give more precedence to similarity match or order match or even equal priority.
Final score = a. Similarity + β. Order Where
α + β = 1
a = weighted average to similarity score
β = weighted average to order score
Depends on user to give more priority on a or β.
The present invention provides a system and method for performing computation using adaptive-window edit distance algorithm to determine matching possibilities by optimizing ratio of similarity values and weighted order of string, wherein the computation on string performed using parallel forward and backward computation on each sub-string to determine sequential order.

Claims

1 . An apparatus for parallel moving adaptive window filtering edit distance computation comprising:
a central processing unit (1 );
a storage unit (2);
a memory module (6);
a plurality of input devices (4);
a random access memory (7) having connected to the memory module (6); an Input-Output hub (3) having connected to the memory module (6), the input devices (4) and the storage unit (2); and
a parallel computation acceleration device (5) having connected to the central processing unit (1 ) via memory module (6),
wherein the parallel computation acceleration device (5) further comprises: a match filtering scoping unit to extract a first string from at least one reference list that is most similar to a second string from at least one search list for matching operation;
a eliminator unit to extract at least one unmatched string from the first string and second string;
a comparator unit to compute the unmatched string, in which the unmatched string are split to at least one sub-string using a space character and eliminate at least one patronymic character; and
at least one edit distance adaptive window determining unit to prepare for adaptive window setting based on the unmatched string and to calculate adaptive distance values on at least one critical path.
2. The apparatus according to claim 1 , wherein the at least one edit distance adaptive window determining unit further comprising:
at least one parallelized adaptive window computation unit to calculate edit distance between the unmatched string by using an adaptive window filter traversing along the critical path; and
at least one feedback based filtering threshold adjustment unit to initialize and continually move the said window according to input of said adaptive window computation unit.
3. The apparatus according to claim 1 , wherein the parallel computation acceleration device (5) further includes an arrangement unit to compute the unmatched string, in which the order of the sub-string is determined by sequential weighted string order.
4. The apparatus according to claim 1 , wherein the results of the computation are based on at least one optimized ratio between the parallel adaptive edit distance operation results and the sequential weighted string order results.
5. A method for parallel moving adaptive window filtering edit distance computation comprising steps of:
executing matching operation (15) using a central processing unit (1 ) for a reference string from at least one reference list from storage unit (2) that is most similar to a search string from at least one search list from a storage unit (2) using a match filtering scoping unit;
extracting at least one unmatched string from the reference string and search string using a eliminator unit;
splitting the unmatched string (20) to at least one sub-string by using a space character to remove the unmatched string (21 ) which contains at least one patronymic character using a comparator unit;
performing edit distance computation (22) between the sub-string of the reference string and search string using at least one edit distance adaptive window determining unit having at least one traversing critical path for obtaining at least one computation results; and
tagging the unmatched string (23) with the computation results.
6. The method according to claim 5, wherein executing matching operation (15) for a reference string from at least one reference list that is most similar to a search string from at least one search list using a match filtering scoping unit further comprising steps of:
extracting at least one matched string from the reference string and search string using a eliminator unit (16) ; and
tagging the matched string with the matched results (18).
7. The method according to claim 5, wherein performing edit distance computation (22) on the sub-string further comprising steps of: calculating edit distance between the unmatched string by using at least one parallelized adaptive window computation unit having an adaptive window filter traversing along the critical path; and
initializing and moving the adaptive window filter according to input of the adaptive window computation unit.
8. The method according to claim 5, wherein the performing edit distance computation (10) on the sub-string further comprising steps of computation on the unmatched string (12) to determine the order of the sub-string by sequential weighted string order (11 ) using an arrangement unit.
9. The method according to claim 5, wherein the tagging the unmatched string (17) with the computation results further comprising steps of optimizing at least one ratio between the parallel adaptive edit distance operation results and the sequential weighted string order results.
PCT/MY2015/050026 2014-05-23 2015-05-06 Adaptive-window edit distance algorithm computation WO2015178756A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
MYPI2014701350 2014-05-23
MYPI2014701350A MY173084A (en) 2014-05-23 2014-05-23 Adaptive-window edit distance algorithm computation

Publications (1)

Publication Number Publication Date
WO2015178756A1 true WO2015178756A1 (en) 2015-11-26

Family

ID=53490223

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/MY2015/050026 WO2015178756A1 (en) 2014-05-23 2015-05-06 Adaptive-window edit distance algorithm computation

Country Status (2)

Country Link
MY (1) MY173084A (en)
WO (1) WO2015178756A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107862062A (en) * 2017-11-15 2018-03-30 中国银行股份有限公司 A kind of information query method, device and electronic equipment
CN108183956A (en) * 2017-12-29 2018-06-19 武汉大学 A kind of critical path extracting method of communication network
CN111831869A (en) * 2020-06-30 2020-10-27 深圳价值在线信息科技股份有限公司 Method and device for checking duplicate of character string, terminal equipment and storage medium
CN114880314A (en) * 2022-05-23 2022-08-09 烟台聚禄信息科技有限公司 Big data cleaning decision-making method applying artificial intelligence strategy and AI processing system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090182728A1 (en) * 2008-01-16 2009-07-16 Arlen Anderson Managing an Archive for Approximate String Matching
EP2664997A2 (en) * 2012-05-18 2013-11-20 Xerox Corporation System and method for resolving named entity coreference

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090182728A1 (en) * 2008-01-16 2009-07-16 Arlen Anderson Managing an Archive for Approximate String Matching
EP2664997A2 (en) * 2012-05-18 2013-11-20 Xerox Corporation System and method for resolving named entity coreference

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
ANONYMOUS: "Levenshtein distance", 28 November 2013 (2013-11-28), pages 1 - 6, XP055159192, Retrieved from the Internet <URL:http://en.wikipedia.org/w/index.php?title=Levenshtein_distance&oldid=583605633> [retrieved on 20141217] *
BILENKO M ET AL: "Adaptive name matching in information integration", IEEE INTELLIGENT SYSTEMS, IEEE SERVICE CENTER, NEW YORK, NY, US, vol. 18, no. 5, 1 September 2003 (2003-09-01), pages 16 - 23, XP011101993, ISSN: 1094-7167, DOI: 10.1109/MIS.2003.1234765 *
DANY BRESLAUERT ET AL: "Parallel string matching algorithms*", 6 January 1994 (1994-01-06), XP055206902, Retrieved from the Internet <URL:http://www.dtic.mil/dtic/tr/fulltext/u2/a274502.pdf> [retrieved on 20150807] *
EGECIOGLU O ET AL: "Parallel algorithms for fast computation of normalized edit distances", PARALLEL AND DISTRIBUTED PROCESSING, 1996., EIGHTH IEEE SYMPOSIUM ON NEW ORLEANS, LA, USA 23-26 OCT. 1996, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, 23 October 1996 (1996-10-23), pages 496 - 503, XP010205754, ISBN: 978-0-8186-7683-3, DOI: 10.1109/SPDP.1996.570374 *
UWE DRAISBACH ET AL: "Adaptive Windows for Duplicate Detection", 28TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2012) : ARLINGTON, VIRGINIA, USA, 1 - 5 APRIL 2012, IEEE, PISCATAWAY, NJ, 1 April 2012 (2012-04-01), pages 1073 - 1083, XP032453948, ISBN: 978-1-4673-0042-1, DOI: 10.1109/ICDE.2012.20 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107862062A (en) * 2017-11-15 2018-03-30 中国银行股份有限公司 A kind of information query method, device and electronic equipment
CN108183956A (en) * 2017-12-29 2018-06-19 武汉大学 A kind of critical path extracting method of communication network
CN108183956B (en) * 2017-12-29 2020-05-12 武汉大学 Method for extracting key path of propagation network
CN111831869A (en) * 2020-06-30 2020-10-27 深圳价值在线信息科技股份有限公司 Method and device for checking duplicate of character string, terminal equipment and storage medium
CN111831869B (en) * 2020-06-30 2023-11-03 深圳价值在线信息科技股份有限公司 Character string duplicate checking method, device, terminal equipment and storage medium
CN114880314A (en) * 2022-05-23 2022-08-09 烟台聚禄信息科技有限公司 Big data cleaning decision-making method applying artificial intelligence strategy and AI processing system

Also Published As

Publication number Publication date
MY173084A (en) 2019-12-25

Similar Documents

Publication Publication Date Title
JP6258191B2 (en) Input method and system
WO2018157805A1 (en) Automatic questioning and answering processing method and automatic questioning and answering system
EP3584786B1 (en) Voice recognition method, electronic device, and computer storage medium
WO2015178756A1 (en) Adaptive-window edit distance algorithm computation
CN103282902A (en) Suffix array candidate selection and index data structure
US9251289B2 (en) Matching target strings to known strings
CN108231066B (en) Speech recognition system and method thereof and vocabulary establishing method
WO2017020454A1 (en) Search method and apparatus
CN106469097B (en) A kind of method and apparatus for recalling error correction candidate based on artificial intelligence
WO2012151255A1 (en) Statistical spell checker
CN112231451B (en) Reference word recovery method and device, conversation robot and storage medium
CN110851559A (en) Automatic data element identification method and identification system
JP4289715B2 (en) Speech recognition apparatus, speech recognition method, and tree structure dictionary creation method used in the method
CN106959943B (en) Language identification updating method and device
CN111160014B (en) Intelligent word segmentation method
EP3115906A1 (en) Finding doublets in a database
CN109543002B (en) Method, device and equipment for restoring abbreviated characters and storage medium
CN108475265A (en) Obtain the method and apparatus of unregistered word
US20110066622A1 (en) Product line extraction
US20220270589A1 (en) Information processing device, information processing method, and computer program product
AU2022204712B2 (en) Extracting content from freeform text samples into custom fields in a software application
Santoro et al. Assisted transcription of historical documents by keyword spotting: a performance model
CN113553398A (en) Search word correcting method and device, electronic equipment and computer storage medium
CN108882033B (en) Character recognition method, device, equipment and medium based on video voice
CN111090338A (en) Training method of HMM (hidden Markov model) input method model of medical document, input method model and input method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15732059

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15732059

Country of ref document: EP

Kind code of ref document: A1