WO2015088314A1 - An apparatus and method for parallel moving adaptive windo filtering edit distance computation - Google Patents

An apparatus and method for parallel moving adaptive windo filtering edit distance computation Download PDF

Info

Publication number
WO2015088314A1
WO2015088314A1 PCT/MY2014/000157 MY2014000157W WO2015088314A1 WO 2015088314 A1 WO2015088314 A1 WO 2015088314A1 MY 2014000157 W MY2014000157 W MY 2014000157W WO 2015088314 A1 WO2015088314 A1 WO 2015088314A1
Authority
WO
WIPO (PCT)
Prior art keywords
string
window
edit distance
character
filtering
Prior art date
Application number
PCT/MY2014/000157
Other languages
French (fr)
Inventor
Chuan Hai NGO
Yaszrina MOHAMAD YASSIN
Ettikan Kandasamy Karuppiah
Original Assignee
Mimos Berhad
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mimos Berhad filed Critical Mimos Berhad
Publication of WO2015088314A1 publication Critical patent/WO2015088314A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Definitions

  • the present invention relates to an apparatus and method for parallel moving adaptive window filtering edit distance computation.
  • the invention relates to an apparatus and method that utilizes an adaptive window filtering mechanism to accelerate edit distance computation for increased efficiency and shorter times by searching along a critical path that provides a high hit rate for string similarity matching.
  • Damerau-Levenshtein edit distance is a string metric between two strings, obtain by counting the minimum number of operations to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two adjacent characters.
  • inefficiency is observed in computing Damerau or Levenstein edit distance values as it is wasting computation power and time due to the need of calculating the values for the whole matrix of the entire region.
  • Computation required in Damerau- Levenshtein edit distance is MxN for M length string vs N length string. This is a problem when comparing multiple rows against multiple rows of string.
  • 1.0 illustrates the example of computation utilizing Damerau or Levenstein edit distance values for comparing two data strings. As illustrated in FIG. 1.0, the circled area indicates unnecessary calculation whereas the highlighted portion is the critical path. Computation of edit distance through an adaptive moving window filter method using parallel computation devices as observed in computation of Damerau or Levenstein edit distance values which filters out inefficient searches such as outside of a critical path is to be addressed.
  • US 173 Patent entitled "Edit Distance String Search”.
  • US 173 Patent an improved method to speed-up edit distance is provided wherein longest prefix is utilized for matching acceleration and is suitable for systems with many similar prefixes, otherwise small gains.
  • the index of the "first different character" from the previous text supports column re-use for shared prefixes.
  • US 524 Patent Another mechanism that provides an improved method to speed-up edit distance is disclosed in United States Patent No. US 6,295,524 B1 (hereinafter referred to as US 524 Patent).
  • US 524 Patent historical data is utilized to reduce computation time as edit distance costs or similarity of two data strings is optimized, or the costs are minimized, based upon a learning algorithm using examples of similar strings.
  • the invention as disclosed in the US 524 Patent is suitable for systems with small variations of data which enables fast lookup as wide variation of data requires a large lookup table and slow lookup time.
  • the US 524 Patent does not compute edit distance values based on the past values and along a critical path that provide a high hit rate for string similarity matching as proposed in the present invention and does not provide length-based string filtering in which the said string are sorted based on character offset as proposed in the present invention.
  • Another example introduce an algorithm which modifies dynamic programming method to reduce amount of data storage and eliminate control flow divergence in a paper entitled "An Algorithm for Fast Edit Distance Computation on GPUS" by Reza Farivar and Harshit Kharbanda of University of Illinois, Shivaram Venkataraman of University of California, and Roy H. Campbell of University of Illinois.
  • a modified algorithm is provided to reduce memory requirement (data storage) and eliminates control flow divergence by embedding the condition variables in the program logic to ensure all threads execute the same instruction regardless of the different data items.
  • score matrix is divided into virtual grid of quadrant and reverse look-up method that is not parallelizable is utilized.
  • Faster processing is provided in a very long string comparison (100 basis points) as it is designed for Graphic Processing Units (GPUs).
  • the invention as disclosed in the said paper does not compute edit distance values based on the past values and along a critical path that provide a high hit rate for string similarity matching as proposed in the present invention and does not provide length-based string filtering in which the said string are sorted based on character offset as proposed in the present invention.
  • the current problem as addressed in the current scenario drives to the solution as proposed in the present invention by utilizing an adaptive window filtering engine that moves in parallel, wherein the window acts as a bandpass filter to compute only the edit distance values based on the past values and move along a critical path to eliminate unnecessary computation.
  • the present invention relates to an apparatus and method for parallel moving adaptive window filtering edit distance computation.
  • the invention relates to an apparatus and method that utilizes an adaptive window filtering mechanism to accelerate edit distance computation for increased efficiency and shorter times by searching along a critical path that provides a high hit rate for string similarity matching.
  • the apparatus (200) comprising at least one central processing unit (202); at least one storage unit (210); at least one memory module (204, 206); at least one I/O hub (212); a plurality of input devices (214); and at least one parallel computation acceleration device (208).
  • the at least one parallel computation acceleration device (208) further comprising at least one match filtering scoping unit (302) which scopes a region of interest of a reference list that is most similar to a search string for matching operation; at least one edit distance adaptive window determining unit (304) to prepare for window setting based on user input and to calculate distance values on critical path.
  • Another aspect of the invention provides a method (500) for parallel moving adaptive window filtering edit distance computation.
  • the method (500) comprising steps of extracting from input and reference source to obtain parameter inputs from both search and reference list (502). It is further scoped into the match filtering scoping unit (504) for initialization of adaptive window (506). Window position is set to compute edit distance (508) and output is provided which determines if string is tagged as match or not match (510). Further, the process concludes (514) if it has reached the end of character offset (512). Else, the process repeats from scoping into the match filtering unit until it is concluded that it has reach the end of character offset.
  • Yet another aspect of the invention provides for the step for scoping down a region of interest of a reference list that is most similar to a search string for matching operation via match filtering scoping unit (504).
  • the said step further sorts the reference table with character offset (602) and matching search string with reference string with character offset (604). Thereafter, the scope of region is located (606).
  • Still another aspect of the invention provides for the step for initializing adaptive window (506).
  • the said step further comprises determining approve level for each pair of string (804).
  • the length difference of the string is determined (806).
  • Window frame is determined by adding the size of the base frame and approving, where maximum is a constant (816).
  • a further aspect of the invention provides for the step for providing feedback on new window position if it is not the final character of compare string (908) which further comprises steps of locating lowest edit distance value in the window closest to the right edge of window (1002); and providing a feedback of the start of the next window position for the next row at the minimum column (1004).
  • Still another aspect of the invention provides for the step for comparing end of character if it is the final character of base string of compare string (910).
  • the said step further comprises steps of determining if it is the final column of base string (1102) and obtaining the last edit distance value (EDvalue) in the last row of compare string and last column of base string if it is the final column of the base string (1104). It is further determined output is tagged as match (1108) or not match (1110).
  • Yet another aspect of the invention provides for the step for moving window through rows of compare string if it is not the final column of base string (11 12).
  • the said step further comprises steps of checking if end of window is hitting last column of base string (1202).
  • the window is moved down each row to calculate the edit distance value to match similarity of string (1204).
  • FIG. 1.0 illustrates the example of computation utilizing Damerau or Levenstein edit distance values for comparing two data strings.
  • FIG. 2.0 illustrates the general architecture of the apparatus of the present invention.
  • FIG. 3.0 illustrates the detailed architecture of the apparatus within the parallel computation acceleration device.
  • FIG. 4.0 illustrates a block diagram of the process flow of the parallel computation acceleration device (208).
  • FIG. 5.0 is a flowchart illustrating the general methodology of the present invention for parallel moving adaptive window filtering edit distance computation.
  • FIG. 6.0 is a flowchart illustrating the steps for scoping down a region of interest of a reference list that is most similar to a search string for matching operation via match filtering scoping unit.
  • FIG. 7.0 is a diagram which illustrates an example of sorting by character offset.
  • FIG. 8.0 is a flowchart illustrating the steps for initializing adaptive window.
  • FIG. 9.0 is a flowchart illustrating the steps for computing edit distance via edit distance adaptive window determining unit.
  • FIG. 10.0 is a flowchart illustrating steps for providing feedback on new window position if it is not the final character of compare string.
  • FIG. 1 1.0 is a flowchart illustrating the steps for comparing end of character if it is the final character of base string of compare string.
  • FIG. 12.0 is a flowchart illustrating steps for moving window through rows of compare string if it is not the final column of base string.
  • the present invention relates to an apparatus and method for parallel moving adaptive window filtering edit distance computation.
  • the invention relates to an apparatus and method that utilizes an adaptive window filtering mechanism to accelerate edit distance computation for increased efficiency and shorter times by searching along a critical path that provides a high hit rate for string similarity matching.
  • FIG. 2.0 a general architecture of the apparatus of the present invention is illustrated.
  • the apparatus (200) for parallel moving adaptive window filtering edit distance computation comprising a central processing unit (202); a storage unit (210); a memory module (204, 206); an I/O (input output) hub (212); input devices (214); and a parallel computation acceleration device (208).
  • the parallel computation acceleration device (208) comprises a match filtering scoping unit (302, 410) and an edit distance adaptive window determining unit (304, 513).
  • the process flow of the parallel computation acceleration device (208) is illustrated in FIG. 4.0 wherein the match filtering scoping unit (302, 410) scopes down a region of interest of a reference list that is most similar to a search string to be matched for matching operation. This is to ensure full coverage of matching of the entire column when the steps are iterated with the reference list being sorted according to a predefined character offset.
  • the edit distance adaptive window determining unit (304, 513) further comprises a parallelized adaptive window computation unit (308, 414) and a feedback based filtering threshold adjustment unit (306, 412).
  • the edit distance adaptive window determining unit (304, 513) prepares for adaptive window setting based on user input and calculates adaptive distance values on critical path while the parallelized adaptive window computation unit (308, 414) within the edit distance adaptive window determining unit (304, 513) calculates edit distance between all characters specified by using an adaptive window filter traversing along a critical path to minimize computation and memory demands. Further, the feedback based filtering threshold adjustment unit (306, 412) initializes and continually moves the window according to the input of the adaptive window computation unit.
  • a general method (500) of an embodiment of the invention is illustrated in FIG. 5.0 wherein the method (500) begins by extracting from input and reference source to obtain parameter inputs from both search and reference list (502).
  • Example of parameter inputs includes at least a string, length and index from both search and reference list.
  • a reference list is a list being compared against and a search list is a list to be matched.
  • the region of interest of a reference list that is most similar to a search string is scoped down for matching operation via the match filtering scoping unit (504).
  • the adaptive window is initialized (506) and the edit distance is computed via edit distance adaptive window determining unit (508) to provide an output which determines if string is tagged as match or not match (510).
  • Character offset is text or numeral from 0-N wherein N is the length of character offset of the reference string.
  • FIG. 6.0 A detailed description of the steps for scoping down a region of interest of a reference list that is most similar to a search string for matching operation via a match filtering scoping unit (504) is illustrated in FIG. 6.0 along with the illustrations as depicted in FIG. 7.0.
  • FIG. 7.0 is a diagram which illustrates an example of sorting by character offset. As illustrated in FIG. 6.0, the reference table is sorted according to the character offset (602) wherein the exact matching of a search string (search) is performed with reference string according to the character offset (604); and the scope of region is located wherein the start is indicated by offsetStart and the end is offsetEnd (606).
  • the window frame is determined by adding the size of base frame and approve level, where maximum frame size is a constant (816).
  • the edit distance values for characters covered by window length between n c character of compare string and base string window (904) is computed and the final character of compare string is determined (906) by providing a feedback on the new window position if it is not the final character of the compare string (908); and comparing the end of character if it is the final character of base string of the compare string (910).
  • the steps for providing a feedback on new window position if it is not the final character of compare string further comprises steps of locating the lowest edit distance value min w ⁇ n window closest to right edge of window (1002); and providing feedback of the start of next window position for the next row at the min w column (1004).
  • the detailed steps for comparing the end of character if it is the final character of base string of compare string (910) is illustrated in FIG. 11.0 which begins by determining if it is the final column of base string (1102) and thereafter obtaining the last edit distance value (EDvalue) in the last row of compare string and the last column of base string if it is the final column of base string (1104).
  • the window is moved through rows of compare string if it is not the final column of base string (1 112).
  • a detailed description of the steps for moving the window through rows of compare string is initiated by checking if the end of the window is hitting the last column of the base string if it is not the final column of base string (1202). Thereafter, the window is moved down each row and the edit distance value for window length is calculated between n c character of compare string and base string window (1204).
  • the present invention utilizes an adaptive window filtering engine that moves in parallel, wherein the window acts as a bandpass filter to compute only the edit distance values based on the past values and move along a critical path to eliminate unnecessary computation.
  • the length-based string filtering is performed automatically for region scoping of the whole column, wherein the strings are sorted based on a character offset. This is to ensure that all possibilities of matching have been covered for the whole column.
  • a feedback mechanism is provided for the automatic threshold based filtering to initialize the adaptive window setting based on user input to prepare for the threshold-based filtering and to calculate the adaptive edit distance values on the critical path; and to obtain the final adaptive edit distance value to determine whether the comparison is the intended match or not.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to an apparatus (200) and method (500) for parallel moving adaptive window filtering edit distance computation that utilizes an adaptive window filtering mechanism to accelerate edit distance computation for increased efficiency and shorter times by searching along a critical path that provides a high rate for string similarity matching. The apparatus (200) of the present invention comprising at least one central processing unit (202); at least one storage unit (210); at least one memory module (204, 206); at least one I/O hub (212); a plurality of input devices (214); and at least one parallel computation acceleration device (208). The at least one parallel computation acceleration device (208) further comprising at least one match filtering scoping unit (302) which scopes a region of interest of a reference list that is most similar to a search string for matching operation; at least one edit distance adaptive window determining unit (304) to prepare for window setting based on user input and to calculate distance values on critical path. String filtering is performed automatically, wherein strings are sorted based on character offset to ensure all possibilities of matching are covered. Further, a feedback mechanism is provided for the automatic threshold based filtering to initialize the adaptive window setting based on user input to prepare for the threshold-based filtering and to calculate the distance values on the critical path; and to obtain the final distance value to determine whether the comparison is the intended match or not.

Description

AN APPARATUS AND METHOD FOR PARALLEL MOVING ADAPTIVE WINDOW FILTERING EDIT DISTANCE COMPUTATION
FIELD OF INVENTION
The present invention relates to an apparatus and method for parallel moving adaptive window filtering edit distance computation. In particular, the invention relates to an apparatus and method that utilizes an adaptive window filtering mechanism to accelerate edit distance computation for increased efficiency and shorter times by searching along a critical path that provides a high hit rate for string similarity matching.
BACKGROUND ART Currently, in information theory, Damerau-Levenshtein edit distance is a string metric between two strings, obtain by counting the minimum number of operations to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two adjacent characters. However, inefficiency is observed in computing Damerau or Levenstein edit distance values as it is wasting computation power and time due to the need of calculating the values for the whole matrix of the entire region. Computation required in Damerau- Levenshtein edit distance is MxN for M length string vs N length string. This is a problem when comparing multiple rows against multiple rows of string. FIG. 1.0 illustrates the example of computation utilizing Damerau or Levenstein edit distance values for comparing two data strings. As illustrated in FIG. 1.0, the circled area indicates unnecessary calculation whereas the highlighted portion is the critical path. Computation of edit distance through an adaptive moving window filter method using parallel computation devices as observed in computation of Damerau or Levenstein edit distance values which filters out inefficient searches such as outside of a critical path is to be addressed.
One example which determines for a search string which if any of the strings in a text list have edit distance from the search string less than a threshold by utilizing dynamic programming on a grid with search string characters corresponding to rows and text characters corresponding to columns is disclosed in United States Patent No. US 7,584,173 B2 (hereinafter referred to as US 173 Patent) entitled "Edit Distance String Search". In the US 173 Patent, an improved method to speed-up edit distance is provided wherein longest prefix is utilized for matching acceleration and is suitable for systems with many similar prefixes, otherwise small gains. The index of the "first different character" from the previous text supports column re-use for shared prefixes. The list of "(prefix, index of next text lacking the prefix)" supports avoiding computation for texts that share a prefix that causes edit distance to exceed a threshold. However, the US 173 Patent does not compute edit distance values based on the past values and along a critical path that provide a high hit rate for string similarity matching as proposed in the present invention. Further, it does not provide length-based string filtering in which the said string are sorted based on character offset as proposed in the present invention.
Another mechanism that provides an improved method to speed-up edit distance is disclosed in United States Patent No. US 6,295,524 B1 (hereinafter referred to as US 524 Patent). In the US 524 Patent, historical data is utilized to reduce computation time as edit distance costs or similarity of two data strings is optimized, or the costs are minimized, based upon a learning algorithm using examples of similar strings. The invention as disclosed in the US 524 Patent is suitable for systems with small variations of data which enables fast lookup as wide variation of data requires a large lookup table and slow lookup time. However, the US 524 Patent does not compute edit distance values based on the past values and along a critical path that provide a high hit rate for string similarity matching as proposed in the present invention and does not provide length-based string filtering in which the said string are sorted based on character offset as proposed in the present invention.
Another example introduce an algorithm which modifies dynamic programming method to reduce amount of data storage and eliminate control flow divergence in a paper entitled "An Algorithm for Fast Edit Distance Computation on GPUS" by Reza Farivar and Harshit Kharbanda of University of Illinois, Shivaram Venkataraman of University of California, and Roy H. Campbell of University of Illinois. In the said paper, a modified algorithm is provided to reduce memory requirement (data storage) and eliminates control flow divergence by embedding the condition variables in the program logic to ensure all threads execute the same instruction regardless of the different data items. Further, score matrix is divided into virtual grid of quadrant and reverse look-up method that is not parallelizable is utilized. Faster processing is provided in a very long string comparison (100 basis points) as it is designed for Graphic Processing Units (GPUs). However, the invention as disclosed in the said paper does not compute edit distance values based on the past values and along a critical path that provide a high hit rate for string similarity matching as proposed in the present invention and does not provide length-based string filtering in which the said string are sorted based on character offset as proposed in the present invention. The current problem as addressed in the current scenario drives to the solution as proposed in the present invention by utilizing an adaptive window filtering engine that moves in parallel, wherein the window acts as a bandpass filter to compute only the edit distance values based on the past values and move along a critical path to eliminate unnecessary computation.
SUMMARY OF INVENTION
The present invention relates to an apparatus and method for parallel moving adaptive window filtering edit distance computation. In particular, the invention relates to an apparatus and method that utilizes an adaptive window filtering mechanism to accelerate edit distance computation for increased efficiency and shorter times by searching along a critical path that provides a high hit rate for string similarity matching.
One aspect of the invention provides an apparatus (200) for parallel moving adaptive window filtering edit distance computation. The apparatus (200) comprising at least one central processing unit (202); at least one storage unit (210); at least one memory module (204, 206); at least one I/O hub (212); a plurality of input devices (214); and at least one parallel computation acceleration device (208). The at least one parallel computation acceleration device (208) further comprising at least one match filtering scoping unit (302) which scopes a region of interest of a reference list that is most similar to a search string for matching operation; at least one edit distance adaptive window determining unit (304) to prepare for window setting based on user input and to calculate distance values on critical path.
Another aspect of the invention provides a method (500) for parallel moving adaptive window filtering edit distance computation. The method (500) comprising steps of extracting from input and reference source to obtain parameter inputs from both search and reference list (502). It is further scoped into the match filtering scoping unit (504) for initialization of adaptive window (506). Window position is set to compute edit distance (508) and output is provided which determines if string is tagged as match or not match (510). Further, the process concludes (514) if it has reached the end of character offset (512). Else, the process repeats from scoping into the match filtering unit until it is concluded that it has reach the end of character offset.
Yet another aspect of the invention provides for the step for scoping down a region of interest of a reference list that is most similar to a search string for matching operation via match filtering scoping unit (504). The said step further sorts the reference table with character offset (602) and matching search string with reference string with character offset (604). Thereafter, the scope of region is located (606).
Still another aspect of the invention provides for the step for initializing adaptive window (506). The said step further comprises determining approve level for each pair of string (804). The length difference of the string is determined (806). Window frame is determined by adding the size of the base frame and approving, where maximum is a constant (816).
A further aspect of the invention provides for the step for providing feedback on new window position if it is not the final character of compare string (908) which further comprises steps of locating lowest edit distance value in the window closest to the right edge of window (1002); and providing a feedback of the start of the next window position for the next row at the minimum column (1004).
Still another aspect of the invention provides for the step for comparing end of character if it is the final character of base string of compare string (910). The said step further comprises steps of determining if it is the final column of base string (1102) and obtaining the last edit distance value (EDvalue) in the last row of compare string and last column of base string if it is the final column of the base string (1104). It is further determined output is tagged as match (1108) or not match (1110).
Yet another aspect of the invention provides for the step for moving window through rows of compare string if it is not the final column of base string (11 12). The said step further comprises steps of checking if end of window is hitting last column of base string (1202). The window is moved down each row to calculate the edit distance value to match similarity of string (1204).
The present invention consists of features and a combination of parts hereinafter fully described and illustrated in the accompanying drawings, it being understood that various changes in the details may be made without departing from the scope of the invention or sacrificing any of the advantages of the present invention.
BRIEF DESCRIPTION OF ACCOMPANYING DRAWINGS
To further clarify various aspects of some embodiments of the present invention, a more particular description of the invention will be rendered by references to specific embodiments thereof, which are illustrated in the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. The invention will be described and explained with additional specificity and detail through the accompanying drawings in which: FIG. 1.0 illustrates the example of computation utilizing Damerau or Levenstein edit distance values for comparing two data strings.
FIG. 2.0 illustrates the general architecture of the apparatus of the present invention.
FIG. 3.0 illustrates the detailed architecture of the apparatus within the parallel computation acceleration device.
FIG. 4.0 illustrates a block diagram of the process flow of the parallel computation acceleration device (208).
FIG. 5.0 is a flowchart illustrating the general methodology of the present invention for parallel moving adaptive window filtering edit distance computation.
FIG. 6.0 is a flowchart illustrating the steps for scoping down a region of interest of a reference list that is most similar to a search string for matching operation via match filtering scoping unit.
FIG. 7.0 is a diagram which illustrates an example of sorting by character offset.
FIG. 8.0 is a flowchart illustrating the steps for initializing adaptive window. FIG. 9.0 is a flowchart illustrating the steps for computing edit distance via edit distance adaptive window determining unit.
FIG. 10.0 is a flowchart illustrating steps for providing feedback on new window position if it is not the final character of compare string.
FIG. 1 1.0 is a flowchart illustrating the steps for comparing end of character if it is the final character of base string of compare string.
FIG. 12.0 is a flowchart illustrating steps for moving window through rows of compare string if it is not the final column of base string.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
The present invention relates to an apparatus and method for parallel moving adaptive window filtering edit distance computation. In particular, the invention relates to an apparatus and method that utilizes an adaptive window filtering mechanism to accelerate edit distance computation for increased efficiency and shorter times by searching along a critical path that provides a high hit rate for string similarity matching.
Hereinafter, this specification will describe the present invention according to the preferred embodiments. It is to be understood that limiting the description to the preferred embodiments of the invention is merely to facilitate discussion of the present invention and it is envisioned without departing from the scope of the appended claims.
Reference is first made to FIGs. 2.0, 3.0 and 4.0 respectively. Referring to FIG. 2.0, a general architecture of the apparatus of the present invention is illustrated. As illustrated in FIG. 2.0, the apparatus (200) for parallel moving adaptive window filtering edit distance computation comprising a central processing unit (202); a storage unit (210); a memory module (204, 206); an I/O (input output) hub (212); input devices (214); and a parallel computation acceleration device (208). Further, as illustrated in FIG. 4.0 and 5.0, the parallel computation acceleration device (208) comprises a match filtering scoping unit (302, 410) and an edit distance adaptive window determining unit (304, 513).
The process flow of the parallel computation acceleration device (208) is illustrated in FIG. 4.0 wherein the match filtering scoping unit (302, 410) scopes down a region of interest of a reference list that is most similar to a search string to be matched for matching operation. This is to ensure full coverage of matching of the entire column when the steps are iterated with the reference list being sorted according to a predefined character offset. The edit distance adaptive window determining unit (304, 513) further comprises a parallelized adaptive window computation unit (308, 414) and a feedback based filtering threshold adjustment unit (306, 412). The edit distance adaptive window determining unit (304, 513) prepares for adaptive window setting based on user input and calculates adaptive distance values on critical path while the parallelized adaptive window computation unit (308, 414) within the edit distance adaptive window determining unit (304, 513) calculates edit distance between all characters specified by using an adaptive window filter traversing along a critical path to minimize computation and memory demands. Further, the feedback based filtering threshold adjustment unit (306, 412) initializes and continually moves the window according to the input of the adaptive window computation unit.
A general method (500) of an embodiment of the invention is illustrated in FIG. 5.0 wherein the method (500) begins by extracting from input and reference source to obtain parameter inputs from both search and reference list (502). Example of parameter inputs includes at least a string, length and index from both search and reference list. A reference list is a list being compared against and a search list is a list to be matched. Thereafter, the region of interest of a reference list that is most similar to a search string is scoped down for matching operation via the match filtering scoping unit (504). Subsequently, the adaptive window is initialized (506) and the edit distance is computed via edit distance adaptive window determining unit (508) to provide an output which determines if string is tagged as match or not match (510). It is further determined if the end of character offset is 'YES' (512); and ending the process if said character offset is 'YES' (514) else the said steps are iterated by scoping down a region of interest for matching operation via the match filtering scoping unit (504). Character offset is text or numeral from 0-N wherein N is the length of character offset of the reference string.
A detailed description of the steps for scoping down a region of interest of a reference list that is most similar to a search string for matching operation via a match filtering scoping unit (504) is illustrated in FIG. 6.0 along with the illustrations as depicted in FIG. 7.0. FIG. 7.0 is a diagram which illustrates an example of sorting by character offset. As illustrated in FIG. 6.0, the reference table is sorted according to the character offset (602) wherein the exact matching of a search string (search) is performed with reference string according to the character offset (604); and the scope of region is located wherein the start is indicated by offsetStart and the end is offsetEnd (606).
Referring to FIG. 8.0, the steps for initializing adaptive window (606) begins by calculating the approve level approve for each pair of reference string (ref) and search string (search) (804). Thereafter, the length difference diff between reference string (ref) and search string (search) is determined (806). The difference is further determined to confirm if difference is less than or equal to approve level (diff<=approve) (808) and proceed to determine if the reference length is more than the search length (Length(ref)> length (search)) (810) if difference is less than or equal to approve level. , If difference is not less than or equal to approve level, abort operation and move to a new pair of string (818, 820). If reference length is more than search length (Length(ref)> length (search)) set the base string as reference , and set the compare string as search (814). If reference length is less than the search length set the base string as search, and set the compare string reference ref (812). Subsequently, the window frame is determined by adding the size of base frame and approve level, where maximum frame size is a constant (816).
Reference is now made to FIGs. 90.0, 10.0, 11.0 and 12.0 respectively. As illustrated in FIG. 9.0, the step for computing edit distance via edit distance adaptive window determining unit (508) further comprises steps of setting the window position to a first column of base string and comparing the string character position nc =0 (902). The edit distance values for characters covered by window length between nc character of compare string and base string window (904) is computed and the final character of compare string is determined (906) by providing a feedback on the new window position if it is not the final character of the compare string (908); and comparing the end of character if it is the final character of base string of the compare string (910).
As illustrated in FIG. 10.0, the steps for providing a feedback on new window position if it is not the final character of compare string (908) further comprises steps of locating the lowest edit distance value minw \n window closest to right edge of window (1002); and providing feedback of the start of next window position for the next row at the minw column (1004). The detailed steps for comparing the end of character if it is the final character of base string of compare string (910) is illustrated in FIG. 11.0 which begins by determining if it is the final column of base string (1102) and thereafter obtaining the last edit distance value (EDvalue) in the last row of compare string and the last column of base string if it is the final column of base string (1104). It is further determined if the last edit distance value (EDvalue) is less than or equal to approve level (1106) and the output is tagged as match if the last edit distance value EDvalue in last row of compare string is approved (1 108) and the output is tagged as not match if last edit distance value EDvalue in the last row of the compare string is not approved (1108). Alternatively, the window is moved through rows of compare string if it is not the final column of base string (1 112). A detailed description of the steps for moving the window through rows of compare string is initiated by checking if the end of the window is hitting the last column of the base string if it is not the final column of base string (1202). Thereafter, the window is moved down each row and the edit distance value for window length is calculated between nc character of compare string and base string window (1204).
The present invention utilizes an adaptive window filtering engine that moves in parallel, wherein the window acts as a bandpass filter to compute only the edit distance values based on the past values and move along a critical path to eliminate unnecessary computation. In the present invention the length-based string filtering is performed automatically for region scoping of the whole column, wherein the strings are sorted based on a character offset. This is to ensure that all possibilities of matching have been covered for the whole column. Further, a feedback mechanism is provided for the automatic threshold based filtering to initialize the adaptive window setting based on user input to prepare for the threshold-based filtering and to calculate the adaptive edit distance values on the critical path; and to obtain the final adaptive edit distance value to determine whether the comparison is the intended match or not. The present invention may be embodied in other specific forms without departing from its essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is, therefore indicated by the appended claims rather than by the foregoing description. All changes, which come within the meaning and range of equivalency of the claims, are to be embraced within their scope.

Claims

1. An apparatus (200) for parallel moving adaptive window filtering edit distance computation comprising:
at least one central processing unit (202);
at least one storage unit (210);
at least one memory module (204, 206);
at least one I/O hub (212);
a plurality of input devices (214); and
at least one parallel computation acceleration device (208)
characterized in that
the at least one parallel computation acceleration device (208) further comprising:
at least one match filtering scoping unit (302) to scope down a region of interest of a reference list that is most similar to a search string for matching operation;
at least one edit distance adaptive window determining unit (304) to prepare for adaptive window setting based on user input and to calculate adaptive distance values on critical path.
2. An apparatus according to Claim 1 , wherein the at least one edit distance adaptive window determining unit (304) further comprising:
at least one parallelized adaptive window computation unit (308) to calculate edit distance between all characters specified by using an adaptive window filter traversing along a critical path to
minimize computation and memory demands; and
at least one feedback based filtering threshold adjustment unit
(306) to initialize and continually move the said window according to input of said adaptive window computation unit.
3. A method (500) for parallel moving adaptive window filtering edit distance computation comprising steps of:
extracting from input and reference source to obtain parameter inputs from both search and reference list (502); scoping down a region of interest of a reference list that is most similar to a search string for matching operation via match filtering scoping unit (504);
initializing adaptive window (506);
computing edit distance via edit distance adaptive window determining unit (508);
providing output which determines if string is tagged as match or not match (510);
determining if end of character offset is 'YES' (512);
ending the process if said character offset is 'YES' (514) else iterating steps by scoping down a region of interest for matching operation via match filtering scoping unit (504) characterized in that
computing edit distance via edit distance adaptive window determining unit (508) further comprises steps of:
setting window position to first column of base string and comparing string character position nc =0 (902);
computing edit distance values for characters covered by window length between nc character of compare string and base string window (904);
determining final character of compare string (906);
providing feedback on new window position if it is not the final character of compare string (908); and
comparing end of character if it is the final character of base string of compare string (910).
4. A method according to Claim 3, wherein scoping down a region of interest of a reference list that is most similar to a search string for matching operation via match filtering scoping unit (504) further comprises steps of:
sorting reference table according to character offset (602);
performing exact matching of search string (search) with reference string according to character offset (604); and locating scope of region wherein start is indicated by offsetStart and end is offsetEnd (606).
5. A method according to Claim 3, wherein initializing adaptive window (506) further comprises steps of:
determiming approve level approve for each pair of reference string and search string (804);
determiming length difference between reference string and search string (806);
determining said difference to confirm if difference is less than or equal to approve level (808);
proceeding to determine if reference length is more than the search length (810) if difference is less than or equal to approve level else if difference is not less than or equal to approve level, abort operation and move to a new pair of string (818, 820);
if reference length is more than search length set the base string as reference, and set the compare string as search (814);
if reference length is less than search length , set the base string assearch, and set the compare string as reference (812); and determining window frame by adding size of base frame and approving, where maximum frame size is a constant (816).
6. A method according to Claim 3, wherein providing feedback on new window position if it is not the final character of compare string (908) further comprises steps of:
locating lowest edit distance value minw in window closest to right edge of window (1002); and
providing feedback of the start of next window position for the next row at the minw column (1004).
7. A method according to Claim 3, wherein comparing end of character if it is the final character of base string of compare string (910) further comprises steps of: determining if it is the final column of base string (1 102); obtaining last edit distance value (EDvalue) in last row of compare string and last column of base string if it is the final column of base string (1 104);
determining if last edit distance value (EDvalue) is less than or equal to approve level (1 106);
tagging output as match if last edit distance value EDvalue in last row of compare string is approved (1108);
tagging output as not match if last edit distance value EDvalue in last row of compare string is not approved (1108); and
moving window through rows of compare string if it is not the final column of base string (11 12).
8. A method according to Claim 3, wherein reference list is a list being compared against and search list is a list to be matched.
9. A method according to Claim 4, wherein character offset is text or numeral from 0-N wherein N is the length of character offset of the reference string.
10. A method according to Claim 7, wherein moving window through rows of compare string if it is not the final column of base string (1112) further comprises steps of:
checking if end of window is hitting last column of base string (1202); and moving window down each row and calculating edit distance value for window length between nc character of compare string and base string window (1204).
PCT/MY2014/000157 2013-12-09 2014-06-04 An apparatus and method for parallel moving adaptive windo filtering edit distance computation WO2015088314A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
MYPI2013004420A MY163947A (en) 2013-12-09 2013-12-09 An apparatus and method for parallel moving adaptive window filtering edit distance computation
MYPI2013004420 2013-12-09

Publications (1)

Publication Number Publication Date
WO2015088314A1 true WO2015088314A1 (en) 2015-06-18

Family

ID=51743530

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/MY2014/000157 WO2015088314A1 (en) 2013-12-09 2014-06-04 An apparatus and method for parallel moving adaptive windo filtering edit distance computation

Country Status (2)

Country Link
MY (1) MY163947A (en)
WO (1) WO2015088314A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112883245A (en) * 2021-02-28 2021-06-01 湖南工商大学 GPU (graphics processing Unit) stream-based rapid parallel character string matching method and system
CN114117145A (en) * 2020-08-27 2022-03-01 东北大学秦皇岛分校 String filtering algorithm based on bit operation and SIMD parallel operation

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6295524B1 (en) 1996-10-29 2001-09-25 Nec Research Institute, Inc. Learning edit distance costs
US20040220920A1 (en) * 2003-02-24 2004-11-04 Bax Eric Theodore Edit distance string search
US20070085716A1 (en) * 2005-09-30 2007-04-19 International Business Machines Corporation System and method for detecting matches of small edit distance

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6295524B1 (en) 1996-10-29 2001-09-25 Nec Research Institute, Inc. Learning edit distance costs
US20040220920A1 (en) * 2003-02-24 2004-11-04 Bax Eric Theodore Edit distance string search
US7584173B2 (en) 2003-02-24 2009-09-01 Avaya Inc. Edit distance string search
US20070085716A1 (en) * 2005-09-30 2007-04-19 International Business Machines Corporation System and method for detecting matches of small edit distance

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ANONYMOUS: "Levenshtein distance", 28 November 2013 (2013-11-28), pages 1 - 6, XP055159192, Retrieved from the Internet <URL:http://en.wikipedia.org/w/index.php?title=Levenshtein_distance&oldid=583605633> [retrieved on 20141217] *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114117145A (en) * 2020-08-27 2022-03-01 东北大学秦皇岛分校 String filtering algorithm based on bit operation and SIMD parallel operation
CN112883245A (en) * 2021-02-28 2021-06-01 湖南工商大学 GPU (graphics processing Unit) stream-based rapid parallel character string matching method and system

Also Published As

Publication number Publication date
MY163947A (en) 2017-11-15

Similar Documents

Publication Publication Date Title
US10796244B2 (en) Method and apparatus for labeling training samples
US11574012B2 (en) Error correction method and device for search term
US11615346B2 (en) Method and system for training model by using training data
US10366342B2 (en) Generation of a boosted ensemble of segmented scorecard models
CN113574327B (en) Method and system for controlling an environment by selecting a control setting
US11698930B2 (en) Techniques for determining artificial neural network topologies
US9471836B1 (en) Method for learning rejector by forming classification tree in use of training images and detecting object in test images, and rejector using the same
US10592786B2 (en) Generating labeled data for deep object tracking
US20140207714A1 (en) Transductive lasso for high-dimensional data regression problems
JP6202147B2 (en) Curve detection method and curve detection apparatus
van der Veen et al. Three tools for practical differential privacy
US10452717B2 (en) Technologies for node-degree based clustering of data sets
US10296628B2 (en) Sample size estimator
CN106844342B (en) Term vector generation method and device based on incremental learning
RU2014117521A (en) RECOGNITION OF DYNAMIC GESTURES USING PROPERTIES RECEIVED FROM SEVERAL INTERVALS
KR20150114795A (en) A method of testing a semiconductor memory device, a test device, and compurter readable recording media recording test program for a semiconductor memory device
US8990069B1 (en) Techniques for pruning phrase tables for statistical machine translation
WO2015088314A1 (en) An apparatus and method for parallel moving adaptive windo filtering edit distance computation
Cuffaro et al. Segmentation free object discovery in video
KR20150131537A (en) object tracking apparatus and object tracking method thereof
US20210012247A1 (en) Information processing apparatus, information processing method, and program
US9195792B2 (en) Circuit design porting between process design types
CN115935208A (en) Online segmentation method, equipment and medium for multi-element time sequence running data of data center equipment
US9576589B2 (en) Harmonic feature processing for reducing noise
CN105788641B (en) Memory device and its operating method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14786372

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14786372

Country of ref document: EP

Kind code of ref document: A1