WO2017070771A1 - Système et procédé de détermination de sous-séquences communes - Google Patents

Système et procédé de détermination de sous-séquences communes Download PDF

Info

Publication number
WO2017070771A1
WO2017070771A1 PCT/CA2015/051088 CA2015051088W WO2017070771A1 WO 2017070771 A1 WO2017070771 A1 WO 2017070771A1 CA 2015051088 W CA2015051088 W CA 2015051088W WO 2017070771 A1 WO2017070771 A1 WO 2017070771A1
Authority
WO
WIPO (PCT)
Prior art keywords
strings
common
string
substrings
substring
Prior art date
Application number
PCT/CA2015/051088
Other languages
English (en)
Inventor
Chad Ternent
Darren Redfern
Original Assignee
Intelliresponse Systems Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intelliresponse Systems Inc. filed Critical Intelliresponse Systems Inc.
Priority to CA3003061A priority Critical patent/CA3003061A1/fr
Priority to PCT/CA2015/051088 priority patent/WO2017070771A1/fr
Publication of WO2017070771A1 publication Critical patent/WO2017070771A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/02Comparing digital values
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2207/00Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F2207/02Indexing scheme relating to groups G06F7/02 - G06F7/026
    • G06F2207/025String search, i.e. pattern matching, e.g. find identical word or best match in a string

Definitions

  • This relates to data processing, and more particularly, to determining common subsequences of two or more strings.
  • sequences to be analyzed may be represented in a computer as a string— a sequence of terms— each term of the string representing an element of the sequence. These strings may then be analyzed and compared by the computer in order to determine common substrings.
  • a computer-implemented method of generating a list of substrings that are common to at least two strings in a plurality of strings, wherein each of the plurality of strings comprises a sequence of terms, and wherein each of the substrings comprises a sequence of one or more of the terms the method comprising: forming a reverse index of the terms in the plurality of strings, the reverse index identifying, for each of the terms, the one or more strings containing that term and position therein; arranging the plurality of strings in an order; and for each one of the strings in the order: determining, using the reverse index, substrings common to the one of the strings and subsequent ones of the strings in the order; and for each one of those common substrings, saving an indication associating that common substring with the one of the strings and the subsequent ones of the strings in which that substring is found.
  • a computer system for generating a list of substrings that are common to at least two strings n a plurality of strings, the system comprising: at least one processor; a memory in communication with the at least one processor; instructions stored in the memory that, when executed by the at least one processor, cause the computer system to: form a reverse index of the terms in the plurality of strings, the reverse index identifying, for each of the terms, the one or more strings containing that term and position therein; arrange the plurality of strings in an order; and for each one of the strings in the order: determine, using the reverse index, substrings common to the one of the strings and subsequent ones of the strings in the order; and for each one of those common substrings, save an indication associating that common substring with the one of the strings and the subsequent one of the strings in which that substring is found.
  • a non-transitory computer readable storage medium storing instructions that, when executed, adapt a computer to: form a reverse index of the terms in the plurality of strings, the reverse index identifying, for each of the terms, the one or more strings containing that term and position therein; arrange the plurality of strings in an order; and for each one of the strings in the order: determine, using the reverse index, substrings common to the one of the strings and subsequent ones of the strings in the order; and for each one of those common substrings, save an indication associating that common substring with the one of the strings and the subsequent one of the strings in which that substring is found.
  • FIG. 1 is a high level block diagram of a computing device, exemplary of an embodiment
  • FIG. 2 illustrates the software organization of the computer of FIG. 1 ;
  • FIG. 3 is a flowchart depicting example blocks performed by the string processing software of FIG. 2;
  • FIG. 4 illustrates a representation of a reverse index, exemplary of an embodiment, for an example set of strings
  • FIG. 5 is a further flowchart depicting example blocks performed by the string processing software of FIG. 2;
  • FIGS. 6A and 6B illustrate a source code listing depicting pseudo-code exemplary of an embodiment.
  • FIG. 1 is a high level block diagram of a computing device, exemplary of an embodiment. As will become apparent, the computing device includes software that analyzes two or more strings to determine longest common substrings.
  • the computing device 10 includes one or more processors 12, a memory 14, and one or more I/O interfaces 16 in communication over bus 18.
  • One or more processors 12 may be one or more Intel x86, Intel x64, AMD x86-64, PowerPC, ARM processors or the like.
  • Memory 14 may include random-access memory, read-only memory, or persistent storage such as a hard disk, a solid-state drive or the like.
  • Read-only memory or persistent storage is a computer-readable medium.
  • a computer-readable medium may be organized using a file system, controlled and administered by an operating system governing overall operation of the computing device.
  • One or more I/O interfaces 16 may serve to interconnect the computing device with peripheral devices, such as for example, keyboards, mice, and the like.
  • One or more I/O interfaces 16 may also comprise communication devices such as, for example network controllers, modems, and the like that may serve to interconnect the computing device with one or more computer networks such as, for example, a local area network (LAN) or the Internet.
  • LAN local area network
  • Software comprising instructions is executed by one or more processors 12 from a computer-readable medium.
  • software may be loaded into random-access memory from persistent storage of memory 14 or from one or more devices via I/O interfaces 16 for execution by one or more processors 12.
  • software may be loaded and executed by one or more processors 12 directly from read-only memory.
  • FIG. 2 depicts a simplified organization of example software components stored within memory 14 of computing device 10. As illustrated these software components include operating system (OS) software 20, and string processing software 22.
  • OS operating system
  • string processing software 22 string processing software 22.
  • OS software 20 may be, for example, Microsoft Windows, UNIX, Linux, Mac OSX, or the like. OS software 20 allows string processing software 22to access one or more processors 12, memory 14, and one or more I/O interfaces 16 of the computing device.
  • String processing software 22 adapts computing device 10, in
  • String processing software 22 is executed by one or more processors 12 so as to process a plurality of strings to generate a list of substrings common to at least two strings of the plurality.
  • Each of the strings and substrings comprises a sequence of one or more terms. The terms may, or may be individually delimited.
  • the contents of the strings may represent anything that can be represented as an ordered sequence.
  • input may be supplied as two sequences of terms (each term in the string consisting of a single character and optionally separated from adjacent terms by a space): "A B D B C A D" and "B A B B A”.
  • a subsequence is any sequence of terms contained within another sequence - up to including the entire sequence.
  • Each term may be made up of one or more characters in some character set such as, for example, Unicode, ASCII, or EBCDIC. Within the string, the terms may be delimited by particular characters. For example, terms may be separated by a space character or a comma. As an example, a string could be a query string comprising one or more words, each word a sequence of one or more characters, the words separated by spaces.
  • each term may have a pre-defined size - e.g. a single character; two characters; or the like.
  • terms may represent non-character data such as numbers.
  • Numbers may be integers represented by various encodings such as binary coded decimal, ones-complement, twos-complement, or fixed point.
  • Numbers may also be floating point numbers represented by some arithmetic format such as those set out the various IEEE floating point standards and the like.
  • terms may represent specialized data such as proteins.
  • Strings may have been previously stored in memory 14.
  • the terms of each string may be stored contiguously in memory in the format to be processed.
  • the terms of each string may be stored in various data structures, such as for example a compressed form, from which the sequence of terms comprising a string may be reconstructed by appropriate processing.
  • the strings may be received using one or more I/O interfaces 16 and processed in a streaming fashion as received either without or without intermediate buffering so as to, for example, memory 14.
  • Blocks S300 and onward are performed by one or more processors 12 executing string processing software 22 at computing device 10.
  • one or more processors 12 form a data structure known as a reverse index according to string processing software 22.
  • An inverted index is a data structure known to persons skilled in natural language processing. Same or similar data structures may also be referred to variously as an "inverted index" or a "positional postings list”. Fundamental to any such a data structure is that it identifies, for each term found in any of one or more strings, the one or more strings containing that term and the position of the term in each string.
  • Reverse indexes may be generated according to standard methods known to skilled persons. According to such methods, a single reverse index may be generated for the entire body of input; that is, all strings of the plurality of strings.
  • more than one reverse index may be generated for a given set of string, each index considering some subset of the set of strings such that, for example, all strings are considered in a least one reverse index.
  • one or more processors 12 consult reverse indexes in executing string processing software 22. Where there is more than one reverse index, one or more processors 12 may consult one or more of the more than one reverse indexes during processing such as according to the strings being compared at a given step.
  • FIG. 4 illustrates a representation of an example reverse index for an example set of strings.
  • Reverse index 40 as illustrated is constructed over two example strings: a first string, "cancel credit card” and a second string, "cancel my credit card”.
  • Entry 42 shows that the term "credit" is found in the first string at a position of offset 2 and in the second string at a position of offset 3.
  • Entry 48 shows that the term "my” is found only in the second string at position of offset 2. The absence of an entry corresponding to the first string implies that the term is not found in that string.
  • Reverse indexes may be represented in memory 12 in various formats such as, for example, in an array, a linked list, a hash table, or some combination thereof.
  • an array could be maintained with an element corresponding to each term in a set of strings, with the element pointing to a linked list of records, each record indicating the various strings in which that term occurs and its position therein.
  • a multidimensional array could be maintained with each row of the multidimensional corresponding to a term in a set of strings, and each column corresponding to a string in the set of strings. Then, each element could be set either to the position of that term in that string. If the term does not occur in that string, an indication of this could instead be stored in the element such as by, for example, storing a sentinel value, such as, for example, the maximum integer, in that element.
  • the representation of a reverse index in memory may also vary in other ways.
  • the indication of a particular string in a reverse index may have some other format such as, for example, a pointer to a memory location or an index into some other a data structure.
  • the offset into a string in a reverse index may have some other representation. As an example, rather than the first location in a string being represented as offset 0, it may be represented as offset 1. As another example, offsets may count right-to-left rather than the left-to-right counting illustrated in FIG. 4.
  • the indication of a particular string and the offset therein may be combined in some reverse indexes such as by providing, for example, a pointer to a memory location falling within the in-memory representation of the string or an index into some other data structure.
  • Various representations of a reverse index may offer trade-offs between computing resources consumed for construction and/or consultation of the reverse index such as, for example, requiring more or fewer instructions to be executed or more or fewer accesses to memory be performed and the storage resources
  • various representations may have greater or lesser performance according to considerations such as, for example, the performance of a cache (not illustrated) of one or more processors 12.
  • analysis of performance of a representation of a reverse index may be performed according to techniques known to skilled persons, such as, for example, profiling.
  • the strings of the plurality of strings are arranged in an order. Arranging of the strings of the plurality of strings in an order may or may not entail actual processing of the strings.
  • the strings may be processed and placed into an ordered data structure.
  • the strings may be processed and their ordering noted in a data structure that contains some indication, such as a pointer, of the string in each position.
  • the strings may already have some natural order due to the nature of their storage or the order in which they are being received.
  • block 304 may not entail substantive data processing.
  • the ordering may simply be the order in which strings are received.
  • strings may be filtered before arranging them in an order such that, for example, only a subset of the plurality of strings may feature in the order.
  • one or more processors 12 identify the next string in the order for processing. Strings may be processed starting with the first string in the order and ending with the last string in the order. Alternatively, strings may be processed starting with some other string in the order. Optionally, only a subset of the strings in the order may be identified for processing, such as according to, for example, filtering criteria as may be applied in identifying the next string such that, for example, only strings meeting that criteria are identified for processing with other strings in the order being discarded.
  • the string identified at block 306 is processed relative to the subsequent strings in the order to ascertain, using the reverse index, substrings common to the identified string and subsequent strings in the order.
  • Each common substring so identified is common to at least a string pair comprising the identified string and a subsequent string in the order. In some cases, a substring may be common to more than one such pair.
  • indications are saved associating identified common substrings with the string identified at block 306 and with the subsequent string with which each substring is common as identified at block 308.
  • Indications associating a common substring with a string may be, for example, maintained by way of a plurality of lists, each list associated with a string of the plurality of strings. For example, saving an indication associating a common substring with the string identified at block 306 and a subsequent string to which it is common may comprise inserting an element indicating the common substring into the list associated the string identified at block 306 and into the list associated with the subsequent string to which it is common as identified at block 308.
  • the element inserted into such a list may comprise a hash of the common substring. Then, on subsequent insertions into the list the hash of the item being inserted may be compared to the hashes within elements of the list, with a new element inserted only if no matching hash is found. If a matching hash is found, the element may be maintained by, for example, increasing an instance count. Alternatively, if a matching hash is found, the list may be left undisturbed. In this way, the list may be maintained as a set. Conveniently, in either case, a suitable checksum may be used in lieu of a hash to similar effect. [0050] Additionally or alternatively, the element being inserted into such a list may comprise the actual common substring.
  • the element being inserted may comprise a length and an offset into the string with which the list is associated according to which the common substring may be located within that string.
  • the element being inserted into such a list may comprise a pointer to the other string with which the common substring is common. Then, the element being inserted may comprise a length and an offset into the other string using which the common substring may be located within that string. Alternatively, the pointer may point at a memory location offset from the start of the other string so as incorporate the offset into the pointer.
  • one or more processors 12 determine whether or not strings in the order remain to be processed. For example, if the last string in the order was previously identified for processing at block 306, processing may be complete. If strings in the order remain to be processed, control flow returns to block 306.
  • each pair of strings is only considered once. This is possible because the method takes advantage of the fact that the determining of the common substrings of a first string and a second string is commutative. Additionally, by the definition of the problem, no string need be compared to itself. Thus, string processing software 22 may, according to FIG. 3, exploit these properties to lessen the overall number of string pairs that must be considered to be less than the cardinality of the cross-product of the ordering with itself as might otherwise be considered in a more naive procedure.
  • one or more processors 12 executing string processor software 22 may require fewer computations to determine substrings common to at least two strings of a plurality of strings as compared to more naive procedures.
  • string processing software 22 may embody additional functionality by way of instructions that when executed by one or more processors 12 cause additional processing.
  • one or more processors 12 may, according to string processing software 22, identify a longest of the common substrings associated with each string of the plurality. As a specific example, for the two example strings above, a longest common substring "A B" of each of the two strings may be identified. In some embodiments, only such a longest common substring may be identified for each pairing.
  • each identification associating a common substring with a string may be processed to determine which of the associated common substrings is the longest. Additionally or alternatively, an indication associated with a string may be maintained during the processing to track the longest common substring identified thus far for that string. In alternate embodiments, no indications may be saved at block 308, and saving may comprise instead, for one or more strings, only maintenance of such a longest common substring indication for that string.
  • one or more processors 12 may, according to string processing software 22, identify, for a term found in a string of the plurality of strings, a longest of the common substrings associated with that string as contains that term.
  • a longest common substring of the first string containing term "A" is "A B".
  • a weight may be associated with the terms occurring in the strings of the plurality of strings. For example, for some terms—such as, for example, terms as may be identified as "stop words" for the purposes of particular processing such as, for example, "a” or “the”—a weight may be associated with those terms. For example, the weight associated with one or more of such terms could be be fractional or even zero. Alternatively, there may be a preprocessing step in which one or more of the strings of the plurality of strings are processed to determine a weight between zero and one associated with any or all of the one or more terms of the strings.
  • a weight may be associated with each common substring. For example, where there is a weight associated with terms, a weight may be associated with each common substring. For example, that weight may e set equal to the sum of the weights associated with the terms of that substring.
  • One or more processors 12 may perform further processing according to the weight associated with a substring. For example, for one or more strings of the plurality of strings, a highest weighted of the common substrings associated with that string may be identified such as by way of, for example, maintaining of indications of weights associated with one or more substrings during processing. Additionally or alternatively, where a weight is associated with each common substring, one or more processors 12 may identify, for one or more terms as found in a string of the plurality of strings, the highest weighted common substring associated with that string as contains that term.
  • FIG. 5 is a flowchart depicting example blocks 500 and onward as may be performed by one or more processors 12, such as according to processing 3 ⁇ 4 software 22, in performing block 308 of FIG. 3.
  • a next string of the strings subsequent to the string identified for processing at block 306 is identified for processing relative to the earlier identified string.
  • subsequent strings may be processed starting with the first string in the order following the string identified at block 306.
  • subsequent strings may be processed in some other order such as, for example, some natural order as may exist according to a data structure storing the strings in memory 14.
  • a string identified for processing at block 306 is hereinafter, for the purposes of the discussion of FIG. 5, referred to as the first string.
  • a next string identified for processing at block 502 is hereinafter, for the purposes of the discussion of FIG. 5, referred to as the second string.
  • An in-progress substring is associated with each string of the order. The in-progress substring may be initialized, such as, to the null or empty string.
  • a next term of the terms of the first string is identified for processing.
  • the terms of the first string may be processed starting with a first term in the sequence comprising that string and ending with a last term in the sequence comprising that string.
  • strings may be processed right-to-left, starting with a last term in the sequence comprising that string and ending with a first term in the sequence comprising that string, or alternatively again, in some other order.
  • one or more processors 12 determine whether the next term is proximate to the in-progress substring for the second string in the second string. If so determined, control flow proceeds to block 510 else to block 508.
  • a term may be considered proximate to an in-progress common substring in the second string, if, for example, the term is adjacent to the in-progress common substring in the second string.
  • the term “C” is proximate to an in-progress common substring "A B” for a second string "X A B C D", whereas terms “X” and “D” is not.
  • any term may be considered proximate to a null or empty in-progress substring in a second string provided the term is found in the second string.
  • Such an application of weights may have application, for example, in giving less precedence to, or even ignoring, "stop words" in strings.
  • processing at block 506 may include using a reverse index to determine whether the next term occurs in the second string via a lookup of the term in the index. Additionally or alternatively, a position of the term in the second string may be determined by way of such a look-up.
  • the in-progress common substring for the second string is updated by appending the next term to that string.
  • the in-progress common substring is also identified as a common substring of the first string and the second string.
  • one or more indications are saved associating the identified common substring with the first string and the second string.
  • one or more processors 12 determine whether or not terms in the first string remain to be processed. For example, if processing is proceeding left- to-right and the last term in the first string was the last identified for processing at block 504, processing of the first string is complete. If processing of the first string is completed, control flow proceeds to block 516, otherwise control flow returns to block 504.
  • one or more processors 12 determine whether or not subsequent strings in the order remain to be processed. For example, if strings in the order are being processed first-to-last and the last string in the order was the last identified for processing at block 502, processing of all subsequent strings is completed. If processing is completed, control flow proceeds to block 518, otherwise control flow returns to block 502. [0075] Reset of in-progress substrings may comprise de-allocation. Additionally or alternatively, reset of an in-progress substring may be flagged such as by, for example, clearing or reset that in-progress substring to a special value such as the null or empty string or to some other sentinel value.
  • processing proceeds in what may be described as "string-major" order, where each possible second string is considered relative to all terms of the first string before another second string is considered. This is, however, merely exemplary.
  • processing could instead proceed in a "term-major" order where each term of the first string is considered relative to every possible second string before another term is considered.
  • both orderings may be functionally equivalent where an in-progress substring is maintained for each string of the order.
  • one may be preferable to the other such as, for example, for reasons of improving the performance of a CPU cache of one or more processors 12 or memory 14 such as ⁇ by, for example, reducing cache misses.
  • processing of one or more strings or terms could occur in parallel, such as is, for example, possible when one or more processors 12 comprises two or more processors.
  • FIGS. 6A and 6B illustrate a source code listing depicting pseudo-code exemplary of an embodiment.
  • Pseudo-code listing 600 presents a pseudo-code listing in a hypothetical ALGOL-like language with a syntax somewhat similar to C, C++, C#, or Java.
  • Pseudocode listing 600 is exemplary of an embodiment and is in no way limiting of the invention nor is it exemplary of all embodiments.
  • Pseudo-code listing 60Q may be translated by a skilled-person into a source code listing in any one of a plurality of programming languages and then may be compiled into machine code for execution by one or more processors of one or more computing devices.
  • pseudo-code listing 600 may be translated into a source code listing in a language suitable for processing by a language interpreter executing on one or more computer devices.
  • the first line of pseudo-code listing 600 declares an array, "inquiries”, that may be initialized by suitable code to a series of input strings, referred to as “inquiry strings” in the listing.
  • the second line of pseudo-code listing 600 declares an array
  • the third line of pseudo-code listing 600 declares an array of sets, "themeSet", each set storing common substrings of a corresponding inquiry string in the array declared at the first line.
  • applying the present invention to such a may use less computing resources than common dynamic programming approaches.
  • string processing software 22 may be configured to determine the longest non-overlapping subsequences of a plurality of strings.
  • string processing software 22 may be employed in the processing of query strings.
  • common substrings may be used determine common "themes" amongst the plurality of strings. These themes may then be employed in subsequent processing such as to, for example, facilitate grouping or classification of queries.
  • an exemplary application might compare the two example strings that feature in reverse index 40 as query strings.
  • the longest non-overlapping common substrings of the two example strings may be determined to be "cancel” and "credit card”.
  • Such an exemplary application may then use "cancel” and "credit card” as "themes” in subsequent processing of one or more of the strings.
  • Fractional values may be assigned to some terms in the query strings, such as is substantially described above, so as to giving less precedence to or even ignore "stop words”.

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne un système et un procédé mis en œuvre par ordinateur de génération d'une liste de sous-chaînes qui sont communes à au moins deux chaînes dans une pluralité de chaînes. Chacune de la pluralité de chaînes comporte une séquence de termes, et chacune des sous-chaînes comprend une séquence d'au moins un de ces termes. Le procédé comprend la formation d'un indice inverse des termes dans la pluralité de chaînes. L'indice inverse identifie, pour chacun des termes, lesdites chaînes contenant ce terme et sa position. Le procédé consiste à agencer la pluralité de chaînes dans un ordre ; et pour chacune des chaînes dans l'ordre déterminé, à l'aide de l'indice inverse, des sous-chaînes communes auxdites chaînes et aux chaînes suivantes dans l'ordre ; et pour chacune de ces sous-chaînes communes, à enregistrer une indication associant cette sous-chaîne commune avec lesdites chaînes et les chaînes suivantes dans laquelle cette sous-chaîne est trouvée.
PCT/CA2015/051088 2015-10-26 2015-10-26 Système et procédé de détermination de sous-séquences communes WO2017070771A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CA3003061A CA3003061A1 (fr) 2015-10-26 2015-10-26 Systeme et procede de determination de sous-sequences communes
PCT/CA2015/051088 WO2017070771A1 (fr) 2015-10-26 2015-10-26 Système et procédé de détermination de sous-séquences communes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CA2015/051088 WO2017070771A1 (fr) 2015-10-26 2015-10-26 Système et procédé de détermination de sous-séquences communes

Publications (1)

Publication Number Publication Date
WO2017070771A1 true WO2017070771A1 (fr) 2017-05-04

Family

ID=58629620

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CA2015/051088 WO2017070771A1 (fr) 2015-10-26 2015-10-26 Système et procédé de détermination de sous-séquences communes

Country Status (2)

Country Link
CA (1) CA3003061A1 (fr)
WO (1) WO2017070771A1 (fr)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7610281B2 (en) * 2006-11-29 2009-10-27 Oracle International Corp. Efficient computation of document similarity
US7617231B2 (en) * 2005-12-07 2009-11-10 Electronics And Telecommunications Research Institute Data hashing method, data processing method, and data processing system using similarity-based hashing algorithm
US7756859B2 (en) * 2005-12-19 2010-07-13 Intentional Software Corporation Multi-segment string search

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7617231B2 (en) * 2005-12-07 2009-11-10 Electronics And Telecommunications Research Institute Data hashing method, data processing method, and data processing system using similarity-based hashing algorithm
US7756859B2 (en) * 2005-12-19 2010-07-13 Intentional Software Corporation Multi-segment string search
US7610281B2 (en) * 2006-11-29 2009-10-27 Oracle International Corp. Efficient computation of document similarity

Also Published As

Publication number Publication date
CA3003061A1 (fr) 2017-05-04

Similar Documents

Publication Publication Date Title
US20170116238A1 (en) System and method for determining common subsequences
Bannai et al. Refining the r-index
US8340914B2 (en) Methods and systems for compressing and comparing genomic data
Gog et al. Optimized succinct data structures for massive data
US11593373B2 (en) Compression, searching, and decompression of log messages
US8688685B2 (en) Accelerated searching of substrings
Kempa et al. Lempel-Ziv factorization: Simple, fast, practical
US20220358178A1 (en) Data query method, electronic device, and storage medium
Pibiri et al. Handling massive N-gram datasets efficiently
Beller et al. Space-efficient construction of the Burrows-Wheeler transform
EP3173947B1 (fr) Indice inversé paginé
Rodrigues et al. {CLP}: Efficient and scalable search on compressed text logs
Louza et al. External memory generalized suffix and LCP arrays construction
Aldwairi et al. MultiPLZW: A novel multiple pattern matching search in LZW-compressed data
Kärkkäinen et al. Lazy lempel-ziv factorization algorithms
Wu Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays
WO2023028721A1 (fr) Systèmes et procédés de détection de clones de code
Wang et al. Rencoder: A space-time efficient range filter with local encoder
Hong et al. LZ77 via prefix-free parsing
WO2017070771A1 (fr) Système et procédé de détermination de sous-séquences communes
Carterette et al. Comparing inverted files and signature files for searching a large lexicon
US11966401B2 (en) Query tree labeling and processing
Pizzi et al. Fast profile matching algorithms—A survey
Oliva et al. CSTs for Terabyte-Sized Data
Gog et al. Improved and extended locating functionality on compressed suffix arrays

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15906869

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 3003061

Country of ref document: CA

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 06.08.2018)

122 Ep: pct application non-entry in european phase

Ref document number: 15906869

Country of ref document: EP

Kind code of ref document: A1