WO2018038697A1 - Prefix burrows-wheeler transformation with fast operations on compressed data - Google Patents

Prefix burrows-wheeler transformation with fast operations on compressed data Download PDF

Info

Publication number
WO2018038697A1
WO2018038697A1 PCT/US2015/030233 US2015030233W WO2018038697A1 WO 2018038697 A1 WO2018038697 A1 WO 2018038697A1 US 2015030233 W US2015030233 W US 2015030233W WO 2018038697 A1 WO2018038697 A1 WO 2018038697A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
search
prefix
pbwt
lexeme
Prior art date
Application number
PCT/US2015/030233
Other languages
French (fr)
Inventor
Jeremy Bruestle
Original Assignee
Spiral Genetics, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Spiral Genetics, Inc. filed Critical Spiral Genetics, Inc.
Priority to CN201580038227.8A priority Critical patent/CN108292307A/en
Priority to EP15894507.1A priority patent/EP3323203A4/en
Publication of WO2018038697A1 publication Critical patent/WO2018038697A1/en
Priority to HK18116146.9A priority patent/HK1257025A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • G06F40/16Automatic learning of transformation rules, e.g. from examples
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/1734Details of monitoring file system events, e.g. by the use of hooks, filter drivers, logs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24532Query optimisation of parallel queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9027Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3068Precoding preceding compression, e.g. Burrows-Wheeler transformation

Definitions

  • a computer When performing data processing, a computer is used to take data, which may be represented as a structure in a computer memory and/or a file format on a persistent computer memory, and to perform operations, called data operation, on the data.
  • Data operations are typically performed on data is typically demarcated into discrete sets, called data sets.
  • Typical data operations on data sets in the course of data processing may include searching, which is retrieval a desired subset of a data set; sorting, which is re-organizing the data set; and transformation which is converting the data set from one representation to another.
  • Data operations are generally improved either by reducing the amount of working memory used to perform the operation, or by improving the processing efficiency of the operation as to reduce processing time.
  • the amount of working memory and processing efficiency results in an optimization tradeoff. Reducing the amount of working memory in an operation often results in lower processing efficiency. Conversely, increasing processing efficiency results in a larger amount of memory used during processing. It is relatively rare to achieve reduced memory utilization and greater processing efficiency in the same optimization.
  • the data processed is comprised of genetic information which define an organism. Genetic information is comprised of a series of base pairs adenine-thymine and guanine-cytosine. The more complex the organism, the more base pairs are used to defined the organism. For example, the Escherichia Coli bacterium uses approximately 4.6 million base pairs. In contrast, simple viruses may use as little as a few thousand base pairs.
  • a major application of bioinformatics is in the analysis of genetic conditions in human beings, in the search for medical therapies.
  • the genetic information for a human being is 3.2 billion base pairs. Accordingly, every byte allocated to a base pair in an effort to improve processing, potentially adds an additional 3.2 Gb of working memory.
  • sequence comparisons with different instances of human beings or other organisms under analysis, the amount of memory used during data processing may rapidly expand to an unmanageable amount.
  • Figure 1 is a top level diagram of a Prefix Burrows-Wheeler
  • Figure 2 is a flow chart of an exemplary performance of a Prefix
  • Figures 3a, 3b, 3c, 3d, 3e, 3f, 3g, 3h, 3i, 3j and 3k are illustrations of an exemplary data set undergoing a Prefix Burrows-Wheeler Transform.
  • Figure 4 is an exemplary compressed data memory layout for a prefix table in a Prefix Burrows- Wheeler Transform.
  • Figure 5 is a flow chart of an exemplary k-Mer search data operation on data transformed by a Prefix Burrows- Wheeler Transform.
  • Figures 6a, 6b, 6c, 6d and 6e are illustrations of exemplary data transformed by a Prefix Burrows-Wheeler Transform being traversed via an exemplary k-Mer search.
  • Figure 7 is a diagram of an exemplary hardware, software and communications environment for a Prefix Burrows-Wheeler Transform.
  • PBWT Prefix Burrows-Wheeler Transform
  • Optimizing data operations may take advantage of insight into the data populating a data set.
  • the set of all possible data that may populate a data set is called the domain of a data set.
  • Some domains are open, where any datum may be included in the data set.
  • An example of an open domain is a representation of analogue signal data where the frequency and amplitude of the signal that may be represented is not initially restricted.
  • Other domains are closed where only a limited set of data may be admitted into a data set.
  • An example of a closed domain is text data where the only data admitted are text characters.
  • EBCDIC Extended Binary Coded Decimal Interchange Code
  • IBM Extended ASCII 256
  • a closed domain need not be comprised of an entire alphabet for a natural language.
  • genetic base pairs are comprised of adenine-thymine and guanine-cytosine pairs.
  • a computer program to convert lowercase letters all into uppercase letters in ASCII might have been implemented by a routine that tested each character as to whether it was a lowercase or uppercase character. If the character was a lowercase letter, the program might have invoked a lookup table to convert the lowercase letter into its corresponding uppercase character.
  • a search data operation may be existential, where the search data operation simply determines whether the pattern exists in the data set or not. For example an existential search for the letter 'C in data set "ABBBCCCD" might simply return the value "TRUE” since the letter 'C was indeed in the data set.
  • a search data operation may be cardinal, where the search data operation not only tests for existence, but also counts the number of instances of the pattern to be detected. For example, a cardinal search for the letter 'C in data set "ABBBCCCD" might return 3, since the letter 'C appeared three times in the data set.
  • a search operation may be positional, where the search data operation returns the position of the pattern to be detected. For example, a positional search for the letter 'C in data set "ABBBCCCD" might return 5, since the first instance of the letter 'C in the data set was in the fifth position.
  • information how a data set is represented in memory may be used to optimize the implementation of a data operation.
  • a computer program may not receive data in an optimized format for a data operation.
  • Data as it arrives in its original data representation, often called “raw data”, may be converted into an alternative data representation which is optimized for a particular data operation. This conversion is called a transformation.
  • Transformations come in many forms. Transformations may order data according to predetermined conventions. Such transformations are called “sorting” data operations. Other transformation may reduce the size of the data representation. Such transformations are called “compression” data operations. Yet other transformation may generate lookup tables and/or indices to aid in the performance of a data operation. The following subsections will discuss these and other transformations.
  • a transformation is in itself a data operation and accordingly may be optimized. However, typically transformations convert data into a data representation optimized for a particular data operation. The converted data is persisted in a persistent memory device called a "data format", or “format” for short. Since the particular data operation typically operates only on the optimized data format, a transforming data is sometimes called “preprocessing data.” Specifically, the processing used to transform the data is persisted in the data format, so that the processing need not be repeated every time the particular data operation is performed. Thus preprocessing is in of itself a way to optimize the particular data operation.
  • the desired result is to reduce both the working memory used in implementing a particular data operation, and to increase the processing efficiency.
  • One route would then be to perform a data transformation that performs both a compression and creates a data representation optimized for a particular data operation.
  • a common data operation is to perform a sort, which is to order data in a data set according to predetermined convention. Sorting is a transformation which generally optimizes searching. Specifically, when a data set is in a particular order, a computer program scanning the data set can make determine where a pattern to be detected is likely to be. Binary search and B-Tree searches are well-known examples of search algorithms that rely on sorted data.
  • a lexicographical sort is a sort where the predetermined convention used to order the data set is to posit an order on the characters comprising a closed data domain of the data set.
  • the data in the data set is comprised of sequences of characters, called “lexemes”.
  • a lexeme made up of alphanumeric characters is also called a "string.”
  • the set of lexemes in the data set is called a "lexicon”. Lexemes that are lexicographically sorted are sorted according to the order and the places of the characters in the lexeme.
  • a common application of a lexicographical sort is to perform an alphabetical sort on strings.
  • the letters of the alphabet have a posited order, where each letter is known to precede or succeed another letter.
  • letters are generally read from left to right.
  • a letter placed to the left of another letter in a string takes precedence in ordering the string with respect to other strings.
  • "do" precedes "hi” because 'd' precedes 'h' in alphabetical order.
  • ⁇ ' succeeds in alphabetical order, the leftmost characters in the string have precedence when performing an alphabetical sort.
  • a transformation that reduces the size of a data representation in computer memory is a compression data operation. Transformations may be "lossy" where information with respect to a particular data operation is lost via the transformation. Transformations may be lossless where information with respect to a particular data operation is preserved during the transformation.
  • One example of compression is run length encoding.
  • the data set comprised of the 12 characters, "AAABBBBCCCCCBBB”.
  • the data set might be represented in 6 characters "3A4B5C3B” meaning that there are "three A's, four B's, five C's followed by three B's.” Note that the amount of memory used to represent the data set has effectively been reduced, i.e., the data set has been compressed.
  • Data operations may be optimized by utilizing supplementary data structures that contain information on how to perform an operation and/or information on the layout of a data representation.
  • Such supplementary data structures are called "lookups”. If a lookup is in the form of a set of tuples, then the lookup is called a “lookup table.” If the lookup stores positional information of data in the data representation, then the lookup is called an "index.”
  • An example of a lookup is used in Huffman encoding compression. Characters in a textual data set are analyzed for frequency. A lookup is generated that maps each character used in the textual data set with a prefix-free code, which is a bit string where each string is guaranteed never to be the prefix of another bit string. More frequent characters are mapped in the lookup to shorter prefix codes. Then the textual data set is transformed by replacing each instance of a character with its respective prefix code per the lookup.
  • the lookup in a Huffman encoding is typically in the form of a tree.
  • Huffman encoded data set may also be represented as a table. Also note that whenever the Huffman encoded data set is transferred, so too is the lookup. In this way, a recipient may decode the Huffman encoded data set.
  • AAABBBBCCCCC "AAABBBBCCCCC". If a programmer were to optimize positional search data operations, and also knew that the data set was guaranteed to be in order, the programmer might prepare an index with three records as follows: "A, 1"; "B, 4"; and "C, 8". Thus in order to find the location of the first 'B' in the dataset, instead of scanning the characters in the dataset, the positional search data operation need only look up the position in the index resulting in faster data processing. Note that generating the index in the first place might have made use of a full scan of the characters in the dataset. However, by generating and storing the index during preprocessing, the index provides an optimization that may be realized whether a positional search data operation is performed.
  • PBWT is a novel modification of techniques used in block- sorting compression, such as in a Burrows- Wheeler Transform, to optimize for existential searches in data sets with closed data domains.
  • applications extend beyond bioinformatics. However, for illustrative purposes, the following discussion will use bioinformatics data as an exemplary context to describe PBWT and its associated data operations.
  • FIG. 1 is a top level diagram 100 of a PBWT and an existential search operation.
  • bioinformatics sequences are shown as the data to be operated on.
  • Raw data 102 comprised of a data set of lexemes made up of a closed domain of characters is received by a PBWT transformation routine 104.
  • the PBWT transformation routine 104 converts the raw data into a prefix table 106 and an offset table 108.
  • the prefix table 106 provides a lookup to identify all single character prefixes for any substring of a lexeme in the data set.
  • the offset table 108 provides a lookup to identify the offset location of lexemes in the prefix table starting with a given single character.
  • the prefix table 106 and the offset table 108 represent a significant compression of the original raw data.
  • An existential search routine 1 10 may search for an arbitrary string 1 12 of the domain of characters may be performed on the prefix table 106 and offset table 108. The existential search can accordingly determine whether the string 1 12 exists 114 in the representation. Via PBWT, the existential search may be performed significantly more quickly than prior techniques, without loss of accuracy.
  • Figure 2 is a flow chart 200 of an exemplary performance of a PBWT.
  • Figures 3 a through 3k illustrate different stages 300a - 300k in generating a PBWT prefix table and a PBWT offset table in performing a PBWT.
  • a computing device receives raw data in block
  • the raw data generally arrives in the form of data strings 300a as shown in Figure 3a.
  • the data strings 300a are four genetic sequences each with four nucleic acids. It is to be stressed that the actual lengths of the sequences can vary, and the data set domain may be any closed domain where the characters comprising the domain have a posited order, lending themselves to a lexicographical sort.
  • the received data will be analyzed and subjected to character rotation techniques to identify all prefixes.
  • the identity and location of the prefixes will be used to generate a prefix table.
  • the suffixes for each sequence is generated and added to the data set.
  • the result 300b is shown in Figure 3b.
  • the sequence ACTG has a suffix of CTG, of TG and of G.
  • the data set includes all suffixes, in block 206 of Figure 2, the data set is lexicographically sorted.
  • the result 300c of the lexicographical sort is shown in Figure 3c.
  • the lexicographical sort 206 is used to guarantee the lexicographical order of the lexemes in the data set, it also aids in removing lexemes that are prefix matches as in block 208 of Figure 2.
  • the sorted data set may be linearly scanned. If a lexeme is identical to or is a prefix of the immediately succeeding lexeme, then it is a prefix match and is removed.
  • the suffix GC for ATGC is identical to the suffix GC for CCGC, and thus the duplicate is removed.
  • the suffix TG is a prefix of TGC and is also removed.
  • the result 300d of the removed prefix matches is shown in Figure 3d.
  • each lexeme is duplicated and delimiter is added to the end of each original lexeme.
  • the delimiter is a member of the closed data domain, but is lexicographically ordered after all the elements in the lexemes. This is done so that the lexicographical order of the lexemes is the same with the delimiter as it is without the delimiter.
  • each new duplicate lexeme has its first letter rotated to the end of the lexeme and separated by a separator character.
  • the separator is also a member of the closed data domain, but is lexicographically order before all the elements in the lexemes. Again, this is done so that the lexicographical order of the rotated lexeme is the same with the separator as it is without the separator.
  • Figure 2 to generate prefix pairs.
  • the table is scanned lexeme by lexeme in lexicographical order.
  • the lexeme is first tested to determine whether it is a rotated or non-rotated lexeme. This may be readily ascertained by detecting a delimiter. If there is a delimiter, it is a non-rotated lexeme. Otherwise, it is not.
  • a non-rotated lexeme is then associated with all rotated characters between itself and the previous non-rotated lexeme.
  • CCGC$ will be associated with 'G' because it is the sole rotated character between CCGC$ and the previous non-rotated lexeme ATGC$.
  • TGC$ will be associated with both 'A' and 'C since those two characters are the rotated characters between TGC$ and the previous non-rotated lexeme GC$.
  • the rotated characters may be readily extracted by virtue of detecting the separator character.
  • the result 300g of the prefix associations is shown in Figure 3g.
  • the rotated characters are the single character prefixes. Because of the rotation 212 and subsequent lexicographical sort 214, the suffix of rotated character will precede the non-rotated lexeme that is a suffix of the rotated character. This is why single character prefixes will immediately precede a non- rotated lexeme.
  • the length of each lexeme remaining in the data set is computed, and associated with that lexeme, as in block 220 of Figure 2. Specifically, the number of characters in the lexeme is counted. The count may be performed via a linear, sequential scan of the lexeme. Alternatively, the size of the lexeme in memory may be detected and the character count performed by dividing the memory size by a known fixed memory size of each character. For example, in the C programming language, the expression "sizeof(*pLexeme)/sizeof(char);” obtains the memory size a lexeme pointed to by pointer pLexeme and divides it by the size of a character data type. In the case of genetic data, the data may not be represented as characters, but potentially as two- bit values since there are only four nucleic acids. (Note that additional bits may be used in order to include separators and delimiters.)
  • the length of the lexeme is then associated with the lexeme. As shown below, this length will be later be compared with patterns being searched for to determine whether a search should be terminated because the lexeme size is smaller than the pattern being searched.
  • the result 300i of the added lexeme lengths is shown in Figure 3i.
  • a prefix count is computed for each character in the closed data domain on a per lexeme basis. This need not be done during preprocessing. Since prefix tables are significantly smaller than their original raw data, it is conceivably practical to compute each session that at least one search is to be performed. However, an implementer may opt to preprocess the prefix table and calculate prefix counts as in block 222 of Figure 2. [0062] To calculate prefix counts, each lexeme in the data set is associated with a prefix count corresponding each character in the closed data domain that was used in a lexeme in the data set (minus the delimiters and separators). The data set is scanned in lexicographical order.
  • the counts for each prefix in the previous lexeme are added to the previous prefix counts of the previous lexeme.
  • the first lexeme is ACTG. Because there are no prefixes in the previous lexeme; indeed there is no previous lexeme to have prefix in the first place, all nucleic acids have a prefix count of 0. This remains true for subsequent lexemes ATCG and CCGC. But note that for CGC, the previous lexeme had prefix of G. Accordingly, we increment the prefix count of G for lexeme CGC to 1. The next lexeme, CTGC increments the prefix count for C by 1. Since the previous prefix count for G was 1 , it too remains at 1 for CTGC. This process is repeated until the last lexeme.
  • prefix count representation 300j in Figure 3j note the addition of row 7 which shows the total prefix count taking into account the A and C prefixes in the last lexeme TGC.
  • the prefix counts will be used during searches as part of computing existence search bounds for lexemes.
  • PBWT prefix table is linearly proportional to the number of unique lexemes, rather than the total number of lexemes in a received set of raw lexemes. Unlike prior art techniques, where adding additional redundant lexemes will increase the amount of memory to store the lexeme information, PBWT prefix tables will not increase. [0065] It is to be emphasized that in an existential search data operation, such as a k-Mer search operation describe below, and other related data operations, will only make use of the prefixes, the lengths and the prefix count information. Accordingly, during such data operations, the lexemes need not be stored in memory. Furthermore, the prefix count information may also be compressed or eliminated via rollup techniques. A compressed data layout of a prefix table in memory is described with respect to Figure 4 below.
  • a PBWT offset table may also be pre-computed as in block 224 of Figure 2.
  • the PBWT offset table is a lookup that stores the offset position of the first lexeme in the lexicographically sorted lexemes in the prefix table that start with a particular character in the closed data set. Note that offset positions are not cardinal positions. Offsets denote the amount to add to base number to find a desired data, in this case a particular lexeme. Thus offsets start with 0, whereas cardinal positions start with 1.
  • the associated PBWT offset table 300k for prefix table 300j is shown in Figure 3k.
  • the first instance of a lexeme starting with ⁇ ', in this case ACTG, in prefix table 300j is offset position 0. Accordingly, 'A' is associated with 0 in the offset table 300k.
  • the first lexeme starting with 'C, in this case CCGC, in prefix table 300j is in offset position 2. Accordingly, 'C is associated with 2 in the offset table 300k.
  • the process is repeated for nucleic acids 'G' and 'T'.
  • offset tables are relatively inexpensive computer processing-wise to computer. Accordingly, an implementer may opt to pre-compute the offset table, or may opt to compute the offset table for every session where at least one search is to be performed.
  • PBWT enables the amount of storage used to store the lexeme data to be greatly compressed.
  • Figure 4 provides an illustration 400 of compression in PBWT.
  • the prefixes 402 associated with the lexeme 406 we can create a bit mask where a bit corresponds to each of the four nucleic acids, ACGT. Since we do not have prefixes of A, G and T, the 1 st , 3 rd and 4 th positions are 0. But because we do have a prefix of C, the 2 nd position is set to 1. Accordingly, the ACGT prefix bit map can be encoded in the four bits 0100.
  • the prefix counts may be generated dynamically and therefore need not be coded.
  • the PBWT routines After reading the first row CGC, in working memory, rather than in persistent memory, the PBWT routines will have stored a count of 1 prefix for G and 0 for A, C and T. Upon reading the C prefix for row CGC, it will increment, in working memory, the prefix count of C to 1. Thus, in working memory, the PBWT routines will store a count of 1 for prefixes C and G, and 0 for A and T.
  • This technique of tracking counts may be referred to as a "rollup technique.”
  • PBWT is optimized for existence searching.
  • the closed data domain is comprised of the four nucleic acids, adenine, cytosine, guanine and thymine, typically represented with their first letters A, C, G and T respectively.
  • Data sets are typically sequences of these nucleic acids.
  • a k-Mer is a permutation of these nucleic acids in the form of a sequence.
  • a k-Mer search is an existence search for that particular k-Mer pattern in a data set.
  • FIG. 5 is a flow chart 500 of an exemplary existence search using PBWT
  • Figures 6a through 6f illustrate different stages 600a - 600f traversing the PBWT data, where the PBWT data has been prepared as with respect to Figures 1 and Figures 2a through 2k.
  • a PBWT existence search starts with making a search boundary using the first and last records of the prefix table with a prefix count as the starting and ending boundaries respectively. Note that the last record is the dummy record while preparing the prefix counts in block 222 of Figure 2. As we iterate, we walk the bounds closer, and closer together, until we determine that we have found the search pattern, or that the search pattern does not exist in the data set.
  • the search pattern is scanned from right to left.
  • the first character to be retrieved from GTGC is the letter 'C
  • decision block 512 of Figure 5 we test to see if X ⁇ Y. If it is, we conclude that the search pattern exists in the data set as in block 514. Otherwise, we report that the search pattern does not exist in the data set as in block 516.
  • Figure 7 illustrates several possible embodiments of a hardware, software and communications environment 700 for PBWT techniques.
  • Client device 702 is any computing device. Exemplary computing devices include without limitation personal computers, tablet computers, smart phones, and smart televisions and/or media players.
  • PBWT techniques have the advantage of being very compact and very efficient with processing.
  • PBWT techniques may be brought to bear on a typical networked client device 702 accessing a remote server, PBWT may be realistically implemented on a standalone computer.
  • PBWT might be on a client device 702 that is a portable laptop, or a portable embedded system, or a standalone stations such as a kiosk.
  • a researcher in the field may have a custom computing device that contains an integrated computer running PBWT.
  • a research lab may have an enclosed station that also contains an integrated computer running PBWT.
  • a client device 702 may have a processor 704 and a memory 706.
  • Client device 702 's memory 706 is any computer-readable media which may store several software components including an application 708 and/or an operating system 710.
  • a software component is a set of computer executable instructions stored together as a discrete whole. Examples of software components include binary executables such as static libraries, dynamically linked libraries, and executable programs. Other examples of software components include interpreted executables that are executed on a run time such as servlets, applets, p-Code binaries, and Java binaries. Software components may run in kernel mode and/or user mode.
  • Computer-readable media includes, at least, two types of computer- readable media, namely computer storage media and communications media.
  • Computer storage media includes volatile and non-volatile, removable and nonremovable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules, or other data.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non- transmission medium that can be used to store information for access by a computing device.
  • communication media may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave, or other transmission mechanism. As defined herein, computer storage media does not include communication media.
  • user equipment device 702 may have a network interface 712.
  • the network interface 712 may be one or more network interfaces including Ethernet, Wi-Fi, or any number of other physical and data link standard interfaces. In the case where the user need only do operations on a standalone single machine, the network interface 712 is optional.
  • Client 702 may communicate to a server 716.
  • Server 716 is any computing device that may participate in a network.
  • the network may be, without limitation, a local area network ("LAN”), a virtual private network (“VPN”), a cellular network, or the Internet.
  • the client network interface 712 may ultimate connect remote networked storage 714, or to server 716 via server network interface 718.
  • Server network interface 718 may be one or more network interfaces as described with respect to client network interface 712.
  • Server 716 also has a processor 720 and memory 722.
  • memory 722 is any computer- readable media including both computer storage media and communication media.
  • memory 722 stores software which may include an application 724 and/or an operating system 726.
  • Memory 718 may also store applications 724 that may include without limitation, an application server and a database management system.
  • client device 702 may be configured with an application server and data management system to support a multi-tier configuration.
  • Server 716 may include a data store 728 accessed by the data management system.
  • the data store 728 may be configured as a relational database, an object-oriented database, a NoSQL database, and/or a columnar database, or any configuration to support scalable persistence.
  • the server 716 need not be on site or operated by the client enterprise.
  • the server 716 may be hosted in the Internet on a cloud installation 730.
  • the cloud installation 730 may represent a plurality of disaggregated servers which provide virtual web application server 732 functionality and virtual database 734 functionality.
  • Cloud 730 services 732, 734 may be made accessible via cloud infrastructure 736.
  • Cloud infrastructure 736 not only provides access to cloud services 732, 734 but also billing services.
  • Cloud infrastructure 536 may provide additional service abstractions such as Platform as a Service (“PAAS”), Infrastructure as a Service (“IAAS”), and Software as a Service (“SAAS").
  • PAAS Platform as a Service
  • IAAS Infrastructure as a Service
  • SAAS Software as a Service
  • PBWT techniques may be advantageously applied to large data sets with an ordered character set, where the character set is finite, and where search data operations show the existence of an instance of the item to be search in the data set, rather than the specific location.
  • non-string data may be encoded to a finite, ordered character set, and PBWT techniques applied to the encoding.
  • signal wave data from a cellular phone may be sampled, amplitudes encoded into strings, and the results encoded strings searched with PBWT.
  • Such a technique could be used to detect patterns of power/transmission loss in cell phones.
  • PBWT techniques are lossless in the sense that the information content itself is not lost. However, PBWT techniques may be considered lossy in the sense that the location of a substring may be lost. Accordingly, PBWT has strong applicability in scenarios searching for an existence of an instance rather than location. For example, PBWT may be able to determine that there is an error in a cellular transmission, but it may not be able to determine where in the data set the error occurred. Examples of such scenarios are as follows.
  • Plagiarism is the passing off of another author's works as one's own. Ordinarily, portions of another author's works are generally marked with citations. Copyrighted material is usually associated with copyright notices, where permission from the original author or the author's agents have been obtained. However, with plagiarism, the plagiarist misappropriates the credit for another author's expressions.
  • Plagiarism detection may make use of PBWT techniques by selecting passages suspected of plagiarism as a search pattern. A corpus of known literature of the field is then processed into PB WT and the selected passage searched for. If there is a match, then it is likely that plagiarism has been detected.
  • PBWT performs exact matches, thus paraphrased plagiarism might not be detected, PBWT has the advantage of being fast, portable due to its compression, and able to compare against a much larger corpus of known literature.
  • Open source clearance relates to companies ensuring that their proprietary software does not contain open source. Often developers may copy source code from the internet and other sources without being aware that the source code was subject to an open source license. Since many open source licenses oblige the surrender of intellectual property rights, such additions may result in code intended to be proprietary losing those intellectual property rights. This phenomenon is known as "tainting.”
  • An open source clearance is a due diligence search by a company, prior to releasing proprietary source code to detect whether any open source had been added.
  • PBWT techniques may be brought to bear by setting a suspect code snippets against a corpus of open source code compressed into a PBWT prefix table and offset table.
  • PBWT will not detect paraphrased or modified source code. However, it will detect simple cut and pasted source code and will have the advantage of being able to perform comparisons against a very large corpus of source code known to be from open source repositories.
  • PBWT techniques are particularly useful in the context of high speed DNA sequencing.
  • DNA sequencing involves reading short sequences, called "reads", that are a portion of a genome, and then re-assembling the reads into the original genome.
  • DNA sequencing generally may be performed via one of two strategies.
  • One strategy is in mapping sequencing, where the reads are mapped to an existing backbone genome structure, akin to mapping details of the genome to an outline of the genome.
  • Another strategy is in de-novo genome assembly, where reads are assembled by matching one read to another via searching for predetermined sequences. De-novo genome assembly derives its name from the possibility of the resulting sequences being novel sequences.
  • PBWT techniques may be used in a k-Mer search to locate the backbone location in the case of mapping sequencing and for another read in the case of de-novo assembly.
  • all the reads may be stored in memory in PBWT format.
  • a read to be matched to another read has a k-Mer identified to be used as a search pattern. The identified k-Mer is usually a known overlap between reads.
  • This predetermined overlap k-Mer is then searched on all the reads stored in memory using a PBWT k-Mer search, thereby identifying whether a match exists or not. Accordingly, this information may be used in the construction of graph data structures, such as read overlap graphs and de Bruijn graphs, representing possible overlaps used to reconstruct the genome.
  • Another example of applying PBWT techniques in bioinformatics is in searching for a genome defect.
  • the researcher may then set that particular genetic sequence as the search pattern and then perform a k-Mer search on a person under study. If the sequence is detected, it could then be correlated with the incidence or the absence of the birth defect in the person under study.
  • PBWT Another potential bioinformatics application of PBWT techniques is to search for biological tags in plants.
  • Governments may seek to track the source of botanical plants.
  • some plants may be genetically modified and subject to certain regulations.
  • some states have legalized marijuana and may wish to track plant sources a part of regulation.
  • One way to track plants is to add a genetic tag to the plants to be tracked. Later, if a regulator or law enforcement officer confiscates some plants, they may be rapidly analyzed using PBWT.
  • the biological tag in the form of a k-Mer, is set as the search pattern.
  • a PBWT k-Mer search one may identify the tag, or the absence of a genetic tag in a plant. Accordingly, the source of the plant, or the contraband source of a plant could be readily identified.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The Prefix Burrows-Wheeler Transform ("PWBT") is described to provide data operations on data sets even if the data set has been compressed. Techniques to set up a PWBT, including an offset table and a prefix table, and techniques to apply data operations on data sets transformed by PWBT are also described. Data operations include k-Mer substring search. General applications of techniques using PWBT, such as plagiarism searches and open source clearance, are described. Bioinformatics applications of the PWBT, such as genomic analysis and genomic tagging, are also described.

Description

PREFIX BURROWS-WHEELER TRANSFORMATION WITH FAST OPERATIONS ON COMPRESSED DATA
BACKGROUND
[0001] When performing data processing, a computer is used to take data, which may be represented as a structure in a computer memory and/or a file format on a persistent computer memory, and to perform operations, called data operation, on the data. Data operations are typically performed on data is typically demarcated into discrete sets, called data sets. Typical data operations on data sets in the course of data processing may include searching, which is retrieval a desired subset of a data set; sorting, which is re-organizing the data set; and transformation which is converting the data set from one representation to another.
[0002] Over time processing power available for data processing has increased rapidly, but in many cases the amount of data applied to data processing techniques has increased even more rapidly. Accordingly, data processing is in need of improved searching, sorting, transformation, and other data operations.
[0003] Data operations are generally improved either by reducing the amount of working memory used to perform the operation, or by improving the processing efficiency of the operation as to reduce processing time. In most cases, the amount of working memory and processing efficiency results in an optimization tradeoff. Reducing the amount of working memory in an operation often results in lower processing efficiency. Conversely, increasing processing efficiency results in a larger amount of memory used during processing. It is relatively rare to achieve reduced memory utilization and greater processing efficiency in the same optimization.
[0004] Nonetheless, for large data sets, which are data sets so large that performing data operations are too slow to enable interactive processing, improving processing efficiency at the expense of memory utilization may render the optimization impractical. Increasing the size of a very large data set may result in the amount of memory utilized to be larger than the amount of memory available. Accordingly, even if an optimization for a data operation's processing improvement is significant, it may not be available for implementation because of the amount of available is insufficient. Thus many optimization techniques are impractical for large data set applications.
[0005] Presently there are many large data set applications. Some examples include, document processing, image processing, multimedia processing and bioinformatics. For example, in the case of bioinformatics, the data processed is comprised of genetic information which define an organism. Genetic information is comprised of a series of base pairs adenine-thymine and guanine-cytosine. The more complex the organism, the more base pairs are used to defined the organism. For example, the Escherichia Coli bacterium uses approximately 4.6 million base pairs. In contrast, simple viruses may use as little as a few thousand base pairs.
[0006] A major application of bioinformatics is in the analysis of genetic conditions in human beings, in the search for medical therapies. The genetic information for a human being is 3.2 billion base pairs. Accordingly, every byte allocated to a base pair in an effort to improve processing, potentially adds an additional 3.2 Gb of working memory. When performing sequence comparisons, with different instances of human beings or other organisms under analysis, the amount of memory used during data processing may rapidly expand to an unmanageable amount.
[0007] Accordingly, there is a need for techniques to improve processing speed of data operations on large data sets, such as in bioinformatics, while reducing the amount of memory used.
BRIEF DESCRIPTION OF THE DRAWINGS
[0008] The Detailed Description is set forth with reference to the accompanying figures.
[0009] Figure 1 is a top level diagram of a Prefix Burrows-Wheeler
Transform.
[0010] Figure 2 is a flow chart of an exemplary performance of a Prefix
Burrows- Wheeler Transform.
[0011] Figures 3a, 3b, 3c, 3d, 3e, 3f, 3g, 3h, 3i, 3j and 3k are illustrations of an exemplary data set undergoing a Prefix Burrows-Wheeler Transform.
[0012] Figure 4 is an exemplary compressed data memory layout for a prefix table in a Prefix Burrows- Wheeler Transform. [0013] Figure 5 is a flow chart of an exemplary k-Mer search data operation on data transformed by a Prefix Burrows- Wheeler Transform.
[0014] Figures 6a, 6b, 6c, 6d and 6e are illustrations of exemplary data transformed by a Prefix Burrows-Wheeler Transform being traversed via an exemplary k-Mer search.
[0015] Figure 7 is a diagram of an exemplary hardware, software and communications environment for a Prefix Burrows-Wheeler Transform.
DETAILED DESCRIPTION
Context of Prefix Burrows-Wheeler Transform Techniques
Overview
[0016] Systems and methods to perform data operations on large data sets optimized for both performance time and memory are disclosed herein. Specifically, Prefix Burrows-Wheeler Transform ("PBWT") techniques are disclosed. In order to aid disclosure, this section describes underlying PBWT.
Closed Domains of Data and Data Representation
[0017] Optimizing data operations may take advantage of insight into the data populating a data set. Specifically, the set of all possible data that may populate a data set is called the domain of a data set. Some domains are open, where any datum may be included in the data set. An example of an open domain is a representation of analogue signal data where the frequency and amplitude of the signal that may be represented is not initially restricted. Other domains are closed where only a limited set of data may be admitted into a data set. An example of a closed domain is text data where the only data admitted are text characters. In some systems, text characters are limited to American Standard Code for Information Interchange ("ASCII") characters, a predefined set of 128 characters representing Latin alphanumeric characters, punctuation marks and common formatting characters. Accordingly each ASCII character could be represented in seven bits of information (i.e., 27 = 128). There are other closed domains comprised of 256 alphanumeric characters which may include the Extended Binary Coded Decimal Interchange Code ("EBCDIC") character set and the IBM Extended ASCII character set, both of which have characters represented by eight bits of information (i.e., 28 = 256).
[0018] Note that a closed domain need not be comprised of an entire alphabet for a natural language. As previously mentioned, genetic base pairs are comprised of adenine-thymine and guanine-cytosine pairs. Accordingly, a data domain for genetic information may be the closed set of adenine, thymine, guanine and cytosine, commonly denoted A, T, G and C. As there are only four elements in this closed domain, these elements may be represented in two bits (i.e., 22 = 4).
[0019] Understanding the domain of a data set and understanding how that domain is represented in a computer memory, may aid in identifying efficiencies in data operations. For example, consider the example where a computer program works with ASCII characters comprised solely of uppercase and lowercase alphabetic characters, i.e., no numbers, punctuation, or formatting characters. The uppercase letter 'A' is represented by the number 65 (or OblOOOOOl in binary), and the lowercase letter 'a' is represented by the number 97 (or Obi 100001 in binary). Note that the lowercase character has a value of 32 more than the uppercase character. Thus toggling the sixth bit correspondingly toggles the character between uppercase and lowercase letters.
[0020] A computer program to convert lowercase letters all into uppercase letters in ASCII might have been implemented by a routine that tested each character as to whether it was a lowercase or uppercase character. If the character was a lowercase letter, the program might have invoked a lookup table to convert the lowercase letter into its corresponding uppercase character.
[0021] However, a much more efficient and equivalent program could be implemented making use of the knowledge of how ASCII uppercase and lowercase characters are coded, and simply have applied an exclusive-or bitmask of (ObO 100000) to each character. Thus, no memory would have been used for the lookup table, and a single operation (the exclusive-or) need have been used per character rather than the two operations of testing for case and for replacing. Accordingly, this implementation represents both an improvement in the amount of working memory used and increase in processing efficiency at the same time.
Search Techniques and Optimization
[0022] In the foregoing discussion, we've described how insight as to how a data set is represented in memory may be used to optimize a data operation. Insight as to what specific data operation is to be performed may also be used to optimize implementation of that data operation. For example, searching is a common data operation where a pattern is detected in a data set. However, search data operations have different degrees of fidelity.
[0023] A search data operation may be existential, where the search data operation simply determines whether the pattern exists in the data set or not. For example an existential search for the letter 'C in data set "ABBBCCCD" might simply return the value "TRUE" since the letter 'C was indeed in the data set.
[0024] A search data operation may be cardinal, where the search data operation not only tests for existence, but also counts the number of instances of the pattern to be detected. For example, a cardinal search for the letter 'C in data set "ABBBCCCD" might return 3, since the letter 'C appeared three times in the data set.
[0025] A search operation may be positional, where the search data operation returns the position of the pattern to be detected. For example, a positional search for the letter 'C in data set "ABBBCCCD" might return 5, since the first instance of the letter 'C in the data set was in the fifth position.
[0026] Thus if it was understood that a computer program was to be optimized for existential search, rather than cardinal search, an implementation might be to scan a data set until it found the first instance of the pattern to be detected. This contrasts to a cardinal search which might scan the entire data set as it counted the number of instances of the pattern to be detected. Thus a data set comprised of the letter 'C followed by a million 'A' characters searching for the letter 'C would terminate at the first character if it were an existential search, whereas a cardinal search would be obliged to scan the subsequent million 'A' characters with no guarantee of finding another 'C character. This is an example of how, understanding the fidelity of a data operation may be used to optimize implementation of that data operation.
Transformations of Data Representation
[0027] As shown above, information how a data set is represented in memory may be used to optimize the implementation of a data operation. However, a computer program may not receive data in an optimized format for a data operation. Data as it arrives in its original data representation, often called "raw data", may be converted into an alternative data representation which is optimized for a particular data operation. This conversion is called a transformation.
[0028] Transformations come in many forms. Transformations may order data according to predetermined conventions. Such transformations are called "sorting" data operations. Other transformation may reduce the size of the data representation. Such transformations are called "compression" data operations. Yet other transformation may generate lookup tables and/or indices to aid in the performance of a data operation. The following subsections will discuss these and other transformations.
[0029] A transformation is in itself a data operation and accordingly may be optimized. However, typically transformations convert data into a data representation optimized for a particular data operation. The converted data is persisted in a persistent memory device called a "data format", or "format" for short. Since the particular data operation typically operates only on the optimized data format, a transforming data is sometimes called "preprocessing data." Specifically, the processing used to transform the data is persisted in the data format, so that the processing need not be repeated every time the particular data operation is performed. Thus preprocessing is in of itself a way to optimize the particular data operation.
[0030] As stated above, for large data sets (as well as in other contexts), the desired result is to reduce both the working memory used in implementing a particular data operation, and to increase the processing efficiency. One route would then be to perform a data transformation that performs both a compression and creates a data representation optimized for a particular data operation.
Lexicographical Sorting
[0031] A common data operation is to perform a sort, which is to order data in a data set according to predetermined convention. Sorting is a transformation which generally optimizes searching. Specifically, when a data set is in a particular order, a computer program scanning the data set can make determine where a pattern to be detected is likely to be. Binary search and B-Tree searches are well-known examples of search algorithms that rely on sorted data.
[0032] A lexicographical sort is a sort where the predetermined convention used to order the data set is to posit an order on the characters comprising a closed data domain of the data set. The data in the data set is comprised of sequences of characters, called "lexemes". A lexeme made up of alphanumeric characters is also called a "string." The set of lexemes in the data set is called a "lexicon". Lexemes that are lexicographically sorted are sorted according to the order and the places of the characters in the lexeme.
[0033] A common application of a lexicographical sort is to perform an alphabetical sort on strings. The letters of the alphabet have a posited order, where each letter is known to precede or succeed another letter. In English, letters are generally read from left to right. Thus a letter placed to the left of another letter in a string takes precedence in ordering the string with respect to other strings. Accordingly, "do" precedes "hi" because 'd' precedes 'h' in alphabetical order. Although Ό' succeeds in alphabetical order, the leftmost characters in the string have precedence when performing an alphabetical sort.
Compression Techniques
[0034] A transformation that reduces the size of a data representation in computer memory is a compression data operation. Transformations may be "lossy" where information with respect to a particular data operation is lost via the transformation. Transformations may be lossless where information with respect to a particular data operation is preserved during the transformation.
[0035] One example of compression is run length encoding. Consider the data set comprised of the 12 characters, "AAABBBBCCCCCBBB". The data set might be represented in 6 characters "3A4B5C3B" meaning that there are "three A's, four B's, five C's followed by three B's." Note that the amount of memory used to represent the data set has effectively been reduced, i.e., the data set has been compressed.
[0036] Now consider the same data set compressed as "3A7B5C". Specifically, the information about the run of four B's and the run of three B's has been combined. With respect to an existential search and a cardinal search, the transformation has been lossless. Specifically, both an existential search and a cardinal search are able to provide an accurate result with the new compressed data operation without loss of information. However, with respect to a positional search, the transformation has been lossy. While the positional search could determine that the letter 'B' was in the dataset, it could not determine whether there was B in the last position or any other position. As this positional information has been lost, the transformation has been lossy with respect to a positional search.
[0037] Note that whether a transformation is lossy or lossless is with respect to a specific data operation. From the perspective of the existential and cardinal searches, those data operations not only are able to provide the correct result, they are also able to provide the correct result faster. Thus this data transformation may be said to be optimized for existential and cardinal searches. Such tradeoffs are common in design optimizations for data representations. For example, many database formats tradeoff performance for record insert data operations in exchange for fast read/search data operations. Lookups/Indexing
[0038] Data operations may be optimized by utilizing supplementary data structures that contain information on how to perform an operation and/or information on the layout of a data representation. Such supplementary data structures are called "lookups". If a lookup is in the form of a set of tuples, then the lookup is called a "lookup table." If the lookup stores positional information of data in the data representation, then the lookup is called an "index."
[0039] An example of a lookup is used in Huffman encoding compression. Characters in a textual data set are analyzed for frequency. A lookup is generated that maps each character used in the textual data set with a prefix-free code, which is a bit string where each string is guaranteed never to be the prefix of another bit string. More frequent characters are mapped in the lookup to shorter prefix codes. Then the textual data set is transformed by replacing each instance of a character with its respective prefix code per the lookup.
[0040] The lookup in a Huffman encoding is typically in the form of a tree.
However, it may also be represented as a table. Also note that whenever the Huffman encoded data set is transferred, so too is the lookup. In this way, a recipient may decode the Huffman encoded data set.
[0041] Turning to indexes, consider the 12 character string
"AAABBBBCCCCC". If a programmer were to optimize positional search data operations, and also knew that the data set was guaranteed to be in order, the programmer might prepare an index with three records as follows: "A, 1"; "B, 4"; and "C, 8". Thus in order to find the location of the first 'B' in the dataset, instead of scanning the characters in the dataset, the positional search data operation need only look up the position in the index resulting in faster data processing. Note that generating the index in the first place might have made use of a full scan of the characters in the dataset. However, by generating and storing the index during preprocessing, the index provides an optimization that may be realized whether a positional search data operation is performed.
Performing a Prefix Burrows-Wheeler Transform
Overview
[0042] This section describes how to perform a Prefix Burrows-Wheeler
Transform ("PBWT"). PBWT is a novel modification of techniques used in block- sorting compression, such as in a Burrows- Wheeler Transform, to optimize for existential searches in data sets with closed data domains. As will be described in other sections, applications extend beyond bioinformatics. However, for illustrative purposes, the following discussion will use bioinformatics data as an exemplary context to describe PBWT and its associated data operations.
[0043] Figure 1 is a top level diagram 100 of a PBWT and an existential search operation. In the present illustration, bioinformatics sequences are shown as the data to be operated on. Raw data 102 comprised of a data set of lexemes made up of a closed domain of characters is received by a PBWT transformation routine 104. The PBWT transformation routine 104 converts the raw data into a prefix table 106 and an offset table 108. The prefix table 106 provides a lookup to identify all single character prefixes for any substring of a lexeme in the data set. The offset table 108 provides a lookup to identify the offset location of lexemes in the prefix table starting with a given single character. The prefix table 106 and the offset table 108 represent a significant compression of the original raw data.
[0044] An existential search routine 1 10 may search for an arbitrary string 1 12 of the domain of characters may be performed on the prefix table 106 and offset table 108. The existential search can accordingly determine whether the string 1 12 exists 114 in the representation. Via PBWT, the existential search may be performed significantly more quickly than prior techniques, without loss of accuracy.
[0045] Figure 2 is a flow chart 200 of an exemplary performance of a PBWT.
Figures 3 a through 3k illustrate different stages 300a - 300k in generating a PBWT prefix table and a PBWT offset table in performing a PBWT.
Receive Raw Data
[0046] In performing a PBWT, a computing device receives raw data in block
202. The raw data generally arrives in the form of data strings 300a as shown in Figure 3a. Here, the data strings 300a are four genetic sequences each with four nucleic acids. It is to be stressed that the actual lengths of the sequences can vary, and the data set domain may be any closed domain where the characters comprising the domain have a posited order, lending themselves to a lexicographical sort.
Generate Suffixes for Each Sequence
[0047] Once raw data is received, the received data will be analyzed and subjected to character rotation techniques to identify all prefixes. The identity and location of the prefixes will be used to generate a prefix table. In block 204 of Figure 2, the suffixes for each sequence is generated and added to the data set. The result 300b is shown in Figure 3b. For example, the sequence ACTG has a suffix of CTG, of TG and of G.
Perform a Lexicographical Sort and Remove Prefix Matches
[0048] Now that the data set includes all suffixes, in block 206 of Figure 2, the data set is lexicographically sorted. The result 300c of the lexicographical sort is shown in Figure 3c.
[0049] While the lexicographical sort 206 is used to guarantee the lexicographical order of the lexemes in the data set, it also aids in removing lexemes that are prefix matches as in block 208 of Figure 2. Specifically, in one embodiment, the sorted data set may be linearly scanned. If a lexeme is identical to or is a prefix of the immediately succeeding lexeme, then it is a prefix match and is removed. For example the suffix GC for ATGC is identical to the suffix GC for CCGC, and thus the duplicate is removed. Similarly, the suffix TG is a prefix of TGC and is also removed. The result 300d of the removed prefix matches is shown in Figure 3d.
[0050] Further note that due to the initial lexicographical sort 206, the result
300d with the prefix matches removed is also lexicographically sorted.
Perform a First Letter Rotation
[0051] The data set will now be subjected to letter rotation techniques in order to determine the location of relative prefixes. In block 210 of Figure 2, each lexeme is duplicated and delimiter is added to the end of each original lexeme. Note that the delimiter is a member of the closed data domain, but is lexicographically ordered after all the elements in the lexemes. This is done so that the lexicographical order of the lexemes is the same with the delimiter as it is without the delimiter.
[0052] In block 212 of Figure 2, each new duplicate lexeme has its first letter rotated to the end of the lexeme and separated by a separator character. The separator is also a member of the closed data domain, but is lexicographically order before all the elements in the lexemes. Again, this is done so that the lexicographical order of the rotated lexeme is the same with the separator as it is without the separator.
[0053] The result 300e of the duplication, first letter rotation, and the addition of the delimiter and separator is shown in Figure 3e. In this Figure, the delimiter is the '$' character and the separator is the 7' character.
Lexicographically Sort the Delimited Original and Rotated Lexemes
[0054] In block 214 of Figure 2, the delimited/rotated/separated lexemes are lexicographically sorted together. Once this sort is performed, the data is in condition to generate a PBWT prefix table. The result of the lexicographical sort 300f is shown in Figure 3f.
Generate a Non-Rotated and Prefix Paired Lookup Table
[0055] The lexicographically sorted data 300f is then scanned in block 216 of
Figure 2 to generate prefix pairs. Specifically, the table is scanned lexeme by lexeme in lexicographical order. The lexeme is first tested to determine whether it is a rotated or non-rotated lexeme. This may be readily ascertained by detecting a delimiter. If there is a delimiter, it is a non-rotated lexeme. Otherwise, it is not.
[0056] A non-rotated lexeme is then associated with all rotated characters between itself and the previous non-rotated lexeme. For example, in 300f, CCGC$ will be associated with 'G' because it is the sole rotated character between CCGC$ and the previous non-rotated lexeme ATGC$. Similarly, in 300f, TGC$ will be associated with both 'A' and 'C since those two characters are the rotated characters between TGC$ and the previous non-rotated lexeme GC$. The rotated characters may be readily extracted by virtue of detecting the separator character. The result 300g of the prefix associations is shown in Figure 3g.
[0057] Once the prefixes have been associated with their respective non- rotated lexemes, the rotated duplicates may be removed and the delimiters removed as in block 218 of Figure 2. The result 300h of the removals is shown in Figure 3h.
[0058] As can be seen, the rotated characters are the single character prefixes. Because of the rotation 212 and subsequent lexicographical sort 214, the suffix of rotated character will precede the non-rotated lexeme that is a suffix of the rotated character. This is why single character prefixes will immediately precede a non- rotated lexeme.
Add Lexeme Lengths to Finalize Prefix Table Generation
[0059] To complete the prefix table, the length of each lexeme remaining in the data set is computed, and associated with that lexeme, as in block 220 of Figure 2. Specifically, the number of characters in the lexeme is counted. The count may be performed via a linear, sequential scan of the lexeme. Alternatively, the size of the lexeme in memory may be detected and the character count performed by dividing the memory size by a known fixed memory size of each character. For example, in the C programming language, the expression "sizeof(*pLexeme)/sizeof(char);" obtains the memory size a lexeme pointed to by pointer pLexeme and divides it by the size of a character data type. In the case of genetic data, the data may not be represented as characters, but potentially as two- bit values since there are only four nucleic acids. (Note that additional bits may be used in order to include separators and delimiters.)
[0060] Regardless of how the length of the lexeme is computed as in 220, the length is then associated with the lexeme. As shown below, this length will be later be compared with patterns being searched for to determine whether a search should be terminated because the lexeme size is smaller than the pattern being searched. The result 300i of the added lexeme lengths is shown in Figure 3i.
Calculate Prefix Counts
[0061] In order to use the prefix table for searches, a prefix count is computed for each character in the closed data domain on a per lexeme basis. This need not be done during preprocessing. Since prefix tables are significantly smaller than their original raw data, it is conceivably practical to compute each session that at least one search is to be performed. However, an implementer may opt to preprocess the prefix table and calculate prefix counts as in block 222 of Figure 2. [0062] To calculate prefix counts, each lexeme in the data set is associated with a prefix count corresponding each character in the closed data domain that was used in a lexeme in the data set (minus the delimiters and separators). The data set is scanned in lexicographical order. For each lexeme, the counts for each prefix in the previous lexeme are added to the previous prefix counts of the previous lexeme. For example, in prefix count representation 300j in Figure 3j, the first lexeme is ACTG. Because there are no prefixes in the previous lexeme; indeed there is no previous lexeme to have prefix in the first place, all nucleic acids have a prefix count of 0. This remains true for subsequent lexemes ATCG and CCGC. But note that for CGC, the previous lexeme had prefix of G. Accordingly, we increment the prefix count of G for lexeme CGC to 1. The next lexeme, CTGC increments the prefix count for C by 1. Since the previous prefix count for G was 1 , it too remains at 1 for CTGC. This process is repeated until the last lexeme.
[0063] After the last lexeme, a dummy record is added, to take into account any prefixes in the last lexeme. In prefix count representation 300j in Figure 3j, note the addition of row 7 which shows the total prefix count taking into account the A and C prefixes in the last lexeme TGC. The prefix counts will be used during searches as part of computing existence search bounds for lexemes.
[0064] Note that an immediate compression benefit is that the size of a PB WT prefix table is linearly proportional to the number of unique lexemes, rather than the total number of lexemes in a received set of raw lexemes. Unlike prior art techniques, where adding additional redundant lexemes will increase the amount of memory to store the lexeme information, PBWT prefix tables will not increase. [0065] It is to be emphasized that in an existential search data operation, such as a k-Mer search operation describe below, and other related data operations, will only make use of the prefixes, the lengths and the prefix count information. Accordingly, during such data operations, the lexemes need not be stored in memory. Furthermore, the prefix count information may also be compressed or eliminated via rollup techniques. A compressed data layout of a prefix table in memory is described with respect to Figure 4 below.
Generate an Offset Table
[0066] To perform existence searches, with PBWT, a PBWT offset table may also be pre-computed as in block 224 of Figure 2. The PBWT offset table is a lookup that stores the offset position of the first lexeme in the lexicographically sorted lexemes in the prefix table that start with a particular character in the closed data set. Note that offset positions are not cardinal positions. Offsets denote the amount to add to base number to find a desired data, in this case a particular lexeme. Thus offsets start with 0, whereas cardinal positions start with 1. The associated PBWT offset table 300k for prefix table 300j is shown in Figure 3k.
[0067] As can be seen, the first instance of a lexeme starting with Ά', in this case ACTG, in prefix table 300j is offset position 0. Accordingly, 'A' is associated with 0 in the offset table 300k. Similarly, the first lexeme starting with 'C, in this case CCGC, in prefix table 300j is in offset position 2. Accordingly, 'C is associated with 2 in the offset table 300k. The process is repeated for nucleic acids 'G' and 'T'. [0068] As with prefix counts offset tables are relatively inexpensive computer processing-wise to computer. Accordingly, an implementer may opt to pre-compute the offset table, or may opt to compute the offset table for every session where at least one search is to be performed.
Compressed PBWT Data Layout in Memory
[0069] As mentioned above, PBWT enables the amount of storage used to store the lexeme data to be greatly compressed. Consider a set of 1,000 genomic sequences, each having 100 base pairs denoted by a single nucleic acid. Since there are four nucleic acids, conceivably only 2 bits (4= 22) need be used for each nucleic acid. Accordingly, 1,000 sequences x 100 nucleic acids per sequence x 2 bits per nucleic acid = 200,000 bits. Figure 4 provides an illustration 400 of compression in PBWT.
[0070] Consider the first row. For lexeme 402 CGC, it has a length 404 of 3, and a prefix 406 of C. All the preceding rows (not shown) had a prefix count 408 of 0 for A, C and T, and a prefix count 408 of 1 for G. Since the lexeme itself is not used in an existential search, the lexeme CGC need not be coded at all. The length may be coded in 7 bits (assuming that we never have a count greater than 128 (128= 27). Here the length is 3, so the binary encoding is 0000011. As for coding the prefixes 402 associated with the lexeme 406, we can create a bit mask where a bit corresponds to each of the four nucleic acids, ACGT. Since we do not have prefixes of A, G and T, the 1st, 3rd and 4th positions are 0. But because we do have a prefix of C, the 2nd position is set to 1. Accordingly, the ACGT prefix bit map can be encoded in the four bits 0100.
[0071] Turning to the prefix counts, note that the prefix counts may be generated dynamically and therefore need not be coded. Consider the second row CTGC. Using the above techniques, it may be encoded as 0000100 (length 404) and 1000 (prefixes associated 406). After reading the first row CGC, in working memory, rather than in persistent memory, the PBWT routines will have stored a count of 1 prefix for G and 0 for A, C and T. Upon reading the C prefix for row CGC, it will increment, in working memory, the prefix count of C to 1. Thus, in working memory, the PBWT routines will store a count of 1 for prefixes C and G, and 0 for A and T. The key benefit is that persistent storage need not store the prefix counts, thereby saving memory in exchange for processing time. This technique of tracking counts may be referred to as a "rollup technique."
[0072] Note that the aforementioned rollup technique describes a pure tradeoff of storage (persistent) memory with respect to processing time. Prior art literature describe variations of rollups, for example with respect to FM Indexes and Burrows Wheeler Transforms (BWT). In some variations, a working bit may be added in storage to every nth record to aid in rollup.
[0073] However, in the aforementioned rollup technique, note that in storage, only 11 bits per record are used rather than 200 bits per record. Thus our 1 ,000 sequences only need 1 1,000 bits rather than the 200,000 bits in raw form, achieving a 94.5% compression in this example, without loss of accuracy. Exemplary Operations on Data Transformed by the Prefix Burrows-Wheeler Transform
Existence Searches, k-Mer Searches and Prefix Burrows-Wheeler Transform
[0074] As previously mentioned, PBWT is optimized for existence searching. For example, in bioinformatics, the closed data domain is comprised of the four nucleic acids, adenine, cytosine, guanine and thymine, typically represented with their first letters A, C, G and T respectively. Data sets are typically sequences of these nucleic acids. A k-Mer is a permutation of these nucleic acids in the form of a sequence. A k-Mer search is an existence search for that particular k-Mer pattern in a data set.
Performing an Existence Search
[0075] To perform an existence search, the prefix table, the prefix counts and the offset tables are computed in advance, either via preprocessing, or in the case of the prefix counts and offset tables, potentially interactively. Figure 5 is a flow chart 500 of an exemplary existence search using PBWT, and Figures 6a through 6f illustrate different stages 600a - 600f traversing the PBWT data, where the PBWT data has been prepared as with respect to Figures 1 and Figures 2a through 2k.
Initialization
[0076] A PBWT existence search starts with making a search boundary using the first and last records of the prefix table with a prefix count as the starting and ending boundaries respectively. Note that the last record is the dummy record while preparing the prefix counts in block 222 of Figure 2. As we iterate, we walk the bounds closer, and closer together, until we determine that we have found the search pattern, or that the search pattern does not exist in the data set.
[0077] We can represent the search boundary using the following notation:
[X, Y) where X represents the start boundary, and Y represents one lexeme past the end of the data set, which is the ending boundary. Note that the square bracket denotes inclusion, and the closing parenthesis denotes exclusion, where Y is not necessarily a lexeme, and may be a dummy record.
[0078] Consider a search for GTGC in the prefix table in 600a of Figure 6.
We initialize in block 502 of Figure 5 by setting X to 0 and Y to 7.
Retrieve a Character from the Search Pattern and Modify the Search Boundary
[0079] To perform an existence search using PBWT, the search pattern is scanned from right to left. In block 504 of Figure 5, we retrieve a character from the search pattern. Hence the first character to be retrieved from GTGC is the letter 'C
[0080] To modify the search boundaries, we first look at the prefix counts associated with the search boundaries, as in block 506 of Figure 5. For the start boundary X, note that in prefix table 600a of Figure 6a, in row X (here 0), the prefix count for 'C is 0. We then add the offset corresponding to 'C in the offset table (here 2), and have a new start boundary of 2.
[0081] Similarly for the ending boundary Y, note that in prefix table 600a in
Figure 6a, in row Y (here 7), the prefix count for 'C is 3. Again, the offset corresponding to 'C in the offset table is 2. By adding, we have a new ending boundary of 5.
[0082] Thus we have modified the search boundary to [2, 5) as illustrated in prefix table 600b in Figure 6b.
[0083] As in decision block 508 of Figure 5, we test to see if the size associated with the starting boundary X is greater than or equal to the size of the search pattern we have processed so far. In our example, the starting boundary X is 2, and the size associated with starting boundary X (at 2) is 4. Since we have only processed 'C at this time, the size of the search pattern we have processed thus far is 1. Since 4 >= 1 , we continue processing. If the size of the search pattern was larger, we would increment X. This is a way of indicating that the search pattern we searching for existence is larger than the lexeme, and therefore does not exist in the data set as will be seen in decision block 510 of Figure 5.
[0084] If there is a subsequent character in the search pattern, as tested in block 510, we continue processing back in block 504. Otherwise we evaluate the results in block 512.
[0085] In decision block 512 of Figure 5, we test to see if X < Y. If it is, we conclude that the search pattern exists in the data set as in block 514. Otherwise, we report that the search pattern does not exist in the data set as in block 516.
[0086] To continue the example, the next character in GTGC is 'G'. So in block 504 of Figure 5, we retrieve the 'G'. As in block 506, at X=2, the prefix count for G is 0 and at Y=5, the prefix count for G is 1. We check the offset table for 'G' and note that we should add 5 to X and Y resulting in a new search boundary of [5,
6) . We illustrate this in result 600c in Figure 6c.
[0087] As in block 508 in Figure 5, we check to see if the size of the lexeme is large enough to accommodate our search pattern thus far. Since the length of the search pattern thus far is 2 (i.e., the length of GC), the size associated with X (here
5) is 2, and 2 >= 2 we conclude that the lexeme can accommodate our search pattern thus far. Since we have additional characters in the search pattern to process, we continue processing back to block 304.
[0088] Further continuing the example, in block 504 of Figure 5 we retrieve the next character in GTGC which is 'T'. As in block 506, at X=5, the prefix count for T is 0 and at Y=6, the prefix count for T is 1. We check the offset table for 'T' and note that we should add 6 to X and Y resulting in a new search boundary of [6,
7) . We illustrate this in result 600d in Figure 6d.
[0089] As in block 508 of Figure 5, we check to see if the size of the lexeme is large enough to accommodate our search pattern thus far. Since the length of the search pattern thus far is 3 (i.e., the length of TGC), the size associated with X (here
6) is 3, and 3 >= 3 we conclude that the lexeme can accommodate our search pattern thus far. Since we have additional characters in the search pattern to process, we continue processing back to block 504.
[0090] In the final character in this illustrative search pattern example, we illustrate showing that a search string does not exist in a data set. In block 504 of Figure 5, we retrieve the next character in GTGC which is 'G'. As in block 306, at X=6, the prefix count for G is 1 and at Y=7, the prefix count for T is 1. We check the offset table for 'T' and note that we should add 5 to X and Y resulting in a new search boundary of [6, 6). We illustrate this in result 600d in Figure 6d.
[0091] As in block 508 of Figure 5, we check to see if the size of the lexeme is large enough to accommodate our search pattern thus far. Since the length of the search pattern thus far is 4 (i.e., the length of GTGC), the size associated with X (here 6) is 3, which is less than 4, we increment X by 1 per block 508. This results in a search boundary of [7, 6). Since X > Y, we conclude that the lexeme cannot accommodate our search pattern of GTCG and terminate processing. As in block 514, we report that GTGC does not exist in our data set.
[0092] Note that the number of processing operations is linear with respect to the number of characters in the search pattern. This is because we iterate for every character in the search pattern. However, note that the time to search for the search pattern remains constant regardless of the size of the PBWT representation. Because the PBWT search relies on offsets, and because the offsets simply increase with the addition of new data, the time to search is independent of the size of the PBWT representation, thus regardless of the size of the database, the amount of time to search for a particular search pattern remains the same. Thus an advantage of PBWT is that performance will not suffer by adding more data, thereby lending itself to superior scalability. Exemplary Hardware, Software and Communications Environment
Computing Device
[0093] Figure 7 illustrates several possible embodiments of a hardware, software and communications environment 700 for PBWT techniques.
[0094] Client device 702 is any computing device. Exemplary computing devices include without limitation personal computers, tablet computers, smart phones, and smart televisions and/or media players.
[0095] PBWT techniques have the advantage of being very compact and very efficient with processing. Although PBWT techniques may be brought to bear on a typical networked client device 702 accessing a remote server, PBWT may be realistically implemented on a standalone computer. Accordingly, PBWT might be on a client device 702 that is a portable laptop, or a portable embedded system, or a standalone stations such as a kiosk. For example, a researcher in the field may have a custom computing device that contains an integrated computer running PBWT. Alternatively, a research lab may have an enclosed station that also contains an integrated computer running PBWT.
[0096] A client device 702 may have a processor 704 and a memory 706. Client device 702 's memory 706 is any computer-readable media which may store several software components including an application 708 and/or an operating system 710. In general, a software component is a set of computer executable instructions stored together as a discrete whole. Examples of software components include binary executables such as static libraries, dynamically linked libraries, and executable programs. Other examples of software components include interpreted executables that are executed on a run time such as servlets, applets, p-Code binaries, and Java binaries. Software components may run in kernel mode and/or user mode.
[0097] Computer-readable media includes, at least, two types of computer- readable media, namely computer storage media and communications media. Computer storage media includes volatile and non-volatile, removable and nonremovable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules, or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non- transmission medium that can be used to store information for access by a computing device. In contrast, communication media may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave, or other transmission mechanism. As defined herein, computer storage media does not include communication media.
[0098] To participate in a communications environment, user equipment device 702 may have a network interface 712. The network interface 712 may be one or more network interfaces including Ethernet, Wi-Fi, or any number of other physical and data link standard interfaces. In the case where the user need only do operations on a standalone single machine, the network interface 712 is optional. Client-Server/Multi-Tier
[0099] Client 702 may communicate to a server 716. Server 716 is any computing device that may participate in a network. The network may be, without limitation, a local area network ("LAN"), a virtual private network ("VPN"), a cellular network, or the Internet. The client network interface 712 may ultimate connect remote networked storage 714, or to server 716 via server network interface 718. Server network interface 718 may be one or more network interfaces as described with respect to client network interface 712.
[00100] Server 716 also has a processor 720 and memory 722. As per the preceding discussion regarding client device 702, memory 722 is any computer- readable media including both computer storage media and communication media.
[00101] In particular, memory 722 stores software which may include an application 724 and/or an operating system 726. Memory 718 may also store applications 724 that may include without limitation, an application server and a database management system. In this way, client device 702 may be configured with an application server and data management system to support a multi-tier configuration.
[00102] Server 716 may include a data store 728 accessed by the data management system. The data store 728 may be configured as a relational database, an object-oriented database, a NoSQL database, and/or a columnar database, or any configuration to support scalable persistence. Cloud
[00103] The server 716 need not be on site or operated by the client enterprise. The server 716 may be hosted in the Internet on a cloud installation 730. The cloud installation 730 may represent a plurality of disaggregated servers which provide virtual web application server 732 functionality and virtual database 734 functionality. Cloud 730 services 732, 734 may be made accessible via cloud infrastructure 736. Cloud infrastructure 736 not only provides access to cloud services 732, 734 but also billing services. Cloud infrastructure 536 may provide additional service abstractions such as Platform as a Service ("PAAS"), Infrastructure as a Service ("IAAS"), and Software as a Service ("SAAS").
Applications of Prefix Burrows-Wright Transform
Common Contexts of PWBT Applications
[00104] PBWT techniques may be advantageously applied to large data sets with an ordered character set, where the character set is finite, and where search data operations show the existence of an instance of the item to be search in the data set, rather than the specific location.
[00105] The specification of an ordered and finite character set is a result of the PBWT relationships to string searching techniques. However non-string data may be encoded to a finite, ordered character set, and PBWT techniques applied to the encoding. For example, signal wave data from a cellular phone may be sampled, amplitudes encoded into strings, and the results encoded strings searched with PBWT. Such a technique could be used to detect patterns of power/transmission loss in cell phones.
[00106] PBWT techniques are lossless in the sense that the information content itself is not lost. However, PBWT techniques may be considered lossy in the sense that the location of a substring may be lost. Accordingly, PBWT has strong applicability in scenarios searching for an existence of an instance rather than location. For example, PBWT may be able to determine that there is an error in a cellular transmission, but it may not be able to determine where in the data set the error occurred. Examples of such scenarios are as follows.
Plagiarism Searches
[00107] Plagiarism is the passing off of another author's works as one's own. Ordinarily, portions of another author's works are generally marked with citations. Copyrighted material is usually associated with copyright notices, where permission from the original author or the author's agents have been obtained. However, with plagiarism, the plagiarist misappropriates the credit for another author's expressions.
[00108] Plagiarism exists in both professional and academic publishing circles. Detecting plagiarism is a difficult due to the large amount of material that may be plagiarized. Accordingly, plagiarism detection is a candidate for automation.
[00109] Plagiarism detection may make use of PBWT techniques by selecting passages suspected of plagiarism as a search pattern. A corpus of known literature of the field is then processed into PB WT and the selected passage searched for. If there is a match, then it is likely that plagiarism has been detected.
[00110] Although PBWT performs exact matches, thus paraphrased plagiarism might not be detected, PBWT has the advantage of being fast, portable due to its compression, and able to compare against a much larger corpus of known literature.
Open Source Clearance
[00111] Open source clearance relates to companies ensuring that their proprietary software does not contain open source. Often developers may copy source code from the internet and other sources without being aware that the source code was subject to an open source license. Since many open source licenses oblige the surrender of intellectual property rights, such additions may result in code intended to be proprietary losing those intellectual property rights. This phenomenon is known as "tainting."
[00112] An open source clearance is a due diligence search by a company, prior to releasing proprietary source code to detect whether any open source had been added. PBWT techniques may be brought to bear by setting a suspect code snippets against a corpus of open source code compressed into a PBWT prefix table and offset table.
[00113] As with a plagiarism detector, PBWT will not detect paraphrased or modified source code. However, it will detect simple cut and pasted source code and will have the advantage of being able to perform comparisons against a very large corpus of source code known to be from open source repositories.
Bioinformatics Specific Applications of Prefix Burrows-Wright Transform
Genome Assembly
[00114] PBWT techniques are particularly useful in the context of high speed DNA sequencing. DNA sequencing involves reading short sequences, called "reads", that are a portion of a genome, and then re-assembling the reads into the original genome. DNA sequencing generally may be performed via one of two strategies. One strategy is in mapping sequencing, where the reads are mapped to an existing backbone genome structure, akin to mapping details of the genome to an outline of the genome. Another strategy is in de-novo genome assembly, where reads are assembled by matching one read to another via searching for predetermined sequences. De-novo genome assembly derives its name from the possibility of the resulting sequences being novel sequences.
[00115] In both mapping and de-novo sequencing, consider a read to be matched to a location on a backbone or to another read respectively. PBWT techniques may be used in a k-Mer search to locate the backbone location in the case of mapping sequencing and for another read in the case of de-novo assembly. For example, in the case of de-novo assembly, all the reads may be stored in memory in PBWT format. A read to be matched to another read has a k-Mer identified to be used as a search pattern. The identified k-Mer is usually a known overlap between reads. This predetermined overlap k-Mer is then searched on all the reads stored in memory using a PBWT k-Mer search, thereby identifying whether a match exists or not. Accordingly, this information may be used in the construction of graph data structures, such as read overlap graphs and de Bruijn graphs, representing possible overlaps used to reconstruct the genome.
[00116] In the case of de-novo assembly, typically a large number of reads are matched to other reads. Since the PBWT format is merely being searched, rather than modified, matching of multiple reads may be performed in parallel. Accordingly, a dispatcher using parallel techniques, such as map-reduce, may spawns parallel read match operations thereby considerably speeding de-novo sequencing.
Searching for Genomic Defect
[00117] Another example of applying PBWT techniques in bioinformatics is in searching for a genome defect. Consider the case of a researcher who suspects that a particular genetic sequence of nucleic acids is a marker for a particular birth defect. The researcher may then set that particular genetic sequence as the search pattern and then perform a k-Mer search on a person under study. If the sequence is detected, it could then be correlated with the incidence or the absence of the birth defect in the person under study.
[00118] It is worth observing that there are other ways to perform a k-Mer search. However, the advent of PBWT techniques enables the processing to be performed in a short number of days rather than several months. Accordingly, PBWT techniques provide several factors of improvement in processing speed hitherto not available with prior art techniques.
Searching for Plant Tags
[00119] Another potential bioinformatics application of PBWT techniques is to search for biological tags in plants. Governments may seek to track the source of botanical plants. For example, some plants may be genetically modified and subject to certain regulations. By way of another example, some states have legalized marijuana and may wish to track plant sources a part of regulation. One way to track plants is to add a genetic tag to the plants to be tracked. Later, if a regulator or law enforcement officer confiscates some plants, they may be rapidly analyzed using PBWT. Specifically the biological tag, in the form of a k-Mer, is set as the search pattern. Using a PBWT k-Mer search, one may identify the tag, or the absence of a genetic tag in a plant. Accordingly, the source of the plant, or the contraband source of a plant could be readily identified.
[00120] Again, it is worth observing that the speed and efficiency of PBWT makes such a scenario possible. Using prior art methods, the slowness of processing would create a massive backlog of genetic tag searches as to render the process impractical. However, with PBWT, the compressed data is easily stored even in an embedded system, and the amount of time to perform the k-Mer search is sufficiently small as to render genetic tagging practical as a regulatory tool. Conclusion
[00121] Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

CLAIMS WHAT IS CLAIMED IS:
1. A system comprising a software component to perform an existence search data operation on a prefix table representation of a set of raw lexemes, comprising: a processor;
a memory communicatively coupled to the processor;
a persistent memory communicatively coupled to the processor storing a prefix table representation of a set of raw lexemes;
an existential search software component stored in the memory configured to receive a search pattern and perform a lexeme search for the search pattern in the prefix table representation.
2. The system of claim 1, wherein amount of persistent memory used to store the prefix table representation of a set of raw lexemes is linearly proportional to the number of unique lexemes in the set of raw lexemes.
3. The system of claim 1, comprising an offset table stored in the persistent memory, wherein the existential search makes use of the offset table.
4. The system of claim 1, wherein the set of raw lexemes is genome data.
5. The system of claim 1 , comprising a second processor specifically configured to perform at least a portion of an existential search.
6. The system of claim 1, wherein the system is hosted in a standalone portable computer.
7. The system of claim 1 , wherein the prefix table representation is a prefix burrows-wheeler transform (PBWT) representation.
8. The system of claim 7, wherein the time to perform the existential search remains constant regardless of the amount of lexemes used to generated the PBWT representation.
9. The system of claim 7, wherein the PBWT representation comprises a PBWT offset table.
10. The system of claim 7, wherein the existential search software component is further configured to perform rollup techniques to track prefix counts.
PCT/US2015/030233 2014-05-13 2015-05-11 Prefix burrows-wheeler transformation with fast operations on compressed data WO2018038697A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN201580038227.8A CN108292307A (en) 2014-05-13 2015-05-11 With the quick operating prefix Burrow-Wheeler transformation to compressed data
EP15894507.1A EP3323203A4 (en) 2014-05-13 2015-05-11 Prefix burrows-wheeler transformation with fast operations on compressed data
HK18116146.9A HK1257025A1 (en) 2014-05-13 2018-12-17 Prefix burrows-wheeler transformation with fast operations on compressed data

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US14/277,038 2014-05-13
US14/277,038 US9600599B2 (en) 2014-05-13 2014-05-13 Prefix burrows-wheeler transformation with fast operations on compressed data

Publications (1)

Publication Number Publication Date
WO2018038697A1 true WO2018038697A1 (en) 2018-03-01

Family

ID=54538681

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2015/030233 WO2018038697A1 (en) 2014-05-13 2015-05-11 Prefix burrows-wheeler transformation with fast operations on compressed data

Country Status (5)

Country Link
US (2) US9600599B2 (en)
EP (1) EP3323203A4 (en)
CN (1) CN108292307A (en)
HK (1) HK1257025A1 (en)
WO (1) WO2018038697A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7448012B1 (en) 2004-04-21 2008-11-04 Qi-De Qian Methods and system for improving integrated circuit layout
GB2567390B (en) * 2016-09-02 2021-10-06 Hitachi High Tech Corp Method for generating text string dictionary, method for searching text string dictionary, and system for processing text string dictionary
TWI719537B (en) * 2019-07-16 2021-02-21 國立清華大學 Text comparison method, system and computer program product

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030009295A1 (en) * 2001-03-14 2003-01-09 Victor Markowitz System and method for retrieving and using gene expression data from multiple sources
US20090063465A1 (en) * 2007-08-29 2009-03-05 Yahoo! Inc. System and method for string processing and searching using a compressed permuterm index
US7734671B1 (en) * 2007-10-09 2010-06-08 The United States Of America As Represented By The Director, National Security Agency Method of sorting text and string searching
US20120330567A1 (en) 2011-06-21 2012-12-27 Illumina Cambridge Limited Methods and systems for data analysis
WO2013119770A1 (en) * 2012-02-08 2013-08-15 Dow Agrosciences Llc Data analysis of dna sequences

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4796003A (en) * 1984-06-28 1989-01-03 American Telephone And Telegraph Company Data compaction
US6075470A (en) * 1998-02-26 2000-06-13 Research In Motion Limited Block-wise adaptive statistical data compressor
US6751624B2 (en) * 2000-04-04 2004-06-15 Globalscape, Inc. Method and system for conducting a full text search on a client system by a server system
US6611832B1 (en) * 2000-10-30 2003-08-26 International Business Machines Corp. Data structure for longest-matching prefix searching and updating method for search table data structures
WO2004002044A2 (en) * 2002-02-01 2003-12-31 John Fairweather A system for exchanging binary data
US7028042B2 (en) * 2002-05-03 2006-04-11 Jorma Rissanen Lossless data compression system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030009295A1 (en) * 2001-03-14 2003-01-09 Victor Markowitz System and method for retrieving and using gene expression data from multiple sources
US20090063465A1 (en) * 2007-08-29 2009-03-05 Yahoo! Inc. System and method for string processing and searching using a compressed permuterm index
US7734671B1 (en) * 2007-10-09 2010-06-08 The United States Of America As Represented By The Director, National Security Agency Method of sorting text and string searching
US20120330567A1 (en) 2011-06-21 2012-12-27 Illumina Cambridge Limited Methods and systems for data analysis
WO2013119770A1 (en) * 2012-02-08 2013-08-15 Dow Agrosciences Llc Data analysis of dna sequences

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
A. J. COX ET AL., LARGE-SCALE COMPRESSION OF GENOMIC SEQUENCE DATABASES WITH THE BURROWS-WHEELER TRANSFORM
ANONYMOUS, COMPRESSION OF GENOMIC SEQUENCING DATA
H. LI ET AL., FAST AND ACCURATE SHORT READ ALIGNMENT WITH BURROWS-WHEELER TRANSFORM
R. DURBIN, EFFICIENT HAPLOTYPE MATCHING AND STORAGE USING THE POSITIONAL BURROWS-WHEELER TRANSFORM (PBWT
See also references of EP3323203A4

Also Published As

Publication number Publication date
CN108292307A (en) 2018-07-17
US20170147542A1 (en) 2017-05-25
US10558739B2 (en) 2020-02-11
US20150331907A1 (en) 2015-11-19
US9600599B2 (en) 2017-03-21
HK1257025A1 (en) 2019-10-11
EP3323203A4 (en) 2019-03-27
EP3323203A1 (en) 2018-05-23

Similar Documents

Publication Publication Date Title
US9489414B2 (en) Prefix burrows-wheeler transformations for creating and searching a merged lexeme set
US8315997B1 (en) Automatic identification of document versions
Rowe et al. Indexed variation graphs for efficient and accurate resistome profiling
Zhang et al. Bed-tree: an all-purpose index structure for string similarity search based on edit distance
US20180137255A1 (en) System and method for characterizing data through a probabilistic data structure
US20170322930A1 (en) Document based query and information retrieval systems and methods
CN107403075B (en) Comparison method, device and system
Rowe When the levee breaks: a practical guide to sketching algorithms for processing the flood of genomic data
KR20060044563A (en) Method for duplicate detection and suppression
US10558739B2 (en) Prefix table generation for prefix burrows-wheeler transformation with fast operations on compressed data
CN104112005B (en) Distributed mass fingerprint identification method
US20070168364A1 (en) Non-Mutating Tree-Structured File Identifiers
US20240311432A1 (en) System and method for search discovery
CN116804980A (en) Binary component retrieval method and device
Li et al. Juxtapp and dstruct: Detection of similarity among android applications
CN116611092A (en) Multi-factor-based data desensitization method and device, and tracing method and device
CN110069489A (en) A kind of information processing method, device, equipment and computer readable storage medium
Zhang et al. A program plagiarism detection model based on information distance and clustering
CA2948793A1 (en) Prefix burrows-wheeler transformation with fast operations on compressed data
Li et al. FlashSchema: achieving high quality XML schemas with powerful inference algorithms and large-scale schema data
CN109063097B (en) Data comparison and consensus method based on block chain
Petri et al. Efficient indexing algorithms for approximate pattern matching in text
Gupta et al. Index Optimization Using Wavelet Tree and Compression
CN117951151B (en) Data verification and retrieval method for preventive inoculation system
US11593439B1 (en) Identifying similar documents in a file repository using unique document signatures

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

REEP Request for entry into the european phase

Ref document number: 2015894507

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2015894507

Country of ref document: EP

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15894507

Country of ref document: EP

Kind code of ref document: A1