US20180268000A1 - Apparatus and Method for Distributed Query Processing Utilizing Dynamically Generated In-Memory Term Maps - Google Patents
Apparatus and Method for Distributed Query Processing Utilizing Dynamically Generated In-Memory Term Maps Download PDFInfo
- Publication number
- US20180268000A1 US20180268000A1 US15/464,232 US201715464232A US2018268000A1 US 20180268000 A1 US20180268000 A1 US 20180268000A1 US 201715464232 A US201715464232 A US 201715464232A US 2018268000 A1 US2018268000 A1 US 2018268000A1
- Authority
- US
- United States
- Prior art keywords
- term
- row
- query
- values
- file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012545 processing Methods 0.000 title description 16
- 238000000034 method Methods 0.000 title description 9
- 230000004044 response Effects 0.000 claims description 3
- 230000002776 aggregation Effects 0.000 description 6
- 238000004220 aggregation Methods 0.000 description 6
- 238000001914 filtration Methods 0.000 description 6
- 238000013459 approach Methods 0.000 description 5
- 238000006243 chemical reaction Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000003278 mimic effect Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000004549 pulsed laser deposition Methods 0.000 description 1
Images
Classifications
-
- G06F17/30283—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/221—Column-oriented storage; Management thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24553—Query execution of query operations
- G06F16/24561—Intermediate data storage techniques for performance improvement
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2471—Distributed queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/256—Integrating or interfacing systems involving database management systems in federated or virtual databases
-
- G06F17/30315—
-
- G06F17/30501—
-
- G06F17/30545—
Definitions
- This invention relates generally to distributed query processing in a computer network. More particularly, this invention relates to techniques for distributed query processing utilizing dynamically generated in-memory term maps.
- Parquet is an open source column-oriented file format that is designed for fast scanning and efficient compression of data. Parquet stores data where the values in each column are physically stored in contiguous memory locations. Due to the columnar storage, Parquet can use per-column compression techniques that are optimized for each data type, and queries that use only a sub-set of all columns do not need to reach the entire data for each row.
- Apache LuceneTM (“Lucene”) is an open source information retrieval library that supports indexing and search as well as text analysis, numeric aggregations, spellchecking, and many other features.
- ElasticsearchTM is an open source distributed search engine built on top of Lucene that provides (near) real time query and advanced analytics functionality for big data. Elasticsearch can quickly generate analytical results even on large datasets containing billions of records.
- a general map is a dictionary that contains a set of unique “keys”. Each key has an associated value.
- a term map is a special kind of map, where the keys are “terms” (e.g., words) and the values are arrays of Boolean (true/false) flags.
- Each column in the dataset has its own term map. There is one entry in a column's term map for each unique term (e.g., word) found in that column, for any row in the dataset.
- the Boolean flags indicate whether the corresponding term occurs in a specific row.
- columnar file 10 of FIG. 1 The data set has five rows, numbered 1 to 5. Each row has a single column (“gender”) containing either the string “F” or “M”.
- Table 12 shows how the original dataset of five rows is converted into a term map with two entries, where the value for each entry is an array of five Boolean flags (one for each row). The flag's value is “Yes” if the corresponding row contains that term, otherwise the value is “No”.
- a typical analytics query requires calculating an aggregation of some column value, where these aggregations are performed separately for each unique value in a different column.
- the aggregation could be the average of values in an Age column, grouped by values in a Gender column.
- the results would typically be two values, one being the average Age for all rows where the Gender column contains “M”, and the other being the average Age for all rows where the Gender column contains “F”.
- the original (pre-tokenized) values for a column are required; for example, when summing a numeric column the original numeric value is required.
- these “row values” (which are called “document values” by Lucene) are read from the Lucene index files persistently stored on the disk. This approach is significantly slower than any of the filtering operations that use in-memory term map data.
- a Lucene index consisting of term maps, row values and other data structures must be created before Elasticsearch can query the data. Creating a Lucene index is very time-consuming, and it typically requires 2-3 times more storage than the original data set. This means significant resources are required to leverage Elasticsearch for fast analytics on big data, and there is a significant delay before any new data can be queried.
- a system has a master node with instructions executed by a master node processor to receive a query over a network from a client machine and distribute query segments over the network.
- Worker nodes receive the query segments.
- Each worker node includes instructions executed by a worker node processor to construct from a columnar file (stored locally or in a distribute file system) a term map characterizing a term from the columnar file, row identifications from the columnar file and a Boolean indicator for each row identification that characterizes whether the term is present in the row specified by the row identification.
- the term map is cached in dynamic memory. Values responsive to the query segment are collected from the term map. The values are sent to the master node.
- the master node aggregates values from the worker nodes to form a result that is returned to the client machine over the network.
- FIG. 1 illustrates conversion of columnar data to a term map.
- FIG. 2 illustrates a system configured in accordance with an embodiment of the invention.
- FIG. 3 illustrates distributed query processing operations performed in accordance with an embodiment of the invention.
- FIG. 4 illustrates conversion of columnar data to a tokenized gender term map in accordance with an embodiment of the invention.
- FIG. 5 illustrates conversion of columnar data to a tokenized age term map in accordance with an embodiment of the invention.
- FIG. 6 illustrates interoperable executable code components utilized in accordance with an embodiment of the invention.
- FIG. 2 illustrates a system 100 configured in accordance with an embodiment of the invention.
- the system 100 is configured for distributed query processing.
- the system 100 includes a client computer 102 in communication with a master node 104 via a bus 106 , which may be any combination of wired and wireless networks.
- the client computer 102 includes standard components, such as a central processing unit 110 connected to input/output devices 112 via a bus 114 .
- the input/output devices 112 may include a keyboard, mouse, touch display and the like.
- a network interface circuit 116 is also connected to the bus 114 .
- the network interface circuit 116 provides connectivity to network 106 .
- a memory 120 is also connected to the bus.
- the memory 120 stores a client module 122 , which includes instructions executed by the central processing unit 110 .
- the client module 122 may be a browser or application that provides connectivity to server 104 for the purpose of submitting queries.
- the client computer 102 may be a desktop computer, mobile computer, tablet, mobile device and the like.
- the server 104 also includes standard components, such as a central processing unit 130 , input/output devices 132 , a bus 134 and a network interface circuit 136 .
- a memory 140 is connected to bus 134 .
- the memory stores a master query processor 142 with instructions executed by the central processing unit 130 .
- the master query processor 142 develops a plan for a received query, distributes query segments to worker nodes 150 _ 1 through 150 _N, aggregates values received from worker nodes 150 _ 1 through 150 _N and supplies a final result, as demonstrated below.
- FIG. 2 also illustrates worker nodes 150 _ 1 through 150 13 N.
- Each worker nodes includes standard components, such as a central processing unit 151 , input/output devices 152 , bus 154 and a network interface circuit 156 .
- a memory 160 is connected to the bus 154 .
- the memory 160 includes instructions executed by the central processing unit 151 to implement distributed query processing in accordance with embodiments of the invention.
- the memory 160 stores a client query processor 162 .
- the client query processor 162 coordinates the processing of a query segment assigned to it by the master query processor 142 .
- the memory 160 also stores a columnar file 164 . Unlike the prior art, which would have an associated index for the columnar file 164 , no such index exists in this case.
- a term map module 166 is used to dynamically generate term maps in response to a query.
- the term map module 166 stores generated term maps in a term map cache 168 .
- the term map cache 168 is an in-memory resource (i.e., in non-persistent memory, such as RAM) and therefore provides results faster than a persistently stored index.
- FIG. 3 illustrates processing operations associated with an embodiment of the invention.
- a query is received 300 .
- a query is generated by the client module 122 of the client machine 102 and is received by the master query processor 142 .
- Query segments are then distributed to worker nodes with columnar files 302 , such as worker nodes 150 _ 1 through 150 _N, each with a columnar file 164 .
- a required term map is available 304 .
- the client query processor 162 may implement this operation by accessing term map cache 168 . If a term map is available ( 304 —Yes), processing proceeds to block 308 . If a term map is not available ( 304 —No), a term map is constructed and cached 306 .
- the term map module 166 may be used to construct the term map and place it in term map cache 168 .
- Values are then collected from in-memory term maps 308 . Examples of this operation are provided below. Values are then aggregated from the worker nodes 310 . That is, the master machine 104 , collects values, via network 106 , from the worker machines 150_1 through 150 _N and aggregates such values. A query result is then returned 312 . For example, the master query processor 142 may supply client machine 102 with the query result.
- the master query processor 142 and each client query processor 162 include distributed Lucene and Elasticsearch code.
- the master query processor 142 and each client query processor 162 include proprietary code to implement operations of the invention. This proprietary code is referred to herein as Parlene. Parlene implements the same API as a regular Lucene-based index. This API is the only communication path between Lucene and Elasticsearch. By responding to requests with the data that Elasticsearch expects, Parlene can effectively “mimic” a Lucene index using columnar data (e.g., a Parquet file).
- Parlene implements four API methods: fields( ), getNumericDocValues( ), getBinaryDocValues( ), and document( ). The first method is used to access each column's term map, while the last three methods return row values.
- Parlene creates the in-memory term map on-demand, in response to an analytics request. Once created, term maps are cached and re-used in each term map cache 168 .
- the Gender column contains strings of varying lengths.
- starts with query such as “all entries where the Gender column starts with “FE”
- the text must be split into pieces via tokenization.
- the keys in the resulting term map contain both a position, and the character at that position, as shown in Gender term map 402 .
- the string “FEM” for row #4 is converted into three terms: “1F”, “2E”, and “3M”.
- a query such as “all rows where the Gender column starts with “FE” becomes a test to find all rows with the value “Yes” in the Gender term map for the terms “1F” and “2E”. In this example, only the columns for rows 2 and 4 have “Yes” for both of those terms.
- Parlene supports fast sub-string queries with small in-memory term maps.
- FIG. 5 illustrates columnar data 500 .
- the Age column contains integers.
- the integer values In order to support a range query such as “all entries where the Age column is greater than 37”, the integer values must be split into components via tokenization.
- the keys in the resulting term map 502 encode both a position, and the value at that position.
- the numeric value 38 for row #5 is converted into two terms: 8 and 30. This represents a value of 8 at the “ones position”, and a value of 30 at the “tens position”.
- a query such as “all rows where the Age column is greater than 37” becomes a test to find all rows with the value “Yes” in the Age term map for the terms 100 OR 90 OR 80 OR 70 OR 60 OR 50 OR 40 OR (30 AND (8 OR 9)). Any row with a 1 in the hundreds position (key is 100) is going to be greater than 35. Likewise it is known that the row is a match if it has 90, 80, and so on down to 40. If the row has a term of 30, then one must ensure that the value for that row in the ones position is greater than 7, thus the check for 8 or 9. In this example, only row 4 has “Yes” for the 100 term, and only row 5 has “Yes” for the terms 30 and 8.
- Parlene supports fast numeric ranges queries with small in-memory term maps. If a request is made to get the row value for a numeric column, and that column has a term map, then Parlene can reconstruct the original value. Given the target row number, each term that has “Yes” for that row can be combined to create the original value. For example, if a request is made for the Age column value of row #2, then the Age term map contains the terms 000, 10, and 7 with “Yes” for row #2. Combining these gives us the original value of 17. Using this approach, Parlene supports fast retrieval of numeric values from the in-memory term map, without requiring a disk access.
- a request may originate from the client module 122.
- the request may be via an analytics query API supported by the master query processor 142 .
- the analytics query API converts this into an Elasticsearch request so that Elasticsearch can compute the results.
- Elasticsearch is designed for processing such queries in real time by using indexing technology.
- the aggregation request, which is sent to an Elasticsearch node as JSON looks like:
- Elasticsearch distributes this request to all nodes in the Elasticsearch cluster that contain one or more pieces (shards) of the index.
- the master query processor 142 includes an Elasticsearch code segment that communicates with Elasticsearch code segments of each client query processor 162 .
- the Elasticsearch query parser communicates with each shard of the index that contains one or more columnar files 164 (e.g., Parquet files). Note here that the Elasticsearch query processor relies upon columnar file locations instead indexes for the columnar files.
- Each worker node then makes a request to Parlene (via a ParleneLeafReader) for the “gender” term map.
- Parlene first checks to see if this term map already exists in the in-memory cache. If so, it is returned immediately. This corresponds to the check 304 of FIG. 3 . Otherwise Parlene must first build a term map from the “gender” column in the Parquet file. This corresponds to operation 306 of FIG. 3 .
- the ParleneLeafReader makes a request to a ParquetBridge to build the term map.
- the ParquetBridge is code configured in accordance with an embodiment of the invention that uses Parquet files instead of Lucene indexes as the source for both term maps and row values. It produces optimized in-memory data structures for term maps. Elasticsearch can now quickly provide fast analytics directly from Parquet files without having to first create Lucene indexes.
- the ParquetBridge makes a request to Parquet to iterate over every record in the “gender” column.
- the proprietary code of the ParquetBridge leverages a native capability of the open source Parquet code that is designed for quick reading from individual columns.
- Parquet materializes (creates in memory) the value for that record's “gender” column.
- the ParquetBridge creates the term map during this iteration, by first generating terms from the raw data read from the file (via tokenization), and then setting the appropriate flag for each row containing the corresponding term.
- the resulting term map is returned by the ParquetBridge to the ParleneLeafReader.
- the ParleneLeafReader adds the term map to the in-memory cache, and returns it to Elasticsearch.
- the exemplary query is only looking at females and therefore Elasticsearch only looks at the results from the term map for gender “F”. These are the row IDs 1, 2 and 4 in the current example. Elasticsearch iterates over every row for gender “F” by the ID found in the term map. This is done by a Lucene “HitCollector”. For each such row, a request is made to Parlene to return the “state” and “payment” column values from the row that is defined by the ID.
- the ParleneLeafReader calls the ParquetBridge to extract these column values for the row.
- the ParquetBridge has a choice of how to extract these values. If the corresponding column has a term map in the cache, then the value can be reconstructed using the term map, as described previously. Otherwise, a request to Parquet can be made to materialize the requested column for the requested row (by its id).
- the Elasticsearch HitCollector adds each “payment” value to the appropriate group. There is one group for each unique value in the “state” column. Each Elasticsearch node returns the HitCollector results for each shard to the node that originated the query, the master node 104 , in this case.
- the originating Elasticsearch node combines the per-shard results to create a final result, which is then returned to the requester, client machine 102 , in this example.
- FIG. 6 illustrates interoperable executable code components utilized in accordance with an embodiment of the invention.
- Elasticsearch 600 includes a search manager 602 .
- each Java® Virtual Machine (JVM) has one search manager 602 .
- This class is responsible for instantiating the correct directory reader.
- Each directory reader organizes one shard.
- the Parlene directory reader 604 manages the Parlene specific caches (e.g., term map cache 168 . Each instance provides one Parlene leaf reader.
- Parlene 606 includes a leaf reader 608 , which corresponds to Lucene's leaf reader. It is mainly responsible for providing row values and terms for each requested column. Parlene also includes a Parquet Bridge 610 .
- the Parquet Bridge 610 provides methods for reading relevant data from a columnar file, in this case a Parquet file.
- the bridge 610 has two main tasks. The first is to read and analyze a complete column when needed and to read concrete rows requested by Lucene's collectors.
- Parquet 616 includes a Parquet handle 614 , which provides methods to access the Parquet file. It is responsible for providing a row reader 616 and a column reader 618 .
- the row reader 616 reads and materializes only the requested row and the requested columns. This causes many random accesses.
- the column reader 618 materializes the entire column and therefore reduces the number of random accesses per row.
- the row reader 616 and column reader 618 operate on a Parquet file 620 .
- An embodiment of the present invention relates to a computer storage product with a computer readable storage medium having computer code thereon for performing various computer-implemented operations.
- the media and computer code may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well known and available to those having skill in the computer software arts.
- Examples of computer-readable media include, but are not limited to: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs, DVDs and holographic devices; magneto-optical media; and hardware devices that are specially configured to store and execute program code, such as application-specific integrated circuits (“ASICs”), programmable logic devices (“PLDs”) and ROM and RAM devices.
- ASICs application-specific integrated circuits
- PLDs programmable logic devices
- Examples of computer code include machine code, such as produced by a compiler, and files containing higher-level code that are executed by a computer using an interpreter.
- machine code such as produced by a compiler
- files containing higher-level code that are executed by a computer using an interpreter.
- an embodiment of the invention may be implemented using JAVA®, C++, or other object-oriented programming language and development tools.
- Another embodiment of the invention may be implemented in hardwired circuitry in place of, or in combination with, machine-executable software instructions.
Abstract
Description
- This invention relates generally to distributed query processing in a computer network. More particularly, this invention relates to techniques for distributed query processing utilizing dynamically generated in-memory term maps.
- Apache Parquet™ (“Parquet”) is an open source column-oriented file format that is designed for fast scanning and efficient compression of data. Parquet stores data where the values in each column are physically stored in contiguous memory locations. Due to the columnar storage, Parquet can use per-column compression techniques that are optimized for each data type, and queries that use only a sub-set of all columns do not need to reach the entire data for each row.
- Apache Lucene™ (“Lucene”) is an open source information retrieval library that supports indexing and search as well as text analysis, numeric aggregations, spellchecking, and many other features.
- Elasticsearch™ is an open source distributed search engine built on top of Lucene that provides (near) real time query and advanced analytics functionality for big data. Elasticsearch can quickly generate analytical results even on large datasets containing billions of records.
- Many analytics tasks require calculations on only a sub-set of all rows in the dataset. Lucene supports calculations on sub-sets of the data by efficiently filtering the entire dataset down to just the target rows, by using the concept of an “inverted index” or a term map.
- A general map is a dictionary that contains a set of unique “keys”. Each key has an associated value. A term map is a special kind of map, where the keys are “terms” (e.g., words) and the values are arrays of Boolean (true/false) flags. Each column in the dataset has its own term map. There is one entry in a column's term map for each unique term (e.g., word) found in that column, for any row in the dataset. The Boolean flags indicate whether the corresponding term occurs in a specific row. As an example, consider
columnar file 10 ofFIG. 1 . The data set has five rows, numbered 1 to 5. Each row has a single column (“gender”) containing either the string “F” or “M”. - Table 12 shows how the original dataset of five rows is converted into a term map with two entries, where the value for each entry is an array of five Boolean flags (one for each row). The flag's value is “Yes” if the corresponding row contains that term, otherwise the value is “No”.
- One can easily determine the set of rows containing “M” for the Gender; only
rows - A typical analytics query requires calculating an aggregation of some column value, where these aggregations are performed separately for each unique value in a different column. For example, the aggregation could be the average of values in an Age column, grouped by values in a Gender column. The results would typically be two values, one being the average Age for all rows where the Gender column contains “M”, and the other being the average Age for all rows where the Gender column contains “F”.
- When using a Lucene index for aggregations, the original (pre-tokenized) values for a column are required; for example, when summing a numeric column the original numeric value is required. Typically these “row values” (which are called “document values” by Lucene) are read from the Lucene index files persistently stored on the disk. This approach is significantly slower than any of the filtering operations that use in-memory term map data.
- A Lucene index consisting of term maps, row values and other data structures must be created before Elasticsearch can query the data. Creating a Lucene index is very time-consuming, and it typically requires 2-3 times more storage than the original data set. This means significant resources are required to leverage Elasticsearch for fast analytics on big data, and there is a significant delay before any new data can be queried.
- Therefore, it would be desirable if one could avoid paying the cost of creating and maintaining an index for columnar data (e.g., a Lucene index).
- A system has a master node with instructions executed by a master node processor to receive a query over a network from a client machine and distribute query segments over the network. Worker nodes receive the query segments. Each worker node includes instructions executed by a worker node processor to construct from a columnar file (stored locally or in a distribute file system) a term map characterizing a term from the columnar file, row identifications from the columnar file and a Boolean indicator for each row identification that characterizes whether the term is present in the row specified by the row identification. The term map is cached in dynamic memory. Values responsive to the query segment are collected from the term map. The values are sent to the master node. The master node aggregates values from the worker nodes to form a result that is returned to the client machine over the network.
- The invention is more fully appreciated in connection with the following detailed description taken in conjunction with the accompanying drawings, in which:
-
FIG. 1 illustrates conversion of columnar data to a term map. -
FIG. 2 illustrates a system configured in accordance with an embodiment of the invention. -
FIG. 3 illustrates distributed query processing operations performed in accordance with an embodiment of the invention. -
FIG. 4 illustrates conversion of columnar data to a tokenized gender term map in accordance with an embodiment of the invention. -
FIG. 5 illustrates conversion of columnar data to a tokenized age term map in accordance with an embodiment of the invention. -
FIG. 6 illustrates interoperable executable code components utilized in accordance with an embodiment of the invention. - Like reference numerals refer to corresponding parts throughout the several views of the drawings.
-
FIG. 2 illustrates asystem 100 configured in accordance with an embodiment of the invention. Thesystem 100 is configured for distributed query processing. Thesystem 100 includes aclient computer 102 in communication with amaster node 104 via abus 106, which may be any combination of wired and wireless networks. - The
client computer 102 includes standard components, such as acentral processing unit 110 connected to input/output devices 112 via abus 114. The input/output devices 112 may include a keyboard, mouse, touch display and the like. Anetwork interface circuit 116 is also connected to thebus 114. Thenetwork interface circuit 116 provides connectivity tonetwork 106. Amemory 120 is also connected to the bus. Thememory 120 stores aclient module 122, which includes instructions executed by thecentral processing unit 110. Theclient module 122 may be a browser or application that provides connectivity toserver 104 for the purpose of submitting queries. Theclient computer 102 may be a desktop computer, mobile computer, tablet, mobile device and the like. - The
server 104 also includes standard components, such as acentral processing unit 130, input/output devices 132, abus 134 and anetwork interface circuit 136. Amemory 140 is connected tobus 134. The memory stores amaster query processor 142 with instructions executed by thecentral processing unit 130. Themaster query processor 142 develops a plan for a received query, distributes query segments to worker nodes 150_1 through 150_N, aggregates values received from worker nodes 150_1 through 150_N and supplies a final result, as demonstrated below. -
FIG. 2 also illustrates worker nodes 150_1 through 150 13 N. Each worker nodes includes standard components, such as acentral processing unit 151, input/output devices 152,bus 154 and anetwork interface circuit 156. Amemory 160 is connected to thebus 154. Thememory 160 includes instructions executed by thecentral processing unit 151 to implement distributed query processing in accordance with embodiments of the invention. In one embodiment, thememory 160 stores aclient query processor 162. Theclient query processor 162 coordinates the processing of a query segment assigned to it by themaster query processor 142. Thememory 160 also stores acolumnar file 164. Unlike the prior art, which would have an associated index for thecolumnar file 164, no such index exists in this case. Instead, aterm map module 166 is used to dynamically generate term maps in response to a query. Theterm map module 166 stores generated term maps in aterm map cache 168. Theterm map cache 168 is an in-memory resource (i.e., in non-persistent memory, such as RAM) and therefore provides results faster than a persistently stored index. -
FIG. 3 illustrates processing operations associated with an embodiment of the invention. A query is received 300. For example, a query is generated by theclient module 122 of theclient machine 102 and is received by themaster query processor 142. Query segments are then distributed to worker nodes withcolumnar files 302, such as worker nodes 150_1 through 150_N, each with acolumnar file 164. - At each worker node it is determined whether a required term map is available 304. The
client query processor 162 may implement this operation by accessingterm map cache 168. If a term map is available (304—Yes), processing proceeds to block 308. If a term map is not available (304—No), a term map is constructed and cached 306. Theterm map module 166 may be used to construct the term map and place it interm map cache 168. - Values are then collected from in-memory term maps 308. Examples of this operation are provided below. Values are then aggregated from the
worker nodes 310. That is, themaster machine 104, collects values, vianetwork 106, from the worker machines 150_1 through 150_N and aggregates such values. A query result is then returned 312. For example, themaster query processor 142 may supplyclient machine 102 with the query result. The foregoing operations are more fully appreciated with reference to the following examples. - The
master query processor 142 and eachclient query processor 162 include distributed Lucene and Elasticsearch code. In addition, themaster query processor 142 and eachclient query processor 162 include proprietary code to implement operations of the invention. This proprietary code is referred to herein as Parlene. Parlene implements the same API as a regular Lucene-based index. This API is the only communication path between Lucene and Elasticsearch. By responding to requests with the data that Elasticsearch expects, Parlene can effectively “mimic” a Lucene index using columnar data (e.g., a Parquet file). - In one embodiment, Parlene implements four API methods: fields( ), getNumericDocValues( ), getBinaryDocValues( ), and document( ). The first method is used to access each column's term map, while the last three methods return row values.
- When Elasticsearch calls the fields( )method to request the term map for a column, Parlene creates the in-memory term map on-demand, in response to an analytics request. Once created, term maps are cached and re-used in each
term map cache 168. - Consider the
columnar file 400 ofFIG. 4 . The Gender column contains strings of varying lengths. In order to support a “starts with” query such as “all entries where the Gender column starts with “FE”, the text must be split into pieces via tokenization. - The keys in the resulting term map contain both a position, and the character at that position, as shown in
Gender term map 402. For example, the string “FEM” forrow # 4 is converted into three terms: “1F”, “2E”, and “3M”. - A query such as “all rows where the Gender column starts with “FE” becomes a test to find all rows with the value “Yes” in the Gender term map for the terms “1F” and “2E”. In this example, only the columns for
rows -
FIG. 5 illustratescolumnar data 500. The Age column contains integers. In order to support a range query such as “all entries where the Age column is greater than 37”, the integer values must be split into components via tokenization. - The keys in the resulting
term map 502 encode both a position, and the value at that position. For example, thenumeric value 38 forrow # 5 is converted into two terms: 8 and 30. This represents a value of 8 at the “ones position”, and a value of 30 at the “tens position”. - A query such as “all rows where the Age column is greater than 37” becomes a test to find all rows with the value “Yes” in the Age term map for the
terms 100 OR 90 OR 80 OR 70 OR 60 OR 50 OR 40 OR (30 AND (8 OR 9)). Any row with a 1 in the hundreds position (key is 100) is going to be greater than 35. Likewise it is known that the row is a match if it has 90, 80, and so on down to 40. If the row has a term of 30, then one must ensure that the value for that row in the ones position is greater than 7, thus the check for 8 or 9. In this example, only row 4 has “Yes” for the 100 term, and only row 5 has “Yes” for theterms - Using this approach, Parlene supports fast numeric ranges queries with small in-memory term maps. If a request is made to get the row value for a numeric column, and that column has a term map, then Parlene can reconstruct the original value. Given the target row number, each term that has “Yes” for that row can be combined to create the original value. For example, if a request is made for the Age column value of
row # 2, then the Age term map contains theterms row # 2. Combining these gives us the original value of 17. Using this approach, Parlene supports fast retrieval of numeric values from the in-memory term map, without requiring a disk access. - What follows is a step by step description of how a query is processed using Parlene. A request may originate from the
client module 122. For example, the request may be via an analytics query API supported by themaster query processor 142. Consider a request for the total payments made by women for each state. The SQL equivalent would be SELECT state, SUM(payment) FROM index WHERE gender=‘F’GROUP BY state; - The analytics query API converts this into an Elasticsearch request so that Elasticsearch can compute the results. Elasticsearch is designed for processing such queries in real time by using indexing technology. The aggregation request, which is sent to an Elasticsearch node as JSON looks like:
-
{ ″query″: { ″filter″: [ { ″term″: { ″gender″: ″F″ } ] }, ″aggs″: { ″bucket_by_state″: { ″terms″: { ″field″: ″state″ }, ″aggs″: { ″sum(payment)″: { ″sum″: { ″field″:″payment″ } } } } } } - Elasticsearch distributes this request to all nodes in the Elasticsearch cluster that contain one or more pieces (shards) of the index. As previously indicated, the
master query processor 142 includes an Elasticsearch code segment that communicates with Elasticsearch code segments of eachclient query processor 162. In particular, the Elasticsearch query parser communicates with each shard of the index that contains one or more columnar files 164 (e.g., Parquet files). Note here that the Elasticsearch query processor relies upon columnar file locations instead indexes for the columnar files. - Each worker node then makes a request to Parlene (via a ParleneLeafReader) for the “gender” term map. Parlene first checks to see if this term map already exists in the in-memory cache. If so, it is returned immediately. This corresponds to the
check 304 ofFIG. 3 . Otherwise Parlene must first build a term map from the “gender” column in the Parquet file. This corresponds tooperation 306 ofFIG. 3 . - The ParleneLeafReader makes a request to a ParquetBridge to build the term map. The ParquetBridge is code configured in accordance with an embodiment of the invention that uses Parquet files instead of Lucene indexes as the source for both term maps and row values. It produces optimized in-memory data structures for term maps. Elasticsearch can now quickly provide fast analytics directly from Parquet files without having to first create Lucene indexes.
- More particularly, the ParquetBridge makes a request to Parquet to iterate over every record in the “gender” column. Thus, the proprietary code of the ParquetBridge leverages a native capability of the open source Parquet code that is designed for quick reading from individual columns. For each record, Parquet materializes (creates in memory) the value for that record's “gender” column. The ParquetBridge creates the term map during this iteration, by first generating terms from the raw data read from the file (via tokenization), and then setting the appropriate flag for each row containing the corresponding term.
- Once all records have been processed, the resulting term map is returned by the ParquetBridge to the ParleneLeafReader. The ParleneLeafReader adds the term map to the in-memory cache, and returns it to Elasticsearch.
- The exemplary query is only looking at females and therefore Elasticsearch only looks at the results from the term map for gender “F”. These are the
row IDs - The ParleneLeafReader calls the ParquetBridge to extract these column values for the row. The ParquetBridge has a choice of how to extract these values. If the corresponding column has a term map in the cache, then the value can be reconstructed using the term map, as described previously. Otherwise, a request to Parquet can be made to materialize the requested column for the requested row (by its id).
- The Elasticsearch HitCollector adds each “payment” value to the appropriate group. There is one group for each unique value in the “state” column. Each Elasticsearch node returns the HitCollector results for each shard to the node that originated the query, the
master node 104, in this case. - The originating Elasticsearch node combines the per-shard results to create a final result, which is then returned to the requester,
client machine 102, in this example. -
FIG. 6 illustrates interoperable executable code components utilized in accordance with an embodiment of the invention.Elasticsearch 600 includes asearch manager 602. In one embodiment, each Java® Virtual Machine (JVM) has onesearch manager 602. This class is responsible for instantiating the correct directory reader. Each directory reader organizes one shard. The Parlene directory reader 604 manages the Parlene specific caches (e.g.,term map cache 168. Each instance provides one Parlene leaf reader. -
Parlene 606 includes aleaf reader 608, which corresponds to Lucene's leaf reader. It is mainly responsible for providing row values and terms for each requested column. Parlene also includes aParquet Bridge 610. TheParquet Bridge 610 provides methods for reading relevant data from a columnar file, in this case a Parquet file. Thebridge 610 has two main tasks. The first is to read and analyze a complete column when needed and to read concrete rows requested by Lucene's collectors. -
Parquet 616 includes aParquet handle 614, which provides methods to access the Parquet file. It is responsible for providing arow reader 616 and acolumn reader 618. Therow reader 616 reads and materializes only the requested row and the requested columns. This causes many random accesses. Thecolumn reader 618 materializes the entire column and therefore reduces the number of random accesses per row. Therow reader 616 andcolumn reader 618 operate on aParquet file 620. - An embodiment of the present invention relates to a computer storage product with a computer readable storage medium having computer code thereon for performing various computer-implemented operations. The media and computer code may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well known and available to those having skill in the computer software arts. Examples of computer-readable media include, but are not limited to: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs, DVDs and holographic devices; magneto-optical media; and hardware devices that are specially configured to store and execute program code, such as application-specific integrated circuits (“ASICs”), programmable logic devices (“PLDs”) and ROM and RAM devices. Examples of computer code include machine code, such as produced by a compiler, and files containing higher-level code that are executed by a computer using an interpreter. For example, an embodiment of the invention may be implemented using JAVA®, C++, or other object-oriented programming language and development tools. Another embodiment of the invention may be implemented in hardwired circuitry in place of, or in combination with, machine-executable software instructions.
- The foregoing description, for purposes of explanation, used specific nomenclature to provide a thorough understanding of the invention. However, it will be apparent to one skilled in the art that specific details are not required in order to practice the invention. Thus, the foregoing descriptions of specific embodiments of the invention are presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the invention to the precise forms disclosed; obviously, many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles of the invention and its practical applications, they thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. It is intended that the following claims and their equivalents define the scope of the invention.
Claims (6)
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/464,232 US10585913B2 (en) | 2017-03-20 | 2017-03-20 | Apparatus and method for distributed query processing utilizing dynamically generated in-memory term maps |
EP18772109.7A EP3602351B1 (en) | 2017-03-20 | 2018-03-19 | Apparatus and method for distributed query processing utilizing dynamically generated in-memory term maps |
PCT/US2018/023169 WO2018175336A1 (en) | 2017-03-20 | 2018-03-19 | Apparatus and method for distributed query processing utilizing dynamically generated in-memory term maps |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/464,232 US10585913B2 (en) | 2017-03-20 | 2017-03-20 | Apparatus and method for distributed query processing utilizing dynamically generated in-memory term maps |
Publications (2)
Publication Number | Publication Date |
---|---|
US20180268000A1 true US20180268000A1 (en) | 2018-09-20 |
US10585913B2 US10585913B2 (en) | 2020-03-10 |
Family
ID=63519391
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/464,232 Expired - Fee Related US10585913B2 (en) | 2017-03-20 | 2017-03-20 | Apparatus and method for distributed query processing utilizing dynamically generated in-memory term maps |
Country Status (3)
Country | Link |
---|---|
US (1) | US10585913B2 (en) |
EP (1) | EP3602351B1 (en) |
WO (1) | WO2018175336A1 (en) |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109840266A (en) * | 2019-01-25 | 2019-06-04 | 网联清算有限公司 | Storage system building method and device |
CN111221851A (en) * | 2018-11-27 | 2020-06-02 | 北京京东尚科信息技术有限公司 | Lucene-based mass data query and storage method and device |
US10891165B2 (en) * | 2019-04-12 | 2021-01-12 | Elasticsearch B.V. | Frozen indices |
US10997204B2 (en) * | 2018-12-21 | 2021-05-04 | Elasticsearch B.V. | Cross cluster replication |
US11132347B2 (en) * | 2017-07-06 | 2021-09-28 | Palantir Technologies Inc. | Selecting backing stores based on data request |
US11182093B2 (en) | 2019-05-02 | 2021-11-23 | Elasticsearch B.V. | Index lifecycle management |
US11188531B2 (en) | 2018-02-27 | 2021-11-30 | Elasticsearch B.V. | Systems and methods for converting and resolving structured queries as search queries |
US11226955B2 (en) | 2018-06-28 | 2022-01-18 | Oracle International Corporation | Techniques for enabling and integrating in-memory semi-structured data and text document searches with in-memory columnar query processing |
WO2022015392A1 (en) * | 2020-07-15 | 2022-01-20 | Oracle International Corporation | Probabilistic text index for semi-structured data in columnar analytics storage formats |
US11238035B2 (en) | 2020-03-10 | 2022-02-01 | Oracle International Corporation | Personal information indexing for columnar data storage format |
WO2022064313A1 (en) * | 2020-09-24 | 2022-03-31 | Speedata Ltd. | Hardware-implemented file reader |
US11431558B2 (en) | 2019-04-09 | 2022-08-30 | Elasticsearch B.V. | Data shipper agent management and configuration systems and methods |
US11461270B2 (en) | 2018-10-31 | 2022-10-04 | Elasticsearch B.V. | Shard splitting |
US11586624B2 (en) * | 2020-09-28 | 2023-02-21 | Databricks, Inc. | Integrated native vectorized engine for computation |
US11604674B2 (en) | 2020-09-04 | 2023-03-14 | Elasticsearch B.V. | Systems and methods for detecting and filtering function calls within processes for malware behavior |
CN116401259A (en) * | 2023-06-08 | 2023-07-07 | 北京江融信科技有限公司 | Automatic pre-creation index method and system for elastic search database |
US11734291B2 (en) * | 2020-10-21 | 2023-08-22 | Ebay Inc. | Parallel execution of API calls using local memory of distributed computing devices |
US20240037104A1 (en) * | 2021-08-09 | 2024-02-01 | Hefei Swaychip Information Technology Inc. | A system and method for hierarchical database operation accelerator |
US11914592B2 (en) | 2018-02-27 | 2024-02-27 | Elasticsearch B.V. | Systems and methods for processing structured queries over clusters |
US11943295B2 (en) | 2019-04-09 | 2024-03-26 | Elasticsearch B.V. | Single bi-directional point of policy control, administration, interactive queries, and security protections |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100257198A1 (en) * | 2009-04-02 | 2010-10-07 | Greeenplum, Inc. | Apparatus and method for integrating map-reduce into a distributed relational database |
US20120310916A1 (en) * | 2010-06-04 | 2012-12-06 | Yale University | Query Execution Systems and Methods |
US20150363167A1 (en) * | 2014-06-16 | 2015-12-17 | International Business Machines Corporation | Flash optimized columnar data layout and data access algorithms for big data query engines |
US20160350367A1 (en) * | 2015-05-27 | 2016-12-01 | Mark Fischer | Mechanisms For Querying Disparate Data Storage Systems |
US9952894B1 (en) * | 2014-01-27 | 2018-04-24 | Microstrategy Incorporated | Parallel query processing |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2290562A1 (en) * | 2009-08-24 | 2011-03-02 | Amadeus S.A.S. | Segmented main-memory stored relational database table system with improved collaborative scan algorithm |
US20120016901A1 (en) | 2010-05-18 | 2012-01-19 | Google Inc. | Data Storage and Processing Service |
US8972337B1 (en) * | 2013-02-21 | 2015-03-03 | Amazon Technologies, Inc. | Efficient query processing in columnar databases using bloom filters |
US9342557B2 (en) * | 2013-03-13 | 2016-05-17 | Cloudera, Inc. | Low latency query engine for Apache Hadoop |
US9477731B2 (en) * | 2013-10-01 | 2016-10-25 | Cloudera, Inc. | Background format optimization for enhanced SQL-like queries in Hadoop |
US9892150B2 (en) * | 2015-08-03 | 2018-02-13 | Sap Se | Unified data management for database systems |
US11709833B2 (en) * | 2016-06-24 | 2023-07-25 | Dremio Corporation | Self-service data platform |
-
2017
- 2017-03-20 US US15/464,232 patent/US10585913B2/en not_active Expired - Fee Related
-
2018
- 2018-03-19 EP EP18772109.7A patent/EP3602351B1/en active Active
- 2018-03-19 WO PCT/US2018/023169 patent/WO2018175336A1/en unknown
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100257198A1 (en) * | 2009-04-02 | 2010-10-07 | Greeenplum, Inc. | Apparatus and method for integrating map-reduce into a distributed relational database |
US20120310916A1 (en) * | 2010-06-04 | 2012-12-06 | Yale University | Query Execution Systems and Methods |
US9952894B1 (en) * | 2014-01-27 | 2018-04-24 | Microstrategy Incorporated | Parallel query processing |
US20150363167A1 (en) * | 2014-06-16 | 2015-12-17 | International Business Machines Corporation | Flash optimized columnar data layout and data access algorithms for big data query engines |
US20160350367A1 (en) * | 2015-05-27 | 2016-12-01 | Mark Fischer | Mechanisms For Querying Disparate Data Storage Systems |
Cited By (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11762830B2 (en) | 2017-07-06 | 2023-09-19 | Palantir Technologies Inc. | Selecting backing stores based on data request |
US11132347B2 (en) * | 2017-07-06 | 2021-09-28 | Palantir Technologies Inc. | Selecting backing stores based on data request |
US11188531B2 (en) | 2018-02-27 | 2021-11-30 | Elasticsearch B.V. | Systems and methods for converting and resolving structured queries as search queries |
US11914592B2 (en) | 2018-02-27 | 2024-02-27 | Elasticsearch B.V. | Systems and methods for processing structured queries over clusters |
US11226955B2 (en) | 2018-06-28 | 2022-01-18 | Oracle International Corporation | Techniques for enabling and integrating in-memory semi-structured data and text document searches with in-memory columnar query processing |
US11461270B2 (en) | 2018-10-31 | 2022-10-04 | Elasticsearch B.V. | Shard splitting |
CN111221851A (en) * | 2018-11-27 | 2020-06-02 | 北京京东尚科信息技术有限公司 | Lucene-based mass data query and storage method and device |
US10997204B2 (en) * | 2018-12-21 | 2021-05-04 | Elasticsearch B.V. | Cross cluster replication |
US11580133B2 (en) | 2018-12-21 | 2023-02-14 | Elasticsearch B.V. | Cross cluster replication |
CN109840266A (en) * | 2019-01-25 | 2019-06-04 | 网联清算有限公司 | Storage system building method and device |
US11943295B2 (en) | 2019-04-09 | 2024-03-26 | Elasticsearch B.V. | Single bi-directional point of policy control, administration, interactive queries, and security protections |
US11431558B2 (en) | 2019-04-09 | 2022-08-30 | Elasticsearch B.V. | Data shipper agent management and configuration systems and methods |
US20210124620A1 (en) * | 2019-04-12 | 2021-04-29 | Elasticsearch B.V. | Frozen Indices |
US10891165B2 (en) * | 2019-04-12 | 2021-01-12 | Elasticsearch B.V. | Frozen indices |
US11556388B2 (en) * | 2019-04-12 | 2023-01-17 | Elasticsearch B.V. | Frozen indices |
US11182093B2 (en) | 2019-05-02 | 2021-11-23 | Elasticsearch B.V. | Index lifecycle management |
US11586374B2 (en) | 2019-05-02 | 2023-02-21 | Elasticsearch B.V. | Index lifecycle management |
US11238035B2 (en) | 2020-03-10 | 2022-02-01 | Oracle International Corporation | Personal information indexing for columnar data storage format |
US11514697B2 (en) | 2020-07-15 | 2022-11-29 | Oracle International Corporation | Probabilistic text index for semi-structured data in columnar analytics storage formats |
WO2022015392A1 (en) * | 2020-07-15 | 2022-01-20 | Oracle International Corporation | Probabilistic text index for semi-structured data in columnar analytics storage formats |
US11604674B2 (en) | 2020-09-04 | 2023-03-14 | Elasticsearch B.V. | Systems and methods for detecting and filtering function calls within processes for malware behavior |
US11586587B2 (en) | 2020-09-24 | 2023-02-21 | Speedata Ltd. | Hardware-implemented file reader |
WO2022064313A1 (en) * | 2020-09-24 | 2022-03-31 | Speedata Ltd. | Hardware-implemented file reader |
US11586624B2 (en) * | 2020-09-28 | 2023-02-21 | Databricks, Inc. | Integrated native vectorized engine for computation |
US11874832B2 (en) | 2020-09-28 | 2024-01-16 | Databricks, Inc. | Integrated native vectorized engine for computation |
US11734291B2 (en) * | 2020-10-21 | 2023-08-22 | Ebay Inc. | Parallel execution of API calls using local memory of distributed computing devices |
US20240037104A1 (en) * | 2021-08-09 | 2024-02-01 | Hefei Swaychip Information Technology Inc. | A system and method for hierarchical database operation accelerator |
CN116401259A (en) * | 2023-06-08 | 2023-07-07 | 北京江融信科技有限公司 | Automatic pre-creation index method and system for elastic search database |
Also Published As
Publication number | Publication date |
---|---|
EP3602351B1 (en) | 2022-11-09 |
US10585913B2 (en) | 2020-03-10 |
EP3602351A4 (en) | 2020-11-18 |
EP3602351A1 (en) | 2020-02-05 |
WO2018175336A1 (en) | 2018-09-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10585913B2 (en) | Apparatus and method for distributed query processing utilizing dynamically generated in-memory term maps | |
US11475034B2 (en) | Schemaless to relational representation conversion | |
JP7130600B2 (en) | Implementing semi-structured data as first-class database elements | |
US9367574B2 (en) | Efficient query processing in columnar databases using bloom filters | |
AU2016359060B2 (en) | Storing and retrieving data of a data cube | |
EP3028137B1 (en) | Generating a multi-column index for relational databases by interleaving data bits for selectivity | |
US9158843B1 (en) | Addressing mechanism for data at world wide scale | |
US10180992B2 (en) | Atomic updating of graph database index structures | |
US8862566B2 (en) | Systems and methods for intelligent parallel searching | |
US9965641B2 (en) | Policy-based data-centric access control in a sorted, distributed key-value data store | |
JP6434154B2 (en) | Identifying join relationships based on transaction access patterns | |
CN103620601A (en) | Joining tables in a mapreduce procedure | |
US10860562B1 (en) | Dynamic predicate indexing for data stores | |
US10157234B1 (en) | Systems and methods for transforming datasets | |
CN106708996A (en) | Method and system for full text search of relational database | |
CN113051268A (en) | Data query method, data query device, electronic equipment and storage medium | |
CN111506621A (en) | Data statistical method and device | |
US10248668B2 (en) | Mapping database structure to software | |
US20220019784A1 (en) | Probabilistic text index for semi-structured data in columnar analytics storage formats | |
Patel et al. | Online analytical processing for business intelligence in big data | |
CN113918605A (en) | Data query method, device, equipment and computer storage medium | |
US11520763B2 (en) | Automated optimization for in-memory data structures of column store databases | |
US20230153455A1 (en) | Query-based database redaction | |
US20130297573A1 (en) | Character Data Compression for Reducing Storage Requirements in a Database System | |
McClean et al. | A comparison of mapreduce and parallel database management systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: DATAMEER, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:KRUGLER, KEN;MCMANUS, MATTHEW;VOSS, PETER;AND OTHERS;SIGNING DATES FROM 20170201 TO 20170310;REEL/FRAME:041649/0243 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
ZAAA | Notice of allowance and fees due |
Free format text: ORIGINAL CODE: NOA |
|
ZAAB | Notice of allowance mailed |
Free format text: ORIGINAL CODE: MN/=. |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT RECEIVED |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
LAPS | Lapse for failure to pay maintenance fees |
Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |