US20220164345A1 - Managed query execution platform, and methods thereof - Google Patents
Managed query execution platform, and methods thereof Download PDFInfo
- Publication number
- US20220164345A1 US20220164345A1 US17/643,965 US202117643965A US2022164345A1 US 20220164345 A1 US20220164345 A1 US 20220164345A1 US 202117643965 A US202117643965 A US 202117643965A US 2022164345 A1 US2022164345 A1 US 2022164345A1
- Authority
- US
- United States
- Prior art keywords
- data
- query execution
- query
- computing machines
- indexed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000000034 method Methods 0.000 title claims description 14
- 239000004744 fabric Substances 0.000 claims abstract description 27
- 238000004891 communication Methods 0.000 claims abstract description 8
- 230000008569 process Effects 0.000 claims description 12
- 238000001514 detection method Methods 0.000 claims 1
- 238000012545 processing Methods 0.000 description 17
- 238000010586 diagram Methods 0.000 description 9
- 230000001133 acceleration Effects 0.000 description 4
- 238000013500 data storage Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 238000013403 standard screening design Methods 0.000 description 4
- 230000002776 aggregation Effects 0.000 description 3
- 238000004220 aggregation Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 238000012517 data analytics Methods 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 241000282813 Aepyceros melampus Species 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000004931 aggregating effect Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000000712 assembly Effects 0.000 description 1
- 238000000429 assembly Methods 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000013144 data compression Methods 0.000 description 1
- 238000013501 data transformation Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 238000007639 printing Methods 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000035939 shock Effects 0.000 description 1
- 238000009987 spinning Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/38—Information transfer, e.g. on bus
- G06F13/40—Bus structure
- G06F13/4004—Coupling between buses
- G06F13/4027—Coupling between buses using bus bridges
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2272—Management thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
Definitions
- the present disclosure relates generally to the field of computer data storage.
- a database is an organized collection of digital data, stored and accessed electronically. Database designers typically organize the data to model aspects of reality in a way that enables the retrieval of the information through computer processes.
- a database management system (DBMS) is a computer software that interacts with end users, applications, and the database itself to capture and analyze data.
- DBMS allows the definition, creation, querying, updating, and administration of databases. Databases may be defined as row or column-based. In a row-based database, all of the information for a record in a table is stored together. This makes simple, record-centric activities (e.g., recording a transaction, viewing the details of an account, etc.) fast.
- a data lake is a storage repository that holds a large amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in the data lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question.
- Data lakes may reside in cloud environments where additional storage can be added on demand.
- data stored in the data lake is accessed by applications for deep learning, AI, analytics (e.g., business analytics), and the like to provide access to both historical and real-time data.
- Querying of the data lakes as required by such applications is typically performed by elastic query engines (e.g., PrestoÂŽ, Apache DrillÂŽ, SparkÂŽ, ImpalaÂŽ, etc.)
- Data lakes provide means for decoupling data processing and data storage. That is, data stored on a data lake and is processed via a separate computing engine. Although data lakes provide flexibility in maintaining different types of data from different sources and are a cost-efficient storage solution, such storage architecture cannot be efficiently scaled up. Specifically, utilization of elastic query engines requires cloud computing resources, such as CPU, memory, and allocation of virtual machines. The ever-increasing demands of applications would be constrained by the allocated computing resources. To improve performance of executed applications, more computing resources can be allocated. However, this increases the overall cost for running such applications. Further, the scaling of computing resources would reduce efficiency, degrade performance, and increase maintenance.
- Another disadvantage of executing data applications over data lakes is the transformation of data from existing sources to a data lake. This typically requires engineering the data from data sources into new models in order to make the data accessible in the data lake. This is typically performed with an extract, transform, load (ETL) tool. As data sources' formats, contents, and volumes are dynamically changing, as well as user and application demands, the transformation process would require months to complete.
- ETL extract, transform, load
- Certain embodiments disclosed herein include a query execution fabric that includes: a plurality of computing machines; and a plurality of storage devices communicatively connected to the plurality of computing machines using a storage communication protocol and configured to maintain a plurality of indexed views comprising index information and a plurality of data chunks, wherein the index information includes at least one local index per data chunk and at least one global index per each of the indexed views of the plurality of indexed views, and wherein each of the plurality of computing machines is configured to: verify a global index to locate segments of the plurality of indexed views; and verify a plurality of local indexes to locate the data chunks containing values related to a plurality of statements.
- FIG. 1 is a network diagram utilized to describe the various disclosed embodiments.
- FIG. 2 is a diagram illustrating the architecture of the managed query execution platform.
- FIG. 3 is schematic diagram of a compute machine and a storage device utilized by the managed query execution platform according to an embodiment.
- FIG. 4A is an illustration of a common view.
- FIG. 4B is an illustration of an indexed view.
- FIG. 5 is a flowchart illustrating a process for executing a query on an indexed view according to an embodiment.
- FIG. 6 is a diagram illustrating the execution of an example query according to an embodiment.
- FIG. 7 is a schematic block diagram of a hardware layer implemented by each computing machine according to an embodiment.
- a managed query execution platform enables, among other tasks, execution of data analytics applications over data lakes.
- the platform further allows execution of such applications on the data stored in data lakes without the need to provide new models to transform the data in the data lakes through acceleration of, for example, standard query language (SQL) queries and commands.
- SQL standard query language
- the disclosed embodiments enable easy scaling of and support for more data applications (on a larger dataset) without allocating additional computing resources.
- FIG. 1 is an example network diagram 100 utilized to describe the various disclosed embodiments.
- a managed query execution platform 110 is deployed between a plurality of data applications 120 and at least one data lake 130 .
- the managed query execution platform 110 is configured to accelerate SQL commands and queries requested by the data applications 120 , thereby improving their overall performance.
- a data application 120 may include any data analytics (e.g., business analytics) application, deep-learning application, artificial intelligence (AI) application, and any data-driven application.
- data analytics e.g., business analytics
- AI artificial intelligence
- the data lake 130 is a storage repository that holds a vast amount of raw data in its native format until it is needed.
- the data lake 130 is hosted in a cloud computing platform.
- AzureÂŽ Data Lake is a scalable data storage and analytics service.
- the service is hosted in the AzureÂŽ cloud platform.
- AmazonÂŽ Simple Storage Service (S3) may host the storage.
- Another cloud-based data lake is Hadoop. It should be noted that the disclosed embodiments are not limited to data lakes and are also applicable to any types of means for object storage, such as CassandraÂŽ.
- the disclosed embodiments may also be applicable to data streams, databases, data warehouse, and the like.
- the managed query execution platform 110 includes a plurality of computing machines and storage devices (both shown in FIG. 2 ) that allow the acceleration of SQL queries and commands. As will be discussed in more detail below, by querying indexed views generated based on the data stored in the data lake 130 , the views are indexed and stored in the storage devices of the query execution platform 110 . Further acceleration is provided by implementation of an elastic query engine by each computing machine.
- a database manager (DBM) device 140 is also connected to the managed query execution platform 110 .
- the DBM device 140 may be any client device allowing a user to configure the platform 110 . Such a configuration may include creation of indexed views, creation of settings related to data indexing, creation of data synchronization settings, definition of connections to data lakes, and the like.
- the DBM device 140 may also generate reports related to the operation managed query execution platform 110 , including, for example, an average time to execute a certain type of query, current load, number of active indexed views, and so on.
- the managed query execution platform 110 , the data applications, 120 , and the data lake 130 may be communicatively connected over a network (not shown). Further, the managed query execution platform 110 , the applications, 120 , and the data lake 130 may be executed and/or deployed or integrated in the same cloud computing platform or different cloud computing platforms.
- the cloud computing platform may be a public cloud, a private cloud, or a hybrid cloud.
- FIG. 2 shows an example diagram illustrating the architecture of the managed query execution platform 110 according to an embodiment.
- the managed query execution platform 110 includes a plurality of computing machines 210 - 1 through 210 - n connected to a plurality of storage devices 220 - 1 and 220 -M over a fabric 230 . Each machine 210 can communicate with each storage device 220 over the fabric 230 .
- the managed query execution platform further includes a database 240 .
- the database 240 is configured to keep query execution history and configuration parameters and settings; examples of such are discussed above.
- Each computing machine 210 may be a physical machine (e.g., a server) or a virtual entity (e.g., a virtual machine, a software container, a micro-service, etc.) executed over a physical machine or hardware layer. According to disclosed embodiments, each computing machine 210 may be configured to execute queries received by a data application ( 120 , FIG. 1 ) to allow fast processing of such queries. Each computing machine 210 requires a hardware layer for execution, an example of which is provided with reference to FIG. 7 .
- Each storage device 220 may include a solid-state persistent drive (SSD), such as Flash and non-volatile random-access memory (NVRAM).
- SSDs are a type of non-volatile computer memory storage devices which use integrated circuit assemblies as memory to store data.
- a storage device 220 may include any random-access media.
- SSDs Compared to electromechanical hard disk drives (HDD), SSDs have no moving mechanical components, such as spinning disks and movable read/write heads. This makes SSDs typically more resistant to physical shock, able run silently, and to have quicker access time and lower latency.
- the fabric 230 may include an Ethernet fabric, an Infiniband fabric, and the like.
- the fabric 230 may enable a storage communication protocol such as, but not limited to, remote direct memory access (RDMA) over Converged Ethernet (RoCE), an internet Wide Area RDMA Protocol (iWARP), Non-Volatile Memory Express (NVMe), NVMe over fabric (NVMeF), and the like.
- RDMA remote direct memory access
- RoCE Remote Direct memory access
- iWARP internet Wide Area RDMA Protocol
- NVMe Non-Volatile Memory Express
- NVMeF NVMe over fabric
- the managed query execution platform 110 is configured as a shared-everything distributed architecture. That is, each computing machine 210 can access data stored in any storage device 220 . In some configurations, a group of computing machines 210 and storage devices 220 may be clustered. Each cluster may serve a specific tenant and/or a set of indexed views.
- one of the computing machines 210 operates as an âorchestratorâ while the other operate as âworkersâ.
- An orchestrator is configured to plan and schedule execution of SQL queries or statements to the workers.
- the orchestrator may also be configured to allocate additional workers (computing machines) on demand.
- the functionality of a computing machine is determined, in part, by its query engine.
- a computing machine 210 implements a query engine 211 and an indexing engine 212 .
- the query engine 211 allows execution of SQL queries over data lakes.
- the query engine 211 is configured to parse SQL queries and plan their execution by pushing the parsed statements to the indexing engine 212 .
- a query engine 211 may be realized using a third-party SQL engine, such as, but not limited to, PrestoÂŽ, Apache Drill, SparkÂŽ, Impala, and the like.
- a third-party SQL engine is configured with a plurality of connectors to allow access to data in various formants. Examples for such data formats include an optimized row columnar (ORC), Parquet, comma-separated values (CSV), JavaScriptÂŽ object notation (JSON), and the like.
- the storage devices 220 may be configured to store indexed views (and their contents).
- an indexed view is an SQL view or (dataset) segmented into a series of ordered vertical segments, each having a specified size or comprising a specified number of records.
- a data chunk includes a header and column data from the respective indexed view.
- a data chunk may further include index information.
- the data chunks and index information together constitute an indexed view, or a portion thereof. The contents of a data chunk and index information are discussed below.
- a storage device 220 maintains a plurality of memory pages 221 .
- a memory page may include index information (at least a local index) and data chunk (not shown).
- a complete indexed view may be resided in memory pages 221 of one or more storage devices 220 .
- the query when a query is executed, by a query engine 211 , on an indexed view, the query may access data in one or more relevant columns by first serially verifying the data values in each column data chunk based, at least in part, on chunk metadata and chunk index.
- the query may be able to focus only on the relevant parts of the view and skip data chunks where the metadata indicates that the values searched for are not within the range stored in the segment.
- the query then may be configured to produce a bitmap of view rows which are responsive to the query.
- a computing engine 210 may retrieve only those data chunks which contain data responsive to the query.
- column-oriented databases store the values of a single column contiguously. This allows array-processing of the values of a column, wherein rows may further be constructed from column values if required.
- Query processing in columnar databases works on columns, so columnar representation can improve the performance of queries that only access a small subset of columns. Because every page reading a columnar database can access the values for many rows at once, a query may retrieve more values per I/O operation than with row-storage.
- the unique structure of columnar databases may provide for faster analytical query activities, known as Online Analytical Processing (OLAP). OLAP queries are typically used to derive new information from existing data (e.g., data aggregates, transformations, and/or calculations).
- OLAP queries typically only need to access a subset of columns in a table
- a columnar representation of data allows for skipping irrelevant columns.
- Columnar databases may also provide for improved data compression ratios, because the encoding is done on multiple values that are of the same type and may have less entropy.
- OLTP queries typically retrieve or modify individual records.
- OLTP queries executed on a columnar database will typically require an increased number of separate disk-access operations.
- OLTP queries are often triggered by end user requests, and therefore response time may be crucial.
- a columnar database structure which allows for a reduction in disk access operations may be capable of providing the advantages of a columnar data representation while reducing the associated computational cost and time penalty as compared to traditional columnar database structures.
- the managed query execution platform 110 is configured for storing data in an SQL view.
- An SQL view is a virtual table based on the result-set of an SQL statement.
- a view contains rows and columns, just like a real table.
- the fields in a view are fields from one or more real tables in the database.
- SQL statements e.g., WHERE, JOIN
- WHERE WHERE
- JOIN JOIN
- the indexed views support SQL statements including Filters, Joins, and aggregations.
- a Filter may be any SQL WHERE condition, on any column, within an SQL statement that can use an index. Indices are used for point lookups, range queries and string matching of data saved in the storage devices 220 .
- a join statement is any SQL JOIN command using the index of a key column. The index is most useful for dimensional JOINs: combining a fact table with a filtered dimension table.
- An aggregation command includes SQL aggregations and grouping can use the indexes for acceleration.
- each indexed view, created and utilized by the managed query execution platform is segmented into a series of ordered vertical segments, each segment having a specified size or comprising a specified number of records.
- FIGS. 4A and 4B illustrate the segmentation of views according to an embodiment.
- FIG. 4A is an illustration of a common (SQL) view 400 of a columnar database structure.
- the view 400 may be populated with data received in any format, e.g., comma-CSV, JSON, and the like.
- the data may be one or more database tables.
- the data utilized to populate the view 400 is saved in the data lake 130 and may be changed from time to time.
- the managed query execution platform 110 is configured to create indexed views based on the stored in the data lake 130 .
- the indexed views are saved in the storage devices 220 of the managed query execution platform 110 , thereby allowing fast execution of queries and scalable solution.
- views are predefined and can be created using an SQL âCreate Viewâ command.
- FIG. 4B is a schematic illustration of an example indexed view 410 created according to an embodiment.
- the indexed view 410 is vertically segmented into a plurality of segments, e.g., segments 0, . . . , N.
- the intersection of each column A, . . . , N and each segment, creates a data chunk, e.g., chunks A- 0 , B- 2 , C- 1 , and so on.
- each data chunk may have a size of 64 kilo-rows.
- each data chunk may have a size that is larger or smaller than 64 kilo-rows, e.g., 128 kilo-rows, and so on.
- index information and a header are generated for each of the data chunks.
- the data chunk header may contain metadata such as a range of values, an average value (minimum-maximum), a mean value, a number of NOT NULL values, histograms, compression metadata, and so on.
- the index information may include any one of: a local index, a meta-index, and global index.
- the local index is information regarding the location of specific data within the data chunk.
- the meta-index is an entire column, comprising all indexes of each data chunk within the column of an indexed view. There are a plurality of local indexes indexing data chunks of a column. When scanning for a specific column the plurality of local indexes of data chunks are scanned in parallel.
- the global index is an index of the entire index view including all individual segments and local and meta indexes. It should be noted that additional and/or other index structures may be used.
- index information is generated by the indexing engine 212 in a computing machine 210 using an inline indexing process. Using this process, all the dimensions (columns) of an indexed view are automatically indexed. Data is indexed as it is loaded, at the rate of the data flow, without any user intervention or post-processing.
- the inline index process is adaptive to the data, where the index information, and particularly each local index is changed as the data in the view changes.
- the data chunks of an indexed view are stored in one or more storage devices 220 .
- the data chunks may be encoded and compressed prior to storage.
- each data chunk may be stored on one or more memory pages of a storage device 220 having a native page size defined by, for example, the NVMe protocol.
- larger page sizes may be used, such as 16 kilobytes or more.
- each storage device's 220 memory page stores only data from a single data chunk.
- a created indexed view (and its contents) is loaded to the storage devices 220 from the data lake based on the index information data layout.
- the managed query execution platform 110 is configured to automatically synchronize in real-time to the data in the indexed view with the data stored in the data lake. As the view is always indexed across all dimensions, consistently fast performance on any query on any dimension that can use an index, including filtering and joins are provided. The synchronization may be performed by any of the computing machines 210 .
- data changes on the data lake are automatically tracked by each indexed view.
- the respective view can be automatically synchronized in near-real-time to the data lake.
- the synchronization of data may be of data addition and/or deletion.
- changes in the data lake are detected based on notifications received from the data lake and/or a user storing data therein; polling, i.e., periodically scanning of the data lake for changes; and/or monitoring changes in the database's catalog, such as table structure changes.
- the synchronization of an indexed view may be incremental and non-incremental.
- the former is defined when the SQL definition is incremental, any change of the data will be efficiently reflected in the view, while the latter occurs when the SQL definition is not incremental and is more like a data transformation workload (for example, includes a JOIN operation).
- any change of data may require a full view re-computation.
- the data synchronization may be performed either on a predefined schedule, or via manual on-demand synchronizations.
- FIG. 5 is an example flowchart 500 illustrating a process for executing a query on an indexed view according to an embodiment.
- the process may be performed by the managed query execution platform and, in particular, by one or more computing machines in the platform.
- a query is received from a data application and processed.
- the processing of a query includes parsing, planning, and scheduling a distributed query plan across the compute machines.
- the parsing may include breaking the query into statements (also known as âpush downsâ) to allow different statements execute on different machines. That is, the processing of a query can be performed in parallel. All processing is performed in-memory of a computing machine.
- the following query may be processed:
- the query calls for retrieving specific rows from columns A and C, where column B has a value of 3 or 8.
- column B has a value of 3 or 8.
- 2 independent statements can be parsed to be processed in parallel: one looking for values in column B equal to 3; and the other looking for values in column B equal to 8.
- a global index of the queried indexed view is verified. This is performed in order to locate segments and data chunks in such segments of the queried indexed view. For example, the segments of the INDEXED VIEW_ 1 are verified. S 520 may also include verifying the mate-index to locate the queried columns in the indexed view.
- the plurality of local indexes are verified to locate the data chunks containing values related to one or more statements in the processed queries.
- the plurality of local indexes are verified or scanned in parallel. This may be performed in a distributed manner by a number of computing machines. Alternatively or collectively, data chunks' headers are verified to determine chunk value ranges, based on header metadata. In the above example query, local indices of data chunks containing the column B data are verified.
- indexes of the relevant data chunks are accessed in order to retrieve the data responsive to the statements (and hence the query).
- the data chunks are accessed in parallel from a number of storage devices, responsive to a number of statements executed in parallel.
- relevant rows in each data chunk which contain data responsive to the query are accessed. It should be noted that this step can be performed for every statement in the processed query.
- the retrieved data is calculated or complied responsive to the query.
- the data from data chunks are retrieved, in parallel from a number of storage devices, responsive to a number of statements executed in parallel.
- the data chunks are decoded or decompressed.
- the compiled query results are returned. For example, the results are returned to the data application issued the query.
- FIG. 6 is a diagram illustrating the execution of the above example query, according to an embodiment.
- column 600 -B includes, e.g., 3 chunks (B- 0 , B- 1 , B- 2 ), where only chunks B- 1 (value range 7-15) and B- 2 (value range 0-7) contain values responsive to the WHERE condition in the query. Thus, the segment B- 0 is not accessed.
- the local indices of data chunks B- 1 and B- 2 are accessed to identify the relevant rows in each data chunk which contain data responsive to the query.
- the relevant data chunks may be detected.
- specified rows have been identified within data chunks B- 1 (e.g., rows 12 , 18 , 62 ) and B- 2 (e.g., rows 2 , 95 ) as containing the values responsive to the WHERE clause of the query.
- Data chunks A- 1 , A- 2 , C- 1 , and C- 2 associated with column A and C may be decoded.
- the disclosed embodiments provide technical improvements over existing solutions.
- the data used in the demonstration included rideshare information collected from a database comprising 5 billion records, representing 25 million rideshare trips recorded over 3 months, with a user base of 5 million users and 200,000 drivers.
- the time for processing a query has been reduced from minutes to seconds.
- the time for processing a query for ârides of specified userâ is 0.6 s, using the disclosed embodiments, in comparison to 1 minute and 36 seconds using the existing PrestoÂŽ solution.
- the time for processing a query for ârides within specified area, age group, and time of the dayâ is 3.1 seconds, using the disclosed embodiments, in comparison to 1 minute using the existing PrestoÂŽ solution.
- FIG. 7 is an example schematic block diagram of a hardware layer 700 implemented by each computing machine according to an embodiment.
- the hardware layer 700 includes a processing circuitry 710 coupled to a memory 720 , a storage 730 , and a network interface 740 .
- the components of the hardware layer 700 may be communicatively connected via a bus 750 .
- the processing circuitry 710 may be realized as one or more hardware logic components and circuits.
- illustrative types of hardware logic components include field programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), Application-specific standard products (ASSPs), system-on-a-chip systems (SOCs), general-purpose microprocessors, microcontrollers, digital signal processors (DSPs), and the like, or any other hardware logic components that can perform calculations or other manipulations of information.
- the memory 720 may be volatile (e.g., RAM, etc.), non-volatile (e.g., ROM, flash memory, etc.), or a combination thereof.
- computer readable instructions to implement one or more embodiments disclosed herein may be stored in the storage 730 .
- the memory 720 is configured to store software.
- Software shall be construed broadly to mean any type of instructions, whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise. Instructions may include code (e.g., in source code format, binary code format, executable code format, or any other suitable format of code). The instructions, when executed by the processing circuitry 710 , cause the processing circuitry 710 to perform the various processes described herein.
- the storage 730 may be magnetic storage, optical storage, and the like, and may be realized, for example, as flash memory or other memory technology, or any other medium which can be used to store the desired information.
- the network interface 740 allows the hardware layer 700 communicate with servers running the data applications and infrastructures hosting data lakes.
- the network interface 740 may also include means for communicating with storage devices using storage communication protocols.
- the various embodiments disclosed herein can be implemented as hardware, firmware, software, or any combination thereof.
- the software is preferably implemented as an application program tangibly embodied on a program storage unit or computer readable medium consisting of parts, or of certain devices and/or a combination of devices.
- the application program may be uploaded to, and executed by, a machine comprising any suitable architecture.
- the machine is implemented on a computer platform having hardware such as one or more central processing units (âCPUsâ), a memory, and input/output interfaces.
- CPUs central processing units
- the computer platform may also include an operating system and microinstruction code.
- a non-transitory computer readable medium is any computer readable medium except for a transitory propagating signal.
- the phrase âat least one ofâ followed by a listing of items means that any of the listed items can be utilized individually, or any combination of two or more of the listed items can be utilized. For example, if a system is described as including âat least one of A, B, and C,â the system can include A alone; B alone; C alone; 2 A; 2 B; 2 C; 3 A; A and B in combination; B and C in combination; A and C in combination; A, B, and C in combination; 2 A and C in combination; A, 3 B, and 2 C in combination; and the like.
- any reference to an element herein using a designation such as âfirst,â âsecond,â and so forth does not generally limit the quantity or order of those elements. Rather, these designations are generally used herein as a convenient method of distinguishing between two or more elements or instances of an element. Thus, a reference to first and second elements does not mean that only two elements may be employed there or that the first element must precede the second element in some manner. Also, unless stated otherwise, a set of elements comprises one or more elements.
Abstract
A query execution fabric is provided. The a plurality of computing machines; and a plurality of storage devices communicatively connected to the plurality of computing machines using a storage communication protocol and configured to maintain a plurality of indexed views comprising index information and a plurality of data chunks, wherein the index information includes at least one local index per data chunk and at least one global index per each of the indexed views of the plurality of indexed views, and wherein each of the plurality of computing machines is configured to: verify a global index to locate segments of the plurality of indexed views; and verify a plurality of local indexes to locate the data chunks containing values related to a plurality of statements.
Description
- This application is a continuation of U.S. patent application Ser. No. 16/597,984 filed Oct. 10, 2019, now allowed, which claims the benefit of U.S. Provisional Application No. 62/744,179 filed on Oct. 11, 2018, the contents of which are hereby incorporated by reference.
- The present disclosure relates generally to the field of computer data storage.
- A database is an organized collection of digital data, stored and accessed electronically. Database designers typically organize the data to model aspects of reality in a way that enables the retrieval of the information through computer processes. A database management system (DBMS) is a computer software that interacts with end users, applications, and the database itself to capture and analyze data. A DBMS allows the definition, creation, querying, updating, and administration of databases. Databases may be defined as row or column-based. In a row-based database, all of the information for a record in a table is stored together. This makes simple, record-centric activities (e.g., recording a transaction, viewing the details of an account, etc.) fast.
- In a columnar database, the data in a table is partitioned so that each field is stored together. This makes analytical query activities (e.g., aggregating data across multiple fields of all of the records) much faster. This is because the queries can avoid reading the fields that are not needed for the query. The trade-off for the columnar approach is that record addition or update operations require multiple separate disk-access operations.
- A data lake is a storage repository that holds a large amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in the data lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question.
- Data lakes may reside in cloud environments where additional storage can be added on demand. Typically, data stored in the data lake is accessed by applications for deep learning, AI, analytics (e.g., business analytics), and the like to provide access to both historical and real-time data. Querying of the data lakes as required by such applications, is typically performed by elastic query engines (e.g., PrestoÂŽ, Apache DrillÂŽ, SparkÂŽ, ImpalaÂŽ, etc.)
- Data lakes provide means for decoupling data processing and data storage. That is, data stored on a data lake and is processed via a separate computing engine. Although data lakes provide flexibility in maintaining different types of data from different sources and are a cost-efficient storage solution, such storage architecture cannot be efficiently scaled up. Specifically, utilization of elastic query engines requires cloud computing resources, such as CPU, memory, and allocation of virtual machines. The ever-increasing demands of applications would be constrained by the allocated computing resources. To improve performance of executed applications, more computing resources can be allocated. However, this increases the overall cost for running such applications. Further, the scaling of computing resources would reduce efficiency, degrade performance, and increase maintenance.
- Another disadvantage of executing data applications over data lakes is the transformation of data from existing sources to a data lake. This typically requires engineering the data from data sources into new models in order to make the data accessible in the data lake. This is typically performed with an extract, transform, load (ETL) tool. As data sources' formats, contents, and volumes are dynamically changing, as well as user and application demands, the transformation process would require months to complete.
- It would therefore be advantageous to provide a solution that would overcome the challenges noted above.
- A summary of several example embodiments of the disclosure follows. This summary is provided for the convenience of the reader to provide a basic understanding of such embodiments and does not wholly define the breadth of the disclosure. This summary is not an extensive overview of all contemplated embodiments and is intended to neither identify key or critical elements of all embodiments nor to delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more embodiments in a simplified form as a prelude to the more detailed description that is presented later. For convenience, the term âsome embodimentsâ or âcertain embodimentsâ may be used herein to refer to a single embodiment or multiple embodiments of the disclosure.
- Certain embodiments disclosed herein include a query execution fabric that includes: a plurality of computing machines; and a plurality of storage devices communicatively connected to the plurality of computing machines using a storage communication protocol and configured to maintain a plurality of indexed views comprising index information and a plurality of data chunks, wherein the index information includes at least one local index per data chunk and at least one global index per each of the indexed views of the plurality of indexed views, and wherein each of the plurality of computing machines is configured to: verify a global index to locate segments of the plurality of indexed views; and verify a plurality of local indexes to locate the data chunks containing values related to a plurality of statements.
- The subject matter disclosed herein is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the disclosed embodiments will be apparent from the following detailed description taken in conjunction with the accompanying drawings.
-
FIG. 1 is a network diagram utilized to describe the various disclosed embodiments. -
FIG. 2 is a diagram illustrating the architecture of the managed query execution platform. -
FIG. 3 is schematic diagram of a compute machine and a storage device utilized by the managed query execution platform according to an embodiment. -
FIG. 4A is an illustration of a common view. -
FIG. 4B is an illustration of an indexed view. -
FIG. 5 is a flowchart illustrating a process for executing a query on an indexed view according to an embodiment. -
FIG. 6 is a diagram illustrating the execution of an example query according to an embodiment. -
FIG. 7 is a schematic block diagram of a hardware layer implemented by each computing machine according to an embodiment. - It is important to note that the embodiments disclosed herein are only examples of the many advantageous uses of the innovative teachings herein. In general, statements made in the specification of the present application do not necessarily limit any of the various claimed embodiments. Moreover, some statements may apply to some inventive features but not to others. In general, unless otherwise indicated, singular elements may be in plural and vice versa with no loss of generality. In the drawings, like numerals refer to like parts through several views.
- By way of example to disclosed embodiments, a managed query execution platform is provided. The platform enables, among other tasks, execution of data analytics applications over data lakes. The platform further allows execution of such applications on the data stored in data lakes without the need to provide new models to transform the data in the data lakes through acceleration of, for example, standard query language (SQL) queries and commands. As such, the disclosed embodiments enable easy scaling of and support for more data applications (on a larger dataset) without allocating additional computing resources.
-
FIG. 1 is an example network diagram 100 utilized to describe the various disclosed embodiments. As illustrated inFIG. 1 , a managedquery execution platform 110 is deployed between a plurality ofdata applications 120 and at least onedata lake 130. In an embodiment, the managedquery execution platform 110 is configured to accelerate SQL commands and queries requested by thedata applications 120, thereby improving their overall performance. Adata application 120 may include any data analytics (e.g., business analytics) application, deep-learning application, artificial intelligence (AI) application, and any data-driven application. - The
data lake 130 is a storage repository that holds a vast amount of raw data in its native format until it is needed. Thedata lake 130 is hosted in a cloud computing platform. For example, AzureÂŽ Data Lake is a scalable data storage and analytics service. The service is hosted in the AzureÂŽ cloud platform. As another example, AmazonÂŽ Simple Storage Service (S3) may host the storage. Another cloud-based data lake is Hadoop. It should be noted that the disclosed embodiments are not limited to data lakes and are also applicable to any types of means for object storage, such as CassandraÂŽ. The disclosed embodiments may also be applicable to data streams, databases, data warehouse, and the like. - The managed
query execution platform 110 includes a plurality of computing machines and storage devices (both shown inFIG. 2 ) that allow the acceleration of SQL queries and commands. As will be discussed in more detail below, by querying indexed views generated based on the data stored in thedata lake 130, the views are indexed and stored in the storage devices of thequery execution platform 110. Further acceleration is provided by implementation of an elastic query engine by each computing machine. - In an embodiment, a database manager (DBM)
device 140 is also connected to the managedquery execution platform 110. TheDBM device 140 may be any client device allowing a user to configure theplatform 110. Such a configuration may include creation of indexed views, creation of settings related to data indexing, creation of data synchronization settings, definition of connections to data lakes, and the like. TheDBM device 140 may also generate reports related to the operation managedquery execution platform 110, including, for example, an average time to execute a certain type of query, current load, number of active indexed views, and so on. - In one configuration, the managed
query execution platform 110, the data applications, 120, and thedata lake 130, may be communicatively connected over a network (not shown). Further, the managedquery execution platform 110, the applications, 120, and thedata lake 130 may be executed and/or deployed or integrated in the same cloud computing platform or different cloud computing platforms. The cloud computing platform may be a public cloud, a private cloud, or a hybrid cloud. -
FIG. 2 shows an example diagram illustrating the architecture of the managedquery execution platform 110 according to an embodiment. The managedquery execution platform 110 includes a plurality of computing machines 210-1 through 210-n connected to a plurality of storage devices 220-1 and 220-M over afabric 230. Eachmachine 210 can communicate with eachstorage device 220 over thefabric 230. In some configurations, the managed query execution platform further includes adatabase 240. Thedatabase 240 is configured to keep query execution history and configuration parameters and settings; examples of such are discussed above. - Each
computing machine 210 may be a physical machine (e.g., a server) or a virtual entity (e.g., a virtual machine, a software container, a micro-service, etc.) executed over a physical machine or hardware layer. According to disclosed embodiments, each computingmachine 210 may be configured to execute queries received by a data application (120,FIG. 1 ) to allow fast processing of such queries. Eachcomputing machine 210 requires a hardware layer for execution, an example of which is provided with reference toFIG. 7 . - Each
storage device 220 may include a solid-state persistent drive (SSD), such as Flash and non-volatile random-access memory (NVRAM). SSDs are a type of non-volatile computer memory storage devices which use integrated circuit assemblies as memory to store data. Astorage device 220 may include any random-access media. Compared to electromechanical hard disk drives (HDD), SSDs have no moving mechanical components, such as spinning disks and movable read/write heads. This makes SSDs typically more resistant to physical shock, able run silently, and to have quicker access time and lower latency. - The
fabric 230 may include an Ethernet fabric, an Infiniband fabric, and the like. In an embodiment, thefabric 230 may enable a storage communication protocol such as, but not limited to, remote direct memory access (RDMA) over Converged Ethernet (RoCE), an internet Wide Area RDMA Protocol (iWARP), Non-Volatile Memory Express (NVMe), NVMe over fabric (NVMeF), and the like. - It should be noted that the storage communication protocols, the computing machines, and the storage devices discussed herein are provided merely for exemplary purposes, and that other communication protocols may be equally utilized in accordance with the embodiments disclosed herein without departing from the scope of the disclosure.
- The managed
query execution platform 110 is configured as a shared-everything distributed architecture. That is, each computingmachine 210 can access data stored in anystorage device 220. In some configurations, a group of computingmachines 210 andstorage devices 220 may be clustered. Each cluster may serve a specific tenant and/or a set of indexed views. - In some embodiments, one of the
computing machines 210 operates as an âorchestratorâ while the other operate as âworkersâ. An orchestrator is configured to plan and schedule execution of SQL queries or statements to the workers. The orchestrator may also be configured to allocate additional workers (computing machines) on demand. The functionality of a computing machine is determined, in part, by its query engine. - In an embodiment, illustrated in
FIG. 3 , acomputing machine 210 implements aquery engine 211 and anindexing engine 212. Thequery engine 211 allows execution of SQL queries over data lakes. Specifically, thequery engine 211 is configured to parse SQL queries and plan their execution by pushing the parsed statements to theindexing engine 212. Aquery engine 211 may be realized using a third-party SQL engine, such as, but not limited to, PrestoÂŽ, Apache Drill, SparkÂŽ, Impala, and the like. Typically, such a third-party SQL engine is configured with a plurality of connectors to allow access to data in various formants. Examples for such data formats include an optimized row columnar (ORC), Parquet, comma-separated values (CSV), JavaScriptÂŽ object notation (JSON), and the like. - The
storage devices 220 may be configured to store indexed views (and their contents). According to the disclosed embodiments, an indexed view is an SQL view or (dataset) segmented into a series of ordered vertical segments, each having a specified size or comprising a specified number of records. With respect to each column in a view (database), the intersection of the segments and the column creates data chunks. A data chunk includes a header and column data from the respective indexed view. A data chunk may further include index information. The data chunks and index information together constitute an indexed view, or a portion thereof. The contents of a data chunk and index information are discussed below. - As illustrated in
FIG. 3 , astorage device 220 maintains a plurality of memory pages 221. A memory page may include index information (at least a local index) and data chunk (not shown). A complete indexed view may be resided inmemory pages 221 of one ormore storage devices 220. - In another embodiment, when a query is executed, by a
query engine 211, on an indexed view, the query may access data in one or more relevant columns by first serially verifying the data values in each column data chunk based, at least in part, on chunk metadata and chunk index. Thus, the query may be able to focus only on the relevant parts of the view and skip data chunks where the metadata indicates that the values searched for are not within the range stored in the segment. The query then may be configured to produce a bitmap of view rows which are responsive to the query. In some embodiments, based on the produced bitmap, acomputing engine 210 may retrieve only those data chunks which contain data responsive to the query. - Typically, column-oriented databases store the values of a single column contiguously. This allows array-processing of the values of a column, wherein rows may further be constructed from column values if required. Query processing in columnar databases works on columns, so columnar representation can improve the performance of queries that only access a small subset of columns. Because every page reading a columnar database can access the values for many rows at once, a query may retrieve more values per I/O operation than with row-storage. The unique structure of columnar databases may provide for faster analytical query activities, known as Online Analytical Processing (OLAP). OLAP queries are typically used to derive new information from existing data (e.g., data aggregates, transformations, and/or calculations). Because OLAP queries typically only need to access a subset of columns in a table, a columnar representation of data allows for skipping irrelevant columns. Columnar databases may also provide for improved data compression ratios, because the encoding is done on multiple values that are of the same type and may have less entropy.
- However, a potential downside of the columnar approach is that it is less well suited for transactional record activity, known as online transaction processing or OLTP. OLTP queries typically retrieve or modify individual records. OLTP queries executed on a columnar database will typically require an increased number of separate disk-access operations. OLTP queries are often triggered by end user requests, and therefore response time may be crucial.
- Accordingly, a columnar database structure which allows for a reduction in disk access operations may be capable of providing the advantages of a columnar data representation while reducing the associated computational cost and time penalty as compared to traditional columnar database structures.
- In some embodiments, the managed
query execution platform 110 is configured for storing data in an SQL view. An SQL view is a virtual table based on the result-set of an SQL statement. A view contains rows and columns, just like a real table. The fields in a view are fields from one or more real tables in the database. - SQL statements (e.g., WHERE, JOIN) can be added to a view and present the data as if the data was retrieved from one single table. The indexed views support SQL statements including Filters, Joins, and aggregations.
- For example, a Filter may be any SQL WHERE condition, on any column, within an SQL statement that can use an index. Indices are used for point lookups, range queries and string matching of data saved in the
storage devices 220. A join statement is any SQL JOIN command using the index of a key column. The index is most useful for dimensional JOINs: combining a fact table with a filtered dimension table. An aggregation command includes SQL aggregations and grouping can use the indexes for acceleration. - As noted above, each indexed view, created and utilized by the managed query execution platform, is segmented into a series of ordered vertical segments, each segment having a specified size or comprising a specified number of records.
FIGS. 4A and 4B illustrate the segmentation of views according to an embodiment. -
FIG. 4A is an illustration of a common (SQL) view 400 of a columnar database structure. The view 400 may be populated with data received in any format, e.g., comma-CSV, JSON, and the like. The data may be one or more database tables. The data utilized to populate the view 400 is saved in thedata lake 130 and may be changed from time to time. - The managed
query execution platform 110 is configured to create indexed views based on the stored in thedata lake 130. The indexed views are saved in thestorage devices 220 of the managedquery execution platform 110, thereby allowing fast execution of queries and scalable solution. In an embodiment, views are predefined and can be created using an SQL âCreate Viewâ command. -
FIG. 4B is a schematic illustration of an example indexed view 410 created according to an embodiment. The indexed view 410 is vertically segmented into a plurality of segments, e.g.,segments 0, . . . , N. The intersection of each column A, . . . , N and each segment, creates a data chunk, e.g., chunks A-0, B-2, C-1, and so on. In some example configurations, each data chunk may have a size of 64 kilo-rows. In other example configurations, each data chunk may have a size that is larger or smaller than 64 kilo-rows, e.g., 128 kilo-rows, and so on. - In some embodiments, index information and a header are generated for each of the data chunks. The data chunk header may contain metadata such as a range of values, an average value (minimum-maximum), a mean value, a number of NOT NULL values, histograms, compression metadata, and so on.
- In some embodiments, the index information may include any one of: a local index, a meta-index, and global index. The local index is information regarding the location of specific data within the data chunk. The meta-index is an entire column, comprising all indexes of each data chunk within the column of an indexed view. There are a plurality of local indexes indexing data chunks of a column. When scanning for a specific column the plurality of local indexes of data chunks are scanned in parallel. The global index is an index of the entire index view including all individual segments and local and meta indexes. It should be noted that additional and/or other index structures may be used.
- In an embodiment, index information is generated by the
indexing engine 212 in acomputing machine 210 using an inline indexing process. Using this process, all the dimensions (columns) of an indexed view are automatically indexed. Data is indexed as it is loaded, at the rate of the data flow, without any user intervention or post-processing. The inline index process is adaptive to the data, where the index information, and particularly each local index is changed as the data in the view changes. - The data chunks of an indexed view are stored in one or
more storage devices 220. The data chunks may be encoded and compressed prior to storage. In some embodiments, each data chunk may be stored on one or more memory pages of astorage device 220 having a native page size defined by, for example, the NVMe protocol. In some embodiments, larger page sizes may be used, such as 16 kilobytes or more. In some embodiments, each storage device's 220 memory page stores only data from a single data chunk. - A created indexed view (and its contents) is loaded to the
storage devices 220 from the data lake based on the index information data layout. Once loaded, the managedquery execution platform 110 is configured to automatically synchronize in real-time to the data in the indexed view with the data stored in the data lake. As the view is always indexed across all dimensions, consistently fast performance on any query on any dimension that can use an index, including filtering and joins are provided. The synchronization may be performed by any of thecomputing machines 210. - Specifically, data changes on the data lake are automatically tracked by each indexed view. When a data change is detected, the respective view can be automatically synchronized in near-real-time to the data lake. The synchronization of data may be of data addition and/or deletion.
- In an embodiment, changes in the data lake are detected based on notifications received from the data lake and/or a user storing data therein; polling, i.e., periodically scanning of the data lake for changes; and/or monitoring changes in the database's catalog, such as table structure changes.
- In an embodiment, the synchronization of an indexed view may be incremental and non-incremental. The former is defined when the SQL definition is incremental, any change of the data will be efficiently reflected in the view, while the latter occurs when the SQL definition is not incremental and is more like a data transformation workload (for example, includes a JOIN operation). In this case, any change of data may require a full view re-computation. In both embodiments, the data synchronization may be performed either on a predefined schedule, or via manual on-demand synchronizations.
-
FIG. 5 is anexample flowchart 500 illustrating a process for executing a query on an indexed view according to an embodiment. The process may be performed by the managed query execution platform and, in particular, by one or more computing machines in the platform. - At S510, a query is received from a data application and processed. In an embodiment, the processing of a query includes parsing, planning, and scheduling a distributed query plan across the compute machines. The parsing may include breaking the query into statements (also known as âpush downsâ) to allow different statements execute on different machines. That is, the processing of a query can be performed in parallel. All processing is performed in-memory of a computing machine.
- For example, the following query may be processed:
-
- SELECT
- COLUMNS A, C FROM INDEXED VIEW_1
- WHERE COLUMN B=3 OR B=8
- SELECT
- The query calls for retrieving specific rows from columns A and C, where column B has a value of 3 or 8. In this example, 2 independent statements can be parsed to be processed in parallel: one looking for values in column B equal to 3; and the other looking for values in column B equal to 8.
- At S520, a global index of the queried indexed view is verified. This is performed in order to locate segments and data chunks in such segments of the queried indexed view. For example, the segments of the INDEXED VIEW_1 are verified. S520 may also include verifying the mate-index to locate the queried columns in the indexed view.
- At S530, the plurality of local indexes are verified to locate the data chunks containing values related to one or more statements in the processed queries. The plurality of local indexes are verified or scanned in parallel. This may be performed in a distributed manner by a number of computing machines. Alternatively or collectively, data chunks' headers are verified to determine chunk value ranges, based on header metadata. In the above example query, local indices of data chunks containing the column B data are verified.
- At S540, indexes of the relevant data chunks are accessed in order to retrieve the data responsive to the statements (and hence the query). The data chunks are accessed in parallel from a number of storage devices, responsive to a number of statements executed in parallel. In an embodiment, relevant rows in each data chunk which contain data responsive to the query are accessed. It should be noted that this step can be performed for every statement in the processed query.
- At S550, the retrieved data is calculated or complied responsive to the query. The data from data chunks are retrieved, in parallel from a number of storage devices, responsive to a number of statements executed in parallel. In some embodiments, the data chunks are decoded or decompressed.
- At S560, the compiled query results are returned. For example, the results are returned to the data application issued the query.
-
FIG. 6 is a diagram illustrating the execution of the above example query, according to an embodiment. As can be seen, column 600-B includes, e.g., 3 chunks (B-0, B-1, B-2), where only chunks B-1 (value range 7-15) and B-2 (value range 0-7) contain values responsive to the WHERE condition in the query. Thus, the segment B-0 is not accessed. - The local indices of data chunks B-1 and B-2 are accessed to identify the relevant rows in each data chunk which contain data responsive to the query. The relevant data chunks may be detected. In this example, specified rows have been identified within data chunks B-1 (e.g.,
rows rows 2, 95) as containing the values responsive to the WHERE clause of the query. Data chunks A-1, A-2, C-1, and C-2 associated with column A and C may be decoded. - The following relevant rows are retrieved from each relevant data chunk in columns A and C:
-
COLUMN A COLUMN C Al -12 C l-12 Al -18 C l-18 Al -62 C l-62 A2-02 C2-02 A2-95 C2-95 - The disclosed embodiments provide technical improvements over existing solutions. For example, when tested during a demonstration of query retrieval results against PrestoÂŽ, the data used in the demonstration included rideshare information collected from a database comprising 5 billion records, representing 25 million rideshare trips recorded over 3 months, with a user base of 5 million users and 200,000 drivers.
- The time for processing a query has been reduced from minutes to seconds. For example, the time for processing a query for ârides of specified userâ is 0.6 s, using the disclosed embodiments, in comparison to 1 minute and 36 seconds using the existing PrestoÂŽ solution. As another example, the time for processing a query for ârides within specified area, age group, and time of the dayâ is 3.1 seconds, using the disclosed embodiments, in comparison to 1 minute using the existing PrestoÂŽ solution.
-
FIG. 7 is an example schematic block diagram of ahardware layer 700 implemented by each computing machine according to an embodiment. Thehardware layer 700 includes aprocessing circuitry 710 coupled to amemory 720, astorage 730, and anetwork interface 740. In an embodiment, the components of thehardware layer 700 may be communicatively connected via abus 750. - The
processing circuitry 710 may be realized as one or more hardware logic components and circuits. For example, and without limitation, illustrative types of hardware logic components that can be used include field programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), Application-specific standard products (ASSPs), system-on-a-chip systems (SOCs), general-purpose microprocessors, microcontrollers, digital signal processors (DSPs), and the like, or any other hardware logic components that can perform calculations or other manipulations of information. - The
memory 720 may be volatile (e.g., RAM, etc.), non-volatile (e.g., ROM, flash memory, etc.), or a combination thereof. In one configuration, computer readable instructions to implement one or more embodiments disclosed herein may be stored in thestorage 730. - In another embodiment, the
memory 720 is configured to store software. Software shall be construed broadly to mean any type of instructions, whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise. Instructions may include code (e.g., in source code format, binary code format, executable code format, or any other suitable format of code). The instructions, when executed by theprocessing circuitry 710, cause theprocessing circuitry 710 to perform the various processes described herein. - The
storage 730 may be magnetic storage, optical storage, and the like, and may be realized, for example, as flash memory or other memory technology, or any other medium which can be used to store the desired information. - The
network interface 740 allows thehardware layer 700 communicate with servers running the data applications and infrastructures hosting data lakes. Thenetwork interface 740 may also include means for communicating with storage devices using storage communication protocols. - It should be understood that the embodiments described herein are not limited to the specific architecture illustrated in
FIG. 7 , and other architectures may be equally used without departing from the scope of the disclosed embodiments. - The various embodiments disclosed herein can be implemented as hardware, firmware, software, or any combination thereof. Moreover, the software is preferably implemented as an application program tangibly embodied on a program storage unit or computer readable medium consisting of parts, or of certain devices and/or a combination of devices. The application program may be uploaded to, and executed by, a machine comprising any suitable architecture. Preferably, the machine is implemented on a computer platform having hardware such as one or more central processing units (âCPUsâ), a memory, and input/output interfaces. The computer platform may also include an operating system and microinstruction code. The various processes and functions described herein may be either part of the microinstruction code or part of the application program, or any combination thereof, which may be executed by a CPU, whether or not such a computer or processor is explicitly shown. In addition, various other peripheral units may be connected to the computer platform such as an additional data storage unit and a printing unit. Furthermore, a non-transitory computer readable medium is any computer readable medium except for a transitory propagating signal.
- As used herein, the phrase âat least one ofâ followed by a listing of items means that any of the listed items can be utilized individually, or any combination of two or more of the listed items can be utilized. For example, if a system is described as including âat least one of A, B, and C,â the system can include A alone; B alone; C alone; 2A; 2B; 2C; 3A; A and B in combination; B and C in combination; A and C in combination; A, B, and C in combination; 2A and C in combination; A, 3B, and 2C in combination; and the like.
- It should be understood that any reference to an element herein using a designation such as âfirst,â âsecond,â and so forth does not generally limit the quantity or order of those elements. Rather, these designations are generally used herein as a convenient method of distinguishing between two or more elements or instances of an element. Thus, a reference to first and second elements does not mean that only two elements may be employed there or that the first element must precede the second element in some manner. Also, unless stated otherwise, a set of elements comprises one or more elements.
- All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the principles of the disclosed embodiment and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions. Moreover, all statements herein reciting principles, aspects, and embodiments of the disclosed embodiments, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure.
Claims (17)
1. A query execution fabric, comprising:
a plurality of computing machines; and
a plurality of storage devices communicatively connected to the plurality of computing machines using a storage communication protocol and configured to maintain a plurality of indexed views comprising index information and a plurality of data chunks, wherein the index information includes at least one local index per data chunk and at least one global index per each of the indexed views of the plurality of indexed views, and
wherein each of the plurality of computing machines is configured to:
verify a global index to locate segments of the plurality of indexed views; and
verify a plurality of local indexes to locate the data chunks containing values related to a plurality of statements.
2. The query execution fabric of claim 1 , wherein each of the plurality of computing machines is further configured to:
execute queries on the plurality of the indexed views.
3. The query execution fabric of claim 3 , wherein each of the plurality of computing machines is configured to:
process an input query by accessing the data chunks located by the plurality of local indexes; and
respond to the input query based on the accessed data chunks.
4. The query execution fabric of claim 1 , wherein each of the indexed views of the plurality of indexed views is a presentation of customer data stored in a data lake.
5. The query execution fabric of claim 2 , wherein each of the plurality of computing machines further comprises:
a query engine configured to parse each of the queries into a plurality of statements and to process the parsed statements on the plurality of indexed views.
6. The query execution fabric of claim 1 , wherein each of the indexed views of the plurality of indexed views is segmented into a plurality of ordered vertical segments, wherein the plurality of data chunks is based on intersections of the vertical segments and each column in the indexed views.
7. The query execution fabric of claim 1 , wherein the index information comprises:
at least one local index per data chunk;
at least one meta-index per column; and
and at least one global index per the indexed view.
8. The query execution fabric of claim 1 , wherein the plurality of computing machines is configured to:
access the plurality of local indexes associated with the plurality of data chunks of respective columns in parallel, wherein the plurality of data indexes is independent of each other.
9. The query execution fabric of claim 1 , wherein each of the plurality of storage devices comprises: a plurality of non-volatile memory pages, wherein each memory page is configured to maintain at least one local index and least one data chunk.
10. The query execution fabric of claim 4 , wherein the plurality of computing machines is configured to synchronize data between the plurality of indexed views and the data stored in the data lake.
11. The query execution fabric of claim 11 , wherein the synchronization is performed in real-time or at near real-time.
12. The query execution fabric of claim 11 , wherein the synchronization of data between the plurality of indexed views and the data stored in the data lake further comprises: detection of changes in the data stored in the data lake.
13. The query execution fabric of claim 2 , wherein the queries executed on the plurality of the indexed views comprise SQL queries and commands.
14. The query execution fabric of claim 1 , wherein at least one computing machine of the plurality of the computing machines is configured to orchestrate operations of the plurality of the computing machines.
15. The query execution fabric of claim 1 , wherein one or more of the plurality of the computing machines is allocated based on a demand.
16. The query execution fabric of claim 1 , wherein the storage communication protocol is any one of:
remote direct memory access (RDMA) over converged ethernet (RoCE);
internet wide area RDMA protocol (iWARP);
non-volatile memory express (NVMe);
NVMe over fabric (NVMeF); and
NVMeF over TCP.
17. The query execution fabric of claim 1 , wherein the plurality of the computing machines and the plurality of storage devices is deployed in a cloud computing environment.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/643,965 US20220164345A1 (en) | 2018-10-11 | 2021-12-13 | Managed query execution platform, and methods thereof |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201862744179P | 2018-10-11 | 2018-10-11 | |
US16/597,984 US11226963B2 (en) | 2018-10-11 | 2019-10-10 | Method and system for executing queries on indexed views |
US17/643,965 US20220164345A1 (en) | 2018-10-11 | 2021-12-13 | Managed query execution platform, and methods thereof |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/597,984 Continuation US11226963B2 (en) | 2018-10-11 | 2019-10-10 | Method and system for executing queries on indexed views |
Publications (1)
Publication Number | Publication Date |
---|---|
US20220164345A1 true US20220164345A1 (en) | 2022-05-26 |
Family
ID=70159053
Family Applications (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/597,984 Active 2040-02-06 US11226963B2 (en) | 2018-10-11 | 2019-10-10 | Method and system for executing queries on indexed views |
US16/597,974 Active 2040-02-13 US11347740B2 (en) | 2018-10-11 | 2019-10-10 | Managed query execution platform, and methods thereof |
US17/643,965 Abandoned US20220164345A1 (en) | 2018-10-11 | 2021-12-13 | Managed query execution platform, and methods thereof |
Family Applications Before (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/597,984 Active 2040-02-06 US11226963B2 (en) | 2018-10-11 | 2019-10-10 | Method and system for executing queries on indexed views |
US16/597,974 Active 2040-02-13 US11347740B2 (en) | 2018-10-11 | 2019-10-10 | Managed query execution platform, and methods thereof |
Country Status (2)
Country | Link |
---|---|
US (3) | US11226963B2 (en) |
WO (1) | WO2020077027A1 (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190109720A1 (en) | 2016-07-26 | 2019-04-11 | Samsung Electronics Co., Ltd. | Modular system (switch boards and mid-plane) for supporting 50g or 100g ethernet speeds of fpga+ssd |
US10210123B2 (en) | 2016-07-26 | 2019-02-19 | Samsung Electronics Co., Ltd. | System and method for supporting multi-path and/or multi-mode NMVe over fabrics devices |
US11593291B2 (en) | 2018-09-10 | 2023-02-28 | GigaIO Networks, Inc. | Methods and apparatus for high-speed data bus connection and fabric management |
US11403247B2 (en) | 2019-09-10 | 2022-08-02 | GigaIO Networks, Inc. | Methods and apparatus for network interface fabric send/receive operations |
US11593288B2 (en) * | 2019-10-02 | 2023-02-28 | GigalO Networks, Inc. | Methods and apparatus for fabric interface polling |
US11392528B2 (en) | 2019-10-25 | 2022-07-19 | Cigaio Networks, Inc. | Methods and apparatus for DMA engine descriptors for high speed data systems |
US11487766B2 (en) * | 2020-12-18 | 2022-11-01 | Microsoft Technology Licensing, Llc | Operation fragmentation with metadata serialization in query processing pushdowns |
US11797523B2 (en) | 2020-12-18 | 2023-10-24 | Microsoft Technology Licensing, Llc | Schema and data modification concurrency in query processing pushdown |
Citations (70)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6163782A (en) * | 1997-11-19 | 2000-12-19 | At&T Corp. | Efficient and effective distributed information management |
US20010044758A1 (en) * | 2000-03-30 | 2001-11-22 | Iqbal Talib | Methods and systems for enabling efficient search and retrieval of products from an electronic product catalog |
US6366903B1 (en) * | 2000-04-20 | 2002-04-02 | Microsoft Corporation | Index and materialized view selection for a given workload |
US20020095430A1 (en) * | 1999-12-30 | 2002-07-18 | Decode Genetics Ehf | SQL query generator utilizing matrix structures |
US6434520B1 (en) * | 1999-04-16 | 2002-08-13 | International Business Machines Corporation | System and method for indexing and querying audio archives |
US20030093408A1 (en) * | 2001-10-12 | 2003-05-15 | Brown Douglas P. | Index selection in a database system |
US20030093407A1 (en) * | 1999-05-21 | 2003-05-15 | Roberta Jo Cochrane | Incremental maintenance of summary tables with complex grouping expressions |
US20030195862A1 (en) * | 2002-04-10 | 2003-10-16 | Harrell James E. | Method and system for providing SQL or other RDBMS access to native xbase application |
US20030212664A1 (en) * | 2002-05-10 | 2003-11-13 | Martin Breining | Querying markup language data sources using a relational query processor |
US20040002954A1 (en) * | 2002-06-26 | 2004-01-01 | Surajit Chaudhuri | Compressing database workloads |
US6785673B1 (en) * | 2000-02-09 | 2004-08-31 | At&T Corp. | Method for converting relational data into XML |
US6795830B1 (en) * | 2000-09-08 | 2004-09-21 | Oracle International Corporation | Techniques for providing off-host storage for a database application |
US20040205066A1 (en) * | 2003-04-08 | 2004-10-14 | International Business Machines Corporation | System and method for a multi-level locking hierarchy in a database with multi-dimensional clustering |
US20050187917A1 (en) * | 2003-09-06 | 2005-08-25 | Oracle International Corporation | Method for index tuning of a SQL statement, and index merging for a multi-statement SQL workload, using a cost-based relational query optimizer |
US20050203933A1 (en) * | 2004-03-09 | 2005-09-15 | Microsoft Corporation | Transformation tool for mapping XML to relational database |
US20050210023A1 (en) * | 2004-03-18 | 2005-09-22 | Renato Barrera | Query optimizer using implied predicates |
US6993657B1 (en) * | 2000-09-08 | 2006-01-31 | Oracle International Corporation | Techniques for managing database systems with a community server |
US20060085465A1 (en) * | 2004-10-15 | 2006-04-20 | Oracle International Corporation | Method(s) for updating database object metadata |
US20060085484A1 (en) * | 2004-10-15 | 2006-04-20 | Microsoft Corporation | Database tuning advisor |
US20060235823A1 (en) * | 2005-04-18 | 2006-10-19 | Oracle International Corporation | Integrating RDF data into a relational database system |
US20060242563A1 (en) * | 2005-04-22 | 2006-10-26 | Liu Zhen H | Optimizing XSLT based on input XML document structure description and translating XSLT into equivalent XQuery expressions |
US20070162420A1 (en) * | 2004-01-21 | 2007-07-12 | Oracle International Corporation | Techniques for automatically discovering a database device on a network |
US7257597B1 (en) * | 2001-12-18 | 2007-08-14 | Siebel Systems, Inc. | Table substitution |
US20070214107A1 (en) * | 2006-03-13 | 2007-09-13 | Oracle International Corporation | Dynamic materialized view ranging |
US20070219952A1 (en) * | 2006-03-15 | 2007-09-20 | Oracle International Corporation | Null aware anti-join |
US20070226203A1 (en) * | 2006-03-23 | 2007-09-27 | Microsoft Corporation | Generation of query and update views for object relational mapping |
US20070270658A1 (en) * | 2004-09-03 | 2007-11-22 | Frederick Mitchell | Quantum memory fusion |
US20070288526A1 (en) * | 2006-06-08 | 2007-12-13 | Emc Corporation | Method and apparatus for processing a database replica |
US20080195583A1 (en) * | 2007-02-08 | 2008-08-14 | International Business Machines Corporation | System And Method For Verifying The Integrity And Completeness Of Records |
US20080222136A1 (en) * | 2006-09-15 | 2008-09-11 | John Yates | Technique for compressing columns of data |
US20080295092A1 (en) * | 2000-09-08 | 2008-11-27 | Dean Tan | Techniques for automatically installing and configuring database applications |
US20090064160A1 (en) * | 2007-08-31 | 2009-03-05 | Microsoft Corporation | Transparent lazy maintenance of indexes and materialized views |
US20090182724A1 (en) * | 2008-01-11 | 2009-07-16 | Paul Reuben Day | Database Query Optimization Using Index Carryover to Subset an Index |
US20090307275A1 (en) * | 2005-12-02 | 2009-12-10 | International Business Machines Corporation | System for improving access efficiency in database and method thereof |
US20100217759A1 (en) * | 2007-05-11 | 2010-08-26 | Nec Corporation | System, method, and progam product for database restructuring support |
US7827168B2 (en) * | 2007-05-30 | 2010-11-02 | Red Hat, Inc. | Index clustering for full text search engines |
US20100281005A1 (en) * | 2009-05-04 | 2010-11-04 | Microsoft Corporation | Asynchronous Database Index Maintenance |
US7945551B1 (en) * | 2001-06-14 | 2011-05-17 | Oracle International Corporation | Redirection of misses in queryable caches |
US20110276355A1 (en) * | 2007-11-05 | 2011-11-10 | Michael Bruce | Method and system for informed decision making to locate a workforce |
US8200612B2 (en) * | 2003-05-07 | 2012-06-12 | Oracle International Corporation | Efficient SQL access to multidimensional data |
US8266101B1 (en) * | 2009-07-16 | 2012-09-11 | Binpeng Shuai | Share nothing database cluster and real time synchronization by signaling |
US20130124545A1 (en) * | 2011-11-15 | 2013-05-16 | Business Objects Software Limited | System and method implementing a text analysis repository |
US20130173955A1 (en) * | 2012-01-04 | 2013-07-04 | Xtremlo Ltd | Data protection in a random access disk array |
US20140172914A1 (en) * | 2012-12-14 | 2014-06-19 | Microsoft Corporation | Graph query processing using plurality of engines |
US20140244606A1 (en) * | 2013-01-18 | 2014-08-28 | Tencent Technology (Shenzhen) Company Limited | Method, apparatus and system for storing, reading the directory index |
US20140280028A1 (en) * | 2013-03-14 | 2014-09-18 | Oracle International Corporation | Selective materialized view refresh |
US20140279838A1 (en) * | 2013-03-15 | 2014-09-18 | Amiato, Inc. | Scalable Analysis Platform For Semi-Structured Data |
US20140317087A1 (en) * | 2013-04-22 | 2014-10-23 | Salesforce.Com, Inc. | Systems and methods for implementing and maintaining sampled tables in a database system |
US8892569B2 (en) * | 2010-12-23 | 2014-11-18 | Ianywhere Solutions, Inc. | Indexing spatial data with a quadtree index having cost-based query decomposition |
US9141676B2 (en) * | 2013-12-02 | 2015-09-22 | Rakuten Usa, Inc. | Systems and methods of modeling object networks |
US20150278255A1 (en) * | 2014-03-27 | 2015-10-01 | International Business Machines Corporation | System and method for optimizing database definitions for a new database |
US9195936B1 (en) * | 2011-12-30 | 2015-11-24 | Pegasystems Inc. | System and method for updating or modifying an application without manual coding |
US20150356314A1 (en) * | 2014-06-10 | 2015-12-10 | Salesforce.Com, Inc. | Systems and methods for implementing an encrypted search index |
US20170024387A1 (en) * | 2015-07-22 | 2017-01-26 | Oracle International Corporation | Approximate distinct counting in a bounded memory |
US20170116241A1 (en) * | 2015-10-21 | 2017-04-27 | Oracle International Corporation | Non-blocking database table alteration |
US20170116275A1 (en) * | 2015-10-21 | 2017-04-27 | International Business Machines Corporation | Adaptive multi-index access plan for database queries |
US9659058B2 (en) * | 2013-03-22 | 2017-05-23 | X1 Discovery, Inc. | Methods and systems for federation of results from search indexing |
US20170177700A1 (en) * | 2015-12-17 | 2017-06-22 | Sap Se | Updating a partitioning column |
US20170185326A1 (en) * | 2015-12-29 | 2017-06-29 | Emc Corporation | Consistent transition from asynchronous to synchronous replication in hash-based storage systems |
US20180096006A1 (en) * | 2016-09-30 | 2018-04-05 | Microsoft Technology Licensing, Llc | Workload-driven recommendations for columnstore and rowstore indexes in relational databases |
US10037355B2 (en) * | 2015-07-07 | 2018-07-31 | Futurewei Technologies, Inc. | Mechanisms for merging index structures in MOLAP while preserving query consistency |
US20180285418A1 (en) * | 2017-03-31 | 2018-10-04 | Amazon Technologies, Inc. | Executing queries for structured data and not-structured data |
US10152371B1 (en) * | 2016-09-30 | 2018-12-11 | EMC IP Holding Company LLC | End-to-end data protection for distributed storage |
US10324782B1 (en) * | 2016-03-24 | 2019-06-18 | Emc Corporation | Hiccup management in a storage array |
US10528599B1 (en) * | 2016-12-16 | 2020-01-07 | Amazon Technologies, Inc. | Tiered data processing for distributed data |
US10705907B1 (en) * | 2016-03-24 | 2020-07-07 | EMC IP Holding Company LLC | Data protection in a heterogeneous random access storage array |
US10769148B1 (en) * | 2017-09-25 | 2020-09-08 | Amazon Technologies, Inc. | Relocating data sharing operations for query processing |
US10810202B2 (en) * | 2018-06-14 | 2020-10-20 | Microsoft Technology Licensing, Llc | Execution plan stitching |
US11074261B1 (en) * | 2016-12-16 | 2021-07-27 | Amazon Technologies, Inc. | Format independent processing for distributed data |
US11119998B1 (en) * | 2018-11-26 | 2021-09-14 | Amazon Technologies, Inc. | Index and view updates in a ledger-based database |
Family Cites Families (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH07160557A (en) * | 1993-12-13 | 1995-06-23 | Hitachi Ltd | Data base access processing method |
US6073134A (en) * | 1997-05-29 | 2000-06-06 | Oracle Corporation | Method article of manufacture, and apparatus for generating a multi-dimensional record management index |
US6356891B1 (en) * | 2000-04-20 | 2002-03-12 | Microsoft Corporation | Identifying indexes on materialized views for database workload |
US7472107B2 (en) * | 2003-06-23 | 2008-12-30 | Microsoft Corporation | Integrating horizontal partitioning into physical database design |
US7376642B2 (en) * | 2004-03-30 | 2008-05-20 | Microsoft Corporation | Integrated full text search system and method |
US7366716B2 (en) * | 2005-05-06 | 2008-04-29 | Microsoft Corporation | Integrating vertical partitioning into physical database design |
US7580941B2 (en) * | 2006-06-13 | 2009-08-25 | Microsoft Corporation | Automated logical database design tuning |
KR20130009754A (en) * | 2010-02-01 | 2013-01-23 | ě íí, ě¸íŹ. | Integrated advertising system |
US20160041999A1 (en) * | 2011-07-22 | 2016-02-11 | Google Inc. | Query analyzer |
US9965497B2 (en) | 2014-05-29 | 2018-05-08 | Oracle International Corporation | Moving data between partitions |
GB201615747D0 (en) * | 2016-09-15 | 2016-11-02 | Gb Gas Holdings Ltd | System for data management in a large scale data repository |
US10216862B1 (en) * | 2016-09-26 | 2019-02-26 | Splunk Inc. | Predictive estimation for ingestion, performance and utilization in a data indexing and query system |
US10437807B1 (en) * | 2017-07-06 | 2019-10-08 | Palantir Technologies Inc. | Selecting backing stores based on data request |
US11639998B2 (en) * | 2018-09-28 | 2023-05-02 | Apple Inc. | Optical communication system for position-finding of a portable electronic device in free space |
-
2019
- 2019-10-10 US US16/597,984 patent/US11226963B2/en active Active
- 2019-10-10 WO PCT/US2019/055503 patent/WO2020077027A1/en active Application Filing
- 2019-10-10 US US16/597,974 patent/US11347740B2/en active Active
-
2021
- 2021-12-13 US US17/643,965 patent/US20220164345A1/en not_active Abandoned
Patent Citations (87)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6163782A (en) * | 1997-11-19 | 2000-12-19 | At&T Corp. | Efficient and effective distributed information management |
US20110173156A1 (en) * | 1999-01-29 | 2011-07-14 | Oracle International Corporation | Techniques for automatically discovering a database device on a network |
US6434520B1 (en) * | 1999-04-16 | 2002-08-13 | International Business Machines Corporation | System and method for indexing and querying audio archives |
US20030093407A1 (en) * | 1999-05-21 | 2003-05-15 | Roberta Jo Cochrane | Incremental maintenance of summary tables with complex grouping expressions |
US6763352B2 (en) * | 1999-05-21 | 2004-07-13 | International Business Machines Corporation | Incremental maintenance of summary tables with complex grouping expressions |
US20020095430A1 (en) * | 1999-12-30 | 2002-07-18 | Decode Genetics Ehf | SQL query generator utilizing matrix structures |
US6785673B1 (en) * | 2000-02-09 | 2004-08-31 | At&T Corp. | Method for converting relational data into XML |
US20050216447A1 (en) * | 2000-03-30 | 2005-09-29 | Iqbal Talib | Methods and systems for enabling efficient retrieval of documents from a document archive |
US20010044758A1 (en) * | 2000-03-30 | 2001-11-22 | Iqbal Talib | Methods and systems for enabling efficient search and retrieval of products from an electronic product catalog |
US20010047353A1 (en) * | 2000-03-30 | 2001-11-29 | Iqbal Talib | Methods and systems for enabling efficient search and retrieval of records from a collection of biological data |
US6366903B1 (en) * | 2000-04-20 | 2002-04-02 | Microsoft Corporation | Index and materialized view selection for a given workload |
US20080295092A1 (en) * | 2000-09-08 | 2008-11-27 | Dean Tan | Techniques for automatically installing and configuring database applications |
US6795830B1 (en) * | 2000-09-08 | 2004-09-21 | Oracle International Corporation | Techniques for providing off-host storage for a database application |
US20120303584A1 (en) * | 2000-09-08 | 2012-11-29 | Oracle International Corporation | Techniques for automatically provisioning a database over a wide area network |
US20080306883A1 (en) * | 2000-09-08 | 2008-12-11 | Jean-Louis Baffier | Techniques for automatically provisioning a database over a wide area network |
US6993657B1 (en) * | 2000-09-08 | 2006-01-31 | Oracle International Corporation | Techniques for managing database systems with a community server |
US7945551B1 (en) * | 2001-06-14 | 2011-05-17 | Oracle International Corporation | Redirection of misses in queryable caches |
US7962521B2 (en) * | 2001-10-12 | 2011-06-14 | Teradata Us, Inc. | Index selection in a database system |
US7499907B2 (en) * | 2001-10-12 | 2009-03-03 | Teradata Us, Inc. | Index selection in a database system |
US20030093408A1 (en) * | 2001-10-12 | 2003-05-15 | Brown Douglas P. | Index selection in a database system |
US7257597B1 (en) * | 2001-12-18 | 2007-08-14 | Siebel Systems, Inc. | Table substitution |
US20030195862A1 (en) * | 2002-04-10 | 2003-10-16 | Harrell James E. | Method and system for providing SQL or other RDBMS access to native xbase application |
US20030212664A1 (en) * | 2002-05-10 | 2003-11-13 | Martin Breining | Querying markup language data sources using a relational query processor |
US20040002954A1 (en) * | 2002-06-26 | 2004-01-01 | Surajit Chaudhuri | Compressing database workloads |
US20040205066A1 (en) * | 2003-04-08 | 2004-10-14 | International Business Machines Corporation | System and method for a multi-level locking hierarchy in a database with multi-dimensional clustering |
US8200612B2 (en) * | 2003-05-07 | 2012-06-12 | Oracle International Corporation | Efficient SQL access to multidimensional data |
US20050187917A1 (en) * | 2003-09-06 | 2005-08-25 | Oracle International Corporation | Method for index tuning of a SQL statement, and index merging for a multi-statement SQL workload, using a cost-based relational query optimizer |
US20070162420A1 (en) * | 2004-01-21 | 2007-07-12 | Oracle International Corporation | Techniques for automatically discovering a database device on a network |
US20050203933A1 (en) * | 2004-03-09 | 2005-09-15 | Microsoft Corporation | Transformation tool for mapping XML to relational database |
US20050210023A1 (en) * | 2004-03-18 | 2005-09-22 | Renato Barrera | Query optimizer using implied predicates |
US20070270658A1 (en) * | 2004-09-03 | 2007-11-22 | Frederick Mitchell | Quantum memory fusion |
US20060085484A1 (en) * | 2004-10-15 | 2006-04-20 | Microsoft Corporation | Database tuning advisor |
US20060085465A1 (en) * | 2004-10-15 | 2006-04-20 | Oracle International Corporation | Method(s) for updating database object metadata |
US20100332526A1 (en) * | 2004-10-15 | 2010-12-30 | Oracle International Corporation | Method(s) For Updating Database Object Metadata |
US7809763B2 (en) * | 2004-10-15 | 2010-10-05 | Oracle International Corporation | Method(s) for updating database object metadata |
US20060235823A1 (en) * | 2005-04-18 | 2006-10-19 | Oracle International Corporation | Integrating RDF data into a relational database system |
US8719250B2 (en) * | 2005-04-18 | 2014-05-06 | Oracle International Corporation | Integrating RDF data into a relational database system |
US20060242563A1 (en) * | 2005-04-22 | 2006-10-26 | Liu Zhen H | Optimizing XSLT based on input XML document structure description and translating XSLT into equivalent XQuery expressions |
US20090307275A1 (en) * | 2005-12-02 | 2009-12-10 | International Business Machines Corporation | System for improving access efficiency in database and method thereof |
US20070214107A1 (en) * | 2006-03-13 | 2007-09-13 | Oracle International Corporation | Dynamic materialized view ranging |
US7676450B2 (en) * | 2006-03-15 | 2010-03-09 | Oracle International Corporation | Null aware anti-join |
US20070219952A1 (en) * | 2006-03-15 | 2007-09-20 | Oracle International Corporation | Null aware anti-join |
US20070226203A1 (en) * | 2006-03-23 | 2007-09-27 | Microsoft Corporation | Generation of query and update views for object relational mapping |
US20070288526A1 (en) * | 2006-06-08 | 2007-12-13 | Emc Corporation | Method and apparatus for processing a database replica |
US20080222136A1 (en) * | 2006-09-15 | 2008-09-11 | John Yates | Technique for compressing columns of data |
US7606795B2 (en) * | 2007-02-08 | 2009-10-20 | International Business Machines Corporation | System and method for verifying the integrity and completeness of records |
US20080195583A1 (en) * | 2007-02-08 | 2008-08-14 | International Business Machines Corporation | System And Method For Verifying The Integrity And Completeness Of Records |
US20080270372A1 (en) * | 2007-02-08 | 2008-10-30 | International Business Machines Corporation | System And Method For Verifying The Integrity And Completeness Of Records |
US20100217759A1 (en) * | 2007-05-11 | 2010-08-26 | Nec Corporation | System, method, and progam product for database restructuring support |
US7827168B2 (en) * | 2007-05-30 | 2010-11-02 | Red Hat, Inc. | Index clustering for full text search engines |
US20090064160A1 (en) * | 2007-08-31 | 2009-03-05 | Microsoft Corporation | Transparent lazy maintenance of indexes and materialized views |
US20110276355A1 (en) * | 2007-11-05 | 2011-11-10 | Michael Bruce | Method and system for informed decision making to locate a workforce |
US20090182724A1 (en) * | 2008-01-11 | 2009-07-16 | Paul Reuben Day | Database Query Optimization Using Index Carryover to Subset an Index |
US20100281005A1 (en) * | 2009-05-04 | 2010-11-04 | Microsoft Corporation | Asynchronous Database Index Maintenance |
US8266101B1 (en) * | 2009-07-16 | 2012-09-11 | Binpeng Shuai | Share nothing database cluster and real time synchronization by signaling |
US8892569B2 (en) * | 2010-12-23 | 2014-11-18 | Ianywhere Solutions, Inc. | Indexing spatial data with a quadtree index having cost-based query decomposition |
US20130124545A1 (en) * | 2011-11-15 | 2013-05-16 | Business Objects Software Limited | System and method implementing a text analysis repository |
US9195936B1 (en) * | 2011-12-30 | 2015-11-24 | Pegasystems Inc. | System and method for updating or modifying an application without manual coding |
US20130173955A1 (en) * | 2012-01-04 | 2013-07-04 | Xtremlo Ltd | Data protection in a random access disk array |
US20140172914A1 (en) * | 2012-12-14 | 2014-06-19 | Microsoft Corporation | Graph query processing using plurality of engines |
US20140244606A1 (en) * | 2013-01-18 | 2014-08-28 | Tencent Technology (Shenzhen) Company Limited | Method, apparatus and system for storing, reading the directory index |
US20140280028A1 (en) * | 2013-03-14 | 2014-09-18 | Oracle International Corporation | Selective materialized view refresh |
US20170206256A1 (en) * | 2013-03-15 | 2017-07-20 | Amazon Technologies, Inc. | Scalable analysis platform for semi-structured data |
US20140279838A1 (en) * | 2013-03-15 | 2014-09-18 | Amiato, Inc. | Scalable Analysis Platform For Semi-Structured Data |
US9659058B2 (en) * | 2013-03-22 | 2017-05-23 | X1 Discovery, Inc. | Methods and systems for federation of results from search indexing |
US20140317087A1 (en) * | 2013-04-22 | 2014-10-23 | Salesforce.Com, Inc. | Systems and methods for implementing and maintaining sampled tables in a database system |
US9141676B2 (en) * | 2013-12-02 | 2015-09-22 | Rakuten Usa, Inc. | Systems and methods of modeling object networks |
US20150278255A1 (en) * | 2014-03-27 | 2015-10-01 | International Business Machines Corporation | System and method for optimizing database definitions for a new database |
US20150356314A1 (en) * | 2014-06-10 | 2015-12-10 | Salesforce.Com, Inc. | Systems and methods for implementing an encrypted search index |
US10037355B2 (en) * | 2015-07-07 | 2018-07-31 | Futurewei Technologies, Inc. | Mechanisms for merging index structures in MOLAP while preserving query consistency |
US20170024387A1 (en) * | 2015-07-22 | 2017-01-26 | Oracle International Corporation | Approximate distinct counting in a bounded memory |
US20170116275A1 (en) * | 2015-10-21 | 2017-04-27 | International Business Machines Corporation | Adaptive multi-index access plan for database queries |
US20170116264A1 (en) * | 2015-10-21 | 2017-04-27 | International Business Machines Corporation | Adaptive multi-index access plan for database queries |
US20170116241A1 (en) * | 2015-10-21 | 2017-04-27 | Oracle International Corporation | Non-blocking database table alteration |
US20170177700A1 (en) * | 2015-12-17 | 2017-06-22 | Sap Se | Updating a partitioning column |
US20170185326A1 (en) * | 2015-12-29 | 2017-06-29 | Emc Corporation | Consistent transition from asynchronous to synchronous replication in hash-based storage systems |
US10705907B1 (en) * | 2016-03-24 | 2020-07-07 | EMC IP Holding Company LLC | Data protection in a heterogeneous random access storage array |
US10324782B1 (en) * | 2016-03-24 | 2019-06-18 | Emc Corporation | Hiccup management in a storage array |
US10152371B1 (en) * | 2016-09-30 | 2018-12-11 | EMC IP Holding Company LLC | End-to-end data protection for distributed storage |
US20180096006A1 (en) * | 2016-09-30 | 2018-04-05 | Microsoft Technology Licensing, Llc | Workload-driven recommendations for columnstore and rowstore indexes in relational databases |
US10528599B1 (en) * | 2016-12-16 | 2020-01-07 | Amazon Technologies, Inc. | Tiered data processing for distributed data |
US11074261B1 (en) * | 2016-12-16 | 2021-07-27 | Amazon Technologies, Inc. | Format independent processing for distributed data |
US20180285418A1 (en) * | 2017-03-31 | 2018-10-04 | Amazon Technologies, Inc. | Executing queries for structured data and not-structured data |
US10713247B2 (en) * | 2017-03-31 | 2020-07-14 | Amazon Technologies, Inc. | Executing queries for structured data and not-structured data |
US10769148B1 (en) * | 2017-09-25 | 2020-09-08 | Amazon Technologies, Inc. | Relocating data sharing operations for query processing |
US10810202B2 (en) * | 2018-06-14 | 2020-10-20 | Microsoft Technology Licensing, Llc | Execution plan stitching |
US11119998B1 (en) * | 2018-11-26 | 2021-09-14 | Amazon Technologies, Inc. | Index and view updates in a ledger-based database |
Also Published As
Publication number | Publication date |
---|---|
WO2020077027A1 (en) | 2020-04-16 |
US11347740B2 (en) | 2022-05-31 |
US20200117676A1 (en) | 2020-04-16 |
US11226963B2 (en) | 2022-01-18 |
US20200117663A1 (en) | 2020-04-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20220164345A1 (en) | Managed query execution platform, and methods thereof | |
KR102627690B1 (en) | Dimensional context propagation techniques for optimizing SKB query plans | |
US11537635B2 (en) | Hadoop OLAP engine | |
US10528599B1 (en) | Tiered data processing for distributed data | |
JP6617117B2 (en) | Scalable analysis platform for semi-structured data | |
US10713247B2 (en) | Executing queries for structured data and not-structured data | |
JP6416194B2 (en) | Scalable analytic platform for semi-structured data | |
US9842152B2 (en) | Transparent discovery of semi-structured data schema | |
CN107408114B (en) | Identifying join relationships based on transactional access patterns | |
US10061834B1 (en) | Incremental out-of-place updates for datasets in data stores | |
US11074261B1 (en) | Format independent processing for distributed data | |
US10114846B1 (en) | Balanced distribution of sort order values for a multi-column sort order of a relational database | |
US10860562B1 (en) | Dynamic predicate indexing for data stores | |
US11550787B1 (en) | Dynamic generation of match rules for rewriting queries to use materialized views | |
US10776368B1 (en) | Deriving cardinality values from approximate quantile summaries | |
US11809421B2 (en) | System and method for data analytics | |
US20190340272A1 (en) | Systems and related methods for updating attributes of nodes and links in a hierarchical data structure | |
US11243956B1 (en) | Enforcing foreign key constraints for efficient materialized view updates | |
US11789936B2 (en) | Storage engine for hybrid data processing | |
CN115658680A (en) | Data storage method, data query method and related device | |
Zheng et al. | Timo: Inâmemory temporal query processing for big temporal data | |
US20230334037A1 (en) | System and method for data analytics | |
US20230394043A1 (en) | Systems and methods for optimizing queries in a data lake | |
Saravana et al. | A case study on analyzing Uber datasets using Hadoop framework | |
Nikitopoulos et al. | Distributed Storage of Large Knowledge Graphs with Mobility Data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |