US20210357453A1 - Query usage based organization for very large databases - Google Patents

Query usage based organization for very large databases Download PDF

Info

Publication number
US20210357453A1
US20210357453A1 US17/337,880 US202117337880A US2021357453A1 US 20210357453 A1 US20210357453 A1 US 20210357453A1 US 202117337880 A US202117337880 A US 202117337880A US 2021357453 A1 US2021357453 A1 US 2021357453A1
Authority
US
United States
Prior art keywords
query
collection
data
usage
collections
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
US17/337,880
Inventor
Ron Ben-Natan
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ben Natan Ron
Original Assignee
Ron Ben-Natan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ron Ben-Natan filed Critical Ron Ben-Natan
Priority to US17/337,880 priority Critical patent/US20210357453A1/en
Publication of US20210357453A1 publication Critical patent/US20210357453A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/81Indexing, e.g. XML tags; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • G06F16/24542Plan optimisation
    • G06F16/24544Join order optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • G06F16/278Data partitioning, e.g. horizontal or vertical partitioning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/84Mapping; Conversion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45504Abstract machines for programme code execution, e.g. Java virtual machine [JVM], interpreters, emulators

Definitions

  • Cloud object stores are increasing in popularity as an alternative to local storage, and have collectively become the largest data stores in the world. Cloud stores provide ultra-reliable storage at a very low cost and provide the simplest path to store and retrieve data.
  • a complementary technology is found with unstructured databases and corresponding large data stores often referred to as “big data.”
  • Unstructured databases are becoming a popular alternative to conventional relational databases due to the relaxed format for data storage and the wider range of data structures that may be stored.
  • unstructured databases In contrast to conventional relational databases, where strong typing imposes data constraints to adhere to a predetermined row and column format, unstructured databases impose no such restrictions.
  • Analytics and statistical data lend themselves well to unstructured databases, and cloud data services offer a readily available mechanism for access.
  • a query server and application identifies data collections of interest in a cloud store, and categorizes the collections based on an intended usage. Depending on the intended usage, the categorized data may be cataloged, indexed, or undergo a full intake into a—column store. In a database of large, unstructured data collections, some collections may experience sparse or indefinite usage. Cataloging or indexing positions the collections for subsequent query access, but defers the computational and intake burden. The full intake performs a columnar storage on the collection for facilitating eminent and regular query access.
  • VMs virtual machines
  • Collections therefore incur processing based on their expected usage—full intake for high query traffic collections, and reduced cataloging for maintaining accessibility of collections of indefinite query interest.
  • large collections need not be transported because query activity occurs on VMs of the cloud storage service.
  • Configurations herein are based, in part, on the observation that so-called “cloud” object stores are becoming an increasingly popular storage medium, and have virtually unlimited capacity and unsurpassed reliability over conventional local hard drives or SSDs (solid state devices) on a home or office computing device.
  • Major vendors such as AMAZON®, GOOGLE® and MICROSOFT® offer storage services on a volume basis.
  • AMAZON®, GOOGLE® and MICROSOFT® offer storage services on a volume basis.
  • conventional approaches to access of such third party storage services suffer from the shortcoming of network latency and transmission volume when accessing large volumes of data. Data from cloud stores must still be retrieved and brought to an application for consuming and processing the data.
  • cloud access can impose substantial constraints.
  • a conventional application or query operating on a cloud store is required to transport the data to the database or application.
  • a query server Upon identification of a collection of data, a query server catalogs the data for use in subsequent queries, and may perform other processing to best transform the data for the subsequent queries.
  • a subsequent user request performs a query using the cataloged data, and is performed very efficiently if the query is consistent with the previously completed transformation. Queries may still be satisfied if the queried collection was cataloged differently, but may incur additional time and processing.
  • Cloud storage vendors also provide computing facilities in the form of virtual machines, which afford access to their cloud stores based on a CPU type and network access bandwidth.
  • Query logic in the forms of scripts or compiled code is brought into the virtual machine, in contrast to transporting the cloud-stored data to the local machine of the user, thus mitigating the need to transport a large database.
  • Unstructured databases are becoming a popular alternative to conventional relational databases due to the relaxed format for data storage and the wider range of data structures that may be stored.
  • unstructured databases impose no such restrictions.
  • Very large volumes of data may often be stored in such an unstructured form, where transport of the entire database is prohibitive in terms of time and/or cost.
  • FIG. 1 is a context diagram of a computing environment suitable for use with configurations disclosed herein;
  • FIG. 2 is a flowchart of a particular configuration operable in the environment of FIG. 1 ;
  • FIG. 3 is a block diagram of data flow in the environment of FIG. 1 ;
  • FIGS. 4A and 4B are a flowchart depicting a query as in FIG. 3 .
  • Cloud storage therefore effectively implements a kind of storage area network (SAN) of enterprise computing formally used by specific businesses or corporations, and defines it as a commodity item available to the general user. Equally important is the availability of virtual computing resources often offered by the vendors of the cloud storage. Sometimes referred to as VMs or hypervisor based approaches, users can invoke processing resources defined in terms of processor number and type and bandwidth of data available from the cloud storage.
  • SAN storage area network
  • the computing demand can be matched to the burden for invoking a larger number of smaller processors, a small number of fast, powerful processors, and may also be scaled for an appropriate data access rate from the cloud.
  • Configurations herein leverage this capability to consolidate the computing resources, data storage and query tasks in an optimal manner that mitigates transfer or traversal of sparse or seldom used data.
  • a “collection” is a body of data stored in one or more files, usually in a string of text characters according to a syntax such as JSON (Javascript Object Notation) or CSV (Comma Separated Variables). Such a collection is often defined as unstructured or semi-structured data due to relaxed field and type restrictions that set it apart from conventional, rigid typing on relational database management systems (RDBMS). Collections are often referred to as a database, as one or more collections may be grouped under a designation of a database covering a particular topic, task or body of information.
  • a collection of data may represent a massive number of entries, or documents that would present substantial computational issues to process as a relational data set.
  • a “document” is the first level of decomposition of a collection, and may often be thought of as analogous to a record in a relational system.
  • a “network” refers to any suitable communications linkage between computing devices, and includes the Internet and other LAN/WAN and wireless arrangements between the user, data storage and processing. Although the figures depict a “cloud” in a particular position, it should be understood that connectivity between devices is generally afforded by the underlying network infrastructure.
  • An “object store” refers to data that has undergone an intake into the present system as a cataloged collection based on the expected usage.
  • the intake transforms the data according to an expected usage, and has therefore been “identified” as available for query by the methods disclosed herein.
  • a “query server” refers to a server or application for directing the intake, transforming the data, receiving user queries, and generating instructions for satisfying the user query by operations performed on the transformed data.
  • An “intake” or ingestion refers to identifying and cataloging a cloud resident collection for query usage by the disclosed approach.
  • the ingestion involves a transformation that varies based on the expected usage, discussed further below, and positions the data to be responsive to the queries according to the expected usage. Deviations are permitted, but may increase response time and compute intensity because the database has not been preprocessed consistent with the query.
  • Document-based, unstructured, NoSQL databases maintain data as documents, typically as either JSON documents or XML documents.
  • a set of documents are maintained within a collection.
  • a server maintains an interface and array of pipelined processors for receiving an unstructured data set having a large collection of documents.
  • the disclosed approach implements a method that “brings the database to the data” rather than “bring the data to the database” for fulfilling a query request.
  • the disclosed approach architects the database in a NoDB approach where the database can run directly on the data within the object store. This allows performance of queries on any data without moving the data and also allows usage of inexpensive and reliable storage for housing the data.
  • the disclosed approach makes use of a cloud object store as a globally distributed SAN.
  • the query server can bring up one VM compute node or 100 compute nodes, some can be powerful machines and some very small machines and all that differs is how they partition the data to run on. Since they all have access to the same “global SAN” they each can run independently on their piece of the data and prepare their part of the result set and then the node that received the original query (and sent off the part of the query to each of the other nodes) stiches the result set together and performs the later part of the query.
  • the generated query instructions are a query language based on a pipeline, which affords partitioning. Most operations can be pushed down to individual VMs and stitching becomes a small operation. As a result, the query can be partitioned into subdivisions of the collection, and then each subdivision allocated to a particular VM of a number of VMs operating in parallel.
  • the disclosed approach employs number of architectural options, all using a cloud object store, but in different ways (allowing for different attributes of cost, performance and usability). These options can be described as a decision tree for determining an intake of collections to position, or preprocess, the collections for subsequent queries, discussed in further detail below with respect to FIGS. 4A-4B .
  • the first and most important branching of the decision tree is whether the data is loaded into the object store and then metadata subsequently built over the data, or whether the data is loaded into a columnar database which prepares the metadata during the ingestion and creates files that are stored on the object store as a column store.
  • a second branch in the decision tree is what type of queries (or workload) will be expected—will it be analytic in nature (complex queries) or OLTP (simple key-based queries), or both.
  • the intake will generate either a columnar representation of the data (for analytic queries) or indexes (for OLTP queries) next to the data.
  • we ingest the data through the database we build these at ingestion time.
  • we “bring the database to the data” we build these representations when we first query the data.
  • we do build these representations e.g. columnar
  • FIG. 1 is a context diagram of a computing environment suitable for use with configurations disclosed herein.
  • a user computing device 110 such as a desktop, laptop or mobile device is coupled to a public access network 120 such as the Internet.
  • the computing device 120 sends query instructions 130 via the network 120 .
  • a plurality of virtual machines (VMs) 132 - 1 . . . 132 -N( 132 generally) are instantiated for performing the query by accessing collections on a cloud store 140 .
  • the cloud store 140 is any suitable remote SAS or SAN (Storage Area Network) storage medium, operable for network access from the computing device 120 .
  • SAN Storage Area Network
  • the cloud store 140 and the VMs 132 are sourced from the same provider.
  • a number of VMs 132 are instantiated for performing the query, discussed further below, by receiving a transmitted raw data collection 144 from the cloud store.
  • query results 134 are returned and rendered on the computing device 110 .
  • FIG. 2 is a flowchart of a particular configuration operable in the environment of FIG. 1 .
  • the method for defining and querying large databases includes, at step 201 , receiving an identification of a collection of data in a cloud store 140 , such that the cloud store 140 is accessible via a public access network 120 for storing data in a location independent manner.
  • Step 202 involves determining, based on an expected usage of the collection, whether to intake the collection into an object store, such that the object store has a predefined format under the control of the user.
  • An intake operation identifies a transformation of the data to position it for queries according to the expected usage.
  • the intake selects, if the collection remains in the cloud store 140 , a catalog organization for the collection, such that the catalog organization defines a structure for accessing the data in the manner called for by the expected usage, as depicted at step 203 .
  • the expected usage defines a computational intensity and storage demand of a query directed to the collection. For example, if a collection is expected to be accessed only infrequently, the catalog operation might need only identify the name and source of the collection.
  • the intake generates the catalog organization by associating elements of the collection with a created entry in the catalog organization, as depicted at step 204 .
  • the substance and complexity of the created entry can vary significantly based on the expected usage, discussed further below in FIGS. 3 and 4A-4B .
  • FIG. 3 is a block diagram of data flow in the environment of FIG. 1 .
  • a user computing device such as a desktop 110 - 1 or mobile device 110 - 2 (laptop, tablet, smartphone) employs a GUI (graphical user interface) to formulate a query request 114 .
  • the query request 114 is received by a query server 135 for generating the query instructions 130 .
  • the query server 135 oversees the intake for positioning the data based on the expected use, and maintains a catalog 137 of collections 142 - 1 . . . 142 -N available for query.
  • the catalog 137 ′ may also be stored in the object store 150 .
  • the catalog 137 includes metadata that tells the query server 135 the location and format of the data to be queried.
  • the intake instructions 122 discussed further below in FIGS. 4A and 4B , transform the raw data 144 based on the expected usage.
  • An object store 150 stores the collections 142 ′ depending on the intake, and provides query data 148 to the VMs 132 .
  • the actual data may reside in the object store 150 or the cloud store 140 depending on the transformation performed by the intake.
  • the object store 150 may store the collections 142 in a columnar form, an index or lookup form, or a name and location form.
  • an application on the user computing device 110 may operate as the query server 135 .
  • the query server 135 Following intake, where the query server 135 now has a cataloged identity of collections 142 available for query, the query server 135 receives a query and an identifier of a collection as a query request 114 . The query server 135 determines the catalog 137 organization of the collection 142 . The query server 135 transmitting the query, an indication of the catalog organization, and a set of instructions 130 for performing the query to each of the plurality of VMs 132 for performing the query.
  • the intake instructions 122 therefore determine the catalog 137 entry and transformation of each collection 142 .
  • the query instructions 130 tell the VMs 132 how to access and perform the query request 114 . Therefore, based in the query instructions 130 , the collection 142 may further define a plurality of collections 142 -N, and performing the query may include performing a join on at least two of the collections 142 .
  • the query instructions 130 also specify the VMs 132 for satisfying the query request 114 .
  • Each VM 132 has a computing speed and a bandwidth, and the query instructions 130 also determine the plurality of VMs 132 based on a cost of each VM and a number of VMs for accommodating the partitions of data into which the collections 142 may be segmented.
  • Each VM 132 may perform at least a portion of the query until limited by dependencies between the partitions.
  • FIG. 4 is a flowchart depicting a query as in FIG. 3 .
  • Query processing is preceded by intake for generating a catalog 137 entry for each collection 142 to be queried.
  • the query server 135 receives an identification of a collection of the data 142 in the cloud store 140 , as depicted at step 401 .
  • a decision is made, at step 402 , to determine, based on an expected usage of the collection, whether to intake the collection into the object store 150 .
  • an intake is performed on the collection if the expected usage involves a transformation of the collection 142 from the cloud store 140 , such that the intake transfers the identified collection from the cloud store 140 to the object store 150 , in which the object store has a columnar format.
  • the intake operation stores the collection in a columnar form in the object store, as disclosed in copending U.S. patent application Ser. No. 14/304,497, filed Jun. 13, 2014, entitled “COLUMNAR STORAGE AND PROCESSING OF UNSTRUCTURED DATA,” incorporated herein by reference.
  • selecting the catalog 137 organization includes determining an intended query usage, as depicted at step 404 .
  • a workload defines the expected usage, such that the workload is based on a likelihood of an eminent query of the collection and whether the query includes a lookup or online analytical processing (OLAP).
  • OLAP online analytical processing
  • Three organizational levels are depicted, however alternate catalog approaches could be employed.
  • the query server 135 generates a name and source, if the query usage is not foreseeable. This is the least computationally intensive approach and merely identifies the data for subsequent queries.
  • the intake defines acceptable delays in successive query responses based on the collection 142 , such that the catalog 137 organization includes a file name and source location of the collection 142 , as shown at step 406 . It is best reserved for data collections 142 that may be sparse or duplicative of other collections such that a definite query cannot be anticipated.
  • Another workload calls for generating an index on a subset of fields, if a keyword lookup query is expected, as depicted at step 407 .
  • This approach is called for if the expected usage is for individual field value based lookup, such that the catalog organization includes an index creation on the individual fields available for query, as depicted at step 408 .
  • a greater expected workload calls for generating a columnar organization of the data, if data analytics are expected, as depicted at step 409 .
  • This involves an expected usage is for analytical queries and the catalog organization includes a columnar arrangement of named fields in the collection 142 , as shown at step 410 , and is similar to the columnar approach of step 405 depending on whether data is stored in the object store 150 or remains in the cloud store 140 .
  • the catalog organization may further include a defer flag indicative of whether to defer creation of the catalog organization until receipt of a query request 114 .
  • the query manager 135 receives a query request 114 directed to one or more collections 142 , as shown at step 411 .
  • the query instructions 130 partition the collection 142 into a plurality of partitions, such that each partition contains a subset of elements or documents from the collection 142 , as depicted at step 412 .
  • the query instructions 130 allocate a plurality of virtual machines (VM) 132 in a virtualization environment, such that each VM 132 is assigned to a partition of the plurality of partitions, as disclosed at step 413 , and the VMs 132 each perform at least a portion of the query on each VM until limited by dependencies between the partitions.
  • VM virtual machines
  • the query server 135 coalesces each of the VMs to generate results 134 for rendering on the user computing device 110 .
  • programs and methods defined herein are deliverable to a user processing and rendering device in many forms, including but not limited to a) information permanently stored on non-writeable storage media such as ROM devices, b) information alterably stored on writeable non-transitory storage media such as floppy disks, magnetic tapes, CDs, RAM devices, and other magnetic and optical media, or c) information conveyed to a computer through communication media, as in an electronic network such as the Internet or telephone modem lines.
  • the operations and methods may be implemented in a software executable object or as a set of encoded instructions for execution by a processor responsive to the instructions.
  • ASICs Application Specific Integrated Circuits
  • FPGAs Field Programmable Gate Arrays
  • state machines controllers or other hardware components or devices, or a combination of hardware, software, and firmware components.

Abstract

A query server identifies data collections of interest in a cloud store, and categorizes the collections based on an intended usage. Depending on the intended usage, the categorized data may be cataloged, indexed, or undergo a full intake into a column store. In a database of large data collections, some collections may experience sparse or indefinite usage. Cataloging or indexing position the collections for subsequent query access, but defers the computational burden. The full intake performs a columnar shredding of the collection for facilitating eminent and regular query access. Upon invocation of query activity, an instantiation of virtual machines provided by the cloud store vendor implements query logic, such that the VMs launch in conjunction with the cloud store having the collections. Collections therefore incur processing based on their expected usage—full intake for high query traffic collections, and reduced cataloging for maintaining accessibility of collections of indefinite query interest.

Description

    RELATED APPLICATION
  • This application is a continuation application of earlier filed U.S. patent application Ser. No. 15/452,072 entitled “Query Usage Based Organization for Very Large Databases” (Attorney Docket No. JNR16-02), filed on Mar. 7, 2017, the entire teachings of which are incorporated herein by this reference.
  • BACKGROUND
  • Cloud object stores are increasing in popularity as an alternative to local storage, and have collectively become the largest data stores in the world. Cloud stores provide ultra-reliable storage at a very low cost and provide the simplest path to store and retrieve data. A complementary technology is found with unstructured databases and corresponding large data stores often referred to as “big data.” Unstructured databases are becoming a popular alternative to conventional relational databases due to the relaxed format for data storage and the wider range of data structures that may be stored. In contrast to conventional relational databases, where strong typing imposes data constraints to adhere to a predetermined row and column format, unstructured databases impose no such restrictions. Analytics and statistical data lend themselves well to unstructured databases, and cloud data services offer a readily available mechanism for access.
  • SUMMARY
  • A query server and application identifies data collections of interest in a cloud store, and categorizes the collections based on an intended usage. Depending on the intended usage, the categorized data may be cataloged, indexed, or undergo a full intake into a—column store. In a database of large, unstructured data collections, some collections may experience sparse or indefinite usage. Cataloging or indexing positions the collections for subsequent query access, but defers the computational and intake burden. The full intake performs a columnar storage on the collection for facilitating eminent and regular query access. Upon invocation of query activity, an instantiation of virtual machines (VMs) provided by the cloud store vendor implements query logic, such that the VMs launch in conjunction with the cloud store having the collections. Collections therefore incur processing based on their expected usage—full intake for high query traffic collections, and reduced cataloging for maintaining accessibility of collections of indefinite query interest. Upon query invocation, large collections need not be transported because query activity occurs on VMs of the cloud storage service.
  • Configurations herein are based, in part, on the observation that so-called “cloud” object stores are becoming an increasingly popular storage medium, and have virtually unlimited capacity and unsurpassed reliability over conventional local hard drives or SSDs (solid state devices) on a home or office computing device. Major vendors such as AMAZON®, GOOGLE® and MICROSOFT® offer storage services on a volume basis. Unfortunately, conventional approaches to access of such third party storage services suffer from the shortcoming of network latency and transmission volume when accessing large volumes of data. Data from cloud stores must still be retrieved and brought to an application for consuming and processing the data. In a large database, cloud access can impose substantial constraints. In other words, a conventional application or query operating on a cloud store is required to transport the data to the database or application. Accordingly, configurations herein substantially overcome the above described shortcomings by instantiating a query platform via the service on which the data is stored, in effect bringing the database to the data. Upon identification of a collection of data, a query server catalogs the data for use in subsequent queries, and may perform other processing to best transform the data for the subsequent queries. A subsequent user request performs a query using the cataloged data, and is performed very efficiently if the query is consistent with the previously completed transformation. Queries may still be satisfied if the queried collection was cataloged differently, but may incur additional time and processing.
  • Cloud storage vendors also provide computing facilities in the form of virtual machines, which afford access to their cloud stores based on a CPU type and network access bandwidth. Query logic in the forms of scripts or compiled code is brought into the virtual machine, in contrast to transporting the cloud-stored data to the local machine of the user, thus mitigating the need to transport a large database.
  • The cloud-based query platform described above is particularly beneficial with unstructured and semi-structured data stores, due to the size and often sparse nature of accessed fields. Unstructured databases are becoming a popular alternative to conventional relational databases due to the relaxed format for data storage and the wider range of data structures that may be stored. In contrast to conventional relational databases, where strong typing imposes data constraints to adhere to a predetermined row and column format, unstructured databases impose no such restrictions. Very large volumes of data (on the order of millions of records) may often be stored in such an unstructured form, where transport of the entire database is prohibitive in terms of time and/or cost.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The foregoing and other objects, features and advantages of the invention will be apparent from the following description of particular embodiments of the invention, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention.
  • FIG. 1 is a context diagram of a computing environment suitable for use with configurations disclosed herein;
  • FIG. 2 is a flowchart of a particular configuration operable in the environment of FIG. 1;
  • FIG. 3 is a block diagram of data flow in the environment of FIG. 1; and
  • FIGS. 4A and 4B are a flowchart depicting a query as in FIG. 3.
  • DETAILED DESCRIPTION
  • Modern computing trends are turning towards the concept of a cloud data store. This represents a largely transparent arrangement of distributed and networked storage accessible to a user in a manner similar to conventional local drives. While the location and manner of storage may not be known to the user, reliability and accessibility is assured by the cloud service provider who allocates storage to users, typically on a fee per storage unit basis. Cloud storage therefore effectively implements a kind of storage area network (SAN) of enterprise computing formally used by specific businesses or corporations, and defines it as a commodity item available to the general user. Equally important is the availability of virtual computing resources often offered by the vendors of the cloud storage. Sometimes referred to as VMs or hypervisor based approaches, users can invoke processing resources defined in terms of processor number and type and bandwidth of data available from the cloud storage. The computing demand can be matched to the burden for invoking a larger number of smaller processors, a small number of fast, powerful processors, and may also be scaled for an appropriate data access rate from the cloud. Configurations herein leverage this capability to consolidate the computing resources, data storage and query tasks in an optimal manner that mitigates transfer or traversal of sparse or seldom used data.
  • In the configurations described below, some terminology can be generalized to normalize the different aspects of the system and provide tangible definition to the virtual and unrestrictive labels common with a virtualized approach to computing. A “collection” is a body of data stored in one or more files, usually in a string of text characters according to a syntax such as JSON (Javascript Object Notation) or CSV (Comma Separated Variables). Such a collection is often defined as unstructured or semi-structured data due to relaxed field and type restrictions that set it apart from conventional, rigid typing on relational database management systems (RDBMS). Collections are often referred to as a database, as one or more collections may be grouped under a designation of a database covering a particular topic, task or body of information. A collection of data may represent a massive number of entries, or documents that would present substantial computational issues to process as a relational data set. A “document” is the first level of decomposition of a collection, and may often be thought of as analogous to a record in a relational system.
  • A “network” refers to any suitable communications linkage between computing devices, and includes the Internet and other LAN/WAN and wireless arrangements between the user, data storage and processing. Although the figures depict a “cloud” in a particular position, it should be understood that connectivity between devices is generally afforded by the underlying network infrastructure.
  • An “object store” refers to data that has undergone an intake into the present system as a cataloged collection based on the expected usage. The intake transforms the data according to an expected usage, and has therefore been “identified” as available for query by the methods disclosed herein.
  • A “query server” refers to a server or application for directing the intake, transforming the data, receiving user queries, and generating instructions for satisfying the user query by operations performed on the transformed data. The figures depict this as a separate node or computing device, but the same could be realized by an application launched on the user computing device.
  • An “intake” or ingestion refers to identifying and cataloging a cloud resident collection for query usage by the disclosed approach. The ingestion involves a transformation that varies based on the expected usage, discussed further below, and positions the data to be responsive to the queries according to the expected usage. Deviations are permitted, but may increase response time and compute intensity because the database has not been preprocessed consistent with the query.
  • Document-based, unstructured, NoSQL databases maintain data as documents, typically as either JSON documents or XML documents. A set of documents are maintained within a collection. A server maintains an interface and array of pipelined processors for receiving an unstructured data set having a large collection of documents.
  • The disclosed approach implements a method that “brings the database to the data” rather than “bring the data to the database” for fulfilling a query request. Rather than loading the data in the object store into another storage mechanism, the disclosed approach architects the database in a NoDB approach where the database can run directly on the data within the object store. This allows performance of queries on any data without moving the data and also allows usage of inexpensive and reliable storage for housing the data. In effect, the disclosed approach makes use of a cloud object store as a globally distributed SAN.
  • Once of the benefits of using a “global SAN” is that it allows us to make the database itself totally elastic and able to run anywhere. The query server can bring up one VM compute node or 100 compute nodes, some can be powerful machines and some very small machines and all that differs is how they partition the data to run on. Since they all have access to the same “global SAN” they each can run independently on their piece of the data and prepare their part of the result set and then the node that received the original query (and sent off the part of the query to each of the other nodes) stiches the result set together and performs the later part of the query. In the disclosed approach, the generated query instructions are a query language based on a pipeline, which affords partitioning. Most operations can be pushed down to individual VMs and stitching becomes a small operation. As a result, the query can be partitioned into subdivisions of the collection, and then each subdivision allocated to a particular VM of a number of VMs operating in parallel.
  • The disclosed approach employs number of architectural options, all using a cloud object store, but in different ways (allowing for different attributes of cost, performance and usability). These options can be described as a decision tree for determining an intake of collections to position, or preprocess, the collections for subsequent queries, discussed in further detail below with respect to FIGS. 4A-4B.
  • While the data will generally reside on a cloud store, the first and most important branching of the decision tree is whether the data is loaded into the object store and then metadata subsequently built over the data, or whether the data is loaded into a columnar database which prepares the metadata during the ingestion and creates files that are stored on the object store as a column store.
  • A second branch in the decision tree is what type of queries (or workload) will be expected—will it be analytic in nature (complex queries) or OLTP (simple key-based queries), or both. Based on the expected usage of the user, the intake will generate either a columnar representation of the data (for analytic queries) or indexes (for OLTP queries) next to the data. In the case where we ingest the data through the database we build these at ingestion time. In the case where we “bring the database to the data” we build these representations when we first query the data. When we do build these representations (e.g. columnar), we store these data sets also on the objects store next to the original data. Building columnar and indexed representations is expensive so we build it when we first use it and store it for future use. This also implies management of invalidation—i.e. we hash the original files to know if they have changed and if they have changed we throw away the built structures and rebuild them. It is therefore important for us to manage metadata on a fairly granular level—e.g. on Amazon S3 metadata is gathered per file and per bucket and the metadata structures maintained at the same level and within the same bucket structure.
  • FIG. 1 is a context diagram of a computing environment suitable for use with configurations disclosed herein. Referring to FIG. 1, in a query environment 100, a user computing device 110, such as a desktop, laptop or mobile device is coupled to a public access network 120 such as the Internet. In response to a query request, the computing device 120 sends query instructions 130 via the network 120. In response, a plurality of virtual machines (VMs) 132-1 . . . 132-N(132 generally) are instantiated for performing the query by accessing collections on a cloud store 140. The cloud store 140 is any suitable remote SAS or SAN (Storage Area Network) storage medium, operable for network access from the computing device 120. Typically the cloud store 140 and the VMs 132 are sourced from the same provider. A number of VMs 132 are instantiated for performing the query, discussed further below, by receiving a transmitted raw data collection 144 from the cloud store. Upon completion, query results 134 are returned and rendered on the computing device 110.
  • FIG. 2 is a flowchart of a particular configuration operable in the environment of FIG. 1. Referring to FIGS. 1 and 2, the method for defining and querying large databases includes, at step 201, receiving an identification of a collection of data in a cloud store 140, such that the cloud store 140 is accessible via a public access network 120 for storing data in a location independent manner. Step 202 involves determining, based on an expected usage of the collection, whether to intake the collection into an object store, such that the object store has a predefined format under the control of the user. An intake operation identifies a transformation of the data to position it for queries according to the expected usage. The intake selects, if the collection remains in the cloud store 140, a catalog organization for the collection, such that the catalog organization defines a structure for accessing the data in the manner called for by the expected usage, as depicted at step 203. The expected usage defines a computational intensity and storage demand of a query directed to the collection. For example, if a collection is expected to be accessed only infrequently, the catalog operation might need only identify the name and source of the collection. In general, the intake generates the catalog organization by associating elements of the collection with a created entry in the catalog organization, as depicted at step 204. The substance and complexity of the created entry can vary significantly based on the expected usage, discussed further below in FIGS. 3 and 4A-4B.
  • FIG. 3 is a block diagram of data flow in the environment of FIG. 1. Referring to FIGS. 1-3, from the user perspective, a user computing device such as a desktop 110-1 or mobile device 110-2 (laptop, tablet, smartphone) employs a GUI (graphical user interface) to formulate a query request 114. The query request 114 is received by a query server 135 for generating the query instructions 130.
  • Prior to query request 114 fulfillment, the query server 135 oversees the intake for positioning the data based on the expected use, and maintains a catalog 137 of collections 142-1 . . . 142-N available for query. Alternatively, the catalog 137′ may also be stored in the object store 150. Generally, the catalog 137 includes metadata that tells the query server 135 the location and format of the data to be queried. The intake instructions 122, discussed further below in FIGS. 4A and 4B, transform the raw data 144 based on the expected usage. An object store 150 stores the collections 142′ depending on the intake, and provides query data 148 to the VMs 132. The actual data may reside in the object store 150 or the cloud store 140 depending on the transformation performed by the intake. The object store 150 may store the collections 142 in a columnar form, an index or lookup form, or a name and location form. Alternatively, an application on the user computing device 110 may operate as the query server 135.
  • Following intake, where the query server 135 now has a cataloged identity of collections 142 available for query, the query server 135 receives a query and an identifier of a collection as a query request 114. The query server 135 determines the catalog 137 organization of the collection 142. The query server 135 transmitting the query, an indication of the catalog organization, and a set of instructions 130 for performing the query to each of the plurality of VMs 132 for performing the query.
  • The intake instructions 122 therefore determine the catalog 137 entry and transformation of each collection 142. The query instructions 130 tell the VMs 132 how to access and perform the query request 114. Therefore, based in the query instructions 130, the collection 142 may further define a plurality of collections 142-N, and performing the query may include performing a join on at least two of the collections 142.
  • The query instructions 130 also specify the VMs 132 for satisfying the query request 114. Each VM 132 has a computing speed and a bandwidth, and the query instructions 130 also determine the plurality of VMs 132 based on a cost of each VM and a number of VMs for accommodating the partitions of data into which the collections 142 may be segmented. Each VM 132 may perform at least a portion of the query until limited by dependencies between the partitions.
  • FIG. 4 is a flowchart depicting a query as in FIG. 3. Query processing is preceded by intake for generating a catalog 137 entry for each collection 142 to be queried. The query server 135 receives an identification of a collection of the data 142 in the cloud store 140, as depicted at step 401. A decision is made, at step 402, to determine, based on an expected usage of the collection, whether to intake the collection into the object store 150.
  • At step 403, an intake is performed on the collection if the expected usage involves a transformation of the collection 142 from the cloud store 140, such that the intake transfers the identified collection from the cloud store 140 to the object store 150, in which the object store has a columnar format. This positions the data in a format readily accessible to the user for extended and complex OLAP (Online Analytic Processing) queries using the efficiency afforded by the columnar format. In this scenario, the intake operation stores the collection in a columnar form in the object store, as disclosed in copending U.S. patent application Ser. No. 14/304,497, filed Jun. 13, 2014, entitled “COLUMNAR STORAGE AND PROCESSING OF UNSTRUCTURED DATA,” incorporated herein by reference.
  • Alternatively, selecting the catalog 137 organization includes determining an intended query usage, as depicted at step 404. A workload defines the expected usage, such that the workload is based on a likelihood of an eminent query of the collection and whether the query includes a lookup or online analytical processing (OLAP). Three organizational levels are depicted, however alternate catalog approaches could be employed. At step 405, the query server 135 generates a name and source, if the query usage is not foreseeable. This is the least computationally intensive approach and merely identifies the data for subsequent queries. It is selected when the expected usage indicates an indefinite need for use of the collection 142 and the intake defines acceptable delays in successive query responses based on the collection 142, such that the catalog 137 organization includes a file name and source location of the collection 142, as shown at step 406. It is best reserved for data collections 142 that may be sparse or duplicative of other collections such that a definite query cannot be anticipated.
  • Another workload calls for generating an index on a subset of fields, if a keyword lookup query is expected, as depicted at step 407. This approach is called for if the expected usage is for individual field value based lookup, such that the catalog organization includes an index creation on the individual fields available for query, as depicted at step 408.
  • A greater expected workload calls for generating a columnar organization of the data, if data analytics are expected, as depicted at step 409. This involves an expected usage is for analytical queries and the catalog organization includes a columnar arrangement of named fields in the collection 142, as shown at step 410, and is similar to the columnar approach of step 405 depending on whether data is stored in the object store 150 or remains in the cloud store 140.
  • In each of the above approaches, the catalog organization may further include a defer flag indicative of whether to defer creation of the catalog organization until receipt of a query request 114.
  • Turning now to the user experience, which normally follows the intake consideration but could occur prior, depending on the defer flag, the query manager 135 receives a query request 114 directed to one or more collections 142, as shown at step 411. The query instructions 130 partition the collection 142 into a plurality of partitions, such that each partition contains a subset of elements or documents from the collection 142, as depicted at step 412. The query instructions 130 allocate a plurality of virtual machines (VM) 132 in a virtualization environment, such that each VM 132 is assigned to a partition of the plurality of partitions, as disclosed at step 413, and the VMs 132 each perform at least a portion of the query on each VM until limited by dependencies between the partitions. For example, summations or sorting of a field across all documents or elements may be deferred. Following completion of each of the VMs, the query server 135 coalesces each of the VMs to generate results 134 for rendering on the user computing device 110.
  • Those skilled in the art should readily appreciate that the programs and methods defined herein are deliverable to a user processing and rendering device in many forms, including but not limited to a) information permanently stored on non-writeable storage media such as ROM devices, b) information alterably stored on writeable non-transitory storage media such as floppy disks, magnetic tapes, CDs, RAM devices, and other magnetic and optical media, or c) information conveyed to a computer through communication media, as in an electronic network such as the Internet or telephone modem lines. The operations and methods may be implemented in a software executable object or as a set of encoded instructions for execution by a processor responsive to the instructions. Alternatively, the operations and methods disclosed herein may be embodied in whole or in part using hardware components, such as Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), state machines, controllers or other hardware components or devices, or a combination of hardware, software, and firmware components.
  • While the system and methods defined herein have been particularly shown and described with references to embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the invention encompassed by the appended claims.

Claims (15)

What is claimed is:
1. A method for defining large databases, comprising:
receiving an identification of a collection of data in a cloud store, the cloud store accessible via a public access network for storing data in a location independent manner;
generating a catalog organization by associating elements of the collection with a created entry in the catalog organization, further comprising selecting the catalog organization by:
determining an intended query usage and:
generating a name and source, if the query usage is not foreseeable;
generating an index on a subset of fields, if a keyword lookup query is expected; and
generating a columnar organization of the data, if data analytics are expected.
2. The method of claim 1, further comprising:
determining, based on an expected usage of the collection, whether to intake the collection into an object store, the object store having a predefined format under the control of the user;
selecting, if the collection remains in the cloud store, the catalog organization for the collection, the catalog organization defining a structure for accessing the data in the manner called for by the expected usage, the expected usage based on a computational intensity and storage demand of a query directed to the collection.
3. The method of claim 2 further comprising performing the intake on the collection if the expected usage involves a transformation of the collection from the cloud store, the intake transferring the identified collection from the cloud store to an object store, the object store having a columnar format.
4. The method of claim 2 wherein a workload defines the expected usage, the workload based on a likelihood of an eminent query of the collection and whether the query includes a lookup or online analytical processing (OLAP).
5. The method of claim 2 wherein the expected usage indicates an indefinite need for use of the collection and the intake defines acceptable delays in successive query responses based on the collection, and the catalog organization includes a file name and source location of the collection.
6. The method of claim 1 wherein the intended query usage is for individual field value based lookup and the catalog organization includes an index creation on the individual fields available for query.
7. The method of claim 1 wherein the intended query usage is for analytical queries and the catalog organization includes a columnar arrangement of named fields in the collection.
8. The method of claim 1 wherein the catalog organization further includes a defer flag indicative of whether to defer creation of the catalog organization until receipt of a query request.
9. The method of claim 1 further comprising:
receiving a query request directed to the collection, the collection residing on the cloud store;
partitioning the collection into a plurality of partitions, each partition containing a subset of elements from the collection;
allocating a plurality of virtual machines (VMs) in a virtualization environment, each VM assigned to a partition of the plurality of partitions; and
performing at least a portion of the query on each VM until limited by dependencies between the partitions.
10. The method of claim 9, further comprising:
receiving a query and an identifier of a collection;
determining the catalog organization of the collection; and
transmitting the query, an indication of the catalog organization, and a set of instructions for performing the query to each of the plurality of VMs for performing the query.
11. The method of claim 10 wherein the collection further comprises a plurality of collections, and performing the query includes performing a join on at least two of the collections.
12. The method of claim 9 wherein each VM has a computing speed and a bandwidth, further comprising determining the plurality of VMs based on a cost of each VM and a number of VMs for accommodating the partitions.
13. A computing device for querying a cloud data store, comprising:
a network interface configured for receiving an identification of a cloud store of a collection of data, the cloud store accessible via a public access network for storing data in a location independent manner;
intake logic to generate a catalog organization by associating elements of the collection with a created entry in the catalog organization, the intake logic further configured to select the catalog organization by:
determining an intended query usage and:
generating a name and source, if the query usage is not foreseeable;
generating an index on a subset of fields, if a keyword lookup query is expected; and
generating a columnar organization of the data, if data analytics are expected.
14. The device of claim 13 further comprising a query server to determine, based on a format of the collection, whether to intake the data set into an object store, the object store having a predefined format under the control of the user.
15. A computer program product on a non-transitory computer readable storage medium having instructions that, when executed by a processor, perform a method for defining and querying large databases, the method comprising:
receiving an identification of a collection of data in a cloud store, the cloud store accessible via a public access network for storing data in a location independent manner;
generating a catalog organization by associating elements of the collection with a created entry in the catalog organization, further comprising selecting the catalog organization by:
determining an intended query usage and:
generating a name and source, if the query usage is not foreseeable;
US17/337,880 2017-03-07 2021-06-03 Query usage based organization for very large databases Pending US20210357453A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/337,880 US20210357453A1 (en) 2017-03-07 2021-06-03 Query usage based organization for very large databases

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US15/452,072 US11030241B2 (en) 2017-03-07 2017-03-07 Query usage based organization for very large databases
US17/337,880 US20210357453A1 (en) 2017-03-07 2021-06-03 Query usage based organization for very large databases

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US15/452,072 Continuation US11030241B2 (en) 2017-03-07 2017-03-07 Query usage based organization for very large databases

Publications (1)

Publication Number Publication Date
US20210357453A1 true US20210357453A1 (en) 2021-11-18

Family

ID=63445384

Family Applications (2)

Application Number Title Priority Date Filing Date
US15/452,072 Active 2037-03-14 US11030241B2 (en) 2017-03-07 2017-03-07 Query usage based organization for very large databases
US17/337,880 Pending US20210357453A1 (en) 2017-03-07 2021-06-03 Query usage based organization for very large databases

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US15/452,072 Active 2037-03-14 US11030241B2 (en) 2017-03-07 2017-03-07 Query usage based organization for very large databases

Country Status (1)

Country Link
US (2) US11030241B2 (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010049732A1 (en) * 2000-06-01 2001-12-06 Raciborski Nathan F. Content exchange apparatus
US6775676B1 (en) * 2000-09-11 2004-08-10 International Business Machines Corporation Defer dataset creation to improve system manageability for a database system
US20130263117A1 (en) * 2012-03-28 2013-10-03 International Business Machines Corporation Allocating resources to virtual machines via a weighted cost ratio
US20140012881A1 (en) * 2012-07-09 2014-01-09 Sap Ag Storage Advisor for Hybrid-Store Databases
US20140108460A1 (en) * 2012-10-11 2014-04-17 Nuance Communications, Inc. Data store organizing data using semantic classification
US8805820B1 (en) * 2008-04-04 2014-08-12 Emc Corporation Systems and methods for facilitating searches involving multiple indexes
US20150058843A1 (en) * 2013-08-23 2015-02-26 Vmware, Inc. Virtual hadoop manager
US20150199408A1 (en) * 2014-01-14 2015-07-16 Dropbox, Inc. Systems and methods for a high speed query infrastructure
US20160182397A1 (en) * 2014-12-18 2016-06-23 Here Global B.V. Method and apparatus for managing provisioning and utilization of resources
US9507762B1 (en) * 2015-11-19 2016-11-29 International Business Machines Corporation Converting portions of documents between structured and unstructured data formats to improve computing efficiency and schema flexibility
US20180089324A1 (en) * 2016-09-26 2018-03-29 Splunk Inc. Dynamic resource allocation for real-time search
US10713272B1 (en) * 2016-06-30 2020-07-14 Amazon Technologies, Inc. Dynamic generation of data catalogs for accessing data

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2426439A1 (en) * 2003-04-23 2004-10-23 Ibm Canada Limited - Ibm Canada Limitee Identifying a workload type for a given workload of database requests
US8583611B2 (en) * 2010-10-22 2013-11-12 Hitachi, Ltd. File server for migration of file and method for migrating file
US10296462B2 (en) * 2013-03-15 2019-05-21 Oracle International Corporation Method to accelerate queries using dynamically generated alternate data formats in flash cache
US10769127B2 (en) * 2015-06-12 2020-09-08 Quest Software Inc. Dynamically optimizing data access patterns using predictive crowdsourcing

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010049732A1 (en) * 2000-06-01 2001-12-06 Raciborski Nathan F. Content exchange apparatus
US6775676B1 (en) * 2000-09-11 2004-08-10 International Business Machines Corporation Defer dataset creation to improve system manageability for a database system
US8805820B1 (en) * 2008-04-04 2014-08-12 Emc Corporation Systems and methods for facilitating searches involving multiple indexes
US20130263117A1 (en) * 2012-03-28 2013-10-03 International Business Machines Corporation Allocating resources to virtual machines via a weighted cost ratio
US20140012881A1 (en) * 2012-07-09 2014-01-09 Sap Ag Storage Advisor for Hybrid-Store Databases
US20140108460A1 (en) * 2012-10-11 2014-04-17 Nuance Communications, Inc. Data store organizing data using semantic classification
US20150058843A1 (en) * 2013-08-23 2015-02-26 Vmware, Inc. Virtual hadoop manager
US20150199408A1 (en) * 2014-01-14 2015-07-16 Dropbox, Inc. Systems and methods for a high speed query infrastructure
US20160182397A1 (en) * 2014-12-18 2016-06-23 Here Global B.V. Method and apparatus for managing provisioning and utilization of resources
US9507762B1 (en) * 2015-11-19 2016-11-29 International Business Machines Corporation Converting portions of documents between structured and unstructured data formats to improve computing efficiency and schema flexibility
US10713272B1 (en) * 2016-06-30 2020-07-14 Amazon Technologies, Inc. Dynamic generation of data catalogs for accessing data
US20180089324A1 (en) * 2016-09-26 2018-03-29 Splunk Inc. Dynamic resource allocation for real-time search

Also Published As

Publication number Publication date
US11030241B2 (en) 2021-06-08
US20180260468A1 (en) 2018-09-13

Similar Documents

Publication Publication Date Title
US11163739B2 (en) Database table format conversion based on user data access patterns in a networked computing environment
US10572475B2 (en) Leveraging columnar encoding for query operations
US9251183B2 (en) Managing tenant-specific data sets in a multi-tenant environment
Khazaei et al. How do I choose the right NoSQL solution? A comprehensive theoretical and experimental survey
US20190317666A1 (en) Transactional operations in multi-master distributed data management systems
US9772911B2 (en) Pooling work across multiple transactions for reducing contention in operational analytics systems
US9400700B2 (en) Optimized system for analytics (graphs and sparse matrices) operations
US20140324917A1 (en) Reclamation of empty pages in database tables
GB2519779A (en) Triplestore replicator
US10176205B2 (en) Using parallel insert sub-ranges to insert into a column store
US11487727B2 (en) Resolving versions in an append-only large-scale data store in distributed data management systems
US11210174B2 (en) Automated rollback for database objects
US20190034445A1 (en) Cognitive file and object management for distributed storage environments
US20230110057A1 (en) Multi-tenant, metadata-driven recommendation system
US11704327B2 (en) Querying distributed databases
US11960488B2 (en) Join queries in data virtualization-based architecture
US20210357453A1 (en) Query usage based organization for very large databases
US20230153300A1 (en) Building cross table index in relational database
US11727022B2 (en) Generating a global delta in distributed databases
US11907176B2 (en) Container-based virtualization for testing database system
US11048756B2 (en) Inserting datasets into database systems utilizing hierarchical value lists
US10503731B2 (en) Efficient analysis of distinct aggregations
US20240143594A1 (en) Offloading graph components to persistent storage for reducing resident memory in distributed graph processing
US10747512B2 (en) Partial object instantiation for object oriented applications

Legal Events

Date Code Title Description
STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED