EP2082317A2 - Système et procédé pour distribuer des demandes à un groupe de bases de données et accélérer un accès à des données - Google Patents
Système et procédé pour distribuer des demandes à un groupe de bases de données et accélérer un accès à des donnéesInfo
- Publication number
- EP2082317A2 EP2082317A2 EP07871310A EP07871310A EP2082317A2 EP 2082317 A2 EP2082317 A2 EP 2082317A2 EP 07871310 A EP07871310 A EP 07871310A EP 07871310 A EP07871310 A EP 07871310A EP 2082317 A2 EP2082317 A2 EP 2082317A2
- Authority
- EP
- European Patent Office
- Prior art keywords
- databases
- database
- query
- fractional
- building
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2462—Approximate or statistical queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2453—Query optimisation
- G06F16/24534—Query rewriting; Transformation
- G06F16/24539—Query rewriting; Transformation using cached or materialised query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24553—Query execution of query operations
- G06F16/24554—Unary operations; Data partitioning operations
- G06F16/24556—Aggregation; Duplicate elimination
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/283—Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
Definitions
- the present invention relates generally to database queries and more specifically to expediting database query results by dynamically querying a combination of multidimensional databases, fractional databases, or data warehouses for fast data analysis.
- a common method of obtaining information from a database is to perform a database search.
- Google.com is an example of a well-known website that enables users to perform searches across large databases and retrieve results.
- Searching databases may be performed in a number of ways. For example, a business intelligence company may need to query a specific set of databases that contain large amounts of data for a particular search. In some cases, the amount of information is quite large and search results could be desired quickly. While querying large databases or warehouses can offer accurate results, performing the query can be time consuming and may not be useful when immediate query results are needed. The time necessary to perform a full search of a database or group of databases to produce results of a certain level of accuracy may take longer that the user desires and provide results that are more accurate than what is actually needed.
- MDDBs Multi-dimensional databases
- sampled databases are common solutions for expediting query results.
- software and/or hardware driven parallel computing technologies are also used to improve database performance.
- MDDBs may be referred to as summary cubes.
- MDDBs are created by captured common statistics or common business metrics (also referred to as measures) across common business groupings (also referred to dimensions). Common dimensions or grouping patters may be identified so derived member databases (data cubes) are constructed to optimize at least one parameter, such as query response times.
- a measure is a business metrics or a measure attributes. In the case of MDDB, common statistics are applied on the selected measures.
- a Dimension is a grouping attribute.
- a fractional or sampled or approximate database (also known as a data mart) is built by taking representative sample of a larger database. As the name suggests, the results from sampled databases will necessarily be approximate.
- a data warehouse (DW) is a Full Databases (FuIlDB) containing all the available information in a database or group of databases.
- MDDBs address the challenge of immediate access to data by pre-computing a variety of measure attribute statistics across a number of common grouping attributes.
- the grouping attributes are also referred to as dimensions and measures attributes are referred to simply as measures in the context of MDDBs.
- MDDBs are typically most effective for data analysis only within the predefined dimensions and their member groupings. Users often revert to time consuming querying of large databases whenever database results are required across categories different from predefined member groupings available in MDDBs. Users sometimes resort to adding more number of dimensions to address part of these challenges which in turn results in very large or unmanageable MDDBs which also could defeat the purpose of immediate data access.
- Approximate or sampled databases are another alternative to expediting data access.
- Sampled databases address the challenges of predefined groupings as the detailed data within it can be grouped exactly like its original databases. Sampled databases can be pre-built to offer approximate results within a desired level of accuracy for most queries and may not be effective for detailed queries resulting in few rows compared to original results.
- sampled databases are built to address different functions in a business. Each of these sampled databases could be built with a focus on the common groupings within that function, for example, marketing may focus on program or campaign tracking while risk management may focus on portfolio wide risk metrics with different groupings or dimensions in each.
- One of skill in the art should appreciate that building one large sampled database with all the possible groupings could result in a sample size that is too large, defeating the purpose of a sampled database.
- Multiple sampled databases built to focus on frequent sets of grouping attributes can result in duplication of databases records across sampled databases, which may lead to wasted storage space due to unnecessary duplication or may lead to data inconsistencies.
- Sourcing samples dynamically to meet desired accuracy levels can address the challenge of volatility of approximate answers from pre-built sampled databases but can be more time consuming than querying pre-built sampled databases. Accordingly, what is needed in the art is an improved method for improving database searching such that expedited results are provided at an appropriate level of accuracy.
- An exemplary method embodiment of the invention includes receiving a query for information from one or more databases, and returning a result to the query for information requested in the query from a hybrid database, the hybrid database being constructed by the steps including: (1) identifying one or more source member databases where query acceleration is required, (2) identifying one or more common dimensions or grouping patterns in a plurality of databases, (3) creating one or more sets of commonly occurring common dimensions, (4) building one or more multi-dimensional databases using the one or more sets of commonly occurring common dimensions, (5) building one or more fractional databases, (6) compacting the one or more fractional databases into a master fractional database, (7) building a hybrid database metadata repository combining metadata from the master fractional database and the one or more multi-dimensional databases and source member databases together referred to as derived member databases, and (8) building a query governor to interface between the query, the hybrid database metadata repository, the plurality of databases, the master fractional database, and the one or more multi-dimensional databases.
- FIG. 1 illustrates a basic system or computing device embodiment of the invention
- FIG. 2A illustrates a method embodiment of the invention
- FIG. 2B illustrates an approach of identify a set of common groupings or dimension sets which is optimal within preset constraints
- FIG. 3 illustrates sample hybrid database
- FIG. 4 illustrates how a input query is directed to a member databases based on user defined or preset accuracy level to expedite query results
- FIG. 5 illustrates an exemplary method for constructing a hybrid database
- FIG. 6 illustrates an exemplary method for constructing and compacting fractional databases
- FIG. 7 illustrates a metadata repository creation process
- an exemplary system for implementing the invention includes a general-purpose computing device 100, including a processing unit (CPU) 120 and a system bus 110 that couples various system components including the system memory such as read only memory (ROM) 140 and random access memory (RAM) 150 to the processing unit 120.
- system memory such as read only memory (ROM) 140 and random access memory (RAM) 150
- Other system memory 130 may be available for use as well.
- the system bus 110 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures.
- a basic input/ output (BIOS) containing the basic routine that helps to transfer information between elements within the computing device 100, such as during start-up, is typically stored in ROM 140.
- the computing device 100 further includes storage means such as a hard disk drive 160, a magnetic disk drive, an optical disk drive, tape drive or the like.
- the storage device 160 is connected to the system bus 110 by a drive interface.
- the drives and the associated computer readable media provide nonvolatile storage of computer readable instructions, data structures, program modules and other data for the computing device 100.
- the basic components are known to those of skill in the art and appropriate variations are contemplated depending on the type of device, such as whether the device is a small, handheld computing device, a desktop computer, or a computer server.
- the exemplary environment described herein employs the hard disk, it should be appreciated by those skilled in the art that other types of computer readable media which can store data that are accessible by a computer, such as magnetic cassettes, flash memory cards, digital versatile disks, cartridges, random access memories (RAMs), read only memory (ROM), a cable or wireless signal containing a bit stream and the like, may also be used in the exemplary operating environment.
- an input device 190 represents any number of input mechanisms, such as a microphone for speech, a touch sensitive screen for gesture or graphical input, keyboard, mouse, motion input, speech and so forth.
- the input may be used by the presenter to indicate the beginning of a speech search query.
- the device output 170 can also be one or more of a number of output means.
- multimodal systems enable a user to provide multiple types of input to communicate with the computing device 100.
- the communications interface 180 generally governs and manages the user input and system output. There is no restriction on the invention operating on any particular hardware arrangement and therefore the basic features here may easily be substituted for improved hardware or firmware arrangements as they are developed.
- the illustrative embodiment of the present invention is presented as comprising individual functional blocks (including functional blocks labeled as a "processor").
- the functions these blocks represent may be provided through the use of either shared or dedicated hardware, including, but not limited to, hardware capable of executing software.
- the functions of one or more processors presented in FIG. 1 may be provided by a single shared processor or multiple processors.
- Illustrative embodiments may comprise microprocessor and/or digital signal processor (DSP) hardware, read-only memory (ROM) for storing software performing the operations discussed below, and random access memory (RAM) for storing results.
- DSP digital signal processor
- ROM read-only memory
- RAM random access memory
- VLSI Very large scale integration
- the present invention addresses a system and method for distributing queries to a group of databases and expedited data access process.
- the system can study the pattern of those queries and build several data cubes and/or data marts that can quickly provide the answers to those questions.
- a data cube is typically contains a reduced number of records from the source database according to query histories.
- a data mart is preferably generated in addition to a data cube and represents a sampling of the source database. So if the data cube cannot answer the query, then a data mart, or sampling of the source database may be accessed.
- the data cubes and data marts may be jointly termed an "answer catalog.”
- the system can access these data cubes and data marts to present the answer very quickly. For example, if there is a database having 300,000,000 records, when a query is submitted wherein the person sending the query desires to see an average salary by state and by race. As an example, assuming there are ten races for 50 states thus resulting in an approximate expected return of 500 records.
- a data cube may represent a summarized data mart of whether the groupings are predetermined. For example, assume the data cube with a state as a dimension and a race as a dimension, such a data cube will have 500 records.
- the system When a query arrives, the system first determines whether the data cubes can address the question, if not, then the system utilizes the data marts or sampling of the database. The system, if necessary, may then dynamically boost a sample of the source database by dynamically sampling and then finally if a sufficient accuracy cannot be obtained, the system may query the entire source database. In this regard, the system may be able to answer 60% of questions only accessing the data cubes, perhaps using the fractional databases or data marts answers an additional 15% of the question, and a dynamic sampling approach increases that another 15% until finally only 10% of the queries are answered via searching the entire source database.
- FIG. 2A illustrates a method embodiment of the invention. First, the method receives a query for information from one or more databases (202).
- the one or more databases may be individually visible or accessible to the user, or they may be accessible only as a conglomerate hybrid database.
- the query may also include a desired level of accuracy of the results.
- the initial query may be a formal structured query or converted from a natural language query or form based querying interface.
- a result is returned to the query from a hybrid database (204).
- the hybrid database is constructed according to the following steps: identifying one or more common dimensions in a plurality of databases (204a) and creating one or more sets of commonly occurring common dimensions (204b).
- Example dimensions include state, race, income level, or any other parameter that may be used to describe data in the database.
- the method includes building one or more multi-dimensional databases using the one or more sets of commonly occurring common dimensions (204c), building one or more fractional databases (204d), compacting the one or more fractional databases into a master fractional database (204e), building a hybrid database metadata repository combining metadata from the master fractional database and the one or more multi-dimensional databases and the source databases (204f), and building a query governor to interface between the query, the hybrid database metadata repository, the plurality of databases, the master fractional database, and the one or more multi-dimensional databases (204g).
- Compacting the one or more fractional databases when constructing the hybrid database is achieved by slice numbering each sample in a full database, and building a plurality of fractional databases by selecting samples starting from the same slice umber to maximize record overlap across the plurality of fractional databases.
- FIG. 2B illustrates an approach of identifying optimal sets of dimension sets for the purpose of iterating a MDDB and/or sample data marks.
- An optimal set of grouping candidates which may be defined by potential constraints, can be identified based on past or ongoing usage. Usage, a key input, can be a log of historical queries from a user.
- a number of constraints which may be used for identifying an optimal set of grouping candidates can be determined by systems, a client, and possible users and may be applied to meet computational, storage and response time goals in the HyDB system. The following illustrates several possible constraints and demonstrating the user of such constraints.
- the constraints are applied to optimize at least one parameter such as query response time, member database size, and so forth.
- Pyramid 210 is shown in Figure 2B and generally illustrates the process of identifying optimal sets of grouping candidates for the purpose of developing an MDDB and/or sample data mart.
- the output is an optimal grouping of candidates that is output as a list of dimension sets with each dimension set resulting in an MDDB and/or a sample data mart.
- the optimal sets of grouping candidates may be based on an optimization parameter such as to achieve a quickest query response time.
- An example algorithm for performing the process illustrated in the Figure 2B is to build optimal dimension sets from the ground up by combining 212 individual dimensions based on frequency or popularity of the combinations. The process is stopped when no new cubes can be built within the constraints. Example constraints are noted above.
- the algorithm starts with a list of dimensioned and identifies members in each.
- the system gathers these lists by simple reviewing of the group-by clause.
- the system may apply other principles to find prudential grouping candidates from a given query.
- queries may be run on a sample data mart as opposed to the full data warehouse.
- the algorithm selects all individual dimensions to begin with. The system iterates the following steps n-1 times, n being the constraints determining the maximum number of dimensions.
- a compact FxDB is built by combining multiple FxDBs and removing any common database rows. The following three steps illustrate the method of creating FxDBs and compacting them into a compact FxDB.
- FxDBs If two or more FxDBs exist, then could be an opportunity to compact them into one FxDB containing fewer database records than in the individual FxDBs separately. For instance, consider two FxDBs with say 2% sample of the full database. Then in the best case, both FxDB contain the same 2% sample therefore compacting results half the number of database records than in the two of them separately. On the other hand, the two FxDBs can contain different 2% samples and, in such case, compacting would not yield any benefit.
- FIG. 3 illustrates sample hybrid database system.
- the hybrid database system 302 is composed of full databases 304, fractional databases 306, multi-dimensional databases 308, hybrid database metadata 310, and a query governor 312.
- the multidimensional databases are created by identifying one or more common dimensions in the full databases and combining the common dimensions that frequently occur or are likely to be requested.
- Fractional databases are subsets or representative sample of complete databases.
- the Hybrid Database Metadata is a repository for metadata from the fractional and multi-dimensional databases and source databases.
- a source database or a source member database is preferably a detailed database or record level database or warehouse.
- the metadata repository stores common metadata information or member databases and also stores selective data elements to determine if a member database can answer a user query by just reviewing only the metadata dictionary without querying all the physical member databases at query time. To do so, the metadata repository also stores the actual members of lookup tables. Therefore, a determination if a member database can answer a question like "Sales for Platinum Product" can be made by reviewing metadata only — specifically, without querying the actual member databases which can be more time consuming.
- a metadata repository (MDR) 318 allows for capturing metadata of all HyDB data assets into a MDR.
- the MDR 318 enables the query governor to find all data assets that can potentially answer a given query by just reviewing the metadata without accessing the actual data assets. To do so, the MDR 318 typically captures two pieces of information about all data assets: (1) metadata of all data source entities like tables and columns (2) actual lookup data of lookup attributes like STATE with values like New York, North Carolina, Georgia, etc. or PRODUCT like Gold, Rewards, etc.
- the metadata associated with any derived member database is methodically and/or automatically absorbed into the MDR using one of database entity ID and base entity ID of the MDR.
- the system only observes the metadata of the source database once or continuously, but the metadata of a derived member database such as a data cube that is built off of the source database is automatically integrated into the MDR.
- the base entity ID is an ID associated with a business metric or field in a database. Any member of that field, for example, "sales" has multiple values and the system may not want to capture the actual values. So where the system may not want to capture those values it may only refer to them using a base entity ID. Other database entities like state, country, products having actual number, and so forth are examples of base entity ID.
- metadata will include information about the Database schema, the Tables, and Columns and lookup data will come from any lookup tables.
- Metadata will include information about the database schema that is holding that cube, the cube itself treated as a fact table, all the dimensions and measures treated as columns entities. All members of the dimensions will represent the lookup data.
- the MDR 318 may be envisioned as a relational database with the following tables: [0043] MDR Entity ID: All columns or database fields and members of lookup potential tables are referred to as "Entities" within MDR 318.
- Entities are captured uniquely in an MDR Entity Table for each data asset. A unique record exists in this table for each column and each member of potential lookup tables. This table also captures two key columns for wiring together source member databases and derived member databases like data cubes or fractional data marts: (1) Base Entity ID: If the data asset is a source data asset then Base Entity ID will be null. If it is a derived asset then the Base Entity ID will be the same as the MDR Entity ID of its source data asset. (2) Business Term ID: This ID uniquely represents a business term that is mapped to the MDR Entities.
- a derived data asset, or derived member database is preferably derived from source databases and may refer to pre-computed data cubes or sample or non-sampled data marts.
- the metadata of a derived member database may be automatically absorbed into the MDR using one of database entity ID and base entity ID of the MDR.
- MDR DataSoufce Metadata In addition to unique Data Source ID, names & descriptions, this table captures information of member databases like the type of data asset (RDBMS, MDDB, etc.), connection information, etc.
- MDR Table Metadata Captures information like unique Table ID, names, descriptions, lookup type if a lookup table, etc.
- MDR Column Metadata Captures information like unique Column ID, names, descriptions, database data types like Integer, Character, Date, etc., MDR Data types like
- Column ID A unique ID representing the actual column name in a table in a given database. Same column names can therefore appear here from different tables.
- This data field captures meta information about the column like if is it a Dimension or Measure column of Summary Cube table or MDDB, ID column of a relational table, or a basic relational column. If it is Dimensional Column, additional information regarding the type of Dimension is captured. Possible types of Dimensions include Simple Lookup (SL), Range Lookup (RL), Simple Hierarchy (SH), Range Hierarchy (RH), etc. Please refer to glossary for detailed descriptions of these types of Dimensions. It should be obvious to those skilled in this filed that such metadata will allow the Query Governor to identify data sources that can potentially answer a user — without actually querying the physical data source.
- the Query Governor may query the answer catalog by first accessing the data cubes and then the data marts to answer a query.
- Column Database Primary-Foreign Key Lastly, information about the primary and foreign keys of the data sources is also captured. This column can be a primary key or part of a composite primary key similar to information present in common Entity Relationship or ER diagrams.
- MDR Member Data Captures information like unique Member ID, Member Data Values, Member Descriptions, etc.
- Member Data Values and Member Descriptions can capture lookup data of lookup tables of relational databases or dimensions of data cubes. For instance, if STATE is a lookup table containing State-Code and State-Name then State-Code will be stored in Member Data Values and State-Name in Member Descriptions. Additional fields like Member Lower Limit, Member Upper Limit, and Member Parent Value can capture metadata of Range or Hierarchical Dimensions.
- MDR-BusinessTerms Captures BusinessTerm ID, names, and Business Term descriptions. By mapping Business Terms to source member databases, queries written using Business Terms can be translated to either the source or derived member databases.
- mapping For example, is a sales company has one database that uses the term total_sales_amount and another database that uses the term sales_amt for essentially the same type of business data, there is a mapping that may be done in the MDR that enables the user to just use business terminology generally and the system will query the appropriate database.
- the mapping preferably occurs in the MDR to the underlying source database. So the MDR translates the queries to be member-specific queries or to a source database. Additional Columns like "Default Value", "Required Columns" can be added to these tables to meet specific application needs and is within the scope of this invention.
- the MDR may be though of as a metadata dictionary that is used to determine which data asset can answer a query at hand.
- the MDR may hold the metadata of the source database that was used to derive cube 1, cube 10 and cube 100.
- the query governor receives a response from the various cubes with regards to a query in which individual cubes respond that can answer the query. That information may also be automatically stored within the MDR — namely which data assets are responding to queries and other data associated with the query and response.
- FIG. 7 illustrates an example MDR creation process.
- a first step involves gathering a list of member databases.
- the method includes capturing various parameters such as ID, name, description, type, sub-type and priority. Further, the system gathers a list of tables that belong to each respective database. For each database table gathered in the list of tables, the method includes capturing ID, name, look-up type, alias, and parent information. Further, the process gathers a list of columns. For each table column, the method captures ID, name, data type, metadata type, default value, description and primary key info and further gathers a list of columns. The method then inquires whether this is a look-up table. For each column member and as a result of the determination of whether this system asking whether this is a look-up table, the system captures ID, value, description, lower limit, upper limit, parent value and parent description and then the initial overall step (702) ends.
- step (704) creates a new driver table that captures all the database entities from step 702.
- Each database entry has a database ID, table ID, column ID and member ID. If the database is of a derived type, then the database entity would also have the ID of the source database.
- Step 706 includes gathering a list of business concepts and assigning a unique ID to each of them. Each concept is matched to one or more of the database entities that were created in step 704. In this manner, the metadata repository may include the necessary data to perform the mapping of a concept to one or more of the database entities.
- the Hybrid database system may be constructed so as to optimize speed or to optimize accuracy, depending on the application and may be user configurable. The optimization may be accomplished optimized by a statistical approach to quantitatively determine candidates for grouping based on usage and/or database distributions and common querying patterns.
- the average querying time can be improved by ensuring MDDBs and FxDBs cover most common groupings and frequent usage.
- how a query is federated to the member databases also improves the average query times i.e. predetermined sequence of choosing MDDBs over FxDBs over on-demand-sampling over querying full databases or driven by load or resources associated with each of the member databases.
- Average querying time can further be improved through sourcing additional slices dynamically to meet a user defined accuracy level at an overall or sub-grouping level.
- Slicing means, for instance, if the census database were divided into 100 pieces, 100 random samples, and each sample assigned a value 1, 2, 3, 4...100. The analysts are then instructed to pull samples from 1 to n. So an analyst 1 pulls a 2% sample, and analyst 2 also pulls a 2% sample. But, this time, if they use the slicing concept, they handle it differently. They went and pulled slices 1 and 2 and the second analyst also pulled slices 1 and 2. These two slices have 100% commonality because the yare the same slices. So when we combine these two within the concept of slicing, there is a reduced size of storage requirement for two sample data marks by 50%. This concept can be followed irrespective of the number of samples or number of analysts.
- the result from the HyDB will include Query Results, data source information and confidence bounds, especially when the data source is an FxDB. Overall, this system balances query access times of MDDBs and flexibility and accuracy of full databases expediting database results superior than any of the individual member databases.
- the query governor 312 takes user queries and/or user defined accuracy level as input 314 and returns query results, name of the data source, and sampling error bounds as output 316.
- the Query governor based on metadata in the HyDB metadata repository, directs the user query to a member database i.e. either full databases, FxDBs (query marts) or MDDBs (data cubes). The logic for determining which member database to be queried is described in Fig. 4.
- the query governor 312 optimizes query response time but using at least one of directing queries to faster, pre-computed and aggregated cubes, by sampling data marts or dynamically sourcing additional samples to meet required accuracy levels or by querying source databases based on input query, required accuracy level and performance metrics of member databases.
- the user may input with the query the desired level of accuracy of the results — such as a query for data with 95% accuracy.
- the query governor allows for flexibility to use MDDBs when groupings match, to use FxDB when user-defined accuracy can be met, to seamlessly get additional slices to meet user defined accuracy level, and only as a last resort to query FuIlDB.
- the hybrid database can offer more flexibility than any one of the member databases can individually.
- the hybrid database may be optimized based on a statistical approach to quantitatively determine grouping candidates or common dimensions based on historical or ongoing query usage or data distributions.
- the hybrid database may dynamically source additional samples or slices to meet required accuracy level of query results.
- the source databases are incorporated into the hybrid database as a member database by absorbing metadata into the MDR.
- FIG. 4 illustrates an example of how a user-defined accuracy level can be directed to a member database to expedite query results.
- the Query governor can direct the query to the appropriate member database.
- the following steps further illustrate the process of directing a user query (402) to the appropriate member database as shown in FIG 4: If the groupings in the user query match that of the MDDBs (404)then use MDDBs to Generate SQL relevant to MDDBs (406), Flag the MDDB(s) that can generate valid results (408), Query Flagged MDDBs (410), and Output Query Results (434).
- Accuracy level is instrumental in directing user queries to the appropriate member databases or in sourcing additional samples or slices. Accuracy levels could also be determined on a per query basis. Accuracy levels could be assumed and preprogrammed for different types of queries. For example, an annual sales report could be presumed to be 100% accurate while a weekly sales-to-date report could be 85% accurate.
- additional slices can be sourced. For instance, if the user desired accuracy level is 99% and it requires an effective sample size of 5% but the pre-built sampled database has only a 3% effective sample then query can be diverted to the FuIlDB specifically querying 5 slices.
- FIG. 5 illustrates an exemplary method for constructing a hybrid database.
- a HyDB can be setup by an administrator or client is loading a software product and beginning the setup process.
- the software can then navigate the administrator through a step-by-step process of locating the original database around which a HyDB is going to be built.
- the administrator can input the dimensions sets or select an automatic option for the software to choose dimensions (502).
- MDDBs (506) and FxDBs (504) will created either immediately or as a batch job at administrator's request. If the client has some pre-existing MDDBs or FxDBs then the client may choose to use them as member databases in the HyDB instead of building new ones.
- the metadata of each of the member database is combined as the metadata of the HyDB (512).
- the system may determine if two or more FxDBs have been built (508). FI yes, then the system compacts the FxDBs (510). If not (508), then the system proceeds to step 512.
- the last step before users can seamlessly query the HyDB for expedited results is to enable the Query governor (514).
- An example method for enabling or disabling the Query governor could be part of the administrator control panel or a dashboard. Disabling Query Governor turns off the logic in FIG. 4 and routes the user query directly to the full database.
- HyDB i.e. member databases, Query Governor and HyDB metadata may all reside collectively or individually on one server or multiple servers.
- the Query governor is not dependent on any domain, platform and database type.
- the Query governor could be implemented using C, C++, Java, or any other suitable programming language.
- FIG. 6 illustrates an exemplary method for constructing and compacting fractional databases in step (510) of Fig. 5.
- Two illustrative concepts are central to compacting fractional databases: (1) slice numbering each sample in the full database, and (2) selecting samples starting from the same slice number, for example 1 and upwards.
- the full databases is sliced into 100 samples with slice numbers 1 to 100. If the FxDBs needs a 2% sample then slices 1 and 2 will be sourced into the FxDBl.
- FxDB2 may contain Slices 1 and 2.
- slice numbering and sourcing slices ensures maximum overlap and therefore, maximum benefit from compact multiple FxDBs.
- Compacting multiple FxDBs allows the data storage requirements to decrease and can improve query access speed.
- the method embodiment may include the system assigning random slice numbers on full DBs (602), identifying source samples for each individual FxDBs using slice numbers beginning with the same slice number, preferably slice number 1, for each population or grouping (604), combining all FxDBs (606), capturing meta- information of each record into a lookup table (608), removing duplicate records identified by database key(s), same as the ones in assigning slices on full DBs (610) and removing individual FxDBs after compacting (612).
- Aspects of compacting databases include tagging original databases, sourcing samples, and compacting the sampled databases.
- Tagging original database(s) with a random-sample number, or Slice is key to both compacting multiple sampled databases and sourcing additional samples.
- the method of assigning a slice must be consistent across all original databases. For instance, in a consumer database with records added to the database when a new account is originated, the account origination date could be the basis for assigning the slice number i.e. if an account has slice number n and the next account originated will be n+1 up to the number of slices when it resets to 1 again.
- the range of the slice numbers is determined based on the percent of a database that is captured in each slice. For example, if the percent of the original database captured in a slice is 1% then there will be 100 slices, possibly ranging from 1 to 100.
- Source sampling is disclosed next.
- the system builds individual sampled databases choosing any sampling methodology(s) to meet specific usage requirements as anyone skilled in the art can do. It is preferable that the selection of these samples from original database be based on slice numbers already assigned.
- slice numbers already assigned.
- the first slice begin at a "set slice number", for instance Slice#l, and the remaining slices be consecutive slices. For example, if a 4% sample is required within a sub-grouping then slices 1-4 will be pulled from the same sub-grouping in the original database. Similarly, if another sub-grouping requires a 3% sample then slices 1 -3 will be pulled.
- Two or more databases sampled as described above can be compacted by eliminating the records that are non-unique as identified by database key information, not the full record detail.
- the meta information of the sub-grouping of each records should be captured in a something like lookup tables. For example, if record 101 appears in program ABC and program XYZ, then that information must be captured in the equivalent of lookup table and the record is captured only once in the compact sampled database. By querying the combination of the lookup tables and the compact database one can get the same set of records that can be obtained from the individual sampled databases.
- Embodiments within the scope of the present invention may also include computer-readable media for carrying or having computer-executable instructions or data structures stored thereon.
- Such computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer.
- Such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store desired program code structures in the form of computer-executable instructions or data structures.
- Computer-executable instructions include, for example, instructions and data which cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions.
- Computer- executable instructions also include program modules that are executed by computers in stand-alone or network environments. Generally, program modules include routines, programs, objects, components, and data structures, etc. that perform particular tasks or implement particular abstract data types.
- Computer-executable instructions, associated data structures, and program modules represent examples of the program code structures for executing steps of the methods disclosed herein.
- the particular sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps.
- Those of skill in the art will appreciate that other embodiments of the invention may be practiced in network computing environments with many types of computing devices or computer system configurations, including personal computers, hand-held devices, multi-processor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, and the like.
- Embodiments may also be practiced in distributed computing environments where tasks are performed by local and remote processing devices that are linked (either by hardwired links, wireless links, or by a combination thereof) through a communications network.
- program modules may be located in both local and remote memory storage devices.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Fuzzy Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US86380106P | 2006-10-31 | 2006-10-31 | |
US11/929,789 US9747349B2 (en) | 2006-10-30 | 2007-10-30 | System and method for distributing queries to a group of databases and expediting data access |
PCT/US2007/083164 WO2008055202A2 (fr) | 2006-10-31 | 2007-10-31 | Système et procédé pour distribuer des demandes à un groupe de bases de données et accélérer un accès à des données |
Publications (2)
Publication Number | Publication Date |
---|---|
EP2082317A2 true EP2082317A2 (fr) | 2009-07-29 |
EP2082317A4 EP2082317A4 (fr) | 2010-10-06 |
Family
ID=39345071
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP07871310A Withdrawn EP2082317A4 (fr) | 2006-10-31 | 2007-10-31 | Système et procédé pour distribuer des demandes à un groupe de bases de données et accélérer un accès à des données |
Country Status (3)
Country | Link |
---|---|
EP (1) | EP2082317A4 (fr) |
CA (1) | CA2701173A1 (fr) |
WO (1) | WO2008055202A2 (fr) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9818141B2 (en) | 2014-01-13 | 2017-11-14 | International Business Machines Corporation | Pricing data according to provenance-based use in a query |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8774561B2 (en) * | 2010-09-01 | 2014-07-08 | Apple Inc. | Consolidating information relating to duplicate images |
US10509788B2 (en) * | 2016-12-30 | 2019-12-17 | Dev Gupta | Systems and methods for lossy data compression using key artifacts and dynamically generated cycles |
CN113392130B (zh) * | 2020-03-13 | 2022-04-29 | 阿里巴巴集团控股有限公司 | 数据处理方法、装置及设备 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5978788A (en) * | 1997-04-14 | 1999-11-02 | International Business Machines Corporation | System and method for generating multi-representations of a data cube |
US20040215626A1 (en) * | 2003-04-09 | 2004-10-28 | International Business Machines Corporation | Method, system, and program for improving performance of database queries |
US20080270363A1 (en) * | 2007-01-26 | 2008-10-30 | Herbert Dennis Hunt | Cluster processing of a core information matrix |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6912524B2 (en) * | 1998-05-20 | 2005-06-28 | Lucent Technologies, Inc. | Join synopsis-based approximate query answering |
US20030033179A1 (en) * | 2001-08-09 | 2003-02-13 | Katz Steven Bruce | Method for generating customized alerts related to the procurement, sourcing, strategic sourcing and/or sale of one or more items by an enterprise |
KR20060126347A (ko) * | 2004-04-02 | 2006-12-07 | 마이크로소프트 코포레이션 | Lob 어플리케이션 통합을 위한 어댑터 프레임워크 |
US8688507B2 (en) * | 2005-03-21 | 2014-04-01 | Oversight Technologies, Inc. | Methods and systems for monitoring transaction entity versions for policy compliance |
-
2007
- 2007-10-31 WO PCT/US2007/083164 patent/WO2008055202A2/fr active Application Filing
- 2007-10-31 EP EP07871310A patent/EP2082317A4/fr not_active Withdrawn
- 2007-10-31 CA CA2701173A patent/CA2701173A1/fr not_active Abandoned
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5978788A (en) * | 1997-04-14 | 1999-11-02 | International Business Machines Corporation | System and method for generating multi-representations of a data cube |
US20040215626A1 (en) * | 2003-04-09 | 2004-10-28 | International Business Machines Corporation | Method, system, and program for improving performance of database queries |
US20080270363A1 (en) * | 2007-01-26 | 2008-10-30 | Herbert Dennis Hunt | Cluster processing of a core information matrix |
Non-Patent Citations (1)
Title |
---|
See also references of WO2008055202A2 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9818141B2 (en) | 2014-01-13 | 2017-11-14 | International Business Machines Corporation | Pricing data according to provenance-based use in a query |
Also Published As
Publication number | Publication date |
---|---|
WO2008055202A2 (fr) | 2008-05-08 |
CA2701173A1 (fr) | 2008-05-08 |
EP2082317A4 (fr) | 2010-10-06 |
WO2008055202A3 (fr) | 2008-08-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9747349B2 (en) | System and method for distributing queries to a group of databases and expediting data access | |
US11755575B2 (en) | Processing database queries using format conversion | |
US10860598B2 (en) | Systems and methods for interest-driven business intelligence systems including event-oriented data | |
Pereira et al. | Efficient matching for web-based publish/subscribe systems | |
JP6144700B2 (ja) | 半構造データのためのスケーラブルな分析プラットフォーム | |
US8700605B1 (en) | Estimating rows returned by recursive queries using fanout | |
US8924373B2 (en) | Query plans with parameter markers in place of object identifiers | |
US20230177078A1 (en) | Conversational Database Analysis | |
US7974981B2 (en) | Multi-value property storage and query support | |
US10083223B2 (en) | Identification of common attributes in multiple datasets | |
US20080114733A1 (en) | User-structured data table indexing | |
CN113704300B (zh) | 供数据检索方法使用的数据印记技术 | |
JP2001350656A (ja) | 異種データソース統合アクセス方法 | |
US10776368B1 (en) | Deriving cardinality values from approximate quantile summaries | |
CN115658680A (zh) | 数据存储方法、数据查询方法和相关装置 | |
US7925617B2 (en) | Efficiency in processing queries directed to static data sets | |
WO2008055202A2 (fr) | Système et procédé pour distribuer des demandes à un groupe de bases de données et accélérer un accès à des données | |
US12105710B2 (en) | Preferred storage of aggregate data items | |
CN107430633B (zh) | 用于数据存储的系统及方法和计算机可读介质 | |
Wu et al. | POLYTOPE: a flexible sampling system for answering exploratory queries | |
US10311049B2 (en) | Pattern-based query result enhancement | |
Villarroya et al. | Enabling efficient distributed spatial join on large scale vector-raster data lakes | |
EP3889798A1 (fr) | Orientation de données de chronométrie dynamique | |
Jin et al. | AIDB: a Sparsely Materialized Database for Queries using Machine Learning | |
Jia et al. | LuBase: A Search-Efficient Hybrid Storage System for Massive Text Data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20090430 |
|
AK | Designated contracting states |
Kind code of ref document: A2 Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU LV MC MT NL PL PT RO SE SI SK TR |
|
DAX | Request for extension of the european patent (deleted) | ||
A4 | Supplementary search report drawn up and despatched |
Effective date: 20100903 |
|
RIC1 | Information provided on ipc code assigned before grant |
Ipc: G06F 7/00 20060101AFI20090520BHEP Ipc: G06F 17/30 20060101ALI20100830BHEP |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20130503 |