US6029163A - Methods for collecting query workload based statistics on column groups identified by RDBMS optimizer - Google Patents
Methods for collecting query workload based statistics on column groups identified by RDBMS optimizer Download PDFInfo
- Publication number
- US6029163A US6029163A US09/164,400 US16440098A US6029163A US 6029163 A US6029163 A US 6029163A US 16440098 A US16440098 A US 16440098A US 6029163 A US6029163 A US 6029163A
- Authority
- US
- United States
- Prior art keywords
- column
- columns
- rows
- query
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Lifetime
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2462—Approximate or statistical queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2453—Query optimisation
- G06F16/24534—Query rewriting; Transformation
- G06F16/24542—Plan optimisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99931—Database or file accessing
- Y10S707/99932—Access augmentation or optimizing
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99931—Database or file accessing
- Y10S707/99933—Query processing, i.e. searching
Definitions
- the present invention relates to the field of computer systems used for storage and retrieval of data. More specifically, the present invention relates to the field of statistics measurement systems for a relational database management system.
- RDBMS Computer implemented relational database management systems
- database systems commonly employ data tables that contain columns and rows containing data (e.g., data values).
- data e.g., data values
- a typical RDBMS in addition to maintaining the data of a particular database, also maintains a set of statistics regarding the data. These statistics are useful in efficiently accessing, manipulating, and presenting the stored data.
- an RDBMS system When an RDBMS system receives a query, its optimizer analyzes the structure of the query, analyzes the various clauses (e.g. selection and join predicates) specified in the query, and examines existing data access paths (e.g. indexes) to formulate a strategy (e.g., method) of performing various relational operations (e.g., aggregation, sort, search, join, etc.) to produce the result of the query.
- the optimizer generally explores various strategies to find the best strategy.
- a best strategy is the one with lowest estimated RDBMS cost, and such strategy generally takes least amount of RDBMS resources or least amount of time or both to produce the result of the query.
- RDBMS cost is generally evaluated based on input/output (I/O) operations that are required to perform necessary relational operations to produce the result of the query. Therefore, the optimizer selection criterion for the best strategy for a query is a minimum execution cost.
- the minimum execution cost is calculated using available statistics such as table and index cardinalities, workload statistics (e.g. statistics on columns and column groups), storage statistics and assumptions of the manner in which any relational operation typically changes the incoming data size and data distribution.
- Prior RDBMS systems collect workload statistics based on single columns of data. As such, these systems make zero correlation assumption that the values of two or more columns of data are not related in any way. However, in many cases, this zero correlation assumption is incorrect. For example, in an example data table having a first column of employee age and second column of employee job position, it is possible that in general the older employees can hold higher job positions. As such, the row values in the first column correlate to the values in the second column.
- the prior art systems compute a result cardinality (result -- cardinality) based on the inverse of the distinct cardinality of the separate age column (DC1) multiplied by the inverse of the distinct cardinality of the separate income column (DC2) and this result is multiplied by the number of employees (#EE).
- the distinct cardinality of a column represents the number of distinct values within that column. See the computation of the result cardinality below:
- the estimated result cardinality represents the average number of rows likely to be produced for a query asking for all employees with a certain age and a certain job position.
- the relationship used above by the prior art system assumes that there is no correlation between the first and second columns. This assumption leads to a computation of a result cardinality value that is much smaller than what is returned in reality due to data correlation.
- the above result cardinality value is used by an optimizer, its determined estimated costs for performing certain relational operations on the subject data become inaccurate. Specifically, this inaccuracy can cause the optimizer to select a query resulting in worse performance in producing the result. Therefore, what is needed is a method for collecting statistics within an RDBMS system that provides for accurate cost estimation analysis for queries involving more than one data column.
- null data values within rows of a data table can lead to inaccuracies in the workload statistics, which will lead to inaccurate cost estimation analysis by the RDBMS optimizer.
- Most relational operations ignore incoming data rows that contain null data values in key columns. Therefore, it is important to also collect statistics about the number of rows with null data in certain columns or column groups.
- Prior art RDBMS systems that compute certain database statistics (e.g., distinct cardinality of a column) without regard to rows with null data produce inaccurate statistics. The inaccurate statistics can cause the RDBMS optimizer to select a sub-par strategy for a given query. Therefore, what is needed is a method for collecting statistics within an RDBMS system that provides for accurate cost estimation analysis for queries involving columns of rows with null data.
- the present invention provides such a system.
- a manual process is performed by users (e.g., system administrators) to determine which data columns of a database on which to collect statistics.
- the user informs integrated statistics gathering procedures of the separate data columns on which workload statistics are to be gathered.
- the system administrators do not realize the specific mechanisms by which the RDBMS optimizer operates. Relying on users to indicate the data columns on which to collect statistics can be unreliable and inefficient. Namely, not knowing what is needed by the optimizer, the system administrator can cause the statistics gathering procedures to inefficiently over-collect statistics or fail to collect statistics on required columns. What is needed is an efficient method for identifying columns or column groups on which statistics collection is required. What is needed is such a method that does not rely on user origination of the above information.
- the present invention provides a method and system for collecting statistics within an RDBMS system that provides for accurate cost estimation analysis for workload queries involving more than one data column.
- the present invention also provides a method and system for collecting statistics within an RDBMS system that provides for accurate estimated cost analysis for queries involving columns of rows with null data values.
- the present invention yet provides a method and system within RDBMS optimizer for identifying the columns or column groups on which statistics collection is required that does not rely on manual user identification of the column or column groups.
- Methods are described for collecting query workload based statistics within a relational database management system (RDBMS) and for automatically identifying columns or groups of columns for which statistic collection is to be performed.
- RDBMS relational database management system
- the novel system collects workload based statistics on multi-columns rather than merely on single columns.
- This multi-column statistic gathering technique of the present invention provides more accurate results when the row values within the multi-columns are correlated to each other.
- the multi-column statistics that result from the present invention lead to better estimated cost analysis performed by an RDBMS optimizer which leads to increased likelihood of finding an optimal strategy for each query in a workload.
- the present invention generates column group statistics which represent distinct cardinality measured over rows of multiple columns of data, rather than individual statistics on single columns ("a multi-column statistic").
- a multi-column statistic which represent distinct cardinality measured over rows of multiple columns of data, rather than individual statistics on single columns.
- the novel system also collects separate statistics regarding the number of null data values within the rows of a column. Separate null data statistics improve the cost estimation analysis used by the RDBMS optimizer because the cardinality of a result produced by a relational operation is generally determined by the number of input rows with non-null data.
- the present invention utilizes the separate null data statistics in determining a result distinct cardinality of a column, or column group, of a data table.
- the null data statistics are useful in accurate estimation of cardinality of the result of a relational operation (e.g. a data search or a table join) and in accurate estimation of average cost of performing a relational operation (e.g. a data sort, a data search, a table join).
- the novel system also includes an RDBMS optimizer that automatically identifies columns and groups of columns on which workload based statistics should be generated.
- the clauses within a query such as equi-join and equi-selection predicates and projection columns are analyzed by the RDBMS optimizer to identify columns and column groups of interest.
- the identified columns are then registered within in a system catalog of the RDBMS.
- the statistics collection utility of the present invention reads the registered columns and column groups from the system catalog to identify the columns and column groups on which to generate the workload statistics.
- the statistics so generated are then used by the optimizer to perform subsequent estimations of the cost of performing certain relational operations in order to select a strategy for a query having the least cost.
- FIG. 1 illustrates a computer system used in accordance with the embodiments of the present invention.
- FIG. 2 is an illustration of a data table used by embodiments of the present invention including columns and rows of data.
- FIG. 3 is a flow diagram of steps performed by the optimizer of the RDBMS system of the present invention when selecting a lowest cost strategy for a query.
- FIG. 4A and FIG. 4B illustrate a flow diagram of steps of a first embodiment of the present invention for automatic identification of columns and column groups by the RDBMS optimizer on which to collect the workload statistics.
- FIG. 5 is a flow diagram of steps performed by statistics generation procedures of the present invention to generate workload statistics.
- FIG. 6 illustrates a flow diagram of steps of a second embodiment of the present invention for generating non-null workload statistics (e.g., the column duplicity factor), which is a multi-column statistic, on identified column groups.
- non-null workload statistics e.g., the column duplicity factor
- FIG. 7 illustrates an exemplary multi-column data table that can be used by the first embodiment of the present invention.
- FIG. 8 illustrates a flow diagram of steps of a third embodiment of the present invention for generating null workload statistics (e.g., the column null factors) on identified column groups.
- null workload statistics e.g., the column null factors
- FIG. 1 illustrates a computer system 112.
- processes e.g., processes 310, 350, 410, 610, 710, 910, 1010 and steps are discussed that are realized, in one implementation, as a series of instructions (e.g., software program) that reside within computer readable memory units of system 112 and executed by processors of system 112. When executed, the instructions cause the computer system 112 to perform specific actions and exhibit specific behavior which is described in detail to follow.
- computer system 112 used by the present invention comprises an address/data bus 100 for communicating information, one or more central processors 101 coupled with the bus 100 for processing information and instructions, a computer readable volatile memory unit 102 (e.g., random access memory, static RAM, dynamic, RAM, etc.) coupled with the bus 100 for storing information and instructions for the central processor(s) 101, a computer readable non-volatile memory unit (e.g., read only memory, programmable ROM, flash memory, EPROM, EEPROM, etc.) coupled with the bus 100 for storing static information and instructions for the processor(s) 101.
- a computer readable volatile memory unit 102 e.g., random access memory, static RAM, dynamic, RAM, etc.
- a computer readable non-volatile memory unit e.g., read only memory, programmable ROM, flash memory, EPROM, EEPROM, etc.
- System 112 also includes a mass storage computer readable data storage device 104 (hard drive or floppy) such as a magnetic or optical disk and disk drive coupled with the bus 100 for storing information and instructions.
- system 112 can include a display device 105 coupled to the bus 100 for displaying information to the computer user, an optional alphanumeric input device 106 including alphanumeric and function keys coupled to the bus 100 for communicating information and command selections to the central processor(s) 101, an optional cursor control device 107 coupled to the bus for communicating user input information and command selections to the central processor(s) 101, and an optional signal generating device 108 coupled to the bus 100 for communicating command selections to the processor(s) 101.
- system 112 is a DEC Alpha computer system by Digital Equipment Corporation, but could equally be of a number of various well known and commercially available platforms.
- FIG. 2 illustrates an exemplary data table 200 as used by embodiments of the present invention. It is appreciated that data table 200, as well as columns and rows of data which comprise the table 200, are stored in computer readable memory units of system 112. Exemplary table 200 includes x number of separate data columns (or “columns") 210, 220, 230, and 240. Within each column, table 200 includes n rows of data 250-256. Data (“data values”) are stored within the rows of each column. A data value can be either non-null (e.g., a known value) or null (e.g., an unknown value or an empty set value). A column group is a collection of multiple columns, e.g., column 210 and 220.
- references to columns, rows, column groups, and data tables refer to analogous data structures as shown in FIG. 2 and as stored in computer readable memory units of system 112.
- Database performance is directly affected by the ability of a database to efficiently access a row or rows stored on disk (e.g., 104) through input/output (e.g., I/O) operations.
- I/O input/output
- the optimizer of the RDBMS system finds a lowest cost strategy (generally most I/O efficient) of retrieving subject data and producing result of a query. Although the optimizer can consider many strategies each with a different cost to find the one with the lowest cost, all strategies yield the same result.
- cost is the estimated number of disk I/Os required to execute a strategy and produce result of a query. This is the metric used by the optimizer to find an optimal strategy, e.g., the one with the minimal cost.
- Main factors affecting the cost of a strategy include the following: 1) query structure, 2) query predicates (e.g. equi-selections, equi-joins, etc.), 3) result ordering or grouping, 4) access paths available (e.g. hash index, B-tree index), and 5) join methods available (e.g. sort-merge join, hash join). Any database statistics, if available, can help the RDBMS optimizer in significantly improving the accuracy of cost estimation analysis to find an optimal strategy.
- database statistics include cardinality statistics, workload statistics and storage statistics.
- Cardinality statistics include the table cardinality and index cardinality.
- the table cardinality represents the total number of rows in a data table.
- the index cardinality represents the total number of distinct keys (index values) within an index. In other words, the index cardinality is a distinct cardinality of index columns.
- the workload statistics include column duplicity factor and column null factor.
- the column duplicity factor is a ratio of the number of table rows with non-null data in a column group to the distinct cardinality of a column group. It represents average number of duplicates per distinct combination of values in a column group.
- the column null factor is a ratio of the number of table rows with null data in any column of a column group to the table cardinality. It represents a fraction of a data table with null data in rows of a column group.
- FIG. 3 illustrates an exemplary process flow 310 performed within an RDBMS system of the present invention which utilizes an optimizer.
- the optimizer upon receiving a query, the optimizer: 1) identifies different column groups based on equi-join predicates, equi-selection predicates and projections specified in the query; 2) uses stored workload statistics on identified column groups and other stored database statistics to estimate the costs of performing different relational operations required to produce result of the query; 3) builds different strategies of executing relational operations in certain order and estimate cost for each strategy; and 4) selects the lowest cost strategy.
- process 310 is executed over processor 101 and is stored as program instructions within computer readable memory units of system 112 (FIG. 1).
- the RDBMS system receives a query, e.g., a user request for data.
- the format of the received query contains certain predicates (joins, filters, and other conditions, etc.)
- system 112 identifies column groups based on equi-joins, equi-selections and projections that are specified in the query.
- the optimizer uses stored workload statistics on identified column groups for cost estimation of different relational operations.
- the optimizer examines the query to determine the different relational operations (e.g., data access paths, sorts, filters, projections, aggregations, joins, etc.) that are needed to complete the query.
- the optimizer of the RDBMS system accesses stored database statistics (cardinality/workload/storage) regarding the subject data of the query received at step 315 and uses this information to estimate the costs of performing necessary relational operations to produce the result of the query. Any of a number of well known methods can be used at step 320 for performing cost analysis of different relational operations used to satisfy the query.
- the optimizer can utilize in order to perform a data search.
- a well known sequential retrieval strategy can be used where the RDBMS accesses the database pages for a table's logical area sequentially and reads all the rows in the table regardless of the selection expression used in the query.
- a number of well known index retrieval strategies can also be used, including those described below.
- the database key (dbkey) retrieval the RDBMS accesses a table data directly through a database key (e.g., logical address) row pointer.
- the RDBMS accesses a specific index structure (sorted or hashed) and retrieves the index keys, which include the dbkeys, of the rows. The dbkeys are then used to fetch data rows and the data is delivered in index order, if a sorted index is used.
- the RDBMS accesses only the index data and the selected index contains all the table columns specified in the query and no further row fetches are necessary. The data is delivered in index order, if a sorted index is used.
- the OR index retrieval two or more sorted or hashed indexes are used that are defined on a single data table when the predicates on these indexes are combined with the logical OR in the query.
- the optimizer attempts different strategies to implement the relational operations identified in step 320 and then computes the cost to perform each strategy.
- a strategy is a scheme of performing different relational operations in a certain order. By reordering the execution sequence of the relational operations, different cost values result for executing the same query. For single data tables, the optimizer finds the lowest cost retrieval strategy by evaluating the cost for each possible retrieval strategy.
- the estimated cost for a strategy includes the cost of scanning an index and the cost of fetching data rows.
- the optimizer selects the strategy, of the strategies analyzed in step 322, that yields the lowest estimated cost solution.
- a number of well known optimizer processes can be used by step 330 for selecting the lowest estimated cost strategy.
- the optimizer initiates the execution of the relational operations indicated by the selected strategy of step 330 to produce the result of the query. It is appreciated that each of the strategies considered at step 322, although having possibly different estimated costs, yields the same result.
- system 112 reports the output of the executed strategy as the result of the query received at step 315.
- the process 410 of a first embodiment of the present invention is described with reference to the flow diagram of FIG. 4A and FIG. 4B; this embodiment of the present invention provides an automatic method for identifying the column groups for which workload statistics are to be collected within an RDBMS system.
- the column groups for which workload statistics are to be collected are identified from predicates specified in the WHERE and HAVING clauses, and projection lists specified in a GROUP BY or DISTINCT clause of each query within a query workload.
- the equi-selection and equi-join predicates within WHERE and HAVING clauses and projection list within a GROUP BY or DISTINCT clause are used in the above process to identify column groups belonging to different data tables of a database.
- This embodiment of the present invention recognizes that maintaining workload statistics about such identified column groups is very important for optimizing a query because the corresponding conditions have significant influence on determining the cardinality of the query result.
- the optimizer will need to be informed of the combined selectivity so that the cardinality of the result (which is produced after all equi-selections have been applied) can be computed with high accuracy.
- the optimizer will need to be informed of the join fanout factors so that it can determine the cardinality of a joined result with high accuracy.
- a projection e.g., in a GROUP BY or DISTINCT clause
- the optimizer will need to be informed of how many distinct groups will be formed which will be equal to the cardinality of the projected result.
- the workload statistics that are collected on the column groups identified by the optimizer in this embodiment of the present invention include a column duplicity factor and a column null factor, which are described further below.
- the column duplicity factor (second embodiment) provides information about the number of distinct values (e.g., the distinct cardinality) as well as the average number of duplicates per distinct combination of column values in an identified column group.
- the column duplicity factor related to the result cardinality discussed above, can be used to determine the combined selectivity of equi-selection predicates, or combined join fanout factor of equi-join predicates, or the number of distinct groups which will be equal to the cardinality of the projected result.
- the column null factor (third embodiment) indicates the fraction of rows in a data table that have null values in any column of an identified column group.
- the column null factor can be used to identify a portion of the data table that would not participate in a relational operation (e.g., join, selection).
- the optimizer when the optimizer processes a query to find the best execution strategy (e.g., FIG. 3), it also registers (if not already registered) the information about identified column groups into system catalog portions of computer readable memory units.
- the optimizer identifies the column groups based on equi-selections, equi-joins, and projection lists in various queries in a query workload.
- An advantage of the present invention is that the workload statistics are collected only on the identified column groups and nothing else leading to a more efficient statistics collection procedure. This contrasts with the prior art method of relying on the user to identify the interesting columns or column groups which is a difficult task, and therefore, can be unreliable and inefficient.
- FIG. 4A and FIG. 4B illustrate a flow diagram of process 410 of this embodiment of the present invention.
- Process 410 resides within an optimizer implemented in accordance with the present invention and is executed over processor 101 and is stored as program instructions within computer readable memory units of system 112 (FIG. 1).
- Embodiments of the present invention described further below illustrate different statistics generation procedures that can be used to produce the workload statistics that are used by process 410.
- the present invention optimizer receives a query.
- the present invention selects one data table ("selected table") from a list of data tables specified within the query received at step 415.
- the present invention interrogates the query to determine if any equi-selection predicates are specified on the selected table.
- an equi-selection can exist in WHERE and HAVING clauses of the query. If not, processing continues to step 435; if so, processing continues to step 430.
- the equi-selection predicate indicates a search criterion wherein the search data is specified as a particular value (or variable) and the match much be exact.
- search data is specified as a particular value (or variable) and the match much be exact.
- both employee -- JOBCODE and employee -- AGE are column names within the selected data table.
- the above equi-selection indicates that matching data are to be reported only for rows wherein the employee job and age match with the above values (e.g., 5 and 40). It is appreciated that an equi-selection can be defined with respect to a single column, or more than one columns and that the above equi-selection is exemplary only.
- the present invention constructs a column group based on all equi-selection predicates specified on the selected table and stores it in the system catalog (within computer readable memory units of system 112).
- step 430 identifies employee -- JOBCODE and employee -- AGE as a column group.
- identified column groups can contain one or two or more columns. All identified column groups are stored in the system catalog. This is performed for each equi-selection predicate specified on the selected table.
- the present invention interrogates the query to determine if any equi-join predicates are specified on the selected table.
- an equi-join can exist in WHERE clauses of the query. If not, processing continues to step 450 (FIG. 4B); if so, processing continues to step 440.
- a join operation data from two tables are joined together based on matching data from one or more columns.
- the present invention constructs different column groups based on different equi-joins connecting selected data table with each of the other data tables and stores the column groups in the system catalog (within computer readable memory units of system 112).
- a column group is defined by step 440 for the two columns that specify the department number and the common regional designation. All identified column groups are stored in the system catalog. This is performed for each equi-join predicate specified on the selected table. Step 450 of FIG. 4B is then entered.
- the present invention interrogates the query to determine if any projection operations are specified on the selected table. If not, processing continues to step 460 of FIG. 4B; if so, processing continues to step 445.
- Projection operation is used mainly for two purposes: (1) to compute aggregates of rows of data by a projection column or columns and then build a new data table having the aggregated result; and (2) eliminating duplicate row data by creating a new data table and projecting distinct data from a given data table into the new data table.
- Projection or projection lists are specified in GROUP BY or DISTINCT clause of a query and is used to perform the above functions on columns of the selected table.
- the present invention constructs a column group based on the projection specified on the selected table and stores the column groups in the system catalog (within computer readable memory units of system 112). Step 460 of FIG. 4B is then entered.
- the present invention examines the query received at step 415 to determine if more data tables are specified within the query that have not yet been processed by process 410. If so, processing returns to step 420 of FIG. 4A to select a next data table; if not, process 410 ends.
- the determined set of column groups identified can include one or more columns on which single column or multi-column workload statistics are to be collected.
- the present invention stores the columns and/or column groups selected at steps 430, 440, and 455 into the system catalog portion of computer readable memory units of system 112 provided they were not previously stored.
- a statistics collection utility When a statistics collection utility runs, it reads the column groups stored in the system catalog of the RDBMS system, and collects the workload statistics for each column group.
- the present invention advantageously provides an automatic mechanism for determining certain data in the database for which statistics are to be collected.
- the availability of the workload statistics on identified columns is deemed to be highly relevant to the cost estimation analysis performed by the optimizer to find an optimal strategy for each query in the query workload.
- FIG. 5 illustrates steps of the statistics generation process 560 which operates to construct the statistics used in process flow 310 (FIG. 3). It is appreciated that process 560 typically operates in the background of the RDBMS system in an effort to avoid interference with the RDBMS system's execution of other relational operations. In this respect, process 560 and process 310 are asynchronous. Process 560 is executed over processor 101 and is stored as program instructions within computer readable memory units of system 112 (FIG. 1). Process 560 commences at step 565 where a particular data table is selected ("selected table").
- process 560 accesses the system catalog within the RDBMS system (e.g. a portion of memory unit 102) to obtain a list of identified column groups (on which to collect workload statistics) associated with a selected data table.
- the workload statistics (column duplicity factor and column null factor) are generated for each of the identified column groups (e.g., identified by step 410 above). This is performed for each identified column group in the system catalog.
- the generated workload statistics are then stored together with respective column groups into the system catalog of the RDBMS system for use by process 310 (FIG. 3) when needed.
- the present invention checks if more data tables are present in the database. If so, process 560 returns to step 565 to select a new data table; if not, process 560 returns (to be executed at a later predetermined time).
- step 575 to provide, respectively, collection of (1) workload-based multi-column non-null statistic (column duplicity factor), and (2) workload-based multi-column null statistic (column null factor) for each of the identified column groups in the database.
- workload-based multi-column non-null statistic (column duplicity factor)
- workload-based multi-column null statistic (column null factor) for each of the identified column groups in the database.
- the embodiment of the present invention represented by process 410 provides methods for identifying the data for which workload statistics are to be collected. This embodiment of the present invention places identifiers of this data within the system catalog which is read by step 570 of process 560, although other memory locations can also be used within the scope of the present invention.
- Process 610 of the second embodiment of the present invention operates within the statistics generation procedures of an RDBMS system (process 560).
- This embodiment recognizes that in many instances within a database system, the distribution of values in one column can correlate to the values of another column.
- Conventional RDBMS systems assume that there is no correlation between two or more columns when determining overall estimated result cardinalities of a relational operation that includes two or more columns.
- the result cardinality represents the average number of rows of a data table that result from a particular relational operation (e.g., a search, projection, join, etc.).
- the zero-correlation assumption made by the conventional systems usually causes a large percentage error in the estimation of query solution costs and cardinalities when strong correlation exists between multiple columns on which relational operations are performed, such as selections, projections, and joins. Large errors in solution costs lead the optimizer to select a sub-optimal strategy for a query.
- the first embodiment of the present invention avoids the above problem through automatic identification of column groups based on a query workload, and collecting workload statistics on single columns and multiple columns (multi-column statistics) as a group, rather than collecting statistics merely on individual columns. In effect, the first embodiment of the present invention provides a mechanism for capturing the correlation between columns by collecting statistics on multiple columns as a group.
- FIG. 6 illustrates a process flow 610 of steps of the second embodiment of the present invention.
- Process 610 is executed over processor 101 and is stored as program instructions within computer readable memory units of system 112 (FIG. 1).
- the present invention reads the system catalog space of a computer readable memory unit of system 112 to receive an indication of certain column groups (column identifiers) on which workload statistics need to be collected.
- this information originates from an automatic analysis (e.g., process 410) performed by an optimizer of one embodiment of the present invention. Alternatively, this information can be user originated with respect to the first embodiment of the present invention.
- FIG. 7 illustrates a data table 500 comprising three columns: 1) an employee identification number column 510; 2) an employee age column 520; and 3) an annual salary column 530.
- the data table 500 contains "n" rows 540-547.
- the present invention receives column identifiers of a column group for which a column duplicity factor needs to be computed, e.g., on the age column 520 and the annual salary column 530 of data table 500.
- the present invention accesses row groups containing identified columns (identified in step 615) from the selected table.
- a row group includes the corresponding rows of each identified column of the data table.
- the identified columns on which workload statistics are to be collected are referred to as "column groups.”
- Each of the age and annual salary entries of a particular row, e.g., row 542, of data table 500 is a corresponding row group.
- the rows are corresponding because they belong to a same employee identification number.
- the following are further examples of row groups in accordance with the example of FIG. 7: 1) 34/30,000 of row 541; 2) 37/35,000 of row 544; and 3) 37/40,000 of row 546.
- the result cardinality of the column group is dependent on the data of multiple columns.
- the number of distinct row groups is 5 and they are: 34/30,000; 36/32,000; 37/35,000; 37/40,000; and 57/50,000.
- the present invention determines the number of distinct "pairs" between the age and income columns of data table 500.
- the multi-column distinct cardinality statistic of 5 is reported as the number of distinct row groups in the column group including column 520 and column 530.
- the column duplicity factor (a multi-column statistics) is computed by dividing the number of non-null row groups by the number of distinct row groups in the selected table.
- the column duplicity factor represents the average number of rows returned based on a query of the data table that includes a specific selection values of the columns identified in step 615.
- the column duplicity factor (a multi-column statistic) determined at step 630 is stored as workload-based statistic corresponding to the columns that were identified in step 615. These values are stored and indexed in computer readable memory units of system 112. As described with reference to FIG. 3, the optimizer utilizes these workload statistics to perform accurate cost estimations in its cost analysis in determining the lowest cost strategy for performing a query.
- the result cardinality of the two columns is expressed as the overall selectivity times the table cardinality which represents the number of rows of the data table 500.
- multi-column workload statistics the average number of rows a query is expected to locate based on a selection of both age and income is 1.6.
- the multi-column workload statistic of the present invention is much more accurate over the single column derived workload statistic because in the example case of FIG. 7, and in many instances, the data within the columns are correlated. This correlation leads to a more accurate result cardinality value, as expressed in the present invention, but ignored in the single column derivation method of conventional systems.
- the second embodiment of the present invention provides an RDBMS optimizer with a more accurate estimated costs for optimal strategy selection.
- an optimal strategy is more likely to be selected by the optimizer and a reduced instance of sub-optimal strategy selection is provided by the present invention.
- more accurate cost estimation by the optimizer is possible because more accurate information is available regarding: (1) how many rows on average are selected from a given data table; and/or (2) how many rows on average going to participate in a relational operation (e.g., a join).
- the process 710 of a third embodiment of the present invention operates within the statistics generation procedures of an RDBMS system (process 560).
- the process 710 of the third embodiment of the present invention is described with reference to the flow diagram of FIG. 8 and the exemplary data tables of FIG. 7.
- the third embodiment of the present invention maintains separate workload statistics regarding the number of rows with null data values in a column group. Separating rows with null data from rows with non-null data is important because rows with nulls do not participate in most of the relational operations performed within an RDBMS. Therefore, as recognized by the third embodiment of the present invention, the cardinality (e.g., the number of rows) of a result produced by a relational operation is generally determined by the number of input rows with non-null data.
- FIG. 8 illustrates a process 710 performed by an implementation of the third embodiment of the present invention for computing multi-column null statistics for a column group.
- Process 710 is executed over processor 101 and is stored as program instructions within computer readable memory units of system 112 (FIG. 1).
- FIG. 8 is described with respect to an example data table 500 of FIG. 7 where n is 100.
- the age column 520 contains null data rows for 50 of the employees.
- there are only 20 distinct age entries in age column 520 so it is appreciated that the initial distinct cardinality (e.g., without consideration of null data values) for column 520 is 100/20 or 5.0.
- the present invention receives column identifiers of a column group of a selected table.
- system 112 reads the system catalog space of a computer readable memory unit of system 112 to receive the column group identifiers on which to collect workload-based null statistics of the selected table.
- this information originates from an automatic analysis (e.g., process 410) performed by an optimizer of one embodiment of the present invention.
- this information can be user originated with respect to the first embodiment of the present invention.
- the present invention receives an indication that workload statistics are to be collected based on the age column 520 of data table 500 (FIG. 7).
- the present invention accesses the corresponding row groups of the data table containing the identified column groups.
- the present invention counts the number of row groups with null data in any of the identified columns.
- system 112 examines the individual entries of the corresponding rows groups to determine the number of row groups with null data in any of the identified column groups. For instance, at step 730, the present invention performs a search, or other equivalent operation, to determine the number of null rows within the identified column and records this value as a null data statistic for the identified column. This is done for each of the identified columns for the selected table. With reference to column 520 (FIG. 7), the present invention performs a search and determines that 50 of the rows of identified column 520 contain null data values.
- the present invention utilizes the number of null rows determined in step 730 and the number of table rows to determine the column null factor for the identified column.
- system 112 computes the column null factor as equal to the number of row groups with null data divided by the number of table rows. In one embodiment, this is computed for all identified columns of the selected table. For instance, with respect to column 520, the present invention divides the number of null rows (50) by the total number of table rows (100) of column 520 to compute a column null factor of 0.5.
- the age column 520 contains null data rows for 50 of the employees out of a total of 100. Also, in the example, there are only 20 distinct values in the age column 520. It is appreciated that without the null statistic, the initial result cardinality for an equi-selection on column 520 is 100/20 or 5.0. By collecting null statistic (column null factor), the present invention gives a much more accurate final result cardinality by multiplying the initial result cardinality by one minus the column null factor. This is shown below:
- the final result cardinality gives a much more accurate estimate of the average number of rows of column 520 that are likely to have any particular, age value, considering rows with null values within column 520.
- the third embodiment of the present invention stores the determined column null factor in a computer readable memory unit as a workload-based null statistic for the identified column group of the selected table. For example, the null factor of 0.5 is then recorded for column 520.
- the present invention gives a more accurate final result cardinality which indicates that, on average, 50/20 or 2.5 employees have the same age in our current example. On average, a query of column 520 for a particular age would result in 2.5 rows identified, not 5.0 as indicated by the initial result cardinality that does not take into consideration the presence of null data rows.
- the following material illustrates the manner in which the workload statistics (e.g., column duplicity factor and column null factor) are used to compute the result cardinality of the following relational operations: 1) equi-selection, 2) equi-join, and 3) projection.
- workload statistics e.g., column duplicity factor and column null factor
- two data tables exists each with one column group.
- the column group in the first table as CG1 and the column group in the second table as CG2.
- the column duplicity factor and column null factor based on CG1 is denoted as CDF1 and CNF1
- the column duplicity factor and column null factor based on CG2 is denoted as CDF2 and CNF2.
- the cardinality of first table i.e. number of rows
- TC2 the cardinality of second table
- CDF1 represents the average number of duplicates per distinct value of CG1.
- An equi-selection over CG1 is nothing but selecting a distinct value of CG1, and therefore, the result cardinality of such an operation is expected to be equal to CDF1.
- TC1*(1-CNF1)/CDF1 represents the distinct cardinality of CG1.
- a projection over CG1 is nothing but selecting all distinct values in CG1, and therefore, the result cardinality of such an operation is expected to be equal to TC1*(1-CNF1)/CDF1.
- TC1*(1-CNF1)/CDF1 is denoted as DC1
- TC2*(1-CNF2)/CDF2 is denoted as DC2.
- DC1 and DC2 represent, respectively, the distinct cardinality of CG1 and CG2.
- the result of the equi-join operation based on CG1 and CG2 is estimated as equal to MIN (DC1, DC2)*CDF1*CDF2.
- MIN is a function that chooses the minimum value between DC1 and DC2.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Fuzzy Systems (AREA)
- Operations Research (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Methods for collecting query workload based statistics within a relational database management system (RDBMS) and for identifying columns for which statistics collection is to be performed. The novel system collects workload statistics that are dependent on multiple columns, rather than merely single columns. Multi-column statistic generation provides more accurate results for columns having correlated data, and therefore leads to better estimated cost analysis by an RDBMS optimizer. In one embodiment, a column duplicity factor is based on an analysis of distinct data rows, e.g., combinations of values within multiple columns, rather than rows of single columns. The novel system also collects separate statistics regarding the presence of null data within the rows of a column group. Separate null data statistics improve the determined result carnality used by the RDBMS optimizer because the cardinality of a relational operation's result is generally determined by the number of input rows with non-null data. The novel system includes an RDBMS optimizer that automatically identifies column groups and column groups on which workload statistics are to be generated. The parameters within a query (e.g., equi-joins, equi-selections, and projections) are analyzed by the optimizer to automatically identify the column groups. The identified columns are then registered within in a system catalog. The registered column groups are read by statistics generation procedures to identify those column groups for which workload statistics are to be collected.
Description
This is a continuation of copending application Ser. No. 08/796,779 filed on Feb. 10, 1997 which is hereby incorporated by reference to this application.
(1) Field of the Invention
The present invention relates to the field of computer systems used for storage and retrieval of data. More specifically, the present invention relates to the field of statistics measurement systems for a relational database management system.
(2) Prior Art
Computer implemented relational database management systems (RDBMS) are well known in the art. Such database systems commonly employ data tables that contain columns and rows containing data (e.g., data values). A typical RDBMS, in addition to maintaining the data of a particular database, also maintains a set of statistics regarding the data. These statistics are useful in efficiently accessing, manipulating, and presenting the stored data.
When an RDBMS system receives a query, its optimizer analyzes the structure of the query, analyzes the various clauses (e.g. selection and join predicates) specified in the query, and examines existing data access paths (e.g. indexes) to formulate a strategy (e.g., method) of performing various relational operations (e.g., aggregation, sort, search, join, etc.) to produce the result of the query. The optimizer generally explores various strategies to find the best strategy. A best strategy is the one with lowest estimated RDBMS cost, and such strategy generally takes least amount of RDBMS resources or least amount of time or both to produce the result of the query.
RDBMS cost is generally evaluated based on input/output (I/O) operations that are required to perform necessary relational operations to produce the result of the query. Therefore, the optimizer selection criterion for the best strategy for a query is a minimum execution cost. The minimum execution cost is calculated using available statistics such as table and index cardinalities, workload statistics (e.g. statistics on columns and column groups), storage statistics and assumptions of the manner in which any relational operation typically changes the incoming data size and data distribution.
Prior RDBMS systems collect workload statistics based on single columns of data. As such, these systems make zero correlation assumption that the values of two or more columns of data are not related in any way. However, in many cases, this zero correlation assumption is incorrect. For example, in an example data table having a first column of employee age and second column of employee job position, it is possible that in general the older employees can hold higher job positions. As such, the row values in the first column correlate to the values in the second column.
In order to estimate the cost of a strategy for a query on an example employees data table asking for all employees with a certain age (e.g., 35) and certain job position (e.g., 4), the prior art systems compute a result cardinality (result-- cardinality) based on the inverse of the distinct cardinality of the separate age column (DC1) multiplied by the inverse of the distinct cardinality of the separate income column (DC2) and this result is multiplied by the number of employees (#EE). The distinct cardinality of a column represents the number of distinct values within that column. See the computation of the result cardinality below:
result-- cardinality=(1/DC1*1/DC2)*#EE
The estimated result cardinality represents the average number of rows likely to be produced for a query asking for all employees with a certain age and a certain job position. However, the relationship used above by the prior art system assumes that there is no correlation between the first and second columns. This assumption leads to a computation of a result cardinality value that is much smaller than what is returned in reality due to data correlation. When the above result cardinality value is used by an optimizer, its determined estimated costs for performing certain relational operations on the subject data become inaccurate. Specifically, this inaccuracy can cause the optimizer to select a query resulting in worse performance in producing the result. Therefore, what is needed is a method for collecting statistics within an RDBMS system that provides for accurate cost estimation analysis for queries involving more than one data column.
The presence of null data values within rows of a data table can lead to inaccuracies in the workload statistics, which will lead to inaccurate cost estimation analysis by the RDBMS optimizer. Most relational operations ignore incoming data rows that contain null data values in key columns. Therefore, it is important to also collect statistics about the number of rows with null data in certain columns or column groups. Prior art RDBMS systems that compute certain database statistics (e.g., distinct cardinality of a column) without regard to rows with null data produce inaccurate statistics. The inaccurate statistics can cause the RDBMS optimizer to select a sub-par strategy for a given query. Therefore, what is needed is a method for collecting statistics within an RDBMS system that provides for accurate cost estimation analysis for queries involving columns of rows with null data. The present invention provides such a system.
In prior RDBMS systems, a manual process is performed by users (e.g., system administrators) to determine which data columns of a database on which to collect statistics. In effect, in prior RDBMS systems, the user informs integrated statistics gathering procedures of the separate data columns on which workload statistics are to be gathered. However, it is often the case that the system administrators do not realize the specific mechanisms by which the RDBMS optimizer operates. Relying on users to indicate the data columns on which to collect statistics can be unreliable and inefficient. Namely, not knowing what is needed by the optimizer, the system administrator can cause the statistics gathering procedures to inefficiently over-collect statistics or fail to collect statistics on required columns. What is needed is an efficient method for identifying columns or column groups on which statistics collection is required. What is needed is such a method that does not rely on user origination of the above information.
Accordingly, the present invention provides a method and system for collecting statistics within an RDBMS system that provides for accurate cost estimation analysis for workload queries involving more than one data column. The present invention also provides a method and system for collecting statistics within an RDBMS system that provides for accurate estimated cost analysis for queries involving columns of rows with null data values. The present invention yet provides a method and system within RDBMS optimizer for identifying the columns or column groups on which statistics collection is required that does not rely on manual user identification of the column or column groups.
Methods are described for collecting query workload based statistics within a relational database management system (RDBMS) and for automatically identifying columns or groups of columns for which statistic collection is to be performed. The novel system collects workload based statistics on multi-columns rather than merely on single columns. This multi-column statistic gathering technique of the present invention provides more accurate results when the row values within the multi-columns are correlated to each other. The multi-column statistics that result from the present invention lead to better estimated cost analysis performed by an RDBMS optimizer which leads to increased likelihood of finding an optimal strategy for each query in a workload. In one embodiment, the present invention generates column group statistics which represent distinct cardinality measured over rows of multiple columns of data, rather than individual statistics on single columns ("a multi-column statistic"). In cases where the data within a column group is correlated, the multi-column result cardinality of the present invention computed using column group statistics is much more accurate compared to a result cardinality computed from statistics on individual columns using zero correlation assumption.
The novel system also collects separate statistics regarding the number of null data values within the rows of a column. Separate null data statistics improve the cost estimation analysis used by the RDBMS optimizer because the cardinality of a result produced by a relational operation is generally determined by the number of input rows with non-null data. The present invention utilizes the separate null data statistics in determining a result distinct cardinality of a column, or column group, of a data table. The null data statistics are useful in accurate estimation of cardinality of the result of a relational operation (e.g. a data search or a table join) and in accurate estimation of average cost of performing a relational operation (e.g. a data sort, a data search, a table join).
The novel system also includes an RDBMS optimizer that automatically identifies columns and groups of columns on which workload based statistics should be generated. Within the present invention, the clauses within a query such as equi-join and equi-selection predicates and projection columns are analyzed by the RDBMS optimizer to identify columns and column groups of interest. The identified columns are then registered within in a system catalog of the RDBMS. Later on, the statistics collection utility of the present invention reads the registered columns and column groups from the system catalog to identify the columns and column groups on which to generate the workload statistics. The statistics so generated are then used by the optimizer to perform subsequent estimations of the cost of performing certain relational operations in order to select a strategy for a query having the least cost.
FIG. 1 illustrates a computer system used in accordance with the embodiments of the present invention.
FIG. 2 is an illustration of a data table used by embodiments of the present invention including columns and rows of data.
FIG. 3 is a flow diagram of steps performed by the optimizer of the RDBMS system of the present invention when selecting a lowest cost strategy for a query.
FIG. 4A and FIG. 4B illustrate a flow diagram of steps of a first embodiment of the present invention for automatic identification of columns and column groups by the RDBMS optimizer on which to collect the workload statistics.
FIG. 5 is a flow diagram of steps performed by statistics generation procedures of the present invention to generate workload statistics.
FIG. 6 illustrates a flow diagram of steps of a second embodiment of the present invention for generating non-null workload statistics (e.g., the column duplicity factor), which is a multi-column statistic, on identified column groups.
FIG. 7 illustrates an exemplary multi-column data table that can be used by the first embodiment of the present invention.
FIG. 8 illustrates a flow diagram of steps of a third embodiment of the present invention for generating null workload statistics (e.g., the column null factors) on identified column groups.
In the following detailed description of the embodiments of the present invention, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be obvious to one skilled in the art that the present invention may be practiced without these specific details. In other instances well known methods, procedures, components, and circuits have not been described in detail as not to unnecessarily obscure aspects of the present invention.
Some portions of the detailed descriptions which follow are presented in terms of steps, procedures, logic blocks, processing, and other symbolic representations of operations on data bits within a computer memory. These descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. A procedure, logic block, process, step, etc., is here, and generally, conceived to be a self-consistent sequence of steps or instructions leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated in a computer system. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussions, it is appreciated that throughout the present invention, discussions utilizing terms such as "processing" or "computing" or "calculating" or "determining" or "displaying" or the like, refer to the action and processes of a computer system (e.g., 112 of FIG. 1), or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
Refer to FIG. 1 which illustrates a computer system 112. Within the following discussions of the present invention, certain processes (e.g., processes 310, 350, 410, 610, 710, 910, 1010) and steps are discussed that are realized, in one implementation, as a series of instructions (e.g., software program) that reside within computer readable memory units of system 112 and executed by processors of system 112. When executed, the instructions cause the computer system 112 to perform specific actions and exhibit specific behavior which is described in detail to follow.
In general, computer system 112 used by the present invention comprises an address/data bus 100 for communicating information, one or more central processors 101 coupled with the bus 100 for processing information and instructions, a computer readable volatile memory unit 102 (e.g., random access memory, static RAM, dynamic, RAM, etc.) coupled with the bus 100 for storing information and instructions for the central processor(s) 101, a computer readable non-volatile memory unit (e.g., read only memory, programmable ROM, flash memory, EPROM, EEPROM, etc.) coupled with the bus 100 for storing static information and instructions for the processor(s) 101. System 112 also includes a mass storage computer readable data storage device 104 (hard drive or floppy) such as a magnetic or optical disk and disk drive coupled with the bus 100 for storing information and instructions. Optionally, system 112 can include a display device 105 coupled to the bus 100 for displaying information to the computer user, an optional alphanumeric input device 106 including alphanumeric and function keys coupled to the bus 100 for communicating information and command selections to the central processor(s) 101, an optional cursor control device 107 coupled to the bus for communicating user input information and command selections to the central processor(s) 101, and an optional signal generating device 108 coupled to the bus 100 for communicating command selections to the processor(s) 101. In one exemplary implementation, system 112 is a DEC Alpha computer system by Digital Equipment Corporation, but could equally be of a number of various well known and commercially available platforms.
FIG. 2 illustrates an exemplary data table 200 as used by embodiments of the present invention. It is appreciated that data table 200, as well as columns and rows of data which comprise the table 200, are stored in computer readable memory units of system 112. Exemplary table 200 includes x number of separate data columns (or "columns") 210, 220, 230, and 240. Within each column, table 200 includes n rows of data 250-256. Data ("data values") are stored within the rows of each column. A data value can be either non-null (e.g., a known value) or null (e.g., an unknown value or an empty set value). A column group is a collection of multiple columns, e.g., column 210 and 220. Herein, references to columns, rows, column groups, and data tables refer to analogous data structures as shown in FIG. 2 and as stored in computer readable memory units of system 112.
Database performance is directly affected by the ability of a database to efficiently access a row or rows stored on disk (e.g., 104) through input/output (e.g., I/O) operations. The greater the number of I/O operations, the longer it takes to find and retrieve rows that satisfy a query. The optimizer of the RDBMS system finds a lowest cost strategy (generally most I/O efficient) of retrieving subject data and producing result of a query. Although the optimizer can consider many strategies each with a different cost to find the one with the lowest cost, all strategies yield the same result.
Herein, cost is the estimated number of disk I/Os required to execute a strategy and produce result of a query. This is the metric used by the optimizer to find an optimal strategy, e.g., the one with the minimal cost. Main factors affecting the cost of a strategy include the following: 1) query structure, 2) query predicates (e.g. equi-selections, equi-joins, etc.), 3) result ordering or grouping, 4) access paths available (e.g. hash index, B-tree index), and 5) join methods available (e.g. sort-merge join, hash join). Any database statistics, if available, can help the RDBMS optimizer in significantly improving the accuracy of cost estimation analysis to find an optimal strategy.
In one embodiment, database statistics include cardinality statistics, workload statistics and storage statistics. Cardinality statistics include the table cardinality and index cardinality. The table cardinality represents the total number of rows in a data table. The index cardinality represents the total number of distinct keys (index values) within an index. In other words, the index cardinality is a distinct cardinality of index columns. The workload statistics include column duplicity factor and column null factor. The column duplicity factor is a ratio of the number of table rows with non-null data in a column group to the distinct cardinality of a column group. It represents average number of duplicates per distinct combination of values in a column group. The column null factor is a ratio of the number of table rows with null data in any column of a column group to the table cardinality. It represents a fraction of a data table with null data in rows of a column group.
FIG. 3 illustrates an exemplary process flow 310 performed within an RDBMS system of the present invention which utilizes an optimizer. Generally, upon receiving a query, the optimizer: 1) identifies different column groups based on equi-join predicates, equi-selection predicates and projections specified in the query; 2) uses stored workload statistics on identified column groups and other stored database statistics to estimate the costs of performing different relational operations required to produce result of the query; 3) builds different strategies of executing relational operations in certain order and estimate cost for each strategy; and 4) selects the lowest cost strategy. It is appreciated that process 310 is executed over processor 101 and is stored as program instructions within computer readable memory units of system 112 (FIG. 1).
At step 315 of FIG. 3, the RDBMS system receives a query, e.g., a user request for data. The format of the received query contains certain predicates (joins, filters, and other conditions, etc.) At step 317, system 112 identifies column groups based on equi-joins, equi-selections and projections that are specified in the query. At step 320, the optimizer uses stored workload statistics on identified column groups for cost estimation of different relational operations. At step 320, the optimizer examines the query to determine the different relational operations (e.g., data access paths, sorts, filters, projections, aggregations, joins, etc.) that are needed to complete the query. At step 320, the optimizer of the RDBMS system accesses stored database statistics (cardinality/workload/storage) regarding the subject data of the query received at step 315 and uses this information to estimate the costs of performing necessary relational operations to produce the result of the query. Any of a number of well known methods can be used at step 320 for performing cost analysis of different relational operations used to satisfy the query.
With respect to determining the relational operations required to produce a result of a query, there are a number of well known retrieval strategies that the optimizer can utilize in order to perform a data search. A well known sequential retrieval strategy can be used where the RDBMS accesses the database pages for a table's logical area sequentially and reads all the rows in the table regardless of the selection expression used in the query. A number of well known index retrieval strategies can also be used, including those described below. In the database key (dbkey) retrieval, the RDBMS accesses a table data directly through a database key (e.g., logical address) row pointer. In the index retrieval, the RDBMS accesses a specific index structure (sorted or hashed) and retrieves the index keys, which include the dbkeys, of the rows. The dbkeys are then used to fetch data rows and the data is delivered in index order, if a sorted index is used. In the index only retrieval, the RDBMS accesses only the index data and the selected index contains all the table columns specified in the query and no further row fetches are necessary. The data is delivered in index order, if a sorted index is used. In the OR index retrieval, two or more sorted or hashed indexes are used that are defined on a single data table when the predicates on these indexes are combined with the logical OR in the query.
At step 322 of FIG. 3, the optimizer then attempts different strategies to implement the relational operations identified in step 320 and then computes the cost to perform each strategy. A strategy is a scheme of performing different relational operations in a certain order. By reordering the execution sequence of the relational operations, different cost values result for executing the same query. For single data tables, the optimizer finds the lowest cost retrieval strategy by evaluating the cost for each possible retrieval strategy. In one embodiment, the estimated cost for a strategy includes the cost of scanning an index and the cost of fetching data rows.
At step 330 of FIG. 3, the optimizer selects the strategy, of the strategies analyzed in step 322, that yields the lowest estimated cost solution. A number of well known optimizer processes can be used by step 330 for selecting the lowest estimated cost strategy. At step 340, the optimizer initiates the execution of the relational operations indicated by the selected strategy of step 330 to produce the result of the query. It is appreciated that each of the strategies considered at step 322, although having possibly different estimated costs, yields the same result. At step 345, system 112 reports the output of the executed strategy as the result of the query received at step 315.
The process 410 of a first embodiment of the present invention is described with reference to the flow diagram of FIG. 4A and FIG. 4B; this embodiment of the present invention provides an automatic method for identifying the column groups for which workload statistics are to be collected within an RDBMS system. Within the present invention, the column groups for which workload statistics are to be collected are identified from predicates specified in the WHERE and HAVING clauses, and projection lists specified in a GROUP BY or DISTINCT clause of each query within a query workload. In one implementation of the present invention, the equi-selection and equi-join predicates within WHERE and HAVING clauses and projection list within a GROUP BY or DISTINCT clause are used in the above process to identify column groups belonging to different data tables of a database. This embodiment of the present invention recognizes that maintaining workload statistics about such identified column groups is very important for optimizing a query because the corresponding conditions have significant influence on determining the cardinality of the query result.
For instance, if equi-selection predicates are specified in a query, the optimizer will need to be informed of the combined selectivity so that the cardinality of the result (which is produced after all equi-selections have been applied) can be computed with high accuracy. If equi-join predicates are specified in a query, the optimizer will need to be informed of the join fanout factors so that it can determine the cardinality of a joined result with high accuracy. If a projection (e.g., in a GROUP BY or DISTINCT clause) is specified in a query, the optimizer will need to be informed of how many distinct groups will be formed which will be equal to the cardinality of the projected result.
The workload statistics that are collected on the column groups identified by the optimizer in this embodiment of the present invention include a column duplicity factor and a column null factor, which are described further below. The column duplicity factor (second embodiment) provides information about the number of distinct values (e.g., the distinct cardinality) as well as the average number of duplicates per distinct combination of column values in an identified column group. The column duplicity factor, related to the result cardinality discussed above, can be used to determine the combined selectivity of equi-selection predicates, or combined join fanout factor of equi-join predicates, or the number of distinct groups which will be equal to the cardinality of the projected result. The column null factor (third embodiment) indicates the fraction of rows in a data table that have null values in any column of an identified column group. The column null factor can be used to identify a portion of the data table that would not participate in a relational operation (e.g., join, selection).
According to the present invention, when the optimizer processes a query to find the best execution strategy (e.g., FIG. 3), it also registers (if not already registered) the information about identified column groups into system catalog portions of computer readable memory units. Within the present invention, the optimizer identifies the column groups based on equi-selections, equi-joins, and projection lists in various queries in a query workload. An advantage of the present invention is that the workload statistics are collected only on the identified column groups and nothing else leading to a more efficient statistics collection procedure. This contrasts with the prior art method of relying on the user to identify the interesting columns or column groups which is a difficult task, and therefore, can be unreliable and inefficient.
FIG. 4A and FIG. 4B illustrate a flow diagram of process 410 of this embodiment of the present invention. Process 410 resides within an optimizer implemented in accordance with the present invention and is executed over processor 101 and is stored as program instructions within computer readable memory units of system 112 (FIG. 1). Embodiments of the present invention described further below illustrate different statistics generation procedures that can be used to produce the workload statistics that are used by process 410. At step 415 of FIG. 4A, the present invention optimizer receives a query. At step 420 the present invention then selects one data table ("selected table") from a list of data tables specified within the query received at step 415. At step 425, the present invention interrogates the query to determine if any equi-selection predicates are specified on the selected table. In one embodiment, an equi-selection can exist in WHERE and HAVING clauses of the query. If not, processing continues to step 435; if so, processing continues to step 430.
The equi-selection predicate indicates a search criterion wherein the search data is specified as a particular value (or variable) and the match much be exact. An example is given below:
where employee-- JOBCODE 5=and employee-- AGE=40
In the above example, both employee-- JOBCODE and employee-- AGE are column names within the selected data table. The above equi-selection indicates that matching data are to be reported only for rows wherein the employee job and age match with the above values (e.g., 5 and 40). It is appreciated that an equi-selection can be defined with respect to a single column, or more than one columns and that the above equi-selection is exemplary only. At step 430, the present invention constructs a column group based on all equi-selection predicates specified on the selected table and stores it in the system catalog (within computer readable memory units of system 112). In the exemplary equi-selection above, step 430 identifies employee-- JOBCODE and employee-- AGE as a column group. Within step 430, identified column groups can contain one or two or more columns. All identified column groups are stored in the system catalog. This is performed for each equi-selection predicate specified on the selected table.
At step 435 of FIG. 4A, the present invention interrogates the query to determine if any equi-join predicates are specified on the selected table. In one embodiment, an equi-join can exist in WHERE clauses of the query. If not, processing continues to step 450 (FIG. 4B); if so, processing continues to step 440. Typically, in a join operation, data from two tables are joined together based on matching data from one or more columns. For instance, assume one data table was "Departments" and another data table was "Employees," a equi-join might join all rows of the "Departments" table and all rows of the "Employees" table that share a common department number and a common regional designation (both represented by different columns within each table). This equi-join predicate would then result in a data table that specifies all employees that worked in a particular department in a particular regional area and would also contain information about the particular department.
At step 440, the present invention constructs different column groups based on different equi-joins connecting selected data table with each of the other data tables and stores the column groups in the system catalog (within computer readable memory units of system 112). In the example above, a column group is defined by step 440 for the two columns that specify the department number and the common regional designation. All identified column groups are stored in the system catalog. This is performed for each equi-join predicate specified on the selected table. Step 450 of FIG. 4B is then entered.
At step 450 of FIG. 4B, the present invention interrogates the query to determine if any projection operations are specified on the selected table. If not, processing continues to step 460 of FIG. 4B; if so, processing continues to step 445. Projection operation is used mainly for two purposes: (1) to compute aggregates of rows of data by a projection column or columns and then build a new data table having the aggregated result; and (2) eliminating duplicate row data by creating a new data table and projecting distinct data from a given data table into the new data table. Projection or projection lists are specified in GROUP BY or DISTINCT clause of a query and is used to perform the above functions on columns of the selected table. At step 455, the present invention constructs a column group based on the projection specified on the selected table and stores the column groups in the system catalog (within computer readable memory units of system 112). Step 460 of FIG. 4B is then entered.
At step 460, the present invention examines the query received at step 415 to determine if more data tables are specified within the query that have not yet been processed by process 410. If so, processing returns to step 420 of FIG. 4A to select a next data table; if not, process 410 ends.
It is appreciated that the determined set of column groups identified (from steps 430, 440, and 455) can include one or more columns on which single column or multi-column workload statistics are to be collected. At the completion of process 410, the present invention stores the columns and/or column groups selected at steps 430, 440, and 455 into the system catalog portion of computer readable memory units of system 112 provided they were not previously stored.
When a statistics collection utility runs, it reads the column groups stored in the system catalog of the RDBMS system, and collects the workload statistics for each column group. By identifying a set of column groups based on the various clauses specified within each query of a query workload, the present invention advantageously provides an automatic mechanism for determining certain data in the database for which statistics are to be collected. The availability of the workload statistics on identified columns is deemed to be highly relevant to the cost estimation analysis performed by the optimizer to find an optimal strategy for each query in the query workload.
FIG. 5 illustrates steps of the statistics generation process 560 which operates to construct the statistics used in process flow 310 (FIG. 3). It is appreciated that process 560 typically operates in the background of the RDBMS system in an effort to avoid interference with the RDBMS system's execution of other relational operations. In this respect, process 560 and process 310 are asynchronous. Process 560 is executed over processor 101 and is stored as program instructions within computer readable memory units of system 112 (FIG. 1). Process 560 commences at step 565 where a particular data table is selected ("selected table").
At step 570 of FIG. 5, process 560 accesses the system catalog within the RDBMS system (e.g. a portion of memory unit 102) to obtain a list of identified column groups (on which to collect workload statistics) associated with a selected data table. At step 575, the workload statistics (column duplicity factor and column null factor) are generated for each of the identified column groups (e.g., identified by step 410 above). This is performed for each identified column group in the system catalog. At step 580, the generated workload statistics are then stored together with respective column groups into the system catalog of the RDBMS system for use by process 310 (FIG. 3) when needed.
At step 585, the present invention checks if more data tables are present in the database. If so, process 560 returns to step 565 to select a new data table; if not, process 560 returns (to be executed at a later predetermined time).
It is appreciated that the following two embodiments of the present invention operate within step 575 to provide, respectively, collection of (1) workload-based multi-column non-null statistic (column duplicity factor), and (2) workload-based multi-column null statistic (column null factor) for each of the identified column groups in the database. It is appreciated that the embodiment of the present invention represented by process 410 provides methods for identifying the data for which workload statistics are to be collected. This embodiment of the present invention places identifiers of this data within the system catalog which is read by step 570 of process 560, although other memory locations can also be used within the scope of the present invention.
The zero-correlation assumption made by the conventional systems usually causes a large percentage error in the estimation of query solution costs and cardinalities when strong correlation exists between multiple columns on which relational operations are performed, such as selections, projections, and joins. Large errors in solution costs lead the optimizer to select a sub-optimal strategy for a query. The first embodiment of the present invention avoids the above problem through automatic identification of column groups based on a query workload, and collecting workload statistics on single columns and multiple columns (multi-column statistics) as a group, rather than collecting statistics merely on individual columns. In effect, the first embodiment of the present invention provides a mechanism for capturing the correlation between columns by collecting statistics on multiple columns as a group.
FIG. 6 illustrates a process flow 610 of steps of the second embodiment of the present invention. Process 610 is executed over processor 101 and is stored as program instructions within computer readable memory units of system 112 (FIG. 1). At step 615, the present invention reads the system catalog space of a computer readable memory unit of system 112 to receive an indication of certain column groups (column identifiers) on which workload statistics need to be collected. According to a preferred embodiment of the present invention, this information originates from an automatic analysis (e.g., process 410) performed by an optimizer of one embodiment of the present invention. Alternatively, this information can be user originated with respect to the first embodiment of the present invention.
By way of example, FIG. 7 illustrates a data table 500 comprising three columns: 1) an employee identification number column 510; 2) an employee age column 520; and 3) an annual salary column 530. The data table 500 contains "n" rows 540-547. In step 615 of FIG. 6, the present invention receives column identifiers of a column group for which a column duplicity factor needs to be computed, e.g., on the age column 520 and the annual salary column 530 of data table 500.
At step 620 of FIG. 6, the present invention accesses row groups containing identified columns (identified in step 615) from the selected table. A row group includes the corresponding rows of each identified column of the data table. The identified columns on which workload statistics are to be collected are referred to as "column groups." Each of the age and annual salary entries of a particular row, e.g., row 542, of data table 500 is a corresponding row group. In the example of FIG. 7, the rows are corresponding because they belong to a same employee identification number. The following are further examples of row groups in accordance with the example of FIG. 7: 1) 34/30,000 of row 541; 2) 37/35,000 of row 544; and 3) 37/40,000 of row 546.
At step 625 of FIG. 6, the present invention then determines the number of unique row groups of the row groups accessed in step 620. This statistic is called the distinct cardinality of the columns identified in step 615 (e.g., the distinct cardinality of the column group). Also at step 625, the present invention counts the total number of non-null row groups. In the example of FIG. 7, assuming n=8, the number of non-null row groups is 8.
At step 625, it is appreciated that corresponding data values of multiple columns are considered in determining the number of distinct row groups within the present invention. Therefore, the result cardinality of the column group is dependent on the data of multiple columns. In the example of FIG. 7, assuming n=8, the number of distinct row groups is 5 and they are: 34/30,000; 36/32,000; 37/35,000; 37/40,000; and 57/50,000. With reference to the above example, the present invention determines the number of distinct "pairs" between the age and income columns of data table 500. At step 625, the multi-column distinct cardinality statistic of 5 is reported as the number of distinct row groups in the column group including column 520 and column 530.
At step 630, the column duplicity factor (a multi-column statistics) is computed by dividing the number of non-null row groups by the number of distinct row groups in the selected table. The column duplicity factor represents the average number of rows returned based on a query of the data table that includes a specific selection values of the columns identified in step 615. In the example of FIG. 7, the column duplicity factor represents the average number of rows matching a query including a specific age value and a specific income value. Assuming n=8, the result cardinality of FIG. 7 reported by step 630 is 8/5 or 1.6. At step 625, the column duplicity factor (a multi-column statistic) determined at step 630 is stored as workload-based statistic corresponding to the columns that were identified in step 615. These values are stored and indexed in computer readable memory units of system 112. As described with reference to FIG. 3, the optimizer utilizes these workload statistics to perform accurate cost estimations in its cost analysis in determining the lowest cost strategy for performing a query.
A Comparison With Single Column Method. To illustrate the increased accuracy of multi-column workload statistics of the present invention, a comparison with workload statistics determination using single columns is presented. With reference to the example of FIG. 7, the age selectivity of column 520 is one over the column's distinct cardinality or 1/4. The income selectivity of column 530 is one over the column's distinct cardinality or 1/5. The overall selectivity is computed below:
______________________________________ overall selectivity = age selectivity * income selectivity overall selectivity = (1/4) * (1/5) = 1/20 ______________________________________
The result cardinality of the two columns is expressed as the overall selectivity times the table cardinality which represents the number of rows of the data table 500. In this example n=8, so the result cardinality computed from single column statistics is expressed below:
______________________________________ result cardinality = (1/20) * (8) = 0.4 ______________________________________
Based on single column statistics, the average number of rows a query (e.g., SELECT FROM DATA TABLE 500 WHERE AGE=x AND INCOME=y) is expected to locate based on a selection of both age and income is 0.4. However, under the present invention multi-column workload statistics, the average number of rows a query is expected to locate based on a selection of both age and income is 1.6. The multi-column workload statistic of the present invention is much more accurate over the single column derived workload statistic because in the example case of FIG. 7, and in many instances, the data within the columns are correlated. This correlation leads to a more accurate result cardinality value, as expressed in the present invention, but ignored in the single column derivation method of conventional systems.
By having more accurate result cardinalities, the second embodiment of the present invention provides an RDBMS optimizer with a more accurate estimated costs for optimal strategy selection. With more accurate costs, an optimal strategy is more likely to be selected by the optimizer and a reduced instance of sub-optimal strategy selection is provided by the present invention. With the present invention, more accurate cost estimation by the optimizer is possible because more accurate information is available regarding: (1) how many rows on average are selected from a given data table; and/or (2) how many rows on average going to participate in a relational operation (e.g., a join).
The process 710 of a third embodiment of the present invention, like the second embodiment (process 610), operates within the statistics generation procedures of an RDBMS system (process 560). The process 710 of the third embodiment of the present invention is described with reference to the flow diagram of FIG. 8 and the exemplary data tables of FIG. 7. The third embodiment of the present invention maintains separate workload statistics regarding the number of rows with null data values in a column group. Separating rows with null data from rows with non-null data is important because rows with nulls do not participate in most of the relational operations performed within an RDBMS. Therefore, as recognized by the third embodiment of the present invention, the cardinality (e.g., the number of rows) of a result produced by a relational operation is generally determined by the number of input rows with non-null data.
FIG. 8 illustrates a process 710 performed by an implementation of the third embodiment of the present invention for computing multi-column null statistics for a column group. Process 710 is executed over processor 101 and is stored as program instructions within computer readable memory units of system 112 (FIG. 1). FIG. 8 is described with respect to an example data table 500 of FIG. 7 where n is 100. In this example, the age column 520 contains null data rows for 50 of the employees. Also in this example, there are only 20 distinct age entries in age column 520, so it is appreciated that the initial distinct cardinality (e.g., without consideration of null data values) for column 520 is 100/20 or 5.0.
With reference to FIG. 8, at step 715, the present invention receives column identifiers of a column group of a selected table. In effect, system 112 reads the system catalog space of a computer readable memory unit of system 112 to receive the column group identifiers on which to collect workload-based null statistics of the selected table. According to a preferred embodiment of the present invention, this information originates from an automatic analysis (e.g., process 410) performed by an optimizer of one embodiment of the present invention. Alternatively, this information can be user originated with respect to the first embodiment of the present invention. For instance, the present invention receives an indication that workload statistics are to be collected based on the age column 520 of data table 500 (FIG. 7). At step 720, the present invention accesses the corresponding row groups of the data table containing the identified column groups.
At step 730, the present invention counts the number of row groups with null data in any of the identified columns. In one embodiment, at step 730, system 112 examines the individual entries of the corresponding rows groups to determine the number of row groups with null data in any of the identified column groups. For instance, at step 730, the present invention performs a search, or other equivalent operation, to determine the number of null rows within the identified column and records this value as a null data statistic for the identified column. This is done for each of the identified columns for the selected table. With reference to column 520 (FIG. 7), the present invention performs a search and determines that 50 of the rows of identified column 520 contain null data values.
At step 740 of FIG. 8, the present invention utilizes the number of null rows determined in step 730 and the number of table rows to determine the column null factor for the identified column. At step 740, system 112 computes the column null factor as equal to the number of row groups with null data divided by the number of table rows. In one embodiment, this is computed for all identified columns of the selected table. For instance, with respect to column 520, the present invention divides the number of null rows (50) by the total number of table rows (100) of column 520 to compute a column null factor of 0.5.
In the example, the age column 520 contains null data rows for 50 of the employees out of a total of 100. Also, in the example, there are only 20 distinct values in the age column 520. It is appreciated that without the null statistic, the initial result cardinality for an equi-selection on column 520 is 100/20 or 5.0. By collecting null statistic (column null factor), the present invention gives a much more accurate final result cardinality by multiplying the initial result cardinality by one minus the column null factor. This is shown below:
______________________________________ final result cardinality = initial result cardinality * (1 - null factor) = 5.0 * (1 - 0.5) = 2.5 ______________________________________
The final result cardinality gives a much more accurate estimate of the average number of rows of column 520 that are likely to have any particular, age value, considering rows with null values within column 520.
At step 750 of FIG. 8, the third embodiment of the present invention stores the determined column null factor in a computer readable memory unit as a workload-based null statistic for the identified column group of the selected table. For example, the null factor of 0.5 is then recorded for column 520. By separating out the rows with non-null data, the present invention gives a more accurate final result cardinality which indicates that, on average, 50/20 or 2.5 employees have the same age in our current example. On average, a query of column 520 for a particular age would result in 2.5 rows identified, not 5.0 as indicated by the initial result cardinality that does not take into consideration the presence of null data rows.
The following material illustrates the manner in which the workload statistics (e.g., column duplicity factor and column null factor) are used to compute the result cardinality of the following relational operations: 1) equi-selection, 2) equi-join, and 3) projection.
In this example, two data tables exists each with one column group. Refer to the column group in the first table as CG1 and the column group in the second table as CG2. Herein, the column duplicity factor and column null factor based on CG1 is denoted as CDF1 and CNF1, and the column duplicity factor and column null factor based on CG2 is denoted as CDF2 and CNF2. Also, the cardinality of first table (i.e. number of rows) is denoted as TC1, and the cardinality of second table is denoted as TC2.
If equi-selection predicates are specified on all columns in CG1 then the result cardinality of an equi-selection operation on the first table is estimated as equal to CDF1. Note that CDF1 represents the average number of duplicates per distinct value of CG1. An equi-selection over CG1 is nothing but selecting a distinct value of CG1, and therefore, the result cardinality of such an operation is expected to be equal to CDF1.
Similarly, if equi-selection predicates are specified on all columns in CG2 then the result cardinality of an equi-selection operation on the second table is estimated as equal to CDF2.
If a projection is specified on all columns in CG1 then the result cardinality of a projection operation on the first table is estimated as equal to TC1*(1-CNF1)/CDF1. Note that TC1*(1-CNF1)/CDF1 represents the distinct cardinality of CG1. A projection over CG1 is nothing but selecting all distinct values in CG1, and therefore, the result cardinality of such an operation is expected to be equal to TC1*(1-CNF1)/CDF1.
Similarly, if a projection is specified on all columns in CG2 then the result cardinality of a projection operation on the second table is estimated as equal to TC2*(1-CNF2)/CDF2.
If equi-join predicates are specified between all columns in CG1 and CG2 then the result cardinality of an equi-join operation between first and second tables is estimated as follows: The quantity TC1*(1-CNF1)/CDF1 is denoted as DC1, and similarly the quantity TC2*(1-CNF2)/CDF2 is denoted as DC2. Herein DC1 and DC2 represent, respectively, the distinct cardinality of CG1 and CG2. The result of the equi-join operation based on CG1 and CG2 is estimated as equal to MIN (DC1, DC2)*CDF1*CDF2. Herein MIN is a function that chooses the minimum value between DC1 and DC2.
The embodiments of the present invention, a method for automatic identification of column groups based on a query workload by the RDBMS optimizer, a method for collecting multi-column non-null statistic on identified column groups, and a method for collecting multi-column null statistics on identified column groups, are thus described. While the present invention has been described in particular embodiments, it should be appreciated that the present invention should not be construed as limited by such embodiments, but rather construed according to the below claims.
Claims (19)
1. In a relational database management system having a processor coupled to a bus and a computer readable memory unit coupled to said bus, a computer implemented method for determining a workload statistic, said method comprising the steps of:
a) accessing a predetermined portion of said computer readable memory unit to determine an identified first column of data within a data table;
b) determining the number of rows of said first column that contain null data;
c) determining a first workload statistic associated with said first column, wherein said workload statistic is compensated for contribution of said rows of said first column that contain null data; and
d) storing said first workload statistic determined in step c) in a predetermined portion of said computer readable memory unit for use by an optimizer of said relational database management system.
2. A method as described in claim 1 further comprising the step of e) performing a cost analysis of executing a query according to a number of different strategies to determine a least cost execution strategy, said query involving said first column, wherein said step e) is performed by said optimizer which accesses and uses said workload statistic determined in step c) in performing said cost analysis.
3. A method as described in claim 1 wherein said first workload statistic is a column null factor of said first column and wherein said step c) comprises the steps of:
c1) determining the number of rows of said first column that contain null data;
c2) determining the total number of rows of said first column; and
c3) dividing the number of rows of said first column that contain null data by the total number of rows of said first column to generate said column null factor for said first column.
4. A method as described in claim 1 wherein said first workload statistic is a result cardinality of said first column and wherein said step c) comprises the steps of:
c1) determining the number of rows of said first column that contain null data;
c2) determining the total number of rows of said first column;
c3) dividing the number of rows of said first column that contain null data by the total number of rows of said first column to generate said column null factor for said first column; and
c4) determining said result cardinality based on an initial cardinality of said first column and said column null factor.
5. A method as described in claim 1 wherein said predetermined portion of said computer readable memory unit comprises a system catalog of said relational database management system.
6. In a relational database management system having a processor coupled to a bus and a computer readable memory unit coupled to said bus, a computer implemented method for identifying data for which workload statistics are to be collected, said method comprising the steps of:
a) receiving a query to be processed by said relational database management system, said query having an associated data table;
b) responsive to step a), invoking an optimizer within said relational database management system to automatically identify sets of columns and column groups associated with said data table of said query;
c) said optimizer registering said set of columns and column groups within a predetermined portion of said computer readable memory unit for use by statistics generation processes of said relational database management t system; and
d) accessing said predetermined portion of said computer readable memory unit and collecting workload statistics on said set of columns and column groups registered by said step c).
7. A method as described in claim 6 wherein said step b) comprises the step of identifying said set of columns and column groups from column identifications of said data table, said column identifications located within equi-selection predicates of WHERE clauses in said query.
8. A method as described in claim 6 wherein said step b) comprises the step of identifying said set of columns and column groups from column identifications of said data table, said column identifications located within equi-selection and/or equi-join predicates of HAVING clauses in said query.
9. A method as described in claim 6 wherein said step b) comprises the step of identifying said set of columns and column groups from column identifications of said data table, said column identifications located within projection lists of GROUP BY clauses in said query.
10. A method as described in claim 6 wherein said step b) comprises the step of identifying said set of columns and column groups from column identifications of said data table, said column identifications located within projection lists of DISTINCT clauses in said query.
11. A computer system comprising:
a bus;
a processor coupled to a bus; and
a computer readable memory unit coupled to said bus, said computer readable memory having stored therein instructions that when executed implement a method of determining a workload statistic, said method comprising the steps of:
a) accessing a predetermined portion of said computer readable memory unit to determine an identified first column of data within a data table;
b) determining the number of rows of said first column that contain null data;
c) determining a first workload statistic associated with said first column, wherein said workload statistic is compensated for contribution of said rows of said first column that contain null data;
d) storing said first workload statistic determined in step c) in a predetermined portion of said computer readable memory unit for use by an optimizer of said relational database management system; and
e) performing a cost analysis of executing a query according to a number of different strategies to determine a least cost execution strategy, said query involving said first column, wherein said step e) is performed by said optimizer which accesses and uses said workload statistic determined in step c) in performing said cost analysis.
12. A method as described in claim 11 wherein said first workload statistic is a column null factor of said first column and wherein said step c) comprises the steps of:
c1) determining the number of rows of said first column that contain null data;
c2) determining the total number of rows of said first column; and
c3) dividing the number of rows of said first column that contain null data by the total number of rows of said first column to generate said column null factor for said first column.
13. A method as described in claim 11 wherein said first workload statistic is a result cardinality of said first column and wherein said step c) comprises the steps of:
c1) determining the number of rows of said first column that contain null data;
c2) determining the total number of rows of said first column;
c3) dividing the number of rows of said first column that contain null data by the total number of rows of said first column to generate said column null factor for said first column; and
c4) determining said result cardinality based on an initial cardinality of said first column and said column null factor.
14. A method as described in claim 11 wherein said predetermined portion of said computer readable memory unit comprises a system catalog of said relational database management system.
15. A computer system comprising:
a bus;
a processor coupled to said bus; and
a computer readable memory unit coupled to said bus, said computer readable memory unit having stored therein instructions that when executed implement a method of identifying data for which workload statistics are to be collected, said method comprising the steps of:
a) receiving a query to be processed by said relational database management system, said query having an associated data table;
b) responsive to step a), invoking an optimizer within said relational database management system to automatically identify sets of columns and columns groups associated with said data table of said query;
c) said optimizer registering said set of columns and column groups within a predetermined portion of said computer readable memory unit for use by statistics generation processes of said relational database management system; and
d) accessing said predetermined portion of said computer readable memory unit and collecting workload statistics on said set of columns and column groups registered by said step c).
16. A computer system as described in claim 15 wherein said step b) of said method comprises the step of identifying said set of columns and columns groups from column identifications of said data table, said column identifications located within equi-selection predicates of WHERE clauses said query.
17. A computer system as described in claim 15 wherein said step b) of said method comprises the step of identifying said set of columns and columns groups from column identifications of said data table, said column identifications located within equi-selection and/or equi-join predicates of HAVING clauses said query.
18. A computer system as described in claim 15 wherein said step b) of said method comprises the step of identifying said set of columns and columns groups from column identifications of said data table, said column identifications located within projection lists of GROUP BY clauses said query.
19. A computer system as described in claim 15 wherein said step b) of said method comprises the step of identifying said set of columns and columns groups from column identifications of said data table, said column identifications located within projection lists of DISTINCT clauses said query.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/164,400 US6029163A (en) | 1997-02-10 | 1998-09-30 | Methods for collecting query workload based statistics on column groups identified by RDBMS optimizer |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US08/796,779 US5899986A (en) | 1997-02-10 | 1997-02-10 | Methods for collecting query workload based statistics on column groups identified by RDBMS optimizer |
US09/164,400 US6029163A (en) | 1997-02-10 | 1998-09-30 | Methods for collecting query workload based statistics on column groups identified by RDBMS optimizer |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US08/796,779 Continuation US5899986A (en) | 1997-02-10 | 1997-02-10 | Methods for collecting query workload based statistics on column groups identified by RDBMS optimizer |
Publications (1)
Publication Number | Publication Date |
---|---|
US6029163A true US6029163A (en) | 2000-02-22 |
Family
ID=25169044
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US08/796,779 Expired - Lifetime US5899986A (en) | 1997-02-10 | 1997-02-10 | Methods for collecting query workload based statistics on column groups identified by RDBMS optimizer |
US09/164,400 Expired - Lifetime US6029163A (en) | 1997-02-10 | 1998-09-30 | Methods for collecting query workload based statistics on column groups identified by RDBMS optimizer |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US08/796,779 Expired - Lifetime US5899986A (en) | 1997-02-10 | 1997-02-10 | Methods for collecting query workload based statistics on column groups identified by RDBMS optimizer |
Country Status (1)
Country | Link |
---|---|
US (2) | US5899986A (en) |
Cited By (79)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6173278B1 (en) * | 1997-11-03 | 2001-01-09 | Newframe Corporation Otd. | Method of and special purpose computer for utilizing an index of a relational data base table |
US6266658B1 (en) * | 2000-04-20 | 2001-07-24 | Microsoft Corporation | Index tuner for given workload |
US6353826B1 (en) * | 1997-10-23 | 2002-03-05 | Sybase, Inc. | Database system with methodology providing improved cost estimates for query strategies |
US6374235B1 (en) * | 1999-06-25 | 2002-04-16 | International Business Machines Corporation | Method, system, and program for a join operation on a multi-column table and satellite tables including duplicate values |
US20020059213A1 (en) * | 2000-10-25 | 2002-05-16 | Kenji Soga | Minimum cost path search apparatus and minimum cost path search method used by the apparatus |
US6397204B1 (en) * | 1999-06-25 | 2002-05-28 | International Business Machines Corporation | Method, system, and program for determining the join ordering of tables in a join query |
US20020107835A1 (en) * | 2001-02-08 | 2002-08-08 | Coram Michael T. | System and method for adaptive result set caching |
US6438538B1 (en) * | 1999-10-07 | 2002-08-20 | International Business Machines Corporation | Data replication in data warehousing scenarios |
US6446063B1 (en) * | 1999-06-25 | 2002-09-03 | International Business Machines Corporation | Method, system, and program for performing a join operation on a multi column table and satellite tables |
US20020123979A1 (en) * | 2001-01-12 | 2002-09-05 | Microsoft Corporation | Sampling for queries |
US6496877B1 (en) * | 2000-01-28 | 2002-12-17 | International Business Machines Corporation | Method and apparatus for scheduling data accesses for random access storage devices with shortest access chain scheduling algorithm |
US20030083959A1 (en) * | 2001-06-08 | 2003-05-01 | Jinshan Song | System and method for creating a customized electronic catalog |
US20030084025A1 (en) * | 2001-10-18 | 2003-05-01 | Zuzarte Calisto Paul | Method of cardinality estimation using statistical soft constraints |
US20030093436A1 (en) * | 2001-09-28 | 2003-05-15 | International Business Machines Corporation | Invocation of web services from a database |
US20030115183A1 (en) * | 2001-12-13 | 2003-06-19 | International Business Machines Corporation | Estimation and use of access plan statistics |
US20030135485A1 (en) * | 2001-12-19 | 2003-07-17 | Leslie Harry Anthony | Method and system for rowcount estimation with multi-column statistics and histograms |
US6598038B1 (en) * | 1999-09-17 | 2003-07-22 | Oracle International Corporation | Workload reduction mechanism for index tuning |
US6609126B1 (en) | 2000-11-15 | 2003-08-19 | Appfluent Technology, Inc. | System and method for routing database requests to a database and a cache |
US20030191769A1 (en) * | 2001-09-28 | 2003-10-09 | International Business Machines Corporation | Method, system, and program for generating a program capable of invoking a flow of operations |
US20040003004A1 (en) * | 2002-06-28 | 2004-01-01 | Microsoft Corporation | Time-bound database tuning |
US6732110B2 (en) * | 2000-08-28 | 2004-05-04 | International Business Machines Corporation | Estimation of column cardinality in a partitioned relational database |
US20040128299A1 (en) * | 2002-12-26 | 2004-07-01 | Michael Skopec | Low-latency method to replace SQL insert for bulk data transfer to relational database |
US20040199636A1 (en) * | 2001-09-28 | 2004-10-07 | International Business Machines Corporation | Automatic generation of database invocation mechanism for external web services |
US20040210563A1 (en) * | 2003-04-21 | 2004-10-21 | Oracle International Corporation | Method and system of collecting execution statistics of query statements |
US20040236722A1 (en) * | 2003-05-20 | 2004-11-25 | Microsoft Corporation | System and method for cardinality estimation based on query execution feedback |
US20050004907A1 (en) * | 2003-06-27 | 2005-01-06 | Microsoft Corporation | Method and apparatus for using conditional selectivity as foundation for exploiting statistics on query expressions |
US20050021287A1 (en) * | 2003-06-25 | 2005-01-27 | International Business Machines Corporation | Computing frequent value statistics in a partitioned relational database |
US20050065911A1 (en) * | 1998-12-16 | 2005-03-24 | Microsoft Corporation | Automatic database statistics creation |
US20050086242A1 (en) * | 2003-09-04 | 2005-04-21 | Oracle International Corporation | Automatic workload repository battery of performance statistics |
US20050216490A1 (en) * | 2004-03-26 | 2005-09-29 | Oracle International Corporation | Automatic database diagnostic usage models |
US20050216304A1 (en) * | 2001-06-08 | 2005-09-29 | W.W. Grainger, Inc. | System and method for electronically creating a customized catalog |
US20050240556A1 (en) * | 2000-06-30 | 2005-10-27 | Microsoft Corporation | Partial pre-aggregation in relational database queries |
US6985904B1 (en) | 2002-02-28 | 2006-01-10 | Oracle International Corporation | Systems and methods for sharing of execution plans for similar database statements |
US20060100992A1 (en) * | 2004-10-21 | 2006-05-11 | International Business Machines Corporation | Apparatus and method for data ordering for derived columns in a database system |
US20060112093A1 (en) * | 2004-11-22 | 2006-05-25 | International Business Machines Corporation | Method, system, and program for collecting statistics of data stored in a database |
US20060116984A1 (en) * | 2004-11-29 | 2006-06-01 | Thomas Zurek | Materialized samples for a business warehouse query |
US20060161515A1 (en) * | 2005-01-14 | 2006-07-20 | International Business Machines Corporation | Apparatus and method for SQL distinct optimization in a computer database system |
US7092931B1 (en) | 2002-05-10 | 2006-08-15 | Oracle Corporation | Methods and systems for database statement execution plan optimization |
US20060195416A1 (en) * | 2005-02-28 | 2006-08-31 | Ewen Stephan E | Method and system for providing a learning optimizer for federated database systems |
US20070083515A1 (en) * | 2002-12-20 | 2007-04-12 | International Business Machines Corporation | System and method for multicolumn sorting in a single column |
US20070220058A1 (en) * | 2006-03-14 | 2007-09-20 | Mokhtar Kandil | Management of statistical views in a database system |
US20070271218A1 (en) * | 2006-05-16 | 2007-11-22 | International Business Machines Corpoeation | Statistics collection using path-value pairs for relational databases |
US20080033914A1 (en) * | 2006-08-02 | 2008-02-07 | Mitch Cherniack | Query Optimizer |
US20080040348A1 (en) * | 2006-08-02 | 2008-02-14 | Shilpa Lawande | Automatic Vertical-Database Design |
US20080052266A1 (en) * | 2006-08-25 | 2008-02-28 | Microsoft Corporation | Optimizing parameterized queries in a relational database management system |
US20080071818A1 (en) * | 2006-09-18 | 2008-03-20 | Infobright Inc. | Method and system for data compression in a relational database |
US20080133458A1 (en) * | 2006-12-01 | 2008-06-05 | Microsoft Corporation | Statistics adjustment to improve query execution plans |
US20080133454A1 (en) * | 2004-10-29 | 2008-06-05 | International Business Machines Corporation | System and method for updating database statistics according to query feedback |
US7447681B2 (en) | 2005-02-17 | 2008-11-04 | International Business Machines Corporation | Method, system and program for selection of database characteristics |
US20090012977A1 (en) * | 2006-04-03 | 2009-01-08 | International Business Machines Corporation | System for estimating cardinality in a database system |
US20090030875A1 (en) * | 2004-01-07 | 2009-01-29 | International Business Machines Corporation | Statistics management |
US20090077054A1 (en) * | 2007-09-13 | 2009-03-19 | Brian Robert Muras | Cardinality Statistic for Optimizing Database Queries with Aggregation Functions |
US20090106756A1 (en) * | 2007-10-19 | 2009-04-23 | Oracle International Corporation | Automatic Workload Repository Performance Baselines |
US20090106210A1 (en) * | 2006-09-18 | 2009-04-23 | Infobright, Inc. | Methods and systems for database organization |
US20090150413A1 (en) * | 2007-12-06 | 2009-06-11 | Oracle International Corporation | Virtual columns |
US20090150366A1 (en) * | 2007-12-06 | 2009-06-11 | Oracle International Corporation | Expression replacement in virtual columns |
US20090299989A1 (en) * | 2004-07-02 | 2009-12-03 | Oracle International Corporation | Determining predicate selectivity in query costing |
US20100011030A1 (en) * | 2006-05-16 | 2010-01-14 | International Business Machines Corp. | Statistics collection using path-identifiers for relational databases |
US20100030728A1 (en) * | 2008-07-29 | 2010-02-04 | Oracle International Corporation | Computing selectivities for group of columns and expressions |
US20100114976A1 (en) * | 2008-10-21 | 2010-05-06 | Castellanos Maria G | Method For Database Design |
US20100223269A1 (en) * | 2009-02-27 | 2010-09-02 | International Business Machines Corporation | System and method for an efficient query sort of a data stream with duplicate key values |
US20110016157A1 (en) * | 2009-07-14 | 2011-01-20 | Vertica Systems, Inc. | Database Storage Architecture |
US20110022581A1 (en) * | 2009-07-27 | 2011-01-27 | Rama Krishna Korlapati | Derived statistics for query optimization |
US7941332B2 (en) | 2006-01-30 | 2011-05-10 | International Business Machines Corporation | Apparatus, system, and method for modeling, projecting, and optimizing an enterprise application system |
US20110213766A1 (en) * | 2010-02-22 | 2011-09-01 | Vertica Systems, Inc. | Database designer |
US20110218987A1 (en) * | 2006-08-25 | 2011-09-08 | Teradata Us, Inc. | Hardware accelerated reconfigurable processor for accelerating database operations and queries |
US20110258179A1 (en) * | 2010-04-19 | 2011-10-20 | Salesforce.Com | Methods and systems for optimizing queries in a multi-tenant store |
US8086598B1 (en) | 2006-08-02 | 2011-12-27 | Hewlett-Packard Development Company, L.P. | Query optimizer with schema conversion |
US20130018890A1 (en) * | 2011-07-13 | 2013-01-17 | Salesforce.Com, Inc. | Creating a custom index in a multi-tenant database environment |
US8417727B2 (en) | 2010-06-14 | 2013-04-09 | Infobright Inc. | System and method for storing data in a relational database |
US20130166557A1 (en) * | 2011-12-23 | 2013-06-27 | Lars Fricke | Unique value calculation in partitioned tables |
US8521748B2 (en) | 2010-06-14 | 2013-08-27 | Infobright Inc. | System and method for managing metadata in a relational database |
US8620888B2 (en) | 2007-12-06 | 2013-12-31 | Oracle International Corporation | Partitioning in virtual columns |
US20140074819A1 (en) * | 2012-09-12 | 2014-03-13 | Oracle International Corporation | Optimal Data Representation and Auxiliary Structures For In-Memory Database Query Processing |
US9117005B2 (en) | 2006-05-16 | 2015-08-25 | International Business Machines Corporation | Statistics collection using path-value pairs for relational databases |
US10162851B2 (en) | 2010-04-19 | 2018-12-25 | Salesforce.Com, Inc. | Methods and systems for performing cross store joins in a multi-tenant store |
US10417611B2 (en) | 2010-05-18 | 2019-09-17 | Salesforce.Com, Inc. | Methods and systems for providing multiple column custom indexes in a multi-tenant database environment |
US11449481B2 (en) * | 2017-12-08 | 2022-09-20 | Alibaba Group Holding Limited | Data storage and query method and device |
US11947568B1 (en) * | 2021-09-30 | 2024-04-02 | Amazon Technologies, Inc. | Working set ratio estimations of data items in a sliding time window for dynamically allocating computing resources for the data items |
Families Citing this family (41)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5995957A (en) | 1997-02-28 | 1999-11-30 | International Business Machines Corporation | Query optimization through the use of multi-column statistics to avoid the problems of column correlation |
US6044370A (en) * | 1998-01-26 | 2000-03-28 | Telenor As | Database management system and method for combining meta-data of varying degrees of reliability |
US6553369B1 (en) * | 1999-03-11 | 2003-04-22 | Oracle Corporation | Approach for performing administrative functions in information systems |
US7080062B1 (en) | 1999-05-18 | 2006-07-18 | International Business Machines Corporation | Optimizing database queries using query execution plans derived from automatic summary table determining cost based queries |
US6738755B1 (en) * | 1999-05-19 | 2004-05-18 | International Business Machines Corporation | Query optimization method for incrementally estimating the cardinality of a derived relation when statistically correlated predicates are applied |
US7890491B1 (en) * | 1999-12-22 | 2011-02-15 | International Business Machines Corporation | Query optimization technique for obtaining improved cardinality estimates using statistics on automatic summary tables |
CA2307155A1 (en) * | 2000-04-28 | 2001-10-28 | Ibm Canada Limited-Ibm Canada Limitee | Execution of database queries including filtering |
EP1346289A1 (en) * | 2000-11-30 | 2003-09-24 | Appfluent Technology, Inc. | System and method for delivering dynamic content |
US7483979B1 (en) | 2001-01-16 | 2009-01-27 | International Business Machines Corporation | Method and system for virtualizing metadata between disparate systems |
US6895412B1 (en) | 2001-04-12 | 2005-05-17 | Ncr Corporation | Methods for dynamically configuring the cardinality of keyword attributes |
US7231460B2 (en) * | 2001-06-04 | 2007-06-12 | Gateway Inc. | System and method for leveraging networked computers to view windows based files on Linux platforms |
US7177856B1 (en) | 2001-10-03 | 2007-02-13 | International Business Machines Corporation | Method for correlating data from external databases |
US6801903B2 (en) * | 2001-10-12 | 2004-10-05 | Ncr Corporation | Collecting statistics in a database system |
US6934706B1 (en) | 2002-03-22 | 2005-08-23 | International Business Machines Corporation | Centralized mapping of security credentials for database access operations |
US20040002956A1 (en) * | 2002-06-28 | 2004-01-01 | Microsoft Corporation | Approximate query processing using multiple samples |
US7349949B1 (en) | 2002-12-26 | 2008-03-25 | International Business Machines Corporation | System and method for facilitating development of a customizable portlet |
US7359982B1 (en) | 2002-12-26 | 2008-04-15 | International Business Machines Corporation | System and method for facilitating access to content information |
US20040143581A1 (en) * | 2003-01-15 | 2004-07-22 | Bohannon Philip L. | Cost-based storage of extensible markup language (XML) data |
US7647280B1 (en) * | 2003-12-08 | 2010-01-12 | Teradata Us, Inc. | Closed-loop estimation of request costs |
EP1548612A1 (en) * | 2003-12-22 | 2005-06-29 | Software Engineering GmbH | Method and device for providing column statistics for data within a relational database |
US7376638B2 (en) * | 2003-12-24 | 2008-05-20 | International Business Machines Corporation | System and method for addressing inefficient query processing |
US7373354B2 (en) * | 2004-02-26 | 2008-05-13 | Sap Ag | Automatic elimination of functional dependencies between columns |
US7302422B2 (en) * | 2004-04-14 | 2007-11-27 | International Business Machines Corporation | Query workload statistics collection in a database management system |
US7814072B2 (en) * | 2004-12-30 | 2010-10-12 | International Business Machines Corporation | Management of database statistics |
US7571168B2 (en) * | 2005-07-25 | 2009-08-04 | Parascale, Inc. | Asynchronous file replication and migration in a storage network |
US20070055658A1 (en) * | 2005-09-08 | 2007-03-08 | International Business Machines Corporation | Efficient access control enforcement in a content management environment |
US7840555B2 (en) * | 2005-09-20 | 2010-11-23 | Teradata Us, Inc. | System and a method for identifying a selection of index candidates for a database |
US8200659B2 (en) * | 2005-10-07 | 2012-06-12 | Bez Systems, Inc. | Method of incorporating DBMS wizards with analytical models for DBMS servers performance optimization |
US8732138B2 (en) * | 2005-12-21 | 2014-05-20 | Sap Ag | Determination of database statistics using application logic |
US9805077B2 (en) * | 2008-02-19 | 2017-10-31 | International Business Machines Corporation | Method and system for optimizing data access in a database using multi-class objects |
KR101621490B1 (en) | 2014-08-07 | 2016-05-17 | (주)그루터 | Query execution apparatus and method, and system for processing data employing the same |
US10204135B2 (en) | 2015-07-29 | 2019-02-12 | Oracle International Corporation | Materializing expressions within in-memory virtual column units to accelerate analytic queries |
US10372706B2 (en) | 2015-07-29 | 2019-08-06 | Oracle International Corporation | Tracking and maintaining expression statistics across database queries |
US10713244B2 (en) * | 2016-05-09 | 2020-07-14 | Sap Se | Calculation engine optimizations for join operations utilizing automatic detection of forced constraints |
US10380112B2 (en) | 2017-07-31 | 2019-08-13 | International Business Machines Corporation | Joining two data tables on a join attribute |
US11182437B2 (en) * | 2017-10-26 | 2021-11-23 | International Business Machines Corporation | Hybrid processing of disjunctive and conjunctive conditions of a search query for a similarity search |
CN110109910A (en) * | 2018-01-08 | 2019-08-09 | 广东神马搜索科技有限公司 | Data processing method and system, electronic equipment and computer readable storage medium |
US10990597B2 (en) * | 2018-05-03 | 2021-04-27 | Sap Se | Generic analytical application integration based on an analytic integration remote services plug-in |
US11226955B2 (en) | 2018-06-28 | 2022-01-18 | Oracle International Corporation | Techniques for enabling and integrating in-memory semi-structured data and text document searches with in-memory columnar query processing |
US11868346B2 (en) | 2020-12-30 | 2024-01-09 | Oracle International Corporation | Automated linear clustering recommendation for database zone maps |
US11907263B1 (en) | 2022-10-11 | 2024-02-20 | Oracle International Corporation | Automated interleaved clustering recommendation for database zone maps |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5598559A (en) * | 1994-07-01 | 1997-01-28 | Hewlett-Packard Company | Method and apparatus for optimizing queries having group-by operators |
-
1997
- 1997-02-10 US US08/796,779 patent/US5899986A/en not_active Expired - Lifetime
-
1998
- 1998-09-30 US US09/164,400 patent/US6029163A/en not_active Expired - Lifetime
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5598559A (en) * | 1994-07-01 | 1997-01-28 | Hewlett-Packard Company | Method and apparatus for optimizing queries having group-by operators |
Non-Patent Citations (2)
Title |
---|
"Performing Group-By before Join," Yan et al., Proceedings of the 10th International Conference on Data Engineering, Houston, Texas, USA, pp. 89-100, Feb. 14-18, 1994. |
Performing Group By before Join, Yan et al., Proceedings of the 10th International Conference on Data Engineering, Houston, Texas, USA, pp. 89 100, Feb. 14 18, 1994. * |
Cited By (149)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6353826B1 (en) * | 1997-10-23 | 2002-03-05 | Sybase, Inc. | Database system with methodology providing improved cost estimates for query strategies |
US6173278B1 (en) * | 1997-11-03 | 2001-01-09 | Newframe Corporation Otd. | Method of and special purpose computer for utilizing an index of a relational data base table |
US20050065911A1 (en) * | 1998-12-16 | 2005-03-24 | Microsoft Corporation | Automatic database statistics creation |
US7624094B2 (en) * | 1998-12-16 | 2009-11-24 | Microsoft Corporation | Automatic database statistics creation |
US6397204B1 (en) * | 1999-06-25 | 2002-05-28 | International Business Machines Corporation | Method, system, and program for determining the join ordering of tables in a join query |
US6446063B1 (en) * | 1999-06-25 | 2002-09-03 | International Business Machines Corporation | Method, system, and program for performing a join operation on a multi column table and satellite tables |
US6374235B1 (en) * | 1999-06-25 | 2002-04-16 | International Business Machines Corporation | Method, system, and program for a join operation on a multi-column table and satellite tables including duplicate values |
US6598038B1 (en) * | 1999-09-17 | 2003-07-22 | Oracle International Corporation | Workload reduction mechanism for index tuning |
US6438538B1 (en) * | 1999-10-07 | 2002-08-20 | International Business Machines Corporation | Data replication in data warehousing scenarios |
US6496877B1 (en) * | 2000-01-28 | 2002-12-17 | International Business Machines Corporation | Method and apparatus for scheduling data accesses for random access storage devices with shortest access chain scheduling algorithm |
US6266658B1 (en) * | 2000-04-20 | 2001-07-24 | Microsoft Corporation | Index tuner for given workload |
US20050240556A1 (en) * | 2000-06-30 | 2005-10-27 | Microsoft Corporation | Partial pre-aggregation in relational database queries |
US7555473B2 (en) * | 2000-06-30 | 2009-06-30 | Microsoft Corporation | Partial pre-aggregation in relational database queries |
US7593926B2 (en) * | 2000-06-30 | 2009-09-22 | Microsoft Corporation | Partial pre-aggregation in relational database queries |
US20050240577A1 (en) * | 2000-06-30 | 2005-10-27 | Microsoft Corporation | Partial pre-aggregation in relational database queries |
US6732110B2 (en) * | 2000-08-28 | 2004-05-04 | International Business Machines Corporation | Estimation of column cardinality in a partitioned relational database |
US20020059213A1 (en) * | 2000-10-25 | 2002-05-16 | Kenji Soga | Minimum cost path search apparatus and minimum cost path search method used by the apparatus |
US6609126B1 (en) | 2000-11-15 | 2003-08-19 | Appfluent Technology, Inc. | System and method for routing database requests to a database and a cache |
US7287020B2 (en) * | 2001-01-12 | 2007-10-23 | Microsoft Corporation | Sampling for queries |
US20020123979A1 (en) * | 2001-01-12 | 2002-09-05 | Microsoft Corporation | Sampling for queries |
US20020107835A1 (en) * | 2001-02-08 | 2002-08-08 | Coram Michael T. | System and method for adaptive result set caching |
US20050216304A1 (en) * | 2001-06-08 | 2005-09-29 | W.W. Grainger, Inc. | System and method for electronically creating a customized catalog |
US7266516B2 (en) | 2001-06-08 | 2007-09-04 | W. W. Grainger Inc. | System and method for creating a customized electronic catalog |
US20030083959A1 (en) * | 2001-06-08 | 2003-05-01 | Jinshan Song | System and method for creating a customized electronic catalog |
US9230256B2 (en) | 2001-06-08 | 2016-01-05 | W. W. Grainger, Inc. | System and method for electronically creating a customized catalog |
US7254582B2 (en) | 2001-06-08 | 2007-08-07 | W.W. Grainger, Inc. | System and method for creating a searchable electronic catalog |
US20030191769A1 (en) * | 2001-09-28 | 2003-10-09 | International Business Machines Corporation | Method, system, and program for generating a program capable of invoking a flow of operations |
US8914807B2 (en) | 2001-09-28 | 2014-12-16 | International Business Machines Corporation | Method, system, and program for generating a program capable of invoking a flow of operations |
US8924408B2 (en) | 2001-09-28 | 2014-12-30 | International Business Machines Corporation | Automatic generation of database invocation mechanism for external web services |
US8166006B2 (en) * | 2001-09-28 | 2012-04-24 | International Business Machines Corporation | Invocation of web services from a database |
US20030093436A1 (en) * | 2001-09-28 | 2003-05-15 | International Business Machines Corporation | Invocation of web services from a database |
US20040199636A1 (en) * | 2001-09-28 | 2004-10-07 | International Business Machines Corporation | Automatic generation of database invocation mechanism for external web services |
US20030084025A1 (en) * | 2001-10-18 | 2003-05-01 | Zuzarte Calisto Paul | Method of cardinality estimation using statistical soft constraints |
US7171408B2 (en) | 2001-10-18 | 2007-01-30 | International Business Machines Corporation | Method of cardinality estimation using statistical soft constraints |
US20080052269A1 (en) * | 2001-12-13 | 2008-02-28 | International Business Machines Corporation | Estimation and use of access plan statistics |
US8140568B2 (en) | 2001-12-13 | 2012-03-20 | International Business Machines Corporation | Estimation and use of access plan statistics |
US20030115183A1 (en) * | 2001-12-13 | 2003-06-19 | International Business Machines Corporation | Estimation and use of access plan statistics |
US7010516B2 (en) * | 2001-12-19 | 2006-03-07 | Hewlett-Packard Development Company, L.P. | Method and system for rowcount estimation with multi-column statistics and histograms |
US20030135485A1 (en) * | 2001-12-19 | 2003-07-17 | Leslie Harry Anthony | Method and system for rowcount estimation with multi-column statistics and histograms |
US6985904B1 (en) | 2002-02-28 | 2006-01-10 | Oracle International Corporation | Systems and methods for sharing of execution plans for similar database statements |
US7092931B1 (en) | 2002-05-10 | 2006-08-15 | Oracle Corporation | Methods and systems for database statement execution plan optimization |
US7155459B2 (en) * | 2002-06-28 | 2006-12-26 | Miccrosoft Corporation | Time-bound database tuning |
US20040003004A1 (en) * | 2002-06-28 | 2004-01-01 | Microsoft Corporation | Time-bound database tuning |
US20070083515A1 (en) * | 2002-12-20 | 2007-04-12 | International Business Machines Corporation | System and method for multicolumn sorting in a single column |
US7552108B2 (en) * | 2002-12-20 | 2009-06-23 | International Business Machines Corporation | System and method for multicolumn sorting in a single column |
US20040128299A1 (en) * | 2002-12-26 | 2004-07-01 | Michael Skopec | Low-latency method to replace SQL insert for bulk data transfer to relational database |
US7305410B2 (en) | 2002-12-26 | 2007-12-04 | Rocket Software, Inc. | Low-latency method to replace SQL insert for bulk data transfer to relational database |
US20040210563A1 (en) * | 2003-04-21 | 2004-10-21 | Oracle International Corporation | Method and system of collecting execution statistics of query statements |
US7447676B2 (en) | 2003-04-21 | 2008-11-04 | Oracle International Corporation | Method and system of collecting execution statistics of query statements |
US7146363B2 (en) * | 2003-05-20 | 2006-12-05 | Microsoft Corporation | System and method for cardinality estimation based on query execution feedback |
US20040236722A1 (en) * | 2003-05-20 | 2004-11-25 | Microsoft Corporation | System and method for cardinality estimation based on query execution feedback |
US20050021287A1 (en) * | 2003-06-25 | 2005-01-27 | International Business Machines Corporation | Computing frequent value statistics in a partitioned relational database |
US7542975B2 (en) * | 2003-06-25 | 2009-06-02 | International Business Machines Corporation | Computing frequent value statistics in a partitioned relational database |
US20050004907A1 (en) * | 2003-06-27 | 2005-01-06 | Microsoft Corporation | Method and apparatus for using conditional selectivity as foundation for exploiting statistics on query expressions |
US7249120B2 (en) * | 2003-06-27 | 2007-07-24 | Microsoft Corporation | Method and apparatus for selecting candidate statistics to estimate the selectivity value of the conditional selectivity expression in optimize queries based on a set of predicates that each reference a set of relational database tables |
US20050086242A1 (en) * | 2003-09-04 | 2005-04-21 | Oracle International Corporation | Automatic workload repository battery of performance statistics |
US20050086195A1 (en) * | 2003-09-04 | 2005-04-21 | Leng Leng Tan | Self-managing database architecture |
US20050086246A1 (en) * | 2003-09-04 | 2005-04-21 | Oracle International Corporation | Database performance baselines |
US7664798B2 (en) | 2003-09-04 | 2010-02-16 | Oracle International Corporation | Database performance baselines |
US7526508B2 (en) | 2003-09-04 | 2009-04-28 | Oracle International Corporation | Self-managing database architecture |
US7603340B2 (en) * | 2003-09-04 | 2009-10-13 | Oracle International Corporation | Automatic workload repository battery of performance statistics |
US7984024B2 (en) | 2004-01-07 | 2011-07-19 | International Business Machines Corporation | Statistics management |
US20090030875A1 (en) * | 2004-01-07 | 2009-01-29 | International Business Machines Corporation | Statistics management |
US20050216490A1 (en) * | 2004-03-26 | 2005-09-29 | Oracle International Corporation | Automatic database diagnostic usage models |
US8024301B2 (en) * | 2004-03-26 | 2011-09-20 | Oracle International Corporation | Automatic database diagnostic usage models |
US20090299989A1 (en) * | 2004-07-02 | 2009-12-03 | Oracle International Corporation | Determining predicate selectivity in query costing |
US9244979B2 (en) | 2004-07-02 | 2016-01-26 | Oracle International Corporation | Determining predicate selectivity in query costing |
US20060100992A1 (en) * | 2004-10-21 | 2006-05-11 | International Business Machines Corporation | Apparatus and method for data ordering for derived columns in a database system |
US20080215537A1 (en) * | 2004-10-21 | 2008-09-04 | International Business Machines Corporation | Data ordering for derived columns in a database system |
US20080215538A1 (en) * | 2004-10-21 | 2008-09-04 | International Business Machines Corporation | Data ordering for derived columns in a database system |
US20080215539A1 (en) * | 2004-10-21 | 2008-09-04 | International Business Machines Corporation | Data ordering for derived columns in a database system |
US20080133454A1 (en) * | 2004-10-29 | 2008-06-05 | International Business Machines Corporation | System and method for updating database statistics according to query feedback |
US7831592B2 (en) | 2004-10-29 | 2010-11-09 | International Business Machines Corporation | System and method for updating database statistics according to query feedback |
US7739293B2 (en) | 2004-11-22 | 2010-06-15 | International Business Machines Corporation | Method, system, and program for collecting statistics of data stored in a database |
US20060112093A1 (en) * | 2004-11-22 | 2006-05-25 | International Business Machines Corporation | Method, system, and program for collecting statistics of data stored in a database |
US20060116984A1 (en) * | 2004-11-29 | 2006-06-01 | Thomas Zurek | Materialized samples for a business warehouse query |
US7610272B2 (en) * | 2004-11-29 | 2009-10-27 | Sap Ag | Materialized samples for a business warehouse query |
US8032514B2 (en) * | 2005-01-14 | 2011-10-04 | International Business Machines Corporation | SQL distinct optimization in a computer database system |
US20060161515A1 (en) * | 2005-01-14 | 2006-07-20 | International Business Machines Corporation | Apparatus and method for SQL distinct optimization in a computer database system |
US7447681B2 (en) | 2005-02-17 | 2008-11-04 | International Business Machines Corporation | Method, system and program for selection of database characteristics |
US7610264B2 (en) | 2005-02-28 | 2009-10-27 | International Business Machines Corporation | Method and system for providing a learning optimizer for federated database systems |
US20060195416A1 (en) * | 2005-02-28 | 2006-08-31 | Ewen Stephan E | Method and system for providing a learning optimizer for federated database systems |
US7941332B2 (en) | 2006-01-30 | 2011-05-10 | International Business Machines Corporation | Apparatus, system, and method for modeling, projecting, and optimizing an enterprise application system |
US7725461B2 (en) * | 2006-03-14 | 2010-05-25 | International Business Machines Corporation | Management of statistical views in a database system |
EP2005332A4 (en) * | 2006-03-14 | 2011-03-09 | Ibm | Management of statistical views in a database system |
JP2009529735A (en) * | 2006-03-14 | 2009-08-20 | インターナショナル・ビジネス・マシーンズ・コーポレーション | Managing statistical views in a database system |
CN101405727B (en) * | 2006-03-14 | 2011-11-30 | 国际商业机器公司 | Management of statistical views in a database system |
EP2005332A1 (en) * | 2006-03-14 | 2008-12-24 | International Business Machines Corporation | Management of statistical views in a database system |
US20070220058A1 (en) * | 2006-03-14 | 2007-09-20 | Mokhtar Kandil | Management of statistical views in a database system |
US8051058B2 (en) * | 2006-04-03 | 2011-11-01 | International Business Machines Corporation | System for estimating cardinality in a database system |
US20090012977A1 (en) * | 2006-04-03 | 2009-01-08 | International Business Machines Corporation | System for estimating cardinality in a database system |
US20070271218A1 (en) * | 2006-05-16 | 2007-11-22 | International Business Machines Corpoeation | Statistics collection using path-value pairs for relational databases |
US7472108B2 (en) * | 2006-05-16 | 2008-12-30 | International Business Machines Corporation | Statistics collection using path-value pairs for relational databases |
US20100011030A1 (en) * | 2006-05-16 | 2010-01-14 | International Business Machines Corp. | Statistics collection using path-identifiers for relational databases |
US8229924B2 (en) * | 2006-05-16 | 2012-07-24 | International Business Machines Corporation | Statistics collection using path-identifiers for relational databases |
US9117005B2 (en) | 2006-05-16 | 2015-08-25 | International Business Machines Corporation | Statistics collection using path-value pairs for relational databases |
US10007686B2 (en) * | 2006-08-02 | 2018-06-26 | Entit Software Llc | Automatic vertical-database design |
US8086598B1 (en) | 2006-08-02 | 2011-12-27 | Hewlett-Packard Development Company, L.P. | Query optimizer with schema conversion |
US8671091B2 (en) | 2006-08-02 | 2014-03-11 | Hewlett-Packard Development Company, L.P. | Optimizing snowflake schema queries |
US20080033914A1 (en) * | 2006-08-02 | 2008-02-07 | Mitch Cherniack | Query Optimizer |
US20080040348A1 (en) * | 2006-08-02 | 2008-02-14 | Shilpa Lawande | Automatic Vertical-Database Design |
US20080052266A1 (en) * | 2006-08-25 | 2008-02-28 | Microsoft Corporation | Optimizing parameterized queries in a relational database management system |
US8229918B2 (en) * | 2006-08-25 | 2012-07-24 | Teradata Us, Inc. | Hardware accelerated reconfigurable processor for accelerating database operations and queries |
US20110218987A1 (en) * | 2006-08-25 | 2011-09-08 | Teradata Us, Inc. | Hardware accelerated reconfigurable processor for accelerating database operations and queries |
US8032522B2 (en) | 2006-08-25 | 2011-10-04 | Microsoft Corporation | Optimizing parameterized queries in a relational database management system |
US20090106210A1 (en) * | 2006-09-18 | 2009-04-23 | Infobright, Inc. | Methods and systems for database organization |
US8266147B2 (en) | 2006-09-18 | 2012-09-11 | Infobright, Inc. | Methods and systems for database organization |
US8838593B2 (en) * | 2006-09-18 | 2014-09-16 | Infobright Inc. | Method and system for storing, organizing and processing data in a relational database |
US8700579B2 (en) | 2006-09-18 | 2014-04-15 | Infobright Inc. | Method and system for data compression in a relational database |
US20080071818A1 (en) * | 2006-09-18 | 2008-03-20 | Infobright Inc. | Method and system for data compression in a relational database |
US20080071748A1 (en) * | 2006-09-18 | 2008-03-20 | Infobright Inc. | Method and system for storing, organizing and processing data in a relational database |
US7877374B2 (en) | 2006-12-01 | 2011-01-25 | Microsoft Corporation | Statistics adjustment to improve query execution plans |
US20080133458A1 (en) * | 2006-12-01 | 2008-06-05 | Microsoft Corporation | Statistics adjustment to improve query execution plans |
US20090077054A1 (en) * | 2007-09-13 | 2009-03-19 | Brian Robert Muras | Cardinality Statistic for Optimizing Database Queries with Aggregation Functions |
US9710353B2 (en) | 2007-10-19 | 2017-07-18 | Oracle International Corporation | Creating composite baselines based on a plurality of different baselines |
US8990811B2 (en) | 2007-10-19 | 2015-03-24 | Oracle International Corporation | Future-based performance baselines |
US20090106756A1 (en) * | 2007-10-19 | 2009-04-23 | Oracle International Corporation | Automatic Workload Repository Performance Baselines |
US8078652B2 (en) * | 2007-12-06 | 2011-12-13 | Oracle International Corporation | Virtual columns |
US20090150366A1 (en) * | 2007-12-06 | 2009-06-11 | Oracle International Corporation | Expression replacement in virtual columns |
US8046352B2 (en) * | 2007-12-06 | 2011-10-25 | Oracle International Corporation | Expression replacement in virtual columns |
US20090150413A1 (en) * | 2007-12-06 | 2009-06-11 | Oracle International Corporation | Virtual columns |
US8620888B2 (en) | 2007-12-06 | 2013-12-31 | Oracle International Corporation | Partitioning in virtual columns |
US20100030728A1 (en) * | 2008-07-29 | 2010-02-04 | Oracle International Corporation | Computing selectivities for group of columns and expressions |
US20100114976A1 (en) * | 2008-10-21 | 2010-05-06 | Castellanos Maria G | Method For Database Design |
US20100223269A1 (en) * | 2009-02-27 | 2010-09-02 | International Business Machines Corporation | System and method for an efficient query sort of a data stream with duplicate key values |
US9235622B2 (en) * | 2009-02-27 | 2016-01-12 | International Business Machines Corporation | System and method for an efficient query sort of a data stream with duplicate key values |
US20110016157A1 (en) * | 2009-07-14 | 2011-01-20 | Vertica Systems, Inc. | Database Storage Architecture |
US8700674B2 (en) | 2009-07-14 | 2014-04-15 | Hewlett-Packard Development Company, L.P. | Database storage architecture |
US20110022581A1 (en) * | 2009-07-27 | 2011-01-27 | Rama Krishna Korlapati | Derived statistics for query optimization |
US20110213766A1 (en) * | 2010-02-22 | 2011-09-01 | Vertica Systems, Inc. | Database designer |
US8290931B2 (en) | 2010-02-22 | 2012-10-16 | Hewlett-Packard Development Company, L.P. | Database designer |
US20110258179A1 (en) * | 2010-04-19 | 2011-10-20 | Salesforce.Com | Methods and systems for optimizing queries in a multi-tenant store |
US8447754B2 (en) * | 2010-04-19 | 2013-05-21 | Salesforce.Com, Inc. | Methods and systems for optimizing queries in a multi-tenant store |
US9507822B2 (en) | 2010-04-19 | 2016-11-29 | Salesforce.Com, Inc. | Methods and systems for optimizing queries in a database system |
US10162851B2 (en) | 2010-04-19 | 2018-12-25 | Salesforce.Com, Inc. | Methods and systems for performing cross store joins in a multi-tenant store |
US10649995B2 (en) | 2010-04-19 | 2020-05-12 | Salesforce.Com, Inc. | Methods and systems for optimizing queries in a multi-tenant store |
US10417611B2 (en) | 2010-05-18 | 2019-09-17 | Salesforce.Com, Inc. | Methods and systems for providing multiple column custom indexes in a multi-tenant database environment |
US8943100B2 (en) | 2010-06-14 | 2015-01-27 | Infobright Inc. | System and method for storing data in a relational database |
US8521748B2 (en) | 2010-06-14 | 2013-08-27 | Infobright Inc. | System and method for managing metadata in a relational database |
US8417727B2 (en) | 2010-06-14 | 2013-04-09 | Infobright Inc. | System and method for storing data in a relational database |
US20130018890A1 (en) * | 2011-07-13 | 2013-01-17 | Salesforce.Com, Inc. | Creating a custom index in a multi-tenant database environment |
US10108648B2 (en) * | 2011-07-13 | 2018-10-23 | Salesforce.Com, Inc. | Creating a custom index in a multi-tenant database environment |
US20130166557A1 (en) * | 2011-12-23 | 2013-06-27 | Lars Fricke | Unique value calculation in partitioned tables |
US9697273B2 (en) | 2011-12-23 | 2017-07-04 | Sap Se | Unique value calculation in partitioned table |
US8880510B2 (en) * | 2011-12-23 | 2014-11-04 | Sap Se | Unique value calculation in partitioned tables |
US9665572B2 (en) * | 2012-09-12 | 2017-05-30 | Oracle International Corporation | Optimal data representation and auxiliary structures for in-memory database query processing |
US20140074819A1 (en) * | 2012-09-12 | 2014-03-13 | Oracle International Corporation | Optimal Data Representation and Auxiliary Structures For In-Memory Database Query Processing |
US11449481B2 (en) * | 2017-12-08 | 2022-09-20 | Alibaba Group Holding Limited | Data storage and query method and device |
US11947568B1 (en) * | 2021-09-30 | 2024-04-02 | Amazon Technologies, Inc. | Working set ratio estimations of data items in a sliding time window for dynamically allocating computing resources for the data items |
Also Published As
Publication number | Publication date |
---|---|
US5899986A (en) | 1999-05-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6029163A (en) | Methods for collecting query workload based statistics on column groups identified by RDBMS optimizer | |
US5761653A (en) | Method for estimating cardinalities for query processing in a relational database management system | |
US7240044B2 (en) | Query optimization by sub-plan memoization | |
US6185557B1 (en) | Merge join process | |
US7213012B2 (en) | Optimizer dynamic sampling | |
US6801903B2 (en) | Collecting statistics in a database system | |
US5875445A (en) | Performance-related estimation using pseudo-ranked trees | |
US7509311B2 (en) | Use of statistics on views in query optimization | |
US7756889B2 (en) | Partitioning of nested tables | |
US8078652B2 (en) | Virtual columns | |
US8122046B2 (en) | Method and apparatus for query rewrite with auxiliary attributes in query processing operations | |
US6138111A (en) | Cardinality-based join ordering | |
EP3014488B1 (en) | Incremental maintenance of range-partitioned statistics for query optimization | |
US7840555B2 (en) | System and a method for identifying a selection of index candidates for a database | |
US7472108B2 (en) | Statistics collection using path-value pairs for relational databases | |
US7542975B2 (en) | Computing frequent value statistics in a partitioned relational database | |
US20030167275A1 (en) | Computation of frequent data values | |
US6778976B2 (en) | Selectivity estimation for processing SQL queries containing having clauses | |
US7617189B2 (en) | Parallel query processing techniques for minus and intersect operators | |
US9117005B2 (en) | Statistics collection using path-value pairs for relational databases | |
US6925463B2 (en) | Method and system for query processing by combining indexes of multilevel granularity or composition | |
US7440936B2 (en) | Method for determining an access mode to a dataset | |
US8229924B2 (en) | Statistics collection using path-identifiers for relational databases | |
Kacharulal | Green Query Optimization: Taming Query Optimization Overheads through Plan Recycling | |
Mittra | Optimization of the External Level of a Database |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FEPP | Fee payment procedure |
Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
FPAY | Fee payment |
Year of fee payment: 4 |
|
AS | Assignment |
Owner name: ORACLE INTERNATIONAL CORPORATION, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:ORACLE CORPORATION;REEL/FRAME:014852/0946 Effective date: 20031113 |
|
FPAY | Fee payment |
Year of fee payment: 8 |
|
FPAY | Fee payment |
Year of fee payment: 12 |
|
CC | Certificate of correction |