US6029163A - Methods for collecting query workload based statistics on column groups identified by RDBMS optimizer - Google Patents

Methods for collecting query workload based statistics on column groups identified by RDBMS optimizer Download PDF

Info

Publication number
US6029163A
US6029163A US09/164,400 US16440098A US6029163A US 6029163 A US6029163 A US 6029163A US 16440098 A US16440098 A US 16440098A US 6029163 A US6029163 A US 6029163A
Authority
US
United States
Prior art keywords
column
columns
rows
query
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
US09/164,400
Inventor
Mohamed Ziauddin
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Oracle International Corp
Original Assignee
Oracle Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oracle Corp filed Critical Oracle Corp
Priority to US09/164,400 priority Critical patent/US6029163A/en
Application granted granted Critical
Publication of US6029163A publication Critical patent/US6029163A/en
Assigned to ORACLE INTERNATIONAL CORPORATION reassignment ORACLE INTERNATIONAL CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: ORACLE CORPORATION
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • G06F16/24542Plan optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99932Access augmentation or optimizing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching

Definitions

  • the present invention relates to the field of computer systems used for storage and retrieval of data. More specifically, the present invention relates to the field of statistics measurement systems for a relational database management system.
  • RDBMS Computer implemented relational database management systems
  • database systems commonly employ data tables that contain columns and rows containing data (e.g., data values).
  • data e.g., data values
  • a typical RDBMS in addition to maintaining the data of a particular database, also maintains a set of statistics regarding the data. These statistics are useful in efficiently accessing, manipulating, and presenting the stored data.
  • an RDBMS system When an RDBMS system receives a query, its optimizer analyzes the structure of the query, analyzes the various clauses (e.g. selection and join predicates) specified in the query, and examines existing data access paths (e.g. indexes) to formulate a strategy (e.g., method) of performing various relational operations (e.g., aggregation, sort, search, join, etc.) to produce the result of the query.
  • the optimizer generally explores various strategies to find the best strategy.
  • a best strategy is the one with lowest estimated RDBMS cost, and such strategy generally takes least amount of RDBMS resources or least amount of time or both to produce the result of the query.
  • RDBMS cost is generally evaluated based on input/output (I/O) operations that are required to perform necessary relational operations to produce the result of the query. Therefore, the optimizer selection criterion for the best strategy for a query is a minimum execution cost.
  • the minimum execution cost is calculated using available statistics such as table and index cardinalities, workload statistics (e.g. statistics on columns and column groups), storage statistics and assumptions of the manner in which any relational operation typically changes the incoming data size and data distribution.
  • Prior RDBMS systems collect workload statistics based on single columns of data. As such, these systems make zero correlation assumption that the values of two or more columns of data are not related in any way. However, in many cases, this zero correlation assumption is incorrect. For example, in an example data table having a first column of employee age and second column of employee job position, it is possible that in general the older employees can hold higher job positions. As such, the row values in the first column correlate to the values in the second column.
  • the prior art systems compute a result cardinality (result -- cardinality) based on the inverse of the distinct cardinality of the separate age column (DC1) multiplied by the inverse of the distinct cardinality of the separate income column (DC2) and this result is multiplied by the number of employees (#EE).
  • the distinct cardinality of a column represents the number of distinct values within that column. See the computation of the result cardinality below:
  • the estimated result cardinality represents the average number of rows likely to be produced for a query asking for all employees with a certain age and a certain job position.
  • the relationship used above by the prior art system assumes that there is no correlation between the first and second columns. This assumption leads to a computation of a result cardinality value that is much smaller than what is returned in reality due to data correlation.
  • the above result cardinality value is used by an optimizer, its determined estimated costs for performing certain relational operations on the subject data become inaccurate. Specifically, this inaccuracy can cause the optimizer to select a query resulting in worse performance in producing the result. Therefore, what is needed is a method for collecting statistics within an RDBMS system that provides for accurate cost estimation analysis for queries involving more than one data column.
  • null data values within rows of a data table can lead to inaccuracies in the workload statistics, which will lead to inaccurate cost estimation analysis by the RDBMS optimizer.
  • Most relational operations ignore incoming data rows that contain null data values in key columns. Therefore, it is important to also collect statistics about the number of rows with null data in certain columns or column groups.
  • Prior art RDBMS systems that compute certain database statistics (e.g., distinct cardinality of a column) without regard to rows with null data produce inaccurate statistics. The inaccurate statistics can cause the RDBMS optimizer to select a sub-par strategy for a given query. Therefore, what is needed is a method for collecting statistics within an RDBMS system that provides for accurate cost estimation analysis for queries involving columns of rows with null data.
  • the present invention provides such a system.
  • a manual process is performed by users (e.g., system administrators) to determine which data columns of a database on which to collect statistics.
  • the user informs integrated statistics gathering procedures of the separate data columns on which workload statistics are to be gathered.
  • the system administrators do not realize the specific mechanisms by which the RDBMS optimizer operates. Relying on users to indicate the data columns on which to collect statistics can be unreliable and inefficient. Namely, not knowing what is needed by the optimizer, the system administrator can cause the statistics gathering procedures to inefficiently over-collect statistics or fail to collect statistics on required columns. What is needed is an efficient method for identifying columns or column groups on which statistics collection is required. What is needed is such a method that does not rely on user origination of the above information.
  • the present invention provides a method and system for collecting statistics within an RDBMS system that provides for accurate cost estimation analysis for workload queries involving more than one data column.
  • the present invention also provides a method and system for collecting statistics within an RDBMS system that provides for accurate estimated cost analysis for queries involving columns of rows with null data values.
  • the present invention yet provides a method and system within RDBMS optimizer for identifying the columns or column groups on which statistics collection is required that does not rely on manual user identification of the column or column groups.
  • Methods are described for collecting query workload based statistics within a relational database management system (RDBMS) and for automatically identifying columns or groups of columns for which statistic collection is to be performed.
  • RDBMS relational database management system
  • the novel system collects workload based statistics on multi-columns rather than merely on single columns.
  • This multi-column statistic gathering technique of the present invention provides more accurate results when the row values within the multi-columns are correlated to each other.
  • the multi-column statistics that result from the present invention lead to better estimated cost analysis performed by an RDBMS optimizer which leads to increased likelihood of finding an optimal strategy for each query in a workload.
  • the present invention generates column group statistics which represent distinct cardinality measured over rows of multiple columns of data, rather than individual statistics on single columns ("a multi-column statistic").
  • a multi-column statistic which represent distinct cardinality measured over rows of multiple columns of data, rather than individual statistics on single columns.
  • the novel system also collects separate statistics regarding the number of null data values within the rows of a column. Separate null data statistics improve the cost estimation analysis used by the RDBMS optimizer because the cardinality of a result produced by a relational operation is generally determined by the number of input rows with non-null data.
  • the present invention utilizes the separate null data statistics in determining a result distinct cardinality of a column, or column group, of a data table.
  • the null data statistics are useful in accurate estimation of cardinality of the result of a relational operation (e.g. a data search or a table join) and in accurate estimation of average cost of performing a relational operation (e.g. a data sort, a data search, a table join).
  • the novel system also includes an RDBMS optimizer that automatically identifies columns and groups of columns on which workload based statistics should be generated.
  • the clauses within a query such as equi-join and equi-selection predicates and projection columns are analyzed by the RDBMS optimizer to identify columns and column groups of interest.
  • the identified columns are then registered within in a system catalog of the RDBMS.
  • the statistics collection utility of the present invention reads the registered columns and column groups from the system catalog to identify the columns and column groups on which to generate the workload statistics.
  • the statistics so generated are then used by the optimizer to perform subsequent estimations of the cost of performing certain relational operations in order to select a strategy for a query having the least cost.
  • FIG. 1 illustrates a computer system used in accordance with the embodiments of the present invention.
  • FIG. 2 is an illustration of a data table used by embodiments of the present invention including columns and rows of data.
  • FIG. 3 is a flow diagram of steps performed by the optimizer of the RDBMS system of the present invention when selecting a lowest cost strategy for a query.
  • FIG. 4A and FIG. 4B illustrate a flow diagram of steps of a first embodiment of the present invention for automatic identification of columns and column groups by the RDBMS optimizer on which to collect the workload statistics.
  • FIG. 5 is a flow diagram of steps performed by statistics generation procedures of the present invention to generate workload statistics.
  • FIG. 6 illustrates a flow diagram of steps of a second embodiment of the present invention for generating non-null workload statistics (e.g., the column duplicity factor), which is a multi-column statistic, on identified column groups.
  • non-null workload statistics e.g., the column duplicity factor
  • FIG. 7 illustrates an exemplary multi-column data table that can be used by the first embodiment of the present invention.
  • FIG. 8 illustrates a flow diagram of steps of a third embodiment of the present invention for generating null workload statistics (e.g., the column null factors) on identified column groups.
  • null workload statistics e.g., the column null factors
  • FIG. 1 illustrates a computer system 112.
  • processes e.g., processes 310, 350, 410, 610, 710, 910, 1010 and steps are discussed that are realized, in one implementation, as a series of instructions (e.g., software program) that reside within computer readable memory units of system 112 and executed by processors of system 112. When executed, the instructions cause the computer system 112 to perform specific actions and exhibit specific behavior which is described in detail to follow.
  • computer system 112 used by the present invention comprises an address/data bus 100 for communicating information, one or more central processors 101 coupled with the bus 100 for processing information and instructions, a computer readable volatile memory unit 102 (e.g., random access memory, static RAM, dynamic, RAM, etc.) coupled with the bus 100 for storing information and instructions for the central processor(s) 101, a computer readable non-volatile memory unit (e.g., read only memory, programmable ROM, flash memory, EPROM, EEPROM, etc.) coupled with the bus 100 for storing static information and instructions for the processor(s) 101.
  • a computer readable volatile memory unit 102 e.g., random access memory, static RAM, dynamic, RAM, etc.
  • a computer readable non-volatile memory unit e.g., read only memory, programmable ROM, flash memory, EPROM, EEPROM, etc.
  • System 112 also includes a mass storage computer readable data storage device 104 (hard drive or floppy) such as a magnetic or optical disk and disk drive coupled with the bus 100 for storing information and instructions.
  • system 112 can include a display device 105 coupled to the bus 100 for displaying information to the computer user, an optional alphanumeric input device 106 including alphanumeric and function keys coupled to the bus 100 for communicating information and command selections to the central processor(s) 101, an optional cursor control device 107 coupled to the bus for communicating user input information and command selections to the central processor(s) 101, and an optional signal generating device 108 coupled to the bus 100 for communicating command selections to the processor(s) 101.
  • system 112 is a DEC Alpha computer system by Digital Equipment Corporation, but could equally be of a number of various well known and commercially available platforms.
  • FIG. 2 illustrates an exemplary data table 200 as used by embodiments of the present invention. It is appreciated that data table 200, as well as columns and rows of data which comprise the table 200, are stored in computer readable memory units of system 112. Exemplary table 200 includes x number of separate data columns (or “columns") 210, 220, 230, and 240. Within each column, table 200 includes n rows of data 250-256. Data (“data values”) are stored within the rows of each column. A data value can be either non-null (e.g., a known value) or null (e.g., an unknown value or an empty set value). A column group is a collection of multiple columns, e.g., column 210 and 220.
  • references to columns, rows, column groups, and data tables refer to analogous data structures as shown in FIG. 2 and as stored in computer readable memory units of system 112.
  • Database performance is directly affected by the ability of a database to efficiently access a row or rows stored on disk (e.g., 104) through input/output (e.g., I/O) operations.
  • I/O input/output
  • the optimizer of the RDBMS system finds a lowest cost strategy (generally most I/O efficient) of retrieving subject data and producing result of a query. Although the optimizer can consider many strategies each with a different cost to find the one with the lowest cost, all strategies yield the same result.
  • cost is the estimated number of disk I/Os required to execute a strategy and produce result of a query. This is the metric used by the optimizer to find an optimal strategy, e.g., the one with the minimal cost.
  • Main factors affecting the cost of a strategy include the following: 1) query structure, 2) query predicates (e.g. equi-selections, equi-joins, etc.), 3) result ordering or grouping, 4) access paths available (e.g. hash index, B-tree index), and 5) join methods available (e.g. sort-merge join, hash join). Any database statistics, if available, can help the RDBMS optimizer in significantly improving the accuracy of cost estimation analysis to find an optimal strategy.
  • database statistics include cardinality statistics, workload statistics and storage statistics.
  • Cardinality statistics include the table cardinality and index cardinality.
  • the table cardinality represents the total number of rows in a data table.
  • the index cardinality represents the total number of distinct keys (index values) within an index. In other words, the index cardinality is a distinct cardinality of index columns.
  • the workload statistics include column duplicity factor and column null factor.
  • the column duplicity factor is a ratio of the number of table rows with non-null data in a column group to the distinct cardinality of a column group. It represents average number of duplicates per distinct combination of values in a column group.
  • the column null factor is a ratio of the number of table rows with null data in any column of a column group to the table cardinality. It represents a fraction of a data table with null data in rows of a column group.
  • FIG. 3 illustrates an exemplary process flow 310 performed within an RDBMS system of the present invention which utilizes an optimizer.
  • the optimizer upon receiving a query, the optimizer: 1) identifies different column groups based on equi-join predicates, equi-selection predicates and projections specified in the query; 2) uses stored workload statistics on identified column groups and other stored database statistics to estimate the costs of performing different relational operations required to produce result of the query; 3) builds different strategies of executing relational operations in certain order and estimate cost for each strategy; and 4) selects the lowest cost strategy.
  • process 310 is executed over processor 101 and is stored as program instructions within computer readable memory units of system 112 (FIG. 1).
  • the RDBMS system receives a query, e.g., a user request for data.
  • the format of the received query contains certain predicates (joins, filters, and other conditions, etc.)
  • system 112 identifies column groups based on equi-joins, equi-selections and projections that are specified in the query.
  • the optimizer uses stored workload statistics on identified column groups for cost estimation of different relational operations.
  • the optimizer examines the query to determine the different relational operations (e.g., data access paths, sorts, filters, projections, aggregations, joins, etc.) that are needed to complete the query.
  • the optimizer of the RDBMS system accesses stored database statistics (cardinality/workload/storage) regarding the subject data of the query received at step 315 and uses this information to estimate the costs of performing necessary relational operations to produce the result of the query. Any of a number of well known methods can be used at step 320 for performing cost analysis of different relational operations used to satisfy the query.
  • the optimizer can utilize in order to perform a data search.
  • a well known sequential retrieval strategy can be used where the RDBMS accesses the database pages for a table's logical area sequentially and reads all the rows in the table regardless of the selection expression used in the query.
  • a number of well known index retrieval strategies can also be used, including those described below.
  • the database key (dbkey) retrieval the RDBMS accesses a table data directly through a database key (e.g., logical address) row pointer.
  • the RDBMS accesses a specific index structure (sorted or hashed) and retrieves the index keys, which include the dbkeys, of the rows. The dbkeys are then used to fetch data rows and the data is delivered in index order, if a sorted index is used.
  • the RDBMS accesses only the index data and the selected index contains all the table columns specified in the query and no further row fetches are necessary. The data is delivered in index order, if a sorted index is used.
  • the OR index retrieval two or more sorted or hashed indexes are used that are defined on a single data table when the predicates on these indexes are combined with the logical OR in the query.
  • the optimizer attempts different strategies to implement the relational operations identified in step 320 and then computes the cost to perform each strategy.
  • a strategy is a scheme of performing different relational operations in a certain order. By reordering the execution sequence of the relational operations, different cost values result for executing the same query. For single data tables, the optimizer finds the lowest cost retrieval strategy by evaluating the cost for each possible retrieval strategy.
  • the estimated cost for a strategy includes the cost of scanning an index and the cost of fetching data rows.
  • the optimizer selects the strategy, of the strategies analyzed in step 322, that yields the lowest estimated cost solution.
  • a number of well known optimizer processes can be used by step 330 for selecting the lowest estimated cost strategy.
  • the optimizer initiates the execution of the relational operations indicated by the selected strategy of step 330 to produce the result of the query. It is appreciated that each of the strategies considered at step 322, although having possibly different estimated costs, yields the same result.
  • system 112 reports the output of the executed strategy as the result of the query received at step 315.
  • the process 410 of a first embodiment of the present invention is described with reference to the flow diagram of FIG. 4A and FIG. 4B; this embodiment of the present invention provides an automatic method for identifying the column groups for which workload statistics are to be collected within an RDBMS system.
  • the column groups for which workload statistics are to be collected are identified from predicates specified in the WHERE and HAVING clauses, and projection lists specified in a GROUP BY or DISTINCT clause of each query within a query workload.
  • the equi-selection and equi-join predicates within WHERE and HAVING clauses and projection list within a GROUP BY or DISTINCT clause are used in the above process to identify column groups belonging to different data tables of a database.
  • This embodiment of the present invention recognizes that maintaining workload statistics about such identified column groups is very important for optimizing a query because the corresponding conditions have significant influence on determining the cardinality of the query result.
  • the optimizer will need to be informed of the combined selectivity so that the cardinality of the result (which is produced after all equi-selections have been applied) can be computed with high accuracy.
  • the optimizer will need to be informed of the join fanout factors so that it can determine the cardinality of a joined result with high accuracy.
  • a projection e.g., in a GROUP BY or DISTINCT clause
  • the optimizer will need to be informed of how many distinct groups will be formed which will be equal to the cardinality of the projected result.
  • the workload statistics that are collected on the column groups identified by the optimizer in this embodiment of the present invention include a column duplicity factor and a column null factor, which are described further below.
  • the column duplicity factor (second embodiment) provides information about the number of distinct values (e.g., the distinct cardinality) as well as the average number of duplicates per distinct combination of column values in an identified column group.
  • the column duplicity factor related to the result cardinality discussed above, can be used to determine the combined selectivity of equi-selection predicates, or combined join fanout factor of equi-join predicates, or the number of distinct groups which will be equal to the cardinality of the projected result.
  • the column null factor (third embodiment) indicates the fraction of rows in a data table that have null values in any column of an identified column group.
  • the column null factor can be used to identify a portion of the data table that would not participate in a relational operation (e.g., join, selection).
  • the optimizer when the optimizer processes a query to find the best execution strategy (e.g., FIG. 3), it also registers (if not already registered) the information about identified column groups into system catalog portions of computer readable memory units.
  • the optimizer identifies the column groups based on equi-selections, equi-joins, and projection lists in various queries in a query workload.
  • An advantage of the present invention is that the workload statistics are collected only on the identified column groups and nothing else leading to a more efficient statistics collection procedure. This contrasts with the prior art method of relying on the user to identify the interesting columns or column groups which is a difficult task, and therefore, can be unreliable and inefficient.
  • FIG. 4A and FIG. 4B illustrate a flow diagram of process 410 of this embodiment of the present invention.
  • Process 410 resides within an optimizer implemented in accordance with the present invention and is executed over processor 101 and is stored as program instructions within computer readable memory units of system 112 (FIG. 1).
  • Embodiments of the present invention described further below illustrate different statistics generation procedures that can be used to produce the workload statistics that are used by process 410.
  • the present invention optimizer receives a query.
  • the present invention selects one data table ("selected table") from a list of data tables specified within the query received at step 415.
  • the present invention interrogates the query to determine if any equi-selection predicates are specified on the selected table.
  • an equi-selection can exist in WHERE and HAVING clauses of the query. If not, processing continues to step 435; if so, processing continues to step 430.
  • the equi-selection predicate indicates a search criterion wherein the search data is specified as a particular value (or variable) and the match much be exact.
  • search data is specified as a particular value (or variable) and the match much be exact.
  • both employee -- JOBCODE and employee -- AGE are column names within the selected data table.
  • the above equi-selection indicates that matching data are to be reported only for rows wherein the employee job and age match with the above values (e.g., 5 and 40). It is appreciated that an equi-selection can be defined with respect to a single column, or more than one columns and that the above equi-selection is exemplary only.
  • the present invention constructs a column group based on all equi-selection predicates specified on the selected table and stores it in the system catalog (within computer readable memory units of system 112).
  • step 430 identifies employee -- JOBCODE and employee -- AGE as a column group.
  • identified column groups can contain one or two or more columns. All identified column groups are stored in the system catalog. This is performed for each equi-selection predicate specified on the selected table.
  • the present invention interrogates the query to determine if any equi-join predicates are specified on the selected table.
  • an equi-join can exist in WHERE clauses of the query. If not, processing continues to step 450 (FIG. 4B); if so, processing continues to step 440.
  • a join operation data from two tables are joined together based on matching data from one or more columns.
  • the present invention constructs different column groups based on different equi-joins connecting selected data table with each of the other data tables and stores the column groups in the system catalog (within computer readable memory units of system 112).
  • a column group is defined by step 440 for the two columns that specify the department number and the common regional designation. All identified column groups are stored in the system catalog. This is performed for each equi-join predicate specified on the selected table. Step 450 of FIG. 4B is then entered.
  • the present invention interrogates the query to determine if any projection operations are specified on the selected table. If not, processing continues to step 460 of FIG. 4B; if so, processing continues to step 445.
  • Projection operation is used mainly for two purposes: (1) to compute aggregates of rows of data by a projection column or columns and then build a new data table having the aggregated result; and (2) eliminating duplicate row data by creating a new data table and projecting distinct data from a given data table into the new data table.
  • Projection or projection lists are specified in GROUP BY or DISTINCT clause of a query and is used to perform the above functions on columns of the selected table.
  • the present invention constructs a column group based on the projection specified on the selected table and stores the column groups in the system catalog (within computer readable memory units of system 112). Step 460 of FIG. 4B is then entered.
  • the present invention examines the query received at step 415 to determine if more data tables are specified within the query that have not yet been processed by process 410. If so, processing returns to step 420 of FIG. 4A to select a next data table; if not, process 410 ends.
  • the determined set of column groups identified can include one or more columns on which single column or multi-column workload statistics are to be collected.
  • the present invention stores the columns and/or column groups selected at steps 430, 440, and 455 into the system catalog portion of computer readable memory units of system 112 provided they were not previously stored.
  • a statistics collection utility When a statistics collection utility runs, it reads the column groups stored in the system catalog of the RDBMS system, and collects the workload statistics for each column group.
  • the present invention advantageously provides an automatic mechanism for determining certain data in the database for which statistics are to be collected.
  • the availability of the workload statistics on identified columns is deemed to be highly relevant to the cost estimation analysis performed by the optimizer to find an optimal strategy for each query in the query workload.
  • FIG. 5 illustrates steps of the statistics generation process 560 which operates to construct the statistics used in process flow 310 (FIG. 3). It is appreciated that process 560 typically operates in the background of the RDBMS system in an effort to avoid interference with the RDBMS system's execution of other relational operations. In this respect, process 560 and process 310 are asynchronous. Process 560 is executed over processor 101 and is stored as program instructions within computer readable memory units of system 112 (FIG. 1). Process 560 commences at step 565 where a particular data table is selected ("selected table").
  • process 560 accesses the system catalog within the RDBMS system (e.g. a portion of memory unit 102) to obtain a list of identified column groups (on which to collect workload statistics) associated with a selected data table.
  • the workload statistics (column duplicity factor and column null factor) are generated for each of the identified column groups (e.g., identified by step 410 above). This is performed for each identified column group in the system catalog.
  • the generated workload statistics are then stored together with respective column groups into the system catalog of the RDBMS system for use by process 310 (FIG. 3) when needed.
  • the present invention checks if more data tables are present in the database. If so, process 560 returns to step 565 to select a new data table; if not, process 560 returns (to be executed at a later predetermined time).
  • step 575 to provide, respectively, collection of (1) workload-based multi-column non-null statistic (column duplicity factor), and (2) workload-based multi-column null statistic (column null factor) for each of the identified column groups in the database.
  • workload-based multi-column non-null statistic (column duplicity factor)
  • workload-based multi-column null statistic (column null factor) for each of the identified column groups in the database.
  • the embodiment of the present invention represented by process 410 provides methods for identifying the data for which workload statistics are to be collected. This embodiment of the present invention places identifiers of this data within the system catalog which is read by step 570 of process 560, although other memory locations can also be used within the scope of the present invention.
  • Process 610 of the second embodiment of the present invention operates within the statistics generation procedures of an RDBMS system (process 560).
  • This embodiment recognizes that in many instances within a database system, the distribution of values in one column can correlate to the values of another column.
  • Conventional RDBMS systems assume that there is no correlation between two or more columns when determining overall estimated result cardinalities of a relational operation that includes two or more columns.
  • the result cardinality represents the average number of rows of a data table that result from a particular relational operation (e.g., a search, projection, join, etc.).
  • the zero-correlation assumption made by the conventional systems usually causes a large percentage error in the estimation of query solution costs and cardinalities when strong correlation exists between multiple columns on which relational operations are performed, such as selections, projections, and joins. Large errors in solution costs lead the optimizer to select a sub-optimal strategy for a query.
  • the first embodiment of the present invention avoids the above problem through automatic identification of column groups based on a query workload, and collecting workload statistics on single columns and multiple columns (multi-column statistics) as a group, rather than collecting statistics merely on individual columns. In effect, the first embodiment of the present invention provides a mechanism for capturing the correlation between columns by collecting statistics on multiple columns as a group.
  • FIG. 6 illustrates a process flow 610 of steps of the second embodiment of the present invention.
  • Process 610 is executed over processor 101 and is stored as program instructions within computer readable memory units of system 112 (FIG. 1).
  • the present invention reads the system catalog space of a computer readable memory unit of system 112 to receive an indication of certain column groups (column identifiers) on which workload statistics need to be collected.
  • this information originates from an automatic analysis (e.g., process 410) performed by an optimizer of one embodiment of the present invention. Alternatively, this information can be user originated with respect to the first embodiment of the present invention.
  • FIG. 7 illustrates a data table 500 comprising three columns: 1) an employee identification number column 510; 2) an employee age column 520; and 3) an annual salary column 530.
  • the data table 500 contains "n" rows 540-547.
  • the present invention receives column identifiers of a column group for which a column duplicity factor needs to be computed, e.g., on the age column 520 and the annual salary column 530 of data table 500.
  • the present invention accesses row groups containing identified columns (identified in step 615) from the selected table.
  • a row group includes the corresponding rows of each identified column of the data table.
  • the identified columns on which workload statistics are to be collected are referred to as "column groups.”
  • Each of the age and annual salary entries of a particular row, e.g., row 542, of data table 500 is a corresponding row group.
  • the rows are corresponding because they belong to a same employee identification number.
  • the following are further examples of row groups in accordance with the example of FIG. 7: 1) 34/30,000 of row 541; 2) 37/35,000 of row 544; and 3) 37/40,000 of row 546.
  • the result cardinality of the column group is dependent on the data of multiple columns.
  • the number of distinct row groups is 5 and they are: 34/30,000; 36/32,000; 37/35,000; 37/40,000; and 57/50,000.
  • the present invention determines the number of distinct "pairs" between the age and income columns of data table 500.
  • the multi-column distinct cardinality statistic of 5 is reported as the number of distinct row groups in the column group including column 520 and column 530.
  • the column duplicity factor (a multi-column statistics) is computed by dividing the number of non-null row groups by the number of distinct row groups in the selected table.
  • the column duplicity factor represents the average number of rows returned based on a query of the data table that includes a specific selection values of the columns identified in step 615.
  • the column duplicity factor (a multi-column statistic) determined at step 630 is stored as workload-based statistic corresponding to the columns that were identified in step 615. These values are stored and indexed in computer readable memory units of system 112. As described with reference to FIG. 3, the optimizer utilizes these workload statistics to perform accurate cost estimations in its cost analysis in determining the lowest cost strategy for performing a query.
  • the result cardinality of the two columns is expressed as the overall selectivity times the table cardinality which represents the number of rows of the data table 500.
  • multi-column workload statistics the average number of rows a query is expected to locate based on a selection of both age and income is 1.6.
  • the multi-column workload statistic of the present invention is much more accurate over the single column derived workload statistic because in the example case of FIG. 7, and in many instances, the data within the columns are correlated. This correlation leads to a more accurate result cardinality value, as expressed in the present invention, but ignored in the single column derivation method of conventional systems.
  • the second embodiment of the present invention provides an RDBMS optimizer with a more accurate estimated costs for optimal strategy selection.
  • an optimal strategy is more likely to be selected by the optimizer and a reduced instance of sub-optimal strategy selection is provided by the present invention.
  • more accurate cost estimation by the optimizer is possible because more accurate information is available regarding: (1) how many rows on average are selected from a given data table; and/or (2) how many rows on average going to participate in a relational operation (e.g., a join).
  • the process 710 of a third embodiment of the present invention operates within the statistics generation procedures of an RDBMS system (process 560).
  • the process 710 of the third embodiment of the present invention is described with reference to the flow diagram of FIG. 8 and the exemplary data tables of FIG. 7.
  • the third embodiment of the present invention maintains separate workload statistics regarding the number of rows with null data values in a column group. Separating rows with null data from rows with non-null data is important because rows with nulls do not participate in most of the relational operations performed within an RDBMS. Therefore, as recognized by the third embodiment of the present invention, the cardinality (e.g., the number of rows) of a result produced by a relational operation is generally determined by the number of input rows with non-null data.
  • FIG. 8 illustrates a process 710 performed by an implementation of the third embodiment of the present invention for computing multi-column null statistics for a column group.
  • Process 710 is executed over processor 101 and is stored as program instructions within computer readable memory units of system 112 (FIG. 1).
  • FIG. 8 is described with respect to an example data table 500 of FIG. 7 where n is 100.
  • the age column 520 contains null data rows for 50 of the employees.
  • there are only 20 distinct age entries in age column 520 so it is appreciated that the initial distinct cardinality (e.g., without consideration of null data values) for column 520 is 100/20 or 5.0.
  • the present invention receives column identifiers of a column group of a selected table.
  • system 112 reads the system catalog space of a computer readable memory unit of system 112 to receive the column group identifiers on which to collect workload-based null statistics of the selected table.
  • this information originates from an automatic analysis (e.g., process 410) performed by an optimizer of one embodiment of the present invention.
  • this information can be user originated with respect to the first embodiment of the present invention.
  • the present invention receives an indication that workload statistics are to be collected based on the age column 520 of data table 500 (FIG. 7).
  • the present invention accesses the corresponding row groups of the data table containing the identified column groups.
  • the present invention counts the number of row groups with null data in any of the identified columns.
  • system 112 examines the individual entries of the corresponding rows groups to determine the number of row groups with null data in any of the identified column groups. For instance, at step 730, the present invention performs a search, or other equivalent operation, to determine the number of null rows within the identified column and records this value as a null data statistic for the identified column. This is done for each of the identified columns for the selected table. With reference to column 520 (FIG. 7), the present invention performs a search and determines that 50 of the rows of identified column 520 contain null data values.
  • the present invention utilizes the number of null rows determined in step 730 and the number of table rows to determine the column null factor for the identified column.
  • system 112 computes the column null factor as equal to the number of row groups with null data divided by the number of table rows. In one embodiment, this is computed for all identified columns of the selected table. For instance, with respect to column 520, the present invention divides the number of null rows (50) by the total number of table rows (100) of column 520 to compute a column null factor of 0.5.
  • the age column 520 contains null data rows for 50 of the employees out of a total of 100. Also, in the example, there are only 20 distinct values in the age column 520. It is appreciated that without the null statistic, the initial result cardinality for an equi-selection on column 520 is 100/20 or 5.0. By collecting null statistic (column null factor), the present invention gives a much more accurate final result cardinality by multiplying the initial result cardinality by one minus the column null factor. This is shown below:
  • the final result cardinality gives a much more accurate estimate of the average number of rows of column 520 that are likely to have any particular, age value, considering rows with null values within column 520.
  • the third embodiment of the present invention stores the determined column null factor in a computer readable memory unit as a workload-based null statistic for the identified column group of the selected table. For example, the null factor of 0.5 is then recorded for column 520.
  • the present invention gives a more accurate final result cardinality which indicates that, on average, 50/20 or 2.5 employees have the same age in our current example. On average, a query of column 520 for a particular age would result in 2.5 rows identified, not 5.0 as indicated by the initial result cardinality that does not take into consideration the presence of null data rows.
  • the following material illustrates the manner in which the workload statistics (e.g., column duplicity factor and column null factor) are used to compute the result cardinality of the following relational operations: 1) equi-selection, 2) equi-join, and 3) projection.
  • workload statistics e.g., column duplicity factor and column null factor
  • two data tables exists each with one column group.
  • the column group in the first table as CG1 and the column group in the second table as CG2.
  • the column duplicity factor and column null factor based on CG1 is denoted as CDF1 and CNF1
  • the column duplicity factor and column null factor based on CG2 is denoted as CDF2 and CNF2.
  • the cardinality of first table i.e. number of rows
  • TC2 the cardinality of second table
  • CDF1 represents the average number of duplicates per distinct value of CG1.
  • An equi-selection over CG1 is nothing but selecting a distinct value of CG1, and therefore, the result cardinality of such an operation is expected to be equal to CDF1.
  • TC1*(1-CNF1)/CDF1 represents the distinct cardinality of CG1.
  • a projection over CG1 is nothing but selecting all distinct values in CG1, and therefore, the result cardinality of such an operation is expected to be equal to TC1*(1-CNF1)/CDF1.
  • TC1*(1-CNF1)/CDF1 is denoted as DC1
  • TC2*(1-CNF2)/CDF2 is denoted as DC2.
  • DC1 and DC2 represent, respectively, the distinct cardinality of CG1 and CG2.
  • the result of the equi-join operation based on CG1 and CG2 is estimated as equal to MIN (DC1, DC2)*CDF1*CDF2.
  • MIN is a function that chooses the minimum value between DC1 and DC2.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Operations Research (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Methods for collecting query workload based statistics within a relational database management system (RDBMS) and for identifying columns for which statistics collection is to be performed. The novel system collects workload statistics that are dependent on multiple columns, rather than merely single columns. Multi-column statistic generation provides more accurate results for columns having correlated data, and therefore leads to better estimated cost analysis by an RDBMS optimizer. In one embodiment, a column duplicity factor is based on an analysis of distinct data rows, e.g., combinations of values within multiple columns, rather than rows of single columns. The novel system also collects separate statistics regarding the presence of null data within the rows of a column group. Separate null data statistics improve the determined result carnality used by the RDBMS optimizer because the cardinality of a relational operation's result is generally determined by the number of input rows with non-null data. The novel system includes an RDBMS optimizer that automatically identifies column groups and column groups on which workload statistics are to be generated. The parameters within a query (e.g., equi-joins, equi-selections, and projections) are analyzed by the optimizer to automatically identify the column groups. The identified columns are then registered within in a system catalog. The registered column groups are read by statistics generation procedures to identify those column groups for which workload statistics are to be collected.

Description

This is a continuation of copending application Ser. No. 08/796,779 filed on Feb. 10, 1997 which is hereby incorporated by reference to this application.
BACKGROUND OF THE INVENTION
(1) Field of the Invention
The present invention relates to the field of computer systems used for storage and retrieval of data. More specifically, the present invention relates to the field of statistics measurement systems for a relational database management system.
(2) Prior Art
Computer implemented relational database management systems (RDBMS) are well known in the art. Such database systems commonly employ data tables that contain columns and rows containing data (e.g., data values). A typical RDBMS, in addition to maintaining the data of a particular database, also maintains a set of statistics regarding the data. These statistics are useful in efficiently accessing, manipulating, and presenting the stored data.
When an RDBMS system receives a query, its optimizer analyzes the structure of the query, analyzes the various clauses (e.g. selection and join predicates) specified in the query, and examines existing data access paths (e.g. indexes) to formulate a strategy (e.g., method) of performing various relational operations (e.g., aggregation, sort, search, join, etc.) to produce the result of the query. The optimizer generally explores various strategies to find the best strategy. A best strategy is the one with lowest estimated RDBMS cost, and such strategy generally takes least amount of RDBMS resources or least amount of time or both to produce the result of the query.
RDBMS cost is generally evaluated based on input/output (I/O) operations that are required to perform necessary relational operations to produce the result of the query. Therefore, the optimizer selection criterion for the best strategy for a query is a minimum execution cost. The minimum execution cost is calculated using available statistics such as table and index cardinalities, workload statistics (e.g. statistics on columns and column groups), storage statistics and assumptions of the manner in which any relational operation typically changes the incoming data size and data distribution.
Prior RDBMS systems collect workload statistics based on single columns of data. As such, these systems make zero correlation assumption that the values of two or more columns of data are not related in any way. However, in many cases, this zero correlation assumption is incorrect. For example, in an example data table having a first column of employee age and second column of employee job position, it is possible that in general the older employees can hold higher job positions. As such, the row values in the first column correlate to the values in the second column.
In order to estimate the cost of a strategy for a query on an example employees data table asking for all employees with a certain age (e.g., 35) and certain job position (e.g., 4), the prior art systems compute a result cardinality (result-- cardinality) based on the inverse of the distinct cardinality of the separate age column (DC1) multiplied by the inverse of the distinct cardinality of the separate income column (DC2) and this result is multiplied by the number of employees (#EE). The distinct cardinality of a column represents the number of distinct values within that column. See the computation of the result cardinality below:
result-- cardinality=(1/DC1*1/DC2)*#EE
The estimated result cardinality represents the average number of rows likely to be produced for a query asking for all employees with a certain age and a certain job position. However, the relationship used above by the prior art system assumes that there is no correlation between the first and second columns. This assumption leads to a computation of a result cardinality value that is much smaller than what is returned in reality due to data correlation. When the above result cardinality value is used by an optimizer, its determined estimated costs for performing certain relational operations on the subject data become inaccurate. Specifically, this inaccuracy can cause the optimizer to select a query resulting in worse performance in producing the result. Therefore, what is needed is a method for collecting statistics within an RDBMS system that provides for accurate cost estimation analysis for queries involving more than one data column.
The presence of null data values within rows of a data table can lead to inaccuracies in the workload statistics, which will lead to inaccurate cost estimation analysis by the RDBMS optimizer. Most relational operations ignore incoming data rows that contain null data values in key columns. Therefore, it is important to also collect statistics about the number of rows with null data in certain columns or column groups. Prior art RDBMS systems that compute certain database statistics (e.g., distinct cardinality of a column) without regard to rows with null data produce inaccurate statistics. The inaccurate statistics can cause the RDBMS optimizer to select a sub-par strategy for a given query. Therefore, what is needed is a method for collecting statistics within an RDBMS system that provides for accurate cost estimation analysis for queries involving columns of rows with null data. The present invention provides such a system.
In prior RDBMS systems, a manual process is performed by users (e.g., system administrators) to determine which data columns of a database on which to collect statistics. In effect, in prior RDBMS systems, the user informs integrated statistics gathering procedures of the separate data columns on which workload statistics are to be gathered. However, it is often the case that the system administrators do not realize the specific mechanisms by which the RDBMS optimizer operates. Relying on users to indicate the data columns on which to collect statistics can be unreliable and inefficient. Namely, not knowing what is needed by the optimizer, the system administrator can cause the statistics gathering procedures to inefficiently over-collect statistics or fail to collect statistics on required columns. What is needed is an efficient method for identifying columns or column groups on which statistics collection is required. What is needed is such a method that does not rely on user origination of the above information.
Accordingly, the present invention provides a method and system for collecting statistics within an RDBMS system that provides for accurate cost estimation analysis for workload queries involving more than one data column. The present invention also provides a method and system for collecting statistics within an RDBMS system that provides for accurate estimated cost analysis for queries involving columns of rows with null data values. The present invention yet provides a method and system within RDBMS optimizer for identifying the columns or column groups on which statistics collection is required that does not rely on manual user identification of the column or column groups.
SUMMARY OF THE INVENTION
Methods are described for collecting query workload based statistics within a relational database management system (RDBMS) and for automatically identifying columns or groups of columns for which statistic collection is to be performed. The novel system collects workload based statistics on multi-columns rather than merely on single columns. This multi-column statistic gathering technique of the present invention provides more accurate results when the row values within the multi-columns are correlated to each other. The multi-column statistics that result from the present invention lead to better estimated cost analysis performed by an RDBMS optimizer which leads to increased likelihood of finding an optimal strategy for each query in a workload. In one embodiment, the present invention generates column group statistics which represent distinct cardinality measured over rows of multiple columns of data, rather than individual statistics on single columns ("a multi-column statistic"). In cases where the data within a column group is correlated, the multi-column result cardinality of the present invention computed using column group statistics is much more accurate compared to a result cardinality computed from statistics on individual columns using zero correlation assumption.
The novel system also collects separate statistics regarding the number of null data values within the rows of a column. Separate null data statistics improve the cost estimation analysis used by the RDBMS optimizer because the cardinality of a result produced by a relational operation is generally determined by the number of input rows with non-null data. The present invention utilizes the separate null data statistics in determining a result distinct cardinality of a column, or column group, of a data table. The null data statistics are useful in accurate estimation of cardinality of the result of a relational operation (e.g. a data search or a table join) and in accurate estimation of average cost of performing a relational operation (e.g. a data sort, a data search, a table join).
The novel system also includes an RDBMS optimizer that automatically identifies columns and groups of columns on which workload based statistics should be generated. Within the present invention, the clauses within a query such as equi-join and equi-selection predicates and projection columns are analyzed by the RDBMS optimizer to identify columns and column groups of interest. The identified columns are then registered within in a system catalog of the RDBMS. Later on, the statistics collection utility of the present invention reads the registered columns and column groups from the system catalog to identify the columns and column groups on which to generate the workload statistics. The statistics so generated are then used by the optimizer to perform subsequent estimations of the cost of performing certain relational operations in order to select a strategy for a query having the least cost.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 illustrates a computer system used in accordance with the embodiments of the present invention.
FIG. 2 is an illustration of a data table used by embodiments of the present invention including columns and rows of data.
FIG. 3 is a flow diagram of steps performed by the optimizer of the RDBMS system of the present invention when selecting a lowest cost strategy for a query.
FIG. 4A and FIG. 4B illustrate a flow diagram of steps of a first embodiment of the present invention for automatic identification of columns and column groups by the RDBMS optimizer on which to collect the workload statistics.
FIG. 5 is a flow diagram of steps performed by statistics generation procedures of the present invention to generate workload statistics.
FIG. 6 illustrates a flow diagram of steps of a second embodiment of the present invention for generating non-null workload statistics (e.g., the column duplicity factor), which is a multi-column statistic, on identified column groups.
FIG. 7 illustrates an exemplary multi-column data table that can be used by the first embodiment of the present invention.
FIG. 8 illustrates a flow diagram of steps of a third embodiment of the present invention for generating null workload statistics (e.g., the column null factors) on identified column groups.
DETAILED DESCRIPTION OF THE INVENTION
In the following detailed description of the embodiments of the present invention, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be obvious to one skilled in the art that the present invention may be practiced without these specific details. In other instances well known methods, procedures, components, and circuits have not been described in detail as not to unnecessarily obscure aspects of the present invention.
NOTATION AND NOMENCLATURE
Some portions of the detailed descriptions which follow are presented in terms of steps, procedures, logic blocks, processing, and other symbolic representations of operations on data bits within a computer memory. These descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. A procedure, logic block, process, step, etc., is here, and generally, conceived to be a self-consistent sequence of steps or instructions leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated in a computer system. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussions, it is appreciated that throughout the present invention, discussions utilizing terms such as "processing" or "computing" or "calculating" or "determining" or "displaying" or the like, refer to the action and processes of a computer system (e.g., 112 of FIG. 1), or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
COMPUTER SYSTEM ENVIRONMENT
Refer to FIG. 1 which illustrates a computer system 112. Within the following discussions of the present invention, certain processes (e.g., processes 310, 350, 410, 610, 710, 910, 1010) and steps are discussed that are realized, in one implementation, as a series of instructions (e.g., software program) that reside within computer readable memory units of system 112 and executed by processors of system 112. When executed, the instructions cause the computer system 112 to perform specific actions and exhibit specific behavior which is described in detail to follow.
In general, computer system 112 used by the present invention comprises an address/data bus 100 for communicating information, one or more central processors 101 coupled with the bus 100 for processing information and instructions, a computer readable volatile memory unit 102 (e.g., random access memory, static RAM, dynamic, RAM, etc.) coupled with the bus 100 for storing information and instructions for the central processor(s) 101, a computer readable non-volatile memory unit (e.g., read only memory, programmable ROM, flash memory, EPROM, EEPROM, etc.) coupled with the bus 100 for storing static information and instructions for the processor(s) 101. System 112 also includes a mass storage computer readable data storage device 104 (hard drive or floppy) such as a magnetic or optical disk and disk drive coupled with the bus 100 for storing information and instructions. Optionally, system 112 can include a display device 105 coupled to the bus 100 for displaying information to the computer user, an optional alphanumeric input device 106 including alphanumeric and function keys coupled to the bus 100 for communicating information and command selections to the central processor(s) 101, an optional cursor control device 107 coupled to the bus for communicating user input information and command selections to the central processor(s) 101, and an optional signal generating device 108 coupled to the bus 100 for communicating command selections to the processor(s) 101. In one exemplary implementation, system 112 is a DEC Alpha computer system by Digital Equipment Corporation, but could equally be of a number of various well known and commercially available platforms.
FIG. 2 illustrates an exemplary data table 200 as used by embodiments of the present invention. It is appreciated that data table 200, as well as columns and rows of data which comprise the table 200, are stored in computer readable memory units of system 112. Exemplary table 200 includes x number of separate data columns (or "columns") 210, 220, 230, and 240. Within each column, table 200 includes n rows of data 250-256. Data ("data values") are stored within the rows of each column. A data value can be either non-null (e.g., a known value) or null (e.g., an unknown value or an empty set value). A column group is a collection of multiple columns, e.g., column 210 and 220. Herein, references to columns, rows, column groups, and data tables refer to analogous data structures as shown in FIG. 2 and as stored in computer readable memory units of system 112.
OPTIMIZER OPERATION
Database performance is directly affected by the ability of a database to efficiently access a row or rows stored on disk (e.g., 104) through input/output (e.g., I/O) operations. The greater the number of I/O operations, the longer it takes to find and retrieve rows that satisfy a query. The optimizer of the RDBMS system finds a lowest cost strategy (generally most I/O efficient) of retrieving subject data and producing result of a query. Although the optimizer can consider many strategies each with a different cost to find the one with the lowest cost, all strategies yield the same result.
Herein, cost is the estimated number of disk I/Os required to execute a strategy and produce result of a query. This is the metric used by the optimizer to find an optimal strategy, e.g., the one with the minimal cost. Main factors affecting the cost of a strategy include the following: 1) query structure, 2) query predicates (e.g. equi-selections, equi-joins, etc.), 3) result ordering or grouping, 4) access paths available (e.g. hash index, B-tree index), and 5) join methods available (e.g. sort-merge join, hash join). Any database statistics, if available, can help the RDBMS optimizer in significantly improving the accuracy of cost estimation analysis to find an optimal strategy.
In one embodiment, database statistics include cardinality statistics, workload statistics and storage statistics. Cardinality statistics include the table cardinality and index cardinality. The table cardinality represents the total number of rows in a data table. The index cardinality represents the total number of distinct keys (index values) within an index. In other words, the index cardinality is a distinct cardinality of index columns. The workload statistics include column duplicity factor and column null factor. The column duplicity factor is a ratio of the number of table rows with non-null data in a column group to the distinct cardinality of a column group. It represents average number of duplicates per distinct combination of values in a column group. The column null factor is a ratio of the number of table rows with null data in any column of a column group to the table cardinality. It represents a fraction of a data table with null data in rows of a column group.
FIG. 3 illustrates an exemplary process flow 310 performed within an RDBMS system of the present invention which utilizes an optimizer. Generally, upon receiving a query, the optimizer: 1) identifies different column groups based on equi-join predicates, equi-selection predicates and projections specified in the query; 2) uses stored workload statistics on identified column groups and other stored database statistics to estimate the costs of performing different relational operations required to produce result of the query; 3) builds different strategies of executing relational operations in certain order and estimate cost for each strategy; and 4) selects the lowest cost strategy. It is appreciated that process 310 is executed over processor 101 and is stored as program instructions within computer readable memory units of system 112 (FIG. 1).
At step 315 of FIG. 3, the RDBMS system receives a query, e.g., a user request for data. The format of the received query contains certain predicates (joins, filters, and other conditions, etc.) At step 317, system 112 identifies column groups based on equi-joins, equi-selections and projections that are specified in the query. At step 320, the optimizer uses stored workload statistics on identified column groups for cost estimation of different relational operations. At step 320, the optimizer examines the query to determine the different relational operations (e.g., data access paths, sorts, filters, projections, aggregations, joins, etc.) that are needed to complete the query. At step 320, the optimizer of the RDBMS system accesses stored database statistics (cardinality/workload/storage) regarding the subject data of the query received at step 315 and uses this information to estimate the costs of performing necessary relational operations to produce the result of the query. Any of a number of well known methods can be used at step 320 for performing cost analysis of different relational operations used to satisfy the query.
With respect to determining the relational operations required to produce a result of a query, there are a number of well known retrieval strategies that the optimizer can utilize in order to perform a data search. A well known sequential retrieval strategy can be used where the RDBMS accesses the database pages for a table's logical area sequentially and reads all the rows in the table regardless of the selection expression used in the query. A number of well known index retrieval strategies can also be used, including those described below. In the database key (dbkey) retrieval, the RDBMS accesses a table data directly through a database key (e.g., logical address) row pointer. In the index retrieval, the RDBMS accesses a specific index structure (sorted or hashed) and retrieves the index keys, which include the dbkeys, of the rows. The dbkeys are then used to fetch data rows and the data is delivered in index order, if a sorted index is used. In the index only retrieval, the RDBMS accesses only the index data and the selected index contains all the table columns specified in the query and no further row fetches are necessary. The data is delivered in index order, if a sorted index is used. In the OR index retrieval, two or more sorted or hashed indexes are used that are defined on a single data table when the predicates on these indexes are combined with the logical OR in the query.
At step 322 of FIG. 3, the optimizer then attempts different strategies to implement the relational operations identified in step 320 and then computes the cost to perform each strategy. A strategy is a scheme of performing different relational operations in a certain order. By reordering the execution sequence of the relational operations, different cost values result for executing the same query. For single data tables, the optimizer finds the lowest cost retrieval strategy by evaluating the cost for each possible retrieval strategy. In one embodiment, the estimated cost for a strategy includes the cost of scanning an index and the cost of fetching data rows.
At step 330 of FIG. 3, the optimizer selects the strategy, of the strategies analyzed in step 322, that yields the lowest estimated cost solution. A number of well known optimizer processes can be used by step 330 for selecting the lowest estimated cost strategy. At step 340, the optimizer initiates the execution of the relational operations indicated by the selected strategy of step 330 to produce the result of the query. It is appreciated that each of the strategies considered at step 322, although having possibly different estimated costs, yields the same result. At step 345, system 112 reports the output of the executed strategy as the result of the query received at step 315.
OPTIMIZER IDENTIFICATION OF COLUMN GROUPS ON WHICH TO COLLECT WORKLOAD STATISTICS
The process 410 of a first embodiment of the present invention is described with reference to the flow diagram of FIG. 4A and FIG. 4B; this embodiment of the present invention provides an automatic method for identifying the column groups for which workload statistics are to be collected within an RDBMS system. Within the present invention, the column groups for which workload statistics are to be collected are identified from predicates specified in the WHERE and HAVING clauses, and projection lists specified in a GROUP BY or DISTINCT clause of each query within a query workload. In one implementation of the present invention, the equi-selection and equi-join predicates within WHERE and HAVING clauses and projection list within a GROUP BY or DISTINCT clause are used in the above process to identify column groups belonging to different data tables of a database. This embodiment of the present invention recognizes that maintaining workload statistics about such identified column groups is very important for optimizing a query because the corresponding conditions have significant influence on determining the cardinality of the query result.
For instance, if equi-selection predicates are specified in a query, the optimizer will need to be informed of the combined selectivity so that the cardinality of the result (which is produced after all equi-selections have been applied) can be computed with high accuracy. If equi-join predicates are specified in a query, the optimizer will need to be informed of the join fanout factors so that it can determine the cardinality of a joined result with high accuracy. If a projection (e.g., in a GROUP BY or DISTINCT clause) is specified in a query, the optimizer will need to be informed of how many distinct groups will be formed which will be equal to the cardinality of the projected result.
The workload statistics that are collected on the column groups identified by the optimizer in this embodiment of the present invention include a column duplicity factor and a column null factor, which are described further below. The column duplicity factor (second embodiment) provides information about the number of distinct values (e.g., the distinct cardinality) as well as the average number of duplicates per distinct combination of column values in an identified column group. The column duplicity factor, related to the result cardinality discussed above, can be used to determine the combined selectivity of equi-selection predicates, or combined join fanout factor of equi-join predicates, or the number of distinct groups which will be equal to the cardinality of the projected result. The column null factor (third embodiment) indicates the fraction of rows in a data table that have null values in any column of an identified column group. The column null factor can be used to identify a portion of the data table that would not participate in a relational operation (e.g., join, selection).
According to the present invention, when the optimizer processes a query to find the best execution strategy (e.g., FIG. 3), it also registers (if not already registered) the information about identified column groups into system catalog portions of computer readable memory units. Within the present invention, the optimizer identifies the column groups based on equi-selections, equi-joins, and projection lists in various queries in a query workload. An advantage of the present invention is that the workload statistics are collected only on the identified column groups and nothing else leading to a more efficient statistics collection procedure. This contrasts with the prior art method of relying on the user to identify the interesting columns or column groups which is a difficult task, and therefore, can be unreliable and inefficient.
FIG. 4A and FIG. 4B illustrate a flow diagram of process 410 of this embodiment of the present invention. Process 410 resides within an optimizer implemented in accordance with the present invention and is executed over processor 101 and is stored as program instructions within computer readable memory units of system 112 (FIG. 1). Embodiments of the present invention described further below illustrate different statistics generation procedures that can be used to produce the workload statistics that are used by process 410. At step 415 of FIG. 4A, the present invention optimizer receives a query. At step 420 the present invention then selects one data table ("selected table") from a list of data tables specified within the query received at step 415. At step 425, the present invention interrogates the query to determine if any equi-selection predicates are specified on the selected table. In one embodiment, an equi-selection can exist in WHERE and HAVING clauses of the query. If not, processing continues to step 435; if so, processing continues to step 430.
The equi-selection predicate indicates a search criterion wherein the search data is specified as a particular value (or variable) and the match much be exact. An example is given below:
where employee-- JOBCODE 5=and employee-- AGE=40
In the above example, both employee-- JOBCODE and employee-- AGE are column names within the selected data table. The above equi-selection indicates that matching data are to be reported only for rows wherein the employee job and age match with the above values (e.g., 5 and 40). It is appreciated that an equi-selection can be defined with respect to a single column, or more than one columns and that the above equi-selection is exemplary only. At step 430, the present invention constructs a column group based on all equi-selection predicates specified on the selected table and stores it in the system catalog (within computer readable memory units of system 112). In the exemplary equi-selection above, step 430 identifies employee-- JOBCODE and employee-- AGE as a column group. Within step 430, identified column groups can contain one or two or more columns. All identified column groups are stored in the system catalog. This is performed for each equi-selection predicate specified on the selected table.
At step 435 of FIG. 4A, the present invention interrogates the query to determine if any equi-join predicates are specified on the selected table. In one embodiment, an equi-join can exist in WHERE clauses of the query. If not, processing continues to step 450 (FIG. 4B); if so, processing continues to step 440. Typically, in a join operation, data from two tables are joined together based on matching data from one or more columns. For instance, assume one data table was "Departments" and another data table was "Employees," a equi-join might join all rows of the "Departments" table and all rows of the "Employees" table that share a common department number and a common regional designation (both represented by different columns within each table). This equi-join predicate would then result in a data table that specifies all employees that worked in a particular department in a particular regional area and would also contain information about the particular department.
At step 440, the present invention constructs different column groups based on different equi-joins connecting selected data table with each of the other data tables and stores the column groups in the system catalog (within computer readable memory units of system 112). In the example above, a column group is defined by step 440 for the two columns that specify the department number and the common regional designation. All identified column groups are stored in the system catalog. This is performed for each equi-join predicate specified on the selected table. Step 450 of FIG. 4B is then entered.
At step 450 of FIG. 4B, the present invention interrogates the query to determine if any projection operations are specified on the selected table. If not, processing continues to step 460 of FIG. 4B; if so, processing continues to step 445. Projection operation is used mainly for two purposes: (1) to compute aggregates of rows of data by a projection column or columns and then build a new data table having the aggregated result; and (2) eliminating duplicate row data by creating a new data table and projecting distinct data from a given data table into the new data table. Projection or projection lists are specified in GROUP BY or DISTINCT clause of a query and is used to perform the above functions on columns of the selected table. At step 455, the present invention constructs a column group based on the projection specified on the selected table and stores the column groups in the system catalog (within computer readable memory units of system 112). Step 460 of FIG. 4B is then entered.
At step 460, the present invention examines the query received at step 415 to determine if more data tables are specified within the query that have not yet been processed by process 410. If so, processing returns to step 420 of FIG. 4A to select a next data table; if not, process 410 ends.
It is appreciated that the determined set of column groups identified (from steps 430, 440, and 455) can include one or more columns on which single column or multi-column workload statistics are to be collected. At the completion of process 410, the present invention stores the columns and/or column groups selected at steps 430, 440, and 455 into the system catalog portion of computer readable memory units of system 112 provided they were not previously stored.
When a statistics collection utility runs, it reads the column groups stored in the system catalog of the RDBMS system, and collects the workload statistics for each column group. By identifying a set of column groups based on the various clauses specified within each query of a query workload, the present invention advantageously provides an automatic mechanism for determining certain data in the database for which statistics are to be collected. The availability of the workload statistics on identified columns is deemed to be highly relevant to the cost estimation analysis performed by the optimizer to find an optimal strategy for each query in the query workload.
STATISTICS GENERATION PROCEDURE
FIG. 5 illustrates steps of the statistics generation process 560 which operates to construct the statistics used in process flow 310 (FIG. 3). It is appreciated that process 560 typically operates in the background of the RDBMS system in an effort to avoid interference with the RDBMS system's execution of other relational operations. In this respect, process 560 and process 310 are asynchronous. Process 560 is executed over processor 101 and is stored as program instructions within computer readable memory units of system 112 (FIG. 1). Process 560 commences at step 565 where a particular data table is selected ("selected table").
At step 570 of FIG. 5, process 560 accesses the system catalog within the RDBMS system (e.g. a portion of memory unit 102) to obtain a list of identified column groups (on which to collect workload statistics) associated with a selected data table. At step 575, the workload statistics (column duplicity factor and column null factor) are generated for each of the identified column groups (e.g., identified by step 410 above). This is performed for each identified column group in the system catalog. At step 580, the generated workload statistics are then stored together with respective column groups into the system catalog of the RDBMS system for use by process 310 (FIG. 3) when needed.
At step 585, the present invention checks if more data tables are present in the database. If so, process 560 returns to step 565 to select a new data table; if not, process 560 returns (to be executed at a later predetermined time).
It is appreciated that the following two embodiments of the present invention operate within step 575 to provide, respectively, collection of (1) workload-based multi-column non-null statistic (column duplicity factor), and (2) workload-based multi-column null statistic (column null factor) for each of the identified column groups in the database. It is appreciated that the embodiment of the present invention represented by process 410 provides methods for identifying the data for which workload statistics are to be collected. This embodiment of the present invention places identifiers of this data within the system catalog which is read by step 570 of process 560, although other memory locations can also be used within the scope of the present invention.
WORKLOAD-BASED MULTI-COLUMN NON-NULL STATISTIC (COLUMN DUPLICITY FACTOR) COMPUTATION PROCEDURE
Process 610 of the second embodiment of the present invention operates within the statistics generation procedures of an RDBMS system (process 560). This embodiment recognizes that in many instances within a database system, the distribution of values in one column can correlate to the values of another column. Conventional RDBMS systems assume that there is no correlation between two or more columns when determining overall estimated result cardinalities of a relational operation that includes two or more columns. As used herein, the result cardinality represents the average number of rows of a data table that result from a particular relational operation (e.g., a search, projection, join, etc.).
The zero-correlation assumption made by the conventional systems usually causes a large percentage error in the estimation of query solution costs and cardinalities when strong correlation exists between multiple columns on which relational operations are performed, such as selections, projections, and joins. Large errors in solution costs lead the optimizer to select a sub-optimal strategy for a query. The first embodiment of the present invention avoids the above problem through automatic identification of column groups based on a query workload, and collecting workload statistics on single columns and multiple columns (multi-column statistics) as a group, rather than collecting statistics merely on individual columns. In effect, the first embodiment of the present invention provides a mechanism for capturing the correlation between columns by collecting statistics on multiple columns as a group.
FIG. 6 illustrates a process flow 610 of steps of the second embodiment of the present invention. Process 610 is executed over processor 101 and is stored as program instructions within computer readable memory units of system 112 (FIG. 1). At step 615, the present invention reads the system catalog space of a computer readable memory unit of system 112 to receive an indication of certain column groups (column identifiers) on which workload statistics need to be collected. According to a preferred embodiment of the present invention, this information originates from an automatic analysis (e.g., process 410) performed by an optimizer of one embodiment of the present invention. Alternatively, this information can be user originated with respect to the first embodiment of the present invention.
By way of example, FIG. 7 illustrates a data table 500 comprising three columns: 1) an employee identification number column 510; 2) an employee age column 520; and 3) an annual salary column 530. The data table 500 contains "n" rows 540-547. In step 615 of FIG. 6, the present invention receives column identifiers of a column group for which a column duplicity factor needs to be computed, e.g., on the age column 520 and the annual salary column 530 of data table 500.
At step 620 of FIG. 6, the present invention accesses row groups containing identified columns (identified in step 615) from the selected table. A row group includes the corresponding rows of each identified column of the data table. The identified columns on which workload statistics are to be collected are referred to as "column groups." Each of the age and annual salary entries of a particular row, e.g., row 542, of data table 500 is a corresponding row group. In the example of FIG. 7, the rows are corresponding because they belong to a same employee identification number. The following are further examples of row groups in accordance with the example of FIG. 7: 1) 34/30,000 of row 541; 2) 37/35,000 of row 544; and 3) 37/40,000 of row 546.
At step 625 of FIG. 6, the present invention then determines the number of unique row groups of the row groups accessed in step 620. This statistic is called the distinct cardinality of the columns identified in step 615 (e.g., the distinct cardinality of the column group). Also at step 625, the present invention counts the total number of non-null row groups. In the example of FIG. 7, assuming n=8, the number of non-null row groups is 8.
At step 625, it is appreciated that corresponding data values of multiple columns are considered in determining the number of distinct row groups within the present invention. Therefore, the result cardinality of the column group is dependent on the data of multiple columns. In the example of FIG. 7, assuming n=8, the number of distinct row groups is 5 and they are: 34/30,000; 36/32,000; 37/35,000; 37/40,000; and 57/50,000. With reference to the above example, the present invention determines the number of distinct "pairs" between the age and income columns of data table 500. At step 625, the multi-column distinct cardinality statistic of 5 is reported as the number of distinct row groups in the column group including column 520 and column 530.
At step 630, the column duplicity factor (a multi-column statistics) is computed by dividing the number of non-null row groups by the number of distinct row groups in the selected table. The column duplicity factor represents the average number of rows returned based on a query of the data table that includes a specific selection values of the columns identified in step 615. In the example of FIG. 7, the column duplicity factor represents the average number of rows matching a query including a specific age value and a specific income value. Assuming n=8, the result cardinality of FIG. 7 reported by step 630 is 8/5 or 1.6. At step 625, the column duplicity factor (a multi-column statistic) determined at step 630 is stored as workload-based statistic corresponding to the columns that were identified in step 615. These values are stored and indexed in computer readable memory units of system 112. As described with reference to FIG. 3, the optimizer utilizes these workload statistics to perform accurate cost estimations in its cost analysis in determining the lowest cost strategy for performing a query.
A Comparison With Single Column Method. To illustrate the increased accuracy of multi-column workload statistics of the present invention, a comparison with workload statistics determination using single columns is presented. With reference to the example of FIG. 7, the age selectivity of column 520 is one over the column's distinct cardinality or 1/4. The income selectivity of column 530 is one over the column's distinct cardinality or 1/5. The overall selectivity is computed below:
______________________________________                                    
overall selectivity                                                       
             =     age selectivity * income selectivity                   
overall selectivity                                                       
             =     (1/4) * (1/5)                                          
             =     1/20                                                   
______________________________________                                    
The result cardinality of the two columns is expressed as the overall selectivity times the table cardinality which represents the number of rows of the data table 500. In this example n=8, so the result cardinality computed from single column statistics is expressed below:
______________________________________                                    
        result cardinality                                                
                  =     (1/20) * (8)                                      
                  =     0.4                                               
______________________________________                                    
Based on single column statistics, the average number of rows a query (e.g., SELECT FROM DATA TABLE 500 WHERE AGE=x AND INCOME=y) is expected to locate based on a selection of both age and income is 0.4. However, under the present invention multi-column workload statistics, the average number of rows a query is expected to locate based on a selection of both age and income is 1.6. The multi-column workload statistic of the present invention is much more accurate over the single column derived workload statistic because in the example case of FIG. 7, and in many instances, the data within the columns are correlated. This correlation leads to a more accurate result cardinality value, as expressed in the present invention, but ignored in the single column derivation method of conventional systems.
By having more accurate result cardinalities, the second embodiment of the present invention provides an RDBMS optimizer with a more accurate estimated costs for optimal strategy selection. With more accurate costs, an optimal strategy is more likely to be selected by the optimizer and a reduced instance of sub-optimal strategy selection is provided by the present invention. With the present invention, more accurate cost estimation by the optimizer is possible because more accurate information is available regarding: (1) how many rows on average are selected from a given data table; and/or (2) how many rows on average going to participate in a relational operation (e.g., a join).
COMPUTATION PROCEDURE TO PROVIDE WORKLOAD-BASED MULTI-COLUMN NULL STATISTIC (COLUMN NULL FACTOR)
The process 710 of a third embodiment of the present invention, like the second embodiment (process 610), operates within the statistics generation procedures of an RDBMS system (process 560). The process 710 of the third embodiment of the present invention is described with reference to the flow diagram of FIG. 8 and the exemplary data tables of FIG. 7. The third embodiment of the present invention maintains separate workload statistics regarding the number of rows with null data values in a column group. Separating rows with null data from rows with non-null data is important because rows with nulls do not participate in most of the relational operations performed within an RDBMS. Therefore, as recognized by the third embodiment of the present invention, the cardinality (e.g., the number of rows) of a result produced by a relational operation is generally determined by the number of input rows with non-null data.
FIG. 8 illustrates a process 710 performed by an implementation of the third embodiment of the present invention for computing multi-column null statistics for a column group. Process 710 is executed over processor 101 and is stored as program instructions within computer readable memory units of system 112 (FIG. 1). FIG. 8 is described with respect to an example data table 500 of FIG. 7 where n is 100. In this example, the age column 520 contains null data rows for 50 of the employees. Also in this example, there are only 20 distinct age entries in age column 520, so it is appreciated that the initial distinct cardinality (e.g., without consideration of null data values) for column 520 is 100/20 or 5.0.
With reference to FIG. 8, at step 715, the present invention receives column identifiers of a column group of a selected table. In effect, system 112 reads the system catalog space of a computer readable memory unit of system 112 to receive the column group identifiers on which to collect workload-based null statistics of the selected table. According to a preferred embodiment of the present invention, this information originates from an automatic analysis (e.g., process 410) performed by an optimizer of one embodiment of the present invention. Alternatively, this information can be user originated with respect to the first embodiment of the present invention. For instance, the present invention receives an indication that workload statistics are to be collected based on the age column 520 of data table 500 (FIG. 7). At step 720, the present invention accesses the corresponding row groups of the data table containing the identified column groups.
At step 730, the present invention counts the number of row groups with null data in any of the identified columns. In one embodiment, at step 730, system 112 examines the individual entries of the corresponding rows groups to determine the number of row groups with null data in any of the identified column groups. For instance, at step 730, the present invention performs a search, or other equivalent operation, to determine the number of null rows within the identified column and records this value as a null data statistic for the identified column. This is done for each of the identified columns for the selected table. With reference to column 520 (FIG. 7), the present invention performs a search and determines that 50 of the rows of identified column 520 contain null data values.
At step 740 of FIG. 8, the present invention utilizes the number of null rows determined in step 730 and the number of table rows to determine the column null factor for the identified column. At step 740, system 112 computes the column null factor as equal to the number of row groups with null data divided by the number of table rows. In one embodiment, this is computed for all identified columns of the selected table. For instance, with respect to column 520, the present invention divides the number of null rows (50) by the total number of table rows (100) of column 520 to compute a column null factor of 0.5.
In the example, the age column 520 contains null data rows for 50 of the employees out of a total of 100. Also, in the example, there are only 20 distinct values in the age column 520. It is appreciated that without the null statistic, the initial result cardinality for an equi-selection on column 520 is 100/20 or 5.0. By collecting null statistic (column null factor), the present invention gives a much more accurate final result cardinality by multiplying the initial result cardinality by one minus the column null factor. This is shown below:
______________________________________                                    
final result cardinality                                                  
            =     initial result cardinality * (1 - null factor)          
            =     5.0 * (1 - 0.5)                                         
            =     2.5                                                     
______________________________________                                    
The final result cardinality gives a much more accurate estimate of the average number of rows of column 520 that are likely to have any particular, age value, considering rows with null values within column 520.
At step 750 of FIG. 8, the third embodiment of the present invention stores the determined column null factor in a computer readable memory unit as a workload-based null statistic for the identified column group of the selected table. For example, the null factor of 0.5 is then recorded for column 520. By separating out the rows with non-null data, the present invention gives a more accurate final result cardinality which indicates that, on average, 50/20 or 2.5 employees have the same age in our current example. On average, a query of column 520 for a particular age would result in 2.5 rows identified, not 5.0 as indicated by the initial result cardinality that does not take into consideration the presence of null data rows.
EXEMPLARY USE OF WORKLOAD STATISTICS
The following material illustrates the manner in which the workload statistics (e.g., column duplicity factor and column null factor) are used to compute the result cardinality of the following relational operations: 1) equi-selection, 2) equi-join, and 3) projection.
In this example, two data tables exists each with one column group. Refer to the column group in the first table as CG1 and the column group in the second table as CG2. Herein, the column duplicity factor and column null factor based on CG1 is denoted as CDF1 and CNF1, and the column duplicity factor and column null factor based on CG2 is denoted as CDF2 and CNF2. Also, the cardinality of first table (i.e. number of rows) is denoted as TC1, and the cardinality of second table is denoted as TC2.
If equi-selection predicates are specified on all columns in CG1 then the result cardinality of an equi-selection operation on the first table is estimated as equal to CDF1. Note that CDF1 represents the average number of duplicates per distinct value of CG1. An equi-selection over CG1 is nothing but selecting a distinct value of CG1, and therefore, the result cardinality of such an operation is expected to be equal to CDF1.
Similarly, if equi-selection predicates are specified on all columns in CG2 then the result cardinality of an equi-selection operation on the second table is estimated as equal to CDF2.
If a projection is specified on all columns in CG1 then the result cardinality of a projection operation on the first table is estimated as equal to TC1*(1-CNF1)/CDF1. Note that TC1*(1-CNF1)/CDF1 represents the distinct cardinality of CG1. A projection over CG1 is nothing but selecting all distinct values in CG1, and therefore, the result cardinality of such an operation is expected to be equal to TC1*(1-CNF1)/CDF1.
Similarly, if a projection is specified on all columns in CG2 then the result cardinality of a projection operation on the second table is estimated as equal to TC2*(1-CNF2)/CDF2.
If equi-join predicates are specified between all columns in CG1 and CG2 then the result cardinality of an equi-join operation between first and second tables is estimated as follows: The quantity TC1*(1-CNF1)/CDF1 is denoted as DC1, and similarly the quantity TC2*(1-CNF2)/CDF2 is denoted as DC2. Herein DC1 and DC2 represent, respectively, the distinct cardinality of CG1 and CG2. The result of the equi-join operation based on CG1 and CG2 is estimated as equal to MIN (DC1, DC2)*CDF1*CDF2. Herein MIN is a function that chooses the minimum value between DC1 and DC2.
CONCLUSION
The embodiments of the present invention, a method for automatic identification of column groups based on a query workload by the RDBMS optimizer, a method for collecting multi-column non-null statistic on identified column groups, and a method for collecting multi-column null statistics on identified column groups, are thus described. While the present invention has been described in particular embodiments, it should be appreciated that the present invention should not be construed as limited by such embodiments, but rather construed according to the below claims.

Claims (19)

What is claimed is:
1. In a relational database management system having a processor coupled to a bus and a computer readable memory unit coupled to said bus, a computer implemented method for determining a workload statistic, said method comprising the steps of:
a) accessing a predetermined portion of said computer readable memory unit to determine an identified first column of data within a data table;
b) determining the number of rows of said first column that contain null data;
c) determining a first workload statistic associated with said first column, wherein said workload statistic is compensated for contribution of said rows of said first column that contain null data; and
d) storing said first workload statistic determined in step c) in a predetermined portion of said computer readable memory unit for use by an optimizer of said relational database management system.
2. A method as described in claim 1 further comprising the step of e) performing a cost analysis of executing a query according to a number of different strategies to determine a least cost execution strategy, said query involving said first column, wherein said step e) is performed by said optimizer which accesses and uses said workload statistic determined in step c) in performing said cost analysis.
3. A method as described in claim 1 wherein said first workload statistic is a column null factor of said first column and wherein said step c) comprises the steps of:
c1) determining the number of rows of said first column that contain null data;
c2) determining the total number of rows of said first column; and
c3) dividing the number of rows of said first column that contain null data by the total number of rows of said first column to generate said column null factor for said first column.
4. A method as described in claim 1 wherein said first workload statistic is a result cardinality of said first column and wherein said step c) comprises the steps of:
c1) determining the number of rows of said first column that contain null data;
c2) determining the total number of rows of said first column;
c3) dividing the number of rows of said first column that contain null data by the total number of rows of said first column to generate said column null factor for said first column; and
c4) determining said result cardinality based on an initial cardinality of said first column and said column null factor.
5. A method as described in claim 1 wherein said predetermined portion of said computer readable memory unit comprises a system catalog of said relational database management system.
6. In a relational database management system having a processor coupled to a bus and a computer readable memory unit coupled to said bus, a computer implemented method for identifying data for which workload statistics are to be collected, said method comprising the steps of:
a) receiving a query to be processed by said relational database management system, said query having an associated data table;
b) responsive to step a), invoking an optimizer within said relational database management system to automatically identify sets of columns and column groups associated with said data table of said query;
c) said optimizer registering said set of columns and column groups within a predetermined portion of said computer readable memory unit for use by statistics generation processes of said relational database management t system; and
d) accessing said predetermined portion of said computer readable memory unit and collecting workload statistics on said set of columns and column groups registered by said step c).
7. A method as described in claim 6 wherein said step b) comprises the step of identifying said set of columns and column groups from column identifications of said data table, said column identifications located within equi-selection predicates of WHERE clauses in said query.
8. A method as described in claim 6 wherein said step b) comprises the step of identifying said set of columns and column groups from column identifications of said data table, said column identifications located within equi-selection and/or equi-join predicates of HAVING clauses in said query.
9. A method as described in claim 6 wherein said step b) comprises the step of identifying said set of columns and column groups from column identifications of said data table, said column identifications located within projection lists of GROUP BY clauses in said query.
10. A method as described in claim 6 wherein said step b) comprises the step of identifying said set of columns and column groups from column identifications of said data table, said column identifications located within projection lists of DISTINCT clauses in said query.
11. A computer system comprising:
a bus;
a processor coupled to a bus; and
a computer readable memory unit coupled to said bus, said computer readable memory having stored therein instructions that when executed implement a method of determining a workload statistic, said method comprising the steps of:
a) accessing a predetermined portion of said computer readable memory unit to determine an identified first column of data within a data table;
b) determining the number of rows of said first column that contain null data;
c) determining a first workload statistic associated with said first column, wherein said workload statistic is compensated for contribution of said rows of said first column that contain null data;
d) storing said first workload statistic determined in step c) in a predetermined portion of said computer readable memory unit for use by an optimizer of said relational database management system; and
e) performing a cost analysis of executing a query according to a number of different strategies to determine a least cost execution strategy, said query involving said first column, wherein said step e) is performed by said optimizer which accesses and uses said workload statistic determined in step c) in performing said cost analysis.
12. A method as described in claim 11 wherein said first workload statistic is a column null factor of said first column and wherein said step c) comprises the steps of:
c1) determining the number of rows of said first column that contain null data;
c2) determining the total number of rows of said first column; and
c3) dividing the number of rows of said first column that contain null data by the total number of rows of said first column to generate said column null factor for said first column.
13. A method as described in claim 11 wherein said first workload statistic is a result cardinality of said first column and wherein said step c) comprises the steps of:
c1) determining the number of rows of said first column that contain null data;
c2) determining the total number of rows of said first column;
c3) dividing the number of rows of said first column that contain null data by the total number of rows of said first column to generate said column null factor for said first column; and
c4) determining said result cardinality based on an initial cardinality of said first column and said column null factor.
14. A method as described in claim 11 wherein said predetermined portion of said computer readable memory unit comprises a system catalog of said relational database management system.
15. A computer system comprising:
a bus;
a processor coupled to said bus; and
a computer readable memory unit coupled to said bus, said computer readable memory unit having stored therein instructions that when executed implement a method of identifying data for which workload statistics are to be collected, said method comprising the steps of:
a) receiving a query to be processed by said relational database management system, said query having an associated data table;
b) responsive to step a), invoking an optimizer within said relational database management system to automatically identify sets of columns and columns groups associated with said data table of said query;
c) said optimizer registering said set of columns and column groups within a predetermined portion of said computer readable memory unit for use by statistics generation processes of said relational database management system; and
d) accessing said predetermined portion of said computer readable memory unit and collecting workload statistics on said set of columns and column groups registered by said step c).
16. A computer system as described in claim 15 wherein said step b) of said method comprises the step of identifying said set of columns and columns groups from column identifications of said data table, said column identifications located within equi-selection predicates of WHERE clauses said query.
17. A computer system as described in claim 15 wherein said step b) of said method comprises the step of identifying said set of columns and columns groups from column identifications of said data table, said column identifications located within equi-selection and/or equi-join predicates of HAVING clauses said query.
18. A computer system as described in claim 15 wherein said step b) of said method comprises the step of identifying said set of columns and columns groups from column identifications of said data table, said column identifications located within projection lists of GROUP BY clauses said query.
19. A computer system as described in claim 15 wherein said step b) of said method comprises the step of identifying said set of columns and columns groups from column identifications of said data table, said column identifications located within projection lists of DISTINCT clauses said query.
US09/164,400 1997-02-10 1998-09-30 Methods for collecting query workload based statistics on column groups identified by RDBMS optimizer Expired - Lifetime US6029163A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US09/164,400 US6029163A (en) 1997-02-10 1998-09-30 Methods for collecting query workload based statistics on column groups identified by RDBMS optimizer

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US08/796,779 US5899986A (en) 1997-02-10 1997-02-10 Methods for collecting query workload based statistics on column groups identified by RDBMS optimizer
US09/164,400 US6029163A (en) 1997-02-10 1998-09-30 Methods for collecting query workload based statistics on column groups identified by RDBMS optimizer

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US08/796,779 Continuation US5899986A (en) 1997-02-10 1997-02-10 Methods for collecting query workload based statistics on column groups identified by RDBMS optimizer

Publications (1)

Publication Number Publication Date
US6029163A true US6029163A (en) 2000-02-22

Family

ID=25169044

Family Applications (2)

Application Number Title Priority Date Filing Date
US08/796,779 Expired - Lifetime US5899986A (en) 1997-02-10 1997-02-10 Methods for collecting query workload based statistics on column groups identified by RDBMS optimizer
US09/164,400 Expired - Lifetime US6029163A (en) 1997-02-10 1998-09-30 Methods for collecting query workload based statistics on column groups identified by RDBMS optimizer

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US08/796,779 Expired - Lifetime US5899986A (en) 1997-02-10 1997-02-10 Methods for collecting query workload based statistics on column groups identified by RDBMS optimizer

Country Status (1)

Country Link
US (2) US5899986A (en)

Cited By (79)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6173278B1 (en) * 1997-11-03 2001-01-09 Newframe Corporation Otd. Method of and special purpose computer for utilizing an index of a relational data base table
US6266658B1 (en) * 2000-04-20 2001-07-24 Microsoft Corporation Index tuner for given workload
US6353826B1 (en) * 1997-10-23 2002-03-05 Sybase, Inc. Database system with methodology providing improved cost estimates for query strategies
US6374235B1 (en) * 1999-06-25 2002-04-16 International Business Machines Corporation Method, system, and program for a join operation on a multi-column table and satellite tables including duplicate values
US20020059213A1 (en) * 2000-10-25 2002-05-16 Kenji Soga Minimum cost path search apparatus and minimum cost path search method used by the apparatus
US6397204B1 (en) * 1999-06-25 2002-05-28 International Business Machines Corporation Method, system, and program for determining the join ordering of tables in a join query
US20020107835A1 (en) * 2001-02-08 2002-08-08 Coram Michael T. System and method for adaptive result set caching
US6438538B1 (en) * 1999-10-07 2002-08-20 International Business Machines Corporation Data replication in data warehousing scenarios
US6446063B1 (en) * 1999-06-25 2002-09-03 International Business Machines Corporation Method, system, and program for performing a join operation on a multi column table and satellite tables
US20020123979A1 (en) * 2001-01-12 2002-09-05 Microsoft Corporation Sampling for queries
US6496877B1 (en) * 2000-01-28 2002-12-17 International Business Machines Corporation Method and apparatus for scheduling data accesses for random access storage devices with shortest access chain scheduling algorithm
US20030083959A1 (en) * 2001-06-08 2003-05-01 Jinshan Song System and method for creating a customized electronic catalog
US20030084025A1 (en) * 2001-10-18 2003-05-01 Zuzarte Calisto Paul Method of cardinality estimation using statistical soft constraints
US20030093436A1 (en) * 2001-09-28 2003-05-15 International Business Machines Corporation Invocation of web services from a database
US20030115183A1 (en) * 2001-12-13 2003-06-19 International Business Machines Corporation Estimation and use of access plan statistics
US20030135485A1 (en) * 2001-12-19 2003-07-17 Leslie Harry Anthony Method and system for rowcount estimation with multi-column statistics and histograms
US6598038B1 (en) * 1999-09-17 2003-07-22 Oracle International Corporation Workload reduction mechanism for index tuning
US6609126B1 (en) 2000-11-15 2003-08-19 Appfluent Technology, Inc. System and method for routing database requests to a database and a cache
US20030191769A1 (en) * 2001-09-28 2003-10-09 International Business Machines Corporation Method, system, and program for generating a program capable of invoking a flow of operations
US20040003004A1 (en) * 2002-06-28 2004-01-01 Microsoft Corporation Time-bound database tuning
US6732110B2 (en) * 2000-08-28 2004-05-04 International Business Machines Corporation Estimation of column cardinality in a partitioned relational database
US20040128299A1 (en) * 2002-12-26 2004-07-01 Michael Skopec Low-latency method to replace SQL insert for bulk data transfer to relational database
US20040199636A1 (en) * 2001-09-28 2004-10-07 International Business Machines Corporation Automatic generation of database invocation mechanism for external web services
US20040210563A1 (en) * 2003-04-21 2004-10-21 Oracle International Corporation Method and system of collecting execution statistics of query statements
US20040236722A1 (en) * 2003-05-20 2004-11-25 Microsoft Corporation System and method for cardinality estimation based on query execution feedback
US20050004907A1 (en) * 2003-06-27 2005-01-06 Microsoft Corporation Method and apparatus for using conditional selectivity as foundation for exploiting statistics on query expressions
US20050021287A1 (en) * 2003-06-25 2005-01-27 International Business Machines Corporation Computing frequent value statistics in a partitioned relational database
US20050065911A1 (en) * 1998-12-16 2005-03-24 Microsoft Corporation Automatic database statistics creation
US20050086242A1 (en) * 2003-09-04 2005-04-21 Oracle International Corporation Automatic workload repository battery of performance statistics
US20050216490A1 (en) * 2004-03-26 2005-09-29 Oracle International Corporation Automatic database diagnostic usage models
US20050216304A1 (en) * 2001-06-08 2005-09-29 W.W. Grainger, Inc. System and method for electronically creating a customized catalog
US20050240556A1 (en) * 2000-06-30 2005-10-27 Microsoft Corporation Partial pre-aggregation in relational database queries
US6985904B1 (en) 2002-02-28 2006-01-10 Oracle International Corporation Systems and methods for sharing of execution plans for similar database statements
US20060100992A1 (en) * 2004-10-21 2006-05-11 International Business Machines Corporation Apparatus and method for data ordering for derived columns in a database system
US20060112093A1 (en) * 2004-11-22 2006-05-25 International Business Machines Corporation Method, system, and program for collecting statistics of data stored in a database
US20060116984A1 (en) * 2004-11-29 2006-06-01 Thomas Zurek Materialized samples for a business warehouse query
US20060161515A1 (en) * 2005-01-14 2006-07-20 International Business Machines Corporation Apparatus and method for SQL distinct optimization in a computer database system
US7092931B1 (en) 2002-05-10 2006-08-15 Oracle Corporation Methods and systems for database statement execution plan optimization
US20060195416A1 (en) * 2005-02-28 2006-08-31 Ewen Stephan E Method and system for providing a learning optimizer for federated database systems
US20070083515A1 (en) * 2002-12-20 2007-04-12 International Business Machines Corporation System and method for multicolumn sorting in a single column
US20070220058A1 (en) * 2006-03-14 2007-09-20 Mokhtar Kandil Management of statistical views in a database system
US20070271218A1 (en) * 2006-05-16 2007-11-22 International Business Machines Corpoeation Statistics collection using path-value pairs for relational databases
US20080033914A1 (en) * 2006-08-02 2008-02-07 Mitch Cherniack Query Optimizer
US20080040348A1 (en) * 2006-08-02 2008-02-14 Shilpa Lawande Automatic Vertical-Database Design
US20080052266A1 (en) * 2006-08-25 2008-02-28 Microsoft Corporation Optimizing parameterized queries in a relational database management system
US20080071818A1 (en) * 2006-09-18 2008-03-20 Infobright Inc. Method and system for data compression in a relational database
US20080133458A1 (en) * 2006-12-01 2008-06-05 Microsoft Corporation Statistics adjustment to improve query execution plans
US20080133454A1 (en) * 2004-10-29 2008-06-05 International Business Machines Corporation System and method for updating database statistics according to query feedback
US7447681B2 (en) 2005-02-17 2008-11-04 International Business Machines Corporation Method, system and program for selection of database characteristics
US20090012977A1 (en) * 2006-04-03 2009-01-08 International Business Machines Corporation System for estimating cardinality in a database system
US20090030875A1 (en) * 2004-01-07 2009-01-29 International Business Machines Corporation Statistics management
US20090077054A1 (en) * 2007-09-13 2009-03-19 Brian Robert Muras Cardinality Statistic for Optimizing Database Queries with Aggregation Functions
US20090106756A1 (en) * 2007-10-19 2009-04-23 Oracle International Corporation Automatic Workload Repository Performance Baselines
US20090106210A1 (en) * 2006-09-18 2009-04-23 Infobright, Inc. Methods and systems for database organization
US20090150413A1 (en) * 2007-12-06 2009-06-11 Oracle International Corporation Virtual columns
US20090150366A1 (en) * 2007-12-06 2009-06-11 Oracle International Corporation Expression replacement in virtual columns
US20090299989A1 (en) * 2004-07-02 2009-12-03 Oracle International Corporation Determining predicate selectivity in query costing
US20100011030A1 (en) * 2006-05-16 2010-01-14 International Business Machines Corp. Statistics collection using path-identifiers for relational databases
US20100030728A1 (en) * 2008-07-29 2010-02-04 Oracle International Corporation Computing selectivities for group of columns and expressions
US20100114976A1 (en) * 2008-10-21 2010-05-06 Castellanos Maria G Method For Database Design
US20100223269A1 (en) * 2009-02-27 2010-09-02 International Business Machines Corporation System and method for an efficient query sort of a data stream with duplicate key values
US20110016157A1 (en) * 2009-07-14 2011-01-20 Vertica Systems, Inc. Database Storage Architecture
US20110022581A1 (en) * 2009-07-27 2011-01-27 Rama Krishna Korlapati Derived statistics for query optimization
US7941332B2 (en) 2006-01-30 2011-05-10 International Business Machines Corporation Apparatus, system, and method for modeling, projecting, and optimizing an enterprise application system
US20110213766A1 (en) * 2010-02-22 2011-09-01 Vertica Systems, Inc. Database designer
US20110218987A1 (en) * 2006-08-25 2011-09-08 Teradata Us, Inc. Hardware accelerated reconfigurable processor for accelerating database operations and queries
US20110258179A1 (en) * 2010-04-19 2011-10-20 Salesforce.Com Methods and systems for optimizing queries in a multi-tenant store
US8086598B1 (en) 2006-08-02 2011-12-27 Hewlett-Packard Development Company, L.P. Query optimizer with schema conversion
US20130018890A1 (en) * 2011-07-13 2013-01-17 Salesforce.Com, Inc. Creating a custom index in a multi-tenant database environment
US8417727B2 (en) 2010-06-14 2013-04-09 Infobright Inc. System and method for storing data in a relational database
US20130166557A1 (en) * 2011-12-23 2013-06-27 Lars Fricke Unique value calculation in partitioned tables
US8521748B2 (en) 2010-06-14 2013-08-27 Infobright Inc. System and method for managing metadata in a relational database
US8620888B2 (en) 2007-12-06 2013-12-31 Oracle International Corporation Partitioning in virtual columns
US20140074819A1 (en) * 2012-09-12 2014-03-13 Oracle International Corporation Optimal Data Representation and Auxiliary Structures For In-Memory Database Query Processing
US9117005B2 (en) 2006-05-16 2015-08-25 International Business Machines Corporation Statistics collection using path-value pairs for relational databases
US10162851B2 (en) 2010-04-19 2018-12-25 Salesforce.Com, Inc. Methods and systems for performing cross store joins in a multi-tenant store
US10417611B2 (en) 2010-05-18 2019-09-17 Salesforce.Com, Inc. Methods and systems for providing multiple column custom indexes in a multi-tenant database environment
US11449481B2 (en) * 2017-12-08 2022-09-20 Alibaba Group Holding Limited Data storage and query method and device
US11947568B1 (en) * 2021-09-30 2024-04-02 Amazon Technologies, Inc. Working set ratio estimations of data items in a sliding time window for dynamically allocating computing resources for the data items

Families Citing this family (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5995957A (en) 1997-02-28 1999-11-30 International Business Machines Corporation Query optimization through the use of multi-column statistics to avoid the problems of column correlation
US6044370A (en) * 1998-01-26 2000-03-28 Telenor As Database management system and method for combining meta-data of varying degrees of reliability
US6553369B1 (en) * 1999-03-11 2003-04-22 Oracle Corporation Approach for performing administrative functions in information systems
US7080062B1 (en) 1999-05-18 2006-07-18 International Business Machines Corporation Optimizing database queries using query execution plans derived from automatic summary table determining cost based queries
US6738755B1 (en) * 1999-05-19 2004-05-18 International Business Machines Corporation Query optimization method for incrementally estimating the cardinality of a derived relation when statistically correlated predicates are applied
US7890491B1 (en) * 1999-12-22 2011-02-15 International Business Machines Corporation Query optimization technique for obtaining improved cardinality estimates using statistics on automatic summary tables
CA2307155A1 (en) * 2000-04-28 2001-10-28 Ibm Canada Limited-Ibm Canada Limitee Execution of database queries including filtering
EP1346289A1 (en) * 2000-11-30 2003-09-24 Appfluent Technology, Inc. System and method for delivering dynamic content
US7483979B1 (en) 2001-01-16 2009-01-27 International Business Machines Corporation Method and system for virtualizing metadata between disparate systems
US6895412B1 (en) 2001-04-12 2005-05-17 Ncr Corporation Methods for dynamically configuring the cardinality of keyword attributes
US7231460B2 (en) * 2001-06-04 2007-06-12 Gateway Inc. System and method for leveraging networked computers to view windows based files on Linux platforms
US7177856B1 (en) 2001-10-03 2007-02-13 International Business Machines Corporation Method for correlating data from external databases
US6801903B2 (en) * 2001-10-12 2004-10-05 Ncr Corporation Collecting statistics in a database system
US6934706B1 (en) 2002-03-22 2005-08-23 International Business Machines Corporation Centralized mapping of security credentials for database access operations
US20040002956A1 (en) * 2002-06-28 2004-01-01 Microsoft Corporation Approximate query processing using multiple samples
US7349949B1 (en) 2002-12-26 2008-03-25 International Business Machines Corporation System and method for facilitating development of a customizable portlet
US7359982B1 (en) 2002-12-26 2008-04-15 International Business Machines Corporation System and method for facilitating access to content information
US20040143581A1 (en) * 2003-01-15 2004-07-22 Bohannon Philip L. Cost-based storage of extensible markup language (XML) data
US7647280B1 (en) * 2003-12-08 2010-01-12 Teradata Us, Inc. Closed-loop estimation of request costs
EP1548612A1 (en) * 2003-12-22 2005-06-29 Software Engineering GmbH Method and device for providing column statistics for data within a relational database
US7376638B2 (en) * 2003-12-24 2008-05-20 International Business Machines Corporation System and method for addressing inefficient query processing
US7373354B2 (en) * 2004-02-26 2008-05-13 Sap Ag Automatic elimination of functional dependencies between columns
US7302422B2 (en) * 2004-04-14 2007-11-27 International Business Machines Corporation Query workload statistics collection in a database management system
US7814072B2 (en) * 2004-12-30 2010-10-12 International Business Machines Corporation Management of database statistics
US7571168B2 (en) * 2005-07-25 2009-08-04 Parascale, Inc. Asynchronous file replication and migration in a storage network
US20070055658A1 (en) * 2005-09-08 2007-03-08 International Business Machines Corporation Efficient access control enforcement in a content management environment
US7840555B2 (en) * 2005-09-20 2010-11-23 Teradata Us, Inc. System and a method for identifying a selection of index candidates for a database
US8200659B2 (en) * 2005-10-07 2012-06-12 Bez Systems, Inc. Method of incorporating DBMS wizards with analytical models for DBMS servers performance optimization
US8732138B2 (en) * 2005-12-21 2014-05-20 Sap Ag Determination of database statistics using application logic
US9805077B2 (en) * 2008-02-19 2017-10-31 International Business Machines Corporation Method and system for optimizing data access in a database using multi-class objects
KR101621490B1 (en) 2014-08-07 2016-05-17 (주)그루터 Query execution apparatus and method, and system for processing data employing the same
US10204135B2 (en) 2015-07-29 2019-02-12 Oracle International Corporation Materializing expressions within in-memory virtual column units to accelerate analytic queries
US10372706B2 (en) 2015-07-29 2019-08-06 Oracle International Corporation Tracking and maintaining expression statistics across database queries
US10713244B2 (en) * 2016-05-09 2020-07-14 Sap Se Calculation engine optimizations for join operations utilizing automatic detection of forced constraints
US10380112B2 (en) 2017-07-31 2019-08-13 International Business Machines Corporation Joining two data tables on a join attribute
US11182437B2 (en) * 2017-10-26 2021-11-23 International Business Machines Corporation Hybrid processing of disjunctive and conjunctive conditions of a search query for a similarity search
CN110109910A (en) * 2018-01-08 2019-08-09 广东神马搜索科技有限公司 Data processing method and system, electronic equipment and computer readable storage medium
US10990597B2 (en) * 2018-05-03 2021-04-27 Sap Se Generic analytical application integration based on an analytic integration remote services plug-in
US11226955B2 (en) 2018-06-28 2022-01-18 Oracle International Corporation Techniques for enabling and integrating in-memory semi-structured data and text document searches with in-memory columnar query processing
US11868346B2 (en) 2020-12-30 2024-01-09 Oracle International Corporation Automated linear clustering recommendation for database zone maps
US11907263B1 (en) 2022-10-11 2024-02-20 Oracle International Corporation Automated interleaved clustering recommendation for database zone maps

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5598559A (en) * 1994-07-01 1997-01-28 Hewlett-Packard Company Method and apparatus for optimizing queries having group-by operators

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5598559A (en) * 1994-07-01 1997-01-28 Hewlett-Packard Company Method and apparatus for optimizing queries having group-by operators

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Performing Group-By before Join," Yan et al., Proceedings of the 10th International Conference on Data Engineering, Houston, Texas, USA, pp. 89-100, Feb. 14-18, 1994.
Performing Group By before Join, Yan et al., Proceedings of the 10th International Conference on Data Engineering, Houston, Texas, USA, pp. 89 100, Feb. 14 18, 1994. *

Cited By (149)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6353826B1 (en) * 1997-10-23 2002-03-05 Sybase, Inc. Database system with methodology providing improved cost estimates for query strategies
US6173278B1 (en) * 1997-11-03 2001-01-09 Newframe Corporation Otd. Method of and special purpose computer for utilizing an index of a relational data base table
US20050065911A1 (en) * 1998-12-16 2005-03-24 Microsoft Corporation Automatic database statistics creation
US7624094B2 (en) * 1998-12-16 2009-11-24 Microsoft Corporation Automatic database statistics creation
US6397204B1 (en) * 1999-06-25 2002-05-28 International Business Machines Corporation Method, system, and program for determining the join ordering of tables in a join query
US6446063B1 (en) * 1999-06-25 2002-09-03 International Business Machines Corporation Method, system, and program for performing a join operation on a multi column table and satellite tables
US6374235B1 (en) * 1999-06-25 2002-04-16 International Business Machines Corporation Method, system, and program for a join operation on a multi-column table and satellite tables including duplicate values
US6598038B1 (en) * 1999-09-17 2003-07-22 Oracle International Corporation Workload reduction mechanism for index tuning
US6438538B1 (en) * 1999-10-07 2002-08-20 International Business Machines Corporation Data replication in data warehousing scenarios
US6496877B1 (en) * 2000-01-28 2002-12-17 International Business Machines Corporation Method and apparatus for scheduling data accesses for random access storage devices with shortest access chain scheduling algorithm
US6266658B1 (en) * 2000-04-20 2001-07-24 Microsoft Corporation Index tuner for given workload
US20050240556A1 (en) * 2000-06-30 2005-10-27 Microsoft Corporation Partial pre-aggregation in relational database queries
US7555473B2 (en) * 2000-06-30 2009-06-30 Microsoft Corporation Partial pre-aggregation in relational database queries
US7593926B2 (en) * 2000-06-30 2009-09-22 Microsoft Corporation Partial pre-aggregation in relational database queries
US20050240577A1 (en) * 2000-06-30 2005-10-27 Microsoft Corporation Partial pre-aggregation in relational database queries
US6732110B2 (en) * 2000-08-28 2004-05-04 International Business Machines Corporation Estimation of column cardinality in a partitioned relational database
US20020059213A1 (en) * 2000-10-25 2002-05-16 Kenji Soga Minimum cost path search apparatus and minimum cost path search method used by the apparatus
US6609126B1 (en) 2000-11-15 2003-08-19 Appfluent Technology, Inc. System and method for routing database requests to a database and a cache
US7287020B2 (en) * 2001-01-12 2007-10-23 Microsoft Corporation Sampling for queries
US20020123979A1 (en) * 2001-01-12 2002-09-05 Microsoft Corporation Sampling for queries
US20020107835A1 (en) * 2001-02-08 2002-08-08 Coram Michael T. System and method for adaptive result set caching
US20050216304A1 (en) * 2001-06-08 2005-09-29 W.W. Grainger, Inc. System and method for electronically creating a customized catalog
US7266516B2 (en) 2001-06-08 2007-09-04 W. W. Grainger Inc. System and method for creating a customized electronic catalog
US20030083959A1 (en) * 2001-06-08 2003-05-01 Jinshan Song System and method for creating a customized electronic catalog
US9230256B2 (en) 2001-06-08 2016-01-05 W. W. Grainger, Inc. System and method for electronically creating a customized catalog
US7254582B2 (en) 2001-06-08 2007-08-07 W.W. Grainger, Inc. System and method for creating a searchable electronic catalog
US20030191769A1 (en) * 2001-09-28 2003-10-09 International Business Machines Corporation Method, system, and program for generating a program capable of invoking a flow of operations
US8914807B2 (en) 2001-09-28 2014-12-16 International Business Machines Corporation Method, system, and program for generating a program capable of invoking a flow of operations
US8924408B2 (en) 2001-09-28 2014-12-30 International Business Machines Corporation Automatic generation of database invocation mechanism for external web services
US8166006B2 (en) * 2001-09-28 2012-04-24 International Business Machines Corporation Invocation of web services from a database
US20030093436A1 (en) * 2001-09-28 2003-05-15 International Business Machines Corporation Invocation of web services from a database
US20040199636A1 (en) * 2001-09-28 2004-10-07 International Business Machines Corporation Automatic generation of database invocation mechanism for external web services
US20030084025A1 (en) * 2001-10-18 2003-05-01 Zuzarte Calisto Paul Method of cardinality estimation using statistical soft constraints
US7171408B2 (en) 2001-10-18 2007-01-30 International Business Machines Corporation Method of cardinality estimation using statistical soft constraints
US20080052269A1 (en) * 2001-12-13 2008-02-28 International Business Machines Corporation Estimation and use of access plan statistics
US8140568B2 (en) 2001-12-13 2012-03-20 International Business Machines Corporation Estimation and use of access plan statistics
US20030115183A1 (en) * 2001-12-13 2003-06-19 International Business Machines Corporation Estimation and use of access plan statistics
US7010516B2 (en) * 2001-12-19 2006-03-07 Hewlett-Packard Development Company, L.P. Method and system for rowcount estimation with multi-column statistics and histograms
US20030135485A1 (en) * 2001-12-19 2003-07-17 Leslie Harry Anthony Method and system for rowcount estimation with multi-column statistics and histograms
US6985904B1 (en) 2002-02-28 2006-01-10 Oracle International Corporation Systems and methods for sharing of execution plans for similar database statements
US7092931B1 (en) 2002-05-10 2006-08-15 Oracle Corporation Methods and systems for database statement execution plan optimization
US7155459B2 (en) * 2002-06-28 2006-12-26 Miccrosoft Corporation Time-bound database tuning
US20040003004A1 (en) * 2002-06-28 2004-01-01 Microsoft Corporation Time-bound database tuning
US20070083515A1 (en) * 2002-12-20 2007-04-12 International Business Machines Corporation System and method for multicolumn sorting in a single column
US7552108B2 (en) * 2002-12-20 2009-06-23 International Business Machines Corporation System and method for multicolumn sorting in a single column
US20040128299A1 (en) * 2002-12-26 2004-07-01 Michael Skopec Low-latency method to replace SQL insert for bulk data transfer to relational database
US7305410B2 (en) 2002-12-26 2007-12-04 Rocket Software, Inc. Low-latency method to replace SQL insert for bulk data transfer to relational database
US20040210563A1 (en) * 2003-04-21 2004-10-21 Oracle International Corporation Method and system of collecting execution statistics of query statements
US7447676B2 (en) 2003-04-21 2008-11-04 Oracle International Corporation Method and system of collecting execution statistics of query statements
US7146363B2 (en) * 2003-05-20 2006-12-05 Microsoft Corporation System and method for cardinality estimation based on query execution feedback
US20040236722A1 (en) * 2003-05-20 2004-11-25 Microsoft Corporation System and method for cardinality estimation based on query execution feedback
US20050021287A1 (en) * 2003-06-25 2005-01-27 International Business Machines Corporation Computing frequent value statistics in a partitioned relational database
US7542975B2 (en) * 2003-06-25 2009-06-02 International Business Machines Corporation Computing frequent value statistics in a partitioned relational database
US20050004907A1 (en) * 2003-06-27 2005-01-06 Microsoft Corporation Method and apparatus for using conditional selectivity as foundation for exploiting statistics on query expressions
US7249120B2 (en) * 2003-06-27 2007-07-24 Microsoft Corporation Method and apparatus for selecting candidate statistics to estimate the selectivity value of the conditional selectivity expression in optimize queries based on a set of predicates that each reference a set of relational database tables
US20050086242A1 (en) * 2003-09-04 2005-04-21 Oracle International Corporation Automatic workload repository battery of performance statistics
US20050086195A1 (en) * 2003-09-04 2005-04-21 Leng Leng Tan Self-managing database architecture
US20050086246A1 (en) * 2003-09-04 2005-04-21 Oracle International Corporation Database performance baselines
US7664798B2 (en) 2003-09-04 2010-02-16 Oracle International Corporation Database performance baselines
US7526508B2 (en) 2003-09-04 2009-04-28 Oracle International Corporation Self-managing database architecture
US7603340B2 (en) * 2003-09-04 2009-10-13 Oracle International Corporation Automatic workload repository battery of performance statistics
US7984024B2 (en) 2004-01-07 2011-07-19 International Business Machines Corporation Statistics management
US20090030875A1 (en) * 2004-01-07 2009-01-29 International Business Machines Corporation Statistics management
US20050216490A1 (en) * 2004-03-26 2005-09-29 Oracle International Corporation Automatic database diagnostic usage models
US8024301B2 (en) * 2004-03-26 2011-09-20 Oracle International Corporation Automatic database diagnostic usage models
US20090299989A1 (en) * 2004-07-02 2009-12-03 Oracle International Corporation Determining predicate selectivity in query costing
US9244979B2 (en) 2004-07-02 2016-01-26 Oracle International Corporation Determining predicate selectivity in query costing
US20060100992A1 (en) * 2004-10-21 2006-05-11 International Business Machines Corporation Apparatus and method for data ordering for derived columns in a database system
US20080215537A1 (en) * 2004-10-21 2008-09-04 International Business Machines Corporation Data ordering for derived columns in a database system
US20080215538A1 (en) * 2004-10-21 2008-09-04 International Business Machines Corporation Data ordering for derived columns in a database system
US20080215539A1 (en) * 2004-10-21 2008-09-04 International Business Machines Corporation Data ordering for derived columns in a database system
US20080133454A1 (en) * 2004-10-29 2008-06-05 International Business Machines Corporation System and method for updating database statistics according to query feedback
US7831592B2 (en) 2004-10-29 2010-11-09 International Business Machines Corporation System and method for updating database statistics according to query feedback
US7739293B2 (en) 2004-11-22 2010-06-15 International Business Machines Corporation Method, system, and program for collecting statistics of data stored in a database
US20060112093A1 (en) * 2004-11-22 2006-05-25 International Business Machines Corporation Method, system, and program for collecting statistics of data stored in a database
US20060116984A1 (en) * 2004-11-29 2006-06-01 Thomas Zurek Materialized samples for a business warehouse query
US7610272B2 (en) * 2004-11-29 2009-10-27 Sap Ag Materialized samples for a business warehouse query
US8032514B2 (en) * 2005-01-14 2011-10-04 International Business Machines Corporation SQL distinct optimization in a computer database system
US20060161515A1 (en) * 2005-01-14 2006-07-20 International Business Machines Corporation Apparatus and method for SQL distinct optimization in a computer database system
US7447681B2 (en) 2005-02-17 2008-11-04 International Business Machines Corporation Method, system and program for selection of database characteristics
US7610264B2 (en) 2005-02-28 2009-10-27 International Business Machines Corporation Method and system for providing a learning optimizer for federated database systems
US20060195416A1 (en) * 2005-02-28 2006-08-31 Ewen Stephan E Method and system for providing a learning optimizer for federated database systems
US7941332B2 (en) 2006-01-30 2011-05-10 International Business Machines Corporation Apparatus, system, and method for modeling, projecting, and optimizing an enterprise application system
US7725461B2 (en) * 2006-03-14 2010-05-25 International Business Machines Corporation Management of statistical views in a database system
EP2005332A4 (en) * 2006-03-14 2011-03-09 Ibm Management of statistical views in a database system
JP2009529735A (en) * 2006-03-14 2009-08-20 インターナショナル・ビジネス・マシーンズ・コーポレーション Managing statistical views in a database system
CN101405727B (en) * 2006-03-14 2011-11-30 国际商业机器公司 Management of statistical views in a database system
EP2005332A1 (en) * 2006-03-14 2008-12-24 International Business Machines Corporation Management of statistical views in a database system
US20070220058A1 (en) * 2006-03-14 2007-09-20 Mokhtar Kandil Management of statistical views in a database system
US8051058B2 (en) * 2006-04-03 2011-11-01 International Business Machines Corporation System for estimating cardinality in a database system
US20090012977A1 (en) * 2006-04-03 2009-01-08 International Business Machines Corporation System for estimating cardinality in a database system
US20070271218A1 (en) * 2006-05-16 2007-11-22 International Business Machines Corpoeation Statistics collection using path-value pairs for relational databases
US7472108B2 (en) * 2006-05-16 2008-12-30 International Business Machines Corporation Statistics collection using path-value pairs for relational databases
US20100011030A1 (en) * 2006-05-16 2010-01-14 International Business Machines Corp. Statistics collection using path-identifiers for relational databases
US8229924B2 (en) * 2006-05-16 2012-07-24 International Business Machines Corporation Statistics collection using path-identifiers for relational databases
US9117005B2 (en) 2006-05-16 2015-08-25 International Business Machines Corporation Statistics collection using path-value pairs for relational databases
US10007686B2 (en) * 2006-08-02 2018-06-26 Entit Software Llc Automatic vertical-database design
US8086598B1 (en) 2006-08-02 2011-12-27 Hewlett-Packard Development Company, L.P. Query optimizer with schema conversion
US8671091B2 (en) 2006-08-02 2014-03-11 Hewlett-Packard Development Company, L.P. Optimizing snowflake schema queries
US20080033914A1 (en) * 2006-08-02 2008-02-07 Mitch Cherniack Query Optimizer
US20080040348A1 (en) * 2006-08-02 2008-02-14 Shilpa Lawande Automatic Vertical-Database Design
US20080052266A1 (en) * 2006-08-25 2008-02-28 Microsoft Corporation Optimizing parameterized queries in a relational database management system
US8229918B2 (en) * 2006-08-25 2012-07-24 Teradata Us, Inc. Hardware accelerated reconfigurable processor for accelerating database operations and queries
US20110218987A1 (en) * 2006-08-25 2011-09-08 Teradata Us, Inc. Hardware accelerated reconfigurable processor for accelerating database operations and queries
US8032522B2 (en) 2006-08-25 2011-10-04 Microsoft Corporation Optimizing parameterized queries in a relational database management system
US20090106210A1 (en) * 2006-09-18 2009-04-23 Infobright, Inc. Methods and systems for database organization
US8266147B2 (en) 2006-09-18 2012-09-11 Infobright, Inc. Methods and systems for database organization
US8838593B2 (en) * 2006-09-18 2014-09-16 Infobright Inc. Method and system for storing, organizing and processing data in a relational database
US8700579B2 (en) 2006-09-18 2014-04-15 Infobright Inc. Method and system for data compression in a relational database
US20080071818A1 (en) * 2006-09-18 2008-03-20 Infobright Inc. Method and system for data compression in a relational database
US20080071748A1 (en) * 2006-09-18 2008-03-20 Infobright Inc. Method and system for storing, organizing and processing data in a relational database
US7877374B2 (en) 2006-12-01 2011-01-25 Microsoft Corporation Statistics adjustment to improve query execution plans
US20080133458A1 (en) * 2006-12-01 2008-06-05 Microsoft Corporation Statistics adjustment to improve query execution plans
US20090077054A1 (en) * 2007-09-13 2009-03-19 Brian Robert Muras Cardinality Statistic for Optimizing Database Queries with Aggregation Functions
US9710353B2 (en) 2007-10-19 2017-07-18 Oracle International Corporation Creating composite baselines based on a plurality of different baselines
US8990811B2 (en) 2007-10-19 2015-03-24 Oracle International Corporation Future-based performance baselines
US20090106756A1 (en) * 2007-10-19 2009-04-23 Oracle International Corporation Automatic Workload Repository Performance Baselines
US8078652B2 (en) * 2007-12-06 2011-12-13 Oracle International Corporation Virtual columns
US20090150366A1 (en) * 2007-12-06 2009-06-11 Oracle International Corporation Expression replacement in virtual columns
US8046352B2 (en) * 2007-12-06 2011-10-25 Oracle International Corporation Expression replacement in virtual columns
US20090150413A1 (en) * 2007-12-06 2009-06-11 Oracle International Corporation Virtual columns
US8620888B2 (en) 2007-12-06 2013-12-31 Oracle International Corporation Partitioning in virtual columns
US20100030728A1 (en) * 2008-07-29 2010-02-04 Oracle International Corporation Computing selectivities for group of columns and expressions
US20100114976A1 (en) * 2008-10-21 2010-05-06 Castellanos Maria G Method For Database Design
US20100223269A1 (en) * 2009-02-27 2010-09-02 International Business Machines Corporation System and method for an efficient query sort of a data stream with duplicate key values
US9235622B2 (en) * 2009-02-27 2016-01-12 International Business Machines Corporation System and method for an efficient query sort of a data stream with duplicate key values
US20110016157A1 (en) * 2009-07-14 2011-01-20 Vertica Systems, Inc. Database Storage Architecture
US8700674B2 (en) 2009-07-14 2014-04-15 Hewlett-Packard Development Company, L.P. Database storage architecture
US20110022581A1 (en) * 2009-07-27 2011-01-27 Rama Krishna Korlapati Derived statistics for query optimization
US20110213766A1 (en) * 2010-02-22 2011-09-01 Vertica Systems, Inc. Database designer
US8290931B2 (en) 2010-02-22 2012-10-16 Hewlett-Packard Development Company, L.P. Database designer
US20110258179A1 (en) * 2010-04-19 2011-10-20 Salesforce.Com Methods and systems for optimizing queries in a multi-tenant store
US8447754B2 (en) * 2010-04-19 2013-05-21 Salesforce.Com, Inc. Methods and systems for optimizing queries in a multi-tenant store
US9507822B2 (en) 2010-04-19 2016-11-29 Salesforce.Com, Inc. Methods and systems for optimizing queries in a database system
US10162851B2 (en) 2010-04-19 2018-12-25 Salesforce.Com, Inc. Methods and systems for performing cross store joins in a multi-tenant store
US10649995B2 (en) 2010-04-19 2020-05-12 Salesforce.Com, Inc. Methods and systems for optimizing queries in a multi-tenant store
US10417611B2 (en) 2010-05-18 2019-09-17 Salesforce.Com, Inc. Methods and systems for providing multiple column custom indexes in a multi-tenant database environment
US8943100B2 (en) 2010-06-14 2015-01-27 Infobright Inc. System and method for storing data in a relational database
US8521748B2 (en) 2010-06-14 2013-08-27 Infobright Inc. System and method for managing metadata in a relational database
US8417727B2 (en) 2010-06-14 2013-04-09 Infobright Inc. System and method for storing data in a relational database
US20130018890A1 (en) * 2011-07-13 2013-01-17 Salesforce.Com, Inc. Creating a custom index in a multi-tenant database environment
US10108648B2 (en) * 2011-07-13 2018-10-23 Salesforce.Com, Inc. Creating a custom index in a multi-tenant database environment
US20130166557A1 (en) * 2011-12-23 2013-06-27 Lars Fricke Unique value calculation in partitioned tables
US9697273B2 (en) 2011-12-23 2017-07-04 Sap Se Unique value calculation in partitioned table
US8880510B2 (en) * 2011-12-23 2014-11-04 Sap Se Unique value calculation in partitioned tables
US9665572B2 (en) * 2012-09-12 2017-05-30 Oracle International Corporation Optimal data representation and auxiliary structures for in-memory database query processing
US20140074819A1 (en) * 2012-09-12 2014-03-13 Oracle International Corporation Optimal Data Representation and Auxiliary Structures For In-Memory Database Query Processing
US11449481B2 (en) * 2017-12-08 2022-09-20 Alibaba Group Holding Limited Data storage and query method and device
US11947568B1 (en) * 2021-09-30 2024-04-02 Amazon Technologies, Inc. Working set ratio estimations of data items in a sliding time window for dynamically allocating computing resources for the data items

Also Published As

Publication number Publication date
US5899986A (en) 1999-05-04

Similar Documents

Publication Publication Date Title
US6029163A (en) Methods for collecting query workload based statistics on column groups identified by RDBMS optimizer
US5761653A (en) Method for estimating cardinalities for query processing in a relational database management system
US7240044B2 (en) Query optimization by sub-plan memoization
US6185557B1 (en) Merge join process
US7213012B2 (en) Optimizer dynamic sampling
US6801903B2 (en) Collecting statistics in a database system
US5875445A (en) Performance-related estimation using pseudo-ranked trees
US7509311B2 (en) Use of statistics on views in query optimization
US7756889B2 (en) Partitioning of nested tables
US8078652B2 (en) Virtual columns
US8122046B2 (en) Method and apparatus for query rewrite with auxiliary attributes in query processing operations
US6138111A (en) Cardinality-based join ordering
EP3014488B1 (en) Incremental maintenance of range-partitioned statistics for query optimization
US7840555B2 (en) System and a method for identifying a selection of index candidates for a database
US7472108B2 (en) Statistics collection using path-value pairs for relational databases
US7542975B2 (en) Computing frequent value statistics in a partitioned relational database
US20030167275A1 (en) Computation of frequent data values
US6778976B2 (en) Selectivity estimation for processing SQL queries containing having clauses
US7617189B2 (en) Parallel query processing techniques for minus and intersect operators
US9117005B2 (en) Statistics collection using path-value pairs for relational databases
US6925463B2 (en) Method and system for query processing by combining indexes of multilevel granularity or composition
US7440936B2 (en) Method for determining an access mode to a dataset
US8229924B2 (en) Statistics collection using path-identifiers for relational databases
Kacharulal Green Query Optimization: Taming Query Optimization Overheads through Plan Recycling
Mittra Optimization of the External Level of a Database

Legal Events

Date Code Title Description
STCF Information on status: patent grant

Free format text: PATENTED CASE

FEPP Fee payment procedure

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

FPAY Fee payment

Year of fee payment: 4

AS Assignment

Owner name: ORACLE INTERNATIONAL CORPORATION, CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:ORACLE CORPORATION;REEL/FRAME:014852/0946

Effective date: 20031113

FPAY Fee payment

Year of fee payment: 8

FPAY Fee payment

Year of fee payment: 12

CC Certificate of correction