CN108268515B - Selection method and device for dimension of aggregation table - Google Patents

Selection method and device for dimension of aggregation table Download PDF

Info

Publication number
CN108268515B
CN108268515B CN201611263719.XA CN201611263719A CN108268515B CN 108268515 B CN108268515 B CN 108268515B CN 201611263719 A CN201611263719 A CN 201611263719A CN 108268515 B CN108268515 B CN 108268515B
Authority
CN
China
Prior art keywords
dimension
dimensions
current
dimension set
aggregation table
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611263719.XA
Other languages
Chinese (zh)
Other versions
CN108268515A (en
Inventor
洪超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201611263719.XA priority Critical patent/CN108268515B/en
Publication of CN108268515A publication Critical patent/CN108268515A/en
Application granted granted Critical
Publication of CN108268515B publication Critical patent/CN108268515B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof

Abstract

The invention discloses a method and a device for selecting dimension of an aggregation table. Wherein, the method comprises the following steps: obtaining dimensions in a query record to obtain a first dimension set, wherein the query record records all dimensions used by historical query; counting the query times of each dimension in the query records; establishing a second dimension set, wherein the second dimension set is initially an empty set; and selecting a target dimension from the first dimension set according to the query times, adding the target dimension into the second dimension set until the number of dimensions in the second dimension set reaches a target number, wherein the expansion rate of the aggregation table is controllable when the number of dimensions in the second dimension set is equal to the target number, the expansion rate of the aggregation table is uncontrollable when the number of dimensions in the second dimension set exceeds the target number, and the aggregation table is generated according to the dimensions in the second dimension set. The invention solves the technical problem of inaccurate dimension selection of the aggregation table.

Description

Selection method and device for dimension of aggregation table
Technical Field
The invention relates to the field of big data, in particular to a method and a device for selecting dimension of an aggregation table.
Background
The aggregation table is a table containing summary information of the fact data table, and is generated by aggregating the fact data table on the basis of a series of dimensions. Using aggregated tables is a relatively popular technique to increase query response time. The technology calculates the result in advance and stores the result in the table, so that the operation time is reduced, and the result is transmitted to the user at a higher speed. Aggregation tables typically have fewer rows than fact data tables and thus the processing speed may be faster. In the prior art, the creation of the aggregation table is performed through artificial empirical judgment, and the dimension selection is not accurate enough.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The embodiment of the invention provides a method and a device for selecting dimensions of an aggregation table, which are used for at least solving the technical problem that the dimension selection of the aggregation table is not accurate enough.
According to an aspect of the embodiments of the present invention, there is provided a method for selecting dimensions of an aggregation table, including: obtaining dimensions in a query record to obtain a first dimension set, wherein the query record records the dimensions used by all historical queries; counting the query times of each dimension in the query records; establishing a second dimension set, wherein the second dimension set is initially an empty set; selecting a target dimension from the first dimension set according to the query times, adding the target dimension into the second dimension set until the number of dimensions in the second dimension set reaches a target number, wherein when the number of dimensions in the second dimension set is equal to the target number, the percentage of row numbers of an aggregation table in the number of rows of a fact data table is smaller than or equal to a first threshold, and when the number of dimensions in the second dimension set exceeds the target number, the percentage of row numbers of the aggregation table in the number of rows of the fact data table is larger than the first threshold, and the aggregation table is generated according to the dimensions in the second dimension set.
Further, selecting a target dimension from the first dimension set to add to the second dimension set according to the number of queries comprises: sorting the dimensions according to the query times from high to low; according to the sequence indicated by the sequencing result, selecting a dimension with the most query times from the first dimension set as a current dimension to be added into the second dimension set; after the current dimension is added to the second dimension set, generating a current aggregation table according to the dimensions in the second dimension set; judging whether the current dimension is reserved in the second dimension set or not according to the percentage of the number of lines of the current aggregation table to the number of lines of the fact data table to obtain a judgment result; and after the judgment result is obtained, selecting the next dimension of the current dimension as the current dimension to be added into the second dimension set until all the dimensions in the first dimension set are selected.
Further, judging whether to keep the current dimension in the second dimension set according to the percentage of the number of rows of the current aggregation table in the number of rows of the fact data table, and obtaining a judgment result includes: judging whether the percentage of the line number of the current aggregation table in the line number of the fact data table is smaller than or equal to the first threshold value; if the percentage of the line number of the current aggregation table in the line number of the fact data table is judged to be less than or equal to the first threshold, the current dimension is kept in the second dimension set; and if the percentage of the line number of the current aggregation table in the line number of the fact data table is judged to be larger than the first threshold, deleting the current dimension from the second dimension set.
Further, after determining that the percentage of the number of rows of the current aggregation table to the number of rows of the fact data table is less than or equal to the first threshold, and keeping the current dimension in the second dimension set, the method further includes: acquiring a correlation coefficient between the dimensions in the first dimension set and the current dimension, wherein the correlation coefficient is used for indicating the degree of correlation between the dimensions in the first dimension set and the current dimension; and selecting the dimension with the correlation coefficient larger than or equal to a second threshold value from the first dimension set, and adding the dimension into the second dimension set.
Further, before selecting a target dimension from the first dimension set to add to the second dimension set according to the number of queries, the method further comprises: obtaining unique values of all dimensions in the first dimension set; searching for dimensions of which the unique value is less than or equal to a third threshold value from the first dimension set; adding dimensions for which the unique value is less than or equal to a third threshold value to the second set of dimensions.
According to another aspect of the embodiments of the present invention, there is also provided an apparatus for selecting dimensions of an aggregation table, including: the first obtaining unit is used for obtaining dimensions in a query record to obtain a first dimension set, wherein the query record records the dimensions used by all historical queries; the statistical unit is used for counting the query times of each dimension in the query record; the establishing unit is used for establishing a second dimension set, wherein the second dimension set is initially an empty set; a first selecting unit, configured to select a target dimension from the first dimension set according to the number of queries and add the target dimension to the second dimension set until the number of dimensions in the second dimension set reaches a target number, where when the number of dimensions in the second dimension set is equal to the target number, a percentage of a number of rows of an aggregation table to a number of rows of a fact data table is less than or equal to a first threshold, and when the number of dimensions in the second dimension set exceeds the target number, a percentage of a number of rows of the aggregation table to a number of rows of the fact data table is greater than the first threshold, where the aggregation table is generated according to the dimensions in the second dimension set.
Further, the selection unit includes: the sorting module is used for sorting the dimensionality from high to low according to the query times; a first selection module, configured to select, according to an order indicated by the sorting result, one dimension with the largest number of queries from the first dimension set as a current dimension, and add the selected dimension to the second dimension set; a generating module, configured to generate a current aggregation table according to the dimensions in the second dimension set after the current dimension is added to the second dimension set; the judging module is used for judging whether the current dimensionality is reserved in the second dimensionality set according to the percentage of the line number of the current aggregation table in the line number of the fact data table to obtain a judging result; and the second selection module selects the next dimension of the current dimension as the current dimension to be added into the second dimension set after the judgment result is obtained until all the dimensions in the first dimension set are completely selected.
Further, the judging module comprises: the first judgment submodule is used for judging whether the percentage of the line number of the current aggregation table in the line number of the fact data table is smaller than or equal to the first threshold value or not; a retention submodule, configured to retain the current dimension in the second dimension set if it is determined that a percentage of the number of rows of the current aggregation table to the number of rows of the fact data table is smaller than or equal to the first threshold; and the deleting submodule is used for deleting the current dimension from the second dimension set if the percentage of the line number of the current aggregation table in the line number of the fact data table is judged to be larger than the first threshold value.
Further, the determining module further includes: an obtaining sub-module, configured to, after determining that a percentage of a number of rows of the current aggregation table to a number of rows of the fact data table is smaller than or equal to the first threshold, and the current dimension is retained in the second dimension set, obtain a correlation coefficient between a dimension in the first dimension set and the current dimension, where the correlation coefficient is used to indicate a degree of correlation between the dimension in the first dimension set and the current dimension; and the selection submodule is used for selecting the dimension with the correlation coefficient larger than or equal to a second threshold value from the first dimension set and adding the dimension into the second dimension set.
Further, the apparatus further comprises: a second obtaining unit, configured to obtain unique values of all dimensions in the first dimension set before a target dimension is selected from the first dimension set according to the number of queries and added to the second dimension set; a searching unit, configured to search for a dimension, in the first dimension set, for which the unique value is less than or equal to a third threshold; a second selecting unit, configured to select a dimension of which the unique value is less than or equal to a third threshold value to be added to the second dimension set.
In the embodiment of the invention, a first dimension set is obtained by acquiring the dimensions in the query record, wherein the dimensions used by all historical queries are recorded in the query record; counting the query times of each dimension in the query records; establishing a second dimension set, wherein the second dimension set is initially an empty set, selecting a target dimension from the first dimension set according to the query times and adding the target dimension into the second dimension set, and when the number of the dimensions in the second dimension set is equal to the target number, the expansion rate of the aggregation table is controllable, wherein when the number of the dimensions in the second dimension set exceeds the target number, the expansion rate of the aggregation table is uncontrollable, and by ensuring that the expansion rate of the generated aggregation table is controllable, selecting the dimensions with more query times as much as possible, the purpose that the aggregation table generated according to the selected dimensions can meet the query acceleration and can reduce the cost for generating the aggregation table is achieved, so that the technical effect of accurately selecting the dimensions for generating the aggregation table is realized, and the technical problem that the dimension selection of the aggregation table is not accurate enough is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a schematic diagram of an alternative method for selecting dimensions of an aggregation table in accordance with embodiments of the present invention;
fig. 2 is a schematic diagram of an alternative selection apparatus for aggregation table dimensions according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In accordance with an embodiment of the present invention, there is provided a method embodiment of a method of selecting dimensions of an aggregate table, it being noted that the steps illustrated in the flowchart of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.
FIG. 1 is a method for selecting dimensions of an aggregation table according to an embodiment of the present invention, as shown in FIG. 1, the method comprising the steps of:
step S102, obtaining the dimensionality in the query record to obtain a first dimensionality set, wherein the dimensionality used by all historical queries is recorded in the query record.
The dimensions used by all historical queries are recorded in the query record, such as: when inquiring the monthly sales, the used dimension is 'month', when inquiring the population scale of the region, the used dimension is 'region', and when inquiring the daily access volume of the web page, the used dimension is the combined dimension formed by 'web page' and 'day'. And obtaining all dimensions used by the query in the query record, and forming a dimension set, namely a first dimension set, from all the dimensions.
And step S104, counting the query times of each dimension in the query records.
And counting the dimensionalities used by all the query records in the query records to obtain the query times of the dimensionalities, wherein the more the query times of the dimensionalities are, the higher the possibility of querying the dimensionalities is, and the more obvious the optimization effect of generating the aggregation table according to the dimensionalities on the query is.
And step S106, establishing a second dimension set, wherein the second dimension set is initially an empty set.
And establishing a second dimension set for adding the selected dimensions, wherein the second dimension set is an empty set when the dimension of the aggregation table is not selected initially.
And S108, selecting a target dimension from the first dimension set according to the query times, adding the target dimension into the second dimension set until the number of dimensions in the second dimension set reaches a target number, wherein when the number of dimensions in the second dimension set is equal to the target number, the percentage of the row number of the aggregation table in the fact data table is smaller than or equal to a first threshold, when the number of dimensions in the second dimension set exceeds the target number, the percentage of the row number of the aggregation table in the fact data table is larger than the first threshold, and the aggregation table is generated according to the dimensions in the second dimension set.
According to the query times of each dimension obtained through statistics, selecting a target number of dimensions from the first dimension set, adding the dimensions into the second dimension set, wherein the target number of dimensions added into the second dimension set meet the following conditions: the percentage of the row number of the aggregation table generated according to the dimensionality of the target number to the row number of the fact data table is smaller than or equal to a first threshold value; if the number of dimensions in the second dimension set is larger than the target number, the percentage of the number of rows of the aggregation table to the number of rows of the fact data table is larger than a first threshold value according to the dimension in the second dimension set, wherein the percentage of the number of rows of the aggregation table to the number of rows of the fact data table is smaller than or equal to the first threshold value, the expansion rate of the aggregation table is controllable, the dimension of the target number is selected from the first dimension set and added into the second dimension set, the expansion rate of the aggregation table generated according to the dimension in the second dimension set is controllable, and the number of the dimensions in the second dimension set is maximum.
In the embodiment of the invention, a first dimension set is obtained by acquiring the dimensions in the query record, wherein the dimensions used by all historical queries are recorded in the query record; counting the query times of each dimension in the query records; establishing a second dimension set, wherein the second dimension set is initially an empty set, selecting a target dimension from the first dimension set according to the query times and adding the target dimension into the second dimension set, and when the number of the dimensions in the second dimension set is equal to the target number, the expansion rate of the aggregation table is controllable, wherein when the number of the dimensions in the second dimension set exceeds the target number, the expansion rate of the aggregation table is uncontrollable, and by ensuring that the expansion rate of the generated aggregation table is controllable, selecting the dimensions with more query times as much as possible, the purpose that the aggregation table generated according to the selected dimensions can meet the query acceleration and can reduce the cost for generating the aggregation table is achieved, so that the technical effect of accurately selecting the dimensions for generating the aggregation table is realized, and the technical problem that the dimension selection of the aggregation table is not accurate enough is solved.
Optionally, selecting a target dimension from the first dimension set according to the number of queries and adding the target dimension to the second dimension set includes: sorting the dimensions according to the query times from high to low; according to the sequence indicated by the sequencing result, selecting a dimension with the most query times from the first dimension set as a current dimension to be added into the second dimension set; after the current dimension is added into the second dimension set, generating a current aggregation table according to the dimensions in the second dimension set; judging whether the current dimensionality is kept in a second dimensionality set or not according to the percentage of the number of lines of the current aggregation table to the number of lines of the fact data table to obtain a judgment result; and after the judgment result is obtained, selecting the next dimension of the current dimension as the current dimension to be added into the second dimension set until all the dimensions in the first dimension set are selected completely.
When the target dimension is selected to be added into the second dimension set, the dimensions with more query times in the query records are preferentially selected, the dimensions in the first dimension set are sorted according to the sorting order of the query times from high to low, the target dimensions are selected one by one to be added into the second dimension set according to the sorting result, the dimension with the most query times is selected as the current dimension to be added into the second dimension set, in order to ensure that the expansion rate of the aggregation table generated according to the dimensions in the second dimension set is controllable, after the current dimension is added, a current aggregation table is generated according to the dimensions in the second dimension set, whether the current dimension is kept in the second dimension set or not is judged according to the percentage of the number of lines of the current aggregation table in the number of lines of the fact data table, and after the judgment result is obtained, the next dimension is selected as the current dimension, and adding the dimension data into the second dimension set, generating the current aggregation table again, judging whether the current dimension is kept in the second dimension set, and selecting the dimensions one by one to add into the second set until all the dimensions in the first dimension set are completely selected.
Optionally, judging whether to keep the current dimension in the second dimension set according to the percentage of the number of lines of the current aggregation table in the number of lines of the fact data table, and obtaining a judgment result includes: judging whether the percentage of the row number of the current aggregation table to the row number of the fact data table is smaller than or equal to a first threshold value or not; if the percentage of the line number of the current aggregation table in the line number of the fact data table is judged to be less than or equal to a first threshold value, the current dimensionality is kept in a second dimensionality set; and if the percentage of the line number of the current aggregation table in the line number of the fact data table is judged to be larger than the first threshold value, deleting the current dimension from the second dimension set.
In order to ensure that the expansion rate of the aggregation table generated according to the dimensions in the second dimension set is controllable after the dimensions are added into the second dimension set, after each time the dimensions are added into the second dimension set, a current aggregation table is generated according to the dimensions in the second dimension set, and whether the dimensions currently added into the second aggregation table are kept in the second dimension set is judged. Specifically, whether the percentage of the generated row number of the current aggregation table to the row number of the fact data table is smaller than or equal to a first threshold value or not is judged, if the percentage is smaller than or equal to the first threshold value, the expansion rate of the current aggregation table is controllable, the dimension currently added into the second dimension set is kept in the second dimension set, and if the percentage is larger than the first threshold value, the expansion rate of the current aggregation table is uncontrollable, and the dimension currently added into the second dimension set is deleted.
Optionally, after determining that the percentage of the number of rows of the current aggregation table to the number of rows of the fact data table is less than or equal to the first threshold, and keeping the current dimension in the second dimension set, the method further includes: acquiring a correlation coefficient between a dimension in the first dimension set and a current dimension, wherein the correlation coefficient is used for indicating the degree of correlation between the dimension in the first dimension set and the current dimension; and selecting the dimension with the correlation coefficient larger than or equal to a second threshold value with the current dimension from the first dimension set, and adding the dimension into a second dimension set.
In the embodiment of the invention, in order to improve the efficiency of dimension selection, the dimension with higher association degree with the dimension currently added into the second dimension set is preferentially selected and added into the second dimension set. Specifically, after the current dimension is added to the second dimension set and the current dimension is judged to be kept in the second dimension set, a correlation coefficient between the dimension in the first dimension set and the current dimension is obtained, wherein the larger the correlation coefficient is, the higher the correlation degree between the dimension and the current dimension is, the dimension with the correlation coefficient larger than or equal to a second threshold is selected and added to the second dimension set. After the addition is finished, selecting the next dimension to be added into the second dimension set according to the arrangement sequence of the query times, and judging whether the dimension is kept in the second dimension set according to the method until all the dimensions in the first dimension set are completely selected.
It should be noted that, dimensions that have already been selected and added to the second dimension set in the first dimension set do not need to be repeatedly selected, and only dimensions that have not been selected and added may be selected when selecting dimensions in order.
Optionally, before selecting a target dimension from the first dimension set to add to the second dimension set according to the number of queries, the method further includes: obtaining unique values of all dimensions in the first dimension set; searching for dimensions of which the unique values are smaller than or equal to a third threshold value from the first dimension set; adding dimensions whose unique values are less than or equal to a third threshold value to the second set of dimensions.
In the embodiment of the present invention, another dimension selection manner may be adopted, that is, before selecting a target dimension from the first dimension set according to the number of queries and adding the target dimension to the second dimension set, a smaller dimension in the first dimension set is added to the second dimension set. Specifically, the unique values of all dimensions in the first dimension set are obtained, wherein the smaller the dimension unique value, the smaller the dimension. And finding out the dimensionality with the unique value smaller than or equal to a third threshold value from all the dimensionalities, and adding the found dimensionality as the selected small dimensionality into the second dimensionality set. After adding the smaller dimension in the first dimension set to the second dimension set, selecting the dimensions from the remaining dimensions of the first dimension set one by one according to the sorting result of the query times, adding the dimensions to the second dimension set, and judging whether the dimensions are kept in the second dimension set according to the method until all the dimensions in the first dimension set are completely selected.
According to an embodiment of the present invention, an embodiment of a device for selecting dimensions of an aggregation table is further provided, and fig. 2 is a schematic diagram of a device for selecting dimensions of an optional aggregation table according to an embodiment of the present invention, as shown in fig. 2, the device includes:
the first obtaining unit 210 is configured to obtain dimensions in a query record, to obtain a first dimension set, where the query record records the dimensions used by all historical queries.
The dimensions used by all historical queries are recorded in the query record, such as: when inquiring the monthly sales, the used dimension is 'month', when inquiring the population scale of the region, the used dimension is 'region', and when inquiring the daily access volume of the web page, the used dimension is the combined dimension formed by 'web page' and 'day'. The first obtaining unit 210 obtains all dimensions used by the query in the query record, and combines all dimensions into a dimension set, i.e. a first dimension set.
The counting unit 220 is configured to count the number of queries of each dimension in the query record.
The statistical unit 220 performs statistics on the dimensions used by all the query records in the query records to obtain the query times of the dimensions, and the greater the query times of the dimensions, the higher the possibility of querying the dimensions is, the more obvious the optimization effect of generating the aggregation table according to the dimensions on the query is.
An establishing unit 230 is configured to establish a second dimension set, where the second dimension set is initially an empty set.
The establishing unit 230 establishes a second dimension set for adding the selected dimension, and when the aggregation table dimension is not initially selected, the second dimension set is an empty set.
A first selecting unit 240, configured to select a target dimension from the first dimension set according to the number of queries and add the target dimension to the second dimension set until the number of dimensions in the second dimension set reaches a target number, where when the number of dimensions in the second dimension set is equal to the target number, a percentage of rows of the aggregation table to rows of the fact data table is less than or equal to a first threshold, and when the number of dimensions in the second dimension set exceeds the target number, a percentage of rows of the aggregation table to rows of the fact data table is greater than the first threshold, and the aggregation table is generated according to the dimensions in the second dimension set.
The first selecting unit 240 selects a target number of dimensions from the first dimension set according to the counted query times of each dimension, and adds the target number of dimensions to the second dimension set, where the target number of dimensions added to the second dimension set satisfies the following condition: the percentage of the row number of the aggregation table generated according to the dimensionality of the target number to the row number of the fact data table is smaller than or equal to a first threshold value; if the number of dimensions in the second dimension set is larger than the target number, the percentage of the number of rows of the aggregation table to the number of rows of the fact data table is larger than a first threshold value according to the dimension in the second dimension set, wherein the percentage of the number of rows of the aggregation table to the number of rows of the fact data table is smaller than or equal to the first threshold value, the expansion rate of the aggregation table is controllable, the dimension of the target number is selected from the first dimension set and added into the second dimension set, the expansion rate of the aggregation table generated according to the dimension in the second dimension set is controllable, and the number of the dimensions in the second dimension set is maximum.
In the embodiment of the invention, a first dimension set is obtained by acquiring the dimensions in the query record, wherein the dimensions used by all historical queries are recorded in the query record; counting the query times of each dimension in the query records; establishing a second dimension set, wherein the second dimension set is initially an empty set, selecting a target dimension from the first dimension set according to the query times and adding the target dimension into the second dimension set, and when the number of the dimensions in the second dimension set is equal to the target number, the expansion rate of the aggregation table is controllable, wherein when the number of the dimensions in the second dimension set exceeds the target number, the expansion rate of the aggregation table is uncontrollable, and by ensuring that the expansion rate of the generated aggregation table is controllable, selecting the dimensions with more query times as much as possible, the purpose that the aggregation table generated according to the selected dimensions can meet the query acceleration and can reduce the cost for generating the aggregation table is achieved, so that the technical effect of accurately selecting the dimensions for generating the aggregation table is realized, and the technical problem that the dimension selection of the aggregation table is not accurate enough is solved.
Optionally, the selection unit comprises: the sorting module is used for sorting the dimensionality from high to low according to the query times; the first selection module is used for selecting a dimension with the most query times from the first dimension set as a current dimension to be added into the second dimension set according to the sequence indicated by the sequencing result; the generating module is used for generating a current aggregation table according to the dimensions in the second dimension set after the current dimensions are added to the second dimension set; the judging module is used for judging whether the current dimension is kept in the second dimension set according to the percentage of the line number of the current aggregation table to the line number of the fact data table to obtain a judging result; and the second selection module selects the next dimension of the current dimension as the current dimension to be added into the second dimension set after the judgment result is obtained until all the dimensions in the first dimension set are completely selected.
When the target dimension is selected to be added into the second dimension set, the dimension with more query times in the query records is preferentially selected, the ranking module ranks the dimensions in the first dimension set according to the ranking order of the query times from high to low, the target dimension is selected one by one to be added into the second dimension set according to the ranking result, the first selection module firstly selects the dimension with the most query times as the current dimension to be added into the second dimension set, in order to ensure that the expansion rate of the aggregation table generated according to the dimensions in the second dimension set is controllable, after the current dimension is added, the generation module generates a current aggregation table according to the dimensions in the second dimension set, the judgment module judges whether the current dimension is kept in the second dimension set according to the percentage of the number of lines of the current aggregation table in the number of lines of the fact data table, after the judgment result is obtained, and the second selection module selects the next dimension as the current dimension, adds the next dimension to the second dimension set, generates the current aggregation table again, and judges whether the current dimension is kept in the second dimension set or not, so that the dimensions are selected one by one and added to the second set until all the dimensions in the first dimension set are completely selected.
Optionally, the determining module includes: the first judgment submodule is used for judging whether the percentage of the line number of the current aggregation table to the line number of the fact data table is smaller than or equal to a first threshold value or not; the retention submodule is used for retaining the current dimension in the second dimension set if the percentage of the line number of the current aggregation table to the line number of the fact data table is judged to be less than or equal to a first threshold value; and the deleting submodule is used for deleting the current dimension from the second dimension set if the percentage of the line number of the current aggregation table to the line number of the fact data table is judged to be larger than a first threshold value.
In order to ensure that the expansion rate of the aggregation table generated according to the dimensions in the second dimension set is controllable after the dimensions are added into the second dimension set, after each time the dimensions are added into the second dimension set, a current aggregation table is generated according to the dimensions in the second dimension set, and whether the dimensions currently added into the second aggregation table are kept in the second dimension set is judged. Specifically, the first judgment sub-module judges whether the percentage of the generated row number of the current aggregation table in the row number of the fact data table is smaller than or equal to a first threshold, if the percentage is smaller than or equal to the first threshold, the expansion rate of the current aggregation table is controllable, the retention sub-module retains the dimension currently added to the second dimension set in the second dimension set, and if the percentage is larger than the first threshold, the expansion rate of the current aggregation table is uncontrollable, and the deletion sub-module deletes the dimension currently added to the second dimension set.
Optionally, the determining module further includes: the obtaining submodule is used for obtaining a correlation coefficient between the dimensionality in the first dimensionality set and the current dimensionality after judging that the percentage of the line number of the current aggregation table to the line number of the fact data table is smaller than or equal to a first threshold value and keeping the current dimensionality in the second dimensionality set, wherein the correlation coefficient is used for indicating the correlation degree between the dimensionality in the first dimensionality set and the current dimensionality; and the selection submodule is used for selecting the dimension with the correlation coefficient larger than or equal to a second threshold value from the first dimension set and adding the dimension into the second dimension set.
In the embodiment of the invention, in order to improve the efficiency of dimension selection, the dimension with higher association degree with the dimension currently added into the second dimension set is preferentially selected and added into the second dimension set. Specifically, after the current dimension is added to the second dimension set and the current dimension is judged to be kept in the second dimension set, the obtaining sub-module obtains a correlation coefficient between the dimension in the first dimension set and the current dimension, wherein the larger the correlation coefficient is, the higher the correlation degree between the dimension and the current dimension is, the selection sub-module selects the dimension of which the correlation coefficient is greater than or equal to a second threshold value and adds the dimension to the second dimension set. After the addition is finished, selecting the next dimension to be added into the second dimension set according to the arrangement sequence of the query times, and judging whether the dimension is kept in the second dimension set according to the method until all the dimensions in the first dimension set are completely selected.
It should be noted that, dimensions that have already been selected and added to the second dimension set in the first dimension set do not need to be repeatedly selected, and only dimensions that have not been selected and added may be selected when selecting dimensions in order.
Optionally, the apparatus further comprises: the second acquisition unit is used for acquiring unique values of all dimensions in the first dimension set before the target dimension is selected from the first dimension set according to the query times and added to the second dimension set; a searching unit, configured to search for a dimension, of which a unique value is less than or equal to a third threshold, from the first dimension set; and the second selection unit is used for selecting the dimension with the unique value less than or equal to a third threshold value to be added into the second dimension set.
In the embodiment of the present invention, another dimension selection manner may be adopted, that is, before selecting a target dimension from the first dimension set according to the number of queries and adding the target dimension to the second dimension set, a smaller dimension in the first dimension set is added to the second dimension set. Specifically, the following components: the second acquisition unit acquires unique values of all dimensions in the first dimension set, wherein the smaller the dimension unique value, the smaller the dimension. The searching unit searches the dimensionality with the unique value smaller than or equal to the third threshold value from all the dimensionalities, and the second selecting unit selects the searched dimensionality and adds the searched dimensionality into the second dimensionality set. After adding the smaller dimension in the first dimension set to the second dimension set, selecting the dimensions from the remaining dimensions of the first dimension set one by one according to the sorting result of the query times, adding the dimensions to the second dimension set, and judging whether the dimensions are kept in the second dimension set according to the method until all the dimensions in the first dimension set are completely selected.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A method for selecting dimensions of an aggregate table, comprising:
obtaining dimensions in a query record to obtain a first dimension set, wherein the query record records the dimensions used by all historical queries;
counting the query times of each dimension in the query records;
establishing a second dimension set, wherein the second dimension set is initially an empty set;
selecting a target dimension from the first dimension set according to the query times, adding the target dimension into the second dimension set until the number of dimensions in the second dimension set reaches a target number, wherein when the number of dimensions in the second dimension set is equal to the target number, the percentage of row numbers of an aggregation table in the number of rows of a fact data table is smaller than or equal to a first threshold, and when the number of dimensions in the second dimension set exceeds the target number, the percentage of row numbers of the aggregation table in the number of rows of the fact data table is larger than the first threshold, and the aggregation table is generated according to the dimensions in the second dimension set.
2. The method of claim 1, wherein selecting a target dimension from the first set of dimensions to add to the second set of dimensions based on the number of queries comprises:
sorting the dimensions according to the query times from high to low;
according to the sequence indicated by the sequencing result, selecting a dimension with the most query times from the first dimension set as a current dimension to be added into the second dimension set;
after the current dimension is added to the second dimension set, generating a current aggregation table according to the dimensions in the second dimension set;
judging whether the current dimension is reserved in the second dimension set or not according to the percentage of the number of lines of the current aggregation table to the number of lines of the fact data table to obtain a judgment result;
and after the judgment result is obtained, selecting the next dimension of the current dimension as the current dimension to be added into the second dimension set until all the dimensions in the first dimension set are selected.
3. The method according to claim 2, wherein determining whether to keep the current dimension in the second dimension set according to a percentage of the number of rows of the current aggregation table to the number of rows of the fact data table, and obtaining a determination result includes:
judging whether the percentage of the line number of the current aggregation table in the line number of the fact data table is smaller than or equal to the first threshold value;
if the percentage of the line number of the current aggregation table in the line number of the fact data table is judged to be less than or equal to the first threshold, the current dimension is kept in the second dimension set;
and if the percentage of the line number of the current aggregation table in the line number of the fact data table is judged to be larger than the first threshold, deleting the current dimension from the second dimension set.
4. The method of claim 3, wherein after determining that the percentage of the number of rows in the current aggregation table to the number of rows in the fact data table is less than or equal to the first threshold, the method further comprises, after retaining the current dimension in the set of second dimensions:
acquiring a correlation coefficient between the dimensions in the first dimension set and the current dimension, wherein the correlation coefficient is used for indicating the degree of correlation between the dimensions in the first dimension set and the current dimension;
and selecting the dimension with the correlation coefficient larger than or equal to a second threshold value from the first dimension set, and adding the dimension into the second dimension set.
5. The method of claim 1, wherein before selecting a target dimension from the first set of dimensions to add to the second set of dimensions according to the number of queries, the method further comprises:
obtaining unique values of all dimensions in the first dimension set;
searching for dimensions of which the unique value is less than or equal to a third threshold value from the first dimension set;
adding dimensions for which the unique value is less than or equal to a third threshold value to the second set of dimensions.
6. An apparatus for selecting aggregate table dimensions, comprising:
the first obtaining unit is used for obtaining dimensions in a query record to obtain a first dimension set, wherein the query record records the dimensions used by all historical queries;
the statistical unit is used for counting the query times of each dimension in the query record;
the establishing unit is used for establishing a second dimension set, wherein the second dimension set is initially an empty set;
a first selecting unit, configured to select a target dimension from the first dimension set according to the number of queries and add the target dimension to the second dimension set until the number of dimensions in the second dimension set reaches a target number, where when the number of dimensions in the second dimension set is equal to the target number, a percentage of a number of rows of an aggregation table to a number of rows of a fact data table is less than or equal to a first threshold, and when the number of dimensions in the second dimension set exceeds the target number, a percentage of a number of rows of the aggregation table to a number of rows of the fact data table is greater than the first threshold, where the aggregation table is generated according to the dimensions in the second dimension set.
7. The apparatus of claim 6, wherein the selection unit comprises:
the sorting module is used for sorting the dimensionality from high to low according to the query times;
a first selection module, configured to select, according to an order indicated by the sorting result, one dimension with the largest number of queries from the first dimension set as a current dimension, and add the selected dimension to the second dimension set;
a generating module, configured to generate a current aggregation table according to the dimensions in the second dimension set after the current dimension is added to the second dimension set;
the judging module is used for judging whether the current dimensionality is reserved in the second dimensionality set according to the percentage of the line number of the current aggregation table in the line number of the fact data table to obtain a judging result;
and the second selection module selects the next dimension of the current dimension as the current dimension to be added into the second dimension set after the judgment result is obtained until all the dimensions in the first dimension set are completely selected.
8. The apparatus of claim 7, wherein the determining module comprises:
the first judgment submodule is used for judging whether the percentage of the line number of the current aggregation table in the line number of the fact data table is smaller than or equal to the first threshold value or not;
a retention submodule, configured to retain the current dimension in the second dimension set if it is determined that a percentage of the number of rows of the current aggregation table to the number of rows of the fact data table is smaller than or equal to the first threshold;
and the deleting submodule is used for deleting the current dimension from the second dimension set if the percentage of the line number of the current aggregation table in the line number of the fact data table is judged to be larger than the first threshold value.
9. The apparatus of claim 8, wherein the determining module further comprises:
an obtaining sub-module, configured to, after determining that a percentage of a number of rows of the current aggregation table to a number of rows of the fact data table is smaller than or equal to the first threshold, and the current dimension is retained in the second dimension set, obtain a correlation coefficient between a dimension in the first dimension set and the current dimension, where the correlation coefficient is used to indicate a degree of correlation between the dimension in the first dimension set and the current dimension;
and the selection submodule is used for selecting the dimension with the correlation coefficient larger than or equal to a second threshold value from the first dimension set and adding the dimension into the second dimension set.
10. The apparatus of claim 6, further comprising:
a second obtaining unit, configured to obtain unique values of all dimensions in the first dimension set before a target dimension is selected from the first dimension set according to the number of queries and added to the second dimension set;
a searching unit, configured to search for a dimension, in the first dimension set, for which the unique value is less than or equal to a third threshold;
a second selecting unit, configured to select a dimension of which the unique value is less than or equal to a third threshold value to be added to the second dimension set.
CN201611263719.XA 2016-12-30 2016-12-30 Selection method and device for dimension of aggregation table Active CN108268515B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611263719.XA CN108268515B (en) 2016-12-30 2016-12-30 Selection method and device for dimension of aggregation table

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611263719.XA CN108268515B (en) 2016-12-30 2016-12-30 Selection method and device for dimension of aggregation table

Publications (2)

Publication Number Publication Date
CN108268515A CN108268515A (en) 2018-07-10
CN108268515B true CN108268515B (en) 2020-07-31

Family

ID=62755144

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611263719.XA Active CN108268515B (en) 2016-12-30 2016-12-30 Selection method and device for dimension of aggregation table

Country Status (1)

Country Link
CN (1) CN108268515B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113392130B (en) * 2020-03-13 2022-04-29 阿里巴巴集团控股有限公司 Data processing method, device and equipment

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101102239A (en) * 2006-07-04 2008-01-09 华为技术有限公司 Establishment method and device for channel allocation connection of Mesh network
CN102053984A (en) * 2009-11-10 2011-05-11 杜卓 Systems and methods for information retrieval, information query and information issue
CN102360379A (en) * 2011-10-10 2012-02-22 浙江鸿程计算机系统有限公司 Multi-dimensional data cube increment aggregation and query optimization method
US8135753B2 (en) * 2009-07-30 2012-03-13 Microsoft Corporation Dynamic information hierarchies
CN102541893A (en) * 2010-12-16 2012-07-04 腾讯科技(深圳)有限公司 Keyword analysis method and keyword analysis device
US8606825B1 (en) * 2011-07-20 2013-12-10 Google Inc. Query response streams based on dynamic query library
CN103902702A (en) * 2014-03-31 2014-07-02 北京车商汇软件有限公司 Data storage system and data storage method
CN104216901A (en) * 2013-05-31 2014-12-17 北京新媒传信科技有限公司 Information searching method and system
CN105320679A (en) * 2014-07-11 2016-02-10 中国移动通信集团重庆有限公司 Data table index set generation method and device
CN105373535A (en) * 2014-08-15 2016-03-02 南京集艾思软件科技有限公司 Data extraction method based on water quality benchmark calculation

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101102239A (en) * 2006-07-04 2008-01-09 华为技术有限公司 Establishment method and device for channel allocation connection of Mesh network
US8135753B2 (en) * 2009-07-30 2012-03-13 Microsoft Corporation Dynamic information hierarchies
CN102053984A (en) * 2009-11-10 2011-05-11 杜卓 Systems and methods for information retrieval, information query and information issue
CN102541893A (en) * 2010-12-16 2012-07-04 腾讯科技(深圳)有限公司 Keyword analysis method and keyword analysis device
US8606825B1 (en) * 2011-07-20 2013-12-10 Google Inc. Query response streams based on dynamic query library
CN102360379A (en) * 2011-10-10 2012-02-22 浙江鸿程计算机系统有限公司 Multi-dimensional data cube increment aggregation and query optimization method
CN104216901A (en) * 2013-05-31 2014-12-17 北京新媒传信科技有限公司 Information searching method and system
CN103902702A (en) * 2014-03-31 2014-07-02 北京车商汇软件有限公司 Data storage system and data storage method
CN105320679A (en) * 2014-07-11 2016-02-10 中国移动通信集团重庆有限公司 Data table index set generation method and device
CN105373535A (en) * 2014-08-15 2016-03-02 南京集艾思软件科技有限公司 Data extraction method based on water quality benchmark calculation

Also Published As

Publication number Publication date
CN108268515A (en) 2018-07-10

Similar Documents

Publication Publication Date Title
WO2017121251A1 (en) Information push method and device
WO2019056661A1 (en) Search term pushing method and device, and terminal
EP2840515A1 (en) Method, device and computer storage media for user preferences information collection
CN108241692B (en) Data query method and device
CN107092645B (en) Book resource management method and device
CN104246765A (en) Image search device, image search method, program, and computer-readable storage medium
CN106446189A (en) Message-recommending method and system
CN110990372A (en) Dimensional data processing method and device and data query method and device
CN108664526A (en) The method and apparatus of retrieval
KR101082589B1 (en) System for providing Aspect Level News Browsing Service that reduce Media-Bias Effect and Method therefor
EP3955129A1 (en) Map point information processing method and device, and server
CN108133058B (en) Video retrieval method
EP3537365A1 (en) Method, device, and system for increasing users
CN106709851A (en) Big data retrieval method and apparatus
CN106933927B (en) Data table connection method and device
CN110516163A (en) A kind of commodity sort method and system based on user behavior data
CN108268515B (en) Selection method and device for dimension of aggregation table
CN110086868B (en) Content pushing method, device and equipment
CN110881131A (en) Classification method of live review videos and related device thereof
CN106649385B (en) Data reordering method and device based on HBase database
CN105718524A (en) Method and device for determining video originals
CN109739854A (en) A kind of date storage method and device
CN108268523B (en) Database aggregation processing method and device
CN108304404B (en) Data frequency estimation method based on improved Sketch structure
CN106257449A (en) A kind of information determines method and apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant