CN107229663B

CN107229663B - Data processing method and device and data table processing method and device

Info

Publication number: CN107229663B
Application number: CN201610180081.7A
Authority: CN
Inventors: 孙伟光; 徐冬; 连杰红; 汪龙重
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2016-03-25
Filing date: 2016-03-25
Publication date: 2022-05-27
Anticipated expiration: 2036-03-25
Also published as: WO2017162102A1; TWI746517B; TW201737057A; CN107229663A

Abstract

The invention provides a data processing method and device and a data table processing method and device, wherein the number of functions of a hash function set is determined according to the number of data obtained by statistics after the data number of the data set is counted, and then the unique value number calculation is carried out on the data set by adopting the hash function set conforming to the number of the functions, so that the scale of the hash function set is matched with the scale of the data set, the execution efficiency and the accuracy are balanced, and the problem that the execution efficiency and the accuracy cannot be considered at the same time due to the fact that the scale of the hash function set is fixed in the prior art is solved. Meanwhile, the scale of the data table is predicted according to the calculated unique value number, the resources required to be occupied by the operation of the data table are evaluated according to the prediction result, and the operation of the data table is optimized based on the condition that the resources required to be occupied by the operation, so that the purposes of reducing the occupation of the resources and improving the operation efficiency in the process of operating the data table are achieved.

Description

Data processing method and device and data table processing method and device

Technical Field

The present invention relates to computer technologies, and in particular, to a data processing method and apparatus, and a data table processing method and apparatus.

Background

In practical applications, especially before performing a data table join operation, it is often necessary to count the number of objects or events that do not occur repeatedly, i.e. the number of independent elements is also called a unique value number, so as to predict the size of the data table. For smaller data volumes, the sequence may be first sorted in memory and then scanned for the number of independent elements. However, when a data stream sequence is processed, the sequence is very long, the range of values of elements may be wide, and a single element may occupy more memory, so that the memory cannot accommodate the whole sequence.

For the situation, a Flajolet-Martin (FM for short) algorithm can be adopted, and the FM algorithm is an algorithm capable of better solving the problem of estimating the unique value number. The algorithm adopts a hash function set to carry out operation, and the unique value number is estimated based on the hash value of each hash function in the hash function set.

However, in the prior art, in the process of calculating the unique value number of a certain column in the data table by applying the FM algorithm, the same hash function set is adopted for data sets of different scales, so that when the scale of the data set is large, the execution efficiency of the unique value number calculation process is often low, and the execution time is too long; when the size of the data set is small, the accuracy of the unique value number is low.

Disclosure of Invention

The invention provides a data processing method and device and a data table processing method and device, which are used for solving the problem that the accuracy of a unique value number cannot be ensured while the execution efficiency is ensured when an FM algorithm is adopted for calculating the unique value number in the prior art.

In order to achieve the above purpose, the embodiment of the invention adopts the following technical scheme:

in a first aspect, a data processing method is provided, including:

counting the number of data in the data set;

determining the function number of the hash function set according to the counted data number;

and based on an FM algorithm, performing unique value number calculation on the data set by adopting a hash function set which accords with the function number.

In a second aspect, there is provided a data processing apparatus comprising:

the statistical module is used for counting the data number of the data set;

the determining module is used for determining the function number of the hash function set according to the counted data number;

and the calculation module is used for calculating the unique value number of the data set by adopting a hash function set which accords with the function number based on an FM algorithm.

In a third aspect, a data table processing method for predicting a size of a data table is provided, including:

processing the data table by adopting a first data processing method to obtain a unique value number;

and predicting the size of the data table according to the unique value number.

In a fourth aspect, a data table processing method for evaluating data table operations is provided, comprising:

predicting the scale of the data table by adopting the data table processing method in the third aspect;

and according to the predicted scale of the data table, evaluating the operation of the data table to determine the resources occupied by the operation.

In a fifth aspect, a data table processing method for performing a data table operation is provided, including:

evaluating resources occupied by the operation of the data table by adopting the data table processing method of the fourth aspect;

and executing the operation on the data table according to the evaluation result.

In a sixth aspect, there is provided a data table processing apparatus for predicting a size of a data table, comprising:

a unique value module, configured to process the data table by using the data processing apparatus according to the second aspect to obtain a unique value number;

and the prediction module is used for predicting the scale of the data table according to the unique value number.

In a seventh aspect, a data table processing apparatus for evaluating data table operations is provided, comprising:

a prediction module, configured to predict a scale of the data table by using the data table processing apparatus according to the sixth aspect;

and the evaluation module is used for evaluating the operation of the data table according to the predicted scale of the data table so as to determine the resources occupied by the operation.

In an eighth aspect, there is provided a data table processing apparatus for performing a data table operation, comprising:

an evaluation module, configured to evaluate, by using the data table processing apparatus of the seventh aspect, resources occupied by the operation of the data table;

and the operation module is used for executing the operation on the data table according to the evaluation result. According to the data processing method and device and the data table processing method and device provided by the embodiment of the invention, after the data number of the data set is counted, the function number of the hash function set is determined according to the data number obtained through counting, and then the hash function set conforming to the function number is adopted to carry out unique value number calculation on the data set, so that the scale of the hash function set is matched with the scale of the data set, the execution efficiency and the accuracy are balanced, and the problem that the execution efficiency and the accuracy cannot be considered at the same time due to the fact that the scale of the hash function set is fixed in the prior art is solved. Meanwhile, the scale of the data table is predicted according to the calculated unique value number, the resources occupied by the operation of the data table are evaluated according to the prediction result, and the operation of the data table is optimized based on the condition that the resources are occupied by the operation, so that the purposes of reducing the occupation of the resources and improving the operation efficiency in the process of operating the data table are achieved.

The above description is only an overview of the technical solutions of the present invention, and the present invention can be implemented in accordance with the content of the description so as to make the technical means of the present invention more clearly understood, and the above and other objects, features, and advantages of the present invention will be more clearly understood.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a schematic flowchart of a data processing method according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating a data processing method according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a data processing apparatus according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of another data processing apparatus according to a fourth embodiment of the present invention;

FIG. 5 is a flowchart illustrating a data table processing method according to a fifth embodiment of the present invention;

FIG. 6 is a flowchart illustrating another data table processing method according to a fifth embodiment of the present invention;

FIG. 7 is a flowchart illustrating a further method for processing a data table according to a fifth embodiment of the present invention;

fig. 8 is a schematic structural diagram of a data table processing apparatus 60 according to a sixth embodiment of the present invention;

fig. 9 is a schematic structural diagram of a data table processing apparatus 70 according to a sixth embodiment of the present invention;

fig. 10 is a schematic structural diagram of a data table processing apparatus 80 according to a sixth embodiment of the present invention;

fig. 11 is a schematic structural diagram of another data table processing apparatus 80 according to a sixth embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The following describes in detail a data processing method and apparatus and a data table processing method and apparatus provided by an embodiment of the present invention with reference to the accompanying drawings.

Example one

Fig. 1 is a schematic flow chart of a data processing method according to an embodiment of the present invention, as shown in fig. 1, including:

step 101, counting the number of data in the data set.

And as a possible implementation mode, counting the traversal data, and determining the number of functions in the hash function set according to a counting result after the data is traversed. Specifically, before the unique value number is calculated for the data set, all data in the data set may be traversed, the number of all data included in the data set may be counted, and the number of functions in the hash function set may be determined according to the statistical result. The data set referred to herein may be a set of all data included in the same column in each data table. When the unique value number is calculated for the data set, all data in the data set are traversed again, the hash values of all data are calculated based on the determined hash function, the hash values are processed based on the FM algorithm, and the unique value number is estimated.

As another possible implementation manner, in the process of traversing data, the number of functions in the hash function set is continuously adjusted according to the current statistical result. Specifically, the number of functions in the hash function set may be adjusted in the process of traversing the data set, and the hash value of the traversed data may be calculated according to the hash function in the adjusted hash function set. And after traversing all the data in the data set, calculating the unique value number by adopting an FM algorithm according to the hash value of the hash function in the hash function set determined by the last adjustment. Specifically, the data in the data set may be read piece by piece, so as to count the number of data that have been read in the data set. After a piece of data is read, the number of functions in the hash function set is determined, and the read data is substituted to calculate the function value of each hash function in the hash function set.

In an application scenario of data table connection of a plurality of data tables, a unique value number can be calculated for data of a certain column or a certain number of columns in the data tables to be connected, so that the connection scale can be predicted. This is because, in the large data processing, each data table usually contains tens of thousands of data records, and the data amount is large, and therefore, it is necessary to predict the large-scale data table connection, which facilitates the preparation work necessary for the large-scale connection.

In the first implementation manner, the data set needs to be traversed twice, and because the data volume contained in the data table is large in the scenario of data table connection, the two traversals need to occupy more computing resources and computing time. Therefore, when the amount of data included in the data set is large, the second implementable manner is preferable, and the operation efficiency is improved by reducing the number of times of traversing the data in the data set.

And step 102, determining the function number of the hash function set according to the counted data number.

FM is an algorithm for calculating a unique value number with high efficiency, and the size of a hash function set in the algorithm has an extremely important influence on the accuracy and the execution efficiency of a calculation result of the unique value number. On one hand, the accuracy of a calculation result is low due to the excessively small scale of the hash function set, but the execution efficiency is high; on the other hand, an excessively large hash function set size results in a calculation result that is relatively high in accuracy but relatively low in execution efficiency. It can be seen that the size of the hash function set of FM should match the size of the data set.

In practical applications, the matching can be performed in the following way, for example: recording the number of data in the data set as N, and recording the number of functions in the hash function set as H;

when N is less than 100,000, H is 1024;

when N is more than or equal to 100,000 and less than 1,000,000, H is 512;

when N is more than or equal to 1,000,000 and less than 10,000,000, H is 256;

when N is more than or equal to 10,000,000 and less than 100,000,000, H is 128;

when N is more than or equal to 100,000,000, H is 64.

And 103, based on an FM algorithm, performing unique value number calculation on the data set by adopting a hash function set according with the function number.

In practical application, in order to reduce errors and improve accuracy, a series of hash functions H1, H2, and H3 … … are usually adopted to calculate all data in a data set to obtain hash values, and then according to an FM algorithm, it is necessary to count a maximum MAX of bit sequence lengths MAX of all 0 tails in a hash value binary representation of the hash function for each hash function.

For example, given a data set { e1, e2, e3, e2}, the function tailzero (x) is able to calculate the number of last consecutive 0's in the binary system of a positive integer x, and the hash function h (e) hashes the data in the data set, obtaining the hash value:

H(e1)＝2＝(0010)₂，TailZero(H(e1))＝1

H(e2)＝8＝(1000)₂，TailZero(H(e2))＝3

H(e3)＝10＝(1010)₂，TailZero(H(e3))＝1

then MAX equals MAX (1, 3, 1) equals 3.

Further, the maximum value of the hash function H1 is MAX1, and similarly, a series of MAX values MAX1, MAX2 and MAX3 … … can be obtained; then, a series of estimated values 2 are estimated based on a formula in which the estimated value of the number of unique values is equal to 2 raised to the power of MAX^Max1，2^Max2，2^Max3… …, and finally, summarizing the series of estimated values and calculating to obtain a final estimated value. Specifically, a × B different hash functions may be first designed, and divided into a group a, where each group includes B hash functions; then B estimated values are calculated by utilizing B hash functions in each group; then, the arithmetic mean of the B estimated values is calculated to be the estimated value of the group; and finally, selecting the median of the estimation values of all groups as a final estimation value.

In this embodiment, after the data number of the data set is counted, the function number of the hash function set is determined according to the data number obtained through counting, and then the hash function set corresponding to the function number is adopted to perform unique value calculation on the data set, so that the scale of the hash function set is matched with the scale of the data set, the execution efficiency and the accuracy are balanced, and the problem that the execution efficiency and the accuracy cannot be considered due to the fact that the scale of the hash function set is fixed in the prior art is solved.

Example two

To clearly illustrate the implementation manner mentioned in the previous embodiment that the number of functions in the hash function set is continuously adjusted according to the current statistical result in the process of traversing data, this embodiment provides a specific execution flow, and fig. 2 is a schematic flow diagram of a data processing method provided in the second embodiment of the present invention, as shown in fig. 2, including:

step 201, an initial hash function set is preset.

Specifically, the number of preset functions of the initial hash function set may be a maximum value, for example: the number of functions may be preset to 1024.

Step 202, reading one data in the data set, and counting the number of the read data.

The

steps

202 and 205 are performed for each read of data in the data set.

And step 203, determining the number of functions according to the number of the read data.

For example:

when the number of the read data reaches 100,000, determining the number of the functions to be 512;

when the number of the read data reaches 1,000,000, determining the number of the functions to be 256;

when the number of the read data reaches 10,000,000, determining the number of the functions to be 128;

when the number of read data reaches 100,000,000, the number of determination functions is 64.

Therefore, in the process of reading the data in the data set, the number of the functions in the hash function set is continuously reduced along with the increase of the read data, the adoption of a larger hash function set is realized when the scale of the data set is smaller, the calculation accuracy is improved, and the adoption of a smaller hash function set is realized when the scale of the data set is larger, the calculation efficiency is improved. In this way, the size of the hash function set is matched to the size of the data set.

And 204, judging whether the determined number of functions is less than that of the current hash function set, if so, executing the step 205, and otherwise, executing the step 206.

And step 205, reducing the hash functions from the hash function set to the determined number of functions.

And step 206, performing hash calculation on the currently read data by adopting the current hash function set, and storing the obtained hash value.

Step 202 and step 206 are repeated until all data in the data set are read.

And step 207, when the hash values of all the data in the data set are calculated, calculating the unique value number by adopting an FM algorithm according to the finally determined hash value of the hash function in the hash function set.

And in the process of reading the data in the data set, continuously adjusting the hash function set, and estimating by taking the hash function set determined by the last adjustment as the reference when calculating the unique value number. The more hash values of the hash functions calculated from the data read earlier, some hash functions among the hash values of the hash functions do not exist in the hash function set determined by the last adjustment and are invalid hash function values, and when the unique value number is estimated, the hash value of the hash function in the hash function set determined by the last adjustment is used for estimation.

For example: 64 hash functions H1-H64 are retained in the final set of hash functions.

From all the stored hash values, the hash values of the hash functions H1-H64 are selected for each data. And for each hash function, determining the maximum value MAX of the bit sequence length of all 0 tail parts in the binary representation of the hash value according to each hash value of the hash function. Referring to pre-grouping: H1-H8; H9-H16; H17-H24; H25-H32; H33-H40; H41-H48; H49-H56; H57-H64, calculating the average value of the maximum value MAX in each group, and taking the median of the average values of each group as an estimated value R. The number of unique values is estimated to be 2 raised to the power of R.

EXAMPLE III

Fig. 3 is a schematic structural diagram of a data processing apparatus according to a third embodiment of the present invention, as shown in fig. 3, including: a statistics module 31, a determination module 32 and a calculation module 33.

And the counting module 31 is configured to count the number of data in the data set.

Specifically, the statistical module 31 is specifically configured to count all data numbers included in the data set.

And the determining module 32 is configured to determine the number of functions of the hash function set according to the counted number of the obtained data.

And the calculating module 33 is configured to perform unique value number calculation on the data set by using a hash function set according to the number of the functions based on an FM algorithm.

And the data processing device counts the traversal data, and then determines the number of functions in the hash function set according to the statistical result after traversing the data. Specifically, before the unique value number is calculated for the data set, all data in the data set may be traversed, the number of all data included in the data set may be counted, and the number of functions in the hash function set may be determined according to the statistical result. The data set referred to herein may be a set of all data included in the same column in each data table. When the unique value number is calculated for the data set, all data in the data set are traversed again, the hash values of all data are calculated based on the determined hash function, the hash values are processed based on the FM algorithm, and the unique value number is estimated. After the data number of the data set is counted, the function number of the hash function set is determined according to the data number obtained through counting, and then the hash function set which accords with the function number is adopted to carry out unique value number calculation on the data set, so that the scale of the hash function set is matched with that of the data set, the execution efficiency and the accuracy are balanced, and the problem that the execution efficiency and the accuracy cannot be considered due to the fact that the scale of the hash function set is fixed in the prior art is solved.

Example four

Fig. 4 is a schematic structural diagram of another data processing apparatus according to a fourth embodiment of the present invention.

In the apparatus provided in this embodiment, the counting module 31 is specifically configured to read data in the data set one by one, and count the number of the data that has been read in the data set.

The determining module 32 is specifically configured to gradually decrease the number of functions of the hash function set as the counted number of the read data increases.

As shown in fig. 4, on the basis of the above embodiment, the data processing apparatus further includes:

and the generating module 34 is configured to discard the hash functions in the hash function set, where the number of the reserved hash functions is the determined number of functions.

The calculation module 33 further includes: a hash value unit 331 and a unique value number unit 332.

A hash value unit 331, configured to calculate a hash value for the read data by using the hash function reserved in the hash function set.

A unique value number unit 332, configured to calculate a unique value number by using the hash value of the hash function reserved in the hash function set based on an FM algorithm when all data in the data set are read.

And continuously adjusting the number of functions in the hash function set according to the current statistical result in the process of traversing the data. Specifically, the number of functions in the hash function set may be adjusted in the process of traversing the data set, and the hash value of the traversed data may be calculated according to the hash function in the adjusted hash function set. And after traversing all the data in the data set, calculating the unique value number by adopting an FM algorithm according to the hash value of the hash function in the hash function set determined by the last adjustment. Specifically, the data in the data set may be read piece by piece, so as to count the number of data that have been read in the data set. After a piece of data is read, the subsequent steps of determining the number of functions in the hash function set for the read data, substituting the read data, and calculating the function value of each hash function in the determined hash function set are started.

It can be seen that, in the process of reading data in the data set, along with the increase of the read data, the number of functions in the hash function set is continuously reduced, so that when the scale of the data set is small, a large-scale hash function set is adopted, the calculation accuracy is improved, and when the scale of the data set is large, a small-scale hash function set is adopted, and the calculation efficiency is improved. In this way, the size of the hash function set is matched to the size of the data set.

It should be noted that the apparatuses provided in the third embodiment and the fourth embodiment are respectively used to implement the data processing flows provided in fig. 1 and fig. 2, and the functions of each functional module of the data processing apparatuses in the third embodiment and the fourth embodiment refer to the related description in the foregoing method embodiments, and are not repeated in the third embodiment and the fourth embodiment.

EXAMPLE five

On the basis of the first or second embodiment, the fifth embodiment provides a data table processing method for optimizing operations of a data table, such as connection operations or grouping operations, so that less resource occupation is realized and the operation efficiency is improved. The resource may be a resource consumed for executing an operation, such as a CPU or a memory.

Fig. 5 is a schematic flow chart of a data table processing method provided in the fifth embodiment of the present invention, for predicting the size of a data table, including:

step 501, processing the data table by using the data processing method of the first embodiment or the second embodiment to obtain the unique value number.

And 502, predicting the scale of the data table according to the unique value number.

In one possible application scenario, the data table processing method provided in fig. 5 may be adopted to implement the size prediction of the data table, and the allocation of the required resources to the data table may be facilitated according to the predicted size.

Fig. 6 is a flowchart illustrating another data table processing method according to a fifth embodiment of the present invention, where the data table processing method is used for evaluating data table operations, and after step 501 in the method provided in fig. 5, the method further includes:

and 503, evaluating the operation of the data table according to the predicted scale of the data table to determine the resources occupied by the operation.

In a possible application scenario, a connection operation may be performed on the data table a and the data table B, and based on the data table processing method provided in fig. 6, the scale of each data table may be predicted first, and then resources occupied by connecting the data tables a and B may be evaluated, thereby facilitating resource allocation.

Fig. 7 is a flowchart illustrating a fifth embodiment of a further data table processing method according to the present invention, where the processing method is used to perform a data table operation, and after step 503 in the method provided in fig. 6, the method further includes:

and step 504, executing operation on the data table according to the evaluation result.

Specifically, the execution order of the at least two operations performed on the data table is determined according to the evaluation result, for example, the execution order of the at least two operations may be determined according to the order of the occupied resources required by the operations being less than or equal to the maximum. And then performing the at least two operations in the determined order.

In a possible application scenario, a connection operation may be performed on the data table a, the data table B, and the data table C, and based on the data table processing method provided in fig. 6, the scale of each data table may be predicted first, and then resources occupied by the three operations of connecting the data tables a and B, connecting the data tables a and C, and connecting the data tables B and C are evaluated, and a connection operation occupying the least resources is selected from the resources. According to the evaluation result, the two data tables A and B with smaller scales can be connected firstly, so that smaller resource occupation is obtained in the connection, and the data table C with larger scale is connected, so that the total occupied resource is minimum.

Therefore, the scale of the data table is predicted according to the calculated unique value number, the resources required to be occupied by the operation of the data table are evaluated according to the prediction result, and the sequence of the operation of the data table is optimized based on the condition that the resources required to be occupied by the operation, so that the purposes of reducing the occupation of the resources and improving the operation efficiency in the process of operating the data table are achieved.

EXAMPLE six

Fig. 8 is a schematic structural diagram of a data table processing apparatus 60 according to a sixth embodiment of the present invention, where the data table processing apparatus 60 is used for predicting a data table size, and includes: a unique value module 61 and a prediction module 62.

A unique value module 61, configured to process the data table by using the data processing apparatus shown in fig. 3 or fig. 4 to obtain a unique value number.

And a prediction module 62 for predicting the size of the data table according to the unique value number.

Fig. 9 is a schematic structural diagram of a data table processing apparatus 70 according to a sixth embodiment of the present invention, where the data table processing apparatus 70 is used for evaluating a data table operation, and includes: a prediction module 71 and an evaluation module 72.

The prediction module 71 is configured to predict the size of the data table by using the data table processing apparatus 60 shown in fig. 8.

And the evaluation module 72 is configured to evaluate the operation of the data table according to the predicted scale of the data table, so as to determine the resource occupied by the operation.

Wherein the operation comprises a connection operation and/or a grouping operation.

Fig. 10 is a schematic structural diagram of a data table processing apparatus 80 according to a sixth embodiment of the present invention, where the data table processing apparatus 80 is configured to perform a data table operation, and includes: an evaluation module 81 and an operation module 82.

An evaluation module 81, configured to evaluate, by using the data table processing apparatus 70 shown in fig. 9, resources occupied by the operation of the data table.

And the operation module 82 is used for executing the operation on the data table according to the evaluation result.

Specifically, as a possible implementation manner, as shown in fig. 11, the operation module 82 includes: a determination unit 821 and an execution unit 822.

A determining unit 821 for determining an execution order of at least two operations performed with respect to the data table according to the evaluation result.

Specifically, the determining unit 821 is specifically configured to determine the execution order of the at least two operations according to an order of occupying resources required by the operations from at least one.

An executing unit 822, configured to execute the at least two operations in the determined order.

The scale of the data table is predicted according to the calculated unique value number, the resources occupied by the operation of the data table are evaluated according to the prediction result, and the operation of the data table is optimized based on the condition that the resources are occupied by the operation, so that the purposes of reducing the occupation of the resources and improving the operation efficiency in the process of operating the data table are achieved.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of processing a data table for predicting the size of the data table, comprising:

counting the number of data in a plurality of data sets;

directly determining the function number of the hash function set according to the counted data number;

based on Flajolet-Martin algorithm, adopting Hash function set which accords with the function number to calculate the unique value number of the data set,

predicting the size of the data table according to the unique value number,

the method is characterized in that the statistics of the number of the data of the plurality of data sets comprises the following steps:

reading the data in the data set one by one, counting the number of the read data in the data set,

the determining the function number of the hash function set according to the counted data number comprises:

as the counted number of read data increases, a smaller number than the number of functions of the hash function set determined from the number of data before the increase is used as the number of functions of the determined hash function set.

2. The method of claim 1, wherein the counting the number of data in the plurality of data sets comprises:

and counting all data contained in the data set.

3. The method for processing the data table according to claim 1, wherein before performing unique value number calculation on the data set by using a hash function set conforming to the number of the functions based on a flajet-Martin algorithm, the method comprises:

and discarding the hash functions in the hash function set, wherein the number of the reserved hash functions is the determined number of the functions.

4. The data sheet processing method according to claim 3, wherein the performing unique value number calculation on the data set by using a hash function set conforming to the number of the functions based on a Flajolet-Martin algorithm comprises:

calculating a hash value of the read data by using the hash function reserved in the hash function set;

and when all the data in the data set are read, calculating the unique value number by adopting the hash value of the hash function reserved in the hash function set based on a Flajolet-Martin algorithm.

5. A method of data table processing for evaluating data table operations, comprising:

predicting the size of the data sheet by using the data sheet processing method according to any one of claims 1 to 4;

6. The method of claim 5, wherein the operation comprises a join operation and/or a group operation.

7. A data table processing method for performing data table operations, comprising:

evaluating resources occupied by the operation of the data sheet by using the data sheet processing method of claim 5 or 6;

8. The data sheet processing method of claim 7, the performing the operation on the data sheet according to the evaluation result, comprising:

determining an execution order of at least two operations performed with respect to the data table according to the evaluation result;

and executing the at least two operations according to the determined sequence.

9. The data sheet processing method of claim 8, said determining an execution order of at least two operations performed with respect to the data sheet according to the evaluation result, comprising:

and determining the execution sequence of the at least two operations according to the sequence of the occupied resources required by the operations as little as possible.

10. A data table processing apparatus for predicting a size of a data table, comprising:

the statistical module is used for counting the data number of the data sets;

the determining module is used for directly determining the function number of the hash function set according to the counted data number;

a calculation module for calculating the unique value number of the data set by adopting a hash function set which accords with the function number based on Flajolet-Martin algorithm,

a prediction module for predicting a size of the data table based on the unique number of values,

it is characterized in that the preparation method is characterized in that,

the statistic module is specifically configured to read the data in the data set one by one, count the number of the data read in the data set,

the determining module is specifically configured to use, as the number of functions of the determined hash function set, a number smaller than the number of functions of the hash function set determined according to the number of data before the increase, as the number of functions of the determined hash function set, as the counted number of read data increases.

11. The data sheet processing apparatus of claim 10,

the statistical module is specifically configured to count the number of all data included in the data set.

12. The apparatus of claim 10, wherein the apparatus comprises:

and the generation module is used for discarding the hash functions in the hash function set, and the number of the reserved hash functions is the determined number of the functions.

13. The apparatus of claim 12, wherein the calculation module comprises:

a hash value unit for calculating a hash value for the read data by using the hash function reserved in the hash function set;

and the unique value number unit is used for calculating the unique value number by adopting the hash value of the hash function reserved in the hash function set based on a Flajolet-Martin algorithm when all data in the data set are read.

14. A spreadsheet processing apparatus for evaluating spreadsheet operations, comprising:

a prediction module for predicting the size of the data table using the data table processing apparatus of any one of claims 10 to 13;

15. The apparatus of claim 14, wherein the operation comprises a join operation and/or a group operation.

16. A data table processing apparatus for performing data table operations, comprising:

an evaluation module, configured to evaluate, by using the data table processing apparatus according to claim 14 or 15, resources occupied by the operation of the data table;

and the operation module is used for executing the operation on the data table according to the evaluation result.

17. The data table processing apparatus of claim 16, the operations module, comprising:

a determination unit configured to determine an execution order of at least two operations performed with respect to the data table according to an evaluation result;

and the execution unit is used for executing the at least two operations according to the determined sequence.

18. The data sheet processing apparatus of claim 17,

the determining unit is specifically configured to determine an execution order of the at least two operations according to an order that resources occupied by the operations are at least as large as possible.