WO2017162102A1

WO2017162102A1 - Data processing method and apparatus, and data table processing method and apparatus

Info

Publication number: WO2017162102A1
Application number: PCT/CN2017/077024
Authority: WO
Inventors: 孙伟光; 徐冬; 连杰红; 汪龙重
Original assignee: 阿里巴巴集团控股有限公司
Priority date: 2016-03-25
Filing date: 2017-03-17
Publication date: 2017-09-28
Also published as: TW201737057A; CN107229663A; TWI746517B; CN107229663B

Abstract

A data processing method and apparatus, and a data table processing method and apparatus. The data processing method comprises: counting the number of data in a data set (101), determining the number of functions in a hash function set according to the number of data obtained through counting (102), and based on an FM algorithm, using a hash function set conforming to the number of functions to calculate a unique value of the data set (103), so that the scale of the hash function set matches the scale of the data set, thereby balancing the execution efficiency and the accuracy, and solving the problem in the prior art that the execution efficiency and the accuracy cannot be balanced due to the fixed scale of the hash function set. Meanwhile, the scale of a data table is predicted according to the calculated unique value, a resource needing to be occupied by a data table operation is evaluated according to a prediction result, so as to optimize the data table operation based on the condition of the resource needing to be occupied by the operation, thereby achieving the purposes of reducing the occupation of resources and improving the operation efficiency during the process of operating the data table.

Description

Data processing method and device, and data table processing method and device

The present application claims the priority of the Chinese Patent Application No. 201610180081.7, the entire disclosure of which is incorporated herein by in.

Technical field

The present invention relates to computer technology, and in particular, to a data processing method and apparatus, and a data table processing method and apparatus.

Background technique

In practical applications, especially before the data table connection operation, it is often necessary to count the number of objects or events that are not repeated, that is, the number of independent elements is also called the unique value, thereby predicting the size of the data table. For smaller amounts of data, you can first sort the sequence in memory and then scan the ordered sequence to count the number of independent elements. However, when processing a sequence of data streams, because the sequence is very long, the range of elements may be wide, and a single element may occupy more memory, resulting in the inability to accommodate the entire sequence in memory.

In this case, the Flajolet-Martin (FM) algorithm can be used, and the FM algorithm is an algorithm that can better solve the estimation of the unique number of values. The algorithm uses a hash function set to perform operations, and estimates the unique value based on the hash value of each hash function in the hash function set.

However, in the prior art, in the process of applying the FM algorithm to calculate the unique value of a column in the data table, the same hash function set is adopted for the data sets of different sizes, resulting in a unique value when the data set is large in scale. The number calculation process is inefficient and the execution time is too long; when the data set is small, the accuracy of the unique value is lower.

Summary of the invention

The present invention provides a data processing method and apparatus, and a data table processing method and apparatus, which are used to solve the problem of ensuring the execution efficiency while ensuring the accuracy of the unique value while using the FM algorithm for the unique value calculation in the prior art.

In order to achieve the above object, embodiments of the present invention adopt the following technical solutions:

In a first aspect, a data processing method is provided, comprising:

Count the number of data in the data set;

Determine the number of functions of the hash function set according to the number of data obtained by the statistics;

Based on the FM algorithm, the data set is subjected to a unique value calculation using a hash function set that conforms to the number of functions.

In a second aspect, a data processing apparatus is provided, comprising:

a statistics module for counting the number of data in the data set;

a determining module, configured to determine the number of functions of the hash function set according to the number of data obtained by the statistics;

And a calculation module, configured to perform a unique value calculation on the data set by using a hash function set that matches the number of functions based on an FM algorithm.

In a third aspect, a data table processing method for predicting a data table size is provided, including:

Using the data processing method described above, processing the data table to obtain a unique number of values;

The size of the data table is predicted based on the unique number of values.

In a fourth aspect, a data table processing method for evaluating data table operations is provided, including:

Using the data table processing method described in the third aspect, predicting the size of the data table;

The operation of the data table is evaluated based on the predicted size of the data table to determine the resources required for the operation.

In a fifth aspect, a data table processing method for performing a data table operation is provided, including:

Using the data table processing method described in the fourth aspect, the resources required for the operation of the data table are evaluated;

The operation is performed on the data table based on the evaluation result.

In a sixth aspect, a data table processing apparatus for predicting a data table size includes:

a unique value module for processing the data table by using the data processing device of the second aspect to obtain a unique number of values;

And a prediction module, configured to predict a size of the data table according to the unique number of values.

In a seventh aspect, a data table processing apparatus for evaluating an operation of a data table is provided, including:

a prediction module, configured to predict a size of the data table by using the data table processing device described in the sixth aspect;

An evaluation module is configured to evaluate an operation of the data table according to the predicted data table size to determine resources required for the operation.

In an eighth aspect, a data table processing apparatus for performing a data table operation includes:

An evaluation module, configured to use the data table processing apparatus according to the seventh aspect, to evaluate resources required for operation of the data table;

And an operation module, configured to perform the operation on the data table according to the evaluation result. The data processing method and device and the data table processing method and device provided by the embodiments of the present invention, by counting the number of data in the data set After that, according to the number of data obtained by the statistics, the number of functions of the hash function set is determined, and then the hash function set conforming to the number of functions is used to calculate the unique value of the data set, thereby making the hash function set The scale is matched with the scale of the data set, which balances the execution efficiency and accuracy, and solves the problem that the execution efficiency and accuracy cannot be balanced due to the fixed size of the hash function set in the prior art. At the same time, the size of the data table is predicted according to the calculated unique value, and the resources required for the operation of the data table are evaluated according to the prediction result, and then the operation of the data table is optimized based on the situation of the resource occupation required by the operation. In the process of operating the data table, the purpose of reducing resource occupation and improving operation efficiency is reduced.

The above description is only an overview of the technical solutions of the present invention, and the above-described and other objects, features and advantages of the present invention can be more clearly understood. Specific embodiments of the invention are set forth below.

DRAWINGS

Various other advantages and benefits will become apparent to those skilled in the art from a The drawings are only for the purpose of illustrating the preferred embodiments and are not to be construed as limiting. Throughout the drawings, the same reference numerals are used to refer to the same parts. In the drawing:

1 is a schematic flowchart of a data processing method according to Embodiment 1 of the present invention;

2 is a schematic flowchart of a data processing method according to Embodiment 2 of the present invention;

3 is a schematic structural diagram of a data processing apparatus according to Embodiment 3 of the present invention;

4 is a schematic structural diagram of another data processing apparatus according to Embodiment 4 of the present invention;

FIG. 5 is a schematic flowchart of a data table processing method according to Embodiment 5 of the present invention; FIG.

FIG. 6 is a schematic flowchart diagram of another data table processing method according to Embodiment 5 of the present invention;

FIG. 7 is a schematic flowchart diagram of still another data table processing method according to Embodiment 5 of the present invention; FIG.

FIG. 8 is a schematic structural diagram of a data table processing apparatus 60 according to Embodiment 6 of the present invention;

FIG. 9 is a schematic structural diagram of a data table processing apparatus 70 according to Embodiment 6 of the present invention;

FIG. 10 is a schematic structural diagram of a data table processing apparatus 80 according to Embodiment 6 of the present invention;

FIG. 11 is a schematic structural diagram of another data table processing apparatus 80 according to Embodiment 6 of the present invention.

detailed description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although the exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the embodiments may be Limited. Rather, these embodiments are provided so that this disclosure will be more fully understood and the scope of the disclosure will be fully disclosed.

The data processing method and apparatus and the data table processing method and apparatus provided by the embodiments of the present invention are described in detail below with reference to the accompanying drawings.

Embodiment 1

1 is a schematic flowchart of a data processing method according to Embodiment 1 of the present invention. As shown in FIG. 1, the method includes:

Step 101: Perform statistics on the number of data in the data set.

As a possible implementation manner, the data is traversed for statistics, and then the number of functions in the hash function set is determined according to the statistical result after traversing the data. Specifically, before calculating the unique value of the data set, all the data in the data set may be traversed, and the total number of data included in the data set may be counted, and then the number of functions in the hash function set is determined according to the statistical result. The data set referred to here may be a set of all the data included in the same column in each data table. When calculating the unique value of the data set, all the data in the data set is traversed again, the hash value of all the data is calculated based on the determined hash function, and the hash value is processed based on the FM algorithm to estimate the unique value.

As another possible implementation manner, in the process of traversing the data, the number of functions in the hash function set is continuously adjusted according to the current statistical result. Specifically, in the process of traversing the data set, the number of functions in the hash function set may be adjusted, and the hash value of the traversed data may be calculated according to the hash function in the adjusted hash function set. After traversing all the data in the data set, the unique value is calculated by the FM algorithm according to the hash value of the hash function in the hash function set determined by the last adjustment. Specifically, the data in the data set can be read one by one, thereby counting the number of data that has been read in the data set. After reading a piece of data, it starts to determine the number of functions in the hash function set, and substitutes the read data to calculate the hash function in the determined hash function set. The steps of the function value.

In an application scenario in which a data table is connected to multiple data tables, a unique value may be calculated for data of a column or columns in the data table to be connected, thereby predicting the connection size. This is because in big data processing, each data table usually contains tens of thousands of data records, and the amount of data is large. Therefore, it is necessary to predict large-scale data table connections, which is necessary for large-scale connection. Preparation for the work.

In the first implementation, the data set needs to be traversed twice. Because the data table contains a large amount of data in the scenario where the data table is connected, two traversal operations require more computing resources and computing time. Therefore, when the amount of data included in the data set is large, the second achievable mode is preferred, and the operation efficiency is improved by reducing the number of times of traversing the data in the data set.

Step 102: Determine the number of functions of the hash function set according to the number of data obtained by the statistics.

FM is an algorithm for calculating the unique value of high efficiency. The size of the hash function set in this algorithm has a very important influence on the accuracy and execution efficiency of the unique value calculation result. On the one hand, the size of the hash function set that is too small will result in lower accuracy of the calculation result, but the execution efficiency is higher; on the other hand, the size of the excessive hash function set will lead to higher accuracy of the calculation result, but The execution efficiency is relatively low. It can be seen that the scale of the hash function set of the FM should match the size of the data set.

In practical applications, matching can be performed in the following manner, for example, the number of data in the data set is N, and the number of functions in the hash function set is H;

When N<100,000, H=1024;

When 100,000 ≤ N < 1,000,000, H = 512;

When 1,000,000 ≤ N < 10,000,000, H = 256;

When 10,000,000 ≤ N < 100,000,000, H = 128;

When N≥100,000,000, H=64.

Step 103: Perform a unique value calculation on the data set by using a hash function set that meets the number of functions based on an FM algorithm.

In practical applications, in order to reduce the error and improve the accuracy, we usually use a series of hash functions H1, H2, H3, ... to calculate the hash value of all the data in the data set, and then according to the FM algorithm, For each hash function, the maximum value MAX of the bit sequence length of the tail all zeros in the binary representation of the hash value of the hash function is counted.

For example, given a data set {e1, e2, e3, e2}, the function TailZero(x) can calculate the number of consecutive zeros in the binary of a positive integer x, and the hash function H(e) is in the data set. The data is hashed and the hash value obtained is:

H(e1)=2=(0010)2, TailZero(H(e1))=1

H(e2)=8=(1000)2, TailZero(H(e2))=3

H(e3)=10=(1010)2, TailZero(H(e3))=1

Then, MAX=MAX(1,3,1)=3.

Further, the maximum value of the hash function H1 is MAX1, and a series of MAX values Max1, Max2, Max3, ... can be obtained similarly; then, based on the formula that the estimated value of the unique value is equal to the power of MAX of 2, thereby estimating one The series estimates are 2Max1, 2Max2, 2Max3..., and finally the estimates for this series are summarized and calculated to obtain the final estimate. Specifically, you can first design A×B hash functions that are different from each other and divide them into groups A. Each group of B hash functions; then use B hash functions in each group to calculate B estimates; then find the arithmetic mean of the B estimates as the estimated value of the group; finally select the estimates of each group The median of the values is used as the final estimate.

In this embodiment, after counting the number of data in the data set, the number of functions of the hash function set is determined according to the number of data obtained by the statistics, and then the hash function set matching the number of the functions is used. The data set performs unique value calculation, so that the size of the hash function set matches the size of the data set, balances the execution efficiency and accuracy, and solves the execution in the prior art due to the fixed size of the hash function set. Problems that cannot be balanced with efficiency and accuracy.

Embodiment 2

In order to clarify the implementation of the function of arranging the hash function set according to the current statistical result in the process of traversing the data mentioned in the previous embodiment, the present embodiment provides a specific execution flow. 2 is a schematic flowchart of a data processing method according to Embodiment 2 of the present invention. As shown in FIG. 2, the method includes:

Step 201: Set an initial hash function set in advance.

Specifically, the number of functions of the preset initial hash function set may be a maximum value, for example, the number of functions may be preset to 1024.

Step 202: Read a data in the data set, and count the number of data that has been read.

Each data in the data set is sequentially read, and steps 202-205 are performed each time one data is read.

Step 203: Determine the number of functions according to the number of data that has been read.

E.g:

When the number of data that has been read reaches 100,000, the number of functions is determined to be 512;

When the number of data that has been read reaches 1,000,000, the number of functions is determined to be 256;

When the number of data that has been read reaches 10,000,000, the number of functions is determined to be 128;

When the number of data that has been read reaches 100,000,000, the number of functions is determined to be 64.

It can be seen that in the process of reading data in the data set, as the read data increases, the number of functions in the hash function set is continuously reduced, and when the size of the data set is small, a larger-scale hash is adopted. The function set, so as to improve the calculation accuracy, when the size of the data set is large, the smaller-scale hash function set is adopted, thereby improving the calculation efficiency. In this way, the hash function set size is matched to the size of the data set.

Step 204: Determine whether the determined number of functions is less than the number of functions in the current hash function set. If yes, execute step 205, otherwise perform step 206.

Step 205: Reduce the hash function from the hash function set to the determined number of functions.

Step 206: Perform hash calculation on the currently read data by using the current hash function set and store the obtained hash value.

Steps 202-206 are repeated until all data in the data set has been read.

Step 207: When the hash value calculation of all data in the data set is completed, the unique value is calculated by using an FM algorithm according to the hash value of the finally determined hash function centralized hash function.

In the process of reading the data in the data set, the hash function set is continuously adjusted, and when the unique value is calculated, the hash function set determined by the last adjustment is used as the estimation. The hash value of the hash function calculated by the earlier read data is more. In the hash value of these hash functions, some hash functions do not exist in the hash function set determined by the last adjustment. The invalid hash function value is estimated by using the hash value of the hash function in the hash function set determined by the last adjustment when estimating the unique value.

For example, the final hash function set retains 64 hash functions H1-H64.

The hash value of the hash functions H1-H64 is selected from all stored hash values for each data. Further, for each hash function, the maximum value MAX of the bit sequence length of the tail all zeros in the binary representation of the hash value is determined according to each hash value of the hash function. Refer to pre-grouping: H1-H8; H9-H16; H17-H24; H25-H32; H33-H40; H41-H48; H49-H56; H57-H64, calculate the average value of the maximum value MAX in each group, for each The group average takes the median as the estimated value R. It is estimated that the unique value is 2 to the power of R.

Embodiment 3

FIG. 3 is a schematic structural diagram of a data processing apparatus according to Embodiment 3 of the present invention. As shown in FIG. 3, the method includes: a statistics module 31, a determining module 32, and a calculating module 33.

The statistics module 31 is configured to perform statistics on the number of data in the data set.

Specifically, the statistic module 31 is specifically configured to use the total number of data included in the statistic data set.

The determining module 32 is configured to determine the number of functions of the hash function set according to the number of data obtained by the statistics.

The calculating module 33 is configured to perform a unique value calculation on the data set by using a hash function set that matches the number of functions based on an FM algorithm.

The data processing device traverses the data for statistics, and then traverses the data to determine the number of functions in the hash function set according to the statistical result. Specifically, before calculating the unique value of the data set, all the data in the data set may be traversed, and the total number of data included in the data set may be counted, and then the number of functions in the hash function set is determined according to the statistical result. The data set referred to here may be a set of all the data included in the same column in each data table. When calculating the unique value of the data set, all the data in the data set is traversed again, the hash value of all the data is calculated based on the determined hash function, and the hash value is processed based on the FM algorithm to estimate the unique value. After counting the number of data in the data set, the number of functions of the hash function set is determined according to the number of data obtained by the statistics, and then the data set is uniquely valued by using a hash function set that matches the number of the functions. The number calculation, so that the size of the hash function set matches the size of the data set, balances the execution efficiency and accuracy, and solves the problem that the execution efficiency and accuracy due to the fixed size of the hash function set in the prior art cannot be solved. A matter of consideration.

Embodiment 4

FIG. 4 is a schematic structural diagram of another data processing apparatus according to Embodiment 4 of the present invention.

In the apparatus provided in this embodiment, the statistic module 31 is specifically configured to read data in a data set one by one, and count the number of data that has been read in the data set.

The determining module 32 is specifically configured to gradually reduce the number of functions of the hash function set according to the counted number of read data.

As shown in FIG. 4, on the basis of the previous embodiment, the data processing apparatus further includes:

The generating module 34 is configured to discard the hash function in the hash function set, and the number of reserved hash functions is the determined number of functions.

The calculation module 33 further includes: a hash value unit 331 and a unique value unit 332.

The hash value unit 331 is configured to calculate a hash value for the read data by using a hash function retained in the hash function set.

The unique value unit 332 is configured to calculate the unique value by using the hash value of the hash function retained in the hash function set based on the FM algorithm when reading all the data in the data set.

In the process of traversing the data, the number of functions in the hash function set is continuously adjusted according to the current statistical result. Specifically, in the process of traversing the data set, the number of functions in the hash function set may be adjusted, and the hash value of the traversed data may be calculated according to the hash function in the adjusted hash function set. After traversing all the data in the data set, the unique value is calculated by the FM algorithm according to the hash value of the hash function in the hash function set determined by the last adjustment. Specifically, the data in the data set can be read one by one, thereby counting the number of data that has been read in the data set. After reading a piece of data, it starts to determine the number of functions in the hash function set, and substitutes the read data to calculate the hash function in the determined hash function set. The steps of the function value.

It can be seen that in the process of reading the data in the data set, as the data read increases, the number is continuously reduced. The number of functions in the set of functions in the set of functions is such that when the size of the data set is small, a larger set of hash functions is used to improve the calculation accuracy. When the size of the data set is large, a smaller-scale hash is used. A set of functions to increase computational efficiency. In this way, the hash function set size is matched to the size of the data set.

It should be noted that the devices provided in the third embodiment and the fourth embodiment are respectively used to implement the data processing flow provided by FIG. 1 and FIG. 2, and the functions of the functional modules of the data processing device in the third embodiment and the fourth embodiment are described. The related descriptions in the foregoing method embodiments are not described in the third embodiment and the fourth embodiment.

Embodiment 5

On the basis of the first or second embodiment, the fifth embodiment provides a data table processing method for optimizing the operation of the data table, such as a connection operation or a group operation, thereby achieving less resource occupation and improving operation efficiency. Among them, the resources mentioned here can be resources consumed for performing operations such as CPU or memory.

FIG. 5 is a schematic flowchart of a data table processing method according to Embodiment 5 of the present invention, which is used to predict a data table size, including:

Step 501: The data processing method of Embodiment 1 or Embodiment 2 is used to process the data table to obtain a unique value.

Step 502: Predict the size of the data table according to the unique number of values.

In a possible application scenario, the data table processing method provided in FIG. 5 can be used to implement the size prediction of the data table, and the required resources can be conveniently allocated to the data table according to the predicted size.

FIG. 6 is a schematic flowchart of another data table processing method according to Embodiment 5 of the present invention. The data table processing method is used to evaluate data table operations. After step 501 in the method provided in FIG. 5, the method further includes:

Step 503: Evaluate the operation of the data table according to the predicted data table size to determine resources required for the operation.

In a possible application scenario, the data table A and the data table B may be connected. Based on the data table processing method provided in FIG. 6, the scale may be first predicted for each data table, and then the connection data tables A and B may be The resources required are evaluated to facilitate the allocation of resources.

FIG. 7 is a schematic flowchart of still another method for processing a data table according to Embodiment 5 of the present invention. The processing method is used to perform a data table operation. After step 503 in the method provided in FIG. 6, the method further includes:

Step 504: Perform an operation on the data table according to the evaluation result.

Specifically, according to the evaluation result, the execution order of at least two operations performed on the data table is determined, for example, the execution order of the at least two operations may be determined in an order of as few as the occupied resources required for the operation. The at least two operations are then performed in the determined order.

In a possible application scenario, data table A, data table B, and data table C may be connected. Based on the data table processing method provided in FIG. 6, the size of each data table may be first predicted, and then the connection data is Tables A and B, the connection data tables A and C, and the resources required to connect the data tables B and C are evaluated to select the connection operation that occupies the least resources. According to the evaluation result, two smaller data tables A and B can be connected first, so that a smaller resource occupation is obtained in the connection, and then the larger-sized data table C is connected, so that the total amount of resources occupied is The smallest.

It can be seen that the order of the data table operation is performed by predicting the size of the data table according to the calculated unique number of values, and evaluating the resources required for the operation of the data table according to the prediction result, and then based on the situation in which the operation requires resources. Optimization, the purpose of reducing the occupation of resources and improving the efficiency of operation in the process of operating the data table.

Embodiment 6

FIG. 8 is a schematic structural diagram of a data table processing apparatus 60 according to Embodiment 6 of the present invention. The data table processing apparatus 60 is configured to predict a data table size, and includes: a unique value module 61 and a prediction module 62.

The unique value module 61 is configured to process the data table using the data processing apparatus shown in FIG. 3 or FIG. 4 to obtain a unique number of values.

The prediction module 62 is configured to predict a size of the data table according to the unique number of values.

FIG. 9 is a schematic structural diagram of a data table processing apparatus 70 according to Embodiment 6 of the present invention. The data table processing apparatus 70 is configured to evaluate a data table operation, and includes: a prediction module 71 and an evaluation module 72.

The prediction module 71 is configured to predict the size of the data table by using the data table processing device 60 shown in FIG.

The evaluation module 72 is configured to evaluate the operation of the data table according to the predicted data table size to determine resources required for the operation.

Among them, operations include connection operations and/or group operations.

FIG. 10 is a schematic structural diagram of a data table processing apparatus 80 according to Embodiment 6 of the present invention, where the data table is located The processing device 80 is configured to perform data table operations, including: an evaluation module 81 and an operation module 82.

The evaluation module 81 is configured to evaluate the resources required for the operation of the data table by using the data table processing device 70 shown in FIG.

The operation module 82 is configured to perform the operation on the data table according to the evaluation result.

Specifically, as a possible implementation manner, as shown in FIG. 11, the operation module 82 includes: a determining unit 821 and an executing unit 822.

The determining unit 821 is configured to determine an execution order of at least two operations performed on the data table according to the evaluation result.

Specifically, the determining unit 821 is specifically configured to determine an execution order of the at least two operations in an order that occupies less resources in an operation.

The executing unit 822 is configured to perform the at least two operations in the determined order.

By predicting the size of the data table according to the calculated unique number of values, and estimating the resources required for the operation of the data table according to the prediction result, and optimizing the operation of the data table based on the operation of the resources required by the operation, the realization is realized. In the process of operating the data table, the use of resources is reduced, and the operation efficiency is improved.

One of ordinary skill in the art will appreciate that all or part of the steps to implement the various method embodiments described above may be accomplished by hardware associated with the program instructions. The aforementioned program can be stored in a computer readable storage medium. The program, when executed, performs the steps including the foregoing method embodiments; and the foregoing storage medium includes various media that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.

Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, and are not intended to be limiting; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that The technical solutions described in the foregoing embodiments may be modified, or some or all of the technical features may be equivalently replaced; and the modifications or substitutions do not deviate from the technical solutions of the embodiments of the present invention. range.

Claims

A data processing method, comprising:

Count the number of data in the data set;

Determine the number of functions of the hash function set according to the number of data obtained by the statistics;

Based on the FM algorithm, the data set is subjected to a unique value calculation using a hash function set that conforms to the number of functions.
The data processing method according to claim 1, wherein the counting the number of data of the data set comprises:

Count the total number of data contained in the data set.
The data processing method according to claim 1, wherein the counting the number of data of the data set comprises:

The data in the data set is read one by one, and the number of data that has been read in the data set is counted.
The data processing method according to claim 3, wherein the determining the number of functions of the hash function set according to the number of data obtained by the statistics comprises:

As the number of data that has been read increases, the number of functions of the hash function set is gradually reduced.
The data processing method according to claim 4, wherein the calculating, based on the FM algorithm, a unique value of the data set by using a hash function set that matches the number of functions, comprises:

The hash function in the hash function set is discarded, and the number of reserved hash functions is the determined number of functions.
The data processing method according to claim 5, wherein the calculating, according to the FM algorithm, a unique value number of the data set by using a hash function set that matches the number of functions, comprising:

Calculating a hash value for the read data by using a hash function retained in the hash function set;

When reading all the data in the data set, based on the FM algorithm, the unique value is calculated using the hash value of the hash function retained in the hash function set.
A data processing device, comprising:

a statistics module for counting the number of data in the data set;

a determining module, configured to determine the number of functions of the hash function set according to the number of data obtained by the statistics;

And a calculation module, configured to perform a unique value calculation on the data set by using a hash function set that matches the number of functions based on an FM algorithm.
A data processing apparatus according to claim 7, wherein

The statistic module is specifically configured to count the total number of data included in the data set.
A data processing apparatus according to claim 7, wherein

The statistic module is specifically configured to read data in the data set one by one, and count the number of data that has been read in the data set.
A data processing apparatus according to claim 9, wherein:

The determining module is specifically configured to gradually reduce the number of functions of the hash function set according to the counted number of data that has been read.
The data processing apparatus according to claim 10, wherein said apparatus comprises:

And a generating module, configured to discard the hash function in the hash function set, and the number of reserved hash functions is the determined number of functions.
The data processing device according to claim 11, wherein the calculation module comprises:

a hash value unit, configured to calculate a hash value for the read data by using a hash function retained in the hash function set;

The unique value unit is configured to calculate the unique value by using the hash value of the hash function retained in the hash function set based on the FM algorithm when reading all the data in the data set.
A data table processing method for predicting a data table size, comprising:

The data processing method according to any one of claims 1 to 6, wherein the data table is processed to obtain a unique number of values;

The size of the data table is predicted based on the unique number of values.
A data table processing method for evaluating data table operations, comprising:

Using the data table processing method of claim 13, predicting the size of the data table;

The operation of the data table is evaluated based on the predicted size of the data table to determine the resources required for the operation.
The data table processing method according to claim 14, wherein the operation comprises a connection operation and/or a group operation.
A data table processing method for performing data table operations, comprising:

Using the data table processing method of claim 14 or 15, to evaluate resources required for operation of the data table;

The operation is performed on the data table based on the evaluation result.
The data table processing method according to claim 16, wherein said data table is executed based on the evaluation result The operations described, including:

Determining an execution order of at least two operations performed on the data table according to the evaluation result;

The at least two operations are performed in the determined order.
The data table processing method according to claim 17, wherein the determining an execution order of at least two operations performed on the data table according to the evaluation result comprises:

The order of execution of the at least two operations is determined in ascending order of resources required for the operation.
A data table processing apparatus for predicting a data table size, comprising:

a unique value module for processing the data table by using the data processing apparatus according to any one of claims 7 to 12 to obtain a unique number of values;

And a prediction module, configured to predict a size of the data table according to the unique number of values.
A data table processing apparatus for evaluating data table operations, comprising:

a prediction module, configured to predict a size of a data table by using the data table processing apparatus of claim 19;

An evaluation module is configured to evaluate an operation of the data table according to the predicted data table size to determine resources required for the operation.
The data table processing apparatus according to claim 20, wherein said operation comprises a connection operation and/or a grouping operation.
A data table processing apparatus for performing a data table operation, comprising:

An evaluation module, configured to use the data table processing apparatus of claim 20 or 21 to evaluate resources required for operation of the data table;

And an operation module, configured to perform the operation on the data table according to the evaluation result.
The data table processing apparatus according to claim 22, wherein the operation module comprises:

a determining unit, configured to determine an execution order of at least two operations performed on the data table according to the evaluation result;

An execution unit, configured to perform the at least two operations in the determined order.
A data table processing apparatus according to claim 23,

The determining unit is specifically configured to determine an execution order of the at least two operations according to an order of less resources occupied by operations.