CN107229663B - Data processing method and device and data table processing method and device - Google Patents

Data processing method and device and data table processing method and device Download PDF

Info

Publication number
CN107229663B
CN107229663B CN201610180081.7A CN201610180081A CN107229663B CN 107229663 B CN107229663 B CN 107229663B CN 201610180081 A CN201610180081 A CN 201610180081A CN 107229663 B CN107229663 B CN 107229663B
Authority
CN
China
Prior art keywords
data
data table
hash function
hash
functions
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610180081.7A
Other languages
Chinese (zh)
Other versions
CN107229663A (en
Inventor
孙伟光
徐冬
连杰红
汪龙重
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201610180081.7A priority Critical patent/CN107229663B/en
Priority to TW106105362A priority patent/TWI746517B/en
Priority to PCT/CN2017/077024 priority patent/WO2017162102A1/en
Publication of CN107229663A publication Critical patent/CN107229663A/en
Application granted granted Critical
Publication of CN107229663B publication Critical patent/CN107229663B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data processing method and device and a data table processing method and device, wherein the number of functions of a hash function set is determined according to the number of data obtained by statistics after the data number of the data set is counted, and then the unique value number calculation is carried out on the data set by adopting the hash function set conforming to the number of the functions, so that the scale of the hash function set is matched with the scale of the data set, the execution efficiency and the accuracy are balanced, and the problem that the execution efficiency and the accuracy cannot be considered at the same time due to the fact that the scale of the hash function set is fixed in the prior art is solved. Meanwhile, the scale of the data table is predicted according to the calculated unique value number, the resources required to be occupied by the operation of the data table are evaluated according to the prediction result, and the operation of the data table is optimized based on the condition that the resources required to be occupied by the operation, so that the purposes of reducing the occupation of the resources and improving the operation efficiency in the process of operating the data table are achieved.

Description

Data processing method and device and data table processing method and device
Technical Field
The present invention relates to computer technologies, and in particular, to a data processing method and apparatus, and a data table processing method and apparatus.
Background
In practical applications, especially before performing a data table join operation, it is often necessary to count the number of objects or events that do not occur repeatedly, i.e. the number of independent elements is also called a unique value number, so as to predict the size of the data table. For smaller data volumes, the sequence may be first sorted in memory and then scanned for the number of independent elements. However, when a data stream sequence is processed, the sequence is very long, the range of values of elements may be wide, and a single element may occupy more memory, so that the memory cannot accommodate the whole sequence.
For the situation, a Flajolet-Martin (FM for short) algorithm can be adopted, and the FM algorithm is an algorithm capable of better solving the problem of estimating the unique value number. The algorithm adopts a hash function set to carry out operation, and the unique value number is estimated based on the hash value of each hash function in the hash function set.
However, in the prior art, in the process of calculating the unique value number of a certain column in the data table by applying the FM algorithm, the same hash function set is adopted for data sets of different scales, so that when the scale of the data set is large, the execution efficiency of the unique value number calculation process is often low, and the execution time is too long; when the size of the data set is small, the accuracy of the unique value number is low.
Disclosure of Invention
The invention provides a data processing method and device and a data table processing method and device, which are used for solving the problem that the accuracy of a unique value number cannot be ensured while the execution efficiency is ensured when an FM algorithm is adopted for calculating the unique value number in the prior art.
In order to achieve the above purpose, the embodiment of the invention adopts the following technical scheme:
in a first aspect, a data processing method is provided, including:
counting the number of data in the data set;
determining the function number of the hash function set according to the counted data number;
and based on an FM algorithm, performing unique value number calculation on the data set by adopting a hash function set which accords with the function number.
In a second aspect, there is provided a data processing apparatus comprising:
the statistical module is used for counting the data number of the data set;
the determining module is used for determining the function number of the hash function set according to the counted data number;
and the calculation module is used for calculating the unique value number of the data set by adopting a hash function set which accords with the function number based on an FM algorithm.
In a third aspect, a data table processing method for predicting a size of a data table is provided, including:
processing the data table by adopting a first data processing method to obtain a unique value number;
and predicting the size of the data table according to the unique value number.
In a fourth aspect, a data table processing method for evaluating data table operations is provided, comprising:
predicting the scale of the data table by adopting the data table processing method in the third aspect;
and according to the predicted scale of the data table, evaluating the operation of the data table to determine the resources occupied by the operation.
In a fifth aspect, a data table processing method for performing a data table operation is provided, including:
evaluating resources occupied by the operation of the data table by adopting the data table processing method of the fourth aspect;
and executing the operation on the data table according to the evaluation result.
In a sixth aspect, there is provided a data table processing apparatus for predicting a size of a data table, comprising:
a unique value module, configured to process the data table by using the data processing apparatus according to the second aspect to obtain a unique value number;
and the prediction module is used for predicting the scale of the data table according to the unique value number.
In a seventh aspect, a data table processing apparatus for evaluating data table operations is provided, comprising:
a prediction module, configured to predict a scale of the data table by using the data table processing apparatus according to the sixth aspect;
and the evaluation module is used for evaluating the operation of the data table according to the predicted scale of the data table so as to determine the resources occupied by the operation.
In an eighth aspect, there is provided a data table processing apparatus for performing a data table operation, comprising:
an evaluation module, configured to evaluate, by using the data table processing apparatus of the seventh aspect, resources occupied by the operation of the data table;
and the operation module is used for executing the operation on the data table according to the evaluation result. According to the data processing method and device and the data table processing method and device provided by the embodiment of the invention, after the data number of the data set is counted, the function number of the hash function set is determined according to the data number obtained through counting, and then the hash function set conforming to the function number is adopted to carry out unique value number calculation on the data set, so that the scale of the hash function set is matched with the scale of the data set, the execution efficiency and the accuracy are balanced, and the problem that the execution efficiency and the accuracy cannot be considered at the same time due to the fact that the scale of the hash function set is fixed in the prior art is solved. Meanwhile, the scale of the data table is predicted according to the calculated unique value number, the resources occupied by the operation of the data table are evaluated according to the prediction result, and the operation of the data table is optimized based on the condition that the resources are occupied by the operation, so that the purposes of reducing the occupation of the resources and improving the operation efficiency in the process of operating the data table are achieved.
The above description is only an overview of the technical solutions of the present invention, and the present invention can be implemented in accordance with the content of the description so as to make the technical means of the present invention more clearly understood, and the above and other objects, features, and advantages of the present invention will be more clearly understood.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a schematic flowchart of a data processing method according to an embodiment of the present invention;
fig. 2 is a flowchart illustrating a data processing method according to a second embodiment of the present invention;
fig. 3 is a schematic structural diagram of a data processing apparatus according to a third embodiment of the present invention;
fig. 4 is a schematic structural diagram of another data processing apparatus according to a fourth embodiment of the present invention;
FIG. 5 is a flowchart illustrating a data table processing method according to a fifth embodiment of the present invention;
FIG. 6 is a flowchart illustrating another data table processing method according to a fifth embodiment of the present invention;
FIG. 7 is a flowchart illustrating a further method for processing a data table according to a fifth embodiment of the present invention;
fig. 8 is a schematic structural diagram of a data table processing apparatus 60 according to a sixth embodiment of the present invention;
fig. 9 is a schematic structural diagram of a data table processing apparatus 70 according to a sixth embodiment of the present invention;
fig. 10 is a schematic structural diagram of a data table processing apparatus 80 according to a sixth embodiment of the present invention;
fig. 11 is a schematic structural diagram of another data table processing apparatus 80 according to a sixth embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The following describes in detail a data processing method and apparatus and a data table processing method and apparatus provided by an embodiment of the present invention with reference to the accompanying drawings.
Example one
Fig. 1 is a schematic flow chart of a data processing method according to an embodiment of the present invention, as shown in fig. 1, including:
step 101, counting the number of data in the data set.
And as a possible implementation mode, counting the traversal data, and determining the number of functions in the hash function set according to a counting result after the data is traversed. Specifically, before the unique value number is calculated for the data set, all data in the data set may be traversed, the number of all data included in the data set may be counted, and the number of functions in the hash function set may be determined according to the statistical result. The data set referred to herein may be a set of all data included in the same column in each data table. When the unique value number is calculated for the data set, all data in the data set are traversed again, the hash values of all data are calculated based on the determined hash function, the hash values are processed based on the FM algorithm, and the unique value number is estimated.
As another possible implementation manner, in the process of traversing data, the number of functions in the hash function set is continuously adjusted according to the current statistical result. Specifically, the number of functions in the hash function set may be adjusted in the process of traversing the data set, and the hash value of the traversed data may be calculated according to the hash function in the adjusted hash function set. And after traversing all the data in the data set, calculating the unique value number by adopting an FM algorithm according to the hash value of the hash function in the hash function set determined by the last adjustment. Specifically, the data in the data set may be read piece by piece, so as to count the number of data that have been read in the data set. After a piece of data is read, the number of functions in the hash function set is determined, and the read data is substituted to calculate the function value of each hash function in the hash function set.
In an application scenario of data table connection of a plurality of data tables, a unique value number can be calculated for data of a certain column or a certain number of columns in the data tables to be connected, so that the connection scale can be predicted. This is because, in the large data processing, each data table usually contains tens of thousands of data records, and the data amount is large, and therefore, it is necessary to predict the large-scale data table connection, which facilitates the preparation work necessary for the large-scale connection.
In the first implementation manner, the data set needs to be traversed twice, and because the data volume contained in the data table is large in the scenario of data table connection, the two traversals need to occupy more computing resources and computing time. Therefore, when the amount of data included in the data set is large, the second implementable manner is preferable, and the operation efficiency is improved by reducing the number of times of traversing the data in the data set.
And step 102, determining the function number of the hash function set according to the counted data number.
FM is an algorithm for calculating a unique value number with high efficiency, and the size of a hash function set in the algorithm has an extremely important influence on the accuracy and the execution efficiency of a calculation result of the unique value number. On one hand, the accuracy of a calculation result is low due to the excessively small scale of the hash function set, but the execution efficiency is high; on the other hand, an excessively large hash function set size results in a calculation result that is relatively high in accuracy but relatively low in execution efficiency. It can be seen that the size of the hash function set of FM should match the size of the data set.
In practical applications, the matching can be performed in the following way, for example: recording the number of data in the data set as N, and recording the number of functions in the hash function set as H;
when N is less than 100,000, H is 1024;
when N is more than or equal to 100,000 and less than 1,000,000, H is 512;
when N is more than or equal to 1,000,000 and less than 10,000,000, H is 256;
when N is more than or equal to 10,000,000 and less than 100,000,000, H is 128;
when N is more than or equal to 100,000,000, H is 64.
And 103, based on an FM algorithm, performing unique value number calculation on the data set by adopting a hash function set according with the function number.
In practical application, in order to reduce errors and improve accuracy, a series of hash functions H1, H2, and H3 … … are usually adopted to calculate all data in a data set to obtain hash values, and then according to an FM algorithm, it is necessary to count a maximum MAX of bit sequence lengths MAX of all 0 tails in a hash value binary representation of the hash function for each hash function.
For example, given a data set { e1, e2, e3, e2}, the function tailzero (x) is able to calculate the number of last consecutive 0's in the binary system of a positive integer x, and the hash function h (e) hashes the data in the data set, obtaining the hash value:
H(e1)=2=(0010)2,TailZero(H(e1))=1
H(e2)=8=(1000)2,TailZero(H(e2))=3
H(e3)=10=(1010)2,TailZero(H(e3))=1
then MAX equals MAX (1, 3, 1) equals 3.
Further, the maximum value of the hash function H1 is MAX1, and similarly, a series of MAX values MAX1, MAX2 and MAX3 … … can be obtained; then, a series of estimated values 2 are estimated based on a formula in which the estimated value of the number of unique values is equal to 2 raised to the power of MAXMax1,2Max2,2Max3… …, and finally, summarizing the series of estimated values and calculating to obtain a final estimated value. Specifically, a × B different hash functions may be first designed, and divided into a group a, where each group includes B hash functions; then B estimated values are calculated by utilizing B hash functions in each group; then, the arithmetic mean of the B estimated values is calculated to be the estimated value of the group; and finally, selecting the median of the estimation values of all groups as a final estimation value.
In this embodiment, after the data number of the data set is counted, the function number of the hash function set is determined according to the data number obtained through counting, and then the hash function set corresponding to the function number is adopted to perform unique value calculation on the data set, so that the scale of the hash function set is matched with the scale of the data set, the execution efficiency and the accuracy are balanced, and the problem that the execution efficiency and the accuracy cannot be considered due to the fact that the scale of the hash function set is fixed in the prior art is solved.
Example two
To clearly illustrate the implementation manner mentioned in the previous embodiment that the number of functions in the hash function set is continuously adjusted according to the current statistical result in the process of traversing data, this embodiment provides a specific execution flow, and fig. 2 is a schematic flow diagram of a data processing method provided in the second embodiment of the present invention, as shown in fig. 2, including:
step 201, an initial hash function set is preset.
Specifically, the number of preset functions of the initial hash function set may be a maximum value, for example: the number of functions may be preset to 1024.
Step 202, reading one data in the data set, and counting the number of the read data.
The steps 202 and 205 are performed for each read of data in the data set.
And step 203, determining the number of functions according to the number of the read data.
For example:
when the number of the read data reaches 100,000, determining the number of the functions to be 512;
when the number of the read data reaches 1,000,000, determining the number of the functions to be 256;
when the number of the read data reaches 10,000,000, determining the number of the functions to be 128;
when the number of read data reaches 100,000,000, the number of determination functions is 64.
Therefore, in the process of reading the data in the data set, the number of the functions in the hash function set is continuously reduced along with the increase of the read data, the adoption of a larger hash function set is realized when the scale of the data set is smaller, the calculation accuracy is improved, and the adoption of a smaller hash function set is realized when the scale of the data set is larger, the calculation efficiency is improved. In this way, the size of the hash function set is matched to the size of the data set.
And 204, judging whether the determined number of functions is less than that of the current hash function set, if so, executing the step 205, and otherwise, executing the step 206.
And step 205, reducing the hash functions from the hash function set to the determined number of functions.
And step 206, performing hash calculation on the currently read data by adopting the current hash function set, and storing the obtained hash value.
Step 202 and step 206 are repeated until all data in the data set are read.
And step 207, when the hash values of all the data in the data set are calculated, calculating the unique value number by adopting an FM algorithm according to the finally determined hash value of the hash function in the hash function set.
And in the process of reading the data in the data set, continuously adjusting the hash function set, and estimating by taking the hash function set determined by the last adjustment as the reference when calculating the unique value number. The more hash values of the hash functions calculated from the data read earlier, some hash functions among the hash values of the hash functions do not exist in the hash function set determined by the last adjustment and are invalid hash function values, and when the unique value number is estimated, the hash value of the hash function in the hash function set determined by the last adjustment is used for estimation.
For example: 64 hash functions H1-H64 are retained in the final set of hash functions.
From all the stored hash values, the hash values of the hash functions H1-H64 are selected for each data. And for each hash function, determining the maximum value MAX of the bit sequence length of all 0 tail parts in the binary representation of the hash value according to each hash value of the hash function. Referring to pre-grouping: H1-H8; H9-H16; H17-H24; H25-H32; H33-H40; H41-H48; H49-H56; H57-H64, calculating the average value of the maximum value MAX in each group, and taking the median of the average values of each group as an estimated value R. The number of unique values is estimated to be 2 raised to the power of R.
EXAMPLE III
Fig. 3 is a schematic structural diagram of a data processing apparatus according to a third embodiment of the present invention, as shown in fig. 3, including: a statistics module 31, a determination module 32 and a calculation module 33.
And the counting module 31 is configured to count the number of data in the data set.
Specifically, the statistical module 31 is specifically configured to count all data numbers included in the data set.
And the determining module 32 is configured to determine the number of functions of the hash function set according to the counted number of the obtained data.
And the calculating module 33 is configured to perform unique value number calculation on the data set by using a hash function set according to the number of the functions based on an FM algorithm.
And the data processing device counts the traversal data, and then determines the number of functions in the hash function set according to the statistical result after traversing the data. Specifically, before the unique value number is calculated for the data set, all data in the data set may be traversed, the number of all data included in the data set may be counted, and the number of functions in the hash function set may be determined according to the statistical result. The data set referred to herein may be a set of all data included in the same column in each data table. When the unique value number is calculated for the data set, all data in the data set are traversed again, the hash values of all data are calculated based on the determined hash function, the hash values are processed based on the FM algorithm, and the unique value number is estimated. After the data number of the data set is counted, the function number of the hash function set is determined according to the data number obtained through counting, and then the hash function set which accords with the function number is adopted to carry out unique value number calculation on the data set, so that the scale of the hash function set is matched with that of the data set, the execution efficiency and the accuracy are balanced, and the problem that the execution efficiency and the accuracy cannot be considered due to the fact that the scale of the hash function set is fixed in the prior art is solved.
Example four
Fig. 4 is a schematic structural diagram of another data processing apparatus according to a fourth embodiment of the present invention.
In the apparatus provided in this embodiment, the counting module 31 is specifically configured to read data in the data set one by one, and count the number of the data that has been read in the data set.
The determining module 32 is specifically configured to gradually decrease the number of functions of the hash function set as the counted number of the read data increases.
As shown in fig. 4, on the basis of the above embodiment, the data processing apparatus further includes:
and the generating module 34 is configured to discard the hash functions in the hash function set, where the number of the reserved hash functions is the determined number of functions.
The calculation module 33 further includes: a hash value unit 331 and a unique value number unit 332.
A hash value unit 331, configured to calculate a hash value for the read data by using the hash function reserved in the hash function set.
A unique value number unit 332, configured to calculate a unique value number by using the hash value of the hash function reserved in the hash function set based on an FM algorithm when all data in the data set are read.
And continuously adjusting the number of functions in the hash function set according to the current statistical result in the process of traversing the data. Specifically, the number of functions in the hash function set may be adjusted in the process of traversing the data set, and the hash value of the traversed data may be calculated according to the hash function in the adjusted hash function set. And after traversing all the data in the data set, calculating the unique value number by adopting an FM algorithm according to the hash value of the hash function in the hash function set determined by the last adjustment. Specifically, the data in the data set may be read piece by piece, so as to count the number of data that have been read in the data set. After a piece of data is read, the subsequent steps of determining the number of functions in the hash function set for the read data, substituting the read data, and calculating the function value of each hash function in the determined hash function set are started.
It can be seen that, in the process of reading data in the data set, along with the increase of the read data, the number of functions in the hash function set is continuously reduced, so that when the scale of the data set is small, a large-scale hash function set is adopted, the calculation accuracy is improved, and when the scale of the data set is large, a small-scale hash function set is adopted, and the calculation efficiency is improved. In this way, the size of the hash function set is matched to the size of the data set.
It should be noted that the apparatuses provided in the third embodiment and the fourth embodiment are respectively used to implement the data processing flows provided in fig. 1 and fig. 2, and the functions of each functional module of the data processing apparatuses in the third embodiment and the fourth embodiment refer to the related description in the foregoing method embodiments, and are not repeated in the third embodiment and the fourth embodiment.
In this embodiment, after the data number of the data set is counted, the function number of the hash function set is determined according to the data number obtained through counting, and then the hash function set corresponding to the function number is adopted to perform unique value calculation on the data set, so that the scale of the hash function set is matched with the scale of the data set, the execution efficiency and the accuracy are balanced, and the problem that the execution efficiency and the accuracy cannot be considered due to the fact that the scale of the hash function set is fixed in the prior art is solved.
EXAMPLE five
On the basis of the first or second embodiment, the fifth embodiment provides a data table processing method for optimizing operations of a data table, such as connection operations or grouping operations, so that less resource occupation is realized and the operation efficiency is improved. The resource may be a resource consumed for executing an operation, such as a CPU or a memory.
Fig. 5 is a schematic flow chart of a data table processing method provided in the fifth embodiment of the present invention, for predicting the size of a data table, including:
step 501, processing the data table by using the data processing method of the first embodiment or the second embodiment to obtain the unique value number.
And 502, predicting the scale of the data table according to the unique value number.
In one possible application scenario, the data table processing method provided in fig. 5 may be adopted to implement the size prediction of the data table, and the allocation of the required resources to the data table may be facilitated according to the predicted size.
Fig. 6 is a flowchart illustrating another data table processing method according to a fifth embodiment of the present invention, where the data table processing method is used for evaluating data table operations, and after step 501 in the method provided in fig. 5, the method further includes:
and 503, evaluating the operation of the data table according to the predicted scale of the data table to determine the resources occupied by the operation.
In a possible application scenario, a connection operation may be performed on the data table a and the data table B, and based on the data table processing method provided in fig. 6, the scale of each data table may be predicted first, and then resources occupied by connecting the data tables a and B may be evaluated, thereby facilitating resource allocation.
Fig. 7 is a flowchart illustrating a fifth embodiment of a further data table processing method according to the present invention, where the processing method is used to perform a data table operation, and after step 503 in the method provided in fig. 6, the method further includes:
and step 504, executing operation on the data table according to the evaluation result.
Specifically, the execution order of the at least two operations performed on the data table is determined according to the evaluation result, for example, the execution order of the at least two operations may be determined according to the order of the occupied resources required by the operations being less than or equal to the maximum. And then performing the at least two operations in the determined order.
In a possible application scenario, a connection operation may be performed on the data table a, the data table B, and the data table C, and based on the data table processing method provided in fig. 6, the scale of each data table may be predicted first, and then resources occupied by the three operations of connecting the data tables a and B, connecting the data tables a and C, and connecting the data tables B and C are evaluated, and a connection operation occupying the least resources is selected from the resources. According to the evaluation result, the two data tables A and B with smaller scales can be connected firstly, so that smaller resource occupation is obtained in the connection, and the data table C with larger scale is connected, so that the total occupied resource is minimum.
Therefore, the scale of the data table is predicted according to the calculated unique value number, the resources required to be occupied by the operation of the data table are evaluated according to the prediction result, and the sequence of the operation of the data table is optimized based on the condition that the resources required to be occupied by the operation, so that the purposes of reducing the occupation of the resources and improving the operation efficiency in the process of operating the data table are achieved.
EXAMPLE six
Fig. 8 is a schematic structural diagram of a data table processing apparatus 60 according to a sixth embodiment of the present invention, where the data table processing apparatus 60 is used for predicting a data table size, and includes: a unique value module 61 and a prediction module 62.
A unique value module 61, configured to process the data table by using the data processing apparatus shown in fig. 3 or fig. 4 to obtain a unique value number.
And a prediction module 62 for predicting the size of the data table according to the unique value number.
Fig. 9 is a schematic structural diagram of a data table processing apparatus 70 according to a sixth embodiment of the present invention, where the data table processing apparatus 70 is used for evaluating a data table operation, and includes: a prediction module 71 and an evaluation module 72.
The prediction module 71 is configured to predict the size of the data table by using the data table processing apparatus 60 shown in fig. 8.
And the evaluation module 72 is configured to evaluate the operation of the data table according to the predicted scale of the data table, so as to determine the resource occupied by the operation.
Wherein the operation comprises a connection operation and/or a grouping operation.
Fig. 10 is a schematic structural diagram of a data table processing apparatus 80 according to a sixth embodiment of the present invention, where the data table processing apparatus 80 is configured to perform a data table operation, and includes: an evaluation module 81 and an operation module 82.
An evaluation module 81, configured to evaluate, by using the data table processing apparatus 70 shown in fig. 9, resources occupied by the operation of the data table.
And the operation module 82 is used for executing the operation on the data table according to the evaluation result.
Specifically, as a possible implementation manner, as shown in fig. 11, the operation module 82 includes: a determination unit 821 and an execution unit 822.
A determining unit 821 for determining an execution order of at least two operations performed with respect to the data table according to the evaluation result.
Specifically, the determining unit 821 is specifically configured to determine the execution order of the at least two operations according to an order of occupying resources required by the operations from at least one.
An executing unit 822, configured to execute the at least two operations in the determined order.
The scale of the data table is predicted according to the calculated unique value number, the resources occupied by the operation of the data table are evaluated according to the prediction result, and the operation of the data table is optimized based on the condition that the resources are occupied by the operation, so that the purposes of reducing the occupation of the resources and improving the operation efficiency in the process of operating the data table are achieved.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (18)

1. A method of processing a data table for predicting the size of the data table, comprising:
counting the number of data in a plurality of data sets;
directly determining the function number of the hash function set according to the counted data number;
based on Flajolet-Martin algorithm, adopting Hash function set which accords with the function number to calculate the unique value number of the data set,
predicting the size of the data table according to the unique value number,
the method is characterized in that the statistics of the number of the data of the plurality of data sets comprises the following steps:
reading the data in the data set one by one, counting the number of the read data in the data set,
the determining the function number of the hash function set according to the counted data number comprises:
as the counted number of read data increases, a smaller number than the number of functions of the hash function set determined from the number of data before the increase is used as the number of functions of the determined hash function set.
2. The method of claim 1, wherein the counting the number of data in the plurality of data sets comprises:
and counting all data contained in the data set.
3. The method for processing the data table according to claim 1, wherein before performing unique value number calculation on the data set by using a hash function set conforming to the number of the functions based on a flajet-Martin algorithm, the method comprises:
and discarding the hash functions in the hash function set, wherein the number of the reserved hash functions is the determined number of the functions.
4. The data sheet processing method according to claim 3, wherein the performing unique value number calculation on the data set by using a hash function set conforming to the number of the functions based on a Flajolet-Martin algorithm comprises:
calculating a hash value of the read data by using the hash function reserved in the hash function set;
and when all the data in the data set are read, calculating the unique value number by adopting the hash value of the hash function reserved in the hash function set based on a Flajolet-Martin algorithm.
5. A method of data table processing for evaluating data table operations, comprising:
predicting the size of the data sheet by using the data sheet processing method according to any one of claims 1 to 4;
and according to the predicted scale of the data table, evaluating the operation of the data table to determine the resources occupied by the operation.
6. The method of claim 5, wherein the operation comprises a join operation and/or a group operation.
7. A data table processing method for performing data table operations, comprising:
evaluating resources occupied by the operation of the data sheet by using the data sheet processing method of claim 5 or 6;
and executing the operation on the data table according to the evaluation result.
8. The data sheet processing method of claim 7, the performing the operation on the data sheet according to the evaluation result, comprising:
determining an execution order of at least two operations performed with respect to the data table according to the evaluation result;
and executing the at least two operations according to the determined sequence.
9. The data sheet processing method of claim 8, said determining an execution order of at least two operations performed with respect to the data sheet according to the evaluation result, comprising:
and determining the execution sequence of the at least two operations according to the sequence of the occupied resources required by the operations as little as possible.
10. A data table processing apparatus for predicting a size of a data table, comprising:
the statistical module is used for counting the data number of the data sets;
the determining module is used for directly determining the function number of the hash function set according to the counted data number;
a calculation module for calculating the unique value number of the data set by adopting a hash function set which accords with the function number based on Flajolet-Martin algorithm,
a prediction module for predicting a size of the data table based on the unique number of values,
it is characterized in that the preparation method is characterized in that,
the statistic module is specifically configured to read the data in the data set one by one, count the number of the data read in the data set,
the determining module is specifically configured to use, as the number of functions of the determined hash function set, a number smaller than the number of functions of the hash function set determined according to the number of data before the increase, as the number of functions of the determined hash function set, as the counted number of read data increases.
11. The data sheet processing apparatus of claim 10,
the statistical module is specifically configured to count the number of all data included in the data set.
12. The apparatus of claim 10, wherein the apparatus comprises:
and the generation module is used for discarding the hash functions in the hash function set, and the number of the reserved hash functions is the determined number of the functions.
13. The apparatus of claim 12, wherein the calculation module comprises:
a hash value unit for calculating a hash value for the read data by using the hash function reserved in the hash function set;
and the unique value number unit is used for calculating the unique value number by adopting the hash value of the hash function reserved in the hash function set based on a Flajolet-Martin algorithm when all data in the data set are read.
14. A spreadsheet processing apparatus for evaluating spreadsheet operations, comprising:
a prediction module for predicting the size of the data table using the data table processing apparatus of any one of claims 10 to 13;
and the evaluation module is used for evaluating the operation of the data table according to the predicted scale of the data table so as to determine the resources occupied by the operation.
15. The apparatus of claim 14, wherein the operation comprises a join operation and/or a group operation.
16. A data table processing apparatus for performing data table operations, comprising:
an evaluation module, configured to evaluate, by using the data table processing apparatus according to claim 14 or 15, resources occupied by the operation of the data table;
and the operation module is used for executing the operation on the data table according to the evaluation result.
17. The data table processing apparatus of claim 16, the operations module, comprising:
a determination unit configured to determine an execution order of at least two operations performed with respect to the data table according to an evaluation result;
and the execution unit is used for executing the at least two operations according to the determined sequence.
18. The data sheet processing apparatus of claim 17,
the determining unit is specifically configured to determine an execution order of the at least two operations according to an order that resources occupied by the operations are at least as large as possible.
CN201610180081.7A 2016-03-25 2016-03-25 Data processing method and device and data table processing method and device Active CN107229663B (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN201610180081.7A CN107229663B (en) 2016-03-25 2016-03-25 Data processing method and device and data table processing method and device
TW106105362A TWI746517B (en) 2016-03-25 2017-02-17 Data processing method and device and data table processing method and device
PCT/CN2017/077024 WO2017162102A1 (en) 2016-03-25 2017-03-17 Data processing method and apparatus, and data table processing method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610180081.7A CN107229663B (en) 2016-03-25 2016-03-25 Data processing method and device and data table processing method and device

Publications (2)

Publication Number Publication Date
CN107229663A CN107229663A (en) 2017-10-03
CN107229663B true CN107229663B (en) 2022-05-27

Family

ID=59899283

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610180081.7A Active CN107229663B (en) 2016-03-25 2016-03-25 Data processing method and device and data table processing method and device

Country Status (3)

Country Link
CN (1) CN107229663B (en)
TW (1) TWI746517B (en)
WO (1) WO2017162102A1 (en)

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU3002000A (en) 1999-06-08 2000-12-28 Brio Technology, Inc. Method and apparatus for data access to heterogeneous data sources
US20060074826A1 (en) * 2004-09-14 2006-04-06 Heumann John M Methods and apparatus for detecting temporal process variation and for managing and predicting performance of automatic classifiers
US8165221B2 (en) * 2006-04-28 2012-04-24 Netapp, Inc. System and method for sampling based elimination of duplicate data
CN101901248B (en) * 2010-04-07 2012-08-15 北京星网锐捷网络技术有限公司 Method and device for creating and updating Bloom filter and searching elements
CN102546293B (en) * 2011-12-20 2014-08-06 东南大学 High speed network flow network address measuring method based on Hash bit string multiplexing
CN102609441B (en) * 2011-12-27 2014-06-25 中国科学院计算技术研究所 Local-sensitive hash high-dimensional indexing method based on distribution entropy
CN102968467A (en) * 2012-11-10 2013-03-13 华中科技大学 Optimization method and query method for multiple layers of Bloom Filters
JP6028567B2 (en) * 2012-12-28 2016-11-16 富士通株式会社 Data storage program, data search program, data storage device, data search device, data storage method, and data search method
CN104424220B (en) * 2013-08-23 2018-07-13 阿里巴巴集团控股有限公司 A kind of data processing method and device
US9256549B2 (en) * 2014-01-17 2016-02-09 Netapp, Inc. Set-associative hash table organization for efficient storage and retrieval of data in a storage system
CN105205052B (en) * 2014-05-30 2019-01-25 华为技术有限公司 A kind of data digging method and device
US10459886B2 (en) * 2014-08-06 2019-10-29 Quest Software Inc. Client-side deduplication with local chunk caching

Also Published As

Publication number Publication date
WO2017162102A1 (en) 2017-09-28
TWI746517B (en) 2021-11-21
TW201737057A (en) 2017-10-16
CN107229663A (en) 2017-10-03

Similar Documents

Publication Publication Date Title
CN105553937B (en) The system and method for data compression
CN107957848B (en) Deduplication processing method and storage device
CN111159002A (en) Data edge acquisition method based on grouping, edge acquisition equipment and system
US20210149985A1 (en) Method and apparatus for processing large-scale distributed matrix product
CN114861579A (en) Method and system for analyzing time sequence bottleneck node and optimizing time sequence in integrated circuit
CN110879749B (en) Scheduling method and scheduling device for real-time transcoding task
CN107229663B (en) Data processing method and device and data table processing method and device
CN111638925A (en) Interface method table generation method, function pointer query method and device
CN108463813B (en) Method and device for processing data
Duvignau et al. Piecewise linear approximation in data streaming: Algorithmic implementations and experimental analysis
US11748255B1 (en) Method for searching free blocks in bitmap data, and related components
US11409523B2 (en) Graphics processing unit
CN114420209A (en) Sequencing data-based pathogenic microorganism detection method and system
KR101725531B1 (en) Frequency envelope vector quantization method and apparatus
CN105468603B (en) Data selecting method and device
Shi et al. A general near-exact k-mer counting method with low memory consumption enables de novo assembly of 106× human sequence data in 2.7 hours
CN107862132B (en) Automatic node deletion method for circuit approximate calculation
CN110875743B (en) Data compression method based on sampling guess
CN109189346B (en) Data processing method and device
CN115543490B (en) Flash firmware starting method and system
WO2021114548A1 (en) Batch processing method, apparatus and device, and storage medium
CN111767980B (en) Model optimization method, device and equipment
WO2020261323A1 (en) Estimation device, estimation method and program
Chabchoub et al. Analysis of an algorithm catching elephants on the internet
CN117785449A (en) Sequence randomness detection method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant