WO2017162102A1 - Data processing method and apparatus, and data table processing method and apparatus - Google Patents

Data processing method and apparatus, and data table processing method and apparatus Download PDF

Info

Publication number
WO2017162102A1
WO2017162102A1 PCT/CN2017/077024 CN2017077024W WO2017162102A1 WO 2017162102 A1 WO2017162102 A1 WO 2017162102A1 CN 2017077024 W CN2017077024 W CN 2017077024W WO 2017162102 A1 WO2017162102 A1 WO 2017162102A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
data table
hash function
processing method
functions
Prior art date
Application number
PCT/CN2017/077024
Other languages
French (fr)
Chinese (zh)
Inventor
孙伟光
徐冬
连杰红
汪龙重
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2017162102A1 publication Critical patent/WO2017162102A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data

Definitions

  • the present invention relates to computer technology, and in particular, to a data processing method and apparatus, and a data table processing method and apparatus.
  • the number of independent elements is also called the unique value, thereby predicting the size of the data table.
  • the range of elements may be wide, and a single element may occupy more memory, resulting in the inability to accommodate the entire sequence in memory.
  • the Flajolet-Martin (FM) algorithm can be used, and the FM algorithm is an algorithm that can better solve the estimation of the unique number of values.
  • the algorithm uses a hash function set to perform operations, and estimates the unique value based on the hash value of each hash function in the hash function set.
  • the present invention provides a data processing method and apparatus, and a data table processing method and apparatus, which are used to solve the problem of ensuring the execution efficiency while ensuring the accuracy of the unique value while using the FM algorithm for the unique value calculation in the prior art.
  • a data processing method comprising:
  • the data set is subjected to a unique value calculation using a hash function set that conforms to the number of functions.
  • a data processing apparatus comprising:
  • a statistics module for counting the number of data in the data set
  • a determining module configured to determine the number of functions of the hash function set according to the number of data obtained by the statistics
  • a calculation module configured to perform a unique value calculation on the data set by using a hash function set that matches the number of functions based on an FM algorithm.
  • a data table processing method for predicting a data table size including:
  • the size of the data table is predicted based on the unique number of values.
  • a data table processing method for evaluating data table operations including:
  • the operation of the data table is evaluated based on the predicted size of the data table to determine the resources required for the operation.
  • a data table processing method for performing a data table operation including:
  • the operation is performed on the data table based on the evaluation result.
  • a data table processing apparatus for predicting a data table size includes:
  • a unique value module for processing the data table by using the data processing device of the second aspect to obtain a unique number of values
  • a prediction module configured to predict a size of the data table according to the unique number of values.
  • a data table processing apparatus for evaluating an operation of a data table, including:
  • a prediction module configured to predict a size of the data table by using the data table processing device described in the sixth aspect
  • An evaluation module is configured to evaluate an operation of the data table according to the predicted data table size to determine resources required for the operation.
  • a data table processing apparatus for performing a data table operation includes:
  • An evaluation module configured to use the data table processing apparatus according to the seventh aspect, to evaluate resources required for operation of the data table
  • an operation module configured to perform the operation on the data table according to the evaluation result.
  • the data processing method and device and the data table processing method and device provided by the embodiments of the present invention by counting the number of data in the data set After that, according to the number of data obtained by the statistics, the number of functions of the hash function set is determined, and then the hash function set conforming to the number of functions is used to calculate the unique value of the data set, thereby making the hash function set
  • the scale is matched with the scale of the data set, which balances the execution efficiency and accuracy, and solves the problem that the execution efficiency and accuracy cannot be balanced due to the fixed size of the hash function set in the prior art.
  • the size of the data table is predicted according to the calculated unique value, and the resources required for the operation of the data table are evaluated according to the prediction result, and then the operation of the data table is optimized based on the situation of the resource occupation required by the operation.
  • the purpose of reducing resource occupation and improving operation efficiency is reduced.
  • FIG. 1 is a schematic flowchart of a data processing method according to Embodiment 1 of the present invention.
  • FIG. 2 is a schematic flowchart of a data processing method according to Embodiment 2 of the present invention.
  • FIG. 3 is a schematic structural diagram of a data processing apparatus according to Embodiment 3 of the present invention.
  • FIG. 4 is a schematic structural diagram of another data processing apparatus according to Embodiment 4 of the present invention.
  • FIG. 5 is a schematic flowchart of a data table processing method according to Embodiment 5 of the present invention.
  • FIG. 6 is a schematic flowchart diagram of another data table processing method according to Embodiment 5 of the present invention.
  • FIG. 7 is a schematic flowchart diagram of still another data table processing method according to Embodiment 5 of the present invention.
  • FIG. 8 is a schematic structural diagram of a data table processing apparatus 60 according to Embodiment 6 of the present invention.
  • FIG. 9 is a schematic structural diagram of a data table processing apparatus 70 according to Embodiment 6 of the present invention.
  • FIG. 10 is a schematic structural diagram of a data table processing apparatus 80 according to Embodiment 6 of the present invention.
  • FIG. 11 is a schematic structural diagram of another data table processing apparatus 80 according to Embodiment 6 of the present invention.
  • FIG. 1 is a schematic flowchart of a data processing method according to Embodiment 1 of the present invention. As shown in FIG. 1, the method includes:
  • Step 101 Perform statistics on the number of data in the data set.
  • the data is traversed for statistics, and then the number of functions in the hash function set is determined according to the statistical result after traversing the data.
  • all the data in the data set may be traversed, and the total number of data included in the data set may be counted, and then the number of functions in the hash function set is determined according to the statistical result.
  • the data set referred to here may be a set of all the data included in the same column in each data table.
  • the number of functions in the hash function set is continuously adjusted according to the current statistical result.
  • the number of functions in the hash function set may be adjusted, and the hash value of the traversed data may be calculated according to the hash function in the adjusted hash function set.
  • the unique value is calculated by the FM algorithm according to the hash value of the hash function in the hash function set determined by the last adjustment.
  • the data in the data set can be read one by one, thereby counting the number of data that has been read in the data set. After reading a piece of data, it starts to determine the number of functions in the hash function set, and substitutes the read data to calculate the hash function in the determined hash function set. The steps of the function value.
  • a unique value may be calculated for data of a column or columns in the data table to be connected, thereby predicting the connection size. This is because in big data processing, each data table usually contains tens of thousands of data records, and the amount of data is large. Therefore, it is necessary to predict large-scale data table connections, which is necessary for large-scale connection. Preparation for the work.
  • the data set needs to be traversed twice. Because the data table contains a large amount of data in the scenario where the data table is connected, two traversal operations require more computing resources and computing time. Therefore, when the amount of data included in the data set is large, the second achievable mode is preferred, and the operation efficiency is improved by reducing the number of times of traversing the data in the data set.
  • Step 102 Determine the number of functions of the hash function set according to the number of data obtained by the statistics.
  • FM is an algorithm for calculating the unique value of high efficiency.
  • the size of the hash function set in this algorithm has a very important influence on the accuracy and execution efficiency of the unique value calculation result.
  • the size of the hash function set that is too small will result in lower accuracy of the calculation result, but the execution efficiency is higher; on the other hand, the size of the excessive hash function set will lead to higher accuracy of the calculation result, but The execution efficiency is relatively low. It can be seen that the scale of the hash function set of the FM should match the size of the data set.
  • matching can be performed in the following manner, for example, the number of data in the data set is N, and the number of functions in the hash function set is H;
  • Step 103 Perform a unique value calculation on the data set by using a hash function set that meets the number of functions based on an FM algorithm.
  • the function TailZero(x) can calculate the number of consecutive zeros in the binary of a positive integer x, and the hash function H(e) is in the data set.
  • the data is hashed and the hash value obtained is:
  • the maximum value of the hash function H1 is MAX1, and a series of MAX values Max1, Max2, Max3, ... can be obtained similarly; then, based on the formula that the estimated value of the unique value is equal to the power of MAX of 2, thereby estimating one
  • the series estimates are 2Max1, 2Max2, 2Max3..., and finally the estimates for this series are summarized and calculated to obtain the final estimate.
  • the number of functions of the hash function set is determined according to the number of data obtained by the statistics, and then the hash function set matching the number of the functions is used.
  • the data set performs unique value calculation, so that the size of the hash function set matches the size of the data set, balances the execution efficiency and accuracy, and solves the execution in the prior art due to the fixed size of the hash function set. Problems that cannot be balanced with efficiency and accuracy.
  • FIG. 2 is a schematic flowchart of a data processing method according to Embodiment 2 of the present invention. As shown in FIG. 2, the method includes:
  • Step 201 Set an initial hash function set in advance.
  • the number of functions of the preset initial hash function set may be a maximum value, for example, the number of functions may be preset to 1024.
  • Step 202 Read a data in the data set, and count the number of data that has been read.
  • Each data in the data set is sequentially read, and steps 202-205 are performed each time one data is read.
  • Step 203 Determine the number of functions according to the number of data that has been read.
  • the number of functions is determined to be 512;
  • the number of functions is determined to be 256;
  • the number of functions is determined to be 128;
  • the number of functions is determined to be 64.
  • Step 204 Determine whether the determined number of functions is less than the number of functions in the current hash function set. If yes, execute step 205, otherwise perform step 206.
  • Step 205 Reduce the hash function from the hash function set to the determined number of functions.
  • Step 206 Perform hash calculation on the currently read data by using the current hash function set and store the obtained hash value.
  • Steps 202-206 are repeated until all data in the data set has been read.
  • Step 207 When the hash value calculation of all data in the data set is completed, the unique value is calculated by using an FM algorithm according to the hash value of the finally determined hash function centralized hash function.
  • the hash function set In the process of reading the data in the data set, the hash function set is continuously adjusted, and when the unique value is calculated, the hash function set determined by the last adjustment is used as the estimation.
  • the hash value of the hash function calculated by the earlier read data is more.
  • some hash functions do not exist in the hash function set determined by the last adjustment.
  • the invalid hash function value is estimated by using the hash value of the hash function in the hash function set determined by the last adjustment when estimating the unique value.
  • the final hash function set retains 64 hash functions H1-H64.
  • the hash value of the hash functions H1-H64 is selected from all stored hash values for each data. Further, for each hash function, the maximum value MAX of the bit sequence length of the tail all zeros in the binary representation of the hash value is determined according to each hash value of the hash function. Refer to pre-grouping: H1-H8; H9-H16; H17-H24; H25-H32; H33-H40; H41-H48; H49-H56; H57-H64, calculate the average value of the maximum value MAX in each group, for each The group average takes the median as the estimated value R. It is estimated that the unique value is 2 to the power of R.
  • FIG. 3 is a schematic structural diagram of a data processing apparatus according to Embodiment 3 of the present invention. As shown in FIG. 3, the method includes: a statistics module 31, a determining module 32, and a calculating module 33.
  • the statistics module 31 is configured to perform statistics on the number of data in the data set.
  • the statistic module 31 is specifically configured to use the total number of data included in the statistic data set.
  • the determining module 32 is configured to determine the number of functions of the hash function set according to the number of data obtained by the statistics.
  • the calculating module 33 is configured to perform a unique value calculation on the data set by using a hash function set that matches the number of functions based on an FM algorithm.
  • the data processing device traverses the data for statistics, and then traverses the data to determine the number of functions in the hash function set according to the statistical result. Specifically, before calculating the unique value of the data set, all the data in the data set may be traversed, and the total number of data included in the data set may be counted, and then the number of functions in the hash function set is determined according to the statistical result.
  • the data set referred to here may be a set of all the data included in the same column in each data table.
  • the number of functions of the hash function set is determined according to the number of data obtained by the statistics, and then the data set is uniquely valued by using a hash function set that matches the number of the functions.
  • the number calculation so that the size of the hash function set matches the size of the data set, balances the execution efficiency and accuracy, and solves the problem that the execution efficiency and accuracy due to the fixed size of the hash function set in the prior art cannot be solved. A matter of consideration.
  • FIG. 4 is a schematic structural diagram of another data processing apparatus according to Embodiment 4 of the present invention.
  • the statistic module 31 is specifically configured to read data in a data set one by one, and count the number of data that has been read in the data set.
  • the determining module 32 is specifically configured to gradually reduce the number of functions of the hash function set according to the counted number of read data.
  • the data processing apparatus further includes:
  • the generating module 34 is configured to discard the hash function in the hash function set, and the number of reserved hash functions is the determined number of functions.
  • the calculation module 33 further includes: a hash value unit 331 and a unique value unit 332.
  • the hash value unit 331 is configured to calculate a hash value for the read data by using a hash function retained in the hash function set.
  • the unique value unit 332 is configured to calculate the unique value by using the hash value of the hash function retained in the hash function set based on the FM algorithm when reading all the data in the data set.
  • the number of functions in the hash function set is continuously adjusted according to the current statistical result. Specifically, in the process of traversing the data set, the number of functions in the hash function set may be adjusted, and the hash value of the traversed data may be calculated according to the hash function in the adjusted hash function set. After traversing all the data in the data set, the unique value is calculated by the FM algorithm according to the hash value of the hash function in the hash function set determined by the last adjustment. Specifically, the data in the data set can be read one by one, thereby counting the number of data that has been read in the data set. After reading a piece of data, it starts to determine the number of functions in the hash function set, and substitutes the read data to calculate the hash function in the determined hash function set. The steps of the function value.
  • the number of functions in the set of functions in the set of functions is such that when the size of the data set is small, a larger set of hash functions is used to improve the calculation accuracy.
  • a smaller-scale hash is used.
  • the devices provided in the third embodiment and the fourth embodiment are respectively used to implement the data processing flow provided by FIG. 1 and FIG. 2, and the functions of the functional modules of the data processing device in the third embodiment and the fourth embodiment are described. The related descriptions in the foregoing method embodiments are not described in the third embodiment and the fourth embodiment.
  • the number of functions of the hash function set is determined according to the number of data obtained by the statistics, and then the hash function set matching the number of the functions is used.
  • the data set performs unique value calculation, so that the size of the hash function set matches the size of the data set, balances the execution efficiency and accuracy, and solves the execution in the prior art due to the fixed size of the hash function set. Problems that cannot be balanced with efficiency and accuracy.
  • the fifth embodiment provides a data table processing method for optimizing the operation of the data table, such as a connection operation or a group operation, thereby achieving less resource occupation and improving operation efficiency.
  • the resources mentioned here can be resources consumed for performing operations such as CPU or memory.
  • FIG. 5 is a schematic flowchart of a data table processing method according to Embodiment 5 of the present invention, which is used to predict a data table size, including:
  • Step 501 The data processing method of Embodiment 1 or Embodiment 2 is used to process the data table to obtain a unique value.
  • Step 502 Predict the size of the data table according to the unique number of values.
  • the data table processing method provided in FIG. 5 can be used to implement the size prediction of the data table, and the required resources can be conveniently allocated to the data table according to the predicted size.
  • FIG. 6 is a schematic flowchart of another data table processing method according to Embodiment 5 of the present invention.
  • the data table processing method is used to evaluate data table operations. After step 501 in the method provided in FIG. 5, the method further includes:
  • Step 503 Evaluate the operation of the data table according to the predicted data table size to determine resources required for the operation.
  • the data table A and the data table B may be connected. Based on the data table processing method provided in FIG. 6, the scale may be first predicted for each data table, and then the connection data tables A and B may be The resources required are evaluated to facilitate the allocation of resources.
  • FIG. 7 is a schematic flowchart of still another method for processing a data table according to Embodiment 5 of the present invention.
  • the processing method is used to perform a data table operation.
  • the method further includes:
  • Step 504 Perform an operation on the data table according to the evaluation result.
  • the execution order of at least two operations performed on the data table is determined, for example, the execution order of the at least two operations may be determined in an order of as few as the occupied resources required for the operation. The at least two operations are then performed in the determined order.
  • data table A, data table B, and data table C may be connected.
  • the size of each data table may be first predicted, and then the connection data is Tables A and B, the connection data tables A and C, and the resources required to connect the data tables B and C are evaluated to select the connection operation that occupies the least resources.
  • two smaller data tables A and B can be connected first, so that a smaller resource occupation is obtained in the connection, and then the larger-sized data table C is connected, so that the total amount of resources occupied is The smallest.
  • the order of the data table operation is performed by predicting the size of the data table according to the calculated unique number of values, and evaluating the resources required for the operation of the data table according to the prediction result, and then based on the situation in which the operation requires resources. Optimization, the purpose of reducing the occupation of resources and improving the efficiency of operation in the process of operating the data table.
  • FIG. 8 is a schematic structural diagram of a data table processing apparatus 60 according to Embodiment 6 of the present invention.
  • the data table processing apparatus 60 is configured to predict a data table size, and includes: a unique value module 61 and a prediction module 62.
  • the unique value module 61 is configured to process the data table using the data processing apparatus shown in FIG. 3 or FIG. 4 to obtain a unique number of values.
  • the prediction module 62 is configured to predict a size of the data table according to the unique number of values.
  • FIG. 9 is a schematic structural diagram of a data table processing apparatus 70 according to Embodiment 6 of the present invention.
  • the data table processing apparatus 70 is configured to evaluate a data table operation, and includes: a prediction module 71 and an evaluation module 72.
  • the prediction module 71 is configured to predict the size of the data table by using the data table processing device 60 shown in FIG.
  • the evaluation module 72 is configured to evaluate the operation of the data table according to the predicted data table size to determine resources required for the operation.
  • operations include connection operations and/or group operations.
  • FIG. 10 is a schematic structural diagram of a data table processing apparatus 80 according to Embodiment 6 of the present invention, where the data table is located
  • the processing device 80 is configured to perform data table operations, including: an evaluation module 81 and an operation module 82.
  • the evaluation module 81 is configured to evaluate the resources required for the operation of the data table by using the data table processing device 70 shown in FIG.
  • the operation module 82 is configured to perform the operation on the data table according to the evaluation result.
  • the operation module 82 includes: a determining unit 821 and an executing unit 822.
  • the determining unit 821 is configured to determine an execution order of at least two operations performed on the data table according to the evaluation result.
  • the determining unit 821 is specifically configured to determine an execution order of the at least two operations in an order that occupies less resources in an operation.
  • the executing unit 822 is configured to perform the at least two operations in the determined order.
  • the realization is realized.
  • the use of resources is reduced, and the operation efficiency is improved.
  • the aforementioned program can be stored in a computer readable storage medium.
  • the program when executed, performs the steps including the foregoing method embodiments; and the foregoing storage medium includes various media that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A data processing method and apparatus, and a data table processing method and apparatus. The data processing method comprises: counting the number of data in a data set (101), determining the number of functions in a hash function set according to the number of data obtained through counting (102), and based on an FM algorithm, using a hash function set conforming to the number of functions to calculate a unique value of the data set (103), so that the scale of the hash function set matches the scale of the data set, thereby balancing the execution efficiency and the accuracy, and solving the problem in the prior art that the execution efficiency and the accuracy cannot be balanced due to the fixed scale of the hash function set. Meanwhile, the scale of a data table is predicted according to the calculated unique value, a resource needing to be occupied by a data table operation is evaluated according to a prediction result, so as to optimize the data table operation based on the condition of the resource needing to be occupied by the operation, thereby achieving the purposes of reducing the occupation of resources and improving the operation efficiency during the process of operating the data table.

Description

数据处理方法和装置以及数据表处理方法和装置Data processing method and device, and data table processing method and device
本申请要求2016年3月25日递交的申请号为201610180081.7、发明名称为“数据处理方法和装置以及数据表处理方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。The present application claims the priority of the Chinese Patent Application No. 201610180081.7, the entire disclosure of which is incorporated herein by in.
技术领域Technical field
本发明涉及计算机技术,尤其涉及一种数据处理方法和装置以及数据表处理方法和装置。The present invention relates to computer technology, and in particular, to a data processing method and apparatus, and a data table processing method and apparatus.
背景技术Background technique
在实际应用中,尤其是在进行数据表连接操作之前,经常需要统计不重复出现的对象或者事件个数,即独立元素数目也称唯一值数,从而预测数据表的规模。对于较小的数据量,可以首先在内存中对序列进行排序,然后扫描有序序列统计独立元素数目。但是在处理数据流序列时,由于序列非常长,元素取值范围可能比较广,单个元素占用内存可能比较多,导致内存中无法容纳整个序列。In practical applications, especially before the data table connection operation, it is often necessary to count the number of objects or events that are not repeated, that is, the number of independent elements is also called the unique value, thereby predicting the size of the data table. For smaller amounts of data, you can first sort the sequence in memory and then scan the ordered sequence to count the number of independent elements. However, when processing a sequence of data streams, because the sequence is very long, the range of elements may be wide, and a single element may occupy more memory, resulting in the inability to accommodate the entire sequence in memory.
针对这种情况,可以采用Flajolet-Martin(简称FM)算法,FM算法是一种能够较好地解决估算唯一值数的算法。该算法中采用哈希函数集进行运算,基于哈希函数集中各哈希函数的哈希值估算唯一值数。In this case, the Flajolet-Martin (FM) algorithm can be used, and the FM algorithm is an algorithm that can better solve the estimation of the unique number of values. The algorithm uses a hash function set to perform operations, and estimates the unique value based on the hash value of each hash function in the hash function set.
但现有技术在应用FM算法计算数据表中某一列的唯一值数的过程中,针对不同规模的数据集,采用相同的哈希函数集,导致当数据集的规模较大时,往往唯一值数计算过程执行效率低,执行时间过长;当数据集的规模较小时,唯一值数的准确度又较低。However, in the prior art, in the process of applying the FM algorithm to calculate the unique value of a column in the data table, the same hash function set is adopted for the data sets of different sizes, resulting in a unique value when the data set is large in scale. The number calculation process is inefficient and the execution time is too long; when the data set is small, the accuracy of the unique value is lower.
发明内容Summary of the invention
本发明提供一种数据处理方法和装置以及数据表处理方法和装置,用于解决现有技术中采用FM算法进行唯一值数计算时,无法在保证执行效率的同时保证唯一值数的准确度。The present invention provides a data processing method and apparatus, and a data table processing method and apparatus, which are used to solve the problem of ensuring the execution efficiency while ensuring the accuracy of the unique value while using the FM algorithm for the unique value calculation in the prior art.
为达到上述目的,本发明的实施例采用如下技术方案:In order to achieve the above object, embodiments of the present invention adopt the following technical solutions:
第一方面,提供了一种数据处理方法,包括:In a first aspect, a data processing method is provided, comprising:
对数据集的数据个数进行统计;Count the number of data in the data set;
根据统计所获得的数据个数,确定哈希函数集的函数个数; Determine the number of functions of the hash function set according to the number of data obtained by the statistics;
基于FM算法,采用符合所述函数个数的哈希函数集对所述数据集进行唯一值数计算。Based on the FM algorithm, the data set is subjected to a unique value calculation using a hash function set that conforms to the number of functions.
第二方面,提供了一种数据处理装置,包括:In a second aspect, a data processing apparatus is provided, comprising:
统计模块,用于对数据集的数据个数进行统计;a statistics module for counting the number of data in the data set;
确定模块,用于根据统计所获得的数据个数,确定哈希函数集的函数个数;a determining module, configured to determine the number of functions of the hash function set according to the number of data obtained by the statistics;
计算模块,用于基于FM算法,采用符合所述函数个数的哈希函数集对所述数据集进行唯一值数计算。And a calculation module, configured to perform a unique value calculation on the data set by using a hash function set that matches the number of functions based on an FM algorithm.
第三方面,提供了一种用于预测数据表规模的数据表处理方法,包括:In a third aspect, a data table processing method for predicting a data table size is provided, including:
采用第一所述的数据处理方法,对数据表进行处理,以获得唯一值数;Using the data processing method described above, processing the data table to obtain a unique number of values;
根据所述唯一值数预测所述数据表的规模。The size of the data table is predicted based on the unique number of values.
第四方面,提供了一种用于评估数据表操作的数据表处理方法,包括:In a fourth aspect, a data table processing method for evaluating data table operations is provided, including:
采用第三方面所述的数据表处理方法,对数据表的规模进行预测;Using the data table processing method described in the third aspect, predicting the size of the data table;
根据所预测出的数据表规模,对所述数据表的操作进行评估,以确定所述操作所需占用的资源。The operation of the data table is evaluated based on the predicted size of the data table to determine the resources required for the operation.
第五方面,提供了一种用于执行数据表操作的数据表处理方法,包括:In a fifth aspect, a data table processing method for performing a data table operation is provided, including:
采用第四方面所述的数据表处理方法,对数据表的操作所需占用的资源进行评估;Using the data table processing method described in the fourth aspect, the resources required for the operation of the data table are evaluated;
根据评估结果,对所述数据表执行所述操作。The operation is performed on the data table based on the evaluation result.
第六方面,提供了一种用于预测数据表规模的数据表处理装置,包括:In a sixth aspect, a data table processing apparatus for predicting a data table size includes:
唯一值模块,用于利用第二方面所述的数据处理装置,对数据表进行处理,以获得唯一值数;a unique value module for processing the data table by using the data processing device of the second aspect to obtain a unique number of values;
预测模块,用于根据所述唯一值数预测所述数据表的规模。And a prediction module, configured to predict a size of the data table according to the unique number of values.
第七方面,提供了一种评估数据表操作的数据表处理装置,包括:In a seventh aspect, a data table processing apparatus for evaluating an operation of a data table is provided, including:
预测模块,用于利用第六方面所述的数据表处理装置,对数据表的规模进行预测;a prediction module, configured to predict a size of the data table by using the data table processing device described in the sixth aspect;
评估模块,用于根据所预测出的数据表规模,对所述数据表的操作进行评估,以确定所述操作所需占用的资源。An evaluation module is configured to evaluate an operation of the data table according to the predicted data table size to determine resources required for the operation.
第八方面,提供了一种用于执行数据表操作的数据表处理装置,包括:In an eighth aspect, a data table processing apparatus for performing a data table operation includes:
评估模块,用于利用第七方面所述的数据表处理装置,对数据表的操作所需占用的资源进行评估;An evaluation module, configured to use the data table processing apparatus according to the seventh aspect, to evaluate resources required for operation of the data table;
操作模块,用于根据评估结果,对所述数据表执行所述操作。本发明实施例提供的数据处理方法和装置以及数据表处理方法和装置,通过对数据集的数据个数进行统计之 后,根据统计所获得的数据个数,确定哈希函数集的函数个数,进而采用符合所述函数个数的哈希函数集对数据集进行唯一值数计算,从而使得哈希函数集的规模与数据集的规模相匹配,平衡了执行效率和准确度,解决了现有技术中由于哈希函数集的规模固定所导致的执行效率和准确度无法兼顾的问题。同时,根据所计算出的唯一值数对数据表的规模进行预测,并根据预测结果评估数据表操作所需占用的资源,进而基于操作所需占用资源的情况,对数据表操作进行优化,实现了在对数据表进行操作的过程中,减少资源的占用,提高操作效率的目的。And an operation module, configured to perform the operation on the data table according to the evaluation result. The data processing method and device and the data table processing method and device provided by the embodiments of the present invention, by counting the number of data in the data set After that, according to the number of data obtained by the statistics, the number of functions of the hash function set is determined, and then the hash function set conforming to the number of functions is used to calculate the unique value of the data set, thereby making the hash function set The scale is matched with the scale of the data set, which balances the execution efficiency and accuracy, and solves the problem that the execution efficiency and accuracy cannot be balanced due to the fixed size of the hash function set in the prior art. At the same time, the size of the data table is predicted according to the calculated unique value, and the resources required for the operation of the data table are evaluated according to the prediction result, and then the operation of the data table is optimized based on the situation of the resource occupation required by the operation. In the process of operating the data table, the purpose of reducing resource occupation and improving operation efficiency is reduced.
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。The above description is only an overview of the technical solutions of the present invention, and the above-described and other objects, features and advantages of the present invention can be more clearly understood. Specific embodiments of the invention are set forth below.
附图说明DRAWINGS
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:Various other advantages and benefits will become apparent to those skilled in the art from a The drawings are only for the purpose of illustrating the preferred embodiments and are not to be construed as limiting. Throughout the drawings, the same reference numerals are used to refer to the same parts. In the drawing:
图1为本发明实施例一提供的一种数据处理方法的流程示意图;1 is a schematic flowchart of a data processing method according to Embodiment 1 of the present invention;
图2为本发明实施例二所提供的一种数据处理方法的流程示意图;2 is a schematic flowchart of a data processing method according to Embodiment 2 of the present invention;
图3为本发明实施例三所提供的一种数据处理装置的结构示意图;3 is a schematic structural diagram of a data processing apparatus according to Embodiment 3 of the present invention;
图4为本发明实施例四提供的另一种数据处理装置的结构示意图;4 is a schematic structural diagram of another data processing apparatus according to Embodiment 4 of the present invention;
图5为本发明实施例五提供的一种数据表处理方法的流程示意图;FIG. 5 is a schematic flowchart of a data table processing method according to Embodiment 5 of the present invention; FIG.
图6为本发明实施例五提供的另一种数据表处理方法的流程示意图;FIG. 6 is a schematic flowchart diagram of another data table processing method according to Embodiment 5 of the present invention;
图7为本发明实施例五提供的又一种数据表处理方法的流程示意图;FIG. 7 is a schematic flowchart diagram of still another data table processing method according to Embodiment 5 of the present invention; FIG.
图8为本发明实施例六提供的一种数据表处理装置60的结构示意图;FIG. 8 is a schematic structural diagram of a data table processing apparatus 60 according to Embodiment 6 of the present invention;
图9为本发明实施例六提供的一种数据表处理装置70的结构示意图;FIG. 9 is a schematic structural diagram of a data table processing apparatus 70 according to Embodiment 6 of the present invention;
图10为本发明实施例六提供的一种数据表处理装置80的结构示意图;FIG. 10 is a schematic structural diagram of a data table processing apparatus 80 according to Embodiment 6 of the present invention;
图11为本发明实施例六提供的另一种数据表处理装置80的结构示意图。FIG. 11 is a schematic structural diagram of another data table processing apparatus 80 according to Embodiment 6 of the present invention.
具体实施方式detailed description
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例 所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although the exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the embodiments may be Limited. Rather, these embodiments are provided so that this disclosure will be more fully understood and the scope of the disclosure will be fully disclosed.
下面结合附图对本发明实施例提供的数据处理方法和装置以及数据表处理方法和装置进行详细描述。The data processing method and apparatus and the data table processing method and apparatus provided by the embodiments of the present invention are described in detail below with reference to the accompanying drawings.
实施例一Embodiment 1
图1为本发明实施例一提供的一种数据处理方法的流程示意图,如图1所示,包括:1 is a schematic flowchart of a data processing method according to Embodiment 1 of the present invention. As shown in FIG. 1, the method includes:
步骤101、对数据集的数据个数进行统计。Step 101: Perform statistics on the number of data in the data set.
作为一种可能的实现方式,遍历数据进行统计,进而遍历数据之后根据统计结果确定哈希函数集中的函数个数。具体来说,可以在对数据集计算唯一值数之前,遍历数据集中的全部数据,统计数据集中所包含的全部数据个数,进而根据统计结果确定哈希函数集中的函数个数。这里所说的数据集可以是各个数据表中同一列中所包含的全部数据所构成的集合。在对数据集计算唯一值数时,再次遍历数据集中的全部数据,基于所确定的哈希函数计算全部数据的哈希值,基于FM算法对哈希值进行处理,估算出唯一值数。As a possible implementation manner, the data is traversed for statistics, and then the number of functions in the hash function set is determined according to the statistical result after traversing the data. Specifically, before calculating the unique value of the data set, all the data in the data set may be traversed, and the total number of data included in the data set may be counted, and then the number of functions in the hash function set is determined according to the statistical result. The data set referred to here may be a set of all the data included in the same column in each data table. When calculating the unique value of the data set, all the data in the data set is traversed again, the hash value of all the data is calculated based on the determined hash function, and the hash value is processed based on the FM algorithm to estimate the unique value.
作为另一种可能的实现方式,在遍历数据的过程中,根据当前统计结果不断调整哈希函数集中的函数个数。具体来说,可以在遍历数据集的过程中,对哈希函数集中的函数个数进行调整,以及根据调整后的哈希函数集中的哈希函数,计算已遍历数据的哈希值。遍历数据集中的全部数据之后,根据最后一次调整所确定的哈希函数集中的哈希函数的哈希值,采用FM算法计算唯一值数。具体来说,可以逐条读取数据集中的数据,从而统计数据集中已读取的数据个数。在读取到一条数据之后,便开始对所读取到的数据进行后续的确定哈希函数集中的函数个数,以及代入所读取的数据,计算所确定的哈希函数集中各哈希函数的函数值的步骤。As another possible implementation manner, in the process of traversing the data, the number of functions in the hash function set is continuously adjusted according to the current statistical result. Specifically, in the process of traversing the data set, the number of functions in the hash function set may be adjusted, and the hash value of the traversed data may be calculated according to the hash function in the adjusted hash function set. After traversing all the data in the data set, the unique value is calculated by the FM algorithm according to the hash value of the hash function in the hash function set determined by the last adjustment. Specifically, the data in the data set can be read one by one, thereby counting the number of data that has been read in the data set. After reading a piece of data, it starts to determine the number of functions in the hash function set, and substitutes the read data to calculate the hash function in the determined hash function set. The steps of the function value.
在对多个数据表进行数据表连接的应用场景中,可以针对待连接的数据表中某一列或某几列的数据计算唯一值数,从而预测连接规模。这是由于在大数据处理中,每个数据表中通常包含了数以万计条数据记录,数据量较大,因此,需要对大规模的数据表连接进行预测,便于进行大规模连接所必须的准备工作。In an application scenario in which a data table is connected to multiple data tables, a unique value may be calculated for data of a column or columns in the data table to be connected, thereby predicting the connection size. This is because in big data processing, each data table usually contains tens of thousands of data records, and the amount of data is large. Therefore, it is necessary to predict large-scale data table connections, which is necessary for large-scale connection. Preparation for the work.
在第一种可实现方式中,需要遍历两次数据集,由于在数据表连接的场景中,数据表所包含的数据量较大,两次遍历需要占用较多的计算资源和计算时间。因此,在数据集中所包含的数据量较大时,优选第二种可实现方式,通过减少遍历数据集中数据的次数,提高了运算效率。 In the first implementation, the data set needs to be traversed twice. Because the data table contains a large amount of data in the scenario where the data table is connected, two traversal operations require more computing resources and computing time. Therefore, when the amount of data included in the data set is large, the second achievable mode is preferred, and the operation efficiency is improved by reducing the number of times of traversing the data in the data set.
步骤102、根据统计所获得的数据个数,确定哈希函数集的函数个数。Step 102: Determine the number of functions of the hash function set according to the number of data obtained by the statistics.
FM是种高效率计算唯一值数的算法,该算法中的哈希函数集规模对唯一值数计算结果的准确度和执行效率有着极重要的影响。一方面,过小的哈希函数集规模导致计算结果的准确度会比较低,但执行效率比较高;另一方面,过大的哈希函数集规模导致计算结果的准确度会比较高,但执行效率比较低。可见,FM的哈希函数集规模应当与数据集的规模相匹配。FM is an algorithm for calculating the unique value of high efficiency. The size of the hash function set in this algorithm has a very important influence on the accuracy and execution efficiency of the unique value calculation result. On the one hand, the size of the hash function set that is too small will result in lower accuracy of the calculation result, but the execution efficiency is higher; on the other hand, the size of the excessive hash function set will lead to higher accuracy of the calculation result, but The execution efficiency is relatively low. It can be seen that the scale of the hash function set of the FM should match the size of the data set.
在实际应用中,可以采用如下方式进行匹配,例如:记数据集中的数据个数为N,哈希函数集中的函数个数为H;In practical applications, matching can be performed in the following manner, for example, the number of data in the data set is N, and the number of functions in the hash function set is H;
N<100,000时,H=1024;When N<100,000, H=1024;
100,000≤N<1,000,000时,H=512;When 100,000 ≤ N < 1,000,000, H = 512;
1,000,000≤N<10,000,000时,H=256;When 1,000,000 ≤ N < 10,000,000, H = 256;
10,000,000≤N<100,000,000时,H=128;When 10,000,000 ≤ N < 100,000,000, H = 128;
N≥100,000,000时,H=64。When N≥100,000,000, H=64.
步骤103、基于FM算法,采用符合所述函数个数的哈希函数集对所述数据集进行唯一值数计算。Step 103: Perform a unique value calculation on the data set by using a hash function set that meets the number of functions based on an FM algorithm.
在实际应用中,为了减小误差,提高精度,我们通常采用一系列的哈希函数H1,H2,H3……,分别对数据集中的全部数据进行计算获得哈希值,进而根据FM算法,需要针对每一个哈希函数,统计该哈希函数的哈希值二进制表示中尾部全0的比特序列长度的最大值MAX。In practical applications, in order to reduce the error and improve the accuracy, we usually use a series of hash functions H1, H2, H3, ... to calculate the hash value of all the data in the data set, and then according to the FM algorithm, For each hash function, the maximum value MAX of the bit sequence length of the tail all zeros in the binary representation of the hash value of the hash function is counted.
举例来说,给定数据集{e1,e2,e3,e2},函数TailZero(x)能够计算正整数x的二进制中末尾连续的0的个数,哈希函数H(e)对数据集中的数据进行哈希运算,获得的哈希值为:For example, given a data set {e1, e2, e3, e2}, the function TailZero(x) can calculate the number of consecutive zeros in the binary of a positive integer x, and the hash function H(e) is in the data set. The data is hashed and the hash value obtained is:
H(e1)=2=(0010)2,TailZero(H(e1))=1H(e1)=2=(0010)2, TailZero(H(e1))=1
H(e2)=8=(1000)2,TailZero(H(e2))=3H(e2)=8=(1000)2, TailZero(H(e2))=3
H(e3)=10=(1010)2,TailZero(H(e3))=1H(e3)=10=(1010)2, TailZero(H(e3))=1
则,MAX=MAX(1,3,1)=3。Then, MAX=MAX(1,3,1)=3.
进而记哈希函数H1的最大值为MAX1,相似的可以获得一系列的MAX值Max1,Max2,Max3……;然后,根据唯一值数的估计值等于2的MAX次幂的公式,从而估算一系列的估计值2Max1,2Max2,2Max3……,最后针对这一系列的估计值汇总并计算获得最终的估计值。具体来说,可以首先设计A×B个互不相同的哈希函数,分成A组, 每组B个哈希函数;然后利用每组中的B个哈希函数计算出B个估计值;接着求出B个估计值的算术平均数为该组的估计值;最后选取各组的估计值的中位数作为最终的估计值。Further, the maximum value of the hash function H1 is MAX1, and a series of MAX values Max1, Max2, Max3, ... can be obtained similarly; then, based on the formula that the estimated value of the unique value is equal to the power of MAX of 2, thereby estimating one The series estimates are 2Max1, 2Max2, 2Max3..., and finally the estimates for this series are summarized and calculated to obtain the final estimate. Specifically, you can first design A×B hash functions that are different from each other and divide them into groups A. Each group of B hash functions; then use B hash functions in each group to calculate B estimates; then find the arithmetic mean of the B estimates as the estimated value of the group; finally select the estimates of each group The median of the values is used as the final estimate.
本实施例中,通过对数据集的数据个数进行统计之后,根据统计所获得的数据个数,确定哈希函数集的函数个数,进而采用符合所述函数个数的哈希函数集对数据集进行唯一值数计算,从而使得哈希函数集的规模与数据集的规模相匹配,平衡了执行效率和准确度,解决了现有技术中由于哈希函数集的规模固定所导致的执行效率和准确度无法兼顾的问题。In this embodiment, after counting the number of data in the data set, the number of functions of the hash function set is determined according to the number of data obtained by the statistics, and then the hash function set matching the number of the functions is used. The data set performs unique value calculation, so that the size of the hash function set matches the size of the data set, balances the execution efficiency and accuracy, and solves the execution in the prior art due to the fixed size of the hash function set. Problems that cannot be balanced with efficiency and accuracy.
实施例二Embodiment 2
为了清楚说明上一实施例中所提及的在遍历数据的过程中,根据当前统计结果不断调整哈希函数集中的函数个数的实现方式,本实施例提供了一种具体的执行流程,图2为本发明实施例二所提供的一种数据处理方法的流程示意图,如图2所示,包括:In order to clarify the implementation of the function of arranging the hash function set according to the current statistical result in the process of traversing the data mentioned in the previous embodiment, the present embodiment provides a specific execution flow. 2 is a schematic flowchart of a data processing method according to Embodiment 2 of the present invention. As shown in FIG. 2, the method includes:
步骤201、预先设置初始哈希函数集。Step 201: Set an initial hash function set in advance.
具体的,预先设置的初始哈希函数集的函数个数可以为最大值,例如:可以预先设置函数个数为1024个。Specifically, the number of functions of the preset initial hash function set may be a maximum value, for example, the number of functions may be preset to 1024.
步骤202、读取数据集中的一个数据,统计已读取的数据个数。Step 202: Read a data in the data set, and count the number of data that has been read.
依次读取数据集中的各个数据,每读取一个数据执行步骤202-205。Each data in the data set is sequentially read, and steps 202-205 are performed each time one data is read.
步骤203、根据已读取的数据个数确定函数个数。Step 203: Determine the number of functions according to the number of data that has been read.
例如:E.g:
已读取的数据个数达到100,000时,确定函数个数为512;When the number of data that has been read reaches 100,000, the number of functions is determined to be 512;
已读取的数据个数达到1,000,000时,确定函数个数为256;When the number of data that has been read reaches 1,000,000, the number of functions is determined to be 256;
已读取的数据个数达到10,000,000时,确定函数个数为128;When the number of data that has been read reaches 10,000,000, the number of functions is determined to be 128;
已读取的数据个数达到100,000,000时,确定函数个数为64。When the number of data that has been read reaches 100,000,000, the number of functions is determined to be 64.
可见,在读取数据集中的数据的过程中,随着读取到的数据增多,不断减少哈希函数集中的函数个数,实现了当数据集的规模较小时,采用较大规模的哈希函数集,从而提高计算准确度,当数据集的规模较大时,采用较小规模的哈希函数集,从而提高计算效率。通过这一方式,使得哈希函数集规模与数据集的规模相匹配。It can be seen that in the process of reading data in the data set, as the read data increases, the number of functions in the hash function set is continuously reduced, and when the size of the data set is small, a larger-scale hash is adopted. The function set, so as to improve the calculation accuracy, when the size of the data set is large, the smaller-scale hash function set is adopted, thereby improving the calculation efficiency. In this way, the hash function set size is matched to the size of the data set.
步骤204、判断所确定出的函数个数是否少于当前哈希函数集中的函数个数,若是则执行步骤205,否则执行步骤206。 Step 204: Determine whether the determined number of functions is less than the number of functions in the current hash function set. If yes, execute step 205, otherwise perform step 206.
步骤205、从哈希函数集中减少哈希函数至所确定出的函数个数。Step 205: Reduce the hash function from the hash function set to the determined number of functions.
步骤206、采用当前哈希函数集对当前所读取的数据进行哈希计算并存储所获得的哈希值。Step 206: Perform hash calculation on the currently read data by using the current hash function set and store the obtained hash value.
重复执行步骤202-206直至数据集中的全部数据读取完毕。Steps 202-206 are repeated until all data in the data set has been read.
步骤207、当数据集中全部数据的哈希值计算完成,根据最终确定出的哈希函数集中哈希函数的哈希值,采用FM算法计算唯一值数。Step 207: When the hash value calculation of all data in the data set is completed, the unique value is calculated by using an FM algorithm according to the hash value of the finally determined hash function centralized hash function.
在读取数据集中的数据的过程中,不断调整了哈希函数集,在计算唯一值数时,以最后一次调整所确定的哈希函数集为准进行估算。越早读取的数据所计算出的哈希函数的哈希值越多,在这些哈希函数的哈希值中,有些哈希函数不存在于最后一次调整所确定的哈希函数集中,是无效的哈希函数值,在估算唯一值数时,采用最后一次调整所确定的哈希函数集中的哈希函数的哈希值进行估算。In the process of reading the data in the data set, the hash function set is continuously adjusted, and when the unique value is calculated, the hash function set determined by the last adjustment is used as the estimation. The hash value of the hash function calculated by the earlier read data is more. In the hash value of these hash functions, some hash functions do not exist in the hash function set determined by the last adjustment. The invalid hash function value is estimated by using the hash value of the hash function in the hash function set determined by the last adjustment when estimating the unique value.
例如:最终哈希函数集中保留了64个哈希函数H1-H64。For example, the final hash function set retains 64 hash functions H1-H64.
针对每一个数据从所存储的全部哈希值中,选择哈希函数H1-H64的哈希值。进而对每一个哈希函数,根据该哈希函数的各哈希值,确定哈希值二进制表示中尾部全0的比特序列长度的最大值MAX。参照预先分组:H1-H8;H9-H16;H17-H24;H25-H32;H33-H40;H41-H48;H49-H56;H57-H64,计算每一组中最大值MAX的平均值,对各组平均值取中位数作为估计值R。估计唯一值数为2的R次幂。The hash value of the hash functions H1-H64 is selected from all stored hash values for each data. Further, for each hash function, the maximum value MAX of the bit sequence length of the tail all zeros in the binary representation of the hash value is determined according to each hash value of the hash function. Refer to pre-grouping: H1-H8; H9-H16; H17-H24; H25-H32; H33-H40; H41-H48; H49-H56; H57-H64, calculate the average value of the maximum value MAX in each group, for each The group average takes the median as the estimated value R. It is estimated that the unique value is 2 to the power of R.
实施例三Embodiment 3
图3为本发明实施例三所提供的一种数据处理装置的结构示意图,如图3所示,包括:统计模块31、确定模块32和计算模块33。FIG. 3 is a schematic structural diagram of a data processing apparatus according to Embodiment 3 of the present invention. As shown in FIG. 3, the method includes: a statistics module 31, a determining module 32, and a calculating module 33.
统计模块31,用于对数据集的数据个数进行统计。The statistics module 31 is configured to perform statistics on the number of data in the data set.
具体的,统计模块31具体用于统计数据集所包含的全部数据个数。Specifically, the statistic module 31 is specifically configured to use the total number of data included in the statistic data set.
确定模块32,用于根据统计所获得的数据个数,确定哈希函数集的函数个数。The determining module 32 is configured to determine the number of functions of the hash function set according to the number of data obtained by the statistics.
计算模块33,用于基于FM算法,采用符合所述函数个数的哈希函数集对所述数据集进行唯一值数计算。The calculating module 33 is configured to perform a unique value calculation on the data set by using a hash function set that matches the number of functions based on an FM algorithm.
数据处理装置遍历数据进行统计,进而遍历数据之后根据统计结果确定哈希函数集中的函数个数。具体来说,可以在对数据集计算唯一值数之前,遍历数据集中的全部数据,统计数据集中所包含的全部数据个数,进而根据统计结果确定哈希函数集中的函数个数。这里所说的数据集可以是各个数据表中同一列中所包含的全部数据所构成的集合。 在对数据集计算唯一值数时,再次遍历数据集中的全部数据,基于所确定的哈希函数计算全部数据的哈希值,基于FM算法对哈希值进行处理,估算出唯一值数。通过对数据集的数据个数进行统计之后,根据统计所获得的数据个数,确定哈希函数集的函数个数,进而采用符合所述函数个数的哈希函数集对数据集进行唯一值数计算,从而使得哈希函数集的规模与数据集的规模相匹配,平衡了执行效率和准确度,解决了现有技术中由于哈希函数集的规模固定所导致的执行效率和准确度无法兼顾的问题。The data processing device traverses the data for statistics, and then traverses the data to determine the number of functions in the hash function set according to the statistical result. Specifically, before calculating the unique value of the data set, all the data in the data set may be traversed, and the total number of data included in the data set may be counted, and then the number of functions in the hash function set is determined according to the statistical result. The data set referred to here may be a set of all the data included in the same column in each data table. When calculating the unique value of the data set, all the data in the data set is traversed again, the hash value of all the data is calculated based on the determined hash function, and the hash value is processed based on the FM algorithm to estimate the unique value. After counting the number of data in the data set, the number of functions of the hash function set is determined according to the number of data obtained by the statistics, and then the data set is uniquely valued by using a hash function set that matches the number of the functions. The number calculation, so that the size of the hash function set matches the size of the data set, balances the execution efficiency and accuracy, and solves the problem that the execution efficiency and accuracy due to the fixed size of the hash function set in the prior art cannot be solved. A matter of consideration.
实施例四Embodiment 4
图4为本发明实施例四提供的另一种数据处理装置的结构示意图。FIG. 4 is a schematic structural diagram of another data processing apparatus according to Embodiment 4 of the present invention.
本实施例所提供的装置中,统计模块31具体用于逐条读取数据集中的数据,统计数据集中已读取的数据个数。In the apparatus provided in this embodiment, the statistic module 31 is specifically configured to read data in a data set one by one, and count the number of data that has been read in the data set.
确定模块32,具体用于随所统计的已读取的数据个数增长,逐步减少哈希函数集的函数个数。The determining module 32 is specifically configured to gradually reduce the number of functions of the hash function set according to the counted number of read data.
如图4所示,在上一实施例的基础上,数据处理装置进一步包括:As shown in FIG. 4, on the basis of the previous embodiment, the data processing apparatus further includes:
生成模块34,用于对所述哈希函数集中的哈希函数进行舍弃,保留的哈希函数个数为所确定出的函数个数。The generating module 34 is configured to discard the hash function in the hash function set, and the number of reserved hash functions is the determined number of functions.
计算模块33,进一步包括:哈希值单元331和唯一值数单元332。The calculation module 33 further includes: a hash value unit 331 and a unique value unit 332.
哈希值单元331,用于采用所述哈希函数集中所保留的哈希函数对所读取的数据计算哈希值。The hash value unit 331 is configured to calculate a hash value for the read data by using a hash function retained in the hash function set.
唯一值数单元332,用于当读取所述数据集中的全部数据时,基于FM算法,采用所述哈希函数集中所保留的哈希函数的哈希值计算唯一值数。The unique value unit 332 is configured to calculate the unique value by using the hash value of the hash function retained in the hash function set based on the FM algorithm when reading all the data in the data set.
在遍历数据的过程中,根据当前统计结果不断调整哈希函数集中的函数个数。具体来说,可以在遍历数据集的过程中,对哈希函数集中的函数个数进行调整,以及根据调整后的哈希函数集中的哈希函数,计算已遍历数据的哈希值。遍历数据集中的全部数据之后,根据最后一次调整所确定的哈希函数集中的哈希函数的哈希值,采用FM算法计算唯一值数。具体来说,可以逐条读取数据集中的数据,从而统计数据集中已读取的数据个数。在读取到一条数据之后,便开始对所读取到的数据进行后续的确定哈希函数集中的函数个数,以及代入所读取的数据,计算所确定的哈希函数集中各哈希函数的函数值的步骤。In the process of traversing the data, the number of functions in the hash function set is continuously adjusted according to the current statistical result. Specifically, in the process of traversing the data set, the number of functions in the hash function set may be adjusted, and the hash value of the traversed data may be calculated according to the hash function in the adjusted hash function set. After traversing all the data in the data set, the unique value is calculated by the FM algorithm according to the hash value of the hash function in the hash function set determined by the last adjustment. Specifically, the data in the data set can be read one by one, thereby counting the number of data that has been read in the data set. After reading a piece of data, it starts to determine the number of functions in the hash function set, and substitutes the read data to calculate the hash function in the determined hash function set. The steps of the function value.
可见,通过在读取数据集中的数据的过程中,随着读取到的数据增多,不断减少哈 希函数集中的函数个数,实现了当数据集的规模较小时,采用较大规模的哈希函数集,从而提高计算准确度,当数据集的规模较大时,采用较小规模的哈希函数集,从而提高计算效率。通过这一方式,使得哈希函数集规模与数据集的规模相匹配。It can be seen that in the process of reading the data in the data set, as the data read increases, the number is continuously reduced. The number of functions in the set of functions in the set of functions is such that when the size of the data set is small, a larger set of hash functions is used to improve the calculation accuracy. When the size of the data set is large, a smaller-scale hash is used. A set of functions to increase computational efficiency. In this way, the hash function set size is matched to the size of the data set.
需要说明的是,实施例三和实施例四所提供的装置分别用于实现图1和图2所提供的数据处理流程,实施例三和实施例四中数据处理装置的各功能模块的功能参见前述方法实施例中相关描述,实施例三和实施例四中不再赘述。It should be noted that the devices provided in the third embodiment and the fourth embodiment are respectively used to implement the data processing flow provided by FIG. 1 and FIG. 2, and the functions of the functional modules of the data processing device in the third embodiment and the fourth embodiment are described. The related descriptions in the foregoing method embodiments are not described in the third embodiment and the fourth embodiment.
本实施例中,通过对数据集的数据个数进行统计之后,根据统计所获得的数据个数,确定哈希函数集的函数个数,进而采用符合所述函数个数的哈希函数集对数据集进行唯一值数计算,从而使得哈希函数集的规模与数据集的规模相匹配,平衡了执行效率和准确度,解决了现有技术中由于哈希函数集的规模固定所导致的执行效率和准确度无法兼顾的问题。In this embodiment, after counting the number of data in the data set, the number of functions of the hash function set is determined according to the number of data obtained by the statistics, and then the hash function set matching the number of the functions is used. The data set performs unique value calculation, so that the size of the hash function set matches the size of the data set, balances the execution efficiency and accuracy, and solves the execution in the prior art due to the fixed size of the hash function set. Problems that cannot be balanced with efficiency and accuracy.
实施例五Embodiment 5
在实施例一或二的基础上,实施例五提供了数据表处理方法,用以优化数据表的操作,如连接操作或分组操作,从而实现较少的资源占用,提高操作效率。其中,这里所说的资源可以为CPU或者内存等执行操作所需消耗的资源。On the basis of the first or second embodiment, the fifth embodiment provides a data table processing method for optimizing the operation of the data table, such as a connection operation or a group operation, thereby achieving less resource occupation and improving operation efficiency. Among them, the resources mentioned here can be resources consumed for performing operations such as CPU or memory.
图5为本发明实施例五提供的一种数据表处理方法的流程示意图,用于预测数据表规模,包括:FIG. 5 is a schematic flowchart of a data table processing method according to Embodiment 5 of the present invention, which is used to predict a data table size, including:
步骤501、采用实施例一或者实施例二的数据处理方法,对数据表进行处理,以获得唯一值数。Step 501: The data processing method of Embodiment 1 or Embodiment 2 is used to process the data table to obtain a unique value.
步骤502、根据唯一值数预测数据表的规模。Step 502: Predict the size of the data table according to the unique number of values.
在一种可能的应用场景中,可以采用图5所提供的数据表处理方法,实现对于数据表的规模预测,根据所预测出的规模可以便于对数据表分配所需的资源。In a possible application scenario, the data table processing method provided in FIG. 5 can be used to implement the size prediction of the data table, and the required resources can be conveniently allocated to the data table according to the predicted size.
图6为本发明实施例五提供的另一种数据表处理方法的流程示意图,该数据表处理方法用于评估数据表操作,在图5所提供的方法中的步骤501之后,进一步包括:FIG. 6 is a schematic flowchart of another data table processing method according to Embodiment 5 of the present invention. The data table processing method is used to evaluate data table operations. After step 501 in the method provided in FIG. 5, the method further includes:
步骤503、根据所预测出的数据表规模,对数据表的操作进行评估,以确定操作所需占用的资源。Step 503: Evaluate the operation of the data table according to the predicted data table size to determine resources required for the operation.
在一种可能的应用场景中,可以对数据表A和数据表B进行连接操作,基于图6所提供的数据表处理方法,可以首先对各数据表预测规模,进而对连接数据表A和B所需占用的资源进行评估,从而便于进行资源的分配。 In a possible application scenario, the data table A and the data table B may be connected. Based on the data table processing method provided in FIG. 6, the scale may be first predicted for each data table, and then the connection data tables A and B may be The resources required are evaluated to facilitate the allocation of resources.
图7为本发明实施例五提供的又一种数据表处理方法的流程示意图,该处理方法用于执行数据表操作,在图6所提供的方法中的步骤503之后,进一步包括:FIG. 7 is a schematic flowchart of still another method for processing a data table according to Embodiment 5 of the present invention. The processing method is used to perform a data table operation. After step 503 in the method provided in FIG. 6, the method further includes:
步骤504、根据评估结果,对数据表执行操作。Step 504: Perform an operation on the data table according to the evaluation result.
具体的,根据评估结果,确定针对数据表所进行的至少两个操作的执行顺序,例如可以按照操作所需占用资源由少至多的顺序,确定至少两个操作的执行顺序。进而按所确定出的顺序,执行该至少两个操作。Specifically, according to the evaluation result, the execution order of at least two operations performed on the data table is determined, for example, the execution order of the at least two operations may be determined in an order of as few as the occupied resources required for the operation. The at least two operations are then performed in the determined order.
在一种可能的应用场景中,可以对数据表A、数据表B和数据表C进行连接操作,基于图6所提供的数据表处理方法,可以首先对各数据表预测规模,进而对连接数据表A和B、连接数据表A和C,以及连接数据表B和C这三种操作所需占用的资源进行评估,从中选择占用资源最少的连接操作。根据评估结果,可以首先连接两个规模较小的数据表A和B,从而在本次连接中获得较小的资源占用,进而连接规模较大的数据表C,这样使得所占用的资源总量最小。In a possible application scenario, data table A, data table B, and data table C may be connected. Based on the data table processing method provided in FIG. 6, the size of each data table may be first predicted, and then the connection data is Tables A and B, the connection data tables A and C, and the resources required to connect the data tables B and C are evaluated to select the connection operation that occupies the least resources. According to the evaluation result, two smaller data tables A and B can be connected first, so that a smaller resource occupation is obtained in the connection, and then the larger-sized data table C is connected, so that the total amount of resources occupied is The smallest.
可见,通过根据所计算出的唯一值数对数据表的规模进行预测,并根据预测结果评估数据表操作所需占用的资源,进而基于操作所需占用资源的情况,对数据表操作的顺序进行优化,实现了在对数据表进行操作的过程中,减少资源的占用,提高操作效率的目的。It can be seen that the order of the data table operation is performed by predicting the size of the data table according to the calculated unique number of values, and evaluating the resources required for the operation of the data table according to the prediction result, and then based on the situation in which the operation requires resources. Optimization, the purpose of reducing the occupation of resources and improving the efficiency of operation in the process of operating the data table.
实施例六Embodiment 6
图8为本发明实施例六提供的一种数据表处理装置60的结构示意图,该数据表处理装置60用于预测数据表规模,包括:唯一值模块61和预测模块62。FIG. 8 is a schematic structural diagram of a data table processing apparatus 60 according to Embodiment 6 of the present invention. The data table processing apparatus 60 is configured to predict a data table size, and includes: a unique value module 61 and a prediction module 62.
唯一值模块61,用于利用图3或图4所示的数据处理装置,对数据表进行处理,以获得唯一值数。The unique value module 61 is configured to process the data table using the data processing apparatus shown in FIG. 3 or FIG. 4 to obtain a unique number of values.
预测模块62,用于根据所述唯一值数预测所述数据表的规模。The prediction module 62 is configured to predict a size of the data table according to the unique number of values.
图9为本发明实施例六提供的一种数据表处理装置70的结构示意图,该数据表处理装置70用于评估数据表操作,包括:预测模块71和评估模块72。FIG. 9 is a schematic structural diagram of a data table processing apparatus 70 according to Embodiment 6 of the present invention. The data table processing apparatus 70 is configured to evaluate a data table operation, and includes: a prediction module 71 and an evaluation module 72.
预测模块71,用于利用图8所示的数据表处理装置60,对数据表的规模进行预测。The prediction module 71 is configured to predict the size of the data table by using the data table processing device 60 shown in FIG.
评估模块72,用于根据所预测出的数据表规模,对所述数据表的操作进行评估,以确定所述操作所需占用的资源。The evaluation module 72 is configured to evaluate the operation of the data table according to the predicted data table size to determine resources required for the operation.
其中,操作包括连接操作和/或分组操作。Among them, operations include connection operations and/or group operations.
图10为本发明实施例六提供的一种数据表处理装置80的结构示意图,该数据表处 理装置80用于执行数据表操作,包括:评估模块81和操作模块82。FIG. 10 is a schematic structural diagram of a data table processing apparatus 80 according to Embodiment 6 of the present invention, where the data table is located The processing device 80 is configured to perform data table operations, including: an evaluation module 81 and an operation module 82.
评估模块81,用于利用图9所示的数据表处理装置70,对数据表的操作所需占用的资源进行评估。The evaluation module 81 is configured to evaluate the resources required for the operation of the data table by using the data table processing device 70 shown in FIG.
操作模块82,用于根据评估结果,对所述数据表执行所述操作。The operation module 82 is configured to perform the operation on the data table according to the evaluation result.
具体的,作为一种可能的实现方式,如图11所示,操作模块82包括:确定单元821和执行单元822。Specifically, as a possible implementation manner, as shown in FIG. 11, the operation module 82 includes: a determining unit 821 and an executing unit 822.
确定单元821,用于根据评估结果,确定针对所述数据表所进行的至少两个操作的执行顺序。The determining unit 821 is configured to determine an execution order of at least two operations performed on the data table according to the evaluation result.
具体的,确定单元821具体用于按照操作所需占用资源由少至多的顺序,确定所述至少两个操作的执行顺序。Specifically, the determining unit 821 is specifically configured to determine an execution order of the at least two operations in an order that occupies less resources in an operation.
执行单元822,用于按所确定出的顺序,执行所述至少两个操作。The executing unit 822 is configured to perform the at least two operations in the determined order.
通过根据所计算出的唯一值数对数据表的规模进行预测,并根据预测结果评估数据表操作所需占用的资源,进而基于操作所需占用资源的情况,对数据表操作进行优化,实现了在对数据表进行操作的过程中,减少资源的占用,提高操作效率的目的。By predicting the size of the data table according to the calculated unique number of values, and estimating the resources required for the operation of the data table according to the prediction result, and optimizing the operation of the data table based on the operation of the resources required by the operation, the realization is realized. In the process of operating the data table, the use of resources is reduced, and the operation efficiency is improved.
本领域普通技术人员可以理解:实现上述各方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成。前述的程序可以存储于一计算机可读取存储介质中。该程序在执行时,执行包括上述各方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。One of ordinary skill in the art will appreciate that all or part of the steps to implement the various method embodiments described above may be accomplished by hardware associated with the program instructions. The aforementioned program can be stored in a computer readable storage medium. The program, when executed, performs the steps including the foregoing method embodiments; and the foregoing storage medium includes various media that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.
最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。 Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, and are not intended to be limiting; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that The technical solutions described in the foregoing embodiments may be modified, or some or all of the technical features may be equivalently replaced; and the modifications or substitutions do not deviate from the technical solutions of the embodiments of the present invention. range.

Claims (24)

  1. 一种数据处理方法,其特征在于,包括:A data processing method, comprising:
    对数据集的数据个数进行统计;Count the number of data in the data set;
    根据统计所获得的数据个数,确定哈希函数集的函数个数;Determine the number of functions of the hash function set according to the number of data obtained by the statistics;
    基于FM算法,采用符合所述函数个数的哈希函数集对所述数据集进行唯一值数计算。Based on the FM algorithm, the data set is subjected to a unique value calculation using a hash function set that conforms to the number of functions.
  2. 根据权利要求1所述的数据处理方法,其特征在于,所述对数据集的数据个数进行统计,包括:The data processing method according to claim 1, wherein the counting the number of data of the data set comprises:
    统计所述数据集所包含的全部数据个数。Count the total number of data contained in the data set.
  3. 根据权利要求1所述的数据处理方法,其特征在于,所述对数据集的数据个数进行统计,包括:The data processing method according to claim 1, wherein the counting the number of data of the data set comprises:
    逐条读取所述数据集中的数据,统计所述数据集中已读取的数据个数。The data in the data set is read one by one, and the number of data that has been read in the data set is counted.
  4. 根据权利要求3所述的数据处理方法,其特征在于,所述根据统计所获得的数据个数,确定哈希函数集的函数个数包括:The data processing method according to claim 3, wherein the determining the number of functions of the hash function set according to the number of data obtained by the statistics comprises:
    随所统计的已读取的数据个数增长,逐步减少哈希函数集的函数个数。As the number of data that has been read increases, the number of functions of the hash function set is gradually reduced.
  5. 根据权利要求4所述的数据处理方法,其特征在于,所述基于FM算法,采用符合所述函数个数的哈希函数集对所述数据集进行唯一值数计算之前,包括:The data processing method according to claim 4, wherein the calculating, based on the FM algorithm, a unique value of the data set by using a hash function set that matches the number of functions, comprises:
    对所述哈希函数集中的哈希函数进行舍弃,保留的哈希函数个数为所确定出的函数个数。The hash function in the hash function set is discarded, and the number of reserved hash functions is the determined number of functions.
  6. 根据权利要求5所述的数据处理方法,其特征在于,所述基于FM算法,采用符合所述函数个数的哈希函数集对所述数据集进行唯一值数计算,包括:The data processing method according to claim 5, wherein the calculating, according to the FM algorithm, a unique value number of the data set by using a hash function set that matches the number of functions, comprising:
    采用所述哈希函数集中所保留的哈希函数对所读取的数据计算哈希值;Calculating a hash value for the read data by using a hash function retained in the hash function set;
    当读取所述数据集中的全部数据时,基于FM算法,采用所述哈希函数集中所保留的哈希函数的哈希值计算唯一值数。When reading all the data in the data set, based on the FM algorithm, the unique value is calculated using the hash value of the hash function retained in the hash function set.
  7. 一种数据处理装置,其特征在于,包括:A data processing device, comprising:
    统计模块,用于对数据集的数据个数进行统计;a statistics module for counting the number of data in the data set;
    确定模块,用于根据统计所获得的数据个数,确定哈希函数集的函数个数;a determining module, configured to determine the number of functions of the hash function set according to the number of data obtained by the statistics;
    计算模块,用于基于FM算法,采用符合所述函数个数的哈希函数集对所述数据集进行唯一值数计算。And a calculation module, configured to perform a unique value calculation on the data set by using a hash function set that matches the number of functions based on an FM algorithm.
  8. 根据权利要求7所述的数据处理装置,其特征在于, A data processing apparatus according to claim 7, wherein
    所述统计模块,具体用于统计所述数据集所包含的全部数据个数。The statistic module is specifically configured to count the total number of data included in the data set.
  9. 根据权利要求7所述的数据处理装置,其特征在于,A data processing apparatus according to claim 7, wherein
    所述统计模块,具体用于逐条读取所述数据集中的数据,统计所述数据集中已读取的数据个数。The statistic module is specifically configured to read data in the data set one by one, and count the number of data that has been read in the data set.
  10. 根据权利要求9所述的数据处理装置,其特征在于,A data processing apparatus according to claim 9, wherein:
    所述确定模块,具体用于随所统计的已读取的数据个数增长,逐步减少哈希函数集的函数个数。The determining module is specifically configured to gradually reduce the number of functions of the hash function set according to the counted number of data that has been read.
  11. 根据权利要求10所述的数据处理装置,其特征在于,所述装置包括:The data processing apparatus according to claim 10, wherein said apparatus comprises:
    生成模块,用于对所述哈希函数集中的哈希函数进行舍弃,保留的哈希函数个数为所确定出的函数个数。And a generating module, configured to discard the hash function in the hash function set, and the number of reserved hash functions is the determined number of functions.
  12. 根据权利要求11所述的数据处理装置,其特征在于,所述计算模块,包括:The data processing device according to claim 11, wherein the calculation module comprises:
    哈希值单元,用于采用所述哈希函数集中所保留的哈希函数对所读取的数据计算哈希值;a hash value unit, configured to calculate a hash value for the read data by using a hash function retained in the hash function set;
    唯一值数单元,用于当读取所述数据集中的全部数据时,基于FM算法,采用所述哈希函数集中所保留的哈希函数的哈希值计算唯一值数。The unique value unit is configured to calculate the unique value by using the hash value of the hash function retained in the hash function set based on the FM algorithm when reading all the data in the data set.
  13. 一种用于预测数据表规模的数据表处理方法,其特征在于,包括:A data table processing method for predicting a data table size, comprising:
    采用权利要求1-6任一项所述的数据处理方法,对数据表进行处理,以获得唯一值数;The data processing method according to any one of claims 1 to 6, wherein the data table is processed to obtain a unique number of values;
    根据所述唯一值数预测所述数据表的规模。The size of the data table is predicted based on the unique number of values.
  14. 一种用于评估数据表操作的数据表处理方法,其特征在于,包括:A data table processing method for evaluating data table operations, comprising:
    采用权利要求13所述的数据表处理方法,对数据表的规模进行预测;Using the data table processing method of claim 13, predicting the size of the data table;
    根据所预测出的数据表规模,对所述数据表的操作进行评估,以确定所述操作所需占用的资源。The operation of the data table is evaluated based on the predicted size of the data table to determine the resources required for the operation.
  15. 根据权利要求14所述的数据表处理方法,其特征在于,所述操作包括连接操作和/或分组操作。The data table processing method according to claim 14, wherein the operation comprises a connection operation and/or a group operation.
  16. 一种用于执行数据表操作的数据表处理方法,其特征在于,包括:A data table processing method for performing data table operations, comprising:
    采用权利要求14或15所述的数据表处理方法,对数据表的操作所需占用的资源进行评估;Using the data table processing method of claim 14 or 15, to evaluate resources required for operation of the data table;
    根据评估结果,对所述数据表执行所述操作。The operation is performed on the data table based on the evaluation result.
  17. 根据权利要求16所述的数据表处理方法,所述根据评估结果,对所述数据表执 行所述操作,包括:The data table processing method according to claim 16, wherein said data table is executed based on the evaluation result The operations described, including:
    根据评估结果,确定针对所述数据表所进行的至少两个操作的执行顺序;Determining an execution order of at least two operations performed on the data table according to the evaluation result;
    按所确定出的顺序,执行所述至少两个操作。The at least two operations are performed in the determined order.
  18. 根据权利要求17所述的数据表处理方法,所述根据评估结果,确定针对所述数据表所进行的至少两个操作的执行顺序,包括:The data table processing method according to claim 17, wherein the determining an execution order of at least two operations performed on the data table according to the evaluation result comprises:
    按照操作所需占用资源由少至多的顺序,确定所述至少两个操作的执行顺序。The order of execution of the at least two operations is determined in ascending order of resources required for the operation.
  19. 一种用于预测数据表规模的数据表处理装置,其特征在于,包括:A data table processing apparatus for predicting a data table size, comprising:
    唯一值模块,用于利用权利要求7-12任一项所述的数据处理装置,对数据表进行处理,以获得唯一值数;a unique value module for processing the data table by using the data processing apparatus according to any one of claims 7 to 12 to obtain a unique number of values;
    预测模块,用于根据所述唯一值数预测所述数据表的规模。And a prediction module, configured to predict a size of the data table according to the unique number of values.
  20. 一种用于评估数据表操作的数据表处理装置,其特征在于,包括:A data table processing apparatus for evaluating data table operations, comprising:
    预测模块,用于利用权利要求19所述的数据表处理装置,对数据表的规模进行预测;a prediction module, configured to predict a size of a data table by using the data table processing apparatus of claim 19;
    评估模块,用于根据所预测出的数据表规模,对所述数据表的操作进行评估,以确定所述操作所需占用的资源。An evaluation module is configured to evaluate an operation of the data table according to the predicted data table size to determine resources required for the operation.
  21. 根据权利要求20所述的数据表处理装置,其特征在于,所述操作包括连接操作和/或分组操作。The data table processing apparatus according to claim 20, wherein said operation comprises a connection operation and/or a grouping operation.
  22. 一种用于执行数据表操作的数据表处理装置,其特征在于,包括:A data table processing apparatus for performing a data table operation, comprising:
    评估模块,用于利用权利要求20或21所述的数据表处理装置,对数据表的操作所需占用的资源进行评估;An evaluation module, configured to use the data table processing apparatus of claim 20 or 21 to evaluate resources required for operation of the data table;
    操作模块,用于根据评估结果,对所述数据表执行所述操作。And an operation module, configured to perform the operation on the data table according to the evaluation result.
  23. 根据权利要求22所述的数据表处理装置,所述操作模块,包括:The data table processing apparatus according to claim 22, wherein the operation module comprises:
    确定单元,用于根据评估结果,确定针对所述数据表所进行的至少两个操作的执行顺序;a determining unit, configured to determine an execution order of at least two operations performed on the data table according to the evaluation result;
    执行单元,用于按所确定出的顺序,执行所述至少两个操作。An execution unit, configured to perform the at least two operations in the determined order.
  24. 根据权利要求23所述的数据表处理装置,A data table processing apparatus according to claim 23,
    所述确定单元,具体用于按照操作所需占用资源由少至多的顺序,确定所述至少两个操作的执行顺序。 The determining unit is specifically configured to determine an execution order of the at least two operations according to an order of less resources occupied by operations.
PCT/CN2017/077024 2016-03-25 2017-03-17 Data processing method and apparatus, and data table processing method and apparatus WO2017162102A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610180081.7 2016-03-25
CN201610180081.7A CN107229663B (en) 2016-03-25 2016-03-25 Data processing method and device and data table processing method and device

Publications (1)

Publication Number Publication Date
WO2017162102A1 true WO2017162102A1 (en) 2017-09-28

Family

ID=59899283

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/077024 WO2017162102A1 (en) 2016-03-25 2017-03-17 Data processing method and apparatus, and data table processing method and apparatus

Country Status (3)

Country Link
CN (1) CN107229663B (en)
TW (1) TWI746517B (en)
WO (1) WO2017162102A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060074826A1 (en) * 2004-09-14 2006-04-06 Heumann John M Methods and apparatus for detecting temporal process variation and for managing and predicting performance of automatic classifiers
CN101901248A (en) * 2010-04-07 2010-12-01 北京星网锐捷网络技术有限公司 Method and device for creating and updating Bloom filter and searching elements
CN102546293A (en) * 2011-12-20 2012-07-04 东南大学 High speed network flow network address measuring method based on Hash bit string multiplexing
CN102968467A (en) * 2012-11-10 2013-03-13 华中科技大学 Optimization method and query method for multiple layers of Bloom Filters

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU3002000A (en) 1999-06-08 2000-12-28 Brio Technology, Inc. Method and apparatus for data access to heterogeneous data sources
US8165221B2 (en) * 2006-04-28 2012-04-24 Netapp, Inc. System and method for sampling based elimination of duplicate data
CN102609441B (en) * 2011-12-27 2014-06-25 中国科学院计算技术研究所 Local-sensitive hash high-dimensional indexing method based on distribution entropy
JP6028567B2 (en) * 2012-12-28 2016-11-16 富士通株式会社 Data storage program, data search program, data storage device, data search device, data storage method, and data search method
CN104424220B (en) * 2013-08-23 2018-07-13 阿里巴巴集团控股有限公司 A kind of data processing method and device
US9256549B2 (en) * 2014-01-17 2016-02-09 Netapp, Inc. Set-associative hash table organization for efficient storage and retrieval of data in a storage system
CN105205052B (en) * 2014-05-30 2019-01-25 华为技术有限公司 A kind of data digging method and device
US10459886B2 (en) * 2014-08-06 2019-10-29 Quest Software Inc. Client-side deduplication with local chunk caching

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060074826A1 (en) * 2004-09-14 2006-04-06 Heumann John M Methods and apparatus for detecting temporal process variation and for managing and predicting performance of automatic classifiers
CN101901248A (en) * 2010-04-07 2010-12-01 北京星网锐捷网络技术有限公司 Method and device for creating and updating Bloom filter and searching elements
CN102546293A (en) * 2011-12-20 2012-07-04 东南大学 High speed network flow network address measuring method based on Hash bit string multiplexing
CN102968467A (en) * 2012-11-10 2013-03-13 华中科技大学 Optimization method and query method for multiple layers of Bloom Filters

Also Published As

Publication number Publication date
TW201737057A (en) 2017-10-16
CN107229663A (en) 2017-10-03
TWI746517B (en) 2021-11-21
CN107229663B (en) 2022-05-27

Similar Documents

Publication Publication Date Title
US10332008B2 (en) Parallel decision tree processor architecture
US20150262062A1 (en) Decision tree threshold coding
TWI796286B (en) A training method and training system for a machine learning system
JPWO2008102739A1 (en) Virtual server system and physical server selection method
JP2021072103A (en) Method of quantizing artificial neural network, and system and artificial neural network device therefor
US20150262063A1 (en) Decision tree processors
WO2017005115A1 (en) Adaptive optimization method and device for distributed dag system
CN112101765A (en) Abnormal data processing method and system for operation index data of power distribution network
WO2015192798A1 (en) Topic mining method and device
CN114861579A (en) Method and system for analyzing time sequence bottleneck node and optimizing time sequence in integrated circuit
CN103856185A (en) Particle filter weight processing and resampling method based on FPGA
JP2019219714A (en) Distributed processing system and distributed processing method
WO2017162102A1 (en) Data processing method and apparatus, and data table processing method and apparatus
CN108463813B (en) Method and device for processing data
JP5600693B2 (en) Clustering apparatus, method and program
CN108710640B (en) Method for improving search efficiency of Spark SQL
JP6996341B2 (en) Estimating device and estimation method
US11748255B1 (en) Method for searching free blocks in bitmap data, and related components
CN115811317A (en) Stream processing method and system based on self-adaptive non-decompression direct calculation
CN111767980B (en) Model optimization method, device and equipment
CN108984101B (en) Method and device for determining relationship between events in distributed storage system
WO2024078096A1 (en) Method and apparatus for processing network flow problem
US9218445B2 (en) Implementing enhanced physical design quality using historical placement analytics
CN108090604A (en) Based on the improved GM of trapezoid formula(1,1)Model prediction method
CN113485805B (en) Distributed computing adjustment method, device and equipment based on heterogeneous acceleration platform

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17769379

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 17769379

Country of ref document: EP

Kind code of ref document: A1