CN113672619A

CN113672619A - Method for segmenting data more uniformly according to hash rule

Info

Publication number: CN113672619A
Application number: CN202110942746.4A
Authority: CN
Inventors: 赵伟; 李南锋
Original assignee: Tianjin Nankai University General Data Technologies Co ltd
Current assignee: Tianjin Nankai University General Data Technologies Co ltd
Priority date: 2021-08-17
Filing date: 2021-08-17
Publication date: 2021-11-19
Anticipated expiration: 2041-08-17
Also published as: CN113672619B

Abstract

The invention provides a method for segmenting data more uniformly according to a hash rule, which comprises the steps of firstly calculating the number of hash buckets according to the set memory size, then sampling data sets to be segmented, recording the occurrence times of the same data in the sampling process, then sequencing the recorded data and the occurrence times of the data according to the occurrence times, recording the data at the top end to form topN data information, and then independently dividing to form independent hash data blocks. According to the method for segmenting the data more uniformly according to the hash rule, the data blocks are segmented more uniformly, so that a plurality of threads can finish work at the same time, and the problem that the processing time is too long due to the huge data volume of the segmentation of a single thread is solved.

Description

Method for segmenting data more uniformly according to hash rule

Technical Field

The invention belongs to the field of databases, and particularly relates to a method for segmenting data more uniformly according to a hash rule.

Background

Join operations of a database refer to the association of two tables during a query process to form a set of rows of two tables of a cartesian product, usually plus a where condition to filter out unwanted rows to obtain the combination of rows of the two tables that is really needed.

When the two tables are subjected to correlation query, the connection condition of the two tables is usually specified, and in many cases, an equivalent condition of a related column of the two tables is specified, for example, select x from t1, t2 where t1.a is t2.a, when processing is performed in a database kernel, the processing is performed through multi-thread parallel computing, before starting multi-thread processing, data of the two tables needs to be split, so that data with the same value can fall into the same thread for processing, and the process usually adopts a hash algorithm to split, so that data with the same hash value is placed into the same data block.

However, the problem is that some data with the same hash value are too huge, and there may be uneven data blocks split out, so that the threads spend more time in computing and processing the data blocks than other threads, and the overall efficiency of the system is affected by the threads to form a barrel effect, so that the more uniform data division can improve the system efficiency.

In addition, the efficiency of the system is influenced by the arrangement of the number of the hash buckets, if the number of the hash buckets is too small, all data cannot be loaded into the memory, the data can be repeatedly read from the disk, so that the number of the hash buckets is evaluated and calculated, and the number of the hash buckets is estimated according to the size of the memory and data information.

Disclosure of Invention

In view of this, the present invention is directed to provide a method for splitting data according to a hash rule to make the data more uniform, so as to solve the problem that a single thread has too long processing time due to huge amount of data to be split.

In order to achieve the purpose, the technical scheme of the invention is realized as follows:

a method for splitting data according to a hash rule to make the data more uniform comprises the following steps:

s1, sampling the data to be divided, and recording the occurrence times of the same data to be divided in the sampling process;

s2, sequencing the sampled data according to the occurrence times to form topN data information;

s3, carrying out hash bucket quantity evaluation by combining the size of the configured memory and the data quantity;

s4, segmenting into data block files through a hash algorithm according to the number of hash buckets and topN data information, and counting the average data number of data in each data block file;

and S5, judging whether the average data number of the data set in each data block file meets the requirement or not according to the set conditions, repeating the steps S2-S4 if the average data number of the data set in each data block file meets the requirement, and otherwise, finishing the segmentation.

Further, in the step S1, sampling is performed in proportion, and the sampling process specifically includes the following steps:

firstly, determining the number of sampling strips: taking 10% of the total amount of data as the total number to be sampled according to the total amount of data;

and step two, calculating sampling points: distributing the total data strips according to 100 parts, and selecting each part of data strips as an initial position as a sampling initial point;

thirdly, calculating the number of data to be sampled of each sampling point: and dividing the calculated number of the sampled data by 100 to obtain the number of the data to be sampled at each sampling point.

Further, the hash bucket number evaluation performed in step S3 is obtained by the following evaluation formula:

the hash bucket number (total number of data pieces x (1-data repetition rate))/number of data pieces that can be stored in the memory.

Further, the process of segmenting the hash algorithm into the data block files in step S4 is as follows:

taking out a piece of data from the topN data, calculating the hash value of the data through a hash algorithm, and obtaining an integer value by using a crc32 algorithm; dividing the integer value by the number of the hash buckets to obtain the serial numbers of the buckets, and putting the data into the corresponding buckets according to the serial numbers of the buckets.

Further, the conditions set in step S5 are: data that exceed multiples of the average number of data pieces.

Compared with the prior art, the method for segmenting the data more uniformly according to the hash rule has the following beneficial effects:

(1) according to the method for segmenting the data more uniformly according to the hash rule, the data blocks are segmented more uniformly, so that a plurality of threads can finish work at the same time, and the problem that the processing time is too long due to the huge data volume of the segmentation of a single thread is solved.

(2) The method for splitting the data according to the hash rule to enable the data to be more uniform determines the hash bucket according to the size of the memory so that the data can be completely loaded into the memory, thereby avoiding multi-pass processing caused by insufficient memory in the operation process and saving a lot of time.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:

fig. 1 is a schematic flow chart of a method for splitting data according to a hash rule to make the data more uniform according to an embodiment of the present invention;

fig. 2 is an operation diagram of a method for splitting data according to a hash rule to make the data more uniform according to the embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating a process of segmenting data according to an embodiment of the present invention;

fig. 4 is a data processing flow chart of a method for splitting data according to a hash rule to make the data more uniform according to the embodiment of the present invention;

fig. 5 is a schematic flow chart of the hash algorithm being divided into data block files according to the embodiment of the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are used only for convenience in describing the present invention and for simplicity in description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and thus, are not to be construed as limiting the present invention. Furthermore, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first," "second," etc. may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless otherwise specified.

In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art through specific situations.

The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

As shown in fig. 1 to 5, a method for segmenting data according to a hash rule to make the data more uniform obtains data information of topN with the largest occurrence frequency by sampling the data to be segmented, compares the data with the information in the topN when performing hash segmentation, and directly forms the data into separate data blocks without performing hash operation segmentation if the data occurs in the topN;

in the divided data blocks, the average line number of the data blocks is counted, the data blocks which are larger than the average line number by a certain multiple are continuously analyzed for data information, the data information which is most appeared in the data blocks is found out and is supplemented to the previous data information of the topN, the data statistics of the topN can be more accurate, and the topN information can be directly used when the same data is cut next time.

As shown in fig. 1 to 4, the specific method includes the following steps:

In the step S1, sampling is performed in proportion, and the sampling process specifically includes the following steps:

calculating a sampling point: distributing the total data strips according to 100 parts, and selecting each part of data strips as an initial position as a sampling initial point;

calculating the number of data to be sampled of each sampling point: and dividing the calculated number of the sampled data by 100 to obtain the number of the data to be sampled at each sampling point.

In step S1, the number of times of occurrence of the same data is recorded in the sampling process, here, a map container in the data structure is used, a key of the map is set as the data record, a value is set as the number of times of occurrence, when each piece of data is processed, the container is searched first, if the same key is found, the number of times of occurrence in the value is accumulated, if not found, the key is used as a new element and inserted into the map, and the value is set to 1 at the same time.

The hash bucket number evaluation performed in step S3 is obtained by the following evaluation formula:

As shown in fig. 5, the process of splitting the hash algorithm into data block files in step S4 is as follows:

The conditions set in step S5 are: data that exceed multiples of the average number of data pieces;

the specific analysis method of the data blocks is similar to the processing method in steps S2, S3, and S4, the topN information of the larger data blocks can be obtained after the analysis, and the data information is further improved into the whole topN data information, so that more data in the large data blocks can be divided independently when the next segmentation is performed, and the data in the data blocks can be reduced.

If the same data is subjected to hash segmentation next time, historical topN data information is used, and time and expense are saved due to the reuse of the data information.

The method for dividing data according to a hash rule to enable the data to be more uniform in the patent comprises the steps of firstly calculating the number of hash buckets according to the size of a set memory, wherein the purpose of the step is to enable the data falling into each hash bucket after the hash algorithm (hash algorithm) is divided to be completely loaded into the memory as much as possible, then sampling a data set to be divided, recording the times of the same data in the sampling process, then sequencing the recorded data and the times of the data according to the times of the data, finding out more data through sequencing, recording the data at the top to form topN data information, then directly and independently dividing the data to form independent hash data blocks without calculating the hash values of the independent hash data blocks in the process of dividing, and thus avoiding the situation that different data with the same hash value are distributed into the same hash data block, finally, data with more occurrence times are independently divided into an independent data block, so that the hash division is more uniform. In addition, in the result of the segmented data block, the average data number of the data block is calculated, the data block which is larger than the average data number by a certain proportion is subjected to information statistics and segmentation once again, through the step, analysis is carried out in the actually segmented large data block, the information statistics of the data can be more accurate, and finally the information is perfected into the topN data record, and when the same data is segmented next time, the information can be directly used without sampling and calculating related data information again.

The specific embodiment is as follows:

this patent exemplifies t1(a int, b varchar (20)), t2(a int, c varchar (20), select t1.a, b, cfrom t1, t2 where t1.a ═ t2. a;

the function varchar () is a conversion type function, int is an integer function, Select represents initiating a query in the database, followed by a table name to be projected, a column name, from is followed by a table to be queried, where represents some conditional constraints, and the meaning of this statement is from associating tables t1 and t2, querying for a, b, c column values satisfying t1.a ═ t2. a.

1. Carrying out hash barrel quantity evaluation according to a t1 table, wherein the data number is 100000, 10000 can be stored in a memory, and the data repetition rate is 5%, then

The number of Hash barrels is 100000 (1-0.05)/10000

2. Data sampling

And (4) sampling data of the t1 and the t2 according to a hash column, namely a column a of the table, and counting data with a large occurrence number.

Data of	Number of occurrences
		5	10000
2	30000
		8	20000

3. And sorting the statistical data according to the occurrence times, and taking the top N.

Data of	Number of occurrences
		2	30000
8	20000
		5	10000

4. And when data is cut, checking whether the data appears in the statistical data.

The values appearing in the statistical data are independently segmented without carrying out hash operation to form single-value data blocks;

that is, when the data are observed to be several data of 2, 8 and 5, each data is correspondingly divided into independent data blocks;

carrying out hash operation on values which do not appear in the statistical data, and dividing the values into corresponding hash buckets to form hash data blocks;

for example, if the rest data values are 9, 7 and 6, performing hash operation to obtain a hash value, and then dropping the hash value into a corresponding hash bucket.

5. After the division is completed, the average line number of the hash data block is counted, for example, the total number of the data blocks is 100000, the number of the hash buckets is 20, and the average line number is 5000.

6. And (3) counting the data blocks which are larger than the average line number by a certain multiple again, for example, the number of the data strips of a hash bucket is 50000 and is larger than the average line number by 10 times, selecting the data blocks to carry out statistics again, and carrying out the statistics according to the mode in 23 by using the statistical method to obtain the top N of the large data blocks.

7. And (4) completing the statistical data, and combining the topN of the large data block and the overall topN to form a new overall statistical data record so as to be reused.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1.A method for splitting data more uniformly according to a hash rule is characterized by comprising the following steps:

2. The method for splitting data according to the hash rule to make the data more uniform as claimed in claim 1, wherein the sampling in step S1 is performed in proportion, and the sampling process specifically includes the following steps:

3. The method for splitting data more uniformly according to the hash rule as claimed in claim 1, wherein the evaluation of the number of hash buckets in step S3 is obtained by the following evaluation formula:

4. The method for splitting data into more uniform data blocks according to the hash rule as claimed in claim 3, wherein the hash algorithm is split into the data block files in step S4 as follows:

5. The method for splitting data according to the hash rule to be more uniform as claimed in claim 1, wherein the conditions set in the step S5 are: data that exceed multiples of the average number of data pieces.