CN113868230B

CN113868230B - Large-scale connection optimization method based on Spark computing framework

Info

Publication number: CN113868230B
Application number: CN202111220042.2A
Authority: CN
Inventors: 付蔚; 宾茂梨; 张棚; 李正; 刘庆
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2021-10-20
Filing date: 2021-10-20
Publication date: 2024-06-04
Anticipated expiration: 2041-10-20
Also published as: CN113868230A

Abstract

The invention relates to a large-table connection optimization method based on Spark computing frames, and belongs to the field of large data computing. The method comprises the following steps: s1: performing data cleaning by combining predicate pushing with a compressed bloom filter, filtering out a large amount of invalid data in a large table, and avoiding a large amount of useless data from entering a shuffle stage; s2: building a Spark-based data inclination detection model, and counting the Key value distribution in the global Map stage through a reservoir sampling algorithm; s3: and cutting the inclined data clusters according to the average load rated capacity by adopting an intermediate data cluster segmentation strategy, so that keys with high occurrence frequency enter other partitions with rapid processing, and the keys are in a uniform distribution state. The invention can filter out a large amount of useless data, improve the data inclination condition, shorten the connection inquiry time, solve the problem of memory overflow of Spark cluster nodes and improve the user satisfaction.

Description

Large-scale connection optimization method based on Spark computing framework

Technical Field

The invention belongs to the field of big data calculation, relates to a large-scale connection optimization method based on Spark computing frames, and can be used for solving the problem of structured data inclination in the fields of smart grids, the Internet, industry and the like, and rapidly and accurately carrying out large-scale connection.

Background

Along with the continuous integration of the Internet, the Internet of things, the social network and the like into the life of people, the data explosion age has been entered, various data (such as data in the fields of smart grids, the Internet, industry and the like) are exponentially increased, and a huge data scale and a complex data relationship provide new challenges for the data analysis technology. The mass data is basically stored and calculated in a table form in a Spark big data calculation frame or a big database, and the connection is the most frequent and basic operation in data processing. In a big data environment, the data table is very large in size, and in a traditional relational database, such as: the processing join operations of Mysql, oracle, DB2, etc. are time consuming and therefore optimizing the large table join operations is necessary.

Spark can be fully compatible with a Hadoop distributed storage access interface, and the performance of a big data computing system is greatly improved by utilizing a distributed memory to process data. Spark SQL is a module for processing structured data in Spark, which can greatly reduce the difficulty of data analysis, but once the data table is very large and connection operation is performed, a large amount of invalid data enters the shuffle process. In addition, in the process of the buffer of the Spark SQL framework, all intermediate data are regarded as key/value structures, the Spark utilizes a hash algorithm to pull the same key into the same Task, if the data quantity difference between the keys is large, and a plurality of keys occupy a particularly large data proportion, data inclination occurs, so that CPU and memory resources of a system cannot be fully utilized in the process of Spark application processing, the operation time is prolonged by the large Task, even memory overflow abnormality is caused, and operation cannot be operated.

Disclosure of Invention

In view of the above, the present invention aims to provide a large table connection optimization method based on Spark computing frames, which is suitable for structured data such as smart grids, the internet or industry, and the like, and to provide two large tables to improve the data inclination condition during join operation, so as to solve the problem of high connection time consumption, further solve the problem of node memory overflow, shorten the connection query time, and improve user satisfaction.

In order to achieve the above purpose, the present invention provides the following technical solutions:

A large-table connection optimization method based on Spark computing framework includes the steps of firstly, carrying out data cleaning on two tables through predicate push down strategy and a compression bloom filter, and filtering out a large amount of useless data in the two tables. After data are cleaned, a data inclination detection model is built, after data inclination is detected, the middle data cluster is divided, the data quantity in each reduce node is balanced, the data inclination condition is improved, and the problem that memory overflow occurs in the large-scale connection process of the Spark node and the time consumption is high is solved.

The method specifically comprises the following steps:

s1: performing data cleaning by combining predicate pushing with a compressed bloom filter, filtering out a large amount of invalid data in a large table, and avoiding a large amount of useless data from entering a shuffle stage;

s2: building a Spark-based data inclination detection model, and counting the Key value distribution in the global Map stage through a reservoir sampling algorithm;

s3: and cutting the inclined data clusters according to the average load rated capacity by adopting an intermediate data cluster segmentation strategy, so that keys with high occurrence frequency enter other partitions with rapid processing, and the keys are in a uniform distribution state.

Further, the step S1 specifically includes: firstly, a filtering expression of the SQL expression is pushed down to a storage layer to directly filter data, so that the data quantity transmitted to calculation is reduced, and the data IO of other columns in the same row of scanning invalid data columns is reduced; and then carrying out Hash mapping by using the compressed Bloom Filter, finding out the attribute connection value commonly owned in the two tables and storing the attribute connection value into a new bit array A and a new bit array B, and carrying out network broadcasting bit array A and network broadcasting bit array B by using the compressed Bloom Filter to remove other invalid data which do not participate in the connection stage.

Further, the step S1 specifically includes the following steps:

S11: after the Spark calculation engine analyzes filters which can be pushed down, the Spark transmits the filters to Parquet for merging operation, and the merging operation is put on leaf nodes to enable the filters to be executed on a data source, so that filtering of invalid data is truly executed in the data source;

S12: putting the connection attribute of the left table and the right table after the execution of the step S11 into two new RDDs, and marking the connection attribute as RDDA and RDDB;

S13: in each node of the Spark cluster, respectively and sequentially reading the attribute connection values of RDDA and RDDB, adopting n groups of hash functions to calculate the attribute connection values in RDDA and RDDB, and then putting the calculated values into a bloom compression filter to generate a bit array until the connection attributes in all the nodes are calculated;

S14: performing OR operation on the RDDA and RDDB bit arrays in each node processed in the step S13 until the bit arrays of each node are processed to obtain CBFA and CBFB, and broadcasting the CBFA and the CBFB into the Spark cluster, wherein the compressed bloom filter is a data structure which greatly saves space, so that the network consumption is in an acceptance range;

S15: after each node of the Spark cluster receives the broadcast CBFA and CBFB, filtering RDDA by using CBFB, mapping the connection attribute in RDDA to the CBFB bit array by using n sets of hash functions of CBFB, which indicates that the connection attribute is common to RDDA and RDDB, and similarly, filtering RDDB by using CBFA, and performing the next connection operation on the filtered RDDA and RDDB.

Further, the step S2 specifically includes: by adopting a Master-Slaves mode, extracting Key value distribution and data by each Slave node through a reservoir sampling algorithm, wherein the probability of each sample being extracted is K/N, and the probability is equal, wherein K represents the number of extracted samples, and N represents the total number of samples; according to the frequency distribution of each Key in the sample, calculating the distribution situation approximate to the whole, and judging whether the large-table data Key value is inclined.

Further, the step S2 specifically includes the following steps:

S21: before performing the map task, a uniform random sample of fixed size k is first selected from the input data without substitution, the goal of this step being to form a base reservoir;

s22: from the (k+1) th sample, the probability of the sample in the reservoir being replaced by the (k+1) th data=the probability of the (k+1) th sample being selected ×the probability of the (i) th sample being replaced, namely The probability of being retained is

S23: for the jth data, where j > k, the probability that the jth data is selected is k/j; the probability of not being replaced by the j+1th number isWhen running to the nth data, the retained probability=the selected probability =the probability of not being replaced, i.e., the continuous multiplication of conditional probabilities: /(I)The probability of retention for each data is/>

S24: and finally, summarizing the data to the Master node by each Slave node to generate reservoir sampling data, wherein the reservoir sampling data can ensure that the key in the original data is closer to the whole condition.

Further, the step S3 specifically includes: after detecting data inclination in the step S2, calculating the average load rated capacity of a data cluster, setting inclination tolerance, and not cutting data when the transmission time of the data cluster data cutting network is longer than the node service processing time; the rest inclined data clusters are cut according to the average load rated capacity, and the size of each data cluster is ensured to be the same as much as possible.

Further, the step S3 specifically includes the following steps:

s31: the data set sampled in step S2 is sc= { SC _i},SC_i representing the key-value key value pair number of the sampled data;

S32: by calculating the standard rated capacity Havg in each bucket, havg is expressed as: Where m is the number of data clusters, h is the number of buckets, and the current remaining capacity of the bucket is represented as { DB ₁,DB₂,…,DB_h };

S33: reverse ordering of SC _i, if SC _i≥DB₁, then a new segment will be split from Havg-sized SC _i and loaded into DB ₁, with the remainder of SC _i-DB₁ and the remaining clusters going to the next iterator;

S34: when SC _i<DB_i, put SC _i into DB _i, recheck the current second largest SC _i-1 for the remaining space to see if it can fill DB _i, if SC _i+SC_i-1≥DB_i, SC _i-1 would be split and traverse the remaining key pairs forward through all remaining DBs _i to see if the remaining key pairs can be fit down;

S35: after each iteration, SC _i will be reordered while setting the skew tolerance, and when Havg < SC _i is less than or equal to Havg 1.1, SC _i is not cut, and because the network overhead time caused by cutting SCi is longer than the data processing time of the current bucket, SCi within the skew tolerance is not processed.

The invention has the beneficial effects that: the invention is beneficial to filtering out a large amount of useless data when two large tables in the Spark computing framework are connected and inquired, improving the data inclination condition, shortening the connection and inquiry time, solving the problem of memory overflow of Spark cluster nodes and improving the user satisfaction.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.

Drawings

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:

FIG. 1 is an overall flow chart of a large table connection optimization method based on Spark computing framework of the present invention;

FIG. 2 is a diagram of a predicate push strategy in combination with a compressed bloom filter for data filtering according to the present invention;

fig. 3 is a flowchart of a construction data inclination detection model.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.

Referring to fig. 1 to 3, the present invention provides a large table connection optimization method based on Spark computing framework, which specifically includes the following steps:

Step 1: the data is cleaned by using an extended Bloom Filter, and a large amount of useless data is filtered; the data here includes structured data such as smart grids, the internet, or industry.

As shown in FIG. 2, the predicate-push strategy is adopted first to perform primary filtering on the data of the tables A and B which do not meet the connection condition, two new RDDs are generated, the connection attributes of the tables A and B are respectively stored, then the compressed bloom filter is utilized to perform Hash mapping operation on RDDA and RDDB to obtain a plurality of node bit arrays, and finally the OR operation is performed on each bit array to generate a new table A and a new table B after data filtering.

Step 2: constructing a data inclination detection model, and counting the Key value distribution of a global Map stage in a reservoir sampling mode;

as shown in FIG. 3, the data detection model adopts a Master-Slave mode, a Master is deployed on a Driver node of Spark, and a Slave is deployed on a workbench node. And carrying out reservoir sampling according to a set ratio by utilizing RDD SAMPLE operators, increasing a Key counter continuously along with the sampling, and judging whether a Key distribution histogram reaches a steady state or not by utilizing a Simon model in a long tail theory, wherein the sampling is finished after the steady state is reached. And finally, returning the Key distribution histogram to the Master node by each Slave node to generate a global Key data inclination detection model.

Step 3: load balancing strategy of Reduce node aiming at data inclination;

1) Recording the sampled data set as SC= { SC _i},SC_i to represent the key-value key value pair quantity of the sampled data;

2) By calculating the standard rated capacity Havg in each bucket, havg is expressed as: Where m is the number of data clusters, h is the number of buckets, and the current remaining capacity of the bucket is represented as { DB ₁,DB₂,…,DB_h };

3) Reverse ordering of SC _i, if SC _i≥DB₁, then a new segment will be split from Havg-sized SC _i and loaded into DB ₁, with the remainder of SC _i-DB₁ and the remaining clusters going to the next iterator;

4) When SC _i<DB_i, put SC _i into DB _i, recheck the current second largest SC _i-1 for the remaining space to see if it can fill DB _i, if SC _i+SC_i-1≥DB_i, SC _i-1 would be split and traverse the remaining key pairs forward through all remaining DBs _i to see if the remaining key pairs can be fit down;

5) After each iteration, SC _i will be reordered while setting the skew tolerance, and when Havg < SC _i is less than or equal to Havg 1.1, SC _i is not cut, and because the network overhead time caused by cutting SCi is longer than the data processing time of the current bucket, SCi within the skew tolerance is not processed.

Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.

Claims

1. A large-table connection optimization method based on Spark computing framework is characterized by comprising the following steps:

S1: the predicate pushing is utilized to combine with the compressed bloom filter to clean data, a large amount of invalid data in a large table is filtered, and a large amount of useless data is prevented from entering a shuffle stage, and the method specifically comprises the following steps: firstly, pushing down a filtering expression of the SQL expression to a storage layer to directly filter data; then, hash mapping is carried out by utilizing the compressed Bloom Filter, the attribute connection value commonly owned in the two tables is found and stored into a new bit array A and a new bit array B, and the compressed Bloom Filter is utilized to carry out network broadcasting on the bit array A and the bit array B so as to remove other invalid data which do not participate in the connection stage;

s2: building a Spark-based data inclination detection model, and counting the Key value distribution of a global Map stage through a reservoir sampling algorithm, wherein the method specifically comprises the following steps: by adopting a Master-Slaves mode, extracting Key value distribution and data by each Slave node through a reservoir sampling algorithm, wherein the probability of each sample being extracted is K/N, K represents the number of extracted samples, and N represents the total number of samples; calculating the distribution condition approximate to the whole according to the frequency distribution of each Key in the sample, and judging whether the large-table data Key value is inclined or not;

S3: cutting the inclined data cluster according to the average load rated capacity by adopting an intermediate data cluster segmentation strategy, so that keys with high occurrence frequency enter other partitions with rapid processing, and the keys are in a uniform distribution state, and the method specifically comprises the following steps: after detecting data inclination in the step S2, calculating the average load rated capacity of a data cluster, setting inclination tolerance, and not cutting data when the transmission time of the data cluster data cutting network is longer than the node service processing time; cutting the rest inclined data clusters according to the average load rated capacity;

the step S3 specifically comprises the following steps:

s35: after each iteration, SC _i will be reordered while setting the tilt tolerance, and SC _i will not be cut when Havg < SC _i is less than Havg 1.1.

2. The large table connection optimization method according to claim 1, wherein the step S1 specifically includes the steps of:

S14: performing OR operation on the bit arrays RDDA and RDDB in each node processed in the step S13 until the bit arrays of each node are processed to obtain CBFA and CBFB, and broadcasting the CBFA and the CBFB into a Spark cluster;

3. The large table connection optimization method according to claim 1, wherein the step S2 specifically includes the steps of:

S21: before performing a map task, first selecting a uniform random sample with a fixed size k from the input data without substitution to form a base reservoir;

s22: from the (k+1) th sample, the probability of the sample in the reservoir being replaced by the (k+1) th data=the probability of the (k+1) th sample being selected ×the probability of the (i) th sample being replaced, namely The probability of being retained is/>

S24: and finally, summarizing the data to the Master node by each Slave node to generate reservoir sampling data.