CN114296911A

CN114296911A - Dynamic data blocking method based on local weighted linear regression and oriented to Dask cluster

Info

Publication number: CN114296911A
Application number: CN202111509160.5A
Authority: CN
Inventors: 万烂军; 张根; 赵昊鑫; 李长云; 王志兵; 张潇云
Original assignee: Hunan University of Technology
Current assignee: Hunan University of Technology
Priority date: 2021-12-10
Filing date: 2021-12-10
Publication date: 2022-04-08

Abstract

The invention discloses a dynamic data blocking method facing a Dask cluster and based on local weighted linear regression, which divides a large-scale data set to be processed into a sub data set for optimizing block size and a remaining sub data set to be processed, adopts a local weighted linear regression algorithm to more accurately estimate the block size corresponding to each remaining sub data set to be processed on line through processing the block size corresponding to each sub data set for optimizing the block size, and according to the block size corresponding to each processed sub data set and the consumed time. The invention solves the problems of high dependence on manual experience and time-consuming and labor-consuming off-line training, can better adapt to the changes of the data set, the parallel application program and the cluster environment, and improves the efficiency of processing large-scale data sets in the Dask cluster to a certain extent.

Description

Dynamic data blocking method based on local weighted linear regression and oriented to Dask cluster

Technical Field

The invention relates to the technical field of performance optimization of big data parallel processing, in particular to a dynamic data blocking method based on local weighted linear regression and oriented to a Dask cluster.

Background

When a parallel application program is executed in a Dask cluster to process a large-scale data set, the data set needs to be partitioned, and a CN201410836567.2 big data parallel computing method and device disclose that the data set is partitioned according to the size of the data set, the size of a cluster memory and the parallelism, so that a partitioned data set consisting of a plurality of data partitions is obtained; and taking the block data set as a training data set of a logistic regression classification algorithm, and solving the optimal weight vector of the logistic regression function to obtain the logistic regression classifier. According to the embodiment of the invention, the data set is partitioned according to the size of the data set, the size of a cluster memory and the parallelism, namely, the computing resources are fully considered when the big data is partitioned, and the size of the data partition is determined by the computing resources, so that each computing node fully utilizes the computing resources when processing the data partition, the processing efficiency of the data partition is improved, and the parallel computing performance of the big data is improved. The block size setting has a great influence on the efficiency of processing a large-scale data set, so that how to reasonably divide the data set to improve the efficiency of processing the large-scale data set to a certain extent has important significance.

When a large-scale data set is partitioned, the block size setting mainly comprises the following two methods:

one is a data chunking method based on human experience. The method is characterized in that different block sizes are manually selected based on experience to perform multiple experiments aiming at a specific large-scale data set, a specific parallel application program and a specific cluster environment, each experiment adopts a fixed block size to perform blocking processing on the large-scale data set and acquire the total time consumption of the large-scale data set, and finally the block size corresponding to the shortest total time consumption is selected from the multiple experiments to serve as the block size for dividing the large-scale data set.

And the other is a data blocking method based on machine learning. The method comprises the steps of firstly obtaining a proper fixed block size adopted when different parallel application programs are executed under different cluster environments to process data sets of different scales as a training sample, then utilizing a large number of samples and training a block size prediction model based on a machine learning algorithm, and finally predicting the block size required by executing a specific parallel application program under a specific cluster environment to process the data set of the specific scale by using the model.

Although the data partitioning method based on manual experience and machine learning can select a relatively proper fixed block size for partitioning a specific large-scale data set aiming at a specific parallel application program in a specific cluster environment, the data partitioning method based on manual experience is highly dependent on experience knowledge of programmers and needs a large amount of tests, and the data partitioning method based on machine learning needs to collect a large amount of training samples for off-line training at high cost. Both of these approaches are time consuming and labor intensive, and difficult to adapt to changes in the data set, parallel applications, and cluster environment.

Disclosure of Invention

The invention aims to solve the technical problems that the existing data partitioning method based on manual experience and machine learning is time-consuming and labor-consuming and is difficult to adapt to changes of a data set, a parallel application program and a cluster environment, and provides a dynamic data partitioning method based on local weighted linear regression for a Dask cluster.

The purpose of the invention is realized by the following technical scheme:

a dynamic data blocking method facing to a Dask cluster and based on local weighted linear regression comprises the following steps:

s1, dividing a large-scale data set X into N sub-data sets, and sequentially dividing the N sub-data sets into sub-data set sets X for optimizing block sizes_profiling＝(X_profiling.1,X_profiling.2,...,X_profiling.i,...,X_profiling.n) And a set of remaining pending subdata sets X_rest＝(X_rest.1,X_rest.2,...,X_rest.j,...,X_rest.m) Wherein i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to m, and N is N + m;

s2, setting an optimized subdata set X_profilingThe block size corresponding to each subdata set;

s3, processing X_profilingSub data set of (1)

From X_profilingRead the ith sub-dataset X_profiling.iAnd according to a specified block size M_profiling.iTo X_profiling.iPartitioning and then executing parallel application program pairs X in the Dask cluster_profiling.iProcessing and obtaining the time T consumed by the processing_profiling.iFinally, M is added_profiling.iAnd T_profiling.iAdding the observation matrix O_N×2Wherein i is more than or equal to 1 and less than or equal to n;

s4. process X_restIs selected to be the first sub data set of

S41, obtaining the minimum value T of the consumed time corresponding to all the processed subdata sets from the observation value matrix_minCalculating a regression coefficient theta of the locally weighted linear regression algorithm based on the observed value matrix, using T_minAnd theta to estimate X_rest.1Corresponding block size M_rest.1And press M_rest.1To X_rest.1Partitioning is carried out;

s42, executing parallel application program pair X in Dask cluster_rest.1Processing and obtaining the time T consumed by the processing_rest.1Will M_rest.1And T_rest.1Added to the observation matrix, i.e. M_rest.1And T_rest.1Replacing two values of the n +1 th row of the matrix;

s5, repeating the step S4 to calculate X_restThe block size and the time spent corresponding to the next sub data set of the data set are added into the observation value matrix;

s6, calculating X_restAll the subdata sets are processed to obtain the total time consumed for processing the large-scale data set X

Further, the set of sub-data sets for block size optimization is X_profilingAnd the remaining to-be-processed subdata sets are aggregated into X_restIn a ratio of 2: 8.

Further, the large-scale data set X and the sub-data set for block size optimization are set to X_profilingAnd the remaining to-be-processed subdata sets are aggregated into X_restIs that X is equal to X_profiling∪X_restAnd is

Further, the setting of the block size in step S2 includes:

s21, determining an initial block size M_initAnd a block size variation d. S22, according to the size M of the initial block_initAnd block size variation d, sequentially calculating X_profilingThe block size corresponding to each sub data set.

Further, the block size M in S2_profiling.iComprises the following steps:

further, the initial block size M_initCan not be too large, ensures that the intermediate results generated by processing a plurality of blocks by executing the parallel application programs in the Dask cluster do not exceed the size of the main memory and the size of the GPU memory, and has the initial block size M_initIt cannot be too small to ensure that executing a parallel application in the task cluster takes significantly longer to process each block than the task schedule that task could take for that block.

Further, the block size variation d is to ensure X_profilingAll sub-data sets in the set have corresponding block sizes greater than 0 and at X_profilingA more suitable block size can be found between the minimum and maximum block sizes corresponding to all sub data sets in the data stream.

Further, the observation value matrix O_N×2Comprises the following steps:

where N is a row, N ═ N + m.

Further, M is added_rest.1And T_rest.1Adding to the observation matrix to take the newly obtained M each time_rest.1And T_rest.1Replacing two values of a corresponding row in the matrix, M obtained the ith time_iAnd T_iBy M_iAnd T_iTwo values of the ith row of the matrix are replaced.

Further, the weight w of the local weighted linear regression algorithm_kThe calculation formula is as follows:

w_k＝exp((T_min-O_k,2)²/(-2σ²))，(1≤k≤n+j-1)

wherein, T_min＝min(O_1,2,O_2,2,...,O_n+j-1,2) k is a cyclic variable, O_k,2K-th of the representation matrix

Row column 2 values.

Further, the regression coefficient θ is:

θ＝θ₁/θ₂，

wherein

Further, said X_rest.1Corresponding block size M_rest.1Comprises the following steps:

M_rest.j＝θT_min。

further, step S6 includes determining X_restIf all the subdata sets are processed, if so, obtaining the total time consumed by processing the large-scale data set X

If not, step S5 is repeated.

Compared with the prior art, the beneficial effects are:

the invention provides a dynamic data blocking method based on local weighted linear regression facing to a Dask cluster, which is characterized in that a large data set is divided into two sub data sets to obtain all processed sub data sets, and the local weighted linear regression algorithm is adopted to more accurately estimate the block size corresponding to each remaining sub data set to be processed on line dynamically, so that high dependence on manual experience and time-consuming and labor-consuming off-line training are avoided, the method can better adapt to the changes of the data set, a parallel application program and a cluster environment, and the efficiency of processing the large-scale data set in the Dask cluster is improved to a certain extent.

Drawings

FIG. 1 is a flow chart of a dynamic data blocking method based on local weighted linear regression facing to a Dask cluster;

FIG. 2 is a diagram of dynamic data partitioning based on local weighted linear regression facing to Dask cluster;

FIG. 3 is a graph comparing the performance of processing a large data set in a Dask cluster using two different data chunking methods and three different parallel applications.

Detailed Description

The present invention will be further described with reference to the following detailed description, wherein the drawings are provided for illustrative purposes only and are not intended to be limiting; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

Example 1

In the dynamic data blocking method facing the Dask cluster and based on the local weighted linear regression, a large-scale data set is defined as X, and a sub data set used for optimizing the block size is defined as X_profilingThe set of the remaining sub data sets to be processed is X_restThe number of subdata sets in X is N, X_profilingThe number of the subdata sets is n, X_restThe number of the subdata sets is m, X_profilingThe ith sub-dataset is X_profiling.i，X_restThe jth sub-data set is X_rest.jInitial block size of M_init，X_profilingBlock size M corresponding to ith sub-data set_profiling.i，X_restThe block size M corresponding to the jth sub-data set_rest.jTreatment of X_profilingThe time spent by the ith sub-data set is T_profiling.iTreatment of X_restThe time consumed by the jth sub-data set is T_rest.iTotal time consumed for processing X is T_total。

The method comprises the following steps:

s1, dividing a large-scale data set X into N sub-data sets, and sequentially dividing the N sub-data sets into N sub-data set sets X for optimizing block sizes_profiling＝(X_profiling.1,X_profiling.2,...,X_profiling.i,...,X_profiling.n) And m sets X of remaining pending subdata sets_rest＝(X_rest.1,X_rest.2,...,X_rest.j,...,X_rest.m) Wherein i is not less than 1 and not more than N, j is not less than 1 and not more than m, N is N + m, and X is X_profiling∪X_restAnd is

S2, setting an optimized subdata set X_profilingThe block size corresponding to each sub data set in the set

S21, determining an initial block size M_init；

S22, according to the size M of the initial block_initAnd block size variation d, sequentially calculating X_profilingThe block size corresponding to each sub data set.

S3, processing X_profilingSub data set of (1)

From X_profilingRead the ith sub-dataset X_profiling.iAnd according to a specified block size M_profiling.iTo X_profiling.iPartitioning and then executing parallel application program pairs X in the Dask cluster_profiling.iProcessing and obtaining the consumption thereofTime T_profiling.iFinally, M is added_profiling.iAnd T_profiling.iAdding the observation matrix O_N×2In (1).

S4. process X_restIs selected to be the first sub data set of

s42, executing parallel application program pair X in Dask cluster_rest.1Processing and obtaining the time T consumed by the processing_rest.1Will M_rest.1And T_rest.1Adding the obtained product into an observation value matrix;

s5, repeating the step S4, and calculating the block size M of the next character data set_rest.1And the time T elapsed_rest.1；

S6, judging X_restWhether all sub data sets in the set are processed. If yes, obtaining the total time consumed for processing the large-scale data set X

If not, the step 5 is repeatedly executed.

Example 2

The embodiment provides a specific algorithm according to the method described in embodiment 1, and the specific steps include:

here, "←" is a writing method in the algorithm, that is, it corresponds to "═ h".

Example 3

In this embodiment, according to the algorithm described in embodiment 2, static and dynamic calculations are performed on a certain data set, where the hardware environment adopted is:

the software environment is as follows:

(1) static calculation

A data set of size 8.7GB is divided into n blocks and then the n blocks are processed simultaneously with a parallel application.

(2) Dynamic computing

Dividing a data set with the size of 8.7GB into m sub-data sets, and sequentially dividing the m sub-data sets into X according to the proportion of 2:8_profilingAnd X_restTo X_profilingEach sub data set is partitioned into blocks, and the initial block size is M_initBlock size variance d, Gaussian kernel parameter σ, residual X_restThe number of blocks of the intermediate data set is dynamically selected by the program, and then the m data sets are processed in sequence by the parallel application program.

The parameter conditions for the specific block settings in the dynamic are as follows

	m	M_init	d	σ
					K-Means	100	10000	1000	0.26
DBSCAN	100	2000	50	0.25
					SpectralCustering	100	1000	50	2.00

The static and dynamic calculations were performed using a combination of programs from K-Means, DBSCAN and Spectral testing, respectively, and the resulting run times are shown in the following table:

	static state	Dynamic state
			K-Means	2103.95s	2015.64s
DBSCAN	4480.41s	4217.33s
			SpectralCustering	30746.53s	27518.99s

As can be seen from the above table and fig. 3, the method of the present invention can improve the efficiency of processing large-scale data sets when processing data in parallel.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A dynamic data blocking method facing to a Dask cluster and based on local weighted linear regression is characterized by comprising the following steps:

s1, dividing a large-scale data set X into N sub-data sets, and sequentially dividing the N sub-data sets into sub-data set sets X for optimizing block sizes_profilingAnd a set of remaining pending subdata sets X_rest；

s3, processing X_profilingSub data set of (1)

s4. process X_restIs selected to be the first sub data set of

s42, executing parallel application program pair X in Dask cluster_rest.1Processing and obtaining the time T consumed by the processing_rest.1Will M_rest.1And T_rest.1Into the observation matrix, i.e. by M_rest.1And T_rest.1Replacing two values of the n +1 th row of the matrix;

s5, repeating the step S4 to calculate X_restThe block size and the time spent corresponding to the next sub data set of the data set are added into the observation value matrix; (ii) a

2. The Dask cluster-oriented dynamic data partitioning method based on local weighted linear regression as claimed in claim 1, wherein the sub data set X is used for block size optimization_profilingThe number n of the sub data sets and the set of the remaining sub data sets to be processed are X_restThe ratio of the number m is 2: 8.

3. The Dask cluster-oriented dynamic data blocking method based on local weighted linear regression as claimed in claim 1, wherein the large-scale data set X and the sub-data set for block size optimization are X_profilingAnd the remaining to-be-processed subdata sets are aggregated into X_restIs that X is equal to X_profiling UX_restAnd is

4. The Dask cluster-oriented dynamic data blocking method based on local weighted linear regression as claimed in claim 1, wherein the step of setting the block size in step S2 comprises:

s21, determining an initial block size M_init；

5. The dynamic data blocking method based on local weighted linear regression for Dask cluster as claimed in claim 4, wherein the block size M in S2_profiling.iComprises the following steps:

6. the Dask cluster-oriented dynamic data blocking method based on local weighted linear regression as claimed in claim 1, wherein the observation matrix O is_N×2Comprises the following steps:

where N is a row, N ═ N + m.

7. The Dask cluster-oriented dynamic data blocking method based on local weighted linear regression as claimed in claim 1, wherein the weight w of the local weighted linear regression algorithm_kThe calculation formula is as follows:

w_k＝exp((T_min-O_k,2)²/(-2σ²))，(1≤k≤n+j-1)

wherein, T_min＝min(O_1,2,O_2,2,K,O_n+j-1,2) K is a cyclic variable, O_k,2Representing the value of the 2 nd column of the k-th row of the matrix.

8. The Dask cluster-oriented dynamic data blocking method based on local weighted linear regression as claimed in claim 1, wherein the regression coefficient θ is:

θ＝θ₁/θ₂，

wherein

9. The Dask cluster-oriented dynamic data blocking method based on local weighted linear regression as claimed in claim 1, wherein X is_rest.1Corresponding block size M_rest.1Comprises the following steps:

M_rest.j＝θT_min。

10. the Dask cluster-oriented dynamic data blocking method based on local weighted linear regression as claimed in claim 1, wherein step S6 comprises determining X_restIf all the subdata sets are processed, if so, obtaining the total time consumed by processing the large-scale data set X

If not, the step 5 is repeatedly executed.