CN114296911A - Dynamic data blocking method based on local weighted linear regression and oriented to Dask cluster - Google Patents

Dynamic data blocking method based on local weighted linear regression and oriented to Dask cluster Download PDF

Info

Publication number
CN114296911A
CN114296911A CN202111509160.5A CN202111509160A CN114296911A CN 114296911 A CN114296911 A CN 114296911A CN 202111509160 A CN202111509160 A CN 202111509160A CN 114296911 A CN114296911 A CN 114296911A
Authority
CN
China
Prior art keywords
rest
profiling
data set
cluster
block size
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111509160.5A
Other languages
Chinese (zh)
Inventor
万烂军
张根
赵昊鑫
李长云
王志兵
张潇云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University of Technology
Original Assignee
Hunan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University of Technology filed Critical Hunan University of Technology
Priority to CN202111509160.5A priority Critical patent/CN114296911A/en
Publication of CN114296911A publication Critical patent/CN114296911A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Devices For Executing Special Programs (AREA)

Abstract

The invention discloses a dynamic data blocking method facing a Dask cluster and based on local weighted linear regression, which divides a large-scale data set to be processed into a sub data set for optimizing block size and a remaining sub data set to be processed, adopts a local weighted linear regression algorithm to more accurately estimate the block size corresponding to each remaining sub data set to be processed on line through processing the block size corresponding to each sub data set for optimizing the block size, and according to the block size corresponding to each processed sub data set and the consumed time. The invention solves the problems of high dependence on manual experience and time-consuming and labor-consuming off-line training, can better adapt to the changes of the data set, the parallel application program and the cluster environment, and improves the efficiency of processing large-scale data sets in the Dask cluster to a certain extent.

Description

Dynamic data blocking method based on local weighted linear regression and oriented to Dask cluster
Technical Field
The invention relates to the technical field of performance optimization of big data parallel processing, in particular to a dynamic data blocking method based on local weighted linear regression and oriented to a Dask cluster.
Background
When a parallel application program is executed in a Dask cluster to process a large-scale data set, the data set needs to be partitioned, and a CN201410836567.2 big data parallel computing method and device disclose that the data set is partitioned according to the size of the data set, the size of a cluster memory and the parallelism, so that a partitioned data set consisting of a plurality of data partitions is obtained; and taking the block data set as a training data set of a logistic regression classification algorithm, and solving the optimal weight vector of the logistic regression function to obtain the logistic regression classifier. According to the embodiment of the invention, the data set is partitioned according to the size of the data set, the size of a cluster memory and the parallelism, namely, the computing resources are fully considered when the big data is partitioned, and the size of the data partition is determined by the computing resources, so that each computing node fully utilizes the computing resources when processing the data partition, the processing efficiency of the data partition is improved, and the parallel computing performance of the big data is improved. The block size setting has a great influence on the efficiency of processing a large-scale data set, so that how to reasonably divide the data set to improve the efficiency of processing the large-scale data set to a certain extent has important significance.
When a large-scale data set is partitioned, the block size setting mainly comprises the following two methods:
one is a data chunking method based on human experience. The method is characterized in that different block sizes are manually selected based on experience to perform multiple experiments aiming at a specific large-scale data set, a specific parallel application program and a specific cluster environment, each experiment adopts a fixed block size to perform blocking processing on the large-scale data set and acquire the total time consumption of the large-scale data set, and finally the block size corresponding to the shortest total time consumption is selected from the multiple experiments to serve as the block size for dividing the large-scale data set.
And the other is a data blocking method based on machine learning. The method comprises the steps of firstly obtaining a proper fixed block size adopted when different parallel application programs are executed under different cluster environments to process data sets of different scales as a training sample, then utilizing a large number of samples and training a block size prediction model based on a machine learning algorithm, and finally predicting the block size required by executing a specific parallel application program under a specific cluster environment to process the data set of the specific scale by using the model.
Although the data partitioning method based on manual experience and machine learning can select a relatively proper fixed block size for partitioning a specific large-scale data set aiming at a specific parallel application program in a specific cluster environment, the data partitioning method based on manual experience is highly dependent on experience knowledge of programmers and needs a large amount of tests, and the data partitioning method based on machine learning needs to collect a large amount of training samples for off-line training at high cost. Both of these approaches are time consuming and labor intensive, and difficult to adapt to changes in the data set, parallel applications, and cluster environment.
Disclosure of Invention
The invention aims to solve the technical problems that the existing data partitioning method based on manual experience and machine learning is time-consuming and labor-consuming and is difficult to adapt to changes of a data set, a parallel application program and a cluster environment, and provides a dynamic data partitioning method based on local weighted linear regression for a Dask cluster.
The purpose of the invention is realized by the following technical scheme:
a dynamic data blocking method facing to a Dask cluster and based on local weighted linear regression comprises the following steps:
s1, dividing a large-scale data set X into N sub-data sets, and sequentially dividing the N sub-data sets into sub-data set sets X for optimizing block sizesprofiling=(Xprofiling.1,Xprofiling.2,...,Xprofiling.i,...,Xprofiling.n) And a set of remaining pending subdata sets Xrest=(Xrest.1,Xrest.2,...,Xrest.j,...,Xrest.m) Wherein i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to m, and N is N + m;
s2, setting an optimized subdata set XprofilingThe block size corresponding to each subdata set;
s3, processing XprofilingSub data set of (1)
From XprofilingRead the ith sub-dataset Xprofiling.iAnd according to a specified block size Mprofiling.iTo Xprofiling.iPartitioning and then executing parallel application program pairs X in the Dask clusterprofiling.iProcessing and obtaining the time T consumed by the processingprofiling.iFinally, M is addedprofiling.iAnd Tprofiling.iAdding the observation matrix ON×2Wherein i is more than or equal to 1 and less than or equal to n;
s4. process XrestIs selected to be the first sub data set of
S41, obtaining the minimum value T of the consumed time corresponding to all the processed subdata sets from the observation value matrixminCalculating a regression coefficient theta of the locally weighted linear regression algorithm based on the observed value matrix, using TminAnd theta to estimate Xrest.1Corresponding block size Mrest.1And press Mrest.1To Xrest.1Partitioning is carried out;
s42, executing parallel application program pair X in Dask clusterrest.1Processing and obtaining the time T consumed by the processingrest.1Will Mrest.1And Trest.1Added to the observation matrix, i.e. Mrest.1And Trest.1Replacing two values of the n +1 th row of the matrix;
s5, repeating the step S4 to calculate XrestThe block size and the time spent corresponding to the next sub data set of the data set are added into the observation value matrix;
s6, calculating XrestAll the subdata sets are processed to obtain the total time consumed for processing the large-scale data set X
Figure RE-GDA0003532601450000031
Further, the set of sub-data sets for block size optimization is XprofilingAnd the remaining to-be-processed subdata sets are aggregated into XrestIn a ratio of 2: 8.
Further, the large-scale data set X and the sub-data set for block size optimization are set to XprofilingAnd the remaining to-be-processed subdata sets are aggregated into XrestIs that X is equal to Xprofiling∪XrestAnd is
Figure RE-GDA0003532601450000032
Further, the setting of the block size in step S2 includes:
s21, determining an initial block size MinitAnd a block size variation d. S22, according to the size M of the initial blockinitAnd block size variation d, sequentially calculating XprofilingThe block size corresponding to each sub data set.
Further, the block size M in S2profiling.iComprises the following steps:
Figure RE-GDA0003532601450000033
Figure RE-GDA0003532601450000034
further, the initial block size MinitCan not be too large, ensures that the intermediate results generated by processing a plurality of blocks by executing the parallel application programs in the Dask cluster do not exceed the size of the main memory and the size of the GPU memory, and has the initial block size MinitIt cannot be too small to ensure that executing a parallel application in the task cluster takes significantly longer to process each block than the task schedule that task could take for that block.
Further, the block size variation d is to ensure XprofilingAll sub-data sets in the set have corresponding block sizes greater than 0 and at XprofilingA more suitable block size can be found between the minimum and maximum block sizes corresponding to all sub data sets in the data stream.
Further, the observation value matrix ON×2Comprises the following steps:
Figure RE-GDA0003532601450000041
where N is a row, N ═ N + m.
Further, M is addedrest.1And Trest.1Adding to the observation matrix to take the newly obtained M each timerest.1And Trest.1Replacing two values of a corresponding row in the matrix, M obtained the ith timeiAnd TiBy MiAnd TiTwo values of the ith row of the matrix are replaced.
Further, the weight w of the local weighted linear regression algorithmkThe calculation formula is as follows:
wk=exp((Tmin-Ok,2)2/(-2σ2)),(1≤k≤n+j-1)
wherein, Tmin=min(O1,2,O2,2,...,On+j-1,2) k is a cyclic variable, Ok,2K-th of the representation matrix
Row column 2 values.
Further, the regression coefficient θ is:
θ=θ12
wherein
Figure RE-GDA0003532601450000042
Further, said Xrest.1Corresponding block size Mrest.1Comprises the following steps:
Mrest.j=θTmin
further, step S6 includes determining XrestIf all the subdata sets are processed, if so, obtaining the total time consumed by processing the large-scale data set X
Figure RE-GDA0003532601450000043
If not, step S5 is repeated.
Compared with the prior art, the beneficial effects are:
the invention provides a dynamic data blocking method based on local weighted linear regression facing to a Dask cluster, which is characterized in that a large data set is divided into two sub data sets to obtain all processed sub data sets, and the local weighted linear regression algorithm is adopted to more accurately estimate the block size corresponding to each remaining sub data set to be processed on line dynamically, so that high dependence on manual experience and time-consuming and labor-consuming off-line training are avoided, the method can better adapt to the changes of the data set, a parallel application program and a cluster environment, and the efficiency of processing the large-scale data set in the Dask cluster is improved to a certain extent.
Drawings
FIG. 1 is a flow chart of a dynamic data blocking method based on local weighted linear regression facing to a Dask cluster;
FIG. 2 is a diagram of dynamic data partitioning based on local weighted linear regression facing to Dask cluster;
FIG. 3 is a graph comparing the performance of processing a large data set in a Dask cluster using two different data chunking methods and three different parallel applications.
Detailed Description
The present invention will be further described with reference to the following detailed description, wherein the drawings are provided for illustrative purposes only and are not intended to be limiting; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
Example 1
In the dynamic data blocking method facing the Dask cluster and based on the local weighted linear regression, a large-scale data set is defined as X, and a sub data set used for optimizing the block size is defined as XprofilingThe set of the remaining sub data sets to be processed is XrestThe number of subdata sets in X is N, XprofilingThe number of the subdata sets is n, XrestThe number of the subdata sets is m, XprofilingThe ith sub-dataset is Xprofiling.i,XrestThe jth sub-data set is Xrest.jInitial block size of Minit,XprofilingBlock size M corresponding to ith sub-data setprofiling.i,XrestThe block size M corresponding to the jth sub-data setrest.jTreatment of XprofilingThe time spent by the ith sub-data set is Tprofiling.iTreatment of XrestThe time consumed by the jth sub-data set is Trest.iTotal time consumed for processing X is Ttotal
The method comprises the following steps:
s1, dividing a large-scale data set X into N sub-data sets, and sequentially dividing the N sub-data sets into N sub-data set sets X for optimizing block sizesprofiling=(Xprofiling.1,Xprofiling.2,...,Xprofiling.i,...,Xprofiling.n) And m sets X of remaining pending subdata setsrest=(Xrest.1,Xrest.2,...,Xrest.j,...,Xrest.m) Wherein i is not less than 1 and not more than N, j is not less than 1 and not more than m, N is N + m, and X is Xprofiling∪XrestAnd is
Figure RE-GDA0003532601450000051
S2, setting an optimized subdata set XprofilingThe block size corresponding to each sub data set in the set
S21, determining an initial block size Minit
S22, according to the size M of the initial blockinitAnd block size variation d, sequentially calculating XprofilingThe block size corresponding to each sub data set.
S3, processing XprofilingSub data set of (1)
From XprofilingRead the ith sub-dataset Xprofiling.iAnd according to a specified block size Mprofiling.iTo Xprofiling.iPartitioning and then executing parallel application program pairs X in the Dask clusterprofiling.iProcessing and obtaining the consumption thereofTime Tprofiling.iFinally, M is addedprofiling.iAnd Tprofiling.iAdding the observation matrix ON×2In (1).
S4. process XrestIs selected to be the first sub data set of
S41, obtaining the minimum value T of the consumed time corresponding to all the processed subdata sets from the observation value matrixminCalculating a regression coefficient theta of the locally weighted linear regression algorithm based on the observed value matrix, using TminAnd theta to estimate Xrest.1Corresponding block size Mrest.1And press Mrest.1To Xrest.1Partitioning is carried out;
s42, executing parallel application program pair X in Dask clusterrest.1Processing and obtaining the time T consumed by the processingrest.1Will Mrest.1And Trest.1Adding the obtained product into an observation value matrix;
s5, repeating the step S4, and calculating the block size M of the next character data setrest.1And the time T elapsedrest.1
S6, judging XrestWhether all sub data sets in the set are processed. If yes, obtaining the total time consumed for processing the large-scale data set X
Figure RE-GDA0003532601450000061
If not, the step 5 is repeatedly executed.
Example 2
The embodiment provides a specific algorithm according to the method described in embodiment 1, and the specific steps include:
Figure RE-GDA0003532601450000062
Figure RE-GDA0003532601450000071
here, "←" is a writing method in the algorithm, that is, it corresponds to "═ h".
Example 3
In this embodiment, according to the algorithm described in embodiment 2, static and dynamic calculations are performed on a certain data set, where the hardware environment adopted is:
Figure RE-GDA0003532601450000072
the software environment is as follows:
Figure RE-GDA0003532601450000081
(1) static calculation
A data set of size 8.7GB is divided into n blocks and then the n blocks are processed simultaneously with a parallel application.
(2) Dynamic computing
Dividing a data set with the size of 8.7GB into m sub-data sets, and sequentially dividing the m sub-data sets into X according to the proportion of 2:8profilingAnd XrestTo XprofilingEach sub data set is partitioned into blocks, and the initial block size is MinitBlock size variance d, Gaussian kernel parameter σ, residual XrestThe number of blocks of the intermediate data set is dynamically selected by the program, and then the m data sets are processed in sequence by the parallel application program.
The parameter conditions for the specific block settings in the dynamic are as follows
m Minit d σ
K-Means 100 10000 1000 0.26
DBSCAN 100 2000 50 0.25
SpectralCustering 100 1000 50 2.00
The static and dynamic calculations were performed using a combination of programs from K-Means, DBSCAN and Spectral testing, respectively, and the resulting run times are shown in the following table:
static state Dynamic state
K-Means 2103.95s 2015.64s
DBSCAN 4480.41s 4217.33s
SpectralCustering 30746.53s 27518.99s
As can be seen from the above table and fig. 3, the method of the present invention can improve the efficiency of processing large-scale data sets when processing data in parallel.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (10)

1. A dynamic data blocking method facing to a Dask cluster and based on local weighted linear regression is characterized by comprising the following steps:
s1, dividing a large-scale data set X into N sub-data sets, and sequentially dividing the N sub-data sets into sub-data set sets X for optimizing block sizesprofilingAnd a set of remaining pending subdata sets Xrest
S2, setting an optimized subdata set XprofilingThe block size corresponding to each subdata set;
s3, processing XprofilingSub data set of (1)
From XprofilingRead the ith sub-dataset Xprofiling.iAnd according to a specified block size Mprofiling.iTo Xprofiling.iPartitioning and then executing parallel application program pairs X in the Dask clusterprofiling.iProcessing and obtaining the time T consumed by the processingprofiling.iFinally, M is addedprofiling.iAnd Tprofiling.iAdding the observation matrix ON×2Wherein i is more than or equal to 1 and less than or equal to n;
s4. process XrestIs selected to be the first sub data set of
S41, obtaining the minimum value T of the consumed time corresponding to all the processed subdata sets from the observation value matrixminCalculating a regression coefficient theta of the locally weighted linear regression algorithm based on the observed value matrix, using TminAnd theta to estimate Xrest.1Corresponding block size Mrest.1And press Mrest.1To Xrest.1Partitioning is carried out;
s42, executing parallel application program pair X in Dask clusterrest.1Processing and obtaining the time T consumed by the processingrest.1Will Mrest.1And Trest.1Into the observation matrix, i.e. by Mrest.1And Trest.1Replacing two values of the n +1 th row of the matrix;
s5, repeating the step S4 to calculate XrestThe block size and the time spent corresponding to the next sub data set of the data set are added into the observation value matrix; (ii) a
S6, calculating XrestAll the subdata sets are processed to obtain the total time consumed for processing the large-scale data set X
Figure FDA0003404589260000011
2. The Dask cluster-oriented dynamic data partitioning method based on local weighted linear regression as claimed in claim 1, wherein the sub data set X is used for block size optimizationprofilingThe number n of the sub data sets and the set of the remaining sub data sets to be processed are XrestThe ratio of the number m is 2: 8.
3. The Dask cluster-oriented dynamic data blocking method based on local weighted linear regression as claimed in claim 1, wherein the large-scale data set X and the sub-data set for block size optimization are XprofilingAnd the remaining to-be-processed subdata sets are aggregated into XrestIs that X is equal to Xprofiling UXrestAnd is
Figure FDA0003404589260000021
4. The Dask cluster-oriented dynamic data blocking method based on local weighted linear regression as claimed in claim 1, wherein the step of setting the block size in step S2 comprises:
s21, determining an initial block size Minit
S22, according to the size M of the initial blockinitAnd block size variation d, sequentially calculating XprofilingThe block size corresponding to each sub data set.
5. The dynamic data blocking method based on local weighted linear regression for Dask cluster as claimed in claim 4, wherein the block size M in S2profiling.iComprises the following steps:
Figure FDA0003404589260000022
Figure FDA0003404589260000023
6. the Dask cluster-oriented dynamic data blocking method based on local weighted linear regression as claimed in claim 1, wherein the observation matrix O isN×2Comprises the following steps:
Figure FDA0003404589260000024
where N is a row, N ═ N + m.
7. The Dask cluster-oriented dynamic data blocking method based on local weighted linear regression as claimed in claim 1, wherein the weight w of the local weighted linear regression algorithmkThe calculation formula is as follows:
wk=exp((Tmin-Ok,2)2/(-2σ2)),(1≤k≤n+j-1)
wherein, Tmin=min(O1,2,O2,2,K,On+j-1,2) K is a cyclic variable, Ok,2Representing the value of the 2 nd column of the k-th row of the matrix.
8. The Dask cluster-oriented dynamic data blocking method based on local weighted linear regression as claimed in claim 1, wherein the regression coefficient θ is:
θ=θ12
wherein
Figure FDA0003404589260000031
9. The Dask cluster-oriented dynamic data blocking method based on local weighted linear regression as claimed in claim 1, wherein X isrest.1Corresponding block size Mrest.1Comprises the following steps:
Mrest.j=θTmin
10. the Dask cluster-oriented dynamic data blocking method based on local weighted linear regression as claimed in claim 1, wherein step S6 comprises determining XrestIf all the subdata sets are processed, if so, obtaining the total time consumed by processing the large-scale data set X
Figure FDA0003404589260000032
If not, the step 5 is repeatedly executed.
CN202111509160.5A 2021-12-10 2021-12-10 Dynamic data blocking method based on local weighted linear regression and oriented to Dask cluster Pending CN114296911A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111509160.5A CN114296911A (en) 2021-12-10 2021-12-10 Dynamic data blocking method based on local weighted linear regression and oriented to Dask cluster

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111509160.5A CN114296911A (en) 2021-12-10 2021-12-10 Dynamic data blocking method based on local weighted linear regression and oriented to Dask cluster

Publications (1)

Publication Number Publication Date
CN114296911A true CN114296911A (en) 2022-04-08

Family

ID=80968359

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111509160.5A Pending CN114296911A (en) 2021-12-10 2021-12-10 Dynamic data blocking method based on local weighted linear regression and oriented to Dask cluster

Country Status (1)

Country Link
CN (1) CN114296911A (en)

Similar Documents

Publication Publication Date Title
Ezugwu et al. Simulated annealing based symbiotic organisms search optimization algorithm for traveling salesman problem
Hamta et al. A hybrid PSO algorithm for a multi-objective assembly line balancing problem with flexible operation times, sequence-dependent setup times and learning effect
Fan et al. Linear and quadratic programming approaches for the general graph partitioning problem
Zeng et al. A GA-based feature selection and parameter optimization for support tucker machine
US11651260B2 (en) Hardware-based machine learning acceleration
CN107908536B (en) Performance evaluation method and system for GPU application in CPU-GPU heterogeneous environment
CN111427750B (en) GPU power consumption estimation method, system and medium of computer platform
Wei et al. GPU-accelerated Monte Carlo simulation of particle coagulation based on the inverse method
CN104063714A (en) Fast human face recognition algorithm used for video monitoring and based on CUDA parallel computing and sparse representing
CN106294288A (en) A kind of distributed non-negative matrix factorization method
CN110647995A (en) Rule training method, device, equipment and storage medium
CN106296434B (en) Grain yield prediction method based on PSO-LSSVM algorithm
Goslee Correlation analysis of dissimilarity matrices
CN112016253A (en) High-fidelity chaotic polynomial correction method suitable for CFD uncertainty quantification
CN103116324A (en) Micro-electronics production line scheduling method based on index prediction and online learning
CN106599610A (en) Method and system for predicting association between long non-coding RNA and protein
Kim et al. Locally most powerful bayesian test for out-of-distribution detection using deep generative models
CN109063418A (en) Determination method, apparatus, equipment and the readable storage medium storing program for executing of disease forecasting classifier
CN114296911A (en) Dynamic data blocking method based on local weighted linear regression and oriented to Dask cluster
Farahmand et al. A comparative study of ccr-(ε-svr) and ccr-(ν-svr) models for efficiency prediction of large decision making units
Cai et al. The pairwise Gaussian random field for high-dimensional data imputation
CN104298729B (en) Data classification method and device
Landau et al. A fully Bayesian strategy for high-dimensional hierarchical modeling using massively parallel computing
Liu et al. Ultrahigh dimensional feature selection via kernel canonical correlation analysis
Juneja et al. Optimization of dejong function using ga under different selection algorithms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination