CN114296911A - Dynamic data blocking method based on local weighted linear regression and oriented to Dask cluster - Google Patents
Dynamic data blocking method based on local weighted linear regression and oriented to Dask cluster Download PDFInfo
- Publication number
- CN114296911A CN114296911A CN202111509160.5A CN202111509160A CN114296911A CN 114296911 A CN114296911 A CN 114296911A CN 202111509160 A CN202111509160 A CN 202111509160A CN 114296911 A CN114296911 A CN 114296911A
- Authority
- CN
- China
- Prior art keywords
- rest
- profiling
- data set
- cluster
- block size
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Devices For Executing Special Programs (AREA)
Abstract
The invention discloses a dynamic data blocking method facing a Dask cluster and based on local weighted linear regression, which divides a large-scale data set to be processed into a sub data set for optimizing block size and a remaining sub data set to be processed, adopts a local weighted linear regression algorithm to more accurately estimate the block size corresponding to each remaining sub data set to be processed on line through processing the block size corresponding to each sub data set for optimizing the block size, and according to the block size corresponding to each processed sub data set and the consumed time. The invention solves the problems of high dependence on manual experience and time-consuming and labor-consuming off-line training, can better adapt to the changes of the data set, the parallel application program and the cluster environment, and improves the efficiency of processing large-scale data sets in the Dask cluster to a certain extent.
Description
Technical Field
The invention relates to the technical field of performance optimization of big data parallel processing, in particular to a dynamic data blocking method based on local weighted linear regression and oriented to a Dask cluster.
Background
When a parallel application program is executed in a Dask cluster to process a large-scale data set, the data set needs to be partitioned, and a CN201410836567.2 big data parallel computing method and device disclose that the data set is partitioned according to the size of the data set, the size of a cluster memory and the parallelism, so that a partitioned data set consisting of a plurality of data partitions is obtained; and taking the block data set as a training data set of a logistic regression classification algorithm, and solving the optimal weight vector of the logistic regression function to obtain the logistic regression classifier. According to the embodiment of the invention, the data set is partitioned according to the size of the data set, the size of a cluster memory and the parallelism, namely, the computing resources are fully considered when the big data is partitioned, and the size of the data partition is determined by the computing resources, so that each computing node fully utilizes the computing resources when processing the data partition, the processing efficiency of the data partition is improved, and the parallel computing performance of the big data is improved. The block size setting has a great influence on the efficiency of processing a large-scale data set, so that how to reasonably divide the data set to improve the efficiency of processing the large-scale data set to a certain extent has important significance.
When a large-scale data set is partitioned, the block size setting mainly comprises the following two methods:
one is a data chunking method based on human experience. The method is characterized in that different block sizes are manually selected based on experience to perform multiple experiments aiming at a specific large-scale data set, a specific parallel application program and a specific cluster environment, each experiment adopts a fixed block size to perform blocking processing on the large-scale data set and acquire the total time consumption of the large-scale data set, and finally the block size corresponding to the shortest total time consumption is selected from the multiple experiments to serve as the block size for dividing the large-scale data set.
And the other is a data blocking method based on machine learning. The method comprises the steps of firstly obtaining a proper fixed block size adopted when different parallel application programs are executed under different cluster environments to process data sets of different scales as a training sample, then utilizing a large number of samples and training a block size prediction model based on a machine learning algorithm, and finally predicting the block size required by executing a specific parallel application program under a specific cluster environment to process the data set of the specific scale by using the model.
Although the data partitioning method based on manual experience and machine learning can select a relatively proper fixed block size for partitioning a specific large-scale data set aiming at a specific parallel application program in a specific cluster environment, the data partitioning method based on manual experience is highly dependent on experience knowledge of programmers and needs a large amount of tests, and the data partitioning method based on machine learning needs to collect a large amount of training samples for off-line training at high cost. Both of these approaches are time consuming and labor intensive, and difficult to adapt to changes in the data set, parallel applications, and cluster environment.
Disclosure of Invention
The invention aims to solve the technical problems that the existing data partitioning method based on manual experience and machine learning is time-consuming and labor-consuming and is difficult to adapt to changes of a data set, a parallel application program and a cluster environment, and provides a dynamic data partitioning method based on local weighted linear regression for a Dask cluster.
The purpose of the invention is realized by the following technical scheme:
a dynamic data blocking method facing to a Dask cluster and based on local weighted linear regression comprises the following steps:
s1, dividing a large-scale data set X into N sub-data sets, and sequentially dividing the N sub-data sets into sub-data set sets X for optimizing block sizesprofiling=(Xprofiling.1,Xprofiling.2,...,Xprofiling.i,...,Xprofiling.n) And a set of remaining pending subdata sets Xrest=(Xrest.1,Xrest.2,...,Xrest.j,...,Xrest.m) Wherein i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to m, and N is N + m;
s2, setting an optimized subdata set XprofilingThe block size corresponding to each subdata set;
s3, processing XprofilingSub data set of (1)
From XprofilingRead the ith sub-dataset Xprofiling.iAnd according to a specified block size Mprofiling.iTo Xprofiling.iPartitioning and then executing parallel application program pairs X in the Dask clusterprofiling.iProcessing and obtaining the time T consumed by the processingprofiling.iFinally, M is addedprofiling.iAnd Tprofiling.iAdding the observation matrix ON×2Wherein i is more than or equal to 1 and less than or equal to n;
s4. process XrestIs selected to be the first sub data set of
S41, obtaining the minimum value T of the consumed time corresponding to all the processed subdata sets from the observation value matrixminCalculating a regression coefficient theta of the locally weighted linear regression algorithm based on the observed value matrix, using TminAnd theta to estimate Xrest.1Corresponding block size Mrest.1And press Mrest.1To Xrest.1Partitioning is carried out;
s42, executing parallel application program pair X in Dask clusterrest.1Processing and obtaining the time T consumed by the processingrest.1Will Mrest.1And Trest.1Added to the observation matrix, i.e. Mrest.1And Trest.1Replacing two values of the n +1 th row of the matrix;
s5, repeating the step S4 to calculate XrestThe block size and the time spent corresponding to the next sub data set of the data set are added into the observation value matrix;
s6, calculating XrestAll the subdata sets are processed to obtain the total time consumed for processing the large-scale data set X
Further, the set of sub-data sets for block size optimization is XprofilingAnd the remaining to-be-processed subdata sets are aggregated into XrestIn a ratio of 2: 8.
Further, the large-scale data set X and the sub-data set for block size optimization are set to XprofilingAnd the remaining to-be-processed subdata sets are aggregated into XrestIs that X is equal to Xprofiling∪XrestAnd is
Further, the setting of the block size in step S2 includes:
s21, determining an initial block size MinitAnd a block size variation d. S22, according to the size M of the initial blockinitAnd block size variation d, sequentially calculating XprofilingThe block size corresponding to each sub data set.
Further, the block size M in S2profiling.iComprises the following steps:
further, the initial block size MinitCan not be too large, ensures that the intermediate results generated by processing a plurality of blocks by executing the parallel application programs in the Dask cluster do not exceed the size of the main memory and the size of the GPU memory, and has the initial block size MinitIt cannot be too small to ensure that executing a parallel application in the task cluster takes significantly longer to process each block than the task schedule that task could take for that block.
Further, the block size variation d is to ensure XprofilingAll sub-data sets in the set have corresponding block sizes greater than 0 and at XprofilingA more suitable block size can be found between the minimum and maximum block sizes corresponding to all sub data sets in the data stream.
Further, the observation value matrix ON×2Comprises the following steps:
where N is a row, N ═ N + m.
Further, M is addedrest.1And Trest.1Adding to the observation matrix to take the newly obtained M each timerest.1And Trest.1Replacing two values of a corresponding row in the matrix, M obtained the ith timeiAnd TiBy MiAnd TiTwo values of the ith row of the matrix are replaced.
Further, the weight w of the local weighted linear regression algorithmkThe calculation formula is as follows:
wk=exp((Tmin-Ok,2)2/(-2σ2)),(1≤k≤n+j-1)
wherein, Tmin=min(O1,2,O2,2,...,On+j-1,2) k is a cyclic variable, Ok,2K-th of the representation matrix
Further, the regression coefficient θ is:
θ=θ1/θ2,
Further, said Xrest.1Corresponding block size Mrest.1Comprises the following steps:
Mrest.j=θTmin。
further, step S6 includes determining XrestIf all the subdata sets are processed, if so, obtaining the total time consumed by processing the large-scale data set XIf not, step S5 is repeated.
Compared with the prior art, the beneficial effects are:
the invention provides a dynamic data blocking method based on local weighted linear regression facing to a Dask cluster, which is characterized in that a large data set is divided into two sub data sets to obtain all processed sub data sets, and the local weighted linear regression algorithm is adopted to more accurately estimate the block size corresponding to each remaining sub data set to be processed on line dynamically, so that high dependence on manual experience and time-consuming and labor-consuming off-line training are avoided, the method can better adapt to the changes of the data set, a parallel application program and a cluster environment, and the efficiency of processing the large-scale data set in the Dask cluster is improved to a certain extent.
Drawings
FIG. 1 is a flow chart of a dynamic data blocking method based on local weighted linear regression facing to a Dask cluster;
FIG. 2 is a diagram of dynamic data partitioning based on local weighted linear regression facing to Dask cluster;
FIG. 3 is a graph comparing the performance of processing a large data set in a Dask cluster using two different data chunking methods and three different parallel applications.
Detailed Description
The present invention will be further described with reference to the following detailed description, wherein the drawings are provided for illustrative purposes only and are not intended to be limiting; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
Example 1
In the dynamic data blocking method facing the Dask cluster and based on the local weighted linear regression, a large-scale data set is defined as X, and a sub data set used for optimizing the block size is defined as XprofilingThe set of the remaining sub data sets to be processed is XrestThe number of subdata sets in X is N, XprofilingThe number of the subdata sets is n, XrestThe number of the subdata sets is m, XprofilingThe ith sub-dataset is Xprofiling.i,XrestThe jth sub-data set is Xrest.jInitial block size of Minit,XprofilingBlock size M corresponding to ith sub-data setprofiling.i,XrestThe block size M corresponding to the jth sub-data setrest.jTreatment of XprofilingThe time spent by the ith sub-data set is Tprofiling.iTreatment of XrestThe time consumed by the jth sub-data set is Trest.iTotal time consumed for processing X is Ttotal。
The method comprises the following steps:
s1, dividing a large-scale data set X into N sub-data sets, and sequentially dividing the N sub-data sets into N sub-data set sets X for optimizing block sizesprofiling=(Xprofiling.1,Xprofiling.2,...,Xprofiling.i,...,Xprofiling.n) And m sets X of remaining pending subdata setsrest=(Xrest.1,Xrest.2,...,Xrest.j,...,Xrest.m) Wherein i is not less than 1 and not more than N, j is not less than 1 and not more than m, N is N + m, and X is Xprofiling∪XrestAnd is
S2, setting an optimized subdata set XprofilingThe block size corresponding to each sub data set in the set
S21, determining an initial block size Minit;
S22, according to the size M of the initial blockinitAnd block size variation d, sequentially calculating XprofilingThe block size corresponding to each sub data set.
S3, processing XprofilingSub data set of (1)
From XprofilingRead the ith sub-dataset Xprofiling.iAnd according to a specified block size Mprofiling.iTo Xprofiling.iPartitioning and then executing parallel application program pairs X in the Dask clusterprofiling.iProcessing and obtaining the consumption thereofTime Tprofiling.iFinally, M is addedprofiling.iAnd Tprofiling.iAdding the observation matrix ON×2In (1).
S4. process XrestIs selected to be the first sub data set of
S41, obtaining the minimum value T of the consumed time corresponding to all the processed subdata sets from the observation value matrixminCalculating a regression coefficient theta of the locally weighted linear regression algorithm based on the observed value matrix, using TminAnd theta to estimate Xrest.1Corresponding block size Mrest.1And press Mrest.1To Xrest.1Partitioning is carried out;
s42, executing parallel application program pair X in Dask clusterrest.1Processing and obtaining the time T consumed by the processingrest.1Will Mrest.1And Trest.1Adding the obtained product into an observation value matrix;
s5, repeating the step S4, and calculating the block size M of the next character data setrest.1And the time T elapsedrest.1;
S6, judging XrestWhether all sub data sets in the set are processed. If yes, obtaining the total time consumed for processing the large-scale data set XIf not, the step 5 is repeatedly executed.
Example 2
The embodiment provides a specific algorithm according to the method described in embodiment 1, and the specific steps include:
here, "←" is a writing method in the algorithm, that is, it corresponds to "═ h".
Example 3
In this embodiment, according to the algorithm described in embodiment 2, static and dynamic calculations are performed on a certain data set, where the hardware environment adopted is:
the software environment is as follows:
(1) static calculation
A data set of size 8.7GB is divided into n blocks and then the n blocks are processed simultaneously with a parallel application.
(2) Dynamic computing
Dividing a data set with the size of 8.7GB into m sub-data sets, and sequentially dividing the m sub-data sets into X according to the proportion of 2:8profilingAnd XrestTo XprofilingEach sub data set is partitioned into blocks, and the initial block size is MinitBlock size variance d, Gaussian kernel parameter σ, residual XrestThe number of blocks of the intermediate data set is dynamically selected by the program, and then the m data sets are processed in sequence by the parallel application program.
The parameter conditions for the specific block settings in the dynamic are as follows
m | Minit | d | σ | |
K-Means | 100 | 10000 | 1000 | 0.26 |
DBSCAN | 100 | 2000 | 50 | 0.25 |
SpectralCustering | 100 | 1000 | 50 | 2.00 |
The static and dynamic calculations were performed using a combination of programs from K-Means, DBSCAN and Spectral testing, respectively, and the resulting run times are shown in the following table:
static state | Dynamic state | |
K-Means | 2103.95s | 2015.64s |
DBSCAN | 4480.41s | 4217.33s |
SpectralCustering | 30746.53s | 27518.99s |
As can be seen from the above table and fig. 3, the method of the present invention can improve the efficiency of processing large-scale data sets when processing data in parallel.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.
Claims (10)
1. A dynamic data blocking method facing to a Dask cluster and based on local weighted linear regression is characterized by comprising the following steps:
s1, dividing a large-scale data set X into N sub-data sets, and sequentially dividing the N sub-data sets into sub-data set sets X for optimizing block sizesprofilingAnd a set of remaining pending subdata sets Xrest;
S2, setting an optimized subdata set XprofilingThe block size corresponding to each subdata set;
s3, processing XprofilingSub data set of (1)
From XprofilingRead the ith sub-dataset Xprofiling.iAnd according to a specified block size Mprofiling.iTo Xprofiling.iPartitioning and then executing parallel application program pairs X in the Dask clusterprofiling.iProcessing and obtaining the time T consumed by the processingprofiling.iFinally, M is addedprofiling.iAnd Tprofiling.iAdding the observation matrix ON×2Wherein i is more than or equal to 1 and less than or equal to n;
s4. process XrestIs selected to be the first sub data set of
S41, obtaining the minimum value T of the consumed time corresponding to all the processed subdata sets from the observation value matrixminCalculating a regression coefficient theta of the locally weighted linear regression algorithm based on the observed value matrix, using TminAnd theta to estimate Xrest.1Corresponding block size Mrest.1And press Mrest.1To Xrest.1Partitioning is carried out;
s42, executing parallel application program pair X in Dask clusterrest.1Processing and obtaining the time T consumed by the processingrest.1Will Mrest.1And Trest.1Into the observation matrix, i.e. by Mrest.1And Trest.1Replacing two values of the n +1 th row of the matrix;
s5, repeating the step S4 to calculate XrestThe block size and the time spent corresponding to the next sub data set of the data set are added into the observation value matrix; (ii) a
2. The Dask cluster-oriented dynamic data partitioning method based on local weighted linear regression as claimed in claim 1, wherein the sub data set X is used for block size optimizationprofilingThe number n of the sub data sets and the set of the remaining sub data sets to be processed are XrestThe ratio of the number m is 2: 8.
3. The Dask cluster-oriented dynamic data blocking method based on local weighted linear regression as claimed in claim 1, wherein the large-scale data set X and the sub-data set for block size optimization are XprofilingAnd the remaining to-be-processed subdata sets are aggregated into XrestIs that X is equal to Xprofiling UXrestAnd is
4. The Dask cluster-oriented dynamic data blocking method based on local weighted linear regression as claimed in claim 1, wherein the step of setting the block size in step S2 comprises:
s21, determining an initial block size Minit;
S22, according to the size M of the initial blockinitAnd block size variation d, sequentially calculating XprofilingThe block size corresponding to each sub data set.
7. The Dask cluster-oriented dynamic data blocking method based on local weighted linear regression as claimed in claim 1, wherein the weight w of the local weighted linear regression algorithmkThe calculation formula is as follows:
wk=exp((Tmin-Ok,2)2/(-2σ2)),(1≤k≤n+j-1)
wherein, Tmin=min(O1,2,O2,2,K,On+j-1,2) K is a cyclic variable, Ok,2Representing the value of the 2 nd column of the k-th row of the matrix.
9. The Dask cluster-oriented dynamic data blocking method based on local weighted linear regression as claimed in claim 1, wherein X isrest.1Corresponding block size Mrest.1Comprises the following steps:
Mrest.j=θTmin。
10. the Dask cluster-oriented dynamic data blocking method based on local weighted linear regression as claimed in claim 1, wherein step S6 comprises determining XrestIf all the subdata sets are processed, if so, obtaining the total time consumed by processing the large-scale data set XIf not, the step 5 is repeatedly executed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111509160.5A CN114296911A (en) | 2021-12-10 | 2021-12-10 | Dynamic data blocking method based on local weighted linear regression and oriented to Dask cluster |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111509160.5A CN114296911A (en) | 2021-12-10 | 2021-12-10 | Dynamic data blocking method based on local weighted linear regression and oriented to Dask cluster |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114296911A true CN114296911A (en) | 2022-04-08 |
Family
ID=80968359
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111509160.5A Pending CN114296911A (en) | 2021-12-10 | 2021-12-10 | Dynamic data blocking method based on local weighted linear regression and oriented to Dask cluster |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114296911A (en) |
-
2021
- 2021-12-10 CN CN202111509160.5A patent/CN114296911A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Ezugwu et al. | Simulated annealing based symbiotic organisms search optimization algorithm for traveling salesman problem | |
Hamta et al. | A hybrid PSO algorithm for a multi-objective assembly line balancing problem with flexible operation times, sequence-dependent setup times and learning effect | |
Fan et al. | Linear and quadratic programming approaches for the general graph partitioning problem | |
Zeng et al. | A GA-based feature selection and parameter optimization for support tucker machine | |
US11651260B2 (en) | Hardware-based machine learning acceleration | |
CN107908536B (en) | Performance evaluation method and system for GPU application in CPU-GPU heterogeneous environment | |
CN111427750B (en) | GPU power consumption estimation method, system and medium of computer platform | |
Wei et al. | GPU-accelerated Monte Carlo simulation of particle coagulation based on the inverse method | |
CN104063714A (en) | Fast human face recognition algorithm used for video monitoring and based on CUDA parallel computing and sparse representing | |
CN106294288A (en) | A kind of distributed non-negative matrix factorization method | |
CN110647995A (en) | Rule training method, device, equipment and storage medium | |
CN106296434B (en) | Grain yield prediction method based on PSO-LSSVM algorithm | |
Goslee | Correlation analysis of dissimilarity matrices | |
CN112016253A (en) | High-fidelity chaotic polynomial correction method suitable for CFD uncertainty quantification | |
CN103116324A (en) | Micro-electronics production line scheduling method based on index prediction and online learning | |
CN106599610A (en) | Method and system for predicting association between long non-coding RNA and protein | |
Kim et al. | Locally most powerful bayesian test for out-of-distribution detection using deep generative models | |
CN109063418A (en) | Determination method, apparatus, equipment and the readable storage medium storing program for executing of disease forecasting classifier | |
CN114296911A (en) | Dynamic data blocking method based on local weighted linear regression and oriented to Dask cluster | |
Farahmand et al. | A comparative study of ccr-(ε-svr) and ccr-(ν-svr) models for efficiency prediction of large decision making units | |
Cai et al. | The pairwise Gaussian random field for high-dimensional data imputation | |
CN104298729B (en) | Data classification method and device | |
Landau et al. | A fully Bayesian strategy for high-dimensional hierarchical modeling using massively parallel computing | |
Liu et al. | Ultrahigh dimensional feature selection via kernel canonical correlation analysis | |
Juneja et al. | Optimization of dejong function using ga under different selection algorithms |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |