WO2021073083A1

WO2021073083A1 - Node load-based dynamic data partitioning system

Info

Publication number: WO2021073083A1
Application number: PCT/CN2020/090554
Authority: WO
Inventors: 孟令伍; 贺成龙; 吴嘉逸; 丁灿; 刘蛰; 李惠柯; 顾学海; 姜吉宁; 陈铮
Original assignee: 南京莱斯网信技术研究院有限公司
Priority date: 2019-10-15
Filing date: 2020-05-15
Publication date: 2021-04-22
Also published as: CN110704542A

Abstract

A node load-based dynamic data partitioning system, comprising load monitoring and collection, prediction, data pre-partitioning, data migration and like modules; a secondary smoothing method is used to predict node load, and AHP and an entropy index weight method are combined so that corresponding partitioning policies may be obtained according to different data analysis applications, thus the load balance of the system may be dynamically adjusted, and the response speed of applications may be improved. For the application scenario of a Spark and Memsql distributed integrated framework, since heterogeneity of node resources exists in a distributed environment, the computing resources of each node are fully used in order to reduce the consumption of data transmission between nodes, and the parallel computing efficiency of application analysis is improved by means of load balance. Therefore, such a node load-based data dynamic partitioning mechanism and policy are proposed so as to improve the load balance of a system and increase the response speed of applications, thus assisting relevant personnel in completing complete decision-making.

Description

A data dynamic partition system based on node load

Technical field

The invention relates to the field of big data distributed computing and storage, in particular to a data dynamic partition system based on node load.

Background technique

The development of big data has directly promoted the development of various distributed computing frameworks. Excellent distributed storage frameworks such as HBASE, HDFS, and MemSql have appeared one after another. However, many storage frameworks have the phenomenon of unbalanced cluster load caused by data skew caused by unreasonable partitioning. In order to improve the real-time performance of cluster data analysis and processing, it is necessary to study the cluster data partition strategy. Data partitioning refers to the distribution of data in a distributed system environment. It is necessary to follow a certain method strategy and adopt a partitioning strategy to store the entire data reasonably on each physical data node in the cluster. Simple data partitioning is easy to do, but to make the system run efficiently and steadily, it is necessary to study and design a corresponding partitioning strategy. An improperly designed data partition strategy can lead to inefficient calculations, high access costs, and heavy network load. In the design of distributed system partitioning strategy, the basic principles of data partitioning should be: improve node load balance, improve the response efficiency of data analysis applications, provide timely decision-making for enterprises, and increase benefits.

Summary of the invention

Objective of the invention: The technical problem to be solved by the present invention is to provide a system of Memsql partition strategy for Spark, which dynamically adjusts the load balance of distributed computing and improves the response speed of data analysis.

Technical Solution: The present invention provides a data dynamic partition system based on node load, which is a system based on a data dynamic partition mechanism and strategy of node load. The system includes a load monitoring module, an acquisition module, a data pre-partitioning module, and a data migration module;

The load monitoring module is used to select load information indicators and monitor the load information indicator values on each node in the distributed cluster in real time;

The collection module is used to periodically collect the load information index value on each node in the distributed cluster;

The data pre-partitioning module is used to predict the load information index value on each node in the distributed cluster, and then obtain the processing capacity of each node according to the index weight method, and finally distribute different amounts of data according to the processing capacity of each node , Complete data pre-partitioning;

The data migration module is used to trigger data migration between nodes to improve load balance when a load imbalance problem occurs in the distributed cluster.

The load monitoring module selects CPU utilization, memory utilization, and bandwidth utilization as load information index values, and monitors the load on each node in the distributed cluster in real time by deploying Memsql (distributed memory database) resource monitoring service Information index value.

The collection module periodically obtains the load information index value on each node in the distributed cluster through the API (program interface) provided by the distributed Yarn resource management component, and saves it in the database.

The data pre-partitioning module is used to predict the load information index value on each node in the distributed cluster, and then obtain the processing capacity of each node according to the AHP (Analytic Hierarchy Process) and entropy subjective and objective index weight integration method, and finally According to the processing capacity of each node, different data volumes are distributed to complete the data pre-partitioning, which specifically includes the following steps:

Step 1. Use the quadratic exponential smoothing method to predict the load information index value:

The formula of the one-time exponential smoothing method is as follows:

The formula of the quadratic exponential smoothing method is as follows:

Combining the first and second exponential smoothing formulas, the load forecast value of the T-th cycle is obtained, and the formula is as follows:

Among them, Y _j is the actual value of the load information index value of the jth cycle,

with

They are the predicted value of the load information index value of the j-1 cycle and the predicted value of the load information index value of the j cycle,

with

Are the second exponential smoothing value of the j-1 period and the second exponential smoothing value of the j period,

Is the predicted value of the load information index value of the j+T cycle; a _j and b _j are intermediate parameters; α is the smoothing coefficient;

The collection module sends the load information index value on each node in the distributed cluster collected in the first n-1 cycles in the database to the data pre-partitioning module, and it is combined with the load information index value on each node in the current cycle. Is the load data of n, take the actual value of the load information index value measured in the first cycle as the initial value Y _j , the initial value of the first prediction and the initial value of the second prediction, and use the obtained n load data to predict the future d cycles The load information index value on each node, calculate the average value P of the load information index value of a node in the future d cycles, and finally determine the load information index value of each node in the cluster;

Step 2. Calculate the processing capacity of each node;

Step 3: Distribute different amounts of data according to the processing capacity of each node.

In step 1, the value of the smoothing coefficient is obtained by calculating the standard deviation S:

Among them, n represents the number of cycles taken, the partial variance S is calculated by adjusting the value of the smoothing coefficient α, and the value of the corresponding smoothing coefficient α is taken when S is the smallest.

Step 2 includes the following steps:

Step 2-1, use the AHP subjective weight method to calculate: In multi-attribute decision-making, the decision maker compares all evaluation indicators pairwise to obtain the judgment matrix U=(A _ij ) _n×n , where A _ij is the evaluation index A _{The value obtained by comparing i} with A _j is an odd number between 1 and 9, that is, when the value is 1, 3, 5, 7, and 9, respectively, it means that the former index is equally important, more important, and very important than the latter index. Important, very important, and extremely important; when the value is an even number between 1 and 9, it means that the importance of the pairwise comparison is between the importance degrees of two adjacent odd numbers, that is, the value is 2. When indicates that the importance of the pairwise comparison is between the importance of two adjacent

odd numbers

1 and 3, and

Comparing CPU utilization, memory utilization and bandwidth utilization in pairs, the judgment matrix A is obtained:

Among them, A ₁ , A ₂ , and A ₃ represent the weight value of the impact of a node's CPU utilization on the overall load of the node, the weight value of the impact of memory utilization on the overall load of the node, and the weight of the impact of bandwidth utilization on the overall load of the node. . Perform normalization operations on each column of the judgment matrix A, obtain the column eigenvectors, and then normalize each row to obtain the row eigenvectors, and finally obtain the weight ratio of each indicator, and compare the judgment matrix A Perform consistency check, and finally get the subjective weights of CPU, memory, and bandwidth of a node as WS ₁ , WS ₂ , WS ₃ , and WS ₁ +WS ₂ +WS ₃ =1;

Step 2-2, calculate the eigenvectors and index weights of the matrix:

Sum the columns of the matrix, the vector of the column sum is: SUM _j ;

To normalize each column of the matrix, the formula is as follows:

The value of ∑A _ij is the sum of each column SUM _j , and B _ij represents the _{normalized data of A ij} . According to B _ij, a new matrix B is obtained. The sum of the values of each column in the B matrix is 1;

Sum each row of matrix B to obtain the eigenvector SUM _i ;

Calculate the index weight and normalize the feature vector, the formula is as follows:

According to the above formula, the three index weights are finally obtained as W ₁ , W ₂ , W ₃ ;

Step 2-3, check the consistency of the matrix:

In order to test whether the index weight is correct, it is necessary to compare the indexes. For example, if A>B, B>C, then A>C must be obtained, otherwise, the consistency is not established. Therefore, it is necessary to check the consistency of the matrix to ensure that the above errors do not occur.

Calculate the largest characteristic root of the matrix, the formula is as follows:

Among them, λ _max is the maximum eigenvalue, AW represents the matrix A and the weight vector W are multiplied to obtain a column vector, n represents the order of the matrix, and W represents the weight vector;

To calculate the constant index of the judgment matrix, the formula is as follows:

Among them, C.I. represents the consistency index, and n represents the order of the matrix;

Calculate the random consistency ratio C.R., the calculation formula is as follows:

Among them, RI stands for the average random consistency index, which is a constant, which can be queried in the scale according to the order; the third-order RI=0.89, if CR<0.1, it means that the comparison matrix is consistent; if CR>0.1, it means comparison The matrix is not consistent and needs to be adjusted;

Steps 2-4, carry out the calculation of the objective weight method by entropy method: the entropy method is a mathematical method that reflects the degree of influence of an indicator on the comprehensive evaluation by judging the dispersion of an indicator, and can objectively pass the variance of the indicator value To determine the weights. The weight of the index is positively correlated with the degree of variability, that is, the greater the degree of variation of the index value, the greater its weight; conversely, the smaller the degree of variation of the index value, the smaller its weight.

Construct load information decision matrix M:

Among them, CUR _n , MUR _n , and BUR _n respectively represent the CPU utilization, memory utilization, and bandwidth utilization predicted in the nth cycle of a node;

Standardize each column of the decision matrix M to obtain the decision matrix R:

among them

R _i1 represents the element in the ith row and the first column of the decision matrix R, and each column of the decision matrix R satisfies the normalization, namely

That is, the sum of each column value is 1, j = 1, 2, 3;

Calculate the entropy of the load information index according to the following formula:

E _j represents the entropy value of the load information index, and the constant K = 1/ln(n), then 0≤E _j ≤1, that is, the _{maximum E j} is 1, and when j is 1, E _j represents the entropy value of the CPU utilization; When j is 2, E _j represents the entropy value of memory utilization; when j is 3, E _j represents the entropy value of bandwidth utilization;

Define D _j as the contribution degree of the j-th load information index E _j _{: D j} =1-E _j ;

Step 2-5, calculate the objective weight value WO _{j of} each load information index:

WO ₁ , WO ₂ , WO ₃ respectively represent the objective weight value of CPU's impact on node load, the objective weight value of memory on node load, and the objective weight of bandwidth on node load, and WO ₁ +WO ₂ +WO ₃ = 1;

_{Step 2-6, calculate the weight w i} of the final load information index of the node:

w _i =β×WS _i +(1-β)×WO _i , (1-12)

Where β is the subjective and objective weight adjustment coefficient, w _i is the weight of the final node load, where i = 1, 2, 3, and w ₁ + w ₂ + w ₃ = 1, w ₁ represents the final CPU utilization weight, w ₂ represents the weight of the final memory utilization, w ₃ represents the weight of the final bandwidth utilization;

Step 3-4, compute the processing capacity of the node:

CA _i = w ₁ ×(1-CAU _i )+w ₂ ×(1-MAU _i )+w ₃ ×(1-BAU _i ), (1-13)

Among them, CAU _i , MAU _i , and BAU _i respectively represent the predicted CPU utilization, memory utilization, and bandwidth utilization of the i-th node in the current cycle, and CA _i represents the processing capacity of the i-th node.

Step 3 includes:

Calculate the proportion of the amount of data to be distributed by each node:

Among them, DP _i represents the proportion of the amount of data that should be allocated to the i-th node, and m represents the total number of nodes.

The data migration module constructs a selection queue of source and target machines by setting high and low load thresholds as conditions for triggering data migration. When a load imbalance problem occurs, select the source and target machines for data migration, The source machine serves as the node of the data to be migrated, and the target machine serves as the node that accepts the data to be migrated, and obtains the amount of data to be migrated.

The data migration module constructs a selection queue of source and target machines by setting high and low load thresholds as conditions for triggering data migration. When a load imbalance problem occurs, select the source and target machines for data migration, The source machine acts as the node of the data to be migrated, and the target machine acts as the node that accepts the data to be migrated, and obtains the amount of data that should be migrated, including the following steps:

Step a1, select the source machine:

Calculate the overall load value of each node:

Load _i = w ₁ ×CUR _i +w ₂ ×MUR _i +w ₃ ×BUR _i , (1-15)

Wherein, Load _i represents the overall load value of the i-th node. The overall load value of each node is compared with the set threshold H _th . If the overall load value of a node exceeds the threshold H _th , the node is added to high load queue node according to the descending integral source load value selection unit queues _{_{S y = {s 1, s}} 2, ......, s m}, s m S _y represents a queue of the m-th node, i.e. a whole The node with the smallest load value;

For _{each node in the Sy} queue, select the source machine according to the overall load value in descending order;

Step a2, select the target machine: compare the overall load value of each node with the set threshold L _th , if the load value of a node is lower than the threshold L _th , add the node to the low-load node queue, according to The overall load value from small to large constitutes the target machine selection queue D _m ={d ₁ , d ₂ ,..., d _z }, d _z represents the z-th node in the queue D _m , that is, the node with the largest overall load value;

For _{each node in the D m} queue, select the target machine in the order of the overall load value from small to large;

Step a3, perform data migration:

If the number of nodes in the high-load and low-load queues is the same, that is, m=z, the nodes in the high-load and low-load queues will be matched and migrated in parallel in sequence. The formula for the number of migrated partitions is as follows:

Where N _q represents the number of partitions to be migrated, N _y represents the number of partitions in the source machine, and N _m represents the number of partitions in the target machine;

If the number of high-load queue nodes is greater than the number of low-load nodes, that is, _Sy > D _m , adjust the low-load threshold appropriately so that the number of nodes in the low-load node queue is equal to or nearly greater than the number of nodes in the high-load node queue, and then follow Formula 1-16 sets the number of partitions to be migrated. In order to reduce unnecessary transmission of data between nodes and achieve load balancing, the low load threshold needs to be adjusted. For example, when the high load node is greater than 0.9, the number of nodes is 20, and the low load is less than There are only 10 nodes with 0.2. In this scenario, the low load threshold needs to be about 0.35, so that the high load nodes can reduce the load pressure as much as possible, and at the same time, they can perform migration in parallel and perform high and low load one-to-one node matching data transmission. ；

If the number of high-load queue nodes is much smaller than the number of low-load nodes, that is, _Sy <D _m , lower the high-load threshold appropriately. For example, the high-load threshold was set to 0.9 before, which can be appropriately reduced to about 0.75 to make the high-load node queue The number of nodes in is equal to or nearly less than the number of nodes in the low-load node queue, and then the number of migrated partitions is set according to formula 1-16;

After obtaining the number of partitions that the source machine should migrate, the data can be migrated.

The system of the present invention involves the following core contents:

(1) Load monitoring module

Resource monitoring can be performed by deploying cluster servers. The main monitoring indicators are the utilization of CPU, memory and bandwidth. The real-time cluster resource monitoring interface is used to pave the way for the acquisition module; it can be determined by combining load prediction and index weight determination methods. Whether it is a high or low load node, pave the way for the data migration module.

(2) Acquisition module

1) Selection of load information indicators

There are many key resources in the node that can describe the load of the node, such as CPU utilization, CPU context switching rate, free hard disk size, memory utilization, bandwidth utilization, and I/O resources. The load-based partition strategy of the present invention mainly uses CPU, memory, and bandwidth utilization to represent the overall load value of the node.

2) Collection cycle

The collection module collects the load information of all nodes at regular intervals. If the collection cycle time is too short, it will increase the load of the central node and consume a certain amount of bandwidth, which will affect the performance of the distributed system. If the collection cycle time is too long, outdated data will be used. , Does not have real-time effects, and may cause wrong partitioning decisions during data partitioning. At the same time, when encountering emergencies, nodes that need to be balanced are not processed in time, but nodes that do not need to be balanced are processed. In order to collect the load information of nodes in a timely and more accurate manner, the method of deploying cluster resource monitoring can be used to collect resource information and save it in the cache array, and save the historical resource information to the database persistently. At present, the time interval used in most papers is between 5s and 15s, which can be collected according to the period set by the user.

(3) Data pre-partitioning module

1) Load forecast

Use the prediction module to predict the load situation of the node in the future, so as to determine the distribution of the amount of data. After the research of relevant personnel, it is concluded that the change of host load has self-similarity and long-term dependence. For the load with such characteristics, the prediction mechanism can be used to predict the load situation of the real overall trend of the node at the time of data distribution, thereby Partition the data more effectively to prevent the occurrence of incorrect data partitioning decisions.

2) Judgment method of index weight

The present invention selects CPU utilization rate CUR, memory utilization rate MUR, and bandwidth utilization rate BUR to judge the load size of the node. Because there are CPU-intensive, memory-intensive, transmission-intensive and mixed types in Spark-MemSql applications, it is aimed at different In application scenarios, the weight of each indicator is likely to be different, so the weight ratio of each indicator is required. There is a problem with this load model formula. The greater the weight given to which indicator, the more the indicator affects the total value of the load. For example, the CUR, MUR, and BUR of the two nodes in the cluster are respectively <0.9,0.2,0.2> and <0.4,0.6,0.5>. Obviously, the CPU load of the previous node is very large, and it has reached the bottleneck, and the load of the latter node is relatively even. According to common sense, load balancing should be performed on the previous nodes first, and data should not be distributed as much as possible or part of the data should be migrated to other nodes with lower load. For example, the weight distribution of the three indicators is w ₁ ＝0.1, w ₂ ＝0.5, and w ₃ ＝0.4. According to the formula, the load of the first node is 0.27, and the load of the second node is 0.54. By comparing the load The value will select the following nodes for load balancing, which can confirm that the difference in the weight value will affect the comprehensive judgment result of the load. Therefore, it is necessary to adopt a reasonable index judgment method to determine the weight of each index, so as to determine the overall load of each node.

3) Data distribution of nodes

Due to the large amount of data analyzed, in order to reduce network transmission, the method of localized reading data analysis is adopted. In the data pre-partitioning phase, MemSql's default partitioning method is that the number of partitions for each node is the same. This will cause data skew between nodes due to the heterogeneity of cluster nodes resulting in different processing capabilities of the nodes, leading to unbalanced cluster load In order to use localized resources as much as possible, improve the parallel computing efficiency of distributed systems, and reduce network transmission, it is necessary to consider the overall load of each node, that is, to effectively partition the data according to the ability of each node to process tasks.

(4) Data migration module

Because some applications are executed or unreasonable pre-partitioning conditions lead to unbalanced cluster load, the following problems need to be solved: 1) Under what conditions data migration is required to achieve load balance, that is, the trigger condition of data migration; 2) Which node to migrate The above data refers to the selection of the source machine; 3) The node to which the data is migrated, namely the selection of the target machine; 4) The amount of data to be migrated, that is, the number of partitions to be migrated.

Beneficial effects: The present invention provides a subjective and objective weight integration method that can subtly quantify the computing power of each node in the cluster, fully utilize the computing resources of each node, and improve the overall response of big data analysis applications Speed; when the cluster has a load imbalance problem, the present invention provides a dynamic load balancing strategy, which can more flexibly ensure the stability of the resource utilization of the distributed cluster; in relatively independent memory and iterative parallel execution applications, such as Association analysis, clustering, neural network and other machine learning algorithms have been widely used in the cluster analysis of event comments in the company’s public opinion analysis system and related person association analysis modules, which ultimately speeds up the response speed of the application.

Description of the drawings

In the following, the present invention will be further described in detail with reference to the accompanying drawings and specific embodiments, and the advantages of the present invention and other aspects will become clearer.

Figure 1 is a flow chart of the dynamic data partition mechanism based on node load;

Figure 2 is a diagram of the cluster resource monitoring interface;

Figure 3 is a flow chart of the prediction mechanism;

Figure 4 is a flowchart of calculating index weights by AHP analytic hierarchy process;

Figure 5 is a flowchart of calculating index weights by entropy method;

Figure 6 is the integration diagram of Spark and MemSql.

Figure 7 is a comparison diagram of the CPU utilization prediction of the correlation analysis application.

Figure 8 is a comparison diagram of the CPU utilization prediction of the Kmeans clustering analysis application.

Figure 9 is a performance comparison chart of the correlation analysis application pre-partitioning strategy.

Figure 10 is a comparison diagram of the performance of the Kmeans clustering analysis application pre-partitioning strategy.

Figure 11 is a comparison diagram of node load utilization of different pre-partitioning strategies applied by correlation analysis.

Figure 12 is a comparison diagram of node load utilization of different pre-partitioning strategies applied by Kmeans clustering analysis.

Figure 13 is a performance comparison diagram of the correlation analysis migration strategy.

Figure 14 is a comparison chart of Kmean clustering analysis migration strategy performance.

Figure 15 is a comparison diagram of the average node load utilization before and after the correlation analysis data is migrated.

Figure 16 is a comparison chart of average node load utilization before and after data migration of Kmeans cluster analysis.

Detailed ways

In order to achieve the dynamic load balance of the system and improve the response speed of the application, a data dynamic partition system based on node load is proposed. As shown in Figure 1, the system includes a load monitoring module, a collection module, a data pre-partitioning module and data Migrate the module. The entire Spark-MemSql integrated cluster has been in application use. The master node in the load monitoring module regularly reads the load information of each indicator in the slave node, and dynamically displays the utilization of CPU, memory, and bandwidth in the monitoring interface; then through The collection module saves the load information in the cache array, and periodically persists it in the Mysql database to provide indicator load information for load prediction; then when a large amount of new data is imported, the prediction module in the data pre-partitioning module is required for each Each indicator of the node is predicted, and then the weight of each indicator is obtained through the indicator weight determination method. The processing capacity of each node is obtained according to the indicator information after the load prediction and the weight value of each indicator, and then the processing capacity of each node is obtained according to the Processing capacity is used to distribute data and complete data pre-partitioning; if the cluster has load imbalance during the application process and reaches the set load threshold, add high and low load nodes to the source and target machine queues, and proceed according to the migration strategy Migration in blocks. After the migration, if you encounter load imbalance again, the above process is also used for dynamic migration of blocks.

(1) Monitoring module

Resource monitoring can be performed by deploying cluster servers. As shown in Figure 2, the main monitoring indicators are the utilization of CPU, memory, and bandwidth. The real-time cluster resource monitoring interface is used to pave the way for the acquisition module; combined with load forecasting and The index weight determination method can determine whether it is a high or low load node, paving the way for the data migration module.

(2) Acquisition module

1) Selection of load information indicators

2) Collection cycle

The collection module collects the load information of all nodes at regular intervals. If the collection cycle time is too short, it will increase the load of the central node and consume a certain amount of bandwidth, which will affect the performance of the distributed system. If the collection cycle time is too long, outdated data will be used. , Does not have real-time effects, and may cause wrong partitioning decisions during data partitioning. At the same time, when encountering emergencies, nodes that need to be balanced are not processed in time, but nodes that do not need to be balanced are processed. In order to collect node load information in a timely and more accurate manner, the API provided by the Yarn resource management component can be used to collect resource information and save it in the cache array, and save historical resource information to the database persistently. At present, the time interval used in most papers is between 5s and 15s, which can be collected according to the period set by the user.

(3) Data pre-partitioning module

1) Load forecast

The traditional data distribution strategy is only based on the current real-time load information of the node as the basis for judging the data partition. Suppose there is an instantaneous high and low peak in the node load, and then it returns to normal. If the traditional data partitioning strategy is used, this peak value will inevitably affect the final data distribution decision making, which can easily lead to unbalanced data distribution and cause unnecessary system overhead. Therefore, it is necessary to prevent incorrect data pre-partitioning decisions caused by instantaneous load peaks. The situation happened. If the data has been allocated, but some unexpected situations occur, such as deleting nodes due to node downtime, adding nodes for horizontal expansion, and extremely unbalanced load problems, etc., it is necessary to perform block migration to balance the load. Need to refer to the load prediction module to determine the amount of migration.

1. Quadratic exponential smoothing method

The second exponential smoothing method is a method of exponential smoothing on the basis of the first exponential smoothing method. It cannot be predicted separately. The mathematical model established by combining with the first exponential smoothing method can use this model to determine the predicted value at the next moment. At present, most forecasting models choose the quadratic exponential smoothing method. Because the one-time exponential smoothing method and the average load method are more suitable for a time series analysis of a horizontal development trend, if the actual value is rising or falling, the deviation between the predicted value and the actual value will be relatively large, and there will be an obvious lag. The analysis application for the Spark-MemSql integrated framework will produce load rise or fall, and the quadratic exponential smoothing method can better solve the problem in this application scenario. It can use the law of lag deviation to find out the value change development trend. Therefore, the present invention adopts a quadratic exponential smoothing method model for load prediction.

The formula of the one-time exponential smoothing method is as follows:

The formula of the quadratic exponential smoothing method is as follows:

Combining the first and second exponential smoothing formulas, the load forecast value of the T-th period can be obtained, the formula is as follows:

Among them, Y _j is the actual value of the j-th period,

with

Are the predicted value of the j-1 cycle and the predicted value of the j cycle,

with

Is the predicted value of the j+T period; a _j and b _j are intermediate parameters; α is the smoothing coefficient, α∈[0,1]. The predicted value is greatly affected by the smoothing coefficient α. The smaller the value of α, the greater the influence of historical data; the larger the value of α, the greater the influence of recent data. Generally speaking, for small data fluctuations, it is necessary to reduce the impact of recent data on the prediction results, and the α value should be smaller; for large data fluctuations, the impact of recent data on the prediction results needs to be increased, and the α value should be Take the bigger one.

Generally, when the data fluctuates little, a smaller value should be selected for α, such as 0.05-0.15; when the data fluctuates but the long-term fluctuation is not large, α should be selected a slightly larger value, such as 0.1-0.5; the data fluctuates greatly and long-term Also large, a larger value should be selected for α, such as 0.6-0.8; when the data is obviously rising or falling, a larger value should be selected for α, such as 0.6-1.

2. Save the historical load information to the Mysql database through the acquisition module. When the analyzed data is pre-partitioned to the MemSql cluster, the data collected in the first n-1 cycles of all nodes in the cluster must be sent to the prediction module as the load data parameter, and the current load forms a load data set of size n. , Take the actual value of the first measurement as the initial value Y _j , the initial value of the first prediction and the initial value of the second prediction. Use these n period data to predict the node load value in the future d periods, and then take the average value of the node load in the future d periods, and finally determine the load information of each node in the cluster, and provide the data partition strategy module with the future load information of the cluster nodes , So as to provide decision-making basis for the data partition strategy module. In the same way, if an unexpected situation causes cluster load imbalance, the average value of the overall load of the d period nodes is compared with the threshold, and if the average value is greater than the threshold, the data migration operation is triggered. In this strategy, if the average load of a node in the future d cycles is higher than the high threshold or lower than the low threshold, the high and low load queues are updated. The smoothing coefficient is based on the standard deviation S of the j predicted data and the real data, and the corresponding value of the smoothing coefficient α when the S value is the smallest is the final smoothing coefficient standard. The formula for standard deviation S is as follows:

Among them, n represents the number of cycles taken, and j represents the j-th cycle. The flow of the prediction mechanism is shown in Figure 3. The partial variance S is calculated by adjusting the value of the smoothing coefficient α, and the value of the corresponding smoothing coefficient α is taken when S is the smallest. The values of n and d are set by the user.

2) Judgment method of index weight

Because of the application scenarios in the Spark-MemSql integrated framework environment, the CPU and bandwidth fluctuate greatly, and the memory fluctuates less. If only the subjective AHP weighting method is considered, the importance of certain indicators will be ignored, and only the objective entropy method will be affected. The weight of memory is judged. Therefore, the present invention calculates the overall load value of each node through the index weight determination method based on the combination of the secondary smooth load prediction method + subjective and objective AHP and the entropy index weight integration method, and finally allocates the corresponding data amount according to the overall load value .

1. AHP

The main idea of the AHP subjective weight method: In multi-attribute decision-making, the decision maker compares all evaluation indicators pairwise to obtain the judgment matrix U=(A _ij ) _n×n , where A _ij is the comparison of the evaluation index A _i and A _j The value obtained is an odd number between 1 and 9, indicating that the former index is equally important, more important, very important, very important, and extremely important than the latter index; when the value is an even number between 1 and 9 When, respectively, the importance of the pairwise comparison is between the importance degrees of two adjacent odd numbers, and

The process of AHP subjective calculation index weight method is shown in Figure 4.

1) Compare the CPU utilization rate, memory utilization rate and bandwidth utilization rate in pairs to obtain the judgment matrix A:

Among them, A ₁ , A ₂ , and A ₃ represent the weight value of the impact of a node's CPU utilization on the overall load of the node, the weight value of the impact of memory utilization on the overall load of the node, and the weight of the impact of bandwidth utilization on the overall load of the node. , Perform normalization operation on each column of the judgment matrix A, obtain the column eigenvectors, and then normalize each row, obtain the row eigenvectors, and finally obtain the weight ratio of each indicator, and compare the judgment matrix A performs a consistency check, and finally obtains the subjective weights of a node's CPU, memory, and bandwidth as WS ₁ , WS ₂ , WS ₃ , and WS ₁ +WS ₂ +WS ₃ =1;

2) Calculate the eigenvectors and index weights of the matrix

① Sum the columns of the matrix, the vector of the column sum is: SUM _j .

② Normalize each column, the formula is as follows:

The value of ΣA _ij _{is the sum of each column SUM j} , and a new matrix B is obtained. The sum of the values of each column in the B matrix is 1.

③ Sum up each row to get the feature vector SUM _i .

④ Calculate the index weight and normalize the feature vector, the formula is as follows:

The index weights of the three query modes are W ₁ , W ₂ , W ₃ .

3) Matrix consistency test

①Calculate the largest eigenvalue of the matrix, the formula is as follows:

Among them, λ _max is the maximum characteristic root, AW represents the matrix A and the weight vector W are multiplied to obtain a column vector, n represents the order of the matrix, and W represents the weight vector.

②Calculate the constant index of the judgment matrix, the formula is as follows:

Among them, C.I. represents the consistency index, and n represents the order of the matrix.

③Calculate the random consistency ratio, the calculation formula is as follows:

Among them, R.I. represents the average random consistency index, which is a constant, which can be queried in the scale according to the order. The fourth-order R.I.=0.89, if C.R.<0.1, it means that the contrast matrix remains consistent. If C.R.>0.1, it means that the contrast matrix is not consistent and needs to be adjusted.

2. Entropy method

Main idea: Entropy method is a mathematical method that reflects the degree of influence of an index on comprehensive evaluation by judging the dispersion of an index, and can objectively determine the weight through the degree of variation of the index value. The weight of the index is positively correlated with the degree of variability, that is, the greater the degree of variation of the index value, the greater its weight; conversely, the smaller the degree of variation of the index value, the smaller its weight. The process of calculating the index weight by the entropy method is shown in Figure 5.

Specific steps are as follows:

(1) Construct load information decision matrix M:

Among them, n represents the number of cycles, and CUR, MUR, and BUR represent the utilization of CPU, memory, and bandwidth, respectively.

(2) Standardize each column of the decision matrix M to obtain the decision R:

among them

Each column of the matrix R satisfies the normalization, namely

j = 1, 2, 3, that is, the sum of the values in each column is 1.

(3) Use the entropy formula to calculate the uncertainty of the index:

Use E to represent the entropy of any load information index, and the formula is as follows:

E _j represents the entropy value of the index, and the constant K=1/ln(n), so that 0≤E≤1 can be guaranteed, that is, the maximum E is 1.

It can be seen from the formula that when the contribution degree of each value under a certain attribute tends to be the same, E tends to 1; for example, when they are all equal, the role of the attribute of the target in decision-making can be ignored, that is, this The weight of the time attribute is 0. In this way, it can be seen that the attribute value is affected by the difference in the value of a certain attribute column. For this reason, D _j can be defined as the contribution degree of a certain index, D _j =1-E _j .

(4) Calculate the objective weight value of each indicator, the formula is as follows:

WO ₁ , WO ₂ , WO ₃ respectively represent the objective weight value of CPU's impact on node load, the objective weight value of memory on node load, and the objective weight of bandwidth on node load, and WO ₁ +WO ₂ +WO ₃ = 1. Calculate the objective weight value of each indicator, the algorithm inputs the matrix of load values of each indicator in different periods, and calculates the objective weight value of each indicator through the entropy method.

3. The weight integration method of subjective and objective AHP and entropy method

For real application situations, there may be disadvantages of subjective and objective indicator weight design, that is, a certain indicator occupies a large proportion in objective applications, but the subjective person is not clear; another example is that the utilization rate of certain indicators has been high. When the load is stable, it may be in a stable state or not in a stable state in use all the time. According to the objective method, it is easy to calculate that the weight of this indicator is relatively small and deviates from the subjective reality. Therefore, the present invention designs a subjective and objective integrated method to solve such problems and balance the weight deviation between the two. The integrated weight formula is as follows:

w _i =β×WS _i +(1-β)×WO _i , (1-12)

Where β is the subjective and objective weight adjustment coefficient, w _i is the weight of the final node load, where i=1, 2, 3, and w ₁ + w ₂ + w ₃ =1.

Node data distribution: First, the subjective and objective integration weights of the three indicators of CPU, memory, and bandwidth in the load are obtained from the previous module, which are w ₁ , w ₂ , and w _{3 respectively} .

Then, the processing capacity of each node is obtained by the weight of each indicator, the formula is as follows:

CA _i = w ₁ ×(1-CAU _i )+w ₂ ×(1-MAU _i )+w ₃ ×(1-BAU _i ), (1-13)

Among them, CAU _i , MAU _i , and BAU _i represent the predicted utilization of CPU, memory, and bandwidth respectively, and i represents the i-th node.

Finally, get the proportion of the amount of data to be allocated by each node, the formula is as follows:

After the above steps, we can know the amount of data allocated to each node in the cluster, that is, the corresponding number of partitions.

(4) Data migration module

By setting high and low load thresholds as a condition for triggering data migration, a selection queue of source and target machines is constructed. After the data is pre-partitioned, load imbalance or addition or deletion of nodes occurs, you need to select the source and target machines for data migration. The source machine is the node of the data to be migrated, and the target machine is the node that accepts the data to be migrated. The number of partitions.

1) Source machine selection

First, read the CPU utilization, memory utilization, and bandwidth utilization load information from the load cache array to predict, and predict the average load value of each indicator after T cycles.

Then, the load utilization prediction value of each indicator is combined with the subjective and objective weight integration method to obtain the load weight value of each indicator, and then the overall load value Load _{i of} each node is obtained. The load value formula is as follows:

Load _i = w ₁ ×CUR _i +w ₂ ×MUR _i +w ₃ ×BUR _i , (1-15)

Among them, CUR _i , MUR _i , BUR _i and w ₁ , w ₂ , w ₃ are the predicted CPU utilization, memory utilization, bandwidth utilization and weight values, respectively.

Then, the load value Load _{i of} each node is compared with the set threshold, and if the load value of a certain node exceeds the H _th threshold, the node is added to the high-load node queue.

_{Then, the source machine selection queue Sy} = {s ₁ , s ₂ ,..., s _m } is formed according to the overall load value from large to small.

Finally, select the source machine _{from the Sy queue.} _{The load value of each node in the Sy} queue is sorted in descending order, and the source machine is selected in descending order of the overall load value.

2) Target machine selection

First, read the CPU utilization, memory utilization, and bandwidth utilization load information from the load cache array to predict, and respectively predict the average load value of each indicator after T cycles.

Then, combine the load utilization prediction value of each indicator with the subjective and objective weight integration method to obtain the load weight value of each indicator, and substitute it into formula 1-15 to calculate, and then obtain the overall load value Load _{i of} each node.

Then, the load value Load _{i of} each node is compared with the set threshold, and if the load value of a certain node is lower than the L _th threshold, the node is added to the low-load node queue.

Then, the target machine selection queue D _m ={d ₁ , d ₂ ,..., d _z } is _{formed according to the value of Load i from small to large.}

Finally, select the target machine from the D _{m queue.} Arrange the Load values in the D _m queue in ascending order, and select the target machine in the descending order of _{Load i.}

3) Number of migrated partitions

1. If the number of high and low load queue nodes is the same, that is, _Sy = D _m . Then the nodes in the high and low load queues are matched and migrated in sequence in parallel, and the formula for the number of migrated partitions is as follows:

Where N _q represents the number of partitions to be migrated, N _y represents the number of partitions in the source machine, and N _m represents the number of partitions in the target machine.

2. If the number of high-load queue nodes is greater than the number of low-load nodes, that is, _Sy > D _m . Then adjust the low-load threshold appropriately so that the number of nodes in the low-load node queue is equal to or nearly greater than the number of nodes in the high-load node queue, and then the number of migrated partitions is set according to Formula 1-16.

3. If the number of high-load queue nodes is much smaller than the number of low-load nodes, that is, _Sy <D _m . Then adjust the high load threshold appropriately so that the number of nodes in the queue of high load nodes is equal to or nearly less than the number of nodes in the queue of low load nodes, and then the number of migration partitions is set according to formula 1-16.

4. After obtaining the matched source and target machines, and knowing the number of partitions that the source machine should migrate in each group, the migration can be performed in parallel to reduce the migration overhead.

Through the above steps, the system can achieve load balancing. For unexpected situations where nodes are added or deleted, this migration strategy can also be adopted.

The distributed memory database MemSql adopts a master-slave structure, uses Hash as a storage method, and uses a data partition as the smallest storage unit block. Spark also uses a master-slave structure. The Master node (master node) manages the resources of the entire cluster, and the Worker node (slave node) manages the resources of each computing node, regularly reports the node resource status to the Master node, and starts the Executor to perform calculations.

Currently, Spark and MemSql have two combined application scenarios: one is Spark and MemSql, which are two relatively independent frameworks, and the other is the integration framework of Spark and MemSql.

For the application scenario under the Spark-MemSql integration framework, as shown in Figure 6, the method of localized data reading and analysis is adopted, and the two are integrated through the MemSql Spark Connector component. The component is started in the background as a daemon process to connect the Master and Spark in Spark. The main aggregator in MemSql is connected, and then the Worker node of Spark can obtain the metadata information of the main aggregator in MemSql through the Master node. The metadata includes which nodes the data exists and which partitions on the nodes, thereby ensuring the actual program In the process of data analysis, the Worker node of Spark uses the MemSqlRDD interface to localize and perform data reading and writing, calculation and analysis from the MemSql storage Leaf node in parallel. The smallest storage granularity in MemSql is Partition. At present, each node is assigned the same number of Partitions by default. This will cause data skew between nodes due to the heterogeneity of cluster nodes resulting in different processing capabilities of nodes. Since Spark in this framework adopts the method of localized data analysis, that is, on which node the data is located, it is analyzed and processed on the corresponding node. The number of Partitions in MemSql directly reflects the number of RDD tasks in Spark, that is, the number of tasks and the number of partitions are positive. Related relationship, if the default partitioning method is adopted, it will cause serious load imbalance. For example, if there are many sub-blocks in a high-load data node that need to be processed and analyzed, then the execution time of the entire job will become longer because of the execution of Spark job scheduling. The time is up to the time when all jobs are completed. In real applications, the problem of data skew is widespread, and the unbalanced load of processing nodes caused by it is an inevitable problem in the application of the Spark-MemSql framework.

Therefore, in the application scenario of the parallel computing framework Spark, it is necessary to propose a MemSql partition strategy to improve load balance and increase the response speed of the application.

Examples:

For the Spark-MemSql integration framework, the Spark-MemSql integrated cluster environment is deployed under the local area network. There are 5 nodes in the experiment, and the total number of partitions is set to 32 partitions. Using a data set in a manufacturing company, the verification is based on load forecasting and AHP The effectiveness of the data dynamic partition strategy combined with the entropy integrated weight method.

In this embodiment, the FIS_PRODUCT table in a certain manufacturing company is used as the test data set. As shown in Table 1, there are approximately more than 50 million rows of data. Each piece of data includes time ID, plant category, product category, product length, product stretch length, product weight, etc. Among them, the LENGTH and WEIGHT columns are used as the data set for the correlation analysis application test. The LENGTH, DRAWLENGTH and WEIGHT columns can be used as the data set for the Kmeans application test. Different applications use different data sets for testing.

Table 1

(1) Test and verify the prediction module. Run related applications to imitate the actual application environment under the Spark-MemSql integration framework, predict the load utilization rate in the application environment, every 5s is a cycle, and then calculate the deviation between the forecast and the actual value to adjust the smoothing coefficient. Pave the way for the comparison test of the partition strategy, and verify the effectiveness of the prediction algorithm in this application scenario. The test process of the prediction module is: read the collected historical load information, then use the secondary smoothing prediction algorithm to predict the load, calculate the partial variance S between the predicted value and the true value, and reduce the partial variance S by adjusting the smoothing coefficient α. Use the same method to adjust the smoothing coefficient for different application scenarios.

(2) Perform performance comparison tests on different pre-partitioning strategies. The application in this experiment is to perform correlation analysis on the two columns of LENGTH and WEIGHT. The attributes of these two columns represent the length of the product and the corresponding weight respectively, and the correlation between the length and weight of the product is analyzed; for the three columns of LENGTH, DRAWLENGTH, and WEIGHT Do Kmeans cluster analysis. These three columns of attributes represent product length, product stretch length, and product weight. Product classification is performed through cluster analysis. By comparing the default pre-partitioning strategy, load forecasting + AHP weighting method, load forecasting + entropy weighting method pre-partitioning strategy and load forecasting + AHP and entropy integrated weighting method of four different pre-partitioning strategies, and then statistically execute the same application. Time to verify the effectiveness of the program.

(3) If cluster load imbalance occurs in the Spark-MemSql framework, use the migration strategy to migrate the data in blocks between the source machine and the target machine, run the same application, compare the performance before and after the migration, and verify the plan. Effectiveness.

Implementation step 1: Load prediction algorithm. Test different applications separately, collect and predict the load of a certain node, verify the effectiveness of this prediction algorithm in different application scenarios, and obtain the smoothing coefficient α of different load indicators in different application scenarios. As shown in Figure 7 and Figure 8, there are fluctuations in the CPU utilization of the two different applications. The quadratic smoothing index method can more accurately predict the CPU utilization and avoid the impact of instantaneous peaks. The same method is used to predict and compare different indicators, and finally the smoothing coefficient α of different indicators in different application scenarios is obtained, as shown in Tables 2 and 3.

Table 2

指标index	CPUCPU	内存RAM	带宽bandwidth
平滑系数αSmoothing coefficient α	0.70.7	0.350.35	0.750.75

table 3

指标index	CPUCPU	内存RAM	带宽bandwidth
平滑系数αSmoothing coefficient α	0.750.75	0.400.40	0.650.65

Implementation step two: pre-partitioning strategy. Divided into two groups of experiments through different pre-partitioning strategies. Each group of experiments runs the same application. The first group of experiments is used for the application of association analysis; the second group of experiments is used for the application of Kmeans cluster analysis. Compare the execution time of applications under different partitioning strategies to verify the effectiveness of the scheme.

(1) Use AHP to get the weight of each indicator

First, enter the index decision matrix A:

The evaluation uses column and row comparisons in pairs, where A ₁ , A ₂ , and A ₃ represent CPU, memory, and bandwidth respectively; then, calculate the random consistency ratio CR=CI/RI=0.00103<0.1, indicating that the comparison matrix maintains consistency , The design of the decision matrix is reasonable; then, the AHP is used to obtain the weight value of each indicator; then, the utilization rate of each indicator is periodically collected during the application process, and then the entropy method is used to obtain the weight value of each indicator; In this experiment, the weight coefficient β was adjusted to 0.8, and the integrated weight value was obtained. The results of different application scenarios are shown in Tables 4 and 5, respectively.

Table 4

指标index	CPUCPU	内存RAM	带宽bandwidth
AHP权重值AHP weight value	61.523％61.523%	31.872％31.872%	6.604％6.604%
熵值法权重值Entropy method weight value	38.231％38.231%	19.076％19.076%	42.693％42.693%
AHP+熵值法权重值AHP+entropy method weight value	57.762％57.762%	29.62％29.62%	13.518％13.518%

table 5

指标index	CPUCPU	内存RAM	带宽bandwidth
AHP权重值AHP weight value	61.523％61.523%	31.872％31.872%	6.604％6.604%
熵值法权重值Entropy method weight value	43.231％43.231%	24.076％24.076%	32.693％32.693%
AHP+熵值法权重值AHP+entropy method weight value	58.762％58.762%	30.52％30.52%	11.518％11.518%

(2) According to the predicted load value of each index and different weighting methods in the specific application, combined with formula 1-13 to obtain the processing capacity of each node under different partitioning strategies, the number of partitions for each node is obtained by combining formula 1-14 Percentage, you can get the number of partitions of each node, as shown in Table 6:

Table 6

As shown in Figure 9 and Figure 10, the association analysis and Kmeans clustering application are performed respectively. On the whole, the default partition strategy has the worst effect. The partition strategy of the prediction + AHP and entropy weight integration method designed in this paper has the best effect, and the effect is more significant as the amount of data increases. The AHP weighting method is a subjective weighting method, which does not match the weights according to the actual application scenarios, which is unobjective; the entropy weighting method is obtained by using the difference of the index value, and the memory utilization rate changes slowly, but it has been frequently used. , The data calculation of the Spark-MemSql framework is carried out in the memory, so the memory has been used relatively steadily, and the bandwidth utilization rate varies greatly, but the utilization rate is very low. If only the objective method is used, the memory weight will be small, Error results with larger bandwidth weights. Therefore, integrating subjective and objective weights will bring better results. Executing different applications has achieved the same effect, which shows that the pre-partitioning strategy studied in this paper has generalization in the application of processing relatively independent tasks.

As shown in Figures 11 and 12, the same application is executed for different pre-partitioning strategies, and the overall average load utilization rate of each node in the entire application process is calculated. On the whole, the default partitioning strategy has serious load imbalance. The pre-partitioning strategy combining prediction + AHP, prediction + entropy method, prediction + AHP and entropy weight integration method can solve the cluster load problem well. The balance of cluster load.

Implementation step three: migration strategy. When encountering load imbalance in the Spark-MemSql framework, through the data migration strategy, and then run the same application, periodically record the load status of different nodes through the monitoring interface, compare the execution time of the application before and after the migration, and consider The time cost of the migration is reduced, and the effectiveness of the scheme is verified.

Use the migration strategy to construct high and low load queues, and obtain the number of sub-blocks that different nodes should receive or send. After performing the migration operation, the number of partitions for each node is shown in Table 7.

Table 7

Figure 13 and Figure 14 show the effectiveness of the migration strategy, which can improve the load balance of the cluster and improve the response speed of the application to a certain extent. In related applications, when the amount of data is small, that is, when the amount of data in the correlation analysis application is less than 30 million, and when the amount of data in the Kmeans analysis application is less than 20 million, the load does not reach the set threshold, and migration is not triggered, but when it is executed When the amount of data is relatively large, that is, when the amount of data in the correlation analysis application reaches 30 million and the data volume in the Kmeans analysis application reaches 20 million, the load reaches the threshold and the migration is triggered. Although the load balance of the cluster is improved, the migration takes time Cost, resulting in a longer total time. When the amount of data further increases, load imbalance intensifies, resulting in relatively small migration overhead and improving the response speed of the application.

As shown in Figure 15 and Figure 16, the migration test is performed on different applications, and the overall average load utilization rate of each node in the entire application process before and after the migration is compared. It can be seen that the migration can improve the balance of the cluster load.

In the data pre-partitioning stage, the partitioning strategy based on load prediction + AHP index weight judgment has the best effect, which can solve the load balance of the cluster and improve the response speed of the application; after the data has been distributed, the load is unbalanced In the case of migration, the load balance of the cluster can be solved and the response speed of the application can be improved.

The present invention provides a data dynamic partition system based on node load. There are many specific methods and ways to implement this technical solution. The above are only the preferred embodiments of the present invention. It should be pointed out that for those of ordinary skill in the art, In other words, without departing from the principle of the present invention, several improvements and modifications can be made, and these improvements and modifications should also be regarded as the protection scope of the present invention. All the components that are not clear in this embodiment can be implemented using existing technology.

Claims

A data dynamic partition system based on node load, which is characterized by comprising a load monitoring module, an acquisition module, a data pre-partitioning module and a data migration module;

The load monitoring module is used to select load information indicators and monitor the load information indicator values on each node in the distributed cluster in real time;

The collection module is used to periodically collect the load information index value on each node in the distributed cluster;

The data pre-partitioning module is used to predict the load information index value on each node in the distributed cluster, and then obtain the processing capacity of each node according to the index weight method, and finally distribute different amounts of data according to the processing capacity of each node , Complete data pre-partitioning;

The data migration module is used to trigger data migration between nodes to improve load balance when a load imbalance problem occurs in the distributed cluster.
The system according to claim 1, wherein the load monitoring module selects CPU utilization, memory utilization, and bandwidth utilization as load information index values, and monitors distributed clusters in real time by deploying Memsql resource monitoring services The load information index value on each node in the.
The system according to claim 2, wherein the collection module periodically obtains the load information index value on each node in the distributed cluster through the API provided by the distributed Yarn resource management component, and saves it in the database.
The system according to claim 3, wherein the data pre-partitioning module is used to predict the load information index value on each node in the distributed cluster, and then obtain it according to the AHP and entropy subjective and objective index weight integration method The processing capacity of each node, and finally, according to the processing capacity of each node, different data volumes are distributed to complete the data pre-partitioning, which specifically includes the following steps:

Step 1. Use the quadratic exponential smoothing method to predict the load information index value:

The formula of the one-time exponential smoothing method is as follows:

The formula of the quadratic exponential smoothing method is as follows:

Combining the first and second exponential smoothing formulas, the load forecast value of the T-th cycle is obtained, and the formula is as follows:

Among them, Y j is the actual value of the load information index value of the jth cycle,
with
They are the predicted value of the load information index value of the j-1 cycle and the predicted value of the load information index value of the j cycle,
with
Are the second exponential smoothing value of the j-1 period and the second exponential smoothing value of the j period,
Is the predicted value of the load information index value of the j+T cycle; a j and b j are intermediate parameters; α is the smoothing coefficient;

The collection module sends the load information index value on each node in the distributed cluster collected in the first n-1 cycles in the database to the data pre-partitioning module, and it is combined with the load information index value on each node in the current cycle. Is the load data of n, take the actual value of the load information index value measured in the first cycle as the initial value Y j , the initial value of the first prediction and the initial value of the second prediction, and use the obtained n load data to predict the future d cycles The load information index value on each node, calculate the average value P of the load information index value of a node in the future d cycles, and finally determine the load information index value of each node in the cluster;

Step 2. Calculate the processing capacity of each node;

Step 3: Distribute different amounts of data according to the processing capacity of each node.
The system according to claim 4, wherein in step 1, the value of the smoothing coefficient is obtained by calculating the standard deviation S:

Among them, n represents the number of cycles taken, the partial variance S is calculated by adjusting the value of the smoothing coefficient α, and the value of the corresponding smoothing coefficient α is taken when S is the smallest.
The system according to claim 5, wherein step 2 includes the following steps:

Step 2-1, use the AHP subjective weight method to calculate: In multi-attribute decision-making, the decision maker compares all evaluation indicators pairwise to obtain the judgment matrix U=(A ij ) n×n , where A ij is the evaluation index A The value obtained by comparing i with A j is an odd number between 1 and 9, that is, when the value is 1, 3, 5, 7, and 9, respectively, it means that the former index is equally important, more important, and very important than the latter index. Important, very important, and extremely important; when the value is an even number between 1 and 9, it means that the importance of the pairwise comparison is between the importance degrees of two adjacent odd numbers, that is, the value is 2. When indicates that the importance of the pairwise comparison is between the importance of two adjacent odd numbers 1 and 3, and

Comparing CPU utilization, memory utilization and bandwidth utilization in pairs, the judgment matrix A is obtained:

Among them, A 1 , A 2 , and A 3 represent the weight value of the impact of a node's CPU utilization on the overall load of the node, the weight value of the impact of memory utilization on the overall load of the node, and the weight of the impact of bandwidth utilization on the overall load of the node. . Perform a normalization operation on each column of the judgment matrix A, obtain the column eigenvectors, and then perform a normalization operation on each row, obtain the row eigenvectors, and finally obtain the weight ratio of each indicator, and compare the judgment matrix A Perform consistency check, and finally get the subjective weights of a node's CPU, memory, and bandwidth as WS 1 , WS 2 , WS 3 , and WS 1 +WS 2 +WS 3 =1;

Step 2-2, calculate the eigenvectors and index weights of the matrix:

Sum the columns of the matrix, the vector of the column sum is: SUM j ;

To normalize each column of the matrix, the formula is as follows:

The value of ∑A ij is the sum of each column SUM j, and B ij represents the normalized data of A ij. A new matrix B is obtained according to B ij , and the sum of the values of each column in the B matrix is 1;

Sum each row of matrix B to obtain the eigenvector SUM i ;

Calculate the index weight and normalize the feature vector, the formula is as follows:

According to the above formula, the three index weights are finally obtained as W 1 , W 2 , W 3 ;

Step 2-3, check the consistency of the matrix:

Calculate the largest characteristic root of the matrix, the formula is as follows:

Among them, λ max is the maximum eigenvalue, AW represents the matrix A and the weight vector W are multiplied to obtain a column vector, n represents the order of the matrix, and W represents the weight vector;

To calculate the consistency index of the judgment matrix, the formula is as follows:

Among them, C.I. represents the consistency index, and n represents the order of the matrix;

Calculate the random consistency ratio C.R., the calculation formula is as follows:

Among them, RI stands for the average random consistency index, which is a constant, which can be queried in the scale according to the order; the third-order RI=0.89, if CR<0.1, it means that the comparison matrix is consistent; if CR>0.1, it means comparison The matrix is not consistent and needs to be adjusted;

Steps 2-4, calculate the objective weight method by entropy method:

Construct load information decision matrix M:

Among them, CUR n , MUR n , and BUR n respectively represent the CPU utilization, memory utilization, and bandwidth utilization predicted in the nth cycle of a node;

Standardize each column of the decision matrix M to obtain the decision matrix R:

among them
R i1 represents the element in the ith row and the first column of the decision matrix R, and each column of the decision matrix R satisfies the normalization, namely
That is, the sum of the values in each column is 1, j = 1, 2, 3;

Calculate the entropy of the load information index according to the following formula:

E j represents the entropy value of the load information index, and the constant K = 1/ln(n), then 0≤E j ≤1, that is, the maximum E j is 1, and when j is 1, E j represents the entropy value of the CPU utilization; When j is 2, E j represents the entropy value of memory utilization; when j is 3, E j represents the entropy value of bandwidth utilization;

Define D j as the contribution degree of the j-th load information index E j : D j =1-E j ;

Step 2-5, calculate the objective weight value WO j of each load information index:

WO 1 , WO 2 , WO 3 respectively represent the objective weight value of CPU's impact on node load, the objective weight value of memory on node load, and the objective weight of bandwidth on node load, and WO 1 +WO 2 +WO 3 = 1;

Step 2-6, calculate the weight w i of the final load information index of the node:

w i =β×WS i +(1-β)×WO i , (1-12)

Where β is the subjective and objective weight adjustment coefficient, w i is the weight of the final node load, where i = 1, 2, 3, and w 1 + w 2 + w 3 = 1, w 1 represents the final CPU utilization weight, w 2 represents the weight of the final memory utilization, w 3 represents the weight of the final bandwidth utilization;

Step 3-4, compute the processing capacity of the node:

CA i = w 1 ×(1-CAU i )+w 2 ×(1-MAU i )+w 3 ×(1-BAU i ), (1-13)

Among them, CAU i , MAU i , and BAU i respectively represent the predicted CPU utilization, memory utilization, and bandwidth utilization of the i-th node in the current cycle, and CA i represents the processing capacity of the i-th node.
The system according to claim 6, wherein step 3 comprises:

Calculate the proportion of the amount of data to be distributed by each node:

Among them, DP i represents the proportion of the amount of data that should be allocated to the i-th node, and m represents the total number of nodes.
The system according to claim 7, wherein the data migration module sets high and low load thresholds as conditions for triggering data migration, and constructs a selection queue for the source and target machines. When the load is unbalanced, When selecting the source machine and the target machine for data migration, the source machine is the node of the data to be migrated, the target machine is the node that accepts the data to be migrated, and the amount of data that should be migrated is obtained.
The system according to claim 8, wherein the data migration module sets high and low load thresholds as conditions for triggering data migration, and constructs a selection queue for the source machine and the target machine. In the event of load imbalance, When selecting the source machine and the target machine for data migration, the source machine is the node of the data to be migrated, the target machine is the node that accepts the data to be migrated, and the amount of data to be migrated is obtained, which specifically includes the following steps:

Step a1, select the source machine:

Calculate the overall load value of each node:

Load i = w 1 ×CUR i +w 2 ×MUR i +w 3 ×BUR i , (1-15)

Wherein, Load i represents the overall load value of the i-th node. The overall load value of each node is compared with the set threshold H th . If the overall load value of a node exceeds the threshold H th , the node is added to high load queue node according to the descending integral source load value selection unit queues S y = {s 1, s 2, ......, s m}, s m S y represents a queue of the m-th node, i.e. a whole The node with the smallest load value;

For each node in the Sy queue, select the source machine according to the overall load value in descending order;

Step a2, select the target machine: compare the overall load value of each node with the set threshold L th , if the load value of a node is lower than the threshold L th , add the node to the low-load node queue, according to The overall load value from small to large constitutes the target machine selection queue D m ={d 1 , d 2 ,..., d z }, d z represents the z-th node in the queue D m , that is, the node with the largest overall load value;

For each node in the D m queue, select the target machine in the order of the overall load value from small to large;

Step a3, perform data migration:

If the number of nodes in the high-load and low-load queues is the same, that is, m=z, the nodes in the high-load and low-load queues will be matched and migrated in parallel in sequence. The formula for the number of migrated partitions is as follows:

Where N q represents the number of partitions to be migrated, N y represents the number of partitions in the source machine, and N m represents the number of partitions in the target machine;

If the number of high-load queue nodes is greater than the number of low-load nodes, that is, Sy >D m , adjust the low-load threshold appropriately so that the number of nodes in the low-load node queue is equal to or nearly greater than the number of nodes in the high-load node queue, and then follow Formula 1-16 sets the number of partitions to be migrated;

If the number of high-load queue nodes is much smaller than the number of low-load nodes, that is, Sy <D m , then appropriately lower the high-load threshold so that the number of nodes in the queue of high-load nodes is equal to or nearly less than the number of nodes in the queue of low-load nodes, and then Set the number of partitions to be migrated according to formula 1-16;

After obtaining the number of partitions that the source machine should migrate, the data can be migrated.