CN111144701A

CN111144701A - ETL job scheduling resource classification evaluation method under distributed environment

Info

Publication number: CN111144701A
Application number: CN201911225107.5A
Authority: CN
Inventors: 杜海; 唐伟力; 苗青鹏; 吴迪
Original assignee: CETC 30 Research Institute
Current assignee: CETC 30 Research Institute
Priority date: 2019-12-04
Filing date: 2019-12-04
Publication date: 2020-05-12
Anticipated expiration: 2039-12-04
Also published as: CN111144701B

Abstract

The invention discloses a method for classifying and evaluating ETL job scheduling resources in a distributed environment, which comprises the following steps: determining an index system for evaluating the performance of an ETL server; step two, determining the ETL operation type; step three, clustering and analyzing ETL server index data based on an index system to obtain an ETL server candidate set classified correspondingly to the ETL operation type; step four, establishing an index evaluation matrix and calculating the information entropy of the index and the weight of the information entropy; step five, sorting all kinds of ETL server candidate sets; and step six, calculating and determining the ETL operation type according to the step two, and then selecting the ETL server with the top rank from the ETL server candidate set of the corresponding classification. The invention adopts a cluster analysis method and an evaluation method based on information entropy to evaluate the performance of the ETL server, automatically matches the ETL operation which does not enter an execution state, can fully utilize idle resources of the ETL server, and dynamically and quasi-real-time distributes computing resources.

Description

ETL job scheduling resource classification evaluation method under distributed environment

Technical Field

The invention belongs to the technical field of networks, and particularly relates to a method for classifying and evaluating ETL job scheduling resources in a distributed environment.

Background

The rapid development of information technology brings about the explosion of data construction, the scale of a data warehouse is gradually huge, the architecture is more complex, the most important link in the data warehouse construction is ETL data connection, the process of data extraction, conversion and loading accounts for 80% of the workload of the data warehouse construction process, a distributed data connection mode is the mainstream technology of the data connection at present, the data connection operation can be dispersed on a relatively cheap computer cluster, and the problem of how to reasonably distribute the ETL operation and fully call and utilize the calculation resources of an ETL server is necessarily involved. In a medium-scale data warehouse in the field of network operation and maintenance, in order to meet various data statistical requirements such as various data analysis, daily reports, weekly reports, monthly reports and the like, an ETL (extract-transform-load) access task in a ready state is generally maintained above one hundred, an ETL server cluster generally consists of twelve left and right common computers or virtual machines, and if a strategy of 'who is idle and who is processing' is simply adopted, the ETL server cluster is wasted by resources, as shown in fig. 1.

Generally, the task of data access mainly includes three types of data extraction, data conversion and data loading, and the consumption of CPU, memory and I/O operation resources is heavier, if a certain ETL server is engaged in ETL operation and occupies a large amount of I/O operation resources for a long time, the CPU computing resources are still in an idle state, which is a great waste for ETL operation requiring CPU computing resources, and similarly, for a computer performing a large amount of memory-occupied operations, I/O operation idle is also a waste of resources for ETL operation requiring simple inter-table data replication, for example, idle resources of ETL operation and ETL server can be accurately matched, the efficiency of executing ETL task can be greatly improved, which requires that we first determine the type of ETL operation and then accurately evaluate the service capability of idle resources of ETL server, finally, the ETL job to be executed and the ETL server are precisely matched, as shown in FIG. 2.

Disclosure of Invention

The invention aims to: aiming at the technical problem, the invention provides a method for classifying and evaluating ETL job scheduling resources in a distributed environment.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a method for classifying and evaluating ETL job scheduling resources in a distributed environment comprises the following steps:

determining an index system for evaluating the performance of an ETL server;

step two, determining the type of the ETL operation by calculating the comprehensive evaluation value of the ETL operation;

step three, clustering and analyzing ETL server index data based on an index system to obtain an ETL server candidate set classified correspondingly to the ETL operation type;

step four, aiming at various ETL server candidate sets, establishing an index evaluation matrix and calculating the information entropy of the index and the weight of the information entropy;

step five, calculating the distance between the ideal point and the index evaluation of the ETL server according to the calculation result of the step four, and sequencing various ETL server candidate sets according to the distance value;

and step six, calculating and determining the ETL job type according to the step two, forming a queue according to the calculated and determined ETL job type, and selecting a target server ranked in front from the corresponding classified ETL server queues.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

the invention adopts a cluster analysis method and an evaluation method based on information entropy, realizes the refined classification of the ETL server and the ETL operation, evaluates the performance of the ETL server based on the information entropy theory, automatically matches and executes the classified ETL operation, can fully utilize the idle resources of the ETL server, dynamically allocates the computing resources in quasi-real time, improves the efficiency of data access, and adapts to the increasingly huge requirements of data warehouses.

Drawings

The invention will now be described, by way of example, with reference to the accompanying drawings, in which:

FIG. 1 is a schematic diagram of simple policy allocation scheduling resources;

FIG. 2 is a schematic diagram of a classification evaluation policy scheduling resource;

FIG. 3 is a flow diagram of a method for evaluating and scheduling ETL jobs by ETL server classification in a distributed environment according to the present invention.

Detailed Description

As shown in fig. 3, the method for classifying and evaluating ETL job scheduling resources in a distributed environment of the present invention is characterized by comprising the following steps:

determining an index system for evaluating the performance of an ETL server;

dividing indexes for evaluating the performance of the ETL server into a CPU class, a memory class and an I/O class; the indexes of the CPU class comprise the utilization rate of the CPU and the occupancy rate of a CPU process queue; the indexes of the memory class comprise memory occupancy rate, buffer zone non-waiting rate and Redo buffer zone non-waiting rate; the indicator of the I/O class includes an I/O busy rate. Specifically, the index system for evaluating the performance of the ETL server is shown in table 1.

Table 1:

and classifying the ETL operation types into a CPU type, a memory type and an I/O type corresponding to the index classification for evaluating the performance of the ETL server according to the index system for evaluating the performance of the ETL server. For the data volume, the conversion workload (the number of conversion components) and the loading workload (the number of loading components) of each ETL job, after determining the weights (respectively, the influence factors of the CPU utilization rate, the memory occupancy rate and the I/O busy rate) of the data volume, the conversion workload (the number of conversion components) and the loading workload (the number of loading components), substituting the weights into a comprehensive evaluation value calculation formula to calculate a comprehensive evaluation value, and then determining the type of the ETL job according to the calculated comprehensive evaluation value.

The specific method comprises the following steps:

(1) recording the ETL operation type as ETLS, and defining a comprehensive evaluation value calculation formula of ETLS as follows:

ETLS＝DS*α+TS*β+LD*γ

wherein DS represents data volume, TS represents the number of conversion components, LD represents the number of loading components, α, β and gamma are respectively the influence factors of CU, MA and IU, the value range is (0,1), and the adjustment can be carried out according to expert knowledge and test results.

(2) Determining the ETL job type according to the following formula by using the calculated comprehensive evaluation value:

the basic idea of cluster analysis is as follows by means of fuzzy classification idea in fuzzy mathematics: establishing an index vector set in an m-dimensional space, firstly giving the number of classifications and initial classifications to obtain the initial membership of each vector, calculating the clustering center of each initial classification, and then iterating repeatedly until each vector belongs to a certain class with a certain membership.

The specific method is that according to the number n of the ETL servers in practice, a six-dimensional vector space based on an index system for evaluating the performance of the ETL servers is established, and then the following steps are executed:

step 3.1, determining the initial classification number and the initial classification of the ETL server to obtain an initial membership degree;

determining an initial classification of an ETL server as S-class, initial classification C_j(j<＝S),X_iIs a vector in six-dimensional vector space, the initial membership u_ijThe expression of (a) is as follows:

further, the initial classification S of the ETL server corresponds to the number of ETL job types, that is, the initial classification of the ETL server is divided into 3 classes, which correspond to a CPU class, a memory class, and an I/O class, respectively.

Step 3.2, calculating the distance of the class center to which each vector belongs according to the Euclidean distance;

setting v at initialization_jIs C_jThe distance of the cluster center to which each vector belongs is expressed as follows:

wherein l represents the number of iterations, m corresponds to the classification of the evaluation index,

is denoted by v_jThe ith iteration value of the kth vector;

step 3.3, calculating the membership degree of each vector, wherein the expression is as follows:

α is an empirical constant of convergence rate, and is generally set to α > 1;

step 3.4, judging the convergence degree of the membership degree, and if the following formula is met, ending the iteration of the step 3.2;

wherein, epsilon is an empirical constant, and is generally set to be 0.05;

step 3.5, calculating a new clustering center of each ETL server type;

wherein the content of the first and second substances,

to be classified into C_jThe aggregation center of (a);

and 3.6, judging that the ETL server belongs to one of ETL server candidate sets (such as a CPU type, a memory type and an I/O type) of a specific type according to the clustering center and the subscript codes of the ETL server vector classification.

Step four, establishing an index evaluation matrix aiming at various ETL server candidate sets, and calculating the information entropy of the index and the weight of the information entropy;

the index which has the largest influence on the ETL server is analyzed based on the principle of the information entropy, and the entropy value of the index and the weight of the index in the evaluation model are evaluated in a quantitative mode.

The method specifically comprises the following steps:

step 4.1, establishing an index evaluation matrix X;

X＝(x_ij)_n*m

wherein n is the number of ETL server candidate sets, m is six types of indexes for evaluating the performance of the ETL server, and i and j are respectively the horizontal coordinate and the vertical coordinate of an index vector;

step 4.2, the indexes are normalized;

(1) positive correlation index treatment:

wherein, a_ijIs a certain index data set;

(2) and (3) processing the negative correlation index:

wherein, a_ijIs a certain index data set;

4.3, calculating the information entropy of the index;

where k is a constant, generally set to k 1;

the information entropy represents the disorder degree of the information, and the larger the value is, the smaller the information contained in the index is, and the smaller the contribution degree to the overall evaluation is.

Step 4.4, calculating the information deviation degree d_j；

d_j＝1-E_j

Step 4.5, calculating the weight of the information entropy of the index;

according to the weight value w of each index of the ETL server_jA composite performance score for the index may be calculated.

the method specifically comprises the following steps:

step 5.1, obtaining an index evaluation matrix according to the calculation result of the step 4.1

X＝(x_ij)_n*m

By weight w with information entropy_jNormalizing and constraining the multiplication line by line to obtain an attribute matrix

B＝(b_ij)

Step 5.2, calculating an ideal point

Step 5.3, solving the distance between each ETL server and an ideal point, and substituting the distance into a formula:

and 5.4, sorting the ETL servers in the various ETL server candidate sets according to the distance values, wherein the smaller the distance value is, the more the distance value meets the requirement of selecting a target.

And step six, calculating and determining the ETL operation type of the ETL operation according to the step two, forming ETL operation queues of three types (CPU type, memory type and I/O type) to be distributed according to the scheduling time of the operation plan, forming ETL server queues of three types (CPU type, memory type and I/O type) by the ETL server serving as a distribution target after the step five is finished, sorting the ETL server queues of each type according to the numerical values from the ideal point, wherein the smaller the numerical value is, the higher the priority is, when the ETL operation is distributed, adopting a 'first come first serve' strategy, preferentially selecting the server with the highest priority in the ETL server queues of the corresponding type, and removing the ETL server from the queues after the distribution is finished, thereby finally achieving the purpose of reasonably allocating the calculation resources.

As can be seen from the above, the present invention has the following beneficial effects:

the invention adopts a cluster analysis method and an evaluation method based on information entropy, realizes the refined classification of the ETL server and the ETL operation, evaluates the performance of the ETL server based on the information entropy theory, automatically matches the ETL operation which does not enter an execution state, fully utilizes idle resources of the ETL server, dynamically allocates computing resources in a quasi-real-time and relatively accurate manner, improves the efficiency of data access, and adapts to the increasingly huge requirements of data warehouses.

Claims

1. A method for classified evaluation of ETL job scheduling resources in a distributed environment is characterized by comprising the following steps:

determining an index system for evaluating the performance of an ETL server;

2. The method for classification and evaluation of ETL job scheduling resources in a distributed environment according to claim 1, wherein the method of the first step is: dividing indexes for evaluating the performance of the ETL server into a CPU class, a memory class and an I/O class; the indexes of the CPU class comprise the utilization rate of the CPU and the occupancy rate of a CPU process queue; the indexes of the memory class comprise memory occupancy rate, buffer zone non-waiting rate and Redo buffer zone non-waiting rate; the indicator of the I/O class includes an I/O busy rate.

3. The method for classifying and evaluating ETL job scheduling resources in a distributed environment according to claim 2, wherein the method of the second step comprises:

ETLS＝DS*α+TS*β+LD*γ

wherein DS represents data volume, TS represents the number of conversion components, LD represents the number of loading components, α, β and gamma are respectively the influence factors of CU, MA and IU, and the value range is (0, 1);

4. the method according to claim 2, wherein the method in step three is to establish a six-dimensional vector space based on an index system for evaluating the performance of the ETL server according to the number n of the ETL servers in practice, and then perform the following steps:

determining an initial classification of an ETL server as S-class, initial classification C_j(j≤S),X_iIs a vector in six-dimensional vector space, the initial membership u_ijThe expression of (a) is as follows:

is denoted by v_jThe ith iteration value of the kth vector;

wherein α is an empirical constant for convergence rate;

wherein epsilon is an empirical constant;

step 3.5, calculating a new clustering center of each ETL server type;

wherein, V_j ^(l+1)To be classified into C_jThe aggregation center of (a);

and 3.6, judging that the ETL server belongs to one type of ETL server candidate sets of a specific type according to the clustering center and the subscript codes of the ETL server vector classification.

5. The method of claim 4, wherein the initial classification S of the ETL server corresponds to the number of ETL job types.

6. The method of claim 4, wherein α >1 is used for classification evaluation of ETL job scheduling resources in a distributed environment.

7. The method of claim 4, wherein ε is 0.05.

8. The method for classification and evaluation of ETL job scheduling resources in a distributed environment according to claim 1, wherein the method of step four is:

step 4.1, establishing an index evaluation matrix X;

X＝(x_ij)_n*m

wherein n is the number of ETL server candidate sets, m is the number of six types of indexes for evaluating the performance of the ETL server, and i and j are respectively the horizontal coordinate and the vertical coordinate of an index vector;

step 4.2, the indexes are normalized;

(1) positive correlation index treatment:

wherein, a_ijIs a certain index data set;

(2) and (3) processing the negative correlation index:

wherein, a_ijIs a certain index data set;

4.3, calculating the information entropy of the index;

E_j＝-k∑_i＝1(r_ijlnr_ij)，j＝1,…….,m

wherein k is a constant;

step 4.4, calculating the information deviation degree d_j；

d_j＝1-E_j

Step 4.5, calculating the weight of the information entropy of the index;

9. the method of claim 8, wherein k is 1.

10. The method for classifying and evaluating ETL job scheduling resources in a distributed environment according to claim 1, wherein the method of the fifth step is:

(1) calculating ideal points according to the calculation result of the step four

Obtaining an index evaluation matrix according to the calculation result of the step four

X＝(x_ij)_n*m

B＝(b_ij)

(2) Finding the ideal point

(3) And solving the distance between each ETL server and an ideal point, and substituting the distance into a formula:

(4) and sorting various ETL server candidate sets according to the distance values, wherein the smaller the distance value is, the more the selection target requirement is met.