CN106888237B

CN106888237B - Data scheduling method and system

Info

Publication number: CN106888237B
Application number: CN201510937896.0A
Authority: CN
Inventors: 张宝海; 鲍媛媛
Original assignee: China Mobile Communications Group Co Ltd
Current assignee: China Mobile Communications Group Co Ltd
Priority date: 2015-12-15
Filing date: 2015-12-15
Publication date: 2020-01-07
Anticipated expiration: 2035-12-15
Also published as: CN106888237A

Abstract

The invention discloses a data scheduling method and a system, wherein the method comprises the following steps: determining at least one attribute included in historical processing performance data and at least one category corresponding to each attribute based on the historical processing performance data of at least one server; establishing a server evaluation model based on historical processing performance data of the at least one server; wherein, the server evaluation model comprises: at least one branch path composed of at least one attribute and at least one category, and leaf nodes of each branch path composed of the evaluation result; evaluating at least one server in the server cluster based on the server evaluation model to obtain an evaluation result aiming at each server in the at least one server; and scheduling data according to the evaluation result of each server in the at least one server.

Description

Data scheduling method and system

Technical Field

The present invention relates to a server cluster management technology in the field of communications, and in particular, to a data scheduling method and system.

Background

With the progress of the big data era, big data development becomes a national strategy, and with the continuous development of hardware level, the software and hardware facility performance of the data center is continuously improved. The network bandwidth bottleneck is continuously broken through, and at present, a trillion network becomes a standard configuration of a data center. The storage and computing power of servers is also continually upgraded and optimized following moore's law. However, most of the scheduling strategies based on traditional data distribution cannot meet the requirements for large-data-volume and real-time data transmission under the current large-data environment.

The scheduling strategies solve the problems of connection and scheduling in data transmission to a certain extent, but are not suitable for the rapid development of data transmission requirements and hardware configuration in the existing big data environment. For example, the following ways: round Robin (Round Robin) is chosen in such a way that a less capable server will also accept a Round Robin in the next Round Robin even if the server is no longer able to process the current request. This may result in overloading the less capable server. Weighted Round Robin (Weighted Round Robin), the administrator simply defines the weight of each server by the processing power of the server. The minimum number of connections (Least Connection), the incoming requests are assigned according to the number of connections currently open by each server. I.e. the server with the least number of active connections will automatically receive the next incoming request. However, if all servers are the same, then the first server will essentially always be preferred. The minimum Connection Slow Start Time (Least Connection Slow Start Time) is processed based on the transition Time configured by the administrator. Weighted Least Connection (Weighted Least Connection), is also the number of active connections that need to be determined by the administrator based on the weights customized by the server conditions to schedule data. Fixed weight (Fixed Weighted), in this way the weight of each real server needs to be configured based on server priority. Weighted Response (Weighted Response), which assumes that server heartbeat detection is based on how fast the machine is, but may not always be true. Source IP Hash (Source IP Hash), where the corresponding server is always the same for the same host, may result in server load imbalance.

It can be seen that, in the foregoing multiple scheduling manners provided in the prior art, performance analysis and scheduling according to attributes of the server cannot be guaranteed, and thus, a guarantee cannot be provided for timely processing of data scheduling.

Disclosure of Invention

In view of the above, the present invention provides a data scheduling method and system, which can at least solve the above problems in the prior art.

In order to achieve the purpose, the technical scheme of the invention is realized as follows:

the embodiment of the invention provides a data scheduling method, which comprises the following steps:

determining at least one attribute included in historical processing performance data and at least one category corresponding to each attribute based on the historical processing performance data of at least one server;

establishing a server evaluation model based on historical processing performance data of the at least one server; wherein, the server evaluation model comprises: at least one branch path composed of at least one attribute and at least one category, and leaf nodes of each branch path composed of the evaluation result;

evaluating at least one server in the server cluster based on the server evaluation model to obtain an evaluation result aiming at each server in the at least one server;

and scheduling data according to the evaluation result of each server in the at least one server.

An embodiment of the present invention provides a data scheduling system, including:

the data preprocessing unit is used for determining at least one attribute included in historical processing performance data and at least one category corresponding to each attribute based on the historical processing performance data of at least one server;

the model establishing unit is used for establishing a server evaluation model based on the historical processing performance data of the at least one server; wherein, the server evaluation model comprises: at least one branch path composed of at least one attribute and at least one category, and leaf nodes of each branch path composed of the evaluation result;

the server evaluation unit is used for evaluating at least one server in the server cluster based on the server evaluation model to obtain an evaluation result aiming at each server in the at least one server;

and the scheduling unit is used for scheduling data according to the evaluation result of each server in the at least one server.

The embodiment of the invention provides a data scheduling method and a data scheduling system, wherein a server decision model is established according to at least one attribute and at least one category of historical processing performance data corresponding to each server in a server cluster, then the performance data of a machine is calculated in real time so as to judge whether the machine is in an idle state or a busy state, and the data is distributed to the machine in a more idle state based on the judgment result. Therefore, various attributes of the machine performance are fully considered, performance analysis is carried out based on the attributes, the accuracy of data scheduling is improved, and the timeliness of data processing is improved.

Drawings

FIG. 1 is a flow chart of a data scheduling method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a model building method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a data scheduling system according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

The first embodiment,

The present embodiment provides a data scheduling method, as shown in fig. 1, the method includes:

step 101: determining at least one attribute included in historical processing performance data and at least one category corresponding to each attribute based on the historical processing performance data of at least one server;

step 102: establishing a server evaluation model based on historical processing performance data of the at least one server; wherein, the server evaluation model comprises: at least one branch path composed of at least one attribute and at least one category, and leaf nodes of each branch path composed of the evaluation result;

step 103: evaluating at least one server in the server cluster based on the server evaluation model to obtain an evaluation result aiming at each server in the at least one server;

step 104: and scheduling data according to the evaluation result of each server in the at least one server.

The method provided by the embodiment can be applied to a server cluster.

The evaluation result in the server evaluation model, that is, the leaf node, may be the idle degree of the server, for example, the server may be idle, busy, and normal.

Further, in this embodiment, the determining at least one attribute included in the historical processing performance data based on the historical processing performance data of at least one server and the at least one category corresponding to each attribute may be a preprocessing operation.

Wherein the historical processing performance data may further include: identification information of the server, the idleness of the server, and the like.

The at least one attribute may be a CPU idle condition, a memory occupation condition, a network congestion condition, a disk read-write condition, a client connection number, or the like. The corresponding categories can include that a CPU is free, busy and normal, the memory occupation is high, the memory occupation is normal, the memory occupation is low, the network congestion condition is serious, the network congestion condition is normal, the network congestion condition is absent, the disk read-write condition is normal, the disk read-write condition is abnormal, the number of links of the client is large, the number of links of the client is small, the number of links of the client is normal, and the like.

Specifically, assume that there is a data set D having attributes a1, a2, …, Ak, which are labeled as n categories c1, c2, …, cn. For example, the attributes of the machines in the cluster: the network IO, the hard disk IO, the CPU, the memory and the like are used as attribute data sets, and the busy degree of each attribute is used as a category, such as idle, busy and moderate.

Establishing a server evaluation model based on the historical processing performance data of the at least one server, including:

calculating the entropy of the historical processing performance data based on the category corresponding to at least one attribute in the historical processing performance data, and calculating the entropy corresponding to each attribute respectively;

determining the information gain of each attribute based on the entropy of each attribute and the entropy of historical processing performance data, and sequencing the at least one attribute based on the information gain of each attribute to obtain at least one sequenced attribute;

and establishing at least one branch path formed by the at least one attribute and the at least one category and leaf nodes of each branch path formed by the evaluation result based on the sorting of the at least one attribute.

In this step, a classifier is constructed based on the data set after data preprocessing, and this step is implemented by using an ID3 decision tree algorithm. The decision tree algorithm separates training data recursively, each recursion selects the optimal classification attribute as the attribute for separating the current data set, and the selection of the optimal classification attribute is realized by a promiscuous degree function; the ID3 decision tree algorithm uses information increment as a function of the degree of mixing, the information gain is based on the entropy in the information theory, the entropy is a measure of the uncertainty of the object, and the larger the entropy is, the higher the uncertainty of the object is.

The entropy of the historical processing performance data is calculated based on the category corresponding to at least one attribute in the historical processing performance data, and the entropy corresponding to each attribute is calculated, specifically as follows: assuming a data set D having attributes a1, a2, …, Ak and containing n categories c1, c2, …, cn, the entropy of the data set D in the original state can be expressed as follows:

where p () is used to represent the calculated probability, that is to say p (cj) to calculate the probability that the jth class appears in the whole training tuple, the number of elements belonging to this class can be divided by the total number of elements of the training tuple as an estimate. The actual meaning of entropy represents the average amount of information needed for class labels of tuples in D.

The information gain of each attribute is determined based on the entropy of each attribute and the entropy of the historical processing performance data, and the at least one attribute is sequenced based on the information gain of each attribute to obtain at least one sequenced attribute; the calculation method for calculating the information gain of each attribute may be as follows:

if D can be partitioned into v disjoint subsets D1, D, Dv using attribute Ai, the entropy of dataset D partitioned using attribute Ai is:

the information gain of attribute Ai is:

the larger the information gain caused by data separation using a certain attribute is, the better the data separation effect of the attribute is.

The at least one attribute is sorted based on the information gain of each attribute to obtain at least one sorted attribute, and the information gain of each attribute may be sorted from large to small.

The establishing at least one branch path composed of at least one attribute and at least one category and the leaf node of each branch path composed of the evaluation result based on the ranking of the at least one attribute may be:

selecting an attribute with the largest information gain as an optimal classification attribute, namely, taking the optimal distribution attribute as a root node;

and then sequentially sequencing, taking other attributes as nodes corresponding to different classes in different branch paths, and finally taking the evaluation result as a leaf node of each branch path.

The condition of the recursion termination is that each data subset obtained by final separation is as pure as possible, the classifier obtained by the algorithm is in the form of a tree, each branch path represents a certain possible attribute value, and each leaf node corresponds to a category.

Preferably, the method further comprises: when a new server is added in the server cluster, acquiring processing performance data of the new server; processing the processing performance data of the new server to obtain at least one attribute and at least one type; and determining the evaluation result of the new server by using at least one attribute, at least one type and the server evaluation model. And classifying the attribute information of the server of the system according to the decision tree, and selecting an idle machine according to a classification result to schedule the task to the idle machine. If the classification result according to the decision tree is an average value, that is, all the machines in the cluster system are in the same state, a random selection method can be adopted to arbitrarily select a part of machines as data transmission objects.

The data scheduling according to the evaluation result of each server of the at least one server includes: when target data are determined to be distributed, selecting a target server meeting a first preset condition based on an evaluation result of processing capacity corresponding to at least one server in a server cluster, wherein the evaluation result of the processing capacity of the target server represented by the first preset condition is idle; and dispatching the target data to the target server.

And selecting a machine which is idle as a target machine for data distribution according to the execution result of the decision tree algorithm.

Further, after the scheduling the target data to the target server, the method further includes: acquiring the processing time of the target server for processing the target data; and judging whether the processing time length is greater than a preset threshold value, and if so, controlling to reestablish the server evaluation model. And predicting the task execution time according to the decision tree execution result and comparing the task execution time with the time task execution time. If the actual execution time of the task is more than the estimated time, such as:

Perform-T estimate | > > DeltaT (setting an acceptable time difference as a threshold value)

And (4) the attribute information is poor sample data, and the attribute information is removed from the test set. Invalid data in the test set can be removed through the Toronto test comparison, and the function of dynamically adjusting the test set is realized.

The reestablishing of the server evaluation model comprises the following steps:

deleting the attribute corresponding to the root node in the current server evaluation model;

and establishing a server evaluation model based on the historical processing performance data of the at least one server.

The method for reestablishing the server evaluation model may be a periodic adjustment, and after the system runs for a certain period of time (the time length may be adjusted according to specific conditions), the data information is updated by using the currently acquired historical processing performance data, and the above steps are repeated.

The traditional scheme is lack of consideration for hardware environment and specific application scenario based on historical experience value setting scheduling distribution strategy. Even some algorithms do not adequately incorporate software and hardware environmental factors into the data scheduling factors in view of connection latency.

According to the scheme provided by the embodiment, a server decision model can be established according to at least one attribute and at least one category of historical processing performance data corresponding to each server in the server cluster, then real-time calculation is carried out on the performance data of the machine so as to judge whether the machine is in an idle or busy state, and the data are distributed to the machine which is idle relatively based on the judgment result. Therefore, various attributes of the machine performance are fully considered, performance analysis is carried out based on the attributes, the accuracy of data scheduling is improved, and the timeliness of data processing is improved.

In addition, the embodiment also provides a method for reestablishing the server evaluation model, so that a flexible mechanism for dynamically adjusting the strategy can be ensured.

Example II,

The specific implementation steps of the ID3 decision tree algorithm are shown in fig. 2:

firstly, initializing parameters to obtain a data set D, an attribute set A and a category cj, and establishing a decision tree T;

judging whether the current data set D only has one category cj, if so, directly adding cj to leaf nodes of T to be used as decision nodes;

if not, whether the attribute set A is empty or not is judged, and if the attribute set A is empty, cj with the highest proportion in the data set D is used as a leaf node;

if not, calculating the entropy of the data set D, and calculating the entropy of each attribute;

selecting an attribute with the largest information gain as an optimal classification attribute Ag;

judging whether the information gain of the optimal classification attribute Ag is smaller than a threshold value, and if so, taking cj with the highest ratio in the data set D as a leaf node;

if not, taking the Ag as a decision node Ag in a decision tree, and dividing a data set D by utilizing the Ag;

adding 1 to the calculation pointer, and setting j to j + 1;

judging whether the jth sub data set is empty, if so, adding 1 to the calculation pointer, and judging the jth sub data set to be empty again;

if not, adding a branch Tj for the decision tree, and calculating again whether the data set only contains one category cj until traversing the attributes and types in the data set.

Example III,

The present embodiment provides a data scheduling system, as shown in fig. 3, including:

the data preprocessing unit 31 is configured to determine, based on historical processing performance data of at least one server, at least one attribute included in the historical processing performance data and at least one category corresponding to each attribute;

a model establishing unit 32, configured to establish a server evaluation model based on historical processing performance data of the at least one server; wherein, the server evaluation model comprises: at least one branch path composed of at least one attribute and at least one category, and leaf nodes of each branch path composed of the evaluation result;

the server evaluation unit 33 is configured to evaluate at least one server in the server cluster based on the server evaluation model to obtain an evaluation result for each server in the at least one server;

and the scheduling unit 34 is configured to schedule data according to the evaluation result of each server of the at least one server.

The system provided in this embodiment may be applied to a management server in a server cluster, and specifically, the four modules may be all disposed in one management server, or may be disposed in different servers with management functions respectively.

The model establishing unit is used for calculating the entropy of the historical processing performance data based on the category corresponding to at least one attribute in the historical processing performance data, and calculating the entropy corresponding to each attribute respectively; determining the information gain of each attribute based on the entropy of each attribute and the entropy of historical processing performance data, and sequencing the at least one attribute based on the information gain of each attribute to obtain at least one sequenced attribute; and establishing at least one branch path formed by the at least one attribute and the at least one category and leaf nodes of each branch path formed by the evaluation result based on the sorting of the at least one attribute.

where p () is used to represent the probability of computation, that is to say p (c)_j) To calculate the probability that the jth class appears in the entire training tuple, the number of elements belonging to this class can be divided by the total number of training tuple elements as an estimate. The actual meaning of entropy represents the average amount of information needed for class labels of tuples in D.

the information gain of attribute Ai is:the larger the information gain caused by data separation using a certain attribute is, the better the data separation effect of the attribute is.

Preferably, the server evaluation unit is configured to, when a new server is added to the server cluster, acquire processing performance data of the new server; processing the processing performance data of the new server to obtain at least one attribute and at least one type; and determining the evaluation result of the new server by using at least one attribute, at least one type and the server evaluation model. And classifying the attribute information of the server of the system according to the decision tree, and selecting an idle machine according to a classification result to schedule the task to the idle machine. If the classification result according to the decision tree is an average value, that is, all the machines in the cluster system are in the same state, a random selection method can be adopted to arbitrarily select a part of machines as data transmission objects.

The scheduling unit is used for selecting and obtaining a target server meeting a first preset condition based on an evaluation result of the processing capacity corresponding to at least one server in the server cluster when target data needs to be distributed, wherein the first preset condition represents that the evaluation result of the processing capacity of the target server is idle; and dispatching the target data to the target server.

Further, the scheduling unit is configured to obtain a processing duration for the target server to process the target data; and judging whether the processing time length is greater than a preset threshold value, and if so, controlling the model establishing unit to reestablish the server evaluation model. And predicting the task execution time according to the decision tree execution result and comparing the task execution time with the time task execution time. If the actual execution time of the task is more than the estimated time, such as:

The model establishing unit is used for deleting the attribute corresponding to the root node in the current server evaluation model; and establishing a server evaluation model based on the historical processing performance data of the at least one server.

According to the scheme provided by the embodiment, a server decision model can be established according to at least one attribute and at least one category of historical processing performance data corresponding to each server in the server cluster, then real-time calculation is carried out on the performance data of the machine so as to judge whether the machine is in an idle or busy state, and the data are distributed to the machine which is idle relatively based on the judgment result. Therefore, various attributes of machine performance are fully considered, performance analysis is carried out based on the attributes, and accuracy of data scheduling is further improved.

In addition, the embodiment also provides an updating method for establishing the server evaluation model, so that a flexible mechanism for dynamically adjusting the strategy can be ensured.

The integrated module according to the embodiment of the present invention may also be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as an independent product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a base station, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes. Thus, embodiments of the invention are not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims

1. A method for scheduling data, the method comprising:

selecting an attribute with the largest information gain as a root node, sequentially sequencing at least one attribute, taking other attributes as nodes corresponding to different categories in different branch paths, and finally taking an evaluation result as a leaf node of each branch path to obtain a server evaluation model; wherein, the server evaluation model comprises: at least one branch path composed of at least one attribute and at least one category, and leaf nodes of each branch path composed of the evaluation result;

based on the server evaluation model, at least one server in a server cluster is evaluated to obtain an evaluation result aiming at each server in the at least one server;

2. The method of claim 1, further comprising:

when a new server is added in the server cluster, acquiring processing performance data of the new server;

processing the processing performance data of the new server to obtain at least one attribute and at least one type;

and determining the evaluation result of the new server by using at least one attribute, at least one type and the server evaluation model.

3. The method of claim 1, wherein the scheduling data according to the evaluation result of each of the at least one server comprises:

when target data are determined to be distributed, selecting a target server meeting a first preset condition based on an evaluation result of processing capacity corresponding to at least one server in a server cluster, wherein the evaluation result of the processing capacity of the target server represented by the first preset condition is idle;

and dispatching the target data to the target server.

4. The method of claim 3, wherein after the scheduling the target data to the target server, the method further comprises:

acquiring the processing time of the target server for processing the target data;

and judging whether the processing time length is greater than a preset threshold value, and if so, controlling to reestablish the server evaluation model.

5. The method of claim 4, wherein the re-establishing the server evaluation model comprises:

6. A data scheduling system, comprising:

the model establishing unit is used for calculating the entropy of the historical processing performance data based on the category corresponding to at least one attribute in the historical processing performance data and calculating the entropy corresponding to each attribute respectively; determining the information gain of each attribute based on the entropy of each attribute and the entropy of historical processing performance data, and sequencing the at least one attribute based on the information gain of each attribute to obtain at least one sequenced attribute; selecting an attribute with the largest information gain as a root node, sequentially sequencing at least one attribute, taking other attributes as nodes corresponding to different categories in different branch paths, and finally taking an evaluation result as a leaf node of each branch path to obtain a server evaluation model; wherein, the server evaluation model comprises: at least one branch path composed of at least one attribute and at least one category, and leaf nodes of each branch path composed of the evaluation result;

7. The system of claim 6,

the server evaluation unit is used for acquiring processing performance data of a new server when the new server is added in the server cluster; processing the processing performance data of the new server to obtain at least one attribute and at least one type; and determining the evaluation result of the new server by using at least one attribute, at least one type and the server evaluation model.

8. The system of claim 6,

9. The system of claim 8,

the scheduling unit is used for acquiring the processing time of the target server for processing the target data; and judging whether the processing time length is greater than a preset threshold value, and if so, controlling the model establishing unit to reestablish the server evaluation model.

10. The system according to claim 9, wherein the model building unit is configured to delete an attribute corresponding to a root node in a current server evaluation model; and establishing a server evaluation model based on the historical processing performance data of the at least one server.