CN106888237B - Data scheduling method and system - Google Patents

Data scheduling method and system Download PDF

Info

Publication number
CN106888237B
CN106888237B CN201510937896.0A CN201510937896A CN106888237B CN 106888237 B CN106888237 B CN 106888237B CN 201510937896 A CN201510937896 A CN 201510937896A CN 106888237 B CN106888237 B CN 106888237B
Authority
CN
China
Prior art keywords
server
attribute
data
performance data
processing performance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510937896.0A
Other languages
Chinese (zh)
Other versions
CN106888237A (en
Inventor
张宝海
鲍媛媛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201510937896.0A priority Critical patent/CN106888237B/en
Publication of CN106888237A publication Critical patent/CN106888237A/en
Application granted granted Critical
Publication of CN106888237B publication Critical patent/CN106888237B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1004Server selection for load balancing
    • H04L67/1008Server selection for load balancing based on parameters of servers, e.g. available memory or workload
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1004Server selection for load balancing
    • H04L67/101Server selection for load balancing based on network conditions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1029Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers using data related to the state of servers by a load balancer

Abstract

The invention discloses a data scheduling method and a system, wherein the method comprises the following steps: determining at least one attribute included in historical processing performance data and at least one category corresponding to each attribute based on the historical processing performance data of at least one server; establishing a server evaluation model based on historical processing performance data of the at least one server; wherein, the server evaluation model comprises: at least one branch path composed of at least one attribute and at least one category, and leaf nodes of each branch path composed of the evaluation result; evaluating at least one server in the server cluster based on the server evaluation model to obtain an evaluation result aiming at each server in the at least one server; and scheduling data according to the evaluation result of each server in the at least one server.

Description

Data scheduling method and system
Technical Field
The present invention relates to a server cluster management technology in the field of communications, and in particular, to a data scheduling method and system.
Background
With the progress of the big data era, big data development becomes a national strategy, and with the continuous development of hardware level, the software and hardware facility performance of the data center is continuously improved. The network bandwidth bottleneck is continuously broken through, and at present, a trillion network becomes a standard configuration of a data center. The storage and computing power of servers is also continually upgraded and optimized following moore's law. However, most of the scheduling strategies based on traditional data distribution cannot meet the requirements for large-data-volume and real-time data transmission under the current large-data environment.
The scheduling strategies solve the problems of connection and scheduling in data transmission to a certain extent, but are not suitable for the rapid development of data transmission requirements and hardware configuration in the existing big data environment. For example, the following ways: round Robin (Round Robin) is chosen in such a way that a less capable server will also accept a Round Robin in the next Round Robin even if the server is no longer able to process the current request. This may result in overloading the less capable server. Weighted Round Robin (Weighted Round Robin), the administrator simply defines the weight of each server by the processing power of the server. The minimum number of connections (Least Connection), the incoming requests are assigned according to the number of connections currently open by each server. I.e. the server with the least number of active connections will automatically receive the next incoming request. However, if all servers are the same, then the first server will essentially always be preferred. The minimum Connection Slow Start Time (Least Connection Slow Start Time) is processed based on the transition Time configured by the administrator. Weighted Least Connection (Weighted Least Connection), is also the number of active connections that need to be determined by the administrator based on the weights customized by the server conditions to schedule data. Fixed weight (Fixed Weighted), in this way the weight of each real server needs to be configured based on server priority. Weighted Response (Weighted Response), which assumes that server heartbeat detection is based on how fast the machine is, but may not always be true. Source IP Hash (Source IP Hash), where the corresponding server is always the same for the same host, may result in server load imbalance.
It can be seen that, in the foregoing multiple scheduling manners provided in the prior art, performance analysis and scheduling according to attributes of the server cannot be guaranteed, and thus, a guarantee cannot be provided for timely processing of data scheduling.
Disclosure of Invention
In view of the above, the present invention provides a data scheduling method and system, which can at least solve the above problems in the prior art.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
the embodiment of the invention provides a data scheduling method, which comprises the following steps:
determining at least one attribute included in historical processing performance data and at least one category corresponding to each attribute based on the historical processing performance data of at least one server;
establishing a server evaluation model based on historical processing performance data of the at least one server; wherein, the server evaluation model comprises: at least one branch path composed of at least one attribute and at least one category, and leaf nodes of each branch path composed of the evaluation result;
evaluating at least one server in the server cluster based on the server evaluation model to obtain an evaluation result aiming at each server in the at least one server;
and scheduling data according to the evaluation result of each server in the at least one server.
An embodiment of the present invention provides a data scheduling system, including:
the data preprocessing unit is used for determining at least one attribute included in historical processing performance data and at least one category corresponding to each attribute based on the historical processing performance data of at least one server;
the model establishing unit is used for establishing a server evaluation model based on the historical processing performance data of the at least one server; wherein, the server evaluation model comprises: at least one branch path composed of at least one attribute and at least one category, and leaf nodes of each branch path composed of the evaluation result;
the server evaluation unit is used for evaluating at least one server in the server cluster based on the server evaluation model to obtain an evaluation result aiming at each server in the at least one server;
and the scheduling unit is used for scheduling data according to the evaluation result of each server in the at least one server.
The embodiment of the invention provides a data scheduling method and a data scheduling system, wherein a server decision model is established according to at least one attribute and at least one category of historical processing performance data corresponding to each server in a server cluster, then the performance data of a machine is calculated in real time so as to judge whether the machine is in an idle state or a busy state, and the data is distributed to the machine in a more idle state based on the judgment result. Therefore, various attributes of the machine performance are fully considered, performance analysis is carried out based on the attributes, the accuracy of data scheduling is improved, and the timeliness of data processing is improved.
Drawings
FIG. 1 is a flow chart of a data scheduling method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a model building method according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a data scheduling system according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
The first embodiment,
The present embodiment provides a data scheduling method, as shown in fig. 1, the method includes:
step 101: determining at least one attribute included in historical processing performance data and at least one category corresponding to each attribute based on the historical processing performance data of at least one server;
step 102: establishing a server evaluation model based on historical processing performance data of the at least one server; wherein, the server evaluation model comprises: at least one branch path composed of at least one attribute and at least one category, and leaf nodes of each branch path composed of the evaluation result;
step 103: evaluating at least one server in the server cluster based on the server evaluation model to obtain an evaluation result aiming at each server in the at least one server;
step 104: and scheduling data according to the evaluation result of each server in the at least one server.
The method provided by the embodiment can be applied to a server cluster.
The evaluation result in the server evaluation model, that is, the leaf node, may be the idle degree of the server, for example, the server may be idle, busy, and normal.
Further, in this embodiment, the determining at least one attribute included in the historical processing performance data based on the historical processing performance data of at least one server and the at least one category corresponding to each attribute may be a preprocessing operation.
Wherein the historical processing performance data may further include: identification information of the server, the idleness of the server, and the like.
The at least one attribute may be a CPU idle condition, a memory occupation condition, a network congestion condition, a disk read-write condition, a client connection number, or the like. The corresponding categories can include that a CPU is free, busy and normal, the memory occupation is high, the memory occupation is normal, the memory occupation is low, the network congestion condition is serious, the network congestion condition is normal, the network congestion condition is absent, the disk read-write condition is normal, the disk read-write condition is abnormal, the number of links of the client is large, the number of links of the client is small, the number of links of the client is normal, and the like.
Specifically, assume that there is a data set D having attributes a1, a2, …, Ak, which are labeled as n categories c1, c2, …, cn. For example, the attributes of the machines in the cluster: the network IO, the hard disk IO, the CPU, the memory and the like are used as attribute data sets, and the busy degree of each attribute is used as a category, such as idle, busy and moderate.
Establishing a server evaluation model based on the historical processing performance data of the at least one server, including:
calculating the entropy of the historical processing performance data based on the category corresponding to at least one attribute in the historical processing performance data, and calculating the entropy corresponding to each attribute respectively;
determining the information gain of each attribute based on the entropy of each attribute and the entropy of historical processing performance data, and sequencing the at least one attribute based on the information gain of each attribute to obtain at least one sequenced attribute;
and establishing at least one branch path formed by the at least one attribute and the at least one category and leaf nodes of each branch path formed by the evaluation result based on the sorting of the at least one attribute.
In this step, a classifier is constructed based on the data set after data preprocessing, and this step is implemented by using an ID3 decision tree algorithm. The decision tree algorithm separates training data recursively, each recursion selects the optimal classification attribute as the attribute for separating the current data set, and the selection of the optimal classification attribute is realized by a promiscuous degree function; the ID3 decision tree algorithm uses information increment as a function of the degree of mixing, the information gain is based on the entropy in the information theory, the entropy is a measure of the uncertainty of the object, and the larger the entropy is, the higher the uncertainty of the object is.
The entropy of the historical processing performance data is calculated based on the category corresponding to at least one attribute in the historical processing performance data, and the entropy corresponding to each attribute is calculated, specifically as follows: assuming a data set D having attributes a1, a2, …, Ak and containing n categories c1, c2, …, cn, the entropy of the data set D in the original state can be expressed as follows:
Figure BDA0000879204310000051
where p () is used to represent the calculated probability, that is to say p (cj) to calculate the probability that the jth class appears in the whole training tuple, the number of elements belonging to this class can be divided by the total number of elements of the training tuple as an estimate. The actual meaning of entropy represents the average amount of information needed for class labels of tuples in D.
The information gain of each attribute is determined based on the entropy of each attribute and the entropy of the historical processing performance data, and the at least one attribute is sequenced based on the information gain of each attribute to obtain at least one sequenced attribute; the calculation method for calculating the information gain of each attribute may be as follows:
if D can be partitioned into v disjoint subsets D1, D, Dv using attribute Ai, the entropy of dataset D partitioned using attribute Ai is:
Figure BDA0000879204310000052
the information gain of attribute Ai is:
Figure BDA0000879204310000053
the larger the information gain caused by data separation using a certain attribute is, the better the data separation effect of the attribute is.
The at least one attribute is sorted based on the information gain of each attribute to obtain at least one sorted attribute, and the information gain of each attribute may be sorted from large to small.
The establishing at least one branch path composed of at least one attribute and at least one category and the leaf node of each branch path composed of the evaluation result based on the ranking of the at least one attribute may be:
selecting an attribute with the largest information gain as an optimal classification attribute, namely, taking the optimal distribution attribute as a root node;
and then sequentially sequencing, taking other attributes as nodes corresponding to different classes in different branch paths, and finally taking the evaluation result as a leaf node of each branch path.
The condition of the recursion termination is that each data subset obtained by final separation is as pure as possible, the classifier obtained by the algorithm is in the form of a tree, each branch path represents a certain possible attribute value, and each leaf node corresponds to a category.
Preferably, the method further comprises: when a new server is added in the server cluster, acquiring processing performance data of the new server; processing the processing performance data of the new server to obtain at least one attribute and at least one type; and determining the evaluation result of the new server by using at least one attribute, at least one type and the server evaluation model. And classifying the attribute information of the server of the system according to the decision tree, and selecting an idle machine according to a classification result to schedule the task to the idle machine. If the classification result according to the decision tree is an average value, that is, all the machines in the cluster system are in the same state, a random selection method can be adopted to arbitrarily select a part of machines as data transmission objects.
The data scheduling according to the evaluation result of each server of the at least one server includes: when target data are determined to be distributed, selecting a target server meeting a first preset condition based on an evaluation result of processing capacity corresponding to at least one server in a server cluster, wherein the evaluation result of the processing capacity of the target server represented by the first preset condition is idle; and dispatching the target data to the target server.
And selecting a machine which is idle as a target machine for data distribution according to the execution result of the decision tree algorithm.
Further, after the scheduling the target data to the target server, the method further includes: acquiring the processing time of the target server for processing the target data; and judging whether the processing time length is greater than a preset threshold value, and if so, controlling to reestablish the server evaluation model. And predicting the task execution time according to the decision tree execution result and comparing the task execution time with the time task execution time. If the actual execution time of the task is more than the estimated time, such as:
Perform-T estimate | > > DeltaT (setting an acceptable time difference as a threshold value)
And (4) the attribute information is poor sample data, and the attribute information is removed from the test set. Invalid data in the test set can be removed through the Toronto test comparison, and the function of dynamically adjusting the test set is realized.
The reestablishing of the server evaluation model comprises the following steps:
deleting the attribute corresponding to the root node in the current server evaluation model;
and establishing a server evaluation model based on the historical processing performance data of the at least one server.
The method for reestablishing the server evaluation model may be a periodic adjustment, and after the system runs for a certain period of time (the time length may be adjusted according to specific conditions), the data information is updated by using the currently acquired historical processing performance data, and the above steps are repeated.
The traditional scheme is lack of consideration for hardware environment and specific application scenario based on historical experience value setting scheduling distribution strategy. Even some algorithms do not adequately incorporate software and hardware environmental factors into the data scheduling factors in view of connection latency.
According to the scheme provided by the embodiment, a server decision model can be established according to at least one attribute and at least one category of historical processing performance data corresponding to each server in the server cluster, then real-time calculation is carried out on the performance data of the machine so as to judge whether the machine is in an idle or busy state, and the data are distributed to the machine which is idle relatively based on the judgment result. Therefore, various attributes of the machine performance are fully considered, performance analysis is carried out based on the attributes, the accuracy of data scheduling is improved, and the timeliness of data processing is improved.
In addition, the embodiment also provides a method for reestablishing the server evaluation model, so that a flexible mechanism for dynamically adjusting the strategy can be ensured.
Example II,
The specific implementation steps of the ID3 decision tree algorithm are shown in fig. 2:
firstly, initializing parameters to obtain a data set D, an attribute set A and a category cj, and establishing a decision tree T;
judging whether the current data set D only has one category cj, if so, directly adding cj to leaf nodes of T to be used as decision nodes;
if not, whether the attribute set A is empty or not is judged, and if the attribute set A is empty, cj with the highest proportion in the data set D is used as a leaf node;
if not, calculating the entropy of the data set D, and calculating the entropy of each attribute;
selecting an attribute with the largest information gain as an optimal classification attribute Ag;
judging whether the information gain of the optimal classification attribute Ag is smaller than a threshold value, and if so, taking cj with the highest ratio in the data set D as a leaf node;
if not, taking the Ag as a decision node Ag in a decision tree, and dividing a data set D by utilizing the Ag;
adding 1 to the calculation pointer, and setting j to j + 1;
judging whether the jth sub data set is empty, if so, adding 1 to the calculation pointer, and judging the jth sub data set to be empty again;
if not, adding a branch Tj for the decision tree, and calculating again whether the data set only contains one category cj until traversing the attributes and types in the data set.
Example III,
The present embodiment provides a data scheduling system, as shown in fig. 3, including:
the data preprocessing unit 31 is configured to determine, based on historical processing performance data of at least one server, at least one attribute included in the historical processing performance data and at least one category corresponding to each attribute;
a model establishing unit 32, configured to establish a server evaluation model based on historical processing performance data of the at least one server; wherein, the server evaluation model comprises: at least one branch path composed of at least one attribute and at least one category, and leaf nodes of each branch path composed of the evaluation result;
the server evaluation unit 33 is configured to evaluate at least one server in the server cluster based on the server evaluation model to obtain an evaluation result for each server in the at least one server;
and the scheduling unit 34 is configured to schedule data according to the evaluation result of each server of the at least one server.
The system provided in this embodiment may be applied to a management server in a server cluster, and specifically, the four modules may be all disposed in one management server, or may be disposed in different servers with management functions respectively.
The evaluation result in the server evaluation model, that is, the leaf node, may be the idle degree of the server, for example, the server may be idle, busy, and normal.
Further, in this embodiment, the determining at least one attribute included in the historical processing performance data based on the historical processing performance data of at least one server and the at least one category corresponding to each attribute may be a preprocessing operation.
Wherein the historical processing performance data may further include: identification information of the server, the idleness of the server, and the like.
The at least one attribute may be a CPU idle condition, a memory occupation condition, a network congestion condition, a disk read-write condition, a client connection number, or the like. The corresponding categories can include that a CPU is free, busy and normal, the memory occupation is high, the memory occupation is normal, the memory occupation is low, the network congestion condition is serious, the network congestion condition is normal, the network congestion condition is absent, the disk read-write condition is normal, the disk read-write condition is abnormal, the number of links of the client is large, the number of links of the client is small, the number of links of the client is normal, and the like.
Specifically, assume that there is a data set D having attributes a1, a2, …, Ak, which are labeled as n categories c1, c2, …, cn. For example, the attributes of the machines in the cluster: the network IO, the hard disk IO, the CPU, the memory and the like are used as attribute data sets, and the busy degree of each attribute is used as a category, such as idle, busy and moderate.
The model establishing unit is used for calculating the entropy of the historical processing performance data based on the category corresponding to at least one attribute in the historical processing performance data, and calculating the entropy corresponding to each attribute respectively; determining the information gain of each attribute based on the entropy of each attribute and the entropy of historical processing performance data, and sequencing the at least one attribute based on the information gain of each attribute to obtain at least one sequenced attribute; and establishing at least one branch path formed by the at least one attribute and the at least one category and leaf nodes of each branch path formed by the evaluation result based on the sorting of the at least one attribute.
In this step, a classifier is constructed based on the data set after data preprocessing, and this step is implemented by using an ID3 decision tree algorithm. The decision tree algorithm separates training data recursively, each recursion selects the optimal classification attribute as the attribute for separating the current data set, and the selection of the optimal classification attribute is realized by a promiscuous degree function; the ID3 decision tree algorithm uses information increment as a function of the degree of mixing, the information gain is based on the entropy in the information theory, the entropy is a measure of the uncertainty of the object, and the larger the entropy is, the higher the uncertainty of the object is.
The entropy of the historical processing performance data is calculated based on the category corresponding to at least one attribute in the historical processing performance data, and the entropy corresponding to each attribute is calculated, specifically as follows: assuming a data set D having attributes a1, a2, …, Ak and containing n categories c1, c2, …, cn, the entropy of the data set D in the original state can be expressed as follows:
Figure BDA0000879204310000101
where p () is used to represent the probability of computation, that is to say p (c)j) To calculate the probability that the jth class appears in the entire training tuple, the number of elements belonging to this class can be divided by the total number of training tuple elements as an estimate. The actual meaning of entropy represents the average amount of information needed for class labels of tuples in D.
The information gain of each attribute is determined based on the entropy of each attribute and the entropy of the historical processing performance data, and the at least one attribute is sequenced based on the information gain of each attribute to obtain at least one sequenced attribute; the calculation method for calculating the information gain of each attribute may be as follows:
if D can be partitioned into v disjoint subsets D1, D, Dv using attribute Ai, the entropy of dataset D partitioned using attribute Ai is:
Figure BDA0000879204310000111
the information gain of attribute Ai is:the larger the information gain caused by data separation using a certain attribute is, the better the data separation effect of the attribute is.
The at least one attribute is sorted based on the information gain of each attribute to obtain at least one sorted attribute, and the information gain of each attribute may be sorted from large to small.
The establishing at least one branch path composed of at least one attribute and at least one category and the leaf node of each branch path composed of the evaluation result based on the ranking of the at least one attribute may be:
selecting an attribute with the largest information gain as an optimal classification attribute, namely, taking the optimal distribution attribute as a root node;
and then sequentially sequencing, taking other attributes as nodes corresponding to different classes in different branch paths, and finally taking the evaluation result as a leaf node of each branch path.
The condition of the recursion termination is that each data subset obtained by final separation is as pure as possible, the classifier obtained by the algorithm is in the form of a tree, each branch path represents a certain possible attribute value, and each leaf node corresponds to a category.
Preferably, the server evaluation unit is configured to, when a new server is added to the server cluster, acquire processing performance data of the new server; processing the processing performance data of the new server to obtain at least one attribute and at least one type; and determining the evaluation result of the new server by using at least one attribute, at least one type and the server evaluation model. And classifying the attribute information of the server of the system according to the decision tree, and selecting an idle machine according to a classification result to schedule the task to the idle machine. If the classification result according to the decision tree is an average value, that is, all the machines in the cluster system are in the same state, a random selection method can be adopted to arbitrarily select a part of machines as data transmission objects.
The scheduling unit is used for selecting and obtaining a target server meeting a first preset condition based on an evaluation result of the processing capacity corresponding to at least one server in the server cluster when target data needs to be distributed, wherein the first preset condition represents that the evaluation result of the processing capacity of the target server is idle; and dispatching the target data to the target server.
And selecting a machine which is idle as a target machine for data distribution according to the execution result of the decision tree algorithm.
Further, the scheduling unit is configured to obtain a processing duration for the target server to process the target data; and judging whether the processing time length is greater than a preset threshold value, and if so, controlling the model establishing unit to reestablish the server evaluation model. And predicting the task execution time according to the decision tree execution result and comparing the task execution time with the time task execution time. If the actual execution time of the task is more than the estimated time, such as:
Perform-T estimate | > > DeltaT (setting an acceptable time difference as a threshold value)
And (4) the attribute information is poor sample data, and the attribute information is removed from the test set. Invalid data in the test set can be removed through the Toronto test comparison, and the function of dynamically adjusting the test set is realized.
The model establishing unit is used for deleting the attribute corresponding to the root node in the current server evaluation model; and establishing a server evaluation model based on the historical processing performance data of the at least one server.
The method for reestablishing the server evaluation model may be a periodic adjustment, and after the system runs for a certain period of time (the time length may be adjusted according to specific conditions), the data information is updated by using the currently acquired historical processing performance data, and the above steps are repeated.
The traditional scheme is lack of consideration for hardware environment and specific application scenario based on historical experience value setting scheduling distribution strategy. Even some algorithms do not adequately incorporate software and hardware environmental factors into the data scheduling factors in view of connection latency.
According to the scheme provided by the embodiment, a server decision model can be established according to at least one attribute and at least one category of historical processing performance data corresponding to each server in the server cluster, then real-time calculation is carried out on the performance data of the machine so as to judge whether the machine is in an idle or busy state, and the data are distributed to the machine which is idle relatively based on the judgment result. Therefore, various attributes of machine performance are fully considered, performance analysis is carried out based on the attributes, and accuracy of data scheduling is further improved.
In addition, the embodiment also provides an updating method for establishing the server evaluation model, so that a flexible mechanism for dynamically adjusting the strategy can be ensured.
The integrated module according to the embodiment of the present invention may also be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as an independent product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a base station, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes. Thus, embodiments of the invention are not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims (10)

1. A method for scheduling data, the method comprising:
determining at least one attribute included in historical processing performance data and at least one category corresponding to each attribute based on the historical processing performance data of at least one server;
calculating the entropy of the historical processing performance data based on the category corresponding to at least one attribute in the historical processing performance data, and calculating the entropy corresponding to each attribute respectively;
determining the information gain of each attribute based on the entropy of each attribute and the entropy of historical processing performance data, and sequencing the at least one attribute based on the information gain of each attribute to obtain at least one sequenced attribute;
selecting an attribute with the largest information gain as a root node, sequentially sequencing at least one attribute, taking other attributes as nodes corresponding to different categories in different branch paths, and finally taking an evaluation result as a leaf node of each branch path to obtain a server evaluation model; wherein, the server evaluation model comprises: at least one branch path composed of at least one attribute and at least one category, and leaf nodes of each branch path composed of the evaluation result;
based on the server evaluation model, at least one server in a server cluster is evaluated to obtain an evaluation result aiming at each server in the at least one server;
and scheduling data according to the evaluation result of each server in the at least one server.
2. The method of claim 1, further comprising:
when a new server is added in the server cluster, acquiring processing performance data of the new server;
processing the processing performance data of the new server to obtain at least one attribute and at least one type;
and determining the evaluation result of the new server by using at least one attribute, at least one type and the server evaluation model.
3. The method of claim 1, wherein the scheduling data according to the evaluation result of each of the at least one server comprises:
when target data are determined to be distributed, selecting a target server meeting a first preset condition based on an evaluation result of processing capacity corresponding to at least one server in a server cluster, wherein the evaluation result of the processing capacity of the target server represented by the first preset condition is idle;
and dispatching the target data to the target server.
4. The method of claim 3, wherein after the scheduling the target data to the target server, the method further comprises:
acquiring the processing time of the target server for processing the target data;
and judging whether the processing time length is greater than a preset threshold value, and if so, controlling to reestablish the server evaluation model.
5. The method of claim 4, wherein the re-establishing the server evaluation model comprises:
deleting the attribute corresponding to the root node in the current server evaluation model;
and establishing a server evaluation model based on the historical processing performance data of the at least one server.
6. A data scheduling system, comprising:
the data preprocessing unit is used for determining at least one attribute included in historical processing performance data and at least one category corresponding to each attribute based on the historical processing performance data of at least one server;
the model establishing unit is used for calculating the entropy of the historical processing performance data based on the category corresponding to at least one attribute in the historical processing performance data and calculating the entropy corresponding to each attribute respectively; determining the information gain of each attribute based on the entropy of each attribute and the entropy of historical processing performance data, and sequencing the at least one attribute based on the information gain of each attribute to obtain at least one sequenced attribute; selecting an attribute with the largest information gain as a root node, sequentially sequencing at least one attribute, taking other attributes as nodes corresponding to different categories in different branch paths, and finally taking an evaluation result as a leaf node of each branch path to obtain a server evaluation model; wherein, the server evaluation model comprises: at least one branch path composed of at least one attribute and at least one category, and leaf nodes of each branch path composed of the evaluation result;
the server evaluation unit is used for evaluating at least one server in the server cluster based on the server evaluation model to obtain an evaluation result aiming at each server in the at least one server;
and the scheduling unit is used for scheduling data according to the evaluation result of each server in the at least one server.
7. The system of claim 6,
the server evaluation unit is used for acquiring processing performance data of a new server when the new server is added in the server cluster; processing the processing performance data of the new server to obtain at least one attribute and at least one type; and determining the evaluation result of the new server by using at least one attribute, at least one type and the server evaluation model.
8. The system of claim 6,
the scheduling unit is used for selecting and obtaining a target server meeting a first preset condition based on an evaluation result of the processing capacity corresponding to at least one server in the server cluster when target data needs to be distributed, wherein the first preset condition represents that the evaluation result of the processing capacity of the target server is idle; and dispatching the target data to the target server.
9. The system of claim 8,
the scheduling unit is used for acquiring the processing time of the target server for processing the target data; and judging whether the processing time length is greater than a preset threshold value, and if so, controlling the model establishing unit to reestablish the server evaluation model.
10. The system according to claim 9, wherein the model building unit is configured to delete an attribute corresponding to a root node in a current server evaluation model; and establishing a server evaluation model based on the historical processing performance data of the at least one server.
CN201510937896.0A 2015-12-15 2015-12-15 Data scheduling method and system Active CN106888237B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510937896.0A CN106888237B (en) 2015-12-15 2015-12-15 Data scheduling method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510937896.0A CN106888237B (en) 2015-12-15 2015-12-15 Data scheduling method and system

Publications (2)

Publication Number Publication Date
CN106888237A CN106888237A (en) 2017-06-23
CN106888237B true CN106888237B (en) 2020-01-07

Family

ID=59174721

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510937896.0A Active CN106888237B (en) 2015-12-15 2015-12-15 Data scheduling method and system

Country Status (1)

Country Link
CN (1) CN106888237B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110766269A (en) * 2019-09-02 2020-02-07 平安科技(深圳)有限公司 Task allocation method and device, readable storage medium and terminal equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102195890A (en) * 2011-06-03 2011-09-21 北京大学 Internet application dispatching method based on cloud computing
CN102214213A (en) * 2011-05-31 2011-10-12 中国科学院计算技术研究所 Method and system for classifying data by adopting decision tree
CN103279392A (en) * 2013-06-14 2013-09-04 浙江大学 Method for classifying operated load in virtual machine under cloud computing environment
CN104346214A (en) * 2013-07-30 2015-02-11 中国银联股份有限公司 Device and method for managing asynchronous tasks in distributed environments
CN104618406A (en) * 2013-11-05 2015-05-13 镇江华扬信息科技有限公司 Load balancing algorithm based on naive Bayesian classification
CN104978236A (en) * 2015-07-07 2015-10-14 四川大学 HDFS load source and sink node selection method based on multiple measurement indexes

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102214213A (en) * 2011-05-31 2011-10-12 中国科学院计算技术研究所 Method and system for classifying data by adopting decision tree
CN102195890A (en) * 2011-06-03 2011-09-21 北京大学 Internet application dispatching method based on cloud computing
CN103279392A (en) * 2013-06-14 2013-09-04 浙江大学 Method for classifying operated load in virtual machine under cloud computing environment
CN104346214A (en) * 2013-07-30 2015-02-11 中国银联股份有限公司 Device and method for managing asynchronous tasks in distributed environments
CN104618406A (en) * 2013-11-05 2015-05-13 镇江华扬信息科技有限公司 Load balancing algorithm based on naive Bayesian classification
CN104978236A (en) * 2015-07-07 2015-10-14 四川大学 HDFS load source and sink node selection method based on multiple measurement indexes

Also Published As

Publication number Publication date
CN106888237A (en) 2017-06-23

Similar Documents

Publication Publication Date Title
CN109617826B (en) Storm dynamic load balancing method based on cuckoo search
CN110417903B (en) Information processing method and system based on cloud computing
US9513806B2 (en) Dimension based load balancing
WO2019134274A1 (en) Interest exploration method, storage medium, electronic device and system
CN107220108B (en) Method and system for realizing load balance of cloud data center
CN110809060B (en) Monitoring system and monitoring method for application server cluster
WO2021169294A1 (en) Application recognition model updating method and apparatus, and storage medium
CN113347267B (en) MEC server deployment method in mobile edge cloud computing network
CN114513470B (en) Network flow control method, device, equipment and computer readable storage medium
CN112835698A (en) Heterogeneous cluster-based dynamic load balancing method for request classification processing
CN107566535B (en) Self-adaptive load balancing method based on concurrent access timing sequence rule of Web map service
CN114356545A (en) Task unloading method for privacy protection and energy consumption optimization
CN105872082B (en) Fine granularity resource response system based on container cluster load-balancing algorithm
CN112559078B (en) Method and system for hierarchically unloading tasks of mobile edge computing server
CN114500578A (en) Load balancing scheduling method and device for distributed storage system and storage medium
KR20230032754A (en) Apparatus and Method for Task Offloading of MEC-Based Wireless Network
CN106888237B (en) Data scheduling method and system
CN111124439B (en) Intelligent dynamic unloading algorithm with cloud edge cooperation
CN113055423B (en) Policy pushing method, policy execution method, device, equipment and medium
US11374869B2 (en) Managing bandwidth based on user behavior
CN113596146B (en) Resource scheduling method and device based on big data
CN115842828A (en) Gateway load balancing control method, device, equipment and readable storage medium
CN113298115A (en) User grouping method, device, equipment and storage medium based on clustering
CN112637904B (en) Load balancing method and device and computing equipment
CN110134575B (en) Method and device for calculating service capacity of server cluster

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant