CN109165093B - System and method for flexibly distributing computing node cluster - Google Patents
System and method for flexibly distributing computing node cluster Download PDFInfo
- Publication number
- CN109165093B CN109165093B CN201810857293.3A CN201810857293A CN109165093B CN 109165093 B CN109165093 B CN 109165093B CN 201810857293 A CN201810857293 A CN 201810857293A CN 109165093 B CN109165093 B CN 109165093B
- Authority
- CN
- China
- Prior art keywords
- task
- computing node
- resources
- resource
- node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5011—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
- G06F9/5016—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3051—Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Quality & Reliability (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention relates to a system and a method for flexibly distributing a computing node cluster, which adopt a computing node flexible distribution mechanism, predict computing node resource distribution according to the use condition of historical tasks on computing node resources and task resource requirements, dynamically control computing stage distribution under the condition of meeting the requirements, improve the operation response speed and the utilization rate of the computing node resources, feed back historical prediction operation results to next prediction, realize the balanced configuration of the resources in a cloud computing environment and improve the comprehensive efficiency of the system.
Description
Technical Field
The invention relates to the field of dynamic resource allocation management of cloud computing cluster servers, in particular to a system and a method for flexibly allocating a computing node cluster.
Background
With the development of the computer field, the development of the cloud computing field is particularly rapid. The cloud computing provides convenient, fast and safe data storage and network service for users through a distributed computing technology, a parallel computing technology, a virtualization technology, a load balancing computer and a network technology.
The operations involved in deep learning are mostly vectorized matrix operations. As a graphics accelerator, the GPU provides a large number of operation cores for rendering, and these operation cores can also be used to accelerate vectorized matrix operations, so in recent years, deep learning has largely adopted the GPU for model training. As demand increases, more and more cloud platforms provide GPUs as a computing node resource to users.
However, due to the particularity of the computing node resources on hardware, the cloud computing node resources are usually provided to the user in an exclusive manner, and the allocation is unidirectional and static, which easily causes the situations of overload of the computing node resources and poor user experience.
In an exclusive manner, it is difficult for each compute node resource to maximize performance. The fixed resource allocation mode is difficult to efficiently match the requirements of different request tasks of different users. However, when the distributed user really submits the task and starts to operate under the condition of only performing one-time distribution on the computing node resources, the initially distributed computing node resources cannot necessarily meet the computing requirements of the user. In order to solve the above problems, it is therefore necessary to design a system and a method for flexibly distributing a cluster of compute nodes.
Disclosure of Invention
The invention aims to overcome the defects and provides a system and a method for flexibly distributing a computing node cluster.
The invention achieves the aim through the following technical scheme: a computing node cluster elastic distribution system, comprising: the system comprises a user module, a computing node management module, a computing node resource module and a storage server; the user module provides a user login port and an entrance of user task request information; the computing node resource module is provided with modular computing node cluster resources for executing computing tasks of users; the storage server is used for storing operation data and operation logs; the computing node management module comprises a verification module, a task resource estimation module, a computing node control module and a computing node state monitoring module; the verification module is used for acquiring user login information and task request information from the user module, and sending the user login information and the task request information to the task resource pre-estimation module after verification; the task resource pre-estimating module is used for receiving the user login information and the task request information sent from the verifying module; computing resource node use estimation judgment is carried out according to task description submitted by a user and selection parameters; the compute node control module: the system comprises a task resource pre-estimation module, a computing node state pre-warning module and a task resource pre-estimation module, wherein the task resource pre-estimation module is used for receiving computing node state pre-warning information sent by the computing node monitoring module directly and processing the computing node state pre-warning information; the computing node monitoring module is used for storing a node information table and can be used for collecting state information of computing node resources at regular time, generating a new node information table and monitoring the computing node resources; the node information table comprises a computing node ID, a CPU occupancy rate, a memory utilization rate, a disk utilization rate, an IO utilization rate, a network bandwidth and the like.
Preferably, the verification module is used for acquiring and storing user information from the user module, verifying whether the user identity and the task request are legal or not during user login and task request information, rejecting the request if the user identity and the task request are illegal, and sending the user login and task request information to the task resource pre-estimation module if the user identity and the task request information are verified to be legal.
Preferably, the task resource pre-estimation module stores a historical resource allocation information table of the computing node corresponding to each task resource allocation request; deducing resources necessary for task operation from a historical resource allocation information table of the computing nodes; for a new task, no execution history can be referred to, and estimation is carried out according to resources actively applied by a user or according to the maximum resources capable of being provided by the system.
Preferably, the computing node monitoring module stores and also stores a computing node state early warning information table, and the computing node state early warning information table has a computing state early warning value; when the state information of the computing node touches the early warning value, the state early warning information of the computing node is directly sent to a computing node control module, and the computing node control module can also receive the state early warning information of the computing node directly sent by a computing node monitoring module, judge whether the state early warning information is abnormal or not, send an abnormal prompt and actively finish an abnormal task; and for the non-abnormal task, analyzing the current situation of resource occupation of each computing node of the executed task, and reallocating the resources of the computing nodes.
The invention also provides a method for flexibly distributing the computing node cluster, which comprises the following steps:
(1) verifying the validity of the user identity, if the user identity passes the verification, receiving the task, otherwise, directly ending the process;
(2) performing computing node resource allocation prediction judgment according to task description in the task request information;
(3) performing computing node resource allocation according to the pre-estimated computation value;
(4) after the task is operated, state information acquisition is carried out on the computing node resources at regular time, and whether the utilization rate of the computing node resources exceeds an early warning threshold value or not is judged; if the task running state exceeds the early warning threshold value, further judging whether the task running state is abnormal, and if the task running state is not abnormal, dynamically reallocating the computing node resources; the early warning threshold value is preset;
(5) and releasing the occupied computing node resources after the task is finished.
Preferably, in step (2), the estimation is performed according to a historical node information table created from experience of previous running task accumulation, wherein a method for calculating the estimation of node resource allocation comprises: extracting parameters related to resource allocation in the task description to form a task vector X, and taking the corresponding distributed computing node resources as a resource vector Y; after the new task description is submitted, a configuration file containing task description parameters is generated, and the parameters related to resource configuration are extracted to form a task vector X newD historical resource records closest to the new task vector are found by using a clustering algorithm to serve as samples, and if the d historical samples have samples completely the same as the new node, the resources distributed to the new task request directly according to the historical resource record sample task vector are distributed; if the same historical resource record does not exist, linear regression fitting is carried out on the latest task vector according to the d historical samples, the weight of the parameter vector described by each task in the d samples is obtained, the calculation resources distributed by the historical resources are weighted according to the parameters obtained by fitting, a certain margin is given, and the calculation resources required by a new request task are obtained; each dimension in the task vector X represents one attribute of the task; each dimension of the resource vector Y represents a computing node running state corresponding to the executed task, wherein the computing node running state refers to CPU utilization rate, memory utilization rate, disk utilization rate, I/O utilization rate, network utilization rate and the like.
Preferably, if the estimation cannot be performed according to the historical node information table during the step (2), the estimation is performed according to resources actively applied by the user or according to the maximum resources that can be provided by the system.
Preferably, in the step (4), the method specifically comprises the following steps: acquiring state information of the computing node resources at regular time, generating a new node information table, and monitoring the computing node resources; the new node information is compared with data stored in a computing node state early warning information table in a computing node monitoring module in advance, and whether the utilization rate of various resources exceeds an early warning threshold value or not is judged; if the task running state exceeds the early warning threshold, further judging whether the task running state is abnormal; if the task in the task running state is judged to be abnormal, actively finishing the abnormal task, releasing occupied resources, prompting that the task fails to run due to insufficient resources, selecting whether the quantity of the required resources needs to be estimated again or not, and arranging a proper node for retrying; and if the state of the computing node is not abnormal, sending the early warning information of the state of the computing node to a computing node control module, and dynamically calling computing node resources by the computing node control module, wherein the dynamic calling method is to reallocate the computing node resources with certain margin according to the latest state of the computing node and record the state information of the computing node.
Preferably, during the running process of the task, the temporary offline or the process ending can be selected; if the temporary offline is selected, the running process is suspended, and after the next login, the previous computing node resources are automatically matched, and the previous process is continued; if the ending process is selected, the occupied computing node resources are released, and the computing node historical resource allocation information corresponding to the task is stored in a computing node historical resource allocation information table of the task resource estimation module; and the next time of login, computing node resources are redistributed according to the newly submitted task request.
Preferably, after the task is ended in step (5), the computing node historical resource allocation information corresponding to the task is stored in a computing node historical resource allocation information table of the task resource prediction module.
The invention has the beneficial effects that: because the same task can be repeatedly executed according to the particularity of the field of the computing platform, the computing node resource allocation can be pre-estimated according to the historical computing node information table and the task resource requirements, and the computing stage allocation is dynamically controlled under the condition of meeting the requirements, so that the operation response speed and the computing node resource utilization rate are improved; in the system prediction process, the last historical prediction operation result is fed back to the next prediction, so that the balanced configuration of resources in the cloud computing environment is realized, and the comprehensive efficiency of the system is improved.
Drawings
FIG. 1 is a schematic view of the construction of the dispensing system of the present invention;
FIG. 2 is a schematic flow chart of the dispensing method of the present invention;
FIG. 3 is a flowchart illustrating a node resource task prediction method according to the present invention.
Detailed Description
The invention will be further described with reference to specific examples, but the scope of the invention is not limited thereto:
The embodiment is as follows: as shown in fig. 1, a system for flexibly allocating a compute node cluster includes a user module, a compute node management module, a compute node resource module, and a storage server. The computing node management module comprises a verification module, a task resource pre-estimation module, a computing node control module and a computing node state monitoring module.
The user module provides a user login port and provides an entrance for a user to submit a task.
The verification module is used for obtaining user login information from the user module, verifying whether the user identity and the task request are legal or not when the user logs in and the task request information, rejecting the request if the user identity and the task request information are illegal, and sending the user login and task request information to the task resource estimation module if the user login and task request information is verified to be legal.
And the task resource pre-estimation module is used for receiving the user login information and the user task request sent by the verification module. The task resource pre-estimation module stores a computation node historical resource allocation information table corresponding to each historical task. After each task is finished, the computing node resource allocation information corresponding to the secondary task is used for updating the historical resource allocation information table of the computing node.
And triggering task resource prediction every time a user submits a new task, so that prediction judgment is carried out according to the task description submitted by the user and the selection parameters. In general, the same task may be repeatedly performed. The computing node resources required for each execution, such as CPU, memory, GPU, IO, network bandwidth and time, may be the same or may be different due to configuration or data differences. And deducing resources required by task operation from a historical resource allocation information table of the computing node through a task estimation module, and inquiring the expected GPU occupancy rate, memory, IO and network capacity from a historical node information table database created according to the accumulated experience of the previous running tasks. For a new task, no execution history can be referred to, and estimation is carried out according to resources actively applied by a user or according to the maximum resources capable of being provided by the system.
A compute node control module: and regulating and distributing the computing node resource module according to the computing resource node use estimation judgment result sent from the task resource estimation module. The computing node control module can also receive the computing node state early warning information directly sent by the computing node monitoring module, judge whether the computing node state early warning information is abnormal or not, send an abnormal prompt and actively finish an abnormal task. And for the non-abnormal task, analyzing the current situation of resource occupation of each computing node of the executed task, and reallocating the resources of the computing nodes.
The computing node monitoring module stores a node information table and a computing node state early warning information table, wherein the node information table comprises information such as computing node ID, CPU occupancy rate, memory utilization rate, disk utilization rate, IO utilization rate and network bandwidth. And the calculation node state early warning information table has a calculation state early warning value.
The computing node monitoring module collects state information of computing node resources at regular time, generates a new node information table and monitors the computing node resources. And the new node information is compared with data stored in a pre-stored slave computing node state early warning information table in the computing node monitoring module, and whether the utilization rate of various resources exceeds an early warning threshold value or not is judged. And if the early warning value is not exceeded, recording the state information of the computing node of the task, and continuing to run the task. When the state information of the computing node touches the early warning value, the state early warning information of the computing node is directly sent to a computing node control module, and the computing node control module allocates computing node resources or manages and controls tasks.
The computing node resource module is provided with modular computing node cluster resources used for executing computing tasks of users. The storage server is used for storing the operation data and the operation log.
In the aspect of software environment, each node is required to adopt an Ubuntu 16.04 operating system and is provided with development tools such as Python 2.7, skearn 0.19.1, a pytorech 0.1.2 and the like; in terms of environment, all nodes are required to be configured in the same network segment.
A method for flexibly distributing a computing node cluster comprises the following steps:
(1) verifying the validity of the user identity, if the user identity passes the verification, receiving the task, otherwise, directly ending the process;
(2) performing computing node resource allocation prediction judgment according to task description in the task request information;
(3) performing computing node resource allocation according to the pre-estimated computation value;
(4) after the task is operated, state information acquisition is carried out on the computing node resources at regular time, and whether the utilization rate of the computing node resources exceeds an early warning threshold value or not is judged; if the task running state exceeds the early warning threshold value, further judging whether the task running state is abnormal, and if the task running state is not abnormal, dynamically reallocating the computing node resources; the early warning threshold value is preset;
(5) and releasing the occupied computing node resources after the task is finished.
In the following, a specific example is described, and as shown in fig. 2, the method of the present invention is as follows:
the user logs in from the client: and the verification module verifies the validity of the user identity, and rejects the user request if the verification fails. If the verification is passed. And when the task request is legal after the user sends the task request, triggering a user resource request event, sending user login information and task request information to a next computing node control module, and carrying out the next step.
And analyzing the task resources, and performing pre-estimation judgment according to the task description and the selection parameters submitted by the user. In general, the same task may be repeatedly performed. The computing node resources required for each execution, such as CPU, memory, GPU, IO, network bandwidth and time, may be the same or different due to configuration or data differences. And deducing resources required by task operation from the task execution history through a task estimation module, and inquiring the estimated GPU occupancy rate, memory, IO and network capacity from a historical node information table database created according to the accumulated experience of the previous operation tasks. For a new task, no execution history can be used as reference, and when estimation cannot be carried out, estimation is carried out according to resources actively applied by a user or according to the maximum resources capable of being provided by the system.
The method comprises the steps that after a user submits task description, each task request generates a configuration file containing task description parameters, M parameters related to resource configuration are extracted to form an M-dimensional task vector X, and each dimension represents one attribute of a task, such as task type, task amount, network resources and the like. The distributed computing node resource corresponding to each task request can be regarded as an E-dimensional computing node distributed resource vector Y, and each dimension represents a resource, i.e., a running state of the computing node corresponding to the task to be executed, such as a CPU utilization rate, a memory utilization rate, a disk utilization rate, an I/O utilization rate, a network utilization rate, and the like. Suppose that N historical allocation records are stored in the historical resource allocation information table of the computing node in the task resource estimation module, and the records will increase along with the use time.
Each historical allocation record includes M-dimensional task description parameters X, and each record is denoted as Xi, i ═ 1,2,3, …, N. The task description parameter table in the existing computation node historical resource allocation information table can be regarded as an N x M matrix. Correspondingly, each record also corresponds to an E-dimensional allocated compute node resource allocation vector Y. The state of the computational resource is denoted as Yi, i equals 1,2,3, …, N. The calculation node allocation table in the existing calculation node historical resource allocation information table can be regarded as an N × E matrix.
The estimation process is as shown in fig. 3, after a new task request enters a task estimation module, firstly, a new configuration file containing task description parameters is generated according to new task description submitted by a user, and M parameters of resource allocation correlation states are selected to form a task vector Xnew. And d pieces of historical resource records which are closest to a new task vector are found by using a KNN (k-nearest neighbor) clustering algorithm as a sample, wherein the parameter state vector describing the task is X _ knj, j is 1,2,3, …, d, and the corresponding computing node allocation resource vector is Y _ knj, j is 1,2,3, …, d. If a sample which is the same as the new node exists in the d historical samples, for example, X _ knnfit is Xnew, a resource vector Y _ knnfit distributed by the historical resource record sample task vector X _ knnfit is directly distributed to a new task request; if the same historical resource record does not exist, linear regression is carried out according to the d historical samples to fit the latest task vector to obtain Wherein kt is the weight of the tth task description parameter vector X _ knnt obtained by the linear regression fitting algorithm. Finally, weighting the calculation resources of the historical resource allocation according to the parameters obtained by fitting and giving a certain marginY _ (new _ pred) is used as the computational resource needed to estimate the new requested task.
And after determining the corresponding computing node resources of the user task, performing computing node resource allocation according to the estimated computed value.
The computing node runs the task: the computing node monitoring module collects state information of computing node resources at regular time, generates a new node information table and monitors the computing node resources. And the new node information is compared with data stored in a pre-stored slave computing node state early warning information table in the computing node monitoring module, and whether the utilization rate of various resources exceeds an early warning threshold value or not is judged. And if the early warning value is not exceeded, recording the state information of the computing node of the task, and continuing to run the task. And if the task running state exceeds the early warning threshold value, further judging whether the task running state is abnormal.
And if the task running state is non-abnormal, sending the early warning information of the state of the computing node to a computing node control module, dynamically calling the computing node resource by the computing node control module, and finally writing the dynamic call into an operation log.
The dynamic resource calling method is to reallocate the computing resource with a certain margin according to the latest state X _ new of the computing node and record the state information of the node.
If the task is abnormal, the abnormal task is actively ended, occupied resources are released, a user is prompted, the task fails to run due to insufficient resources, and the user selects whether to estimate the required resource amount again and arranges a proper node for retry. And finally writing the exception handling operation into an operation log.
During the running process of the task, if the user exits, the user can select to temporarily take off the line or end the process. If the temporary offline is selected, the running process is suspended, the login information, the occupation information of the computing nodes and the running state are temporarily stored, and after the user logs in next time, the previous computing node resources are automatically matched and the previous process is continued. If the user selects to finish the process, the task process is finished, the occupied computing node resources are released, and the computing node historical resource allocation information corresponding to the task is stored in a computing node historical resource allocation information table of the task resource estimation module and is used as a reference for later computing node resource allocation. And the user needs to reallocate the computing node resources according to the newly submitted task request when logging in next time.
After each task is finished, the historical resource allocation information of the computing node corresponding to the task is stored in a historical resource allocation information table of the computing node of the task resource pre-estimation module and is used as a reference for pre-estimation in the future. The user can select whether to quit the login, if the user selects to quit the login, the login information is deleted from the verification module database, the user submits the task request next time, and the user logs in again to perform identity verification. If not, the user still has a legal identity at the system end, and the user can continue to submit a new task request and allocate node resources for calculation.
While the invention has been described in connection with specific embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (15)
1. A system for resilient distribution of clusters of computing nodes, comprising: the system comprises a user module, a computing node management module, a computing node resource module and a storage server; the computing node management module comprises a task resource estimation module, a computing node control module and a computing node state monitoring module; the task resource pre-estimation module is used for performing pre-estimation judgment on the use of the computing resource nodes according to task description and selection parameters submitted by a user; the computing node control module is used for regulating and controlling and distributing the computing node resource module according to the computing resource node use estimation judgment result sent from the task resource estimation module, receiving computing node state early warning information directly sent by the computing node monitoring module and processing the computing node state early warning information; if the utilization rate of the node resources exceeds the early warning threshold, further judging whether the task running state is abnormal, and if the task running state is not abnormal, dynamically redistributing the computing node resources; the early warning threshold value is preset; the dynamic resource calling method is to redistribute the computing resources with certain margin according to the latest state of the computing node and record the state information of the node;
If the task is abnormal, actively ending the abnormal task, releasing occupied resources and prompting a user; the computing node monitoring module is used for storing a node information table, and can periodically acquire state information of computing node resources, generate a new node information table and monitor the computing node resources;
the task resource pre-estimation module is used for storing a historical resource allocation information table of the computing node corresponding to each task resource allocation request; the method comprises the steps of extracting a task vector formed by relevant parameters of task description, and finding d historical resource records closest to a new task vector to serve as a sample for calculation.
2. The system of claim 1, wherein: the user module provides a user login port and an entrance of user task request information.
3. The system according to any of claims 1-2, wherein: the computing node management module also comprises a verification module which is used for acquiring and storing user login information and task request information from the user module, verifying whether the user identity and the task request are legal or not when the user logs in and the task request information, rejecting the request if the user identity and the task request are illegal, and sending the user login and task request information to the task resource pre-estimation module if the user login and task request information passes verification.
4. The system of claim 1, wherein: deducing resources necessary for task operation from a historical resource allocation information table of the computing nodes; for a new task, no execution history is available for reference, and estimation is performed according to the resources actively applied by the user or according to the maximum resources provided by the system.
5. The system of claim 1, wherein: the computing node monitoring module stores and also stores a computing node state early warning information table, and a computing node state early warning value is stored in the computing node state early warning information table; when the state information of the computing node touches the early warning value, the state early warning information of the computing node is directly sent to a computing node control module, and the computing node control module can also receive the state early warning information of the computing node directly sent by a computing node monitoring module, judge whether the state early warning information is abnormal or not, send an abnormal prompt and actively finish an abnormal task; and for the non-abnormal task, analyzing the current situation of resource occupation of each computing node of the executed task, and reallocating the resources of the computing nodes.
6. The system according to claim 1, wherein: the computing node resource module is provided with modular computing node cluster resources for executing computing tasks of users; the storage server is used for storing operation data and operation logs; the node information table comprises a computing node ID, a CPU occupancy rate, a memory utilization rate, a disk utilization rate, an IO utilization rate and a network bandwidth.
7. A method for flexibly distributing a computing node cluster is characterized by comprising the following steps:
(1) performing computing node resource allocation prediction judgment according to task description in the task request information;
(2) performing computing node resource allocation according to the pre-estimated computation value;
(3) after the task is operated, state information acquisition is carried out on the computing node resources at regular time, and whether the utilization rate of the computing node resources exceeds an early warning threshold value or not is judged; if the task running state exceeds the early warning threshold value, further judging whether the task running state is abnormal, and if the task running state is not abnormal, dynamically reallocating the computing node resources; the early warning threshold value is preset; the dynamic resource calling method is to redistribute the computing resources with certain margin according to the latest state of the computing node and record the state information of the node;
If the task is abnormal, actively ending the abnormal task, releasing occupied resources and prompting a user;
(4) releasing occupied computing node resources after the task is finished;
the estimation method comprises the steps of extracting task vectors formed by task description related parameters, and finding d historical resource records closest to the new task vectors as samples for calculation.
8. The method for elastic distribution of a cluster of compute nodes of claim 7 wherein: before submitting the task request information, the user identity validity needs to be verified, if the user identity validity passes the verification, the task is received, otherwise, the process is directly ended.
9. The method of claim 8, wherein: in the step (1), the estimation is carried out according to a historical node information table created by experience accumulated by previous running tasks.
10. The method for flexibly distributing a cluster of computing nodes as claimed in any one of claims 7 to 9, wherein: the method for calculating the node resource allocation prediction comprises the following steps: extracting parameters related to resource allocation in the task description to form a task vector X, and taking the corresponding distributed computing node resources as a resource vector Y; after the new task description is submitted, a configuration file containing task description parameters is generated, and the parameters related to resource configuration are extracted to form a task vector X newD historical resource records closest to the new task vector are found by using a clustering algorithm as samples, and if the d historical samples have samples completely the same as the new node, the d historical samples are directly allocated to a new task request according to resources allocated to the historical resource record sample task vector; if the same historical resource record does not exist, linear regression fitting is carried out according to the d historical samples to fit the latest task vector, the weight of the parameter vector is described through each task in the d samples, the calculation resources distributed by the historical resources are weighted according to the parameters obtained through fitting, a certain margin is given, and the calculation resources required by the new request task are obtained.
11. The method of claim 10, wherein: each dimension in the task vector X represents one attribute of the task; each dimension of the resource vector Y represents the running state of a computing node corresponding to the executed task, wherein the running state of the computing node refers to the CPU utilization rate, the memory utilization rate, the disk utilization rate, the I/O utilization rate and the network utilization rate.
12. The method of claim 7, wherein: and (2) if the estimation can not be carried out according to the historical node information table in the process of executing the step (1), estimating according to the resources actively applied by the user or the maximum resources which can be provided by the system.
13. The method for elastic distribution of a cluster of compute nodes of claim 9 wherein: the step (3) is specifically as follows: acquiring state information of the computing node resources at regular time, generating a new node information table, and monitoring the computing node resources; the new node information is compared with data stored in a computing node state early warning information table in a computing node monitoring module in advance, and whether the utilization rate of various resources exceeds an early warning threshold value or not is judged; if the task running state exceeds the early warning threshold value, further judging whether the task running state is abnormal; if the task in the task running state is judged to be abnormal, actively finishing the abnormal task, releasing occupied resources, prompting that the task fails to run due to insufficient resources, selecting whether the quantity of the required resources needs to be estimated again or not, and arranging a proper node for retrying; and if the state of the computing node is not abnormal, sending the early warning information of the state of the computing node to a computing node control module, and dynamically calling computing node resources by the computing node control module, wherein the dynamic calling method is to reallocate the computing node resources with certain margin according to the latest state of the computing node and record the state information of the computing node.
14. The method for elastic distribution of a cluster of compute nodes of claim 7 wherein: during the task running process, the temporary offline or the process ending can be selected; if the temporary offline is selected, the running process is suspended, and after the next login, the previous computing node resources are automatically matched, and the previous process is continued; if the process is selected to be ended, occupied computing node resources are released, and computing node historical resource allocation information corresponding to the task is stored in a computing node historical resource allocation information table of a task resource estimation module; and the next time of login, computing node resources are redistributed according to the newly submitted task request.
15. The method for elastic distribution of a cluster of compute nodes of claim 7 wherein: and (4) after each task is finished in the step (4), the historical resource allocation information of the computing node corresponding to the task is stored in a historical resource allocation information table of the computing node of the task resource estimation module.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810857293.3A CN109165093B (en) | 2018-07-31 | 2018-07-31 | System and method for flexibly distributing computing node cluster |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810857293.3A CN109165093B (en) | 2018-07-31 | 2018-07-31 | System and method for flexibly distributing computing node cluster |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109165093A CN109165093A (en) | 2019-01-08 |
CN109165093B true CN109165093B (en) | 2022-07-19 |
Family
ID=64898439
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810857293.3A Active CN109165093B (en) | 2018-07-31 | 2018-07-31 | System and method for flexibly distributing computing node cluster |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109165093B (en) |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110096349B (en) * | 2019-04-10 | 2020-03-06 | 山东科技大学 | Job scheduling method based on cluster node load state prediction |
CN115525438A (en) * | 2019-08-23 | 2022-12-27 | 第四范式(北京)技术有限公司 | Method, device and system for allocating resources and tasks in distributed system |
CN110705893B (en) * | 2019-10-11 | 2021-06-15 | 腾讯科技(深圳)有限公司 | Service node management method, device, equipment and storage medium |
CN113094243A (en) * | 2020-01-08 | 2021-07-09 | 北京小米移动软件有限公司 | Node performance detection method and device |
CN111399976A (en) * | 2020-03-02 | 2020-07-10 | 上海交通大学 | GPU virtualization implementation system and method based on API redirection technology |
CN111381969B (en) * | 2020-03-16 | 2021-10-26 | 北京康吉森技术有限公司 | Management method and system of distributed software |
CN111813545A (en) * | 2020-06-29 | 2020-10-23 | 北京字节跳动网络技术有限公司 | Resource allocation method, device, medium and equipment |
CN111885158B (en) * | 2020-07-22 | 2023-05-02 | 曙光信息产业(北京)有限公司 | Cluster task processing method and device, electronic equipment and storage medium |
CN114780225B (en) * | 2022-06-14 | 2022-09-23 | 支付宝(杭州)信息技术有限公司 | Distributed model training system, method and device |
CN115495231B (en) * | 2022-08-09 | 2023-09-19 | 徐州医科大学 | Dynamic resource scheduling method and system under high concurrency task complex scene |
CN115297018B (en) * | 2022-10-10 | 2022-12-20 | 北京广通优云科技股份有限公司 | Operation and maintenance system load prediction method based on active detection |
CN115756822B (en) * | 2022-10-18 | 2024-03-19 | 超聚变数字技术有限公司 | Method and system for optimizing high-performance computing application performance |
CN115794420B (en) * | 2023-02-07 | 2023-04-18 | 飞天诚信科技股份有限公司 | Dynamic management method, device and medium for service node resource allocation |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104951372A (en) * | 2015-06-16 | 2015-09-30 | 北京工业大学 | Method for dynamic allocation of Map/Reduce data processing platform memory resources based on prediction |
CN105897616A (en) * | 2016-05-17 | 2016-08-24 | 腾讯科技(深圳)有限公司 | Resource allocation method and server |
CN106790636A (en) * | 2017-01-09 | 2017-05-31 | 上海承蓝科技股份有限公司 | A kind of equally loaded system and method for cloud computing server cluster |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8261277B2 (en) * | 2006-04-10 | 2012-09-04 | General Electric Company | System and method for dynamic allocation of resources in a computing grid |
US20130263117A1 (en) * | 2012-03-28 | 2013-10-03 | International Business Machines Corporation | Allocating resources to virtual machines via a weighted cost ratio |
CN102759984A (en) * | 2012-06-13 | 2012-10-31 | 上海交通大学 | Power supply and performance management system for virtualization server cluster |
CN103699447B (en) * | 2014-01-08 | 2017-02-08 | 北京航空航天大学 | Cloud computing-based transcoding and distribution system for video conference |
US9485197B2 (en) * | 2014-01-15 | 2016-11-01 | Cisco Technology, Inc. | Task scheduling using virtual clusters |
TWI546681B (en) * | 2014-12-09 | 2016-08-21 | 英業達股份有限公司 | Method of resource allocation in a server system |
CN104407912B (en) * | 2014-12-25 | 2018-08-17 | 无锡清华信息科学与技术国家实验室物联网技术中心 | A kind of virtual machine configuration method and device |
CN105487930B (en) * | 2015-12-01 | 2018-10-16 | 中国电子科技集团公司第二十八研究所 | A kind of optimizing and scheduling task method based on Hadoop |
-
2018
- 2018-07-31 CN CN201810857293.3A patent/CN109165093B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104951372A (en) * | 2015-06-16 | 2015-09-30 | 北京工业大学 | Method for dynamic allocation of Map/Reduce data processing platform memory resources based on prediction |
CN105897616A (en) * | 2016-05-17 | 2016-08-24 | 腾讯科技(深圳)有限公司 | Resource allocation method and server |
CN106790636A (en) * | 2017-01-09 | 2017-05-31 | 上海承蓝科技股份有限公司 | A kind of equally loaded system and method for cloud computing server cluster |
Non-Patent Citations (1)
Title |
---|
云环境下公平性优化的资源分配方法;薛胜军等;《计算机应用》;20161010(第10期);46-51 * |
Also Published As
Publication number | Publication date |
---|---|
CN109165093A (en) | 2019-01-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109165093B (en) | System and method for flexibly distributing computing node cluster | |
CN108089921B (en) | Server for cloud big data operation architecture and operation resource optimization method thereof | |
CN107888669B (en) | Deep learning neural network-based large-scale resource scheduling system and method | |
CN107239336B (en) | Method and device for realizing task scheduling | |
US10474504B2 (en) | Distributed node intra-group task scheduling method and system | |
US9910888B2 (en) | Map-reduce job virtualization | |
JP4377369B2 (en) | Resource allocation arbitration device and resource allocation arbitration method | |
WO2021159638A1 (en) | Method, apparatus and device for scheduling cluster queue resources, and storage medium | |
CN108052384B (en) | Task processing method, service platform and electronic equipment | |
US11579933B2 (en) | Method for establishing system resource prediction and resource management model through multi-layer correlations | |
US11838384B2 (en) | Intelligent scheduling apparatus and method | |
KR101471749B1 (en) | Virtual machine allcoation of cloud service for fuzzy logic driven virtual machine resource evaluation apparatus and method | |
CN110262897B (en) | Hadoop calculation task initial allocation method based on load prediction | |
Delavar et al. | A synthetic heuristic algorithm for independent task scheduling in cloud systems | |
US8819239B2 (en) | Distributed resource management systems and methods for resource management thereof | |
CN115033340A (en) | Host selection method and related device | |
Manikandan et al. | Virtualized load balancer for hybrid cloud using genetic algorithm | |
CN114564313A (en) | Load adjustment method and device, electronic equipment and storage medium | |
CN109347982A (en) | A kind of dispatching method and device of data center | |
CN113867907A (en) | CPU resource-based scheduling system and optimization algorithm in engineering field | |
CN110928659A (en) | Numerical value pool system remote multi-platform access method with self-adaptive function | |
CN111813564B (en) | Cluster resource management method and device and container cluster management system | |
CN111580937B (en) | Automatic virtual machine scheduling method for Feiteng multi-core/many-core hybrid cluster | |
JPWO2018168695A1 (en) | Distributed machine learning device, distributed machine learning method, and distributed machine learning program | |
CN113904940A (en) | Resource adjusting method and device, electronic equipment and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |