CN109165093A - A kind of calculate node cluster elasticity distribution system and method - Google Patents

A kind of calculate node cluster elasticity distribution system and method Download PDF

Info

Publication number
CN109165093A
CN109165093A CN201810857293.3A CN201810857293A CN109165093A CN 109165093 A CN109165093 A CN 109165093A CN 201810857293 A CN201810857293 A CN 201810857293A CN 109165093 A CN109165093 A CN 109165093A
Authority
CN
China
Prior art keywords
calculate node
resource
task
module
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810857293.3A
Other languages
Chinese (zh)
Other versions
CN109165093B (en
Inventor
郑达韡
梅胜涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ningbo Integral Power Information Technology Co Ltd
Original Assignee
Ningbo Integral Power Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ningbo Integral Power Information Technology Co Ltd filed Critical Ningbo Integral Power Information Technology Co Ltd
Priority to CN201810857293.3A priority Critical patent/CN109165093B/en
Publication of CN109165093A publication Critical patent/CN109165093A/en
Application granted granted Critical
Publication of CN109165093B publication Critical patent/CN109165093B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3051Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals

Abstract

The present invention relates to a kind of calculate node cluster elasticity distribution system and methods, using calculate node elasticity distribution mechanism, according to historic task to the service condition and task resource demand of calculate node resource, calculate node resource allocation is estimated, in the case where meet demand, calculation stages are distributed and carry out dynamic control, improve operation response speed and calculate node resource utilization, and historical forecast operation result can be fed back in prediction next time, the equilibrium allocation for realizing the resource in cloud computing environment improves the overall efficiency of system.

Description

A kind of calculate node cluster elasticity distribution system and method
Technical field
The present invention relates to cloud computing cluster server Resource dynamic allocation management domain more particularly to a kind of calculate node collection The elastic distribution system of group and method.
Background technique
Along with the development of computer field, the development of field of cloud calculation is especially swift and violent.Cloud computing passes through distributed computing The computer and network technologies such as technology, parallel computing, virtualization technology and load balancing provide a user it is convenient, fast, The data storage and network service of safety.
The operation being related in deep learning is the matrix operation of vectorization mostly.As graphics accelerator, GPU is provided A large amount of arithmetic core is for rendering, these arithmetic cores equally can also be used to the matrix operation of acceleration vector, so closely Deep learning uses GPU largely to carry out the training of model over year.With increasing for demand, more and more cloud platforms are by GPU User is supplied to as a kind of calculate node resource.
But the particularity due to calculate node resource on hardware, cloud calculate node resource are usually with exclusive side Formula is supplied to user's, and this distribution be it is unidirectional, static, be easy to cause calculate node resource overload, user use Bad situation is experienced to occur.
Under exclusive mode, each calculate node resource is difficult to maximum performance.Fixed resource distribution mode, it is difficult to high The demand of the different request tasks of the matching different user of effect.And in the case where only carrying out primary distribution to calculate node resource, When user after distribution really submits task and starts operation, initially allocated calculate node resource is not necessarily able to satisfy user's Calculating demand.To solve the above-mentioned problems, therefore to design a kind of calculate node cluster elasticity distribution system and method be extremely to have It is necessary.
Summary of the invention
The present invention be overcome above-mentioned shortcoming, and it is an object of the present invention to provide a kind of calculate node cluster elasticity distribution system and Method, the present invention uses calculate node elasticity distribution mechanism, by analytical calculation node status information, according to task resource demand Calculate node state is estimated, in the case where meet demand, calculation stages is distributed and carry out dynamic control, to improve Operation response speed and calculate node resource utilization.
The present invention is to reach above-mentioned purpose by the following technical programs: a kind of calculate node cluster elasticity distribution system, packet It includes: line module, calculate node management module, calculate node resource module and storage server;The line module provides use Family logs in the entrance of port and user task solicited message;There is modular calculate node in the calculate node resource module Cluster resource, for executing the calculating task of user;The storage server is for storing operational data and operation log;Institute Stating calculate node management module includes authentication module, and task resource estimates module, calculate node control module, calculate node state Monitoring module;Wherein, the authentication module is used to obtain user login information and task request message from line module, after verifying User login information and task request message are sent to task resource and estimate module;The task resource estimates module, is used for Receive the user login information and task request message sent from authentication module;And according to the task description of user's submission and choosing It selects parameter progress computing resource node use and estimates judgement;The calculate node control module: for being provided according to from the task It estimates the computing resource node use sended in module and estimates judging result to the calculate node resource module in source Carry out regulation distribution, while being subjected to the calculate node status early warning letter that the calculate node monitoring module directly transmits Breath, and calculate node status early warning information is handled;The calculate node monitoring module is stored with informational table of nodes, described Calculate node monitoring module periodically can carry out status information capture to calculate node resource, new informational table of nodes be generated, to meter Operator node resource is monitored;Wherein, the informational table of nodes include calculate node ID, CPU usage, memory usage, Disk utilization, I/O utilization, network bandwidth etc..
Preferably, the authentication module is used to obtain from the line module and saves user information, and stepped in user Record and whether verifying user identity and task requests are legal when task request message refuses the request if it is illegal, if It is verified, then user is logged in and task request message is sent to task resource and estimates module.
Preferably, the task resource, which is estimated, stores the corresponding calculating of each request task resource allocation in module Node history resource allocation information table;Wherein, from calculate node history resource allocation information table, infer that task run institute is necessary Resource;The task new for one, be not carried out history can for reference, then according to user actively application resource or according to The maximum resource that system can provide is estimated.
Preferably, the calculate node monitoring module storage also stores calculate node status early warning information table, count Have in operator node status early warning information table and calculates status early warning value;It can will be calculated when calculate node status information touches early warning value Node state warning information is sent directly to calculate node control module, and the calculate node control module is also acceptable described The calculate node status early warning information that calculate node monitoring module directly transmits judges whether it is abnormal, issues abnormal mention It wakes up, abnormal task is actively terminated;To non-abnormal task, each calculate node resource occupation status of execution task is analyzed, to meter Operator node carries out resource reallocation.
Present invention simultaneously provides a kind of calculate node cluster elasticity distribution methods, include the following steps:
(1) user identity legitimacy is verified, is verified, received an assignment, otherwise directly terminate process;
(2) calculate node resource allocation is carried out according to the task description in task request message and estimates judgement;
(3) basis estimates calculated value and carries out calculate node resource allocation;
(4) timing carries out status information capture to calculate node resource after operation task, judges the calculate node utilization of resources Whether rate is more than threshold value of warning;If it exceeds threshold value of warning, then further judge whether task run state is exception, if it is Non- exception then carries out dynamic reallocation to calculate node resource;The threshold value of warning is preset;
(5) after task release busy calculate node resource.
Preferably, it is described in step (2), according to the history nodal information of the experience creation of previous operation task accumulation Table is estimated, wherein the method that calculate node resource allocation is estimated are as follows: extracts the relevant parameter of resource distribution in task description Task vector X is formed, the calculate node resource of corresponding distribution is as resource vector Y;After newly submitting task description, generate One include task description parameter configuration file, extract wherein parameter relevant to resource distribution at task vector Xnew, use Clustering algorithm finds the d history resource record nearest apart from new task vector as sample, if in this d historical sample In the presence of with the duplicate sample of new node, then directly given according to the resource allocation that history resource record sample task vector distributes New task requests;It is most newly appointed then to do linear regression fit according to d historical sample for the same history resource record if it does not exist Business vector obtains being provided by task description parameter vector weight each in d sample according to the parameter weighting history that fitting obtains The computing resource of source distribution simultaneously gives certain surplus, computing resource needed for obtaining new request task;In the task vector X Every one-dimensional representation task an attribute;Every one-dimensional representation of resource vector Y executes the corresponding calculate node of task and runs shape State, wherein the calculate node operating status refers to cpu busy percentage, memory usage, disk utilization, I/O utilization rate, net Network utilization rate etc..
Preferably, described executing step (2) if occurring not estimated according to history informational table of nodes in the process When, then foundation user actively applies resource or estimate according to the maximum resource that system can provide.
Preferably, described in step (4) specifically: periodically carry out status information capture to calculate node resource, generate New informational table of nodes is monitored calculate node resource;It is deposited in advance in new nodal information and calculate node monitoring module The data stored up from calculate node status early warning information table are compared, and whether various resource utilizations are more than threshold value of warning;If More than threshold value of warning, then further judge whether task run state is abnormal;If it is determined that task run state task is different Often, abnormal task is actively terminated, release busy resource, and task is prompted to run failure because of inadequate resource, choose whether to need It reevaluates the amount of required resource and suitable node is arranged to be retried;If it is non-exception, by calculate node state Warning information is sent to calculate node control module, carries out dynamic call to calculate node resource by calculate node control module, Wherein, the calculate node last state redistributes the calculate node with certain surplus according to the dynamic call method Resource, and record the node status information.
Preferably, it is described in task operational process, it may be selected temporarily offline or terminate process;If selection is temporary Offline, then the operation process will be suspended, after next time logs in, by the calculate node resource before Auto-matching, before continuing Process;If selecting end process, the calculate node resource occupied will be released, the corresponding calculate node history of the task Resource allocation information estimates the calculate node history resource allocation information table of module by task resource is stored in;Next time, which logs in, needs root According to the task requests newly submitted, calculate node resource is redistributed.
Preferably, described in step (5) after every subtask, the corresponding calculate node history resource allocation of task Information estimates the calculate node history resource allocation information table of module by task resource is stored in.
The beneficial effects of the present invention are: since this computing platform is directed to the particularity in field, same task may be anti- It is multiple to execute, therefore according to history calculate node information table and task resource demand, realize to calculate node resource allocation into Row is estimated, and in the case where meet demand, is distributed calculation stages and is carried out dynamic control, to improve operation response speed and meter Operator node resource utilization;During system prediction, last historical forecast operation result can be fed back in prediction next time, The equilibrium allocation for realizing the resource in cloud computing environment improves the overall efficiency of system.
Detailed description of the invention
Fig. 1 is the structural schematic diagram of distribution system of the present invention;
Fig. 2 is the flow diagram of distribution method of the present invention;
Fig. 3 is the flow diagram of node resource task predictor method of the invention.
Specific embodiment
The present invention is described further combined with specific embodiments below, but protection scope of the present invention is not limited in This:
Embodiment: as shown in Figure 1, a kind of calculate node cluster elasticity distribution system, including line module, calculate node pipe Manage module, calculate node resource module and storage server.Wherein, calculate node management module includes: authentication module, task money Estimate module, calculate node control module, calculate node monitoring module in source.
Line module provides user and logs in port, and provides user's entrance for submitting task.
Authentication module is used to obtain user login information from line module, and tests when user logs in task request message It demonstrate,proves user identity and whether task requests is legal, if it is illegal, refuse the request, if the verification passes, then step on user Record and task request message are sent to task resource and estimate module.
Task resource estimates module, and the user login information and user task for receiving to send from authentication module are asked It asks.Task resource, which is estimated, stores the corresponding calculate node history resource allocation information table of each historic task in module.Every time After task, the calculate node history resource point is updated when the corresponding calculate node resource allocation information of subtask will be used in With information table.
User, which submits new task every time and can trigger task resource, to be estimated, thus the task description submitted according to user and choosing Parameter is selected to carry out estimating judgement.Under normal conditions, the same task may be by execution repeatedly.The calculating needed is executed every time Node resource, such as CPU, memory, GPU, IO, network bandwidth and time etc., it may be possible to identical, it is also possible to because of configuration or number According to difference and it is different.Module is estimated by task, from calculate node history resource allocation information table, infers task run institute Necessary resource, estimated according to inquiry in the history informational table of nodes database of the accumulative experience creation of previous operation task GPU occupancy, memory, IO and network capacity.The task new for one, being not carried out history can for reference, then according to user The resource of active application is estimated according to the maximum resource that system can provide.
Calculate node control module: it is used in advance according to the computing resource node sended in module is estimated from task resource Estimate judging result and regulation distribution is carried out to calculate node resource module.Calculate node control module can also receive calculate node monitoring The calculate node status early warning information that module directly transmits judges whether it is abnormal, abnormity prompt is issued, to abnormal task Actively terminate.To non-abnormal task, each calculate node resource occupation status of execution task is analyzed, resource is carried out to calculate node It reallocates.
Informational table of nodes and calculate node status early warning information table, interior joint are stored in calculate node monitoring module Information table includes the information such as calculate node ID, CPU usage, memory usage, disk utilization, I/O utilization, network bandwidth. Have in calculate node status early warning information table and calculates status early warning value.
Calculate node monitoring module timing carries out status information capture to calculate node resource, generates new nodal information Table is monitored calculate node resource.It is stored in advance in new nodal information and calculate node monitoring module from calculate node Data in status early warning information table are compared, and whether various resource utilizations are more than threshold value of warning.If be not above pre- Alert value, then record the calculate node status information of the task, and continue to run task.When calculate node status information touches early warning When value, calculate node status early warning information can be sent directly to calculate node control module, calculate node is provided by the module Source is allocated or manages to task.
There is modular calculate node cluster resource in calculate node resource module, for executing the calculating task of user. Storage server is for storing operational data and operation log.
The present invention is in terms of software environment, it is desirable that each node uses 16.04 operating system of Ubuntu, is equipped with The developing instruments such as Python 2.7, sklearn 0.19.1, pytorch 0.1.2;In terms of environment, it is desirable that each node configuration exists Same network segment.
A kind of calculate node cluster elasticity distribution method, includes the following steps:
(1) user identity legitimacy is verified, is verified, received an assignment, otherwise directly terminate process;
(2) calculate node resource allocation is carried out according to the task description in task request message and estimates judgement;
(3) basis estimates calculated value and carries out calculate node resource allocation;
(4) timing carries out status information capture to calculate node resource after operation task, judges the calculate node utilization of resources Whether rate is more than threshold value of warning;If it exceeds threshold value of warning, then further judge whether task run state is exception, if it is Non- exception then carries out dynamic reallocation to calculate node resource;The threshold value of warning is preset;
(5) after task release busy calculate node resource.
It is illustrated below with specific example, as shown in Fig. 2, the method for the invention is as follows:
User logs in from client: authentication module is to user identity legitimacy and verifies, if verifying is without logical It crosses, then refuses user's request.If the verification passes.User is waited to send task requests, task requests are legal, then trigger user's money Source request event, and next calculate node control module is sent by user login information and task request message, it carries out next Step.
Task resource is analyzed, the task description and selection parameter submitted according to user carry out estimating judgement.Usually In the case of, the same task may be by execution repeatedly.The calculate node resource that execution needs every time, such as CPU, memory, GPU, IO, network bandwidth and time etc., it may be possible to identical, it is also possible to different because of the difference of configuration or data.Pass through Task estimates module, from the history of task execution, infers resource necessary to task run, accumulative according to previous operation task Experience creation history informational table of nodes database in inquire estimated GPU occupancy, memory, IO and network capacity.For One new task, be not carried out history can for reference, when can not be estimated, then according to user actively application resource or Estimated according to the maximum resource that system can provide.
Determine that calculate node state value, specific method are according to the information in history calculate node information table, each task Request, user submit task description after, can all generate one include task description parameter configuration file, extract wherein with money Source configures relevant M parameter and forms a M dimension task vector X, wherein an attribute of every one-dimensional representation task, such as task class Type, task amount, Internet resources etc..The calculate node resource of the corresponding distribution of each task requests can regard an E dimension as and calculate Node distributes resource vector Y, a kind of every resource of one-dimensional representation, the i.e. corresponding calculate node operating status of execution task, such as CPU benefit With rate, memory usage, disk utilization, I/O utilization rate, network utilization etc..Assuming that being estimated in module in task resource There is N number of history assignment record in calculate node history resource allocation information table, which can be with using the time to increase.
Each history assignment record includes the task description parameter X of M dimension, and every is denoted as Xi, i=1,2,3 ..., N. Task description parameter list in existing calculate node history resource allocation information table can regard the matrix of a N x M as.Relatively It answers, every record also corresponds to the calculate node resource allocation vector Y of the distribution of E dimension.Computing resource state is denoted as Yi, i=1, 2,3,…,N.Calculate node allocation table in existing calculate node history resource allocation information table can regard a N x E's as Matrix.
Process is estimated as shown in figure 3, new task requests enter after task estimates module, firstly, submitting according to user new Task description generates the new configuration file comprising task description parameter, selects wherein M parameter group of resource allocation correlated condition At task vector Xnew.The clustering algorithms such as KNN are used to find the d history resource record nearest apart from new task vector as sample This, wherein the parameter state vector for describing task is X_knnj, j=1,2,3 ..., d, corresponding calculate node distribute resource to Amount be Y_knnj, j=1,2,3 ..., d.If existed in this d historical sample with the duplicate sample of new node, such as X_ Knnfit=Xnew, then the resource vector Y_knnfit directly distributed according to history resource record sample task vector X_knnfit Distribute to new task requests;The same history resource record if it does not exist then does linear regression fit according to d historical sample Newest task vector, obtainsWherein kt is t-th of task that linear regression fit algorithm obtains The weight of characterising parameter vector X_knnt.Finally, according to the computing resource of obtained parameter weighting history resource allocation is fitted simultaneously Give certain surplusBy Y_ (new_pred) the request task institute new as estimation The computing resource needed.
After determining the corresponding calculate node resource of user task, calculate node resource is carried out to according to the calculated value estimated Distribution.
Calculate node runs task: calculate node monitoring module timing carries out status information capture to calculate node resource, New informational table of nodes is generated, calculate node resource is monitored.It is pre- in new nodal information and calculate node monitoring module The data first stored from calculate node status early warning information table are compared, and whether various resource utilizations are more than early warning threshold Value.If being not above early warning value, the calculate node status information of the task is recorded, and continues to run task.If it exceeds Threshold value of warning then further judges whether task run state is abnormal.
If task run state is non-exception, calculate node status early warning information is sent to calculate node control mould Block carries out dynamic call to calculate node resource by calculate node control module, finally will be in dynamic call write operation log.
Calculate node last state X_new is redistributed with certain surplus according to the method that dynamic resource calls Computing resource, and record the node status information.
If task is exception, abnormal task is actively terminated, release busy resource, and prompt user, task is because of resource Insufficient and run failure, user chooses whether to need to reevaluate the amount of required resource and suitable node is arranged to retry. Abnormality processing is finally operated into write operation log.
During task run, if user exits, user can choose temporarily offline or end process.If selection Temporarily offline, then the operation process will be suspended, and log-on message, calculate node occupied information and operating status will be kept in, Process after user logs in next time, by the calculate node resource before Auto-matching, before continuing.If user selection terminate into Journey, then task process will terminate, the calculate node resource of occupancy will be released, the corresponding calculate node history money of the task Source distribution information estimates the calculate node history resource allocation information table of module by task resource is stored in, and calculates section as later The reference of point resource allocation.User, which logs in next time, to be needed to redistribute calculate node resource according to the task requests newly submitted.
After every subtask, the corresponding calculate node history resource allocation information of task will be stored in task resource and estimate The calculate node history resource allocation information table of module, as the reference estimated from now on.User can choose whether to log off, If selection is logged off, log-on message will be deleted from authentication module database, and user submits task requests next time, will be weighed New login carries out authentication.If do not logged off, still possess legal identity in system end subscriber, user can be after Continuous to submit new task requests, distribution node resource is calculated.
It is specific embodiments of the present invention and the technical principle used described in above, if conception under this invention institute The change of work when the spirit that generated function is still covered without departing from specification and attached drawing, should belong to of the invention Protection scope.

Claims (10)

1. a kind of calculate node cluster elasticity distribution system characterized by comprising line module, calculate node management module, Calculate node resource module and storage server;The line module provides user and logs in port and user task solicited message Entrance;There is modular calculate node cluster resource in the calculate node resource module, the calculating for executing user is appointed Business;The storage server is for storing operational data and operation log;The calculate node management module includes verifying mould Block, task resource estimate module, calculate node control module, calculate node monitoring module;Wherein, the authentication module is used In obtaining user login information and task request message from line module, by user login information and task request message after verifying It is sent to task resource and estimates module;The task resource estimates module, and the user for receiving to send from authentication module steps on The task description and selection parameter recording information and task request message, and being submitted according to user carry out computing resource node and use in advance Estimate judgement;The calculate node control module, by according to estimated from the task resource sended in module it is described based on Calculation resource node use estimates judging result and carries out regulation distribution to the calculate node resource module, while acceptable described The calculate node status early warning information that calculate node monitoring module directly transmits, and to calculate node status early warning information into Row processing;The calculate node monitoring module is stored with informational table of nodes, and the calculate node monitoring module can be periodically to calculating Node resource carries out status information capture, generates new informational table of nodes, is monitored to calculate node resource;Wherein, described Informational table of nodes include calculate node ID, CPU usage, memory usage, disk utilization, I/O utilization, network bandwidth Deng.
2. a kind of calculate node cluster elasticity distribution system according to claim 1, it is characterised in that: the authentication module User identity is verified for user information to be obtained and saved from the line module, and when user logs in task request message It is whether legal with task requests, if it is illegal, refuse the request, if the verification passes, then user is logged in and task is asked It asks information to be sent to task resource and estimates module.
3. a kind of calculate node cluster elasticity distribution system according to claim 1 to 2, it is characterised in that: the task Resource, which is estimated, stores the corresponding calculate node history resource allocation information table of each request task resource allocation in module;Its In, from calculate node history resource allocation information table, infer resource necessary to task run;The task new for one, Being not carried out history can for reference, then according to the user's actively resource of application or maximum resource progress that can be provided according to system Estimation.
4. a kind of calculate node cluster elasticity distribution system according to claim 1 to 3, it is characterised in that: the calculating The storage of monitoring nodes module also stores calculate node status early warning information table, has calculating in calculate node status early warning information table Status early warning value;Calculate node status early warning information can be sent directly to count when calculate node status information touches early warning value Operator node control module, the calculate node control module are also subjected to the calculate node monitoring module and directly transmit Calculate node status early warning information, judge whether it is abnormal, issue abnormity prompt, abnormal task is actively terminated;To non-different The each calculate node resource occupation status of execution task is analyzed in permanent business, carries out resource reallocation to calculate node.
5. a kind of calculate node cluster elasticity distribution method, which comprises the steps of:
(1) user identity legitimacy is verified, is verified, received an assignment, otherwise directly terminate process;
(2) calculate node resource allocation is carried out according to the task description in task request message and estimates judgement;
(3) basis estimates calculated value and carries out calculate node resource allocation;
(4) timing carries out status information capture to calculate node resource after operation task, judges that calculate node resource utilization is No is more than threshold value of warning;If it exceeds threshold value of warning, then further judge whether task run state is exception, if it is non-different Often, then dynamic reallocation is carried out to calculate node resource;The threshold value of warning is preset;
(5) after task release busy calculate node resource.
6. a kind of calculate node cluster elasticity distribution method according to claim 5, it is characterised in that: described in step (2) it in, is estimated according to the history informational table of nodes of the experience creation of previous operation task accumulation, wherein calculate node money The method estimated is distributed in source are as follows: is extracted the relevant parameter of resource distribution in task description and is formed task vector X, corresponding distribution Calculate node resource as resource vector Y;After newly submitting task description, the configuration comprising task description parameter is generated File, extract wherein parameter relevant to resource distribution at task vector Xnew, using clustering algorithm find apart from new task to D nearest history resource record is measured as sample, if existed in this d historical sample with the duplicate sample of new node This, then directly give new task requests according to the resource allocation of history resource record sample task vector distribution;If it does not exist one The history resource record of sample then does the newest task vector of linear regression fit according to d historical sample, obtains through d sample In each task description parameter vector weight, the computing resource of parameter weighting history resource allocation obtained according to fitting simultaneously gives Certain surplus, computing resource needed for obtaining new request task;One kind of every one-dimensional representation task in the task vector X Attribute;Every one-dimensional representation of resource vector Y executes the corresponding calculate node operating status of task, wherein the calculate node fortune Row state refers to cpu busy percentage, memory usage, disk utilization, I/O utilization rate, network utilization etc..
7. a kind of calculate node cluster elasticity distribution method according to claim 6, it is characterised in that: described to execute step Suddenly (2) if during occur when can not be estimated according to history informational table of nodes, according to user actively the resource of application or Person estimates according to the maximum resource that system can provide.
8. a kind of calculate node cluster elasticity distribution method according to claim 5, it is characterised in that: described in step (4) specifically: status information capture periodically is carried out to calculate node resource, generates new informational table of nodes, calculate node is provided Source is monitored;It is stored in advance from calculate node status early warning information table in new nodal information and calculate node monitoring module Data be compared, whether various resource utilizations are more than threshold value of warning;If being more than threshold value of warning, task is further judged Whether operating status is abnormal;If it is determined that task run state task is exception, abnormal task is actively terminated, release busy Resource, and prompt task to run failure because of inadequate resource, choose whether to need to reevaluate the amount of required resource and arrange to close Suitable node is retried;If it is non-exception, calculate node status early warning information is sent to calculate node control module, Dynamic call is carried out to calculate node resource by calculate node control module, wherein should according to the dynamic call method Calculate node last state redistributes the calculate node resource with certain surplus, and records the node status information.
9. a kind of calculate node cluster elasticity distribution method according to claim 5, it is characterised in that: described to be transported in task During row, it may be selected temporarily offline or terminate process;If selection is temporarily offline, which will be suspended, under Process after secondary login, by the calculate node resource before Auto-matching, before continuing;If selecting end process, account for Calculate node resource will be released, and the corresponding calculate node history resource allocation information of the task will be stored in task resource Estimate the calculate node history resource allocation information table of module;Logging in next time needs to be redistributed according to the task requests newly submitted Calculate node resource.
10. a kind of calculate node cluster elasticity distribution method according to claim 5, it is characterised in that: described in step (5) in after every subtask, the corresponding calculate node history resource allocation information of task will be stored in task resource and estimate mould The calculate node history resource allocation information table of block.
CN201810857293.3A 2018-07-31 2018-07-31 System and method for flexibly distributing computing node cluster Active CN109165093B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810857293.3A CN109165093B (en) 2018-07-31 2018-07-31 System and method for flexibly distributing computing node cluster

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810857293.3A CN109165093B (en) 2018-07-31 2018-07-31 System and method for flexibly distributing computing node cluster

Publications (2)

Publication Number Publication Date
CN109165093A true CN109165093A (en) 2019-01-08
CN109165093B CN109165093B (en) 2022-07-19

Family

ID=64898439

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810857293.3A Active CN109165093B (en) 2018-07-31 2018-07-31 System and method for flexibly distributing computing node cluster

Country Status (1)

Country Link
CN (1) CN109165093B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110096349A (en) * 2019-04-10 2019-08-06 山东科技大学 A kind of job scheduling method based on the prediction of clustered node load condition
CN110597626A (en) * 2019-08-23 2019-12-20 第四范式(北京)技术有限公司 Method, device and system for allocating resources and tasks in distributed system
CN110705893A (en) * 2019-10-11 2020-01-17 腾讯科技(深圳)有限公司 Service node management method, device, equipment and storage medium
CN111381969A (en) * 2020-03-16 2020-07-07 北京隆普智能科技有限公司 Management method and system of distributed software
CN111399976A (en) * 2020-03-02 2020-07-10 上海交通大学 GPU virtualization implementation system and method based on API redirection technology
CN111813545A (en) * 2020-06-29 2020-10-23 北京字节跳动网络技术有限公司 Resource allocation method, device, medium and equipment
CN111885158A (en) * 2020-07-22 2020-11-03 曙光信息产业(北京)有限公司 Cluster task processing method and device, electronic equipment and storage medium
CN113094243A (en) * 2020-01-08 2021-07-09 北京小米移动软件有限公司 Node performance detection method and device
CN114780225A (en) * 2022-06-14 2022-07-22 支付宝(杭州)信息技术有限公司 Distributed model training system, method and device
CN115297018A (en) * 2022-10-10 2022-11-04 北京广通优云科技股份有限公司 Operation and maintenance system load prediction method based on active detection
CN115495231A (en) * 2022-08-09 2022-12-20 徐州医科大学 Dynamic resource scheduling method and system under complex scene of high concurrent tasks
CN115756822A (en) * 2022-10-18 2023-03-07 超聚变数字技术有限公司 Method and system for optimizing performance of high-performance computing application
CN115794420A (en) * 2023-02-07 2023-03-14 飞天诚信科技股份有限公司 Dynamic management method, device and medium for service node resource allocation

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070240161A1 (en) * 2006-04-10 2007-10-11 General Electric Company System and method for dynamic allocation of resources in a computing grid
CN102759984A (en) * 2012-06-13 2012-10-31 上海交通大学 Power supply and performance management system for virtualization server cluster
US20130263117A1 (en) * 2012-03-28 2013-10-03 International Business Machines Corporation Allocating resources to virtual machines via a weighted cost ratio
CN103699447A (en) * 2014-01-08 2014-04-02 北京航空航天大学 Cloud computing-based transcoding and distribution system for video conference
CN104407912A (en) * 2014-12-25 2015-03-11 无锡清华信息科学与技术国家实验室物联网技术中心 Virtual machine configuration method and device
US20150200867A1 (en) * 2014-01-15 2015-07-16 Cisco Technology, Inc. Task scheduling using virtual clusters
CN104951372A (en) * 2015-06-16 2015-09-30 北京工业大学 Method for dynamic allocation of Map/Reduce data processing platform memory resources based on prediction
CN105487930A (en) * 2015-12-01 2016-04-13 中国电子科技集团公司第二十八研究所 Task optimization scheduling method based on Hadoop
TW201621698A (en) * 2014-12-09 2016-06-16 英業達股份有限公司 Method of resource allocation in a server system
CN105897616A (en) * 2016-05-17 2016-08-24 腾讯科技(深圳)有限公司 Resource allocation method and server
CN106790636A (en) * 2017-01-09 2017-05-31 上海承蓝科技股份有限公司 A kind of equally loaded system and method for cloud computing server cluster

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070240161A1 (en) * 2006-04-10 2007-10-11 General Electric Company System and method for dynamic allocation of resources in a computing grid
US20130263117A1 (en) * 2012-03-28 2013-10-03 International Business Machines Corporation Allocating resources to virtual machines via a weighted cost ratio
CN102759984A (en) * 2012-06-13 2012-10-31 上海交通大学 Power supply and performance management system for virtualization server cluster
CN103699447A (en) * 2014-01-08 2014-04-02 北京航空航天大学 Cloud computing-based transcoding and distribution system for video conference
US20150200867A1 (en) * 2014-01-15 2015-07-16 Cisco Technology, Inc. Task scheduling using virtual clusters
TW201621698A (en) * 2014-12-09 2016-06-16 英業達股份有限公司 Method of resource allocation in a server system
CN104407912A (en) * 2014-12-25 2015-03-11 无锡清华信息科学与技术国家实验室物联网技术中心 Virtual machine configuration method and device
CN104951372A (en) * 2015-06-16 2015-09-30 北京工业大学 Method for dynamic allocation of Map/Reduce data processing platform memory resources based on prediction
CN105487930A (en) * 2015-12-01 2016-04-13 中国电子科技集团公司第二十八研究所 Task optimization scheduling method based on Hadoop
CN105897616A (en) * 2016-05-17 2016-08-24 腾讯科技(深圳)有限公司 Resource allocation method and server
CN106790636A (en) * 2017-01-09 2017-05-31 上海承蓝科技股份有限公司 A kind of equally loaded system and method for cloud computing server cluster

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YONGZHONG ZHANG 等: "The Revenues Driven Resource Allocation Algorithm for Cluster-based Web Server", 《2006 6TH WORLD CONGRESS ON INTELLIGENT CONTROL AND AUTOMATION》 *
荀亚玲 等: "MapReduce集群环境下的数据放置策略", 《软件学报》 *
薛胜军等: "云环境下公平性优化的资源分配方法", 《计算机应用》 *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110096349A (en) * 2019-04-10 2019-08-06 山东科技大学 A kind of job scheduling method based on the prediction of clustered node load condition
CN110597626A (en) * 2019-08-23 2019-12-20 第四范式(北京)技术有限公司 Method, device and system for allocating resources and tasks in distributed system
CN110597626B (en) * 2019-08-23 2022-09-06 第四范式(北京)技术有限公司 Method, device and system for allocating resources and tasks in distributed system
CN110705893A (en) * 2019-10-11 2020-01-17 腾讯科技(深圳)有限公司 Service node management method, device, equipment and storage medium
CN113094243A (en) * 2020-01-08 2021-07-09 北京小米移动软件有限公司 Node performance detection method and device
CN111399976A (en) * 2020-03-02 2020-07-10 上海交通大学 GPU virtualization implementation system and method based on API redirection technology
CN111381969B (en) * 2020-03-16 2021-10-26 北京康吉森技术有限公司 Management method and system of distributed software
CN111381969A (en) * 2020-03-16 2020-07-07 北京隆普智能科技有限公司 Management method and system of distributed software
CN111813545A (en) * 2020-06-29 2020-10-23 北京字节跳动网络技术有限公司 Resource allocation method, device, medium and equipment
CN111885158A (en) * 2020-07-22 2020-11-03 曙光信息产业(北京)有限公司 Cluster task processing method and device, electronic equipment and storage medium
CN111885158B (en) * 2020-07-22 2023-05-02 曙光信息产业(北京)有限公司 Cluster task processing method and device, electronic equipment and storage medium
CN114780225A (en) * 2022-06-14 2022-07-22 支付宝(杭州)信息技术有限公司 Distributed model training system, method and device
CN114780225B (en) * 2022-06-14 2022-09-23 支付宝(杭州)信息技术有限公司 Distributed model training system, method and device
CN115495231A (en) * 2022-08-09 2022-12-20 徐州医科大学 Dynamic resource scheduling method and system under complex scene of high concurrent tasks
CN115495231B (en) * 2022-08-09 2023-09-19 徐州医科大学 Dynamic resource scheduling method and system under high concurrency task complex scene
CN115297018A (en) * 2022-10-10 2022-11-04 北京广通优云科技股份有限公司 Operation and maintenance system load prediction method based on active detection
CN115297018B (en) * 2022-10-10 2022-12-20 北京广通优云科技股份有限公司 Operation and maintenance system load prediction method based on active detection
CN115756822A (en) * 2022-10-18 2023-03-07 超聚变数字技术有限公司 Method and system for optimizing performance of high-performance computing application
CN115756822B (en) * 2022-10-18 2024-03-19 超聚变数字技术有限公司 Method and system for optimizing high-performance computing application performance
CN115794420A (en) * 2023-02-07 2023-03-14 飞天诚信科技股份有限公司 Dynamic management method, device and medium for service node resource allocation

Also Published As

Publication number Publication date
CN109165093B (en) 2022-07-19

Similar Documents

Publication Publication Date Title
CN109165093A (en) A kind of calculate node cluster elasticity distribution system and method
EP3525096B1 (en) Resource load balancing control method and cluster scheduler
Bhattacharjee et al. Barista: Efficient and scalable serverless serving system for deep learning prediction services
CN107239336B (en) Method and device for realizing task scheduling
CN110163474A (en) A kind of method and apparatus of task distribution
US11146497B2 (en) Resource prediction for cloud computing
CN108228347A (en) The Docker self-adapting dispatching systems that a kind of task perceives
Balasangameshwara et al. Performance-driven load balancing with a primary-backup approach for computational grids with low communication cost and replication cost
US9037880B2 (en) Method and system for automated application layer power management solution for serverside applications
KR101471749B1 (en) Virtual machine allcoation of cloud service for fuzzy logic driven virtual machine resource evaluation apparatus and method
US20210255899A1 (en) Method for Establishing System Resource Prediction and Resource Management Model Through Multi-layer Correlations
CN109075988A (en) Task schedule and resource delivery system and method
CN110262897B (en) Hadoop calculation task initial allocation method based on load prediction
CN108566424A (en) Dispatching method, device and system based on server resource consumption forecast
US8180716B2 (en) Method and device for forecasting computational needs of an application
Khazaei et al. Performance analysis of cloud computing centers
Liu et al. CCRP: Customized cooperative resource provisioning for high resource utilization in clouds
Ding et al. Kubernetes-oriented microservice placement with dynamic resource allocation
CN106471473A (en) Mechanism for the too high distribution of server in the minds of in control data
Manikandan et al. Virtualized load balancer for hybrid cloud using genetic algorithm
Bacigalupo et al. Resource management of enterprise cloud systems using layered queuing and historical performance models
Jagannatha et al. Algorithm approach: Modelling and performance analysis of software system
De Grande et al. Measuring communication delay for dynamic balancing strategies of distributed virtual simulations
Wang et al. Remediating overload in over-subscribed computing environments
KR20160044623A (en) Load Balancing Method for a Linux Virtual Server

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant