CN109165093A - A kind of calculate node cluster elasticity distribution system and method - Google Patents
A kind of calculate node cluster elasticity distribution system and method Download PDFInfo
- Publication number
- CN109165093A CN109165093A CN201810857293.3A CN201810857293A CN109165093A CN 109165093 A CN109165093 A CN 109165093A CN 201810857293 A CN201810857293 A CN 201810857293A CN 109165093 A CN109165093 A CN 109165093A
- Authority
- CN
- China
- Prior art keywords
- calculate node
- resource
- task
- module
- user
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5011—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
- G06F9/5016—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3051—Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
Abstract
The present invention relates to a kind of calculate node cluster elasticity distribution system and methods, using calculate node elasticity distribution mechanism, according to historic task to the service condition and task resource demand of calculate node resource, calculate node resource allocation is estimated, in the case where meet demand, calculation stages are distributed and carry out dynamic control, improve operation response speed and calculate node resource utilization, and historical forecast operation result can be fed back in prediction next time, the equilibrium allocation for realizing the resource in cloud computing environment improves the overall efficiency of system.
Description
Technical field
The present invention relates to cloud computing cluster server Resource dynamic allocation management domain more particularly to a kind of calculate node collection
The elastic distribution system of group and method.
Background technique
Along with the development of computer field, the development of field of cloud calculation is especially swift and violent.Cloud computing passes through distributed computing
The computer and network technologies such as technology, parallel computing, virtualization technology and load balancing provide a user it is convenient, fast,
The data storage and network service of safety.
The operation being related in deep learning is the matrix operation of vectorization mostly.As graphics accelerator, GPU is provided
A large amount of arithmetic core is for rendering, these arithmetic cores equally can also be used to the matrix operation of acceleration vector, so closely
Deep learning uses GPU largely to carry out the training of model over year.With increasing for demand, more and more cloud platforms are by GPU
User is supplied to as a kind of calculate node resource.
But the particularity due to calculate node resource on hardware, cloud calculate node resource are usually with exclusive side
Formula is supplied to user's, and this distribution be it is unidirectional, static, be easy to cause calculate node resource overload, user use
Bad situation is experienced to occur.
Under exclusive mode, each calculate node resource is difficult to maximum performance.Fixed resource distribution mode, it is difficult to high
The demand of the different request tasks of the matching different user of effect.And in the case where only carrying out primary distribution to calculate node resource,
When user after distribution really submits task and starts operation, initially allocated calculate node resource is not necessarily able to satisfy user's
Calculating demand.To solve the above-mentioned problems, therefore to design a kind of calculate node cluster elasticity distribution system and method be extremely to have
It is necessary.
Summary of the invention
The present invention be overcome above-mentioned shortcoming, and it is an object of the present invention to provide a kind of calculate node cluster elasticity distribution system and
Method, the present invention uses calculate node elasticity distribution mechanism, by analytical calculation node status information, according to task resource demand
Calculate node state is estimated, in the case where meet demand, calculation stages is distributed and carry out dynamic control, to improve
Operation response speed and calculate node resource utilization.
The present invention is to reach above-mentioned purpose by the following technical programs: a kind of calculate node cluster elasticity distribution system, packet
It includes: line module, calculate node management module, calculate node resource module and storage server;The line module provides use
Family logs in the entrance of port and user task solicited message;There is modular calculate node in the calculate node resource module
Cluster resource, for executing the calculating task of user;The storage server is for storing operational data and operation log;Institute
Stating calculate node management module includes authentication module, and task resource estimates module, calculate node control module, calculate node state
Monitoring module;Wherein, the authentication module is used to obtain user login information and task request message from line module, after verifying
User login information and task request message are sent to task resource and estimate module;The task resource estimates module, is used for
Receive the user login information and task request message sent from authentication module;And according to the task description of user's submission and choosing
It selects parameter progress computing resource node use and estimates judgement;The calculate node control module: for being provided according to from the task
It estimates the computing resource node use sended in module and estimates judging result to the calculate node resource module in source
Carry out regulation distribution, while being subjected to the calculate node status early warning letter that the calculate node monitoring module directly transmits
Breath, and calculate node status early warning information is handled;The calculate node monitoring module is stored with informational table of nodes, described
Calculate node monitoring module periodically can carry out status information capture to calculate node resource, new informational table of nodes be generated, to meter
Operator node resource is monitored;Wherein, the informational table of nodes include calculate node ID, CPU usage, memory usage,
Disk utilization, I/O utilization, network bandwidth etc..
Preferably, the authentication module is used to obtain from the line module and saves user information, and stepped in user
Record and whether verifying user identity and task requests are legal when task request message refuses the request if it is illegal, if
It is verified, then user is logged in and task request message is sent to task resource and estimates module.
Preferably, the task resource, which is estimated, stores the corresponding calculating of each request task resource allocation in module
Node history resource allocation information table;Wherein, from calculate node history resource allocation information table, infer that task run institute is necessary
Resource;The task new for one, be not carried out history can for reference, then according to user actively application resource or according to
The maximum resource that system can provide is estimated.
Preferably, the calculate node monitoring module storage also stores calculate node status early warning information table, count
Have in operator node status early warning information table and calculates status early warning value;It can will be calculated when calculate node status information touches early warning value
Node state warning information is sent directly to calculate node control module, and the calculate node control module is also acceptable described
The calculate node status early warning information that calculate node monitoring module directly transmits judges whether it is abnormal, issues abnormal mention
It wakes up, abnormal task is actively terminated;To non-abnormal task, each calculate node resource occupation status of execution task is analyzed, to meter
Operator node carries out resource reallocation.
Present invention simultaneously provides a kind of calculate node cluster elasticity distribution methods, include the following steps:
(1) user identity legitimacy is verified, is verified, received an assignment, otherwise directly terminate process;
(2) calculate node resource allocation is carried out according to the task description in task request message and estimates judgement;
(3) basis estimates calculated value and carries out calculate node resource allocation;
(4) timing carries out status information capture to calculate node resource after operation task, judges the calculate node utilization of resources
Whether rate is more than threshold value of warning;If it exceeds threshold value of warning, then further judge whether task run state is exception, if it is
Non- exception then carries out dynamic reallocation to calculate node resource;The threshold value of warning is preset;
(5) after task release busy calculate node resource.
Preferably, it is described in step (2), according to the history nodal information of the experience creation of previous operation task accumulation
Table is estimated, wherein the method that calculate node resource allocation is estimated are as follows: extracts the relevant parameter of resource distribution in task description
Task vector X is formed, the calculate node resource of corresponding distribution is as resource vector Y;After newly submitting task description, generate
One include task description parameter configuration file, extract wherein parameter relevant to resource distribution at task vector Xnew, use
Clustering algorithm finds the d history resource record nearest apart from new task vector as sample, if in this d historical sample
In the presence of with the duplicate sample of new node, then directly given according to the resource allocation that history resource record sample task vector distributes
New task requests;It is most newly appointed then to do linear regression fit according to d historical sample for the same history resource record if it does not exist
Business vector obtains being provided by task description parameter vector weight each in d sample according to the parameter weighting history that fitting obtains
The computing resource of source distribution simultaneously gives certain surplus, computing resource needed for obtaining new request task;In the task vector X
Every one-dimensional representation task an attribute;Every one-dimensional representation of resource vector Y executes the corresponding calculate node of task and runs shape
State, wherein the calculate node operating status refers to cpu busy percentage, memory usage, disk utilization, I/O utilization rate, net
Network utilization rate etc..
Preferably, described executing step (2) if occurring not estimated according to history informational table of nodes in the process
When, then foundation user actively applies resource or estimate according to the maximum resource that system can provide.
Preferably, described in step (4) specifically: periodically carry out status information capture to calculate node resource, generate
New informational table of nodes is monitored calculate node resource;It is deposited in advance in new nodal information and calculate node monitoring module
The data stored up from calculate node status early warning information table are compared, and whether various resource utilizations are more than threshold value of warning;If
More than threshold value of warning, then further judge whether task run state is abnormal;If it is determined that task run state task is different
Often, abnormal task is actively terminated, release busy resource, and task is prompted to run failure because of inadequate resource, choose whether to need
It reevaluates the amount of required resource and suitable node is arranged to be retried;If it is non-exception, by calculate node state
Warning information is sent to calculate node control module, carries out dynamic call to calculate node resource by calculate node control module,
Wherein, the calculate node last state redistributes the calculate node with certain surplus according to the dynamic call method
Resource, and record the node status information.
Preferably, it is described in task operational process, it may be selected temporarily offline or terminate process;If selection is temporary
Offline, then the operation process will be suspended, after next time logs in, by the calculate node resource before Auto-matching, before continuing
Process;If selecting end process, the calculate node resource occupied will be released, the corresponding calculate node history of the task
Resource allocation information estimates the calculate node history resource allocation information table of module by task resource is stored in;Next time, which logs in, needs root
According to the task requests newly submitted, calculate node resource is redistributed.
Preferably, described in step (5) after every subtask, the corresponding calculate node history resource allocation of task
Information estimates the calculate node history resource allocation information table of module by task resource is stored in.
The beneficial effects of the present invention are: since this computing platform is directed to the particularity in field, same task may be anti-
It is multiple to execute, therefore according to history calculate node information table and task resource demand, realize to calculate node resource allocation into
Row is estimated, and in the case where meet demand, is distributed calculation stages and is carried out dynamic control, to improve operation response speed and meter
Operator node resource utilization;During system prediction, last historical forecast operation result can be fed back in prediction next time,
The equilibrium allocation for realizing the resource in cloud computing environment improves the overall efficiency of system.
Detailed description of the invention
Fig. 1 is the structural schematic diagram of distribution system of the present invention;
Fig. 2 is the flow diagram of distribution method of the present invention;
Fig. 3 is the flow diagram of node resource task predictor method of the invention.
Specific embodiment
The present invention is described further combined with specific embodiments below, but protection scope of the present invention is not limited in
This:
Embodiment: as shown in Figure 1, a kind of calculate node cluster elasticity distribution system, including line module, calculate node pipe
Manage module, calculate node resource module and storage server.Wherein, calculate node management module includes: authentication module, task money
Estimate module, calculate node control module, calculate node monitoring module in source.
Line module provides user and logs in port, and provides user's entrance for submitting task.
Authentication module is used to obtain user login information from line module, and tests when user logs in task request message
It demonstrate,proves user identity and whether task requests is legal, if it is illegal, refuse the request, if the verification passes, then step on user
Record and task request message are sent to task resource and estimate module.
Task resource estimates module, and the user login information and user task for receiving to send from authentication module are asked
It asks.Task resource, which is estimated, stores the corresponding calculate node history resource allocation information table of each historic task in module.Every time
After task, the calculate node history resource point is updated when the corresponding calculate node resource allocation information of subtask will be used in
With information table.
User, which submits new task every time and can trigger task resource, to be estimated, thus the task description submitted according to user and choosing
Parameter is selected to carry out estimating judgement.Under normal conditions, the same task may be by execution repeatedly.The calculating needed is executed every time
Node resource, such as CPU, memory, GPU, IO, network bandwidth and time etc., it may be possible to identical, it is also possible to because of configuration or number
According to difference and it is different.Module is estimated by task, from calculate node history resource allocation information table, infers task run institute
Necessary resource, estimated according to inquiry in the history informational table of nodes database of the accumulative experience creation of previous operation task
GPU occupancy, memory, IO and network capacity.The task new for one, being not carried out history can for reference, then according to user
The resource of active application is estimated according to the maximum resource that system can provide.
Calculate node control module: it is used in advance according to the computing resource node sended in module is estimated from task resource
Estimate judging result and regulation distribution is carried out to calculate node resource module.Calculate node control module can also receive calculate node monitoring
The calculate node status early warning information that module directly transmits judges whether it is abnormal, abnormity prompt is issued, to abnormal task
Actively terminate.To non-abnormal task, each calculate node resource occupation status of execution task is analyzed, resource is carried out to calculate node
It reallocates.
Informational table of nodes and calculate node status early warning information table, interior joint are stored in calculate node monitoring module
Information table includes the information such as calculate node ID, CPU usage, memory usage, disk utilization, I/O utilization, network bandwidth.
Have in calculate node status early warning information table and calculates status early warning value.
Calculate node monitoring module timing carries out status information capture to calculate node resource, generates new nodal information
Table is monitored calculate node resource.It is stored in advance in new nodal information and calculate node monitoring module from calculate node
Data in status early warning information table are compared, and whether various resource utilizations are more than threshold value of warning.If be not above pre-
Alert value, then record the calculate node status information of the task, and continue to run task.When calculate node status information touches early warning
When value, calculate node status early warning information can be sent directly to calculate node control module, calculate node is provided by the module
Source is allocated or manages to task.
There is modular calculate node cluster resource in calculate node resource module, for executing the calculating task of user.
Storage server is for storing operational data and operation log.
The present invention is in terms of software environment, it is desirable that each node uses 16.04 operating system of Ubuntu, is equipped with
The developing instruments such as Python 2.7, sklearn 0.19.1, pytorch 0.1.2;In terms of environment, it is desirable that each node configuration exists
Same network segment.
A kind of calculate node cluster elasticity distribution method, includes the following steps:
(1) user identity legitimacy is verified, is verified, received an assignment, otherwise directly terminate process;
(2) calculate node resource allocation is carried out according to the task description in task request message and estimates judgement;
(3) basis estimates calculated value and carries out calculate node resource allocation;
(4) timing carries out status information capture to calculate node resource after operation task, judges the calculate node utilization of resources
Whether rate is more than threshold value of warning;If it exceeds threshold value of warning, then further judge whether task run state is exception, if it is
Non- exception then carries out dynamic reallocation to calculate node resource;The threshold value of warning is preset;
(5) after task release busy calculate node resource.
It is illustrated below with specific example, as shown in Fig. 2, the method for the invention is as follows:
User logs in from client: authentication module is to user identity legitimacy and verifies, if verifying is without logical
It crosses, then refuses user's request.If the verification passes.User is waited to send task requests, task requests are legal, then trigger user's money
Source request event, and next calculate node control module is sent by user login information and task request message, it carries out next
Step.
Task resource is analyzed, the task description and selection parameter submitted according to user carry out estimating judgement.Usually
In the case of, the same task may be by execution repeatedly.The calculate node resource that execution needs every time, such as CPU, memory,
GPU, IO, network bandwidth and time etc., it may be possible to identical, it is also possible to different because of the difference of configuration or data.Pass through
Task estimates module, from the history of task execution, infers resource necessary to task run, accumulative according to previous operation task
Experience creation history informational table of nodes database in inquire estimated GPU occupancy, memory, IO and network capacity.For
One new task, be not carried out history can for reference, when can not be estimated, then according to user actively application resource or
Estimated according to the maximum resource that system can provide.
Determine that calculate node state value, specific method are according to the information in history calculate node information table, each task
Request, user submit task description after, can all generate one include task description parameter configuration file, extract wherein with money
Source configures relevant M parameter and forms a M dimension task vector X, wherein an attribute of every one-dimensional representation task, such as task class
Type, task amount, Internet resources etc..The calculate node resource of the corresponding distribution of each task requests can regard an E dimension as and calculate
Node distributes resource vector Y, a kind of every resource of one-dimensional representation, the i.e. corresponding calculate node operating status of execution task, such as CPU benefit
With rate, memory usage, disk utilization, I/O utilization rate, network utilization etc..Assuming that being estimated in module in task resource
There is N number of history assignment record in calculate node history resource allocation information table, which can be with using the time to increase.
Each history assignment record includes the task description parameter X of M dimension, and every is denoted as Xi, i=1,2,3 ..., N.
Task description parameter list in existing calculate node history resource allocation information table can regard the matrix of a N x M as.Relatively
It answers, every record also corresponds to the calculate node resource allocation vector Y of the distribution of E dimension.Computing resource state is denoted as Yi, i=1,
2,3,…,N.Calculate node allocation table in existing calculate node history resource allocation information table can regard a N x E's as
Matrix.
Process is estimated as shown in figure 3, new task requests enter after task estimates module, firstly, submitting according to user new
Task description generates the new configuration file comprising task description parameter, selects wherein M parameter group of resource allocation correlated condition
At task vector Xnew.The clustering algorithms such as KNN are used to find the d history resource record nearest apart from new task vector as sample
This, wherein the parameter state vector for describing task is X_knnj, j=1,2,3 ..., d, corresponding calculate node distribute resource to
Amount be Y_knnj, j=1,2,3 ..., d.If existed in this d historical sample with the duplicate sample of new node, such as X_
Knnfit=Xnew, then the resource vector Y_knnfit directly distributed according to history resource record sample task vector X_knnfit
Distribute to new task requests;The same history resource record if it does not exist then does linear regression fit according to d historical sample
Newest task vector, obtainsWherein kt is t-th of task that linear regression fit algorithm obtains
The weight of characterising parameter vector X_knnt.Finally, according to the computing resource of obtained parameter weighting history resource allocation is fitted simultaneously
Give certain surplusBy Y_ (new_pred) the request task institute new as estimation
The computing resource needed.
After determining the corresponding calculate node resource of user task, calculate node resource is carried out to according to the calculated value estimated
Distribution.
Calculate node runs task: calculate node monitoring module timing carries out status information capture to calculate node resource,
New informational table of nodes is generated, calculate node resource is monitored.It is pre- in new nodal information and calculate node monitoring module
The data first stored from calculate node status early warning information table are compared, and whether various resource utilizations are more than early warning threshold
Value.If being not above early warning value, the calculate node status information of the task is recorded, and continues to run task.If it exceeds
Threshold value of warning then further judges whether task run state is abnormal.
If task run state is non-exception, calculate node status early warning information is sent to calculate node control mould
Block carries out dynamic call to calculate node resource by calculate node control module, finally will be in dynamic call write operation log.
Calculate node last state X_new is redistributed with certain surplus according to the method that dynamic resource calls
Computing resource, and record the node status information.
If task is exception, abnormal task is actively terminated, release busy resource, and prompt user, task is because of resource
Insufficient and run failure, user chooses whether to need to reevaluate the amount of required resource and suitable node is arranged to retry.
Abnormality processing is finally operated into write operation log.
During task run, if user exits, user can choose temporarily offline or end process.If selection
Temporarily offline, then the operation process will be suspended, and log-on message, calculate node occupied information and operating status will be kept in,
Process after user logs in next time, by the calculate node resource before Auto-matching, before continuing.If user selection terminate into
Journey, then task process will terminate, the calculate node resource of occupancy will be released, the corresponding calculate node history money of the task
Source distribution information estimates the calculate node history resource allocation information table of module by task resource is stored in, and calculates section as later
The reference of point resource allocation.User, which logs in next time, to be needed to redistribute calculate node resource according to the task requests newly submitted.
After every subtask, the corresponding calculate node history resource allocation information of task will be stored in task resource and estimate
The calculate node history resource allocation information table of module, as the reference estimated from now on.User can choose whether to log off,
If selection is logged off, log-on message will be deleted from authentication module database, and user submits task requests next time, will be weighed
New login carries out authentication.If do not logged off, still possess legal identity in system end subscriber, user can be after
Continuous to submit new task requests, distribution node resource is calculated.
It is specific embodiments of the present invention and the technical principle used described in above, if conception under this invention institute
The change of work when the spirit that generated function is still covered without departing from specification and attached drawing, should belong to of the invention
Protection scope.
Claims (10)
1. a kind of calculate node cluster elasticity distribution system characterized by comprising line module, calculate node management module,
Calculate node resource module and storage server;The line module provides user and logs in port and user task solicited message
Entrance;There is modular calculate node cluster resource in the calculate node resource module, the calculating for executing user is appointed
Business;The storage server is for storing operational data and operation log;The calculate node management module includes verifying mould
Block, task resource estimate module, calculate node control module, calculate node monitoring module;Wherein, the authentication module is used
In obtaining user login information and task request message from line module, by user login information and task request message after verifying
It is sent to task resource and estimates module;The task resource estimates module, and the user for receiving to send from authentication module steps on
The task description and selection parameter recording information and task request message, and being submitted according to user carry out computing resource node and use in advance
Estimate judgement;The calculate node control module, by according to estimated from the task resource sended in module it is described based on
Calculation resource node use estimates judging result and carries out regulation distribution to the calculate node resource module, while acceptable described
The calculate node status early warning information that calculate node monitoring module directly transmits, and to calculate node status early warning information into
Row processing;The calculate node monitoring module is stored with informational table of nodes, and the calculate node monitoring module can be periodically to calculating
Node resource carries out status information capture, generates new informational table of nodes, is monitored to calculate node resource;Wherein, described
Informational table of nodes include calculate node ID, CPU usage, memory usage, disk utilization, I/O utilization, network bandwidth
Deng.
2. a kind of calculate node cluster elasticity distribution system according to claim 1, it is characterised in that: the authentication module
User identity is verified for user information to be obtained and saved from the line module, and when user logs in task request message
It is whether legal with task requests, if it is illegal, refuse the request, if the verification passes, then user is logged in and task is asked
It asks information to be sent to task resource and estimates module.
3. a kind of calculate node cluster elasticity distribution system according to claim 1 to 2, it is characterised in that: the task
Resource, which is estimated, stores the corresponding calculate node history resource allocation information table of each request task resource allocation in module;Its
In, from calculate node history resource allocation information table, infer resource necessary to task run;The task new for one,
Being not carried out history can for reference, then according to the user's actively resource of application or maximum resource progress that can be provided according to system
Estimation.
4. a kind of calculate node cluster elasticity distribution system according to claim 1 to 3, it is characterised in that: the calculating
The storage of monitoring nodes module also stores calculate node status early warning information table, has calculating in calculate node status early warning information table
Status early warning value;Calculate node status early warning information can be sent directly to count when calculate node status information touches early warning value
Operator node control module, the calculate node control module are also subjected to the calculate node monitoring module and directly transmit
Calculate node status early warning information, judge whether it is abnormal, issue abnormity prompt, abnormal task is actively terminated;To non-different
The each calculate node resource occupation status of execution task is analyzed in permanent business, carries out resource reallocation to calculate node.
5. a kind of calculate node cluster elasticity distribution method, which comprises the steps of:
(1) user identity legitimacy is verified, is verified, received an assignment, otherwise directly terminate process;
(2) calculate node resource allocation is carried out according to the task description in task request message and estimates judgement;
(3) basis estimates calculated value and carries out calculate node resource allocation;
(4) timing carries out status information capture to calculate node resource after operation task, judges that calculate node resource utilization is
No is more than threshold value of warning;If it exceeds threshold value of warning, then further judge whether task run state is exception, if it is non-different
Often, then dynamic reallocation is carried out to calculate node resource;The threshold value of warning is preset;
(5) after task release busy calculate node resource.
6. a kind of calculate node cluster elasticity distribution method according to claim 5, it is characterised in that: described in step
(2) it in, is estimated according to the history informational table of nodes of the experience creation of previous operation task accumulation, wherein calculate node money
The method estimated is distributed in source are as follows: is extracted the relevant parameter of resource distribution in task description and is formed task vector X, corresponding distribution
Calculate node resource as resource vector Y;After newly submitting task description, the configuration comprising task description parameter is generated
File, extract wherein parameter relevant to resource distribution at task vector Xnew, using clustering algorithm find apart from new task to
D nearest history resource record is measured as sample, if existed in this d historical sample with the duplicate sample of new node
This, then directly give new task requests according to the resource allocation of history resource record sample task vector distribution;If it does not exist one
The history resource record of sample then does the newest task vector of linear regression fit according to d historical sample, obtains through d sample
In each task description parameter vector weight, the computing resource of parameter weighting history resource allocation obtained according to fitting simultaneously gives
Certain surplus, computing resource needed for obtaining new request task;One kind of every one-dimensional representation task in the task vector X
Attribute;Every one-dimensional representation of resource vector Y executes the corresponding calculate node operating status of task, wherein the calculate node fortune
Row state refers to cpu busy percentage, memory usage, disk utilization, I/O utilization rate, network utilization etc..
7. a kind of calculate node cluster elasticity distribution method according to claim 6, it is characterised in that: described to execute step
Suddenly (2) if during occur when can not be estimated according to history informational table of nodes, according to user actively the resource of application or
Person estimates according to the maximum resource that system can provide.
8. a kind of calculate node cluster elasticity distribution method according to claim 5, it is characterised in that: described in step
(4) specifically: status information capture periodically is carried out to calculate node resource, generates new informational table of nodes, calculate node is provided
Source is monitored;It is stored in advance from calculate node status early warning information table in new nodal information and calculate node monitoring module
Data be compared, whether various resource utilizations are more than threshold value of warning;If being more than threshold value of warning, task is further judged
Whether operating status is abnormal;If it is determined that task run state task is exception, abnormal task is actively terminated, release busy
Resource, and prompt task to run failure because of inadequate resource, choose whether to need to reevaluate the amount of required resource and arrange to close
Suitable node is retried;If it is non-exception, calculate node status early warning information is sent to calculate node control module,
Dynamic call is carried out to calculate node resource by calculate node control module, wherein should according to the dynamic call method
Calculate node last state redistributes the calculate node resource with certain surplus, and records the node status information.
9. a kind of calculate node cluster elasticity distribution method according to claim 5, it is characterised in that: described to be transported in task
During row, it may be selected temporarily offline or terminate process;If selection is temporarily offline, which will be suspended, under
Process after secondary login, by the calculate node resource before Auto-matching, before continuing;If selecting end process, account for
Calculate node resource will be released, and the corresponding calculate node history resource allocation information of the task will be stored in task resource
Estimate the calculate node history resource allocation information table of module;Logging in next time needs to be redistributed according to the task requests newly submitted
Calculate node resource.
10. a kind of calculate node cluster elasticity distribution method according to claim 5, it is characterised in that: described in step
(5) in after every subtask, the corresponding calculate node history resource allocation information of task will be stored in task resource and estimate mould
The calculate node history resource allocation information table of block.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810857293.3A CN109165093B (en) | 2018-07-31 | 2018-07-31 | System and method for flexibly distributing computing node cluster |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810857293.3A CN109165093B (en) | 2018-07-31 | 2018-07-31 | System and method for flexibly distributing computing node cluster |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109165093A true CN109165093A (en) | 2019-01-08 |
CN109165093B CN109165093B (en) | 2022-07-19 |
Family
ID=64898439
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810857293.3A Active CN109165093B (en) | 2018-07-31 | 2018-07-31 | System and method for flexibly distributing computing node cluster |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109165093B (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110096349A (en) * | 2019-04-10 | 2019-08-06 | 山东科技大学 | A kind of job scheduling method based on the prediction of clustered node load condition |
CN110597626A (en) * | 2019-08-23 | 2019-12-20 | 第四范式(北京)技术有限公司 | Method, device and system for allocating resources and tasks in distributed system |
CN110705893A (en) * | 2019-10-11 | 2020-01-17 | 腾讯科技(深圳)有限公司 | Service node management method, device, equipment and storage medium |
CN111381969A (en) * | 2020-03-16 | 2020-07-07 | 北京隆普智能科技有限公司 | Management method and system of distributed software |
CN111399976A (en) * | 2020-03-02 | 2020-07-10 | 上海交通大学 | GPU virtualization implementation system and method based on API redirection technology |
CN111813545A (en) * | 2020-06-29 | 2020-10-23 | 北京字节跳动网络技术有限公司 | Resource allocation method, device, medium and equipment |
CN111885158A (en) * | 2020-07-22 | 2020-11-03 | 曙光信息产业(北京)有限公司 | Cluster task processing method and device, electronic equipment and storage medium |
CN113094243A (en) * | 2020-01-08 | 2021-07-09 | 北京小米移动软件有限公司 | Node performance detection method and device |
CN114780225A (en) * | 2022-06-14 | 2022-07-22 | 支付宝(杭州)信息技术有限公司 | Distributed model training system, method and device |
CN115297018A (en) * | 2022-10-10 | 2022-11-04 | 北京广通优云科技股份有限公司 | Operation and maintenance system load prediction method based on active detection |
CN115495231A (en) * | 2022-08-09 | 2022-12-20 | 徐州医科大学 | Dynamic resource scheduling method and system under complex scene of high concurrent tasks |
CN115756822A (en) * | 2022-10-18 | 2023-03-07 | 超聚变数字技术有限公司 | Method and system for optimizing performance of high-performance computing application |
CN115794420A (en) * | 2023-02-07 | 2023-03-14 | 飞天诚信科技股份有限公司 | Dynamic management method, device and medium for service node resource allocation |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070240161A1 (en) * | 2006-04-10 | 2007-10-11 | General Electric Company | System and method for dynamic allocation of resources in a computing grid |
CN102759984A (en) * | 2012-06-13 | 2012-10-31 | 上海交通大学 | Power supply and performance management system for virtualization server cluster |
US20130263117A1 (en) * | 2012-03-28 | 2013-10-03 | International Business Machines Corporation | Allocating resources to virtual machines via a weighted cost ratio |
CN103699447A (en) * | 2014-01-08 | 2014-04-02 | 北京航空航天大学 | Cloud computing-based transcoding and distribution system for video conference |
CN104407912A (en) * | 2014-12-25 | 2015-03-11 | 无锡清华信息科学与技术国家实验室物联网技术中心 | Virtual machine configuration method and device |
US20150200867A1 (en) * | 2014-01-15 | 2015-07-16 | Cisco Technology, Inc. | Task scheduling using virtual clusters |
CN104951372A (en) * | 2015-06-16 | 2015-09-30 | 北京工业大学 | Method for dynamic allocation of Map/Reduce data processing platform memory resources based on prediction |
CN105487930A (en) * | 2015-12-01 | 2016-04-13 | 中国电子科技集团公司第二十八研究所 | Task optimization scheduling method based on Hadoop |
TW201621698A (en) * | 2014-12-09 | 2016-06-16 | 英業達股份有限公司 | Method of resource allocation in a server system |
CN105897616A (en) * | 2016-05-17 | 2016-08-24 | 腾讯科技(深圳)有限公司 | Resource allocation method and server |
CN106790636A (en) * | 2017-01-09 | 2017-05-31 | 上海承蓝科技股份有限公司 | A kind of equally loaded system and method for cloud computing server cluster |
-
2018
- 2018-07-31 CN CN201810857293.3A patent/CN109165093B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070240161A1 (en) * | 2006-04-10 | 2007-10-11 | General Electric Company | System and method for dynamic allocation of resources in a computing grid |
US20130263117A1 (en) * | 2012-03-28 | 2013-10-03 | International Business Machines Corporation | Allocating resources to virtual machines via a weighted cost ratio |
CN102759984A (en) * | 2012-06-13 | 2012-10-31 | 上海交通大学 | Power supply and performance management system for virtualization server cluster |
CN103699447A (en) * | 2014-01-08 | 2014-04-02 | 北京航空航天大学 | Cloud computing-based transcoding and distribution system for video conference |
US20150200867A1 (en) * | 2014-01-15 | 2015-07-16 | Cisco Technology, Inc. | Task scheduling using virtual clusters |
TW201621698A (en) * | 2014-12-09 | 2016-06-16 | 英業達股份有限公司 | Method of resource allocation in a server system |
CN104407912A (en) * | 2014-12-25 | 2015-03-11 | 无锡清华信息科学与技术国家实验室物联网技术中心 | Virtual machine configuration method and device |
CN104951372A (en) * | 2015-06-16 | 2015-09-30 | 北京工业大学 | Method for dynamic allocation of Map/Reduce data processing platform memory resources based on prediction |
CN105487930A (en) * | 2015-12-01 | 2016-04-13 | 中国电子科技集团公司第二十八研究所 | Task optimization scheduling method based on Hadoop |
CN105897616A (en) * | 2016-05-17 | 2016-08-24 | 腾讯科技(深圳)有限公司 | Resource allocation method and server |
CN106790636A (en) * | 2017-01-09 | 2017-05-31 | 上海承蓝科技股份有限公司 | A kind of equally loaded system and method for cloud computing server cluster |
Non-Patent Citations (3)
Title |
---|
YONGZHONG ZHANG 等: "The Revenues Driven Resource Allocation Algorithm for Cluster-based Web Server", 《2006 6TH WORLD CONGRESS ON INTELLIGENT CONTROL AND AUTOMATION》 * |
荀亚玲 等: "MapReduce集群环境下的数据放置策略", 《软件学报》 * |
薛胜军等: "云环境下公平性优化的资源分配方法", 《计算机应用》 * |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110096349A (en) * | 2019-04-10 | 2019-08-06 | 山东科技大学 | A kind of job scheduling method based on the prediction of clustered node load condition |
CN110597626A (en) * | 2019-08-23 | 2019-12-20 | 第四范式(北京)技术有限公司 | Method, device and system for allocating resources and tasks in distributed system |
CN110597626B (en) * | 2019-08-23 | 2022-09-06 | 第四范式(北京)技术有限公司 | Method, device and system for allocating resources and tasks in distributed system |
CN110705893A (en) * | 2019-10-11 | 2020-01-17 | 腾讯科技(深圳)有限公司 | Service node management method, device, equipment and storage medium |
CN113094243A (en) * | 2020-01-08 | 2021-07-09 | 北京小米移动软件有限公司 | Node performance detection method and device |
CN111399976A (en) * | 2020-03-02 | 2020-07-10 | 上海交通大学 | GPU virtualization implementation system and method based on API redirection technology |
CN111381969B (en) * | 2020-03-16 | 2021-10-26 | 北京康吉森技术有限公司 | Management method and system of distributed software |
CN111381969A (en) * | 2020-03-16 | 2020-07-07 | 北京隆普智能科技有限公司 | Management method and system of distributed software |
CN111813545A (en) * | 2020-06-29 | 2020-10-23 | 北京字节跳动网络技术有限公司 | Resource allocation method, device, medium and equipment |
CN111885158A (en) * | 2020-07-22 | 2020-11-03 | 曙光信息产业(北京)有限公司 | Cluster task processing method and device, electronic equipment and storage medium |
CN111885158B (en) * | 2020-07-22 | 2023-05-02 | 曙光信息产业(北京)有限公司 | Cluster task processing method and device, electronic equipment and storage medium |
CN114780225A (en) * | 2022-06-14 | 2022-07-22 | 支付宝(杭州)信息技术有限公司 | Distributed model training system, method and device |
CN114780225B (en) * | 2022-06-14 | 2022-09-23 | 支付宝(杭州)信息技术有限公司 | Distributed model training system, method and device |
CN115495231A (en) * | 2022-08-09 | 2022-12-20 | 徐州医科大学 | Dynamic resource scheduling method and system under complex scene of high concurrent tasks |
CN115495231B (en) * | 2022-08-09 | 2023-09-19 | 徐州医科大学 | Dynamic resource scheduling method and system under high concurrency task complex scene |
CN115297018A (en) * | 2022-10-10 | 2022-11-04 | 北京广通优云科技股份有限公司 | Operation and maintenance system load prediction method based on active detection |
CN115297018B (en) * | 2022-10-10 | 2022-12-20 | 北京广通优云科技股份有限公司 | Operation and maintenance system load prediction method based on active detection |
CN115756822A (en) * | 2022-10-18 | 2023-03-07 | 超聚变数字技术有限公司 | Method and system for optimizing performance of high-performance computing application |
CN115756822B (en) * | 2022-10-18 | 2024-03-19 | 超聚变数字技术有限公司 | Method and system for optimizing high-performance computing application performance |
CN115794420A (en) * | 2023-02-07 | 2023-03-14 | 飞天诚信科技股份有限公司 | Dynamic management method, device and medium for service node resource allocation |
Also Published As
Publication number | Publication date |
---|---|
CN109165093B (en) | 2022-07-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109165093A (en) | A kind of calculate node cluster elasticity distribution system and method | |
EP3525096B1 (en) | Resource load balancing control method and cluster scheduler | |
Bhattacharjee et al. | Barista: Efficient and scalable serverless serving system for deep learning prediction services | |
CN107239336B (en) | Method and device for realizing task scheduling | |
CN110163474A (en) | A kind of method and apparatus of task distribution | |
US11146497B2 (en) | Resource prediction for cloud computing | |
CN108228347A (en) | The Docker self-adapting dispatching systems that a kind of task perceives | |
Balasangameshwara et al. | Performance-driven load balancing with a primary-backup approach for computational grids with low communication cost and replication cost | |
US9037880B2 (en) | Method and system for automated application layer power management solution for serverside applications | |
KR101471749B1 (en) | Virtual machine allcoation of cloud service for fuzzy logic driven virtual machine resource evaluation apparatus and method | |
US20210255899A1 (en) | Method for Establishing System Resource Prediction and Resource Management Model Through Multi-layer Correlations | |
CN109075988A (en) | Task schedule and resource delivery system and method | |
CN110262897B (en) | Hadoop calculation task initial allocation method based on load prediction | |
CN108566424A (en) | Dispatching method, device and system based on server resource consumption forecast | |
US8180716B2 (en) | Method and device for forecasting computational needs of an application | |
Khazaei et al. | Performance analysis of cloud computing centers | |
Liu et al. | CCRP: Customized cooperative resource provisioning for high resource utilization in clouds | |
Ding et al. | Kubernetes-oriented microservice placement with dynamic resource allocation | |
CN106471473A (en) | Mechanism for the too high distribution of server in the minds of in control data | |
Manikandan et al. | Virtualized load balancer for hybrid cloud using genetic algorithm | |
Bacigalupo et al. | Resource management of enterprise cloud systems using layered queuing and historical performance models | |
Jagannatha et al. | Algorithm approach: Modelling and performance analysis of software system | |
De Grande et al. | Measuring communication delay for dynamic balancing strategies of distributed virtual simulations | |
Wang et al. | Remediating overload in over-subscribed computing environments | |
KR20160044623A (en) | Load Balancing Method for a Linux Virtual Server |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |