CN118138591A - Node management method, device, equipment and storage medium - Google Patents

Node management method, device, equipment and storage medium Download PDF

Info

Publication number
CN118138591A
CN118138591A CN202410553670.XA CN202410553670A CN118138591A CN 118138591 A CN118138591 A CN 118138591A CN 202410553670 A CN202410553670 A CN 202410553670A CN 118138591 A CN118138591 A CN 118138591A
Authority
CN
China
Prior art keywords
node
preset
current
working
node management
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410553670.XA
Other languages
Chinese (zh)
Inventor
张瑞
李健
郭运起
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
OP Retail Suzhou Technology Co Ltd
Original Assignee
OP Retail Suzhou Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by OP Retail Suzhou Technology Co Ltd filed Critical OP Retail Suzhou Technology Co Ltd
Priority to CN202410553670.XA priority Critical patent/CN118138591A/en
Publication of CN118138591A publication Critical patent/CN118138591A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1004Server selection for load balancing
    • H04L67/1008Server selection for load balancing based on parameters of servers, e.g. available memory or workload
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/566Grouping or aggregating service requests, e.g. for unified processing

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application discloses a node management method, a device, equipment and a storage medium, which relate to the field of node management and comprise the following steps: a scheduling center in a preset scheduling group acquires the running states of all working nodes in the current preset scheduling group, and determines the current node utilization rate based on the running states of all working nodes; determining a node management policy based on the current node usage; if the node management policy characterizes that the capacity expansion or the shrinkage of the working nodes in the preset scheduling group is required, a node management request is sent to the cloud provider abstract layer, so that the cloud provider abstract layer manages the working nodes in the preset scheduling group based on the node management request; the cloud provider abstract layer is connected with a plurality of cloud providers and is connected with the dispatching center through a unified interface. The application realizes the dynamic capacity expansion and contraction of the working nodes based on the node utilization rate without manual intervention, ensures the timeliness of the working tasks and realizes the reasonable utilization of resources.

Description

Node management method, device, equipment and storage medium
Technical Field
The present invention relates to the field of node management, and in particular, to a method, an apparatus, a device, and a storage medium for node management.
Background
A cloud server may include one or more GPUs (Graphics Processing Unit, graphics processors), and a graphics processor corresponds to a single video memory, i.e., to a working node, and the video memory used to deploy a single large language model is relatively large, so typically only one model can be deployed by a working node. And, when a single working node runs a single model, the total time consumption of two tasks is generally greater than the total time consumption of two tasks which are executed in series, so that in general, one working node only performs the reasoning of one working task at the same time. For a fixed number of cloud servers, namely a fixed number of working nodes, when facing instantaneous high concurrent working tasks, the task reasoning consumes longer time; when facing fewer work tasks, resource waste is easily caused.
Disclosure of Invention
Accordingly, the present invention aims to provide a node management method, device, equipment and storage medium, which can realize dynamic capacity expansion and contraction of a working node based on the node utilization rate without manual intervention, ensure timeliness of a working task and realize reasonable utilization of resources. The specific scheme is as follows:
In a first aspect, the present application provides a node management method, which is applied to a scheduling center in a preset scheduling group, where the preset scheduling group further includes a plurality of working nodes; wherein the method comprises the following steps:
Acquiring the running states of all current working nodes, and determining the current node utilization rate based on the running states of all the working nodes;
determining a node management policy based on the current node usage;
if the node management policy characterizes that the capacity expansion or the capacity contraction of the working nodes in the preset scheduling group is required, a node management request is sent to a cloud provider abstract layer, so that the cloud provider abstract layer manages the working nodes in the preset scheduling group based on the node management request;
The cloud provider abstract layer is connected with a plurality of cloud providers and is connected with the dispatching center through a unified interface.
Optionally, the obtaining the operation states of all current working nodes and determining the current node utilization rate based on the operation states of all the working nodes includes:
Acquiring the running states of all current working nodes, and determining a target working node with the busy running state from all the working nodes;
and determining the current node utilization rate based on the number of target working nodes and the total number of all working nodes.
Optionally, the obtaining the running states of all current working nodes and determining the target working node with the busy running state from all the working nodes includes:
Acquiring the running states of all current working nodes sent by the current working nodes based on a preset heartbeat rate, and updating a node index table in a preset cache based on the running states of all the working nodes; the node index table is an index table which is constructed in advance based on each working node and the corresponding running state;
and carrying out index traversal on the updated node index table to determine a target working node with a busy running state from all the working nodes.
Optionally, the determining a node management policy based on the current node usage includes:
Judging whether the current node utilization rate is within a preset utilization rate range;
if the current node utilization rate is larger than the upper limit value of the preset utilization rate range, determining a node capacity expansion strategy as a node management strategy;
And if the current node utilization rate is smaller than the lower limit value of the preset utilization rate range, determining the node shrink-fit strategy as a node management strategy.
Optionally, the determining a node management policy based on the current node usage includes:
Acquiring node utilization rates of any two different moments in a first preset historical time period which is adjacent to the current moment and contains the current moment;
determining a node utilization rate trend value based on the node utilization rates at any two different moments, and judging whether the node utilization rate trend value is in a preset trend value range or not;
if the node utilization rate trend value is larger than the upper limit value of the preset trend value range, determining a node capacity expansion strategy as a node management strategy;
If the node utilization rate trend value is smaller than the lower limit value of the preset trend value range, determining a node shrink-fit strategy as a node management strategy;
The node utilization rates at any two different moments comprise the current node utilization rate corresponding to the current moment.
Optionally, the determining a node management policy based on the current node usage includes:
Acquiring historical node utilization rates of at least two different moments in a second preset historical time period which is adjacent to the current moment and does not contain the current moment;
Determining a node utilization trend value based on each historical node utilization and the current node utilization respectively;
If the number of the trend values which are larger than the upper limit of the preset trend values in all the node utilization trend values is larger than the preset number, determining a node capacity expansion strategy as a node management strategy;
and if the number of the trend values smaller than the lower limit of the preset trend value in all the node utilization trend values is larger than the preset number, determining the node shrink-fit strategy as a node management strategy.
Optionally, if the node management policy indicates that expansion or contraction is required for the working nodes in the preset scheduling group, sending a node management request to a cloud provider abstract layer, so that the cloud provider abstract layer manages the working nodes in the preset scheduling group based on the node management request, including:
if the node management policy characterizes that the capacity of the working nodes in the preset scheduling group needs to be expanded, a node newly-added request is sent to a cloud provider abstract layer, a cloud provider corresponding to the node newly-added request calls a corresponding newly-added cloud server in the preset scheduling group through the unified interface, after the newly-added cloud server is started based on a preset starting mirror containing a scheduling agent program, the scheduling agent program in the newly-added cloud server determines each newly-added working node based on each local graphic processor, and sends a node registration request determined based on each newly-added working node to the scheduling center, so that the scheduling center registers information of each newly-added working node locally based on the node registration request;
If the node management policy characterizes that the working nodes in the preset scheduling group need to be contracted and configured, a server offline request is sent to a corresponding target cloud server, and after a scheduling agent program in the target cloud server closes each working node in the target cloud server based on the server offline request, a server destruction instruction is sent to the cloud provider abstract layer, so that the cloud provider abstract layer destroys the target cloud server in the preset scheduling group.
In a second aspect, the present application provides a node management apparatus, which is applied to a scheduling center in a preset scheduling group, where the preset scheduling group further includes a plurality of working nodes; wherein the device comprises:
The utilization rate determining module is used for obtaining the running states of all the current working nodes and determining the utilization rate of the current node based on the running states of all the working nodes;
A policy determining module, configured to determine a node management policy based on the current node usage;
The node management module is used for sending a node management request to a cloud provider abstract layer if the node management policy characterizes that the capacity expansion or the shrinkage of the working nodes in the preset scheduling group is required, so that the cloud provider abstract layer manages the working nodes in the preset scheduling group based on the node management request;
The cloud provider abstract layer is connected with a plurality of cloud providers and is connected with the dispatching center through a unified interface.
In a third aspect, the present application provides an electronic device, comprising:
a memory for storing a computer program;
And a processor for executing the computer program to implement the aforementioned node management method.
In a fourth aspect, the present application provides a computer readable storage medium storing a computer program which when executed by a processor implements the aforementioned node management method.
In the application, a scheduling center in a preset scheduling group acquires the running states of all current working nodes, and determines the current node utilization rate based on the running states of all the working nodes; determining a node management policy based on the current node usage; if the node management policy characterizes that the capacity expansion or the capacity contraction of the working nodes in the preset scheduling group is required, a node management request is sent to a cloud provider abstract layer, so that the cloud provider abstract layer manages the working nodes in the preset scheduling group based on the node management request; the cloud provider abstract layer is connected with a plurality of cloud providers and is connected with the dispatching center through a unified interface. Therefore, the cloud provider abstract layer is connected with a plurality of cloud providers so as to support a plurality of different cloud providers, and is connected with a dispatching center through a unified interface, so that the difference among all cloud providers is shielded, and the compatibility and the suitability of node management are improved; furthermore, the application realizes the dynamic capacity expansion and contraction configuration of the working nodes based on the node utilization rate under the condition of no manual intervention, thereby not only coping with instantaneous high concurrent working tasks and ensuring the timeliness of the working tasks, but also realizing the reasonable utilization of resources while ensuring the working efficiency of the nodes.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method of node management disclosed in the present application;
FIG. 2 is a diagram of a scheduling platform architecture in accordance with the present disclosure;
FIG. 3 is a schematic diagram of a node management device according to the present disclosure;
Fig. 4 is a block diagram of an electronic device according to the present disclosure.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
For a fixed number of cloud servers, i.e. a fixed number of working nodes, when facing transient high concurrent working tasks, the task reasoning consumes longer time; when facing fewer work tasks, resource waste is easily caused. Therefore, the application provides a node management method, which can realize dynamic capacity expansion and contraction of the working nodes based on the node utilization rate under the condition of no manual intervention, ensure the timeliness of the working tasks and realize reasonable utilization of resources.
Referring to fig. 1, the embodiment of the invention discloses a node management method which is applied to a scheduling center in a preset scheduling group, wherein the preset scheduling group also comprises a plurality of working nodes; wherein the method comprises the following steps:
Step S11, acquiring the running states of all the current working nodes, and determining the current node utilization rate based on the running states of all the working nodes.
In this embodiment, one working node corresponds to one GPU, and one cloud server may include one or more GPUs, that is, one cloud server may include one or more working nodes. And the running state of the working node comprises a busy state, an idle state, an alarm state and the like; the busy state indicates that the working node is processing the working task, the idle state indicates that the working node is not processing the working task, and the alarm state indicates that the working node is abnormal when processing the working task.
Specifically, a scheduling center in a preset scheduling group acquires the running states of all current working nodes in the scheduling group, and determines a target working node with the busy running state from all the working nodes; the current node usage is then determined based on the number of target worker nodes and the total number of all worker nodes, i.e., current node usage = number of target worker nodes/total number of all worker nodes.
Further, the scheduling center can acquire the running state of the current all working nodes sent by the scheduling center based on the preset heartbeat rate; the preset heartbeat rate may be set according to the actual requirement of the user, for example, to send an operation state once per second. Then the scheduling center updates a node index table in a preset cache based on the running states of all the working nodes; the node index table is constructed in advance based on each working node and the corresponding running state. Further, the scheduling center conducts index traversal on the updated node index table to determine a target working node with a busy running state from all working nodes. Therefore, the working node in a busy state can be quickly searched through the node index table, so that the current node utilization rate can be quickly calculated.
Besides sending the running state of the working node to the dispatching center, the working node can also send GPU performance parameters, network information of the working node, node types, node memory use conditions and the like. The node types comprise fixed nodes and non-fixed nodes; the fixed node refers to a resident node for ensuring the normal operation of the system, the fixed node cannot be released and destroyed, and the non-fixed node can be released and destroyed. The expansion and contraction of the working nodes in the preset scheduling group can be realized by either fixed nodes or non-fixed nodes; the shrink fit is a non-fixed node.
And step S12, determining a node management strategy based on the current node utilization rate.
In one embodiment, for determining a node management policy based on current node usage, it may comprise: judging whether the current node utilization rate is within a preset utilization rate range; the preset usage rate range may be set by the user according to actual needs of the user, for example, may be set to 30% -80%. If the current node utilization rate is greater than the upper limit value of the preset utilization rate range, determining a node capacity expansion strategy as a node management strategy; and if the current node utilization rate is smaller than the lower limit value of the preset utilization rate range, determining the node shrink-fit strategy as a node management strategy. It should be noted that, the number of node expansion and the number of node shrink-fit may be preset, or may be determined according to the actually received work task.
In another embodiment, for determining a node management policy based on a current node usage may include: acquiring node utilization rates of any two different moments in a first preset historical time period which is adjacent to the current moment and contains the current moment; the node utilization rate of any two different times may include the current node utilization rate corresponding to the current time, or may not include the current node utilization rate corresponding to the current time. Then determining a node utilization trend value based on the node utilization at any two different moments, namely node utilization trend value = utilization difference value between the node utilization at any two different moments/time difference value between any two different moments; it should be noted that, any two different times related to the difference between the node usage rates at any two different times and the time difference between any two different times are corresponding, that is, if any two different times related to the difference between the node usage rates at any two different times are the a time and the b time, any two different times related to the difference between the time differences between any two different times are the a time and the b time. After the node utilization rate trend value is obtained, judging whether the node utilization rate trend value is in a preset trend value range or not; if the node utilization trend value is larger than the upper limit value of the preset trend value range, determining the node capacity expansion strategy as a node management strategy; if the node utilization trend value is smaller than the lower limit value of the preset trend value range, determining the node shrink-fit strategy as the node management strategy.
In yet another embodiment, for determining a node management policy based on a current node usage may include: acquiring historical node utilization rates of at least two different moments in a second preset historical time period which is adjacent to the current moment and does not contain the current moment; determining corresponding node utilization rate trend values based on the historical node utilization rates and the current node utilization rate respectively; if the number of trend values which are larger than the upper limit of the preset trend values in all the node utilization trend values is larger than the preset number, determining a node capacity expansion strategy as a node management strategy; if the number of the trend values smaller than the lower limit of the preset trend values in the trend values of the utilization rates of all the nodes is larger than the preset number, determining the node shrink-fit strategy as the node management strategy. It should be noted that the preset number is determined based on the total number of all the node usage trend values, for example, the preset number may be half of the total number of all the node usage trend values.
For example, the current node usage rate is a1, the historical node usage rate at a time point 10 seconds before the current time is b1, the historical node usage rate at a time point 30 seconds before the current time is c1, the historical node usage rate at a time point 100 seconds before the current time is d1, and the historical node usage rate at a time point 300 seconds before the current time is e1. The node usage trend value E1 within 300 seconds is calculated using |a1-e1|/300, the node usage trend value D1 within 100 seconds is calculated using |a1-d1|/100, the node usage trend value C1 within 30 seconds is calculated using |a1-c1|/30, calculated using |a1-b1|/10 to give 10 seconds node usage trend value B1 within. If B1, C1, D1 and E1 are all larger than the upper limit of the preset trend value, determining the node capacity expansion strategy as a node management strategy.
Step S13, if the node management policy characterizes that the capacity expansion or the shrinkage of the working nodes in the preset scheduling group is required, a node management request is sent to a cloud provider abstract layer, so that the cloud provider abstract layer manages the working nodes in the preset scheduling group based on the node management request; the cloud provider abstract layer is connected with a plurality of cloud providers and is connected with the dispatching center through a unified interface.
In this embodiment, if the node management policy is a node capacity expansion policy, that is, the node management policy characterizes that capacity expansion needs to be performed on the working nodes in the preset scheduling group, the scheduling center sends a node new request to the cloud provider abstraction layer. Then, the cloud provider abstract layer calls a cloud provider corresponding to the node newly-added request to create a corresponding newly-added cloud server (ECS, elastic Compute Service) in a preset scheduling group through a unified interface, wherein the node newly-added request comprises GPU specifications, required memory capacity, CPU (Central Processing Unit ) specifications and the like; then the cloud provider abstract layer starts the newly added cloud server by using a preset starting mirror image containing a scheduling agent program; the preset starting mirror image is internally provided with a Docker container, and the Docker container comprises a scheduling agent program and other default programs. After the newly-added cloud server is started, a scheduling agent program in the newly-added cloud server determines each newly-added working node based on each local graphic processor GPU, and sends a node registration request determined based on each newly-added working node to a scheduling center. After receiving the node registration request, the dispatching center locally registers the information of each newly-added working node based on the node registration request, and loads a corresponding large language model for each newly-added working node after the registration is completed, and the time for the first use of the model is considered longer, so that after the large language model is loaded to the corresponding newly-added working node, the large language model is enabled to execute one-time simulation reasoning to preheat, and then normal working tasks can be processed.
In this embodiment, if the node management policy is a node shrink-fit policy, that is, the node management policy indicates that the shrink-fit needs to be performed on the working nodes in the preset scheduling group, the scheduling center sends a server offline request to the corresponding target cloud server. And closing each working node in the target cloud server based on the server offline request by the scheduling agent program in the target cloud server, namely, the closed working node can not receive and process the working task any more, and responding to the scheduling center after closing the working node. And then the dispatching center sends a server destroying instruction to the cloud provider abstract layer so that the cloud provider abstract layer destroys the target cloud servers in the preset dispatching group, namely destroys all working nodes in the target cloud servers, and accordingly shrink-fitting of the working nodes in the preset dispatching group is achieved.
Therefore, the cloud provider abstract layer is connected with a plurality of cloud providers so as to support a plurality of different cloud providers, and is connected with a dispatching center through a unified interface, so that the difference among all cloud providers is shielded, and the compatibility and the suitability of node management are improved; furthermore, the application realizes the dynamic capacity expansion and contraction configuration of the working nodes based on the node utilization rate under the condition of no manual intervention, thereby not only coping with instantaneous high concurrent working tasks and ensuring the timeliness of the working tasks, but also realizing the reasonable utilization of resources while ensuring the working efficiency of the nodes.
Referring to fig. 2, the embodiment of the invention discloses a node management method, which comprises the following steps:
The scheduling platform comprises a cloud provider abstract layer and a plurality of scheduling groups; each scheduling group comprises a scheduling center and a plurality of working nodes workbench; the cloud provider abstract layer is connected with a plurality of different cloud providers and is connected with the dispatching center through a unified interface, so that differences among all cloud providers are shielded.
The scheduling platform receives the work tasks sent by the service side and distributes the work tasks to the corresponding scheduling groups based on a DNS (Domain NAME SYSTEM ) load balancing policy. And the dispatching center in the dispatching group distributes corresponding working nodes for the working task so as to process the working task.
Meanwhile, for a dispatching center in any dispatching group, the running state of all current working nodes in the dispatching group sent based on high-frequency heartbeat is acquired, a target working node with the busy running state is determined from all working nodes in the dispatching group, and then the current node utilization rate is determined based on the ratio between the number of the target working nodes and the total number of all the working nodes. Determining a node management policy based on the current node usage; if the node utilization rate is higher, determining the node capacity expansion strategy as a node management strategy; if the node utilization rate is lower, determining the node shrink fit strategy as the node management strategy.
If the node management policy is a node capacity expansion policy, the scheduling center sends a node new adding request to the cloud provider abstract layer, a cloud provider corresponding to the node new adding request is called in the cloud provider abstract layer to create a corresponding new adding cloud server in the scheduling group through a unified interface, after the new adding cloud server is started based on a preset starting mirror image containing a scheduling agent program, the scheduling agent program in the new adding cloud server determines each new adding working node based on each local GPU, and a node registration request determined based on each new adding working node is sent to the scheduling center, so that the scheduling center registers information of each new adding working node locally based on the node registration request, and capacity expansion of the working nodes in the scheduling group is realized.
If the node management policy is a node shrink fit policy, the scheduling center sends a server offline request to a corresponding target cloud server, and after a scheduling agent in the target cloud server closes each working node in the target cloud server based on the server offline request, the scheduling center sends a server destruction instruction to a cloud provider abstraction layer so that the cloud provider abstraction layer destroys the target cloud server in the scheduling group, thereby realizing shrink fit of the working nodes in the scheduling group.
Therefore, when the embodiment faces the instantaneous high concurrent work task, the capacity of the work nodes in the scheduling group is expanded, so that more work nodes are used for processing the work task, and the problems of task accumulation and extra delay are avoided; and when facing fewer work tasks, the work nodes in the scheduling group are contracted and configured, so that part of the work nodes are released to save resources and cost, the reasonable utilization of the resources is realized, and the processing speed of the work tasks is ensured.
Referring to fig. 3, the embodiment of the invention discloses a node management device which is applied to a scheduling center in a preset scheduling group, wherein the preset scheduling group also comprises a plurality of working nodes; wherein the device comprises:
The utilization rate determining module 11 is configured to obtain operation states of all current working nodes, and determine a current node utilization rate based on the operation states of all the working nodes;
a policy determination module 12 for determining a node management policy based on the current node usage;
The node management module 13 is configured to send a node management request to a cloud provider abstract layer if the node management policy indicates that expansion or contraction of the working nodes in the preset scheduling group is required, so that the cloud provider abstract layer manages the working nodes in the preset scheduling group based on the node management request;
The cloud provider abstract layer is connected with a plurality of cloud providers and is connected with the dispatching center through a unified interface.
Therefore, the cloud provider abstract layer is connected with a plurality of cloud providers so as to support a plurality of different cloud providers, and is connected with a dispatching center through a unified interface, so that the difference among all cloud providers is shielded, and the compatibility and the suitability of node management are improved; furthermore, the application realizes the dynamic capacity expansion and contraction configuration of the working nodes based on the node utilization rate under the condition of no manual intervention, thereby not only coping with instantaneous high concurrent working tasks and ensuring the timeliness of the working tasks, but also realizing the reasonable utilization of resources while ensuring the working efficiency of the nodes.
In some embodiments, the usage determining module 11 includes:
The node determining submodule is used for acquiring the running states of all current working nodes and determining a target working node with a busy running state from all the working nodes;
And the utilization rate determining unit is used for determining the current node utilization rate based on the number of the target working nodes and the total number of all the working nodes.
In some embodiments, the node determination submodule includes:
The index table updating unit is used for acquiring the running states of all current working nodes sent by the current working nodes based on the preset heartbeat rate and updating the node index table in the preset cache based on the running states of all the working nodes; the node index table is an index table which is constructed in advance based on each working node and the corresponding running state;
and the index traversing unit is used for carrying out index traversal on the updated node index table so as to determine a target working node with a busy running state from all the working nodes.
In some embodiments, the policy determination module 12 includes:
the utilization rate judging unit is used for judging whether the current node utilization rate is within a preset utilization rate range;
The first capacity expansion strategy determining unit is used for determining the capacity expansion strategy of the node as a node management strategy if the current node utilization rate is larger than the upper limit value of the preset utilization rate range;
and the first shrink fit strategy determining unit is used for determining the shrink fit strategy of the node as a node management strategy if the current node utilization rate is smaller than the lower limit value of the preset utilization rate range.
In some embodiments, the policy determination module 12 includes:
A first usage rate obtaining unit, configured to obtain node usage rates of any two different times in a first preset historical time period adjacent to the current time and including the current time;
The trend value judging unit is used for determining a trend value of the node utilization rate based on the node utilization rates at any two different moments and judging whether the trend value of the node utilization rate is in a preset trend value range or not;
The second capacity expansion strategy determining unit is used for determining the capacity expansion strategy of the node as a node management strategy if the node utilization rate trend value is larger than the upper limit value of the preset trend value range;
the second shrink fit strategy determining unit is used for determining the shrink fit strategy of the node as a node management strategy if the node utilization rate trend value is smaller than the lower limit value of the preset trend value range;
The node utilization rates at any two different moments comprise the current node utilization rate corresponding to the current moment.
In some embodiments, the policy determination module 12 includes:
A second usage rate obtaining unit, configured to obtain historical node usage rates of at least two different times in a second preset historical time period that is adjacent to the current time and does not include the current time;
A trend value determining unit, configured to determine a node usage trend value based on each of the historical node usage rates and the current node usage rate, respectively;
The third capacity expansion strategy determining unit is used for determining the capacity expansion strategy of the node as a node management strategy if the number of trend values which are larger than the upper limit of the preset trend values in all the node utilization trend values is larger than the preset number;
and the third shrink fit strategy determining unit is used for determining the shrink fit strategy of the node as a node management strategy if the number of trend values smaller than the lower limit of the preset trend value among all the node utilization rate trend values is larger than the preset number.
In some embodiments, the node management module 13 includes:
The node capacity expansion unit is used for sending a node newly-added request to a cloud provider abstract layer if the node management strategy characterizes that the capacity expansion of the working nodes in the preset scheduling group is required, calling a cloud provider corresponding to the node newly-added request in the cloud provider abstract layer to create a corresponding newly-added cloud server in the preset scheduling group through the unified interface, determining each newly-added working node based on each local graphic processor by a scheduling agent program in the newly-added cloud server after the newly-added cloud server is started based on a preset starting mirror containing the scheduling agent program, and sending a node registration request determined based on each newly-added working node to the scheduling center so that the scheduling center registers information of each newly-added working node locally based on the node registration request;
And the node shrink-fit unit is used for sending a server off-line request to a corresponding target cloud server if the node management policy characterizes that the work nodes in the preset scheduling group need shrink-fit, and sending a server destruction instruction to the cloud provider abstract layer after a scheduling agent program in the target cloud server closes each work node in the target cloud server based on the server off-line request, so that the cloud provider abstract layer destroys the target cloud server in the preset scheduling group.
Further, the embodiment of the present application further discloses an electronic device, and fig. 4 is a block diagram of an electronic device 20 according to an exemplary embodiment, where the content of the diagram is not to be considered as any limitation on the scope of use of the present application.
Fig. 4 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present application. The electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input output interface 25, and a communication bus 26. Wherein the memory 22 is configured to store a computer program that is loaded and executed by the processor 21 to implement the relevant steps in the node management method disclosed in any of the foregoing embodiments. In addition, the electronic device 20 in the present embodiment may be specifically an electronic computer.
In this embodiment, the power supply 23 is configured to provide an operating voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and an external device, and the communication protocol to be followed is any communication protocol applicable to the technical solution of the present application, which is not specifically limited herein; the input/output interface 25 is used for acquiring external input data or outputting external output data, and the specific interface type thereof may be selected according to the specific application requirement, which is not limited herein.
The memory 22 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk, or an optical disk, and the resources stored thereon may include an operating system 221, a computer program 222, and the like, and the storage may be temporary storage or permanent storage.
The operating system 221 is used for managing and controlling various hardware devices on the electronic device 20 and the computer program 222, which may be Windows Server, netware, unix, linux, etc. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the node management method performed by the electronic device 20 as disclosed in any of the previous embodiments.
Further, the application also discloses a computer readable storage medium for storing a computer program; wherein the computer program, when executed by a processor, implements the node management method disclosed previously. For specific steps of the method, reference may be made to the corresponding contents disclosed in the foregoing embodiments, and no further description is given here.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing has outlined rather broadly the more detailed description of the application in order that the detailed description of the application that follows may be better understood, and in order that the present principles and embodiments may be better understood; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims (10)

1. The node management method is characterized by being applied to a scheduling center in a preset scheduling group, wherein the preset scheduling group also comprises a plurality of working nodes; wherein the method comprises the following steps:
Acquiring the running states of all current working nodes, and determining the current node utilization rate based on the running states of all the working nodes;
determining a node management policy based on the current node usage;
if the node management policy characterizes that the capacity expansion or the capacity contraction of the working nodes in the preset scheduling group is required, a node management request is sent to a cloud provider abstract layer, so that the cloud provider abstract layer manages the working nodes in the preset scheduling group based on the node management request;
The cloud provider abstract layer is connected with a plurality of cloud providers and is connected with the dispatching center through a unified interface.
2. The node management method according to claim 1, wherein the obtaining the operation states of all the current operation nodes and determining the current node usage based on the operation states of the operation nodes includes:
Acquiring the running states of all current working nodes, and determining a target working node with the busy running state from all the working nodes;
and determining the current node utilization rate based on the number of target working nodes and the total number of all working nodes.
3. The node management method according to claim 2, wherein the obtaining the operation states of all the current operation nodes and determining the target operation node whose operation state is a busy state from all the operation nodes includes:
Acquiring the running states of all current working nodes sent by the current working nodes based on a preset heartbeat rate, and updating a node index table in a preset cache based on the running states of all the working nodes; the node index table is an index table which is constructed in advance based on each working node and the corresponding running state;
and carrying out index traversal on the updated node index table to determine a target working node with a busy running state from all the working nodes.
4. The node management method according to claim 1, wherein the determining a node management policy based on the current node usage comprises:
Judging whether the current node utilization rate is within a preset utilization rate range;
if the current node utilization rate is larger than the upper limit value of the preset utilization rate range, determining a node capacity expansion strategy as a node management strategy;
And if the current node utilization rate is smaller than the lower limit value of the preset utilization rate range, determining the node shrink-fit strategy as a node management strategy.
5. The node management method according to claim 1, wherein the determining a node management policy based on the current node usage comprises:
Acquiring node utilization rates of any two different moments in a first preset historical time period which is adjacent to the current moment and contains the current moment;
determining a node utilization rate trend value based on the node utilization rates at any two different moments, and judging whether the node utilization rate trend value is in a preset trend value range or not;
if the node utilization rate trend value is larger than the upper limit value of the preset trend value range, determining a node capacity expansion strategy as a node management strategy;
If the node utilization rate trend value is smaller than the lower limit value of the preset trend value range, determining a node shrink-fit strategy as a node management strategy;
The node utilization rates at any two different moments comprise the current node utilization rate corresponding to the current moment.
6. The node management method according to claim 1, wherein the determining a node management policy based on the current node usage comprises:
Acquiring historical node utilization rates of at least two different moments in a second preset historical time period which is adjacent to the current moment and does not contain the current moment;
Determining a node utilization trend value based on each historical node utilization and the current node utilization respectively;
If the number of the trend values which are larger than the upper limit of the preset trend values in all the node utilization trend values is larger than the preset number, determining a node capacity expansion strategy as a node management strategy;
and if the number of the trend values smaller than the lower limit of the preset trend value in all the node utilization trend values is larger than the preset number, determining the node shrink-fit strategy as a node management strategy.
7. The node management method according to any one of claims 1 to 6, wherein if the node management policy characterizes that expansion or contraction of the working nodes in the preset schedule group is required, sending a node management request to a cloud provider abstraction layer, so that the cloud provider abstraction layer manages the working nodes in the preset schedule group based on the node management request, including:
if the node management policy characterizes that the capacity of the working nodes in the preset scheduling group needs to be expanded, a node newly-added request is sent to a cloud provider abstract layer, a cloud provider corresponding to the node newly-added request calls a corresponding newly-added cloud server in the preset scheduling group through the unified interface, after the newly-added cloud server is started based on a preset starting mirror containing a scheduling agent program, the scheduling agent program in the newly-added cloud server determines each newly-added working node based on each local graphic processor, and sends a node registration request determined based on each newly-added working node to the scheduling center, so that the scheduling center registers information of each newly-added working node locally based on the node registration request;
If the node management policy characterizes that the working nodes in the preset scheduling group need to be contracted and configured, a server offline request is sent to a corresponding target cloud server, and after a scheduling agent program in the target cloud server closes each working node in the target cloud server based on the server offline request, a server destruction instruction is sent to the cloud provider abstract layer, so that the cloud provider abstract layer destroys the target cloud server in the preset scheduling group.
8. The node management device is characterized by being applied to a scheduling center in a preset scheduling group, wherein the preset scheduling group also comprises a plurality of working nodes; wherein the device comprises:
The utilization rate determining module is used for obtaining the running states of all the current working nodes and determining the utilization rate of the current node based on the running states of all the working nodes;
A policy determining module, configured to determine a node management policy based on the current node usage;
The node management module is used for sending a node management request to a cloud provider abstract layer if the node management policy characterizes that the capacity expansion or the shrinkage of the working nodes in the preset scheduling group is required, so that the cloud provider abstract layer manages the working nodes in the preset scheduling group based on the node management request;
The cloud provider abstract layer is connected with a plurality of cloud providers and is connected with the dispatching center through a unified interface.
9. An electronic device, comprising:
a memory for storing a computer program;
a processor for executing the computer program to implement the node management method of any of claims 1 to 7.
10. A computer readable storage medium for storing a computer program which when executed by a processor implements the node management method of any of claims 1 to 7.
CN202410553670.XA 2024-05-07 2024-05-07 Node management method, device, equipment and storage medium Pending CN118138591A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410553670.XA CN118138591A (en) 2024-05-07 2024-05-07 Node management method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410553670.XA CN118138591A (en) 2024-05-07 2024-05-07 Node management method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN118138591A true CN118138591A (en) 2024-06-04

Family

ID=91236098

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410553670.XA Pending CN118138591A (en) 2024-05-07 2024-05-07 Node management method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN118138591A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110545197A (en) * 2018-05-29 2019-12-06 杭州海康威视系统技术有限公司 node state monitoring method and device
CN110753112A (en) * 2019-10-23 2020-02-04 北京百度网讯科技有限公司 Elastic expansion method and device of cloud service
CN111782147A (en) * 2020-06-30 2020-10-16 北京百度网讯科技有限公司 Method and apparatus for cluster scale-up
WO2021129862A1 (en) * 2019-12-26 2021-07-01 华为技术有限公司 Method and apparatus for managing container cluster node resource pool
US20210263780A1 (en) * 2020-02-25 2021-08-26 Hewlett Packard Enterprise Development Lp Autoscaling nodes of a stateful application based on role-based autoscaling policies
CN117632897A (en) * 2023-11-06 2024-03-01 交控科技股份有限公司 Dynamic capacity expansion and contraction method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110545197A (en) * 2018-05-29 2019-12-06 杭州海康威视系统技术有限公司 node state monitoring method and device
CN110753112A (en) * 2019-10-23 2020-02-04 北京百度网讯科技有限公司 Elastic expansion method and device of cloud service
WO2021129862A1 (en) * 2019-12-26 2021-07-01 华为技术有限公司 Method and apparatus for managing container cluster node resource pool
US20210263780A1 (en) * 2020-02-25 2021-08-26 Hewlett Packard Enterprise Development Lp Autoscaling nodes of a stateful application based on role-based autoscaling policies
CN111782147A (en) * 2020-06-30 2020-10-16 北京百度网讯科技有限公司 Method and apparatus for cluster scale-up
CN117632897A (en) * 2023-11-06 2024-03-01 交控科技股份有限公司 Dynamic capacity expansion and contraction method and device

Similar Documents

Publication Publication Date Title
CN108737270B (en) Resource management method and device for server cluster
US9071609B2 (en) Methods and apparatus for performing dynamic load balancing of processing resources
CN111880936B (en) Resource scheduling method, device, container cluster, computer equipment and storage medium
US8095935B2 (en) Adapting message delivery assignments with hashing and mapping techniques
CN105159775A (en) Load balancer based management system and management method for cloud computing data center
CN112929408A (en) Dynamic load balancing method and device
WO2018121334A1 (en) Web application service providing method, apparatus, electronic device and system
CN116860461B (en) Resource scheduling method, equipment and storage medium of K8s cluster
CN110868323A (en) Bandwidth control method, device, equipment and medium
CN111522664A (en) Service resource management and control method and device based on distributed service
CN118138591A (en) Node management method, device, equipment and storage medium
CN115766737B (en) Load balancing method and device and electronic equipment
CN115225645B (en) Service updating method, device, system and storage medium
CN114710488B (en) Method, device, equipment and medium for realizing elastic expansion and contraction across available areas
CN112822062A (en) Management method for desktop cloud service platform
US8495634B2 (en) Method for the management of tasks in a decentralized data network
CN113485828A (en) Distributed task scheduling system and method based on quartz
CN114301922B (en) Reverse proxy method with delay perception load balance and storage device
CN112131142B (en) Method, system, equipment and medium for quickly caching data set
US20160188306A1 (en) Methods Circuits Devices Systems and Associated Computer Executable Code for Providing Application Data Services to a Mobile Communication Device
CN110955579A (en) Ambari-based large data platform monitoring method
CN107317880B (en) Method and device for realizing load balance
CN113873052B (en) Domain name resolution method, device and equipment of Kubernetes cluster
CN114466011B (en) Metadata service request method, device, equipment and medium
CN110753120B (en) Stateful load creating method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination