CN110958311A - YARN-based shared cluster elastic expansion system and method - Google Patents

YARN-based shared cluster elastic expansion system and method Download PDF

Info

Publication number
CN110958311A
CN110958311A CN201911179701.5A CN201911179701A CN110958311A CN 110958311 A CN110958311 A CN 110958311A CN 201911179701 A CN201911179701 A CN 201911179701A CN 110958311 A CN110958311 A CN 110958311A
Authority
CN
China
Prior art keywords
node
cluster
manager
elastic
resource
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911179701.5A
Other languages
Chinese (zh)
Inventor
曹东刚
马俊明
邵嘉伦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN201911179701.5A priority Critical patent/CN110958311A/en
Publication of CN110958311A publication Critical patent/CN110958311A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1031Controlling of the operation of servers by a load balancer, e.g. adding or removing servers that serve requests
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0893Assignment of logical groups to network elements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0817Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/16Threshold monitoring
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1029Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers using data related to the state of servers by a load balancer

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Environmental & Geological Engineering (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a YARN-based shared cluster elastic stretching system and method. The system comprises: a fixed node, a resilient node that joins or leaves the cluster as the cluster load changes, an application manager, a resource manager running on the fixed node, and a node manager running on either the fixed node or the resilient node. The resource manager is used for scheduling resources, monitoring the resource utilization rate of the cluster and interacting with the public cloud platform according to the resource utilization rate of the cluster to add the elastic nodes into the cluster or release the elastic nodes in the cluster. The application manager communicates with the node manager according to the allocation of the resource manager to the resource, so that the fixed node or the elastic node where the node manager is located starts an operation task, and manages and monitors the task; the node manager makes the fixed node or the elastic node execute tasks. The invention can lead the scale of the sharing cluster to elastically stretch along with the change of the load.

Description

YARN-based shared cluster elastic expansion system and method
Technical Field
The invention relates to the technical field of computer clusters, in particular to a YARN-based shared cluster elastic stretching system and method.
Background
Cloud computing is an information technology service mode which enables users to use hardware resources such as computing, network and storage as required. The public cloud is a cloud infrastructure that is managed by third-party enterprise operations and maintenance to provide services to individuals and enterprise users. Representative public cloud enterprises currently include arrests, amazon AWS, microsoft Azure, and the like. Public cloud operators provide programmable Application Program Interfaces (APIs) that enable consumers to use public cloud resources more efficiently.
YARN is a piece of software that manages large computer cluster systems. YARNs support different application frameworks while running on shared computer cluster hardware resources. For example, in a YARN cluster, different types of jobs, such as HadoopMapReduce and Spark, may be run simultaneously. YARN uses a double-layer scheduling mechanism, each job has a centralized management program Application Master (AM), and each AM will apply for resources from Resource Manager (RM) of YARN. Having obtained the allocated resources, the AM will further allocate these resources to different tasks within the job. The AM communicates with a Node Manager (NM for short) and runs tasks on each computer Node.
YARNs are commonly used in large-scale data centers to manage computer clusters. With the development of cloud computing, more and more enterprises choose to use the service of the public cloud and deploy the data center on the public cloud. However, YARN as cluster management software cannot effectively support job migration, and cannot fully exert the flexible and flexible characteristics of cloud computing.
Disclosure of Invention
The invention aims to provide a YARN-based shared cluster elastic expansion system and method in a cloud environment.
In order to achieve the purpose, the invention provides the following scheme:
the invention provides a YARN-based shared cluster elastic expansion system in a first aspect, which comprises: a fixed node, a resilient node, an application manager, a resource manager running on the fixed node, and a node manager running on the fixed node or the resilient node.
The fixed nodes are fixed and unchangeable nodes in the shared cluster, and the elastic nodes are nodes which are added into the cluster or released from the cluster along with the change of the cluster load in the shared cluster. The system provided by the invention is deployed on a virtual machine cluster in a public cloud environment and is the extension of a native YARN, fixed nodes exist all the time in the whole cluster operation process, and elastic nodes can be dynamically added into or released from the cluster along with the change of the cluster load.
The resource manager is used for scheduling resources, monitoring the resource utilization rate of the cluster and interacting with the public cloud platform according to the resource utilization rate of the cluster so as to add the elastic node into the cluster or release the elastic node in the cluster; the application manager is used for communicating with the node manager according to the allocation of the resource manager to the resource, so that the fixed node or the elastic node where the node manager is located starts an operation task, and manages and monitors the task; the node manager is used for enabling the fixed node or the elastic node to run the task.
Optionally, the resource manager is a most core module of the entire system, which runs on the fixed node, and may include the following four modules:
the resource monitoring module is used for periodically monitoring the resource utilization rate of the cluster;
the resource scheduling module is used for receiving the task allocation request submitted by the application manager and performing resource scheduling;
the flexible management module is used for making a decision of cluster expansion or contraction according to the resource utilization rate provided by the resource monitoring module;
and the cloud platform interaction module is used for interacting with the public cloud platform according to the decision, creating a new virtual machine as an elastic node, or releasing the virtual machine corresponding to the elastic node in the cluster, so as to dynamically expand the scale of the whole cluster.
Optionally, the node manager includes a container manager module, and the container manager module is configured to monitor a state of a node, manage tasks on the node in a container form, and migrate the tasks in a cluster scaling process.
The second aspect of the present invention provides a YARN-based shared cluster elastic stretching method, which is applied to the YARN-based shared cluster elastic stretching system provided by the present invention, and the method includes:
the resource manager creates an application manager for managing the job submitted by the user;
the application manager makes a resource application for the job to the resource manager;
the resource manager allocates resources on a plurality of nodes to the application manager according to the resource use information of each node;
after obtaining the resource allocation of the resource manager, the application manager communicates with the node manager of the corresponding node, and starts to run the subtasks in the operation at the corresponding node;
the resource manager monitors the resource utilization rate of the cluster in real time;
the resource manager judges whether the resource utilization rate of the cluster is greater than a first threshold value;
if the resource utilization rate of the cluster in a plurality of continuous monitoring periods is larger than a first threshold value, the resource manager interacts with the public cloud platform and adds the elastic node into the cluster;
the resource manager judges whether the resource utilization rate of the elastic nodes in the cluster is smaller than a second threshold value;
and if the resource utilization rate of the elastic node in a plurality of continuous monitoring periods is less than a second threshold value, the resource manager deletes the elastic node from the cluster.
Optionally, before the adding the elastic node into the cluster, the method further includes:
and the resource manager determines the number of the elastic nodes needing to be accessed into the cluster according to the resource utilization rate of the cluster.
Optionally, the resource manager interacts with a public cloud platform, and adds an elastic node into the cluster, which specifically includes:
calling a public cloud API interface to interact with a public cloud platform, creating a set number of virtual machines, and determining the IP addresses of the newly added virtual machines;
and the resource manager remotely logs in the newly added virtual machine according to the IP address of the newly added virtual machine, starts a node manager of the newly added virtual machine and adds the newly added virtual machine into the cluster.
Optionally, before the resource manager deletes the elastic node from the cluster, the method further includes:
judging whether an application manager runs on the elastic node;
if not, the elastic node is deleted from the cluster.
Optionally, before the resource manager deletes the elastic node from the cluster, the method further includes:
and migrating the containers running in the elastic nodes needing to be deleted.
Optionally, the migrating the container running in the elastic node to be deleted specifically includes:
the resource manager determines a target node identification number and an application manager identification number for managing the container, and transmits the target node identification number to an application manager for managing the container and a node manager of an elastic node to which the container belongs;
the node manager migrates the container to the target node, and transmits information of completed migration to a resource manager after migration is completed;
the resource manager informs the node manager on the target node to recover the container which has completed the migration;
after the node manager recovers the container, the node manager informs the resource manager that the container is completely recovered;
the resource manager calls a public cloud API to interact with a public cloud platform, and deletes the virtual machine corresponding to the elastic node;
the resource manager container informs the application manager that the container has been migrated to the target node;
the application manager establishes an RPC association with a node manager on the target node.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects: the YARN-based shared cluster elastic telescopic system in the cloud environment comprises fixed nodes which are fixed and unchangeable in a shared cluster and elastic nodes which are added into or removed from the cluster along with the load change of the shared cluster. The resource utilization rate of the cluster is monitored through the resource manager; when the cluster resource utilization rate is greater than a first threshold value, interacting with a public cloud platform, and adding the elastic node into the cluster; and when the resource utilization rate of the cluster elastic node is less than a second threshold value, interacting with the public cloud platform, deleting the elastic node from the cluster, and migrating a container on the elastic node. The scale of the shared cluster can be elastically stretched along with the change of the load, and the cloud computing has the characteristic of flexibility.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a schematic structural diagram of a YARN-based shared cluster elastic expansion system in an embodiment of the present invention;
fig. 2 is a flowchart of a YARN-based shared cluster elastic scaling method in an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a YARN-based shared cluster elastic expansion system and method in a cloud environment.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
As shown in fig. 1, the YARN-based shared cluster elastic expansion and contraction system provided by the present invention comprises:
the fixed nodes are fixed and unchangeable nodes in the shared cluster;
the elastic nodes are nodes which are added to or released from the cluster along with the change of the cluster load in the shared cluster;
the resource manager runs on the fixed node and is used for scheduling resources, monitoring the resource utilization rate of the cluster and interacting with the public cloud platform according to the resource utilization rate of the cluster so as to add the elastic node into the cluster or release the elastic node in the cluster;
an application manager; the node manager is used for communicating with the node manager according to the allocation of the resource manager to the resource, so that the fixed node or the elastic node where the node manager is located starts an operation task, and manages and monitors the task;
and the node manager runs on the fixed node or the elastic node and is used for enabling the fixed node or the elastic node to run the task.
Wherein the resource manager comprises: the system comprises a resource monitoring module, a resource scheduling module, a flexible management module and a cloud platform interaction module. The resource monitoring module is used for periodically monitoring the resource utilization rate of the cluster; the resource scheduling module is used for receiving the task allocation request submitted by the application manager and performing resource scheduling; the flexible management module is used for making a decision of cluster expansion or contraction according to the resource utilization rate provided by the resource monitoring module; and the cloud platform interaction module is used for interacting with the public cloud platform according to the decision, creating a new virtual machine as an elastic node, or releasing the virtual machine corresponding to the elastic node in the cluster.
The node manager is a manager of each node in the system. A node manager runs on each fixed and elastic node. The node manager comprises a core submodule, namely a container manager module. The function of the container manager module is to monitor the status of the nodes and to manage tasks on the nodes in the form of containers and to migrate tasks during cluster scale.
The application manager is a manager of each job in the system, and each job, after being submitted to the system, creates a corresponding application manager process to manage the job. Each job consists of a number of subtasks, which are typically executed in containers on the various nodes, and are managed and monitored by the application manager. Each application manager can be viewed as an application framework, with developers developing applications that meet specific requirements using the APIs provided by the system. The system provides an interface for perceiving task migration for a developer, so that a user can develop an application manager with fault tolerance capability for task migration based on the interface, and the system cannot cause the crash of operation when scaling migration tasks are performed.
The resource manager, the node manager and the application manager are communicated with each other in an RPC mode. The application manager requests resources for running the tasks from the resource manager through the RPC heartbeat, and the resource manager informs the resource conditions distributed by the application manager and the task running state through an RPC heartbeat return value. The application manager informs the node manager of requests related to running tasks such as startup and release through RPC heartbeat, and the node manager informs the application manager of the state of the tasks running on the node through an RPC heartbeat return value. The node manager reports the resource utilization condition of each node and the state of the task running on the node to the resource manager through RPC heartbeat, and the resource manager informs the node manager of operations such as task migration and task recovery through an RPC heartbeat return value.
Based on the above system, the present invention further provides a YARN-based shared cluster elastic stretching method, as shown in fig. 2:
a user submits a job (e.g., a Hadoop MapReduce job) to the RM (resource manager) through a client, and the RM first creates an AM (application manager) for the job to manage the job. The AM will make a resource application required by the operation to the RM, and the RM allocates resources on a plurality of nodes to the AM through the node resource use information reported by the NM. After the AM obtains the allocated resources, it communicates with NM (node manager) of the corresponding node to start the subtask within the operation. The system is initially started on a cluster consisting of a number of virtual machines, which are all fixed virtual machines, and the life cycle of the virtual machines is maintained until the whole system is terminated. The RM process runs on one machine of the cluster and the NM runs on each machine of the cluster.
The RM periodically monitors the cluster load. And when the RM monitors that the utilization rate of the cluster resources is integrally overhigh in a plurality of continuous periods, the RM makes a decision to expand the cluster size. And the RM calculates the number of nodes which need to be increased to enable the utilization rate of the cluster resources to fall within a reasonable interval. And then the RM calls the public cloud platform API to create a specified number of virtual machines, remotely logs on the newly created virtual machines, starts NM service and adds the newly created machines into the cluster. These newly added virtual machines to the cluster are elastic virtual machines.
When the RM monitors that the resource utilization rate of a certain elastic node in the cluster is too low in a plurality of continuous periods, the RM makes a decision to delete the elastic virtual machine with the resource utilization rate lower than a certain value. Before deleting the virtual machine, the system will migrate the tasks (in the form of containers) on that node, computing the destination nodes for these container migrations. If the RM calculation finds the cost of migration too high, the RM will kill these tasks directly so that they can be scheduled to run again. When the task is migrated, the RM tells the AM that some containers belonging to the AM need to be migrated to the target node through the RPC heartbeat return value, and tells the NM to migrate the container running on the node to the target node through the RPC heartbeat return value. The NM receives the RM heartbeat return value, adaptively migrates these containers, and tells the RM container that the migration is in progress or has been completed through the heartbeat. The RM receives NM heartbeat to know that a certain container has finished the migration, and informs a destination node of NM recovery container through RPC heartbeat return value. And the RM receives the NM heartbeat and calls the public cloud platform API to delete the node after knowing that all containers on the RM finish the migration. After the RM receives the destination node NM and knows that the migration container is recovered, the AM is informed that a new RPC can be established with the destination node NM.
The method comprises the following specific steps:
step 1:
and the resource scheduling module of the RM receives the job resource request submitted by the AM and performs resource scheduling. The resource scheduling module uses a fixed priority scheduling strategy to attempt to schedule the task on the fixed node, and if the fixed node resources are insufficient, the resource scheduling module can attempt to schedule the task on the elastic node.
And the RM periodically monitors the cluster load through the resource monitoring module at the same time, and if the cluster resource utilization rate is monitored to be higher than a certain threshold value in a plurality of continuous periods, the RM makes a decision to expand the cluster scale, outputs the resource utilization rate, the node number and the threshold value of the current cluster, and executes the step 2. If the resource utilization rate of a certain elastic node in the cluster is monitored to be lower than a certain threshold value for a plurality of continuous periods, the RM makes a decision to delete the elastic node which runs without the AM and has the resource utilization rate lower than a certain value to shrink the cluster. And outputting the node identification number and executing the step 5. The monitoring period and the threshold value are set by the user of the system.
Step 2:
and the flexible management module of the RM takes the resource utilization rate of the current cluster, the node number and the threshold value output in the step 1 as input, and calculates the number of the nodes needing to be increased so that the cluster resource utilization rate can be lower than the threshold value. Step 3 is performed with the increased number of nodes calculated as output.
And step 3:
and the cloud platform interaction module of the RM takes the number of the nodes to be increased output in the step 2 as input, calls the public cloud API to interact with the public cloud platform, and creates a set number of virtual machines. And outputting the IP address of the newly added virtual machine and executing the step 4.
And 4, step 4:
and 3, the expansion management module of the RM takes the IP address of the newly-added virtual machine output in the step 3 as input, remotely logs on the newly-created virtual machine, starts NM service and adds the machine into the cluster. And returning to the step 1.
And 5:
and (3) inquiring the running containers on the node by the RM expansion management module based on the node identification numbers output in the step (1), calculating the target node identification numbers to be migrated of the containers, and outputting the AM identification numbers to which the containers belong, the NM identification numbers of the nodes to which the containers belong and the target node identification numbers to be migrated.
Step 6:
and the scaling management module of the RM tells the AM that the container belonging to the RM needs to be migrated to the target node through an RPC heartbeat return value between the RM-AM based on the container AM identification number output in the step 5 and the target node identification number to be migrated.
And 7:
and the flexible management module of the RM tells the NM of the node to which the container belongs to migrate the container to the target node through an RPC heartbeat return value between the RM and the NM based on the NM identification number of the node to which the container belongs and the identification number of the target node to be migrated, which are output in the step 5.
And 8:
and the NM receives the RPC heartbeat information sent by the step 7, and starts to migrate the container to the target node through the container manager module. During container migration, NM will tell RM container is in the process of migration through RPC heartbeat between RM-NM. When the NM container manager module completes the container migration operation, the NM will tell the RM container that the migration is complete through an RPC heartbeat between RM-NMs.
And step 9:
and the telescopic management module of the RM receives the RPC heartbeat information sent by the NM in the step 8, knows that the container has completed the migration, and tells the NM on the target node to restore the container which has completed the migration through the RPC heartbeat between the RM and the NM.
Step 10:
and the flexible management module of the RM receives the RPC heartbeat information sent by the NM in the step 8 and knows that all containers on the NM have completed the migration. And the RM expansion management module informs the cloud platform interaction module, calls the public cloud API to interact with the public cloud platform, and deletes the node virtual machine where the NM is located.
Step 11:
the NM on the target node receives the RPC heartbeat message sent by the RM in step 9, and its container manager module starts to recover the container migrated to that node. When the container recovery is complete, the NM tells the RM that the container has been recovered by an RPC heartbeat between RM-NMs.
Step 12:
the RM receives the RPC heartbeat message from the NM in step 11, and knows that the container on the NM has completed recovery. The RM will tell the AM that the container to which it belongs has been successfully migrated to the target node via an RPC heartbeat between RM-AMs.
Step 13:
the AM receives the RPC heartbeat message sent by the RM in step 9, and knows that the container belonging to the AM has been migrated to the target node. The AM will establish a new RPC contact with the NM on the target node. And returning to the step 1.
The existing shared cluster management system YARN does not have the property of scaling nodes by supporting job migration. When the cluster load is low and the nodes need to be deleted to save the cost, the YARN can only wait for all the tasks on the nodes to be executed or kill all the tasks running on the nodes to perform the scheduling again to release the nodes. Both of these schemes are inefficient. The former may cause nodes to run for long periods of time with low resource utilization, and the latter may cause tasks that have been calculated for long periods of time to perform inefficient calculations. Existing YARNs lack mechanisms to ensure that systems can be elastically scaled through task migration while allowing tasks running on the system to execute correctly.
On the basis of YARN, the invention realizes that a new system has the capability of elastically stretching the cluster scale through task migration in a public cloud environment. The system can interactively create and delete the virtual machine with the public cloud platform in real time according to the current cluster resource utilization rate, and the scale of the virtual machine cluster is dynamically adjusted.
Meanwhile, the system provides an interface for perceiving task migration for developers, so that users can develop the AM with fault tolerance capability for task migration based on the interface, and the system cannot cause the crash of operation when scaling migration tasks are carried out. By the system provided by the invention, the shared cluster running on the public cloud can use the public cloud resources more flexibly as required.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (9)

1. A YARN-based shared cluster elastic telescopic system is characterized by comprising:
the fixed nodes are fixed and unchangeable nodes in the shared cluster;
the elastic nodes are nodes which are added to or released from the cluster along with the change of the cluster load in the shared cluster;
the resource manager runs on the fixed node and is used for scheduling resources, monitoring the resource utilization rate of the cluster and interacting with the public cloud platform according to the resource utilization rate of the cluster so as to add the elastic node into the cluster or release the elastic node in the cluster;
an application manager; the node manager is used for communicating with the node manager according to the allocation of the resource manager to the resource, so that the fixed node or the elastic node where the node manager is located starts an operation task, and manages and monitors the task;
and the node manager runs on the fixed node or the elastic node and is used for enabling the fixed node or the elastic node to run the task.
2. The YARN-based shared cluster elastic scaling system of claim 1, wherein the resource manager comprises:
the resource monitoring module is used for periodically monitoring the resource utilization rate of the cluster;
the resource scheduling module is used for receiving the task allocation request submitted by the application manager and performing resource scheduling;
the flexible management module is used for making a decision of cluster expansion or contraction according to the resource utilization rate provided by the resource monitoring module;
and the cloud platform interaction module is used for interacting with the public cloud platform according to the decision, creating a new virtual machine as an elastic node, or releasing the virtual machine corresponding to the elastic node in the cluster.
3. The YARN-based shared cluster elastic scalability system of claim 1 wherein the node manager comprises a container manager module that monitors the status of nodes, manages tasks on nodes in containers, and migrates tasks during cluster scalability.
4. A YARN-based shared cluster elastic scaling method, applied to the YARN-based shared cluster elastic scaling system of any one of claims 1-3, the method comprising:
the resource manager creates an application manager for managing the job submitted by the user;
the application manager makes a resource application for the job to the resource manager;
the resource manager allocates resources on a plurality of nodes to the application manager according to the resource use information of each node;
after obtaining the resource allocation of the resource manager, the application manager communicates with the node manager of the corresponding node, and starts to run the subtasks in the operation at the corresponding node;
the resource manager monitors the resource utilization rate of the cluster in real time;
the resource manager judges whether the resource utilization rate of the cluster is greater than a first threshold value;
if the resource utilization rate of the cluster in a plurality of continuous monitoring periods is larger than a first threshold value, the resource manager interacts with the public cloud platform and adds the elastic node into the cluster;
the resource manager judges whether the resource utilization rate of the elastic nodes in the cluster is smaller than a second threshold value;
and if the resource utilization rate of the elastic node in a plurality of continuous monitoring periods is less than a second threshold value, the resource manager deletes the elastic node from the cluster.
5. The YARN-based shared cluster elastic scaling method of claim 4, further comprising, prior to said joining an elastic node to a cluster:
and the resource manager determines the number of the elastic nodes needing to be accessed into the cluster according to the resource utilization rate of the cluster.
6. The YARN-based shared cluster elastic scaling method of claims 4 or 5, wherein the resource manager interacts with a public cloud platform to add elastic nodes to the cluster, specifically comprising:
calling a public cloud API interface to interact with a public cloud platform, creating a set number of virtual machines, and determining the IP addresses of the newly added virtual machines;
and the resource manager remotely logs in the newly added virtual machine according to the IP address of the newly added virtual machine, starts a node manager of the newly added virtual machine and adds the newly added virtual machine into the cluster.
7. The YARN-based shared cluster elastic scaling method of claim 4, further comprising, before the resource manager deletes an elastic node from the cluster:
judging whether an application manager runs on the elastic node;
if not, the elastic node is deleted from the cluster.
8. The YARN-based shared cluster elastic scaling method of claim 4, further comprising, before the resource manager deletes an elastic node from the cluster:
and migrating the containers running in the elastic nodes needing to be deleted.
9. The YARN-based shared cluster elastic scaling method of claim 8, wherein migrating the containers running in the elastic nodes that need to be deleted specifically comprises:
the resource manager determines a target node identification number and an application manager identification number for managing the container, and transmits the target node identification number to an application manager for managing the container and a node manager of an elastic node to which the container belongs;
the node manager migrates the container to the target node, and transmits information of completed migration to a resource manager after migration is completed;
the resource manager informs the node manager on the target node to recover the container which has completed the migration;
after the node manager recovers the container, the node manager informs the resource manager that the container is completely recovered;
the resource manager calls a public cloud API to interact with a public cloud platform, and deletes the virtual machine corresponding to the elastic node;
the resource manager container informs the application manager that the container has been migrated to the target node;
the application manager establishes an RPC association with a node manager on the target node.
CN201911179701.5A 2019-11-27 2019-11-27 YARN-based shared cluster elastic expansion system and method Pending CN110958311A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911179701.5A CN110958311A (en) 2019-11-27 2019-11-27 YARN-based shared cluster elastic expansion system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911179701.5A CN110958311A (en) 2019-11-27 2019-11-27 YARN-based shared cluster elastic expansion system and method

Publications (1)

Publication Number Publication Date
CN110958311A true CN110958311A (en) 2020-04-03

Family

ID=69978568

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911179701.5A Pending CN110958311A (en) 2019-11-27 2019-11-27 YARN-based shared cluster elastic expansion system and method

Country Status (1)

Country Link
CN (1) CN110958311A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111865714A (en) * 2020-06-24 2020-10-30 上海上实龙创智能科技股份有限公司 Cluster management method based on multi-cloud environment
CN112291326A (en) * 2020-10-23 2021-01-29 深圳市欢太科技有限公司 Load balancing method, load balancing device, storage medium and electronic equipment
CN113037856A (en) * 2021-03-23 2021-06-25 苏州云霄电子科技有限公司 Public cloud-based computing system, method, computer device, and storage medium
CN115086335A (en) * 2022-07-27 2022-09-20 北京思和科创软件有限公司 Container cloud node dynamic adding method and device, electronic equipment and storage medium
CN115934299A (en) * 2023-02-22 2023-04-07 智者四海(北京)技术有限公司 Migration system and method for YARN operation
CN117234721A (en) * 2023-09-18 2023-12-15 安徽继远软件有限公司 Cloud native system automatic operation research system and application based on operation layout self-adaption technology

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106201661A (en) * 2016-07-20 2016-12-07 北京百度网讯科技有限公司 Method and apparatus for elastic telescopic cluster virtual machine
CN107977252A (en) * 2016-10-21 2018-05-01 中兴通讯股份有限公司 A kind of capacity reduction method, device and the cloud platform of cloud platform business
CN108156212A (en) * 2017-06-29 2018-06-12 广东网金控股股份有限公司 A kind of elastic telescopic method and system perceived based on user
CN109343965A (en) * 2018-10-31 2019-02-15 北京金山云网络技术有限公司 Resource adjusting method, device, cloud platform and server

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106201661A (en) * 2016-07-20 2016-12-07 北京百度网讯科技有限公司 Method and apparatus for elastic telescopic cluster virtual machine
CN107977252A (en) * 2016-10-21 2018-05-01 中兴通讯股份有限公司 A kind of capacity reduction method, device and the cloud platform of cloud platform business
CN108156212A (en) * 2017-06-29 2018-06-12 广东网金控股股份有限公司 A kind of elastic telescopic method and system perceived based on user
CN109343965A (en) * 2018-10-31 2019-02-15 北京金山云网络技术有限公司 Resource adjusting method, device, cloud platform and server

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
邹姣姣: "Docker swarm集群增加节点和删除节点", 《HTTPS://WWW.CNBLOGS.COM/ZOUJIAOJIAO/P/10886262.HTML》 *
陈良章: "任务感知YARN资源调度器的研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111865714A (en) * 2020-06-24 2020-10-30 上海上实龙创智能科技股份有限公司 Cluster management method based on multi-cloud environment
CN111865714B (en) * 2020-06-24 2022-08-02 上海上实龙创智能科技股份有限公司 Cluster management method based on multi-cloud environment
CN112291326A (en) * 2020-10-23 2021-01-29 深圳市欢太科技有限公司 Load balancing method, load balancing device, storage medium and electronic equipment
CN112291326B (en) * 2020-10-23 2023-04-18 深圳市欢太科技有限公司 Load balancing method, load balancing device, storage medium and electronic equipment
CN113037856A (en) * 2021-03-23 2021-06-25 苏州云霄电子科技有限公司 Public cloud-based computing system, method, computer device, and storage medium
CN113037856B (en) * 2021-03-23 2022-07-08 苏州云霄电子科技有限公司 Public cloud-based computing system, method, computer device and storage medium
CN115086335A (en) * 2022-07-27 2022-09-20 北京思和科创软件有限公司 Container cloud node dynamic adding method and device, electronic equipment and storage medium
CN115934299A (en) * 2023-02-22 2023-04-07 智者四海(北京)技术有限公司 Migration system and method for YARN operation
CN117234721A (en) * 2023-09-18 2023-12-15 安徽继远软件有限公司 Cloud native system automatic operation research system and application based on operation layout self-adaption technology

Similar Documents

Publication Publication Date Title
CN110958311A (en) YARN-based shared cluster elastic expansion system and method
US10764125B2 (en) Method and device for training model in distributed system
EP3522013B1 (en) Method and system for migration of containers in a container orchestration platform between compute nodes
US8219997B2 (en) Execution the job that is divided into job tasks based on the estimated completion time
Wang et al. A three-phases scheduling in a hierarchical cloud computing network
CN109117252B (en) Method and system for task processing based on container and container cluster management system
KR101474872B1 (en) Method for elastic virtual cluster management for efficient construction of virtual clusters on cloud, apparatus for elastic virtual cluster management and cloud system using the same
CN106790092B (en) Remote procedure call server control system and method
CN112437129B (en) Cluster management method and cluster management device
CN109117244B (en) Method for implementing virtual machine resource application queuing mechanism
CN112905297A (en) Container cluster resource scheduling method and device
Guo et al. Energy-efficient fault-tolerant scheduling algorithm for real-time tasks in cloud-based 5G networks
CN109960579B (en) Method and device for adjusting service container
Duran-Limon et al. Using Lightweight virtual machines to run high performance computing applications: the case of the weather research and forecasting model
CN113515356B (en) Lightweight distributed resource management and task scheduler and method
Megharaj et al. Two level hierarchical model of load balancing in cloud
Bozyigit History-driven dynamic load balancing for recurring applications on networks of workstations
Htet et al. An implementation of job running backup function in user-PC computing system
Min et al. Issues on supporting public cloud virtual machine provisioning and orchestration
Valencia et al. Combining vm preemption schemes to improve vertical memory elasticity scheduling in clouds
Sun et al. Towards a scalable paas for service oriented software
Ismail Dynamic resource allocation mechanisms for grid computing environment
CN114745377B (en) Edge cloud cluster service system and implementation method
CN112395079B (en) Operation core job migration method under heterogeneous many-core architecture
Cogorno et al. Fault tolerance in Hadoop MapReduce implementation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200403

RJ01 Rejection of invention patent application after publication