CN112463535A - Multi-cluster exception handling method and device - Google Patents

Multi-cluster exception handling method and device Download PDF

Info

Publication number
CN112463535A
CN112463535A CN202011356181.3A CN202011356181A CN112463535A CN 112463535 A CN112463535 A CN 112463535A CN 202011356181 A CN202011356181 A CN 202011356181A CN 112463535 A CN112463535 A CN 112463535A
Authority
CN
China
Prior art keywords
cluster
application container
deployment
task
scheduling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011356181.3A
Other languages
Chinese (zh)
Other versions
CN112463535B (en
Inventor
康凤筠
李彤
沈一帆
白佳乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202011356181.3A priority Critical patent/CN112463535B/en
Publication of CN112463535A publication Critical patent/CN112463535A/en
Application granted granted Critical
Publication of CN112463535B publication Critical patent/CN112463535B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The embodiment of the application provides a multi-cluster exception handling method and device, which can be applied to the field of cloud computing, and the method comprises the following steps: receiving an application container deployment instruction sent by a federal cluster management platform, and executing a corresponding application container deployment task according to the total application deployment copy number and a cluster scheduling strategy in the application container deployment instruction; monitoring the resource state of each member cluster and the execution condition of the application container deployment task, and executing corresponding task scheduling operation on the application container deployment task according to the resource state of each member cluster after monitoring that the application container deployment task fails to be executed; the method and the device can effectively guarantee the service stability and high availability of the multiple clusters.

Description

Multi-cluster exception handling method and device
Technical Field
The application relates to the field of cloud computing, in particular to a multi-cluster exception handling method and device.
Background
With the popularization of cloud computing technology, the application on the cloud is rapidly increased, and the scale and the number of container deployment clusters are larger and larger. Often, there are hundreds of computing nodes on a cluster, and each computing node deploys multiple containers, and when a problem occurs in a cluster, such as a cluster master node failure, the whole cluster cannot be arranged, scheduled, and deployed. Meanwhile, as more and more applications are deployed on the cluster, excessive pressure is applied to the cluster, so that the remaining resources of the cluster are insufficient, and the starting of some containers of the applications is abnormal due to the insufficient resources.
The inventor finds that, in the prior art, it is a common practice that each service container is deployed on a plurality of clusters, and a plurality of copies are deployed on each cluster, which requires an application to specify that a cluster needs to be deployed on one hand, and does not achieve loose coupling of a service side and a platform side; on the other hand, resource waste is caused to a certain extent by multi-cluster and multi-copy deployment. Meanwhile, if a problem occurs in one cluster, all the access pressure is switched to another cluster, which can cause the pressure of the container on the cluster to be excessive.
Disclosure of Invention
Aiming at the problems in the prior art, the application provides a multi-cluster exception handling method and device, which can effectively guarantee the service stability and high availability of multiple clusters.
In order to solve at least one of the above problems, the present application provides the following technical solutions:
in a first aspect, the present application provides a method for handling multiple cluster exceptions, including:
receiving an application container deployment instruction sent by a federal cluster management platform, and executing a corresponding application container deployment task according to the total application deployment copy number and a cluster scheduling strategy in the application container deployment instruction;
and monitoring the resource state of each member cluster and the execution condition of the application container deployment task, and executing corresponding task scheduling operation on the application container deployment task according to the resource state of each member cluster after monitoring that the application container deployment task fails to be executed.
Further, after it is monitored that the application container deployment task fails to be executed, the executing a corresponding task scheduling operation on the application container deployment task according to the resource state of each member cluster includes:
after monitoring that the application container deployment task fails to be executed, isolating the corresponding member cluster;
and determining a target member cluster according to a preset scheduling rule and the resource state of each member cluster, and scheduling the application container deployment task failed to be executed to the target member cluster.
Further, the determining a target member cluster according to a preset scheduling rule and the resource status of each member cluster, and scheduling the application container deployment task that fails to be executed to the target member cluster includes:
determining the member cluster with the resource state meeting the preset health state condition as a target member cluster according to the resource state of each member cluster;
and uniformly scheduling the application container deployment tasks which fail to be executed to the target member cluster according to a balanced scheduling rule.
Further, the executing the corresponding application container deployment task according to the total application deployment copy number and the cluster scheduling policy in the application container deployment instruction includes:
determining, by the master cluster, an application container copy deployment number of each member cluster corresponding to the master cluster according to the total application deployment copy number in the application container deployment instruction and a cluster scheduling policy;
and the main cluster issues the deployment quantity of the application container copies to each corresponding member cluster, and the member clusters execute corresponding application container deployment operation according to the deployment quantity of the application container copies.
In a second aspect, the present application provides a multi-cluster exception handling apparatus, comprising:
the application container deployment task determining module is used for receiving an application container deployment instruction sent by a federal cluster management platform and executing a corresponding application container deployment task according to the total application deployment copy number and a cluster scheduling strategy in the application container deployment instruction;
and the cluster abnormal task scheduling module is used for monitoring the resource state of each member cluster and the execution condition of the application container deployment task, and executing corresponding task scheduling operation on the application container deployment task according to the resource state of each member cluster after monitoring that the application container deployment task fails to be executed.
Further, the cluster exception task scheduling module includes:
the abnormal cluster isolation unit is used for isolating the corresponding member cluster after monitoring that the execution of the application container deployment task fails;
and the failed task scheduling unit is used for determining a target member cluster according to a preset scheduling rule and the resource state of each member cluster and scheduling the application container deployment task which fails to be executed to the target member cluster.
Further, the failed task scheduling unit includes:
the healthy cluster determining subunit is used for determining the member cluster with the resource state meeting the preset healthy state condition as a target member cluster according to the resource state of each member cluster;
and the failed task balanced scheduling subunit is used for uniformly scheduling the application container deployment tasks which fail to be executed to the target member cluster according to a balanced scheduling rule.
Further, the application container deployment task determination module includes:
the main cluster decision unit is used for determining the deployment quantity of the application container copies of each member cluster corresponding to the main cluster according to the total application deployment copy number in the application container deployment instruction and the cluster scheduling policy;
and the master cluster issuing unit is used for issuing the deployment quantity of the application container copies to each corresponding member cluster by the master cluster and executing corresponding application container deployment operation by the member clusters according to the deployment quantity of the application container copies.
In a third aspect, the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the multi-cluster exception handling method when executing the program.
In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the multi-cluster exception handling method described.
According to the technical scheme, the application container deployment instruction sent by the federal cluster management platform is received, the corresponding application container deployment task is executed according to the total application deployment copy number in the application container deployment instruction and the cluster scheduling strategy, whether the current situation is in accordance with the expectation is ensured by monitoring the resource states of all member clusters and the execution situation of the application container deployment task, and after the task failure is detected, the task which is deployed in the abnormal cluster and fails to be deployed can be rescheduled to another normal cluster, so that the continuous and reliable external service provision under a plurality of cluster scenes is ensured, and the service stability and the high availability of the multi-cluster are effectively ensured.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flowchart illustrating a multi-cluster exception handling method according to an embodiment of the present application;
FIG. 2 is a second flowchart illustrating a multi-cluster exception handling method according to an embodiment of the present application;
FIG. 3 is a third flowchart illustrating a multi-cluster exception handling method according to an embodiment of the present application;
FIG. 4 is a fourth flowchart illustrating a multi-cluster exception handling method according to an embodiment of the present application;
FIG. 5 is a block diagram of a multi-cluster exception handling apparatus according to an embodiment of the present application;
FIG. 6 is a second block diagram of a multi-cluster exception handling apparatus according to an embodiment of the present application;
FIG. 7 is a third block diagram of a multi-cluster exception handling apparatus according to an embodiment of the present application;
FIG. 8 is a fourth block diagram of a multi-cluster exception handling apparatus according to an embodiment of the present application;
FIG. 9 is a schematic diagram of a multi-cluster architecture in an embodiment of the present application;
fig. 10 is a schematic structural diagram of an electronic device in an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Considering that a common method in the prior art is that each service container is deployed on a plurality of clusters, and each cluster is deployed with a plurality of copies, on one hand, the manner needs to be applied to specify that the clusters need to be deployed, and loose coupling of a service side and a platform side is not achieved; on the other hand, resource waste is caused to a certain extent by multi-cluster and multi-copy deployment. Meanwhile, if one cluster has a problem and all the access pressure is switched to another cluster, the problem that the pressure of a container on the cluster is overlarge is caused, the application provides a multi-cluster exception handling method and a device, by receiving an application container deployment instruction sent by a federal cluster management platform and executing a corresponding application container deployment task according to the total application deployment copy number and a cluster scheduling policy in the application container deployment instruction, and by monitoring the resource status of all member clusters and the execution of the application container deployment tasks to ensure that they are currently in anticipation, when a task failure is detected, a failed deployment task deployed in an anomalous cluster may be rescheduled to another normal cluster, therefore, continuous and reliable external service provision under a plurality of cluster scenes is guaranteed, and the stability and high availability of the service of the plurality of clusters are effectively guaranteed.
In order to effectively ensure the service stability and high availability of multiple clusters, the present application provides an embodiment of a multiple cluster exception handling method, and referring to fig. 1, the multiple cluster exception handling method specifically includes the following contents:
step S101: receiving an application container deployment instruction sent by a federal cluster management platform, and executing a corresponding application container deployment task according to the total application deployment copy number and a cluster scheduling strategy in the application container deployment instruction.
Referring to fig. 9, for the overall architecture diagram of the present application, the core function for realizing multi-cluster high availability is a federal cluster module. The application completes the configuration of the template through the PAAS management platform, one template comprises one or more containers, and the configuration information comprises the number of copies of container start (the number of container start). And starting the template after the application configuration is completed, and enabling the PAAS management platform to form a deployment message and send the task to the federal cluster. After the container is started up by the federal cluster, the application can see the starting state of the container through the PAAS management platform.
It can be understood that the user only needs to define the total number of template deployment on the PAAS management platform, and does not need to specify the cluster, and the PAAS management platform uniformly distributes the user instances in each member cluster, thereby ensuring that the cluster is transparent to the user and the cluster has higher availability. The deployment synchronizes the orchestration policy to the designated clusters, thereby dispersing the individual working instances into the individual clusters.
It can be understood that the executing subject of the technical solution of the present application may be a federated cluster, and specifically, a cluster federal management control component kubeded is installed and deployed on a certain cluster, and is used to manage the cluster federation, a cluster may be added to the federated cluster by using a join command of a control panel of the cluster federation, when the cluster exits, the cluster may be deleted from the federated cluster by using the join command, and the control panel kubeded of the federated cluster is installed, and the purpose of the component cluster federation is to implement a mechanism for uniformly managing multiple clusters by a single cluster, and multiple clusters may be deployed and operated simultaneously by the cluster federation.
It is understood that the clusters requiring container deployment are added to the federal cluster, and one of the clusters is designated as a master cluster of the federal cluster, and the rest are member clusters, preferably, only one and only one of the clusters in one federal cluster is a master cluster, and the master clusters can be switched at will.
Optionally, when receiving an application container deployment instruction sent by the federal cluster management platform and executing the application deployment container to work, the deployment policy is connected to the main cluster, and then the main cluster issues the deployment policy to each member cluster of the federal cluster.
Step S102: and monitoring the resource state of each member cluster and the execution condition of the application container deployment task, and executing corresponding task scheduling operation on the application container deployment task according to the resource state of each member cluster after monitoring that the application container deployment task fails to be executed.
Optionally, the method and the device for monitoring the cluster status of the member may monitor whether the cluster status of each member is healthy and monitor the task status of deployment, and when it is monitored that the cluster status is unavailable, isolate the unavailable cluster, and redeploy the failed task on the cluster to other healthy clusters in the federation. And finding corresponding deployment resources by monitoring the resource states deployed on the clusters, and rescheduling failed tasks to other clusters capable of normally starting the tasks when monitoring that deployment of the deployment tasks fails due to insufficient resources of a certain cluster.
As can be seen from the above description, the multi-cluster exception handling method provided in the embodiment of the present application can execute the corresponding application container deployment task according to the total application deployment copy number and the cluster scheduling policy in the application container deployment instruction by receiving the application container deployment instruction sent by the federal cluster management platform, and ensure whether the current situation is in accordance with the expectation by monitoring the resource states of all member clusters and the execution situation of the application container deployment task, and when a task failure is detected, can reschedule the task that has failed to be deployed in the abnormal cluster to another normal cluster, thereby ensuring continuous and reliable external service provision in multiple cluster scenarios, and effectively ensuring the service stability and high availability of the multi-cluster.
In order to timely and effectively execute the exception handling operation after the application container deployment task on a member cluster fails to be executed, so as to ensure smooth execution of the task, in an embodiment of the multiple cluster exception handling method of the present application, referring to fig. 2, the step S102 may further specifically include the following steps:
step S201: and after monitoring that the application container deployment task fails to be executed, isolating the corresponding member cluster.
Step S202: and determining a target member cluster according to a preset scheduling rule and the resource state of each member cluster, and scheduling the application container deployment task failed to be executed to the target member cluster.
Optionally, the method and the device for monitoring the cluster status of the member may monitor whether the cluster status of each member is healthy and monitor the task status of deployment, and when it is monitored that the cluster status is unavailable, isolate the unavailable cluster, and redeploy the failed task on the cluster to other healthy clusters in the federation. And finding corresponding deployment resources by monitoring the resource states deployed on the clusters, and rescheduling failed tasks to other clusters capable of normally starting the tasks when monitoring that deployment of the deployment tasks fails due to insufficient resources of a certain cluster.
In order to accurately schedule the failed task to other member clusters that can be smoothly executed, so as to ensure the smooth execution of the task, in an embodiment of the multi-cluster exception handling method of the present application, referring to fig. 3, the step S202 may further include the following steps:
step S301: and determining the member cluster with the resource state meeting the preset health state condition as a target member cluster according to the resource state of each member cluster.
Step S302: and uniformly scheduling the application container deployment tasks which fail to be executed to the target member cluster according to a balanced scheduling rule.
Optionally, the application container deployment task failed to be executed is uniformly scheduled to the target member cluster according to the balanced scheduling rule, so that the availability of the cluster is ensured to the greatest extent. Usually, a user configures a template and specifies the number of copies to be started through the PAAS management platform, and the PAAS platform automatically assembles the template into a fed deployment type, and the deployment strategy is balanced deployment.
In order to accurately and efficiently deploy the application container in the federate cluster, in an embodiment of the multi-cluster exception handling method of the present application, referring to fig. 4, the step S101 may further specifically include the following steps:
step S401: and determining the deployment quantity of the application container copies of each member cluster corresponding to the main cluster by the main cluster according to the total application deployment copy number in the application container deployment instruction and the cluster scheduling policy.
Step S402: and the main cluster issues the deployment quantity of the application container copies to each corresponding member cluster, and the member clusters execute corresponding application container deployment operation according to the deployment quantity of the application container copies.
Specifically, when receiving an application container deployment instruction sent by the federal cluster management platform and executing the application deployment container to work, the deployment policy is connected to the main cluster, and then the main cluster issues the deployment policy to each member cluster of the federal cluster. When the container is deployed through the federal cluster management platform (such as a PAAS management platform), the deployed cluster does not need to be specified, and only the number of copies to be deployed under the template needs to be specified, and the strategy selection balance scheduling strategy is deployed. And after the main cluster is calculated, issuing the main cluster to the member cluster of each federal cluster, and finally, uniformly deploying containers required by application in the whole federal cluster to ensure that the number of the containers is the number of copies deployed by the application.
In order to effectively ensure the service stability and high availability of multiple clusters, the present application provides an embodiment of a multiple cluster exception handling apparatus for implementing all or part of the contents of the multiple cluster exception handling method, and referring to fig. 5, the multiple cluster exception handling apparatus specifically includes the following contents:
the application container deployment task determining module 10 is configured to receive an application container deployment instruction sent by a federal cluster management platform, and execute a corresponding application container deployment task according to the total application deployment copy number and a cluster scheduling policy in the application container deployment instruction.
The cluster abnormal task scheduling module 20 is configured to monitor resource states of each member cluster and execution conditions of the application container deployment task, and execute a corresponding task scheduling operation on the application container deployment task according to the resource states of each member cluster after monitoring that the application container deployment task fails to be executed.
As can be seen from the above description, the multi-cluster exception handling apparatus provided in the embodiment of the present application can execute the corresponding application container deployment task according to the total application deployment copy number and the cluster scheduling policy in the application container deployment instruction by receiving the application container deployment instruction sent by the federal cluster management platform, and ensure whether the current situation is in accordance with the expectation by monitoring the resource states of all member clusters and the execution situation of the application container deployment task, and when a task failure is detected, can reschedule the task that has failed to be deployed in the abnormal cluster to another normal cluster, thereby ensuring continuous and reliable external service provision in multiple cluster scenarios, and effectively ensuring the service stability and high availability of the multi-cluster.
In order to effectively execute an exception handling operation in time after an application container deployment task on a member cluster fails to be executed, so as to ensure smooth execution of the task, in an embodiment of the multiple cluster exception handling apparatus according to the present application, referring to fig. 6, the cluster exception task scheduling module 20 includes:
and the abnormal cluster isolation unit 21 is configured to perform isolation processing on the corresponding member cluster after monitoring that the execution of the application container deployment task fails.
And the failed task scheduling unit 22 is configured to determine a target member cluster according to a preset scheduling rule and the resource state of each member cluster, and schedule the application container deployment task that fails to be executed to the target member cluster.
In order to accurately schedule a failed task to other member clusters that can be executed smoothly to ensure the smooth execution of the task, in an embodiment of the multi-cluster exception handling apparatus of the present application, referring to fig. 7, the failed task scheduling unit 22 includes:
and the healthy cluster determining subunit 221 is configured to determine, according to the resource state of each member cluster, a member cluster whose resource state meets a preset healthy state condition as a target member cluster.
And a failure task balanced scheduling subunit 222, configured to uniformly schedule the application container deployment task that fails to be executed to the target member cluster according to a balanced scheduling rule.
In order to accurately and efficiently deploy the application container in the federate cluster, in an embodiment of the multi-cluster exception handling apparatus of the present application, referring to fig. 8, the application container deployment task determining module 10 includes:
and the master cluster decision unit 11 is configured to determine, by the master cluster, the deployment number of the application container copies of each member cluster corresponding to the master cluster according to the total application deployment copy number in the application container deployment instruction and the cluster scheduling policy.
The master cluster issuing unit 12 is configured to issue, by the master cluster, the deployment number of the application container copies to each corresponding member cluster, and execute, by the member cluster, a corresponding application container deployment operation according to the deployment number of the application container copies.
To further illustrate the present solution, the present application further provides a specific application example of implementing a multi-cluster exception handling method by using the above multi-cluster exception handling apparatus, which specifically includes the following contents:
step 1): the cluster Federation Federation V2 is installed on a cluster master node, and is used for managing the Federation cluster. The purpose of the cluster federation is to realize a mechanism for uniformly managing a plurality of kubenees clusters by a single cluster, and a plurality of clusters can be deployed and operated simultaneously through the cluster federation.
Step 2): the clusters are added to the Federation cluster through Federation, and one of the clusters is designated as a master cluster of the Federation cluster, and the remaining clusters are member clusters. Only one and only one cluster in one federal cluster can be used as a main cluster, and the main cluster can be switched at will, so that other clusters can be quickly used as a new main cluster if the main cluster fails.
Step 3): the cluster federal orchestration strategy is to monitor and acquire the content of the RSP to load the workload on a specified cluster by creating a ReplicaschedulingPrereference (RSP). When cluster resources are insufficient and the deployments cannot be started, the KubeFed RSP Controller monitors the change and obtains the RSP content, finds the corresponding deployments, recalculates the number of copies corresponding to each cluster according to the definition, and then synchronizes the new number of copies to the federate cluster. Therefore, the deployment which cannot be started due to insufficient cluster resources is transferred to the cluster with sufficient resources, and the cluster is ensured to provide stable service.
Step 4): the federation-controller collects the state and resource information of each sub-cluster, and completes the functions of synchronization and scheduling of the federated resources through a monitoring crd mechanism. And when the cluster exception is monitored, the Deployment of the Deployment is performed again according to the total copy number and the cluster strategy, and the new copy number is synchronized to the federal cluster, so that the instance is rescheduled to the normal cluster.
As can be seen from the above, the present application can achieve at least the following technical effects:
the method has the advantages that the cluster federation is formed by the clusters for unified management, the tasks are uniformly distributed to the clusters, the cluster state and the task state deployed on the clusters are monitored, when a certain cluster fails due to the fact that the task deployed on the cluster fails, the failed task can be automatically reset to the normal cluster, so that the cluster can continuously and stably provide services, high availability of the clusters is guaranteed to the maximum extent, and the method has the following specific advantages:
1. the applied template is deployed on the federation formed by a plurality of clusters, and compared with the traditional single-cluster multi-copy redundant deployment mode, the resource waste is reduced.
2. And an automatic multi-cluster load balancing mode reduces the pressure of a single cluster and improves the service performance.
3. And the task failure is automatically detected, and the failed task is dispatched to the normal cluster again, so that the high availability of multiple clusters is ensured.
4. The cluster configuration information is transparent to the application.
In terms of hardware, in order to effectively ensure service stability and high availability of multiple clusters, the present application provides an embodiment of an electronic device for implementing all or part of contents in the multiple cluster exception handling method, where the electronic device specifically includes the following contents:
a processor (processor), a memory (memory), a communication Interface (Communications Interface), and a bus; the processor, the memory and the communication interface complete mutual communication through the bus; the communication interface is used for realizing information transmission between the multi-cluster exception handling device and relevant equipment such as a core service system, a user terminal, a relevant database and the like; the logic controller may be a desktop computer, a tablet computer, a mobile terminal, and the like, but the embodiment is not limited thereto. In this embodiment, the logic controller may refer to the embodiment of the multiple cluster exception handling method and the embodiment of the multiple cluster exception handling apparatus in the embodiment for implementation, and the contents thereof are incorporated herein, and repeated details are not repeated.
It is understood that the user terminal may include a smart phone, a tablet electronic device, a network set-top box, a portable computer, a desktop computer, a Personal Digital Assistant (PDA), an in-vehicle device, a smart wearable device, and the like. Wherein, intelligence wearing equipment can include intelligent glasses, intelligent wrist-watch, intelligent bracelet etc..
In practical applications, part of the multi-cluster exception handling method may be executed on the electronic device side as described above, or all operations may be completed in the client device. The selection may be specifically performed according to the processing capability of the client device, the limitation of the user usage scenario, and the like. This is not a limitation of the present application. The client device may further include a processor if all operations are performed in the client device.
The client device may have a communication module (i.e., a communication unit), and may be communicatively connected to a remote server to implement data transmission with the server. The server may include a server on the task scheduling center side, and in other implementation scenarios, the server may also include a server on an intermediate platform, for example, a server on a third-party server platform that is communicatively linked to the task scheduling center server. The server may include a single computer device, or may include a server cluster formed by a plurality of servers, or a server structure of a distributed apparatus.
Fig. 10 is a schematic block diagram of a system configuration of an electronic device 9600 according to an embodiment of the present application. As shown in fig. 10, the electronic device 9600 can include a central processor 9100 and a memory 9140; the memory 9140 is coupled to the central processor 9100. Notably, this fig. 10 is exemplary; other types of structures may also be used in addition to or in place of the structure to implement telecommunications or other functions.
In one embodiment, the multiple cluster exception handling method function may be integrated into the central processor 9100. The central processor 9100 may be configured to control as follows:
step S101: receiving an application container deployment instruction sent by a federal cluster management platform, and executing a corresponding application container deployment task according to the total application deployment copy number and a cluster scheduling strategy in the application container deployment instruction.
Step S102: and monitoring the resource state of each member cluster and the execution condition of the application container deployment task, and executing corresponding task scheduling operation on the application container deployment task according to the resource state of each member cluster after monitoring that the application container deployment task fails to be executed.
As can be seen from the above description, in the electronic device provided in the embodiment of the present application, the application container deployment instruction sent by the federal cluster management platform is received, the corresponding application container deployment task is executed according to the total application deployment copy number and the cluster scheduling policy in the application container deployment instruction, and whether the current situation is in accordance with the expectation is ensured by monitoring the resource states of all member clusters and the execution situation of the application container deployment task, and after a task failure is detected, the task that is deployed in an abnormal cluster and fails to be deployed may be rescheduled to another normal cluster, so that continuous and reliable external service provision in a plurality of cluster scenes is ensured, and the service stability and high availability of multiple clusters are effectively ensured.
In another embodiment, the multiple cluster exception handling apparatus may be configured separately from the central processing unit 9100, for example, the multiple cluster exception handling apparatus may be configured as a chip connected to the central processing unit 9100, and the multiple cluster exception handling method function may be implemented by the control of the central processing unit.
As shown in fig. 10, the electronic device 9600 may further include: a communication module 9110, an input unit 9120, an audio processor 9130, a display 9160, and a power supply 9170. It is noted that the electronic device 9600 also does not necessarily include all of the components shown in fig. 10; in addition, the electronic device 9600 may further include components not shown in fig. 10, which can be referred to in the prior art.
As shown in fig. 10, a central processor 9100, sometimes referred to as a controller or operational control, can include a microprocessor or other processor device and/or logic device, which central processor 9100 receives input and controls the operation of the various components of the electronic device 9600.
The memory 9140 can be, for example, one or more of a buffer, a flash memory, a hard drive, a removable media, a volatile memory, a non-volatile memory, or other suitable device. The information relating to the failure may be stored, and a program for executing the information may be stored. And the central processing unit 9100 can execute the program stored in the memory 9140 to realize information storage or processing, or the like.
The input unit 9120 provides input to the central processor 9100. The input unit 9120 is, for example, a key or a touch input device. Power supply 9170 is used to provide power to electronic device 9600. The display 9160 is used for displaying display objects such as images and characters. The display may be, for example, an LCD display, but is not limited thereto.
The memory 9140 can be a solid state memory, e.g., Read Only Memory (ROM), Random Access Memory (RAM), a SIM card, or the like. There may also be a memory that holds information even when power is off, can be selectively erased, and is provided with more data, an example of which is sometimes called an EPROM or the like. The memory 9140 could also be some other type of device. Memory 9140 includes a buffer memory 9141 (sometimes referred to as a buffer). The memory 9140 may include an application/function storage portion 9142, the application/function storage portion 9142 being used for storing application programs and function programs or for executing a flow of operations of the electronic device 9600 by the central processor 9100.
The memory 9140 can also include a data store 9143, the data store 9143 being used to store data, such as contacts, digital data, pictures, sounds, and/or any other data used by an electronic device. The driver storage portion 9144 of the memory 9140 may include various drivers for the electronic device for communication functions and/or for performing other functions of the electronic device (e.g., messaging applications, contact book applications, etc.).
The communication module 9110 is a transmitter/receiver 9110 that transmits and receives signals via an antenna 9111. The communication module (transmitter/receiver) 9110 is coupled to the central processor 9100 to provide input signals and receive output signals, which may be the same as in the case of a conventional mobile communication terminal.
Based on different communication technologies, a plurality of communication modules 9110, such as a cellular network module, a bluetooth module, and/or a wireless local area network module, may be provided in the same electronic device. The communication module (transmitter/receiver) 9110 is also coupled to a speaker 9131 and a microphone 9132 via an audio processor 9130 to provide audio output via the speaker 9131 and receive audio input from the microphone 9132, thereby implementing ordinary telecommunications functions. The audio processor 9130 may include any suitable buffers, decoders, amplifiers and so forth. In addition, the audio processor 9130 is also coupled to the central processor 9100, thereby enabling recording locally through the microphone 9132 and enabling locally stored sounds to be played through the speaker 9131.
An embodiment of the present application further provides a computer-readable storage medium capable of implementing all the steps in the multi-cluster exception handling method with the execution subject being the server or the client in the foregoing embodiment, where the computer-readable storage medium stores a computer program thereon, and when the computer program is executed by a processor, the computer program implements all the steps in the multi-cluster exception handling method with the execution subject being the server or the client in the foregoing embodiment, for example, when the processor executes the computer program, the processor implements the following steps:
step S101: receiving an application container deployment instruction sent by a federal cluster management platform, and executing a corresponding application container deployment task according to the total application deployment copy number and a cluster scheduling strategy in the application container deployment instruction.
Step S102: and monitoring the resource state of each member cluster and the execution condition of the application container deployment task, and executing corresponding task scheduling operation on the application container deployment task according to the resource state of each member cluster after monitoring that the application container deployment task fails to be executed.
As can be seen from the above description, in the computer-readable storage medium provided in this embodiment of the present application, an application container deployment instruction sent by a federate cluster management platform is received, a corresponding application container deployment task is executed according to an application deployment total copy number and a cluster scheduling policy in the application container deployment instruction, and whether current conditions are met is ensured by monitoring resource states of all member clusters and execution conditions of the application container deployment task, and when a task failure is detected, a task that is deployed in an abnormal cluster and fails to be deployed may be rescheduled to another normal cluster, so that continuous and reliable external service provision in multiple cluster scenarios is ensured, and stability and high availability of services of multiple clusters are effectively ensured.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The principle and the implementation mode of the invention are explained by applying specific embodiments in the invention, and the description of the embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. A method for multi-cluster exception handling, the method comprising:
receiving an application container deployment instruction sent by a federal cluster management platform, and executing a corresponding application container deployment task according to the total application deployment copy number and a cluster scheduling strategy in the application container deployment instruction;
and monitoring the resource state of each member cluster and the execution condition of the application container deployment task, and executing corresponding task scheduling operation on the application container deployment task according to the resource state of each member cluster after monitoring that the application container deployment task fails to be executed.
2. The multi-cluster exception handling method according to claim 1, wherein after it is monitored that the application container deployment task fails to be executed, the method performs a corresponding task scheduling operation on the application container deployment task according to the resource state of each member cluster, and includes:
after monitoring that the application container deployment task fails to be executed, isolating the corresponding member cluster;
and determining a target member cluster according to a preset scheduling rule and the resource state of each member cluster, and scheduling the application container deployment task failed to be executed to the target member cluster.
3. The multi-cluster exception handling method according to claim 2, wherein the determining a target member cluster according to a preset scheduling rule and the resource status of each member cluster, and scheduling an application container deployment task that fails to be executed to the target member cluster comprises:
determining the member cluster with the resource state meeting the preset health state condition as a target member cluster according to the resource state of each member cluster;
and uniformly scheduling the application container deployment tasks which fail to be executed to the target member cluster according to a balanced scheduling rule.
4. The method according to claim 1, wherein the executing a corresponding application container deployment task according to the total number of application deployment copies in the application container deployment instruction and the cluster scheduling policy includes:
determining, by the master cluster, an application container copy deployment number of each member cluster corresponding to the master cluster according to the total application deployment copy number in the application container deployment instruction and a cluster scheduling policy;
and the main cluster issues the deployment quantity of the application container copies to each corresponding member cluster, and the member clusters execute corresponding application container deployment operation according to the deployment quantity of the application container copies.
5. A multi-cluster exception handling apparatus comprising:
the application container deployment task determining module is used for receiving an application container deployment instruction sent by a federal cluster management platform and executing a corresponding application container deployment task according to the total application deployment copy number and a cluster scheduling strategy in the application container deployment instruction;
and the cluster abnormal task scheduling module is used for monitoring the resource state of each member cluster and the execution condition of the application container deployment task, and executing corresponding task scheduling operation on the application container deployment task according to the resource state of each member cluster after monitoring that the application container deployment task fails to be executed.
6. The multi-cluster exception handling apparatus of claim 5 wherein the cluster exception task scheduling module comprises:
the abnormal cluster isolation unit is used for isolating the corresponding member cluster after monitoring that the execution of the application container deployment task fails;
and the failed task scheduling unit is used for determining a target member cluster according to a preset scheduling rule and the resource state of each member cluster and scheduling the application container deployment task which fails to be executed to the target member cluster.
7. The multi-cluster exception handling apparatus of claim 6 wherein the failed task scheduling unit comprises:
the healthy cluster determining subunit is used for determining the member cluster with the resource state meeting the preset healthy state condition as a target member cluster according to the resource state of each member cluster;
and the failed task balanced scheduling subunit is used for uniformly scheduling the application container deployment tasks which fail to be executed to the target member cluster according to a balanced scheduling rule.
8. The multi-cluster exception handling apparatus of claim 5, wherein the application container deployment task determination module comprises:
the main cluster decision unit is used for determining the deployment quantity of the application container copies of each member cluster corresponding to the main cluster according to the total application deployment copy number in the application container deployment instruction and the cluster scheduling policy;
and the master cluster issuing unit is used for issuing the deployment quantity of the application container copies to each corresponding member cluster by the master cluster and executing corresponding application container deployment operation by the member clusters according to the deployment quantity of the application container copies.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the multi-cluster exception handling method of any one of claims 1 to 4 when executing the program.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the multi-cluster exception handling method of any one of claims 1 to 4.
CN202011356181.3A 2020-11-27 2020-11-27 Multi-cluster exception handling method and device Active CN112463535B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011356181.3A CN112463535B (en) 2020-11-27 2020-11-27 Multi-cluster exception handling method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011356181.3A CN112463535B (en) 2020-11-27 2020-11-27 Multi-cluster exception handling method and device

Publications (2)

Publication Number Publication Date
CN112463535A true CN112463535A (en) 2021-03-09
CN112463535B CN112463535B (en) 2024-05-10

Family

ID=74809736

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011356181.3A Active CN112463535B (en) 2020-11-27 2020-11-27 Multi-cluster exception handling method and device

Country Status (1)

Country Link
CN (1) CN112463535B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112905306A (en) * 2021-03-29 2021-06-04 建信金融科技有限责任公司 Multi-cluster container management method and device, electronic equipment and storage medium
CN113179331A (en) * 2021-06-11 2021-07-27 苏州大学 Distributed special protection service scheduling method facing mobile edge calculation
CN113190364A (en) * 2021-04-30 2021-07-30 平安壹钱包电子商务有限公司 Remote call management method and device, computer equipment and readable storage medium
CN113342552A (en) * 2021-07-05 2021-09-03 湖南快乐阳光互动娱乐传媒有限公司 Data processing method and device, storage medium and electronic equipment
CN113391902A (en) * 2021-06-22 2021-09-14 未鲲(上海)科技服务有限公司 Task scheduling method and device and storage medium
CN113590256A (en) * 2021-06-03 2021-11-02 新浪网技术(中国)有限公司 Application deployment method and device for multiple Kubernetes clusters
CN113626280A (en) * 2021-06-30 2021-11-09 广东浪潮智慧计算技术有限公司 Cluster state control method and device, electronic equipment and readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105515812A (en) * 2014-10-15 2016-04-20 中兴通讯股份有限公司 Fault processing method of resources and device
CN106713056A (en) * 2017-03-17 2017-05-24 郑州云海信息技术有限公司 Method for selecting and switching standbys under distributed cluster
WO2020097814A1 (en) * 2018-11-14 2020-05-22 深圳市互盟科技股份有限公司 Method and apparatus for installing container orchestration engine, and electronic device
CN111290834A (en) * 2020-01-21 2020-06-16 苏州浪潮智能科技有限公司 Method, device and equipment for realizing high availability of service based on cloud management platform
CN111385114A (en) * 2018-12-28 2020-07-07 华为技术有限公司 VNF service instantiation method and device
CN111800303A (en) * 2020-09-09 2020-10-20 杭州朗澈科技有限公司 Method, device and system for guaranteeing number of available clusters in mixed cloud scene

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105515812A (en) * 2014-10-15 2016-04-20 中兴通讯股份有限公司 Fault processing method of resources and device
CN106713056A (en) * 2017-03-17 2017-05-24 郑州云海信息技术有限公司 Method for selecting and switching standbys under distributed cluster
WO2020097814A1 (en) * 2018-11-14 2020-05-22 深圳市互盟科技股份有限公司 Method and apparatus for installing container orchestration engine, and electronic device
CN111385114A (en) * 2018-12-28 2020-07-07 华为技术有限公司 VNF service instantiation method and device
CN111290834A (en) * 2020-01-21 2020-06-16 苏州浪潮智能科技有限公司 Method, device and equipment for realizing high availability of service based on cloud management platform
CN111800303A (en) * 2020-09-09 2020-10-20 杭州朗澈科技有限公司 Method, device and system for guaranteeing number of available clusters in mixed cloud scene

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KAIREN BAI: "KubeFed Kubernetes Federation v2 详解", 《KUBERNETES 中文社区》:HTTPS://WWW.KUBERNETES.ORG.CN/5702.HTML, 12 August 2019 (2019-08-12) *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112905306A (en) * 2021-03-29 2021-06-04 建信金融科技有限责任公司 Multi-cluster container management method and device, electronic equipment and storage medium
CN113190364A (en) * 2021-04-30 2021-07-30 平安壹钱包电子商务有限公司 Remote call management method and device, computer equipment and readable storage medium
CN113590256A (en) * 2021-06-03 2021-11-02 新浪网技术(中国)有限公司 Application deployment method and device for multiple Kubernetes clusters
CN113179331A (en) * 2021-06-11 2021-07-27 苏州大学 Distributed special protection service scheduling method facing mobile edge calculation
CN113179331B (en) * 2021-06-11 2022-02-11 苏州大学 Distributed special protection service scheduling method facing mobile edge calculation
CN113391902A (en) * 2021-06-22 2021-09-14 未鲲(上海)科技服务有限公司 Task scheduling method and device and storage medium
CN113626280A (en) * 2021-06-30 2021-11-09 广东浪潮智慧计算技术有限公司 Cluster state control method and device, electronic equipment and readable storage medium
CN113626280B (en) * 2021-06-30 2024-02-09 广东浪潮智慧计算技术有限公司 Cluster state control method and device, electronic equipment and readable storage medium
CN113342552A (en) * 2021-07-05 2021-09-03 湖南快乐阳光互动娱乐传媒有限公司 Data processing method and device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN112463535B (en) 2024-05-10

Similar Documents

Publication Publication Date Title
CN112463535B (en) Multi-cluster exception handling method and device
CN112445575B (en) Multi-cluster resource scheduling method, device and system
CN116170317A (en) Network system, service providing and resource scheduling method, device and storage medium
CN106575247B (en) Fault-tolerant federation of computing clusters
CN111813601B (en) Micro-service rollback method and device for stateful distributed cluster
CN112380020A (en) Computing power resource allocation method, device, equipment and storage medium
CN111274033B (en) Resource deployment method, device, server and storage medium
CN113742031A (en) Node state information acquisition method and device, electronic equipment and readable storage medium
CN109656691A (en) Processing method, device and the electronic equipment of computing resource
CN110837407B (en) Server-free cloud service system, resource management method thereof and electronic equipment
CN110875833A (en) Cluster hybrid cloud, job processing method and device and electronic equipment
CN111858007A (en) Task scheduling method and device based on message middleware
CN111445331A (en) Transaction matching method and device
CN111510493B (en) Distributed data transmission method and device
CN112069154A (en) Automatic operation and maintenance method and related device for etcd distributed database
CN110944067A (en) Load balancing method and server
CN114489989A (en) Method and system for parallel scheduling based on proxy client
CN111858050B (en) Server cluster hybrid deployment method, cluster management node and related system
CN110427260B (en) Host job scheduling method, device and system
CN113138812A (en) Spacecraft task scheduling method and device
CN113342520B (en) Cross-cluster remote continuous release method and system based on federal implementation
CN112445574B (en) Application container multi-cluster migration method and device
CN115914375A (en) Disaster tolerance processing method and device for distributed message platform
CN113326025B (en) Single cluster remote continuous release method and device
CN113326025A (en) Single cluster remote continuous release method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant