CN114880150A

CN114880150A - Fault isolation and field protection method and system

Info

Publication number: CN114880150A
Application number: CN202110166122.8A
Authority: CN
Inventors: 梁奂; 姚文胜; 乔宏明; 陈靖翔; 陈春华; 胡军军
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2021-02-05
Filing date: 2021-02-05
Publication date: 2022-08-09

Abstract

The invention relates to a fault isolation and field protection method and a system, wherein the method comprises the following steps: the Kubelet monitors a service cluster on the Node, and sends a fault notification to a Master management Node of the Kubernetes under the condition that the service instance in the service cluster is monitored to have a fault; after receiving the fault notification, the Master management node of Kubernetes modifies the label of the fault service instance into an isolation label on line; the fault service instance after the label is modified automatically breaks away from the service cluster and is stored in a fault isolation area, wherein a virtual fault isolation area is arranged under the name space of the service cluster and is used for storing the service instance with the isolation label; and fault reproduction and analysis are carried out on the fault service instances in the fault isolation area according to the fault service instances and the logs. By carrying out fault isolation on the fault service instance and protecting the site, fault recurrence and fault reason analysis can be carried out in time, so that the normal operation of the service is guaranteed.

Description

Fault isolation and field protection method and system

Technical Field

The present invention relates to the field of container cluster management, and more particularly to fault handling in container cluster management of a telecommunication system.

Background

With the popularization of cloud native applications, the scenarios that the telecommunication system utilizes kubernets to perform container cluster management are increasing. Kubernetes is a Google open-source container management system, and can realize the functions of automatic deployment, automatic expansion and contraction of container clusters, maintenance and the like. Under the cloud-native operating environment, service instances in kubernets are typically deployed in a multi-instance cluster mode. When a single service instance fails, the failed service instance is usually deleted and then a new service instance is created to replace the failed service instance, thereby ensuring the normal operation scale of the business. However, in terms of operation and maintenance, it is difficult to achieve fault recurrence and analysis of the root cause of a service instance failure. Therefore, there is a need to address issues regarding the handling of service instance failures in a cloud-native environment.

Disclosure of Invention

To overcome the above-mentioned drawbacks, the present disclosure proposes an innovative fault isolation and field protection method and system in a cloud-native environment.

The following presents a simplified summary of the disclosure in order to provide a basic understanding of some aspects of the disclosure. However, it should be understood that this summary is not an exhaustive overview of the disclosure. It is not intended to identify key or critical elements of the disclosure or to delineate the scope of the disclosure. Its sole purpose is to present some concepts of the disclosure in a simplified form as a prelude to the more detailed description that is presented later.

According to one aspect of the present disclosure, there is provided a fault isolation and field protection method in a kubernets system, comprising the steps of: the Kubelet monitors a service cluster on the Node, and sends a fault notification to a Master management Node of the Kubernetes under the condition that the service instance in the service cluster is monitored to have a fault; after receiving the fault notification, the Master management node of Kubernetes modifies the label of the fault service instance into an isolation label on line; the fault service instance after the label is modified automatically breaks away from the service cluster and is stored in a fault isolation area, wherein a virtual fault isolation area is arranged under the name space of the service cluster and is used for storing the service instance with the isolation label; and fault recurrence and analysis are carried out on the fault service instances in the fault isolation area according to the fault service instances and the logs.

According to another aspect of the present disclosure, there is provided a fault isolation and field protection system in a kubernets container management system including a Master management Node and a Node in communication with each other, the fault isolation and field protection system including: the Kubelet process on the Node is used for monitoring a service cluster running on the Node and sending a fault notification to a Master management Node of Kubernetes under the condition that a service instance in the service cluster is monitored to have a fault; the label modification module on the Master management node is used for modifying the label of the fault service instance into an isolation label on line after receiving the fault notification from the Kubelet process; and the fault isolation area on the Node is used for storing the fault service instance with the isolation label and automatically separating from the service cluster so as to reproduce and analyze the fault.

According to yet another aspect of the present disclosure, there is provided a computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, cause the one or more processors to perform a fault isolation and field protection method according to the above-described aspect of the present disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.

The present disclosure may be more clearly understood from the following detailed description with reference to the accompanying drawings, in which:

FIG. 1 shows a schematic diagram of the core components of a Kubernets container management system;

FIG. 2 is a flow diagram illustrating the processing of a failed service instance in the event of a failure of a service instance in a Kubernetes system in the prior art;

FIG. 3 shows a block diagram of a fault isolation and field protection system implemented by improving the Deployment Deployment handling of a Kubernetes system, according to an embodiment of the present disclosure; and

fig. 4 shows a flowchart of a fault isolation and field protection method implemented by improving the Deployment handling of the kubernets system according to an embodiment of the present disclosure.

Detailed Description

The following detailed description is made with reference to the accompanying drawings and is provided to assist in a comprehensive understanding of various exemplary embodiments of the disclosure. The following description includes various details to assist understanding, but these details are to be regarded as examples only and are not intended to limit the disclosure, which is defined by the appended claims and their equivalents. The words and phrases used in the following description are used only to provide a clear and consistent understanding of the disclosure. In addition, descriptions of well-known structures, functions, and configurations may be omitted for clarity and conciseness. Those of ordinary skill in the art will recognize that various changes and modifications of the examples described herein can be made without departing from the spirit and scope of the disclosure.

FIG. 1 shows a core component schematic of a Kubernets container management system. A brief introduction to the kubernets container management system and some technical terminology follows.

The Kubernetes container management system comprises a Master management Node and a Node which are communicated with each other in a physical architecture.

The Master management node is responsible for the management work of the whole container cluster and provides a resource data access entrance of the cluster. Typically having an Etcd storage service, an API server process, a controller manager service process and a scheduler service process.

The Node is a service Node for operating Pod in the Kubernetes cluster architecture and is a host for operating Pod. Each Node runs a Kubelet process for processing tasks issued by a Master management Node to the Node, and managing Pod and containers therein. The Kubelet can register itself with the Master management Node and report the self condition to the Master management Node at regular time, so that the Master management Node can know the resource use condition of each Node and realize an efficient and balanced resource scheduling strategy.

Pod (also referred to herein as a "service instance") is the smallest unit created, scheduled, and managed by kurbernets, which provides a higher level of abstraction than containers, making deployment and management more flexible. A group of closely related containers are usually placed in the same Pod, and the containers in the Pod are scheduled to a Node by a Master management Node as a whole to run.

The main role of repliaset (RS for short) in Kubernetes is to maintain the operation of a set of copies of a Pod to ensure that a certain number of pods can operate properly in a cluster. The replicase can continuously monitor the running state of the Pod and rerun a new Pod copy when the Pod fails and the restart quantity is reduced.

The Deployment of the Deployment is based on RS implementation. After the Deployment of the deployments, the Master management Node of Kubernetes schedules the service instances contained in the Deployment of the deployments of the services of the deployments of the services of the deployments of the services of the deployments of the services of the deployments of the nodes of.

Fig. 2 shows a flow chart of the prior art processing of a failed service instance in case of a failure of a service instance (Pod) in a kubernets system.

In step S201, the Kubelet monitors the service cluster on the Node, which includes, for example, 3 service instances (Pod) A, B and C. And when the service instance C is monitored to have a fault, the Kubelet sends a fault notification to a Master management node of the Kubernetes.

In step S202, the Master management node of Kubernetes deletes the failed service instance C and starts a new service instance D.

In step S203, the operation and maintenance personnel processes the fault according to the log of the service instance C.

Here, normal service operation is maintained by starting a new service instance D to keep the number of original service instances unchanged, i.e., keeping the number of service instances A, B and D the same as the number of original service instances A, B and C (still 3).

Fig. 3 shows a schematic diagram of a fault isolation and field protection system implemented by improving the Deployment handling of a kubernets system according to an embodiment of the present disclosure.

The fault isolation and field protection system according to the embodiment of the disclosure comprises a Kubelet process on a Node, a label modification module on a Master management Node and a fault isolation area on the Node.

The Kubelet process monitors the service clusters running on the Node nodes, which include, for example, 3 service instances (Pod) A, B and C. And sending a fault notification to a Master management node of Kubernetes when the service instance C (pod C) in the service cluster is monitored to have a fault.

And the label modification module on the Master management node is used for modifying the label of the fault service instance C into an isolation label on line after receiving the fault notification from the Kubelet process.

The fault isolation zone on the Node is used for storing a fault service instance, such as fault service instance C, of the automatic off-service cluster with an isolation tag.

The virtual fault isolation zone is set by the maintenance personnel under the name space of the service cluster, and is used for saving the service instance with an isolation tag, for example, a metadata.

In the kubernets system, namespaces are utilized to organize the various clock objects of kubernets. By "assigning" objects within the system to different namespaces, different items, groups, or groups of users are formed that are logically grouped. Different groups can be managed separately while sharing the resources that use the entire cluster.

Any API object in Kubernetes is identified by a tag Label, which can be attached to various resource objects (e.g., Node nodes, Pod, etc.).

It is emphasized that the quarantine tag is not named the same as the other resource tags.

And the operation and maintenance personnel can reproduce and analyze the fault of the fault service instance C in the fault isolation area according to the fault service instance C and the log.

And, after completing the recurrence and analysis of the failure of the failed service instance C in the failure isolation zone, the failed service instance is deleted to clean the environment.

Fig. 4 shows a flowchart of a method for implementing fault isolation and field protection by improving the Deployment processing manner of the Deployment of the kubernets system according to an embodiment of the present disclosure.

In step S401, the Kubelet monitors the service cluster on the Node, which includes, for example, 3 service instances (Pod) A, B and C, as in step S201 described above with respect to fig. 2. And when the service instance C is monitored to have a fault, the Kubelet sends a fault notification to a Master management node of the Kubernetes.

In step S402, unlike the processing manner of deleting the service instance C in step S202 described in the prior art with respect to fig. 2, the Master management node of kubernets does not delete the failed service instance C after receiving the failure notification, but executes a tag modification command through the tag modification module, for example, executes "kubec label point xxx zone — Pending-zone — override" to modify the tag of the failed service instance C online to the isolation tag, thereby modifying the processing manner of kubernets on the failed service instance C.

In step S403, the modified label failed service instance C automatically leaves the service cluster and rejects the new access request, and the failed service instance C is saved in the fail-safe area.

It is emphasized that the isolation tag is named differently from other resource tags.

In step S404, the failure of the failed service instance C in the failure isolation zone is reproduced and analyzed according to the failed service instance C and the log.

Preferably, upon completion of the recurrence and analysis of the failure of the failed service instance C in the failed isolation zone, the failed service instance is deleted to clean the environment.

Preferably, after the failed service instance C is saved in the failure isolation area, the Master management node of Kubernetes may restart a new service instance D according to ReplicaSet to replace the failed service instance C. The Kubernetes management node master maintains normal service operation by keeping the number of original service instances unchanged, here keeping the number of service instances A, B and D the same as the number of original service instances A, B and C (still 3).

Therefore, according to the fault service instance processing method disclosed by the invention, by improving the Deployment processing mode of the Deployment of Kubernetes, fault isolation can be carried out on the fault service instance in time and the site can be protected, so that fault recurrence and fault cause analysis are facilitated. Moreover, normal operation of the service can be guaranteed, and influence on user perception is avoided.

In addition, the processing flow of the Kubernets to the fault service instances can be modified by modifying the labels of the fault service instances on line, the modification on the aspect of hardware is not involved, and the method is easy to implement.

The present disclosure may be implemented as any combination of devices, systems, integrated circuits, and computer programs on non-transitory computer readable media. One or more processors may be implemented as an Integrated Circuit (IC), an Application Specific Integrated Circuit (ASIC), or a large scale integrated circuit (LSI), a system LSI, or a super LSI, or as an ultra LSI package that performs some or all of the functions described in this disclosure.

The present disclosure includes the use of software, applications, computer programs or algorithms. Software, applications, computer programs, or algorithms may be stored on a non-transitory computer readable medium to cause a computer, such as one or more processors, to perform the steps described above and depicted in the figures. For example, one or more memories store software or algorithms in executable instructions and one or more processors can associate a set of instructions to execute the software or algorithms to provide various functionality in accordance with embodiments described in this disclosure.

Software and computer programs (which may also be referred to as programs, software applications, components, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural, object-oriented, functional, logical, or assembly or machine language. The term "computer-readable medium" refers to any computer program product, apparatus or device, such as magnetic disks, optical disks, solid state storage devices, memories, and Programmable Logic Devices (PLDs), used to provide machine instructions or data to a programmable data processor, including a computer-readable medium that receives machine instructions as a computer-readable signal.

By way of example, computer-readable media can comprise Dynamic Random Access Memory (DRAM), Random Access Memory (RAM), Read Only Memory (ROM), electrically erasable read only memory (EEPROM), compact disk read only memory (CD-ROM) or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store desired computer-readable program code in the form of instructions or data structures and which can be accessed by a general-purpose or special-purpose computer or a general-purpose or special-purpose processor. Disk or disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above are also included within the scope of computer-readable media.

The subject matter of the present disclosure is provided as examples of apparatus, systems, methods, and programs for performing the features described in the present disclosure. However, other features or variations are contemplated in addition to the features described above. It is contemplated that the implementation of the components and functions of the present disclosure may be accomplished with any emerging technology that may replace the technology of any of the implementations described above.

Additionally, the above description provides examples, and does not limit the scope, applicability, or configuration set forth in the claims. Changes may be made in the function and arrangement of elements discussed without departing from the spirit and scope of the disclosure. Various embodiments may omit, substitute, or add various procedures or components as appropriate. For example, features described with respect to certain embodiments may be combined in other embodiments.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

Claims

1. A method of fault isolation and field protection in a kubernets container management system, comprising the steps of:

the Kubelet monitors a service cluster on the Node, and sends a fault notification to a Master management Node of the Kubernetes under the condition that the service instance in the service cluster is monitored to have a fault;

after receiving the fault notification, the Master management node of Kubernetes modifies the label of the fault service instance into an isolation label on line;

the fault service instance after the label is modified automatically breaks away from the service cluster and is stored in a fault isolation area, wherein a virtual fault isolation area is arranged under the name space of the service cluster and is used for storing the service instance with the isolation label; and

and fault recovery and analysis are carried out on the fault service instances in the fault isolation area according to the fault service instances and the logs.

2. The method of fault isolation and field protection according to claim 1, further comprising the steps of:

upon completion of the fault recurrence and analysis, the faulty service instance is deleted.

3. The fault isolation and field protection method of claim 1, further comprising the steps of:

after the fault service instance is stored in the fault isolation area, the Master management node of Kubernetes restarts a new service instance according to the ReplicaSet to replace the fault service instance.

4. The fault isolation and field protection method according to any one of claims 1 to 3, wherein the isolation tag is of a different name than the other resource tags.

5. The method for fault isolation and field protection according to any one of claims 1 to 3, wherein the Master management node modifies the label of the fault service instance online to an isolation label by executing a command to modify the label.

6. A fault isolation and field protection system in a Kubernetes container management system, the Kubernetes container management system including a Master management Node and a Node in communication with each other, the fault isolation and field protection system comprising:

the Kubelet process on the Node is used for monitoring a service cluster running on the Node and sending a fault notification to a Master management Node of Kubernetes under the condition that a service instance in the service cluster is monitored to have a fault;

the label modification module on the Master management node is used for modifying the label of the fault service instance into an isolation label on line after receiving the fault notification from the Kubelet process; and

and the fault isolation area on the Node is used for storing the fault service instance with the isolation label and automatically separating from the service cluster so as to reproduce and analyze the fault.

7. The fault isolation and field protection device of claim 6, wherein the isolation tag is of a different name than the other resource tags.

8. A computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, cause the one or more processors to perform the method of any one of claims 1-5.