WO2023275984A1

WO2023275984A1 - Virtualization system restoration device and virtualization system restoration method

Info

Publication number: WO2023275984A1
Application number: PCT/JP2021/024528
Authority: WO
Inventors: 健太篠原; 紀貴堀米; 真生上野
Original assignee: 日本電信電話株式会社
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2023-01-05
Also published as: JPWO2023275984A1

Abstract

The present invention includes: a plurality of clusters (12A, 12B) in which containers related to applications (15a, 15b) virtually created on a physical machine by container virtualization software are clustered and arranged; and internal abnormality detection units (17A, 17B) and an external abnormality detection unit (23) that are virtually created outside the clusters 12A, 12B. The external abnormality detection unit (23) is configured to detect, when the internal abnormality detection units (17A, 17B) detect abnormality related to the applications (15a, 15b), a cluster (cluster 12A or 12B) where containers related to the applications (15a, 15b) with abnormality are arranged as being abnormal.

Description

Virtualization system recovery device and virtualization system recovery method

The present invention relates to a virtualization system recovery device and a virtualization system recovery method for realizing abnormality detection and failure recovery of containers and applications operating on containers in a computing infrastructure based on virtual machines and containers.

The virtual machine mentioned above is a computer that realizes the same functions as a physical computer with software. A container is a virtualization technology created by packaging an application in an environment called a "container" and running on a container engine. In conventional container-based technology, anomaly detection and failure recovery of containers and applications running on containers are realized mainly by Liveness/Readiness Probe functions (also called probe functions) of kubernetes, which will be described later. ing.

Kubernetes is container virtualization software that creates and clusters containers such as Docker, and is open source software. The Liveness Probe function performs control such as restarting the container, and the Readiness Probe function performs control such as whether or not the container accepts requests. There is a technique described in Non-Patent Document 1 as this type of conventional technique.

By the way, in the virtualization system as a virtualization technology area, not only for the container described above, recovery work etc. are performed manually based on an alert issued when a failure in the virtualization system is detected. However, it is difficult to shorten the time from failure occurrence to normalization because recovery work is performed manually after an alert is issued.

If the failure is detected and restored by the probe function of Kubernetes that performs anomaly detection and failure recovery, the failure monitoring cycle can only be set to a predetermined slow cycle such as 1 second. Therefore, when it is necessary to detect anomalies and recover from failures as quickly as possible, there is a problem that it is not possible to detect anomalies and recover from anomalies faster than the anomaly detection and recovery functions of Kubernetes in the default state.

The present invention has been made in view of such circumstances, and an object of the present invention is to detect anomalies and recover from failures occurring in a virtualization system faster than the anomaly detection and recovery functions of container virtualization software. do.

In order to solve the above-mentioned problems, the virtualized system recovery device of the present invention is a computing resource cluster that is virtually created on a physical machine by container virtualization software, and clusters and arranges the virtualized containers. , a cluster management unit that manages the placement and operation of the virtually created clustered containers, and a plurality of cluster management units, each of which includes the computing resource cluster and the cluster management unit a cluster, an internal anomaly detection unit that is arranged for each of the plurality of clusters and that is virtually created outside the virtually created computational resource cluster and cluster management unit that detects an anomaly in the container; and an external anomaly detection unit that is virtually created outside the plurality of clusters and detects an anomaly in the cluster in which the abnormal container is arranged when the internal anomaly detection unit detects an anomaly in the container. and

According to the present invention, anomaly detection and failure recovery can be performed faster than the anomaly detection and recovery function of container virtualization software when a failure occurs in a virtualization system.

1 is a block diagram showing the configuration of a virtualization system restoration device according to an embodiment of the present invention; FIG. FIG. 10 is a block diagram showing a configuration when an endpoint setting unit and a Pod are deployed as a 1:1 configuration by a failure deployment instruction unit in the virtualization system recovery device of the present embodiment; FIG. 11 is a block diagram for explaining first container abnormality detection processing by the Pod of the virtualization system restoration device of the present embodiment; FIG. 11 is a block diagram for explaining second anomaly detection processing by a routing table provided for each worker node of the virtualization system restoration device of this embodiment; FIG. 11 is a block diagram for explaining third anomaly detection processing by monitoring daemons of virtual switches provided for each worker node of the virtualization system restoration device of the present embodiment; FIG. 11 is a block diagram for explaining fourth anomaly detection processing by monitoring daemons of container runtime provided for each worker node of the virtualization system recovery device of the present embodiment; FIG. 14 is a block diagram for explaining fifth anomaly detection processing by monitoring each worker node of the virtualization system restoration device of the present embodiment; FIG. 11 is a block diagram for explaining sixth anomaly detection processing by monitoring a DB externally attached to a cluster of a container system of the virtualization system recovery device of the present embodiment; FIG. 11 is a block diagram showing a configuration for explaining anomaly detection processing related to occurrence of failures in a plurality of clusters by an external anomaly detection unit; FIG. 4 is a block diagram for explaining anomaly handling processing of the virtualization system recovery device of the present embodiment; FIG. 4 is a diagram showing the correspondence relationship between domain names and resolution destination IP addresses in a DNS record table; FIG. 10 illustrates how IP addresses of faulty clusters are deleted from the DNS record table; 4 is a flowchart for explaining the operation of the abnormality handling process of the virtualization system restoration device of the present embodiment; FIG. 4 is a block diagram showing the configuration of a virtualization system restoration device according to Modification 1 of the embodiment of the present invention; FIG. 11 is a block diagram showing the configuration of a virtualization system restoration device according to Modification 2 of the embodiment of the present invention; FIG. 2 is a hardware configuration diagram showing an example of a computer that implements the functions of the virtualization system recovery device according to the present embodiment;

BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments of the present invention will be described with reference to the drawings. However, in all the drawings of this specification, the same reference numerals are given to components having corresponding functions, and descriptions thereof will be omitted as appropriate.
<Configuration of Embodiment>
FIG. 1 is a block diagram showing the configuration of a virtualization system restoration device according to an embodiment of the present invention.

The container system 20 shown in FIG. 1 is a virtualization system configured by a plurality of clusters (in this example, a first cluster 12A and a second cluster 12B) in which containers are clustered. The first cluster 12A is composed of a cluster manager 14A and a computational resource cluster 15A. The second cluster 12B is composed of a cluster manager 14B and a computational resource cluster 15B.

The

cluster management units

14A and 14B include a communication distribution unit 14a, a computational resource operation unit 14b, a computational resource management unit 14c, a container configuration reception unit 14d, a container placement destination determination unit 14e, and a container management unit 14f. configured with. The

computational resource clusters

15A, 15B are configured with a plurality of

applications

15a, 15b.

The

cluster management units

14A and 14B are also called the cluster management unit 14, and the

computational resource clusters

15A and 15B are also called the computational resource cluster 15.

The virtualization system recovery device (also referred to as recovery device) 10 shown in FIG. The recovery device 10 includes

cluster management units

14A and 14B,

computational resource clusters

15A and 15B, internal

anomaly detection units

17A and 17B, anomaly

recovery handling units

18A and 18B, failure handling

deployment instruction units

19A and 19B, It comprises a distribution destination switching unit 21 and an external abnormality detection unit 23 .

The internal

abnormality detection units

17A and 17B are also referred to as the internal abnormality detection unit 17, the abnormality

recovery response units

18A and 18B are also referred to as the abnormality recovery response unit 18, and the failure response

deployment instruction units

19A and 19B are failure response deployment instructions. Also referred to as part 19.

Inside each

cluster

12A, 12B, an internal anomaly detection unit 17, an anomaly recovery handling unit 18, and a failure handling deployment instruction unit 19 are deployed. A distribution destination switching unit 21 and an external abnormality detection unit 23 are provided outside each of the

clusters

12A and 12B. However, the internal abnormality detection unit 17, the abnormality recovery response unit 18, the failure response deployment instruction unit 19, the distribution destination switching unit 21, and the external abnormality detection unit 23 are the cluster management unit 14 virtually created by the container virtualization software. and outside the computing resource cluster 15 . Also, the internal anomaly detection unit 17, the anomaly recovery response unit 18, and the failure handling deployment instruction unit 19 can be deployed outside the

respective clusters

12A and 12B in the same way as the allocation destination switching unit 21 and the external anomaly detection unit 23. good.

Since the first and

second clusters

12A and 12B have substantially the same configuration, the functional configuration will be described on behalf of the first cluster 12A.

The computational resource cluster 15 is configured with a plurality of

applications

15a and 15b. The

applications

15a and 15b are, in other words, Pods (see Pods 15a and 15b shown in FIG. 3) as management units for a collection of one or more containers. A Pod is the smallest unit of an application that can be executed on Kubernetes (container virtualization software). That is, the

applications

15a and 15b as pods create containers and cluster them, and the clusters are operated on the container engine. The computing resource cluster 15 is virtually created on a physical machine by container virtualization software, and clusters and arranges the virtually created containers.

The cluster management unit 14 manages the placement and operation of the virtually created and clustered containers. The cluster management unit 14 includes a communication allocation unit 14a, a computational resource operation unit 14b, a computational resource management unit 14c, a container configuration reception unit 14d, a container arrangement destination determination unit 14e, and a container management unit 14f. configured as follows.

In the recovery device 10 having such a configuration, the failure handling deployment instruction unit (also referred to as the deployment instruction unit) 19 connects the endpoint (end point) setting units 14j and 14k and the Pods 15a and 15b shown in FIG. Perform the process of deploying (arranging) as a configuration. The endpoint setting units 14j and 14k are associated with each of the plurality of

Pods

15a and 15b, set the distribution ratio (%) of traffic to each of the

Pods

15a and 15b, and serve as the end point of communication data.

The internal anomaly detection unit 17 shown in FIG. 1 detects an anomaly in Pods (applications) 15a and 15b, which are one or more containers in the container system 20.

The error recovery handling unit 18 changes the weight value of the deployment instruction unit 19 associated with the Pod (for example, Pod 15a) in which an error has been detected by the internal error detection unit 17 to 0%, thereby isolating the error Pod 15a. The command is sent to the communication distribution unit 14a. Further, when restoring the disconnected Pod 15a, the abnormality restoration handling unit 18 transmits a restoration command for gradually increasing the traffic to the Pod 15a to be restored to a predetermined traffic value to the communication distribution unit 14a. .

In the cluster management unit 14, the communication distribution unit 14a is a router, and distributes and notifies the change command or recovery command from the failure recovery response unit 18 to the corresponding units 14b to 14f. In addition, the communication distribution unit 14a determines the destination endpoint setting units 14j and 14k ( (described later).

The container configuration reception unit (also referred to as reception unit) 14d receives configuration information for deploying containers to the computational resource cluster 15 from an external server or the like.

The container placement destination determination unit (also referred to as placement destination determination unit) 14e determines which container to place on which worker node (computation resource cluster 15) based on the configuration information received by the reception unit 14d.

The container management unit 14f checks whether the container is operating normally.

The computational resource management unit 14c grasps and manages whether the worker node is operable, the usage amount of computational resources of the server that constitutes the worker node, the remaining amount of CPU (Central Processing Unit), and the like.

The computational resource operation unit 14b performs an operation of allocating a predetermined amount of computational resources such as a certain amount of CPU to a certain container, in other words, an operation of allocating storage capacity, CPU time, memory capacity usable by the container, and the like. conduct.

Next, various abnormality detection processes (first to sixth abnormality detection processes) related to the container of the container system 20 by the internal abnormality detection unit 17 of the recovery device 10 will be described with reference to FIGS. 3 to 8. FIG.

<First abnormality detection process>
FIG. 3 is a block diagram for explaining the first container abnormality detection processing by the Pods (applications) 15a and 15b of the virtualization system restoration device 10 of the present embodiment. However, each

Pod

15a, 15b constitutes one or more containers.

In FIG. 3, a master node 14A, an infrastructure node 14B, and

worker nodes

15A and 15B are configured by virtual machines in the container system 20, and are connected by a virtual switch {OVS (Open vSwitch)} 30. It's like However, the virtual switch may be a virtual switch other than OVS. The master node 14A and the infrastructure node 14B correspond to the

cluster management units

14A and 14B (Fig. 1), and the

worker nodes

15A and 15B correspond to the

computing resource clusters

15A and 15B (Fig. 1).

Further, the master node 14A and the worker node 15A constitute the first cluster 12, and the infrastructure node 14B and the worker node 15B constitute the second cluster 12. Assume that the container system 20 is composed of these clusters 12 .

An internal abnormality detection unit 17 is arranged outside the container system 20 in the same manner as in the configuration of FIG. Although a total of two internal abnormality detection units 17 are shown for each of the

worker nodes

15A and 15B in FIG. 3, the number may be one. The master node 14A, the infrastructure node 14B, the

worker nodes

15A and 15B, and the internal anomaly detector 17 are connected to the opposite device 24 via the network 22. FIG. The opposing device 24 is a communication device such as an external server that transmits request signals and the like to the container system 20 .

The internal anomaly detection unit 17 transmits a predetermined command (for example, "sudo crictl ps") to the

Pods

15a and 15b of the

worker nodes

15A and 15B by polling indicated by the two-way arrows Y1 and Y2. It determines whether it is normal or abnormal based on the returned response result. In this actual polling test, the average round-trip time was 0.06 seconds when polling was performed 10 times.

The abnormality determination by the internal abnormality detection unit 17 is performed by reading the character string indicating normality or abnormality described in the command response results returned from the

Pods

15a and 15b by polling. For example, the character string "Running" indicates that the operation of the container (

Pods

15a, 15b) is normal, and character strings other than "Running" indicate that it is abnormal. Therefore, the internal abnormality detection unit 17 determines that the operation of the container (

Pods

15a, 15b) is normal when "Running" is described in the command response result, and determines that the operation is abnormal when a character string other than "Running" is described. to decide.

<Second abnormality detection process>
Next, FIG. 4 is a block diagram for explaining the second abnormality detection processing by the routing table 15c provided for each of the

worker nodes

15A and 15B of the virtualization system recovery device 10 of this embodiment.

The routing table (also referred to as a table) 15c manages destination containers of packets transmitted from the remote device 24 to the

Pods

15a and 15b of the

worker nodes

15A and 15B via the network 22, using route information indicating the destination. ing. If the destination management of this table 15c is not correct, the packet will not reach the appropriate container. For this reason, the internal abnormality detection unit 17 detects whether the transmission destination management of the table 15c is normal or abnormal.

However, the routing table 15c consists of a pair of tables "iptables" and "nftables". Alternatively, the routing table 15c may consist of only "iptbles" or only "nftables".

The internal abnormality detection unit 17 transmits a predetermined command to each table 15c of the

worker nodes

15A and 15B by polling indicated by the two-way arrows Y3 and Y4, and according to the response result returned from each table 15c in response to the command, the normal state is detected. or abnormal.

The predetermined command above is a pair of "sudo iptables -L|wc-|" and "sudo nft list ruleset". The command "sudo iptables -L|wc-|" is notified to "iptables" of the table 15c, and the command "sudo nft list ruleset" is notified to "nftables". Then, each of the “iptables” and “nftables” tables sends a response to the command to the internal abnormality detection unit 17 .

In actual polling tests using a pair of commands, the average round-trip time when polling was executed 10 times was 0.03 seconds for the command "sudo iptables -L |wc-|", and for the command "sudo nft 0.08 seconds for "list ruleset".

The abnormality determination by the internal abnormality detection unit 17 is determined as normal if the destination route information is described in the command response result returned from each table 15c, and determined as abnormal if nothing is described. do.

<Third anomaly detection process>
Next, FIG. 5 is a block diagram for explaining the third abnormality detection processing by monitoring the daemon of the virtual switch 30 provided for each of the

worker nodes

15A and 15B of the virtualization system recovery device 10 of this embodiment. . Note that the daemon of the virtual switch 30 is also called an OVS daemon.

A daemon is a program that manages the destination of packets in the virtual switch 30. The internal anomaly detection unit 17 monitors the OVS daemon, and detects that it is normal if the packet is properly transmitted, and that it is abnormal if it is not transmitted.

The internal abnormality detection unit 17 sends a predetermined command (for example, "ps aux|grep ovs-vswitchd|grep "db.sock"|wc-|") to each of the

worker nodes

15A and 15B by polling indicated by the two-way arrows Y5 and Y6. is transmitted to the virtual switch 30, and whether the command is normal or abnormal is determined based on the response result returned from the virtual switch 30 in response to the command.

In this actual polling test, the average round-trip time when polling was performed 10 times was 0.03 seconds.

The abnormality determination by the internal abnormality detection unit 17 is performed by determining that the command response result returned from each virtual switch 30 is normal if, for example, "db.sock process" related to the transmission destination is described. If not, it is judged to be abnormal.

<Fourth abnormality detection process>
Next, FIG. 6 is a block diagram for explaining the fourth anomaly detection processing by monitoring the daemon of the container runtime 15d provided for each of the

worker nodes

15A and 15B of the virtualization system recovery device 10 of this embodiment. . The above daemon is also called a crio daemon, and is an example of the container runtime 15d. crio (cri-o) is an open-source, community-driven container engine used in containerized virtualization technology.

The container runtime 15d is responsible for starting the containers of the

Pods

15a and 15b, so by monitoring the container runtime 15d, it is possible to detect whether the containers are starting normally. Therefore, the internal anomaly detector 17 monitors the crio daemon, and detects that the container is normal if it has started, and that it is abnormal if it has not started.

The internal abnormality detection unit 17 transmits a predetermined command (for example, "systemctl|status crio|grep Active") to the container runtime 15d of each of the

worker nodes

15A and 15B by polling indicated by the two-way arrows Y7 and Y8, and responds to the command. It determines whether it is normal or abnormal based on the response result returned from each container runtime 15d.

The abnormality determination by the internal abnormality detection unit 17 is determined as normal if "active (running)" indicating the activation state of the crio daemon is described in the command response result returned from each virtual switch 30, and " Any description other than "active (running)" is judged to be abnormal.

<Fifth anomaly detection process>
Next, FIG. 7 is a block diagram for explaining the fifth anomaly detection processing by monitoring each of the

worker nodes

15A and 15B of the virtualization system recovery device 10 of this embodiment.

However, it is assumed that the

worker nodes

15A and 15B are created by virtualization technology (virtual machines) using physical machines 32. In this configuration, the internal anomaly detector 17 exists on the physical machine 32 outside the virtual machine, and the internal anomaly detector 17 detects that the container is normal if the virtual machine is running. container will detect anomalies.

The internal anomaly detection unit 17 transmits a predetermined command (for example, "sudo virsh list") to each of the

worker nodes

15A and 15B by polling indicated by the two-way arrows Y9 and Y10. It determines whether it is normal or abnormal based on the returned response result.

The abnormality determination by the internal abnormality detection unit 17 is normal if "running" indicating the activation state of the

target worker node

15A, 15B is described in the command response result returned from each

worker node

15A, 15B. If the description is anything other than "running", it is determined to be abnormal.

<Sixth anomaly detection process>
Next, FIG. 8 is for explaining the sixth anomaly detection processing by monitoring DBs (Data Bases) 26a and 26b externally attached to the cluster 12 of the container system 20 of the virtualization system recovery device 10 of this embodiment. It is a block diagram.

As an external device of each

cluster

12A, 12B (FIG. 1), there is a configuration in which DBs (also referred to as external DBs) 26a, 26b that store data related to containers are connected to the

worker nodes

15A, 15B via the network 22. be. At this time, the internal abnormality detection unit 17 is also connected to the

worker nodes

15A and 15B via the network 22. FIG.

Here, since each

cluster

12A, 12B may be connected to each other via the network 22, as shown in FIG. , are positioned as the internal abnormality detection units 17 in the

respective clusters

12A and 12B in the same manner as shown in FIG.

The internal abnormality detection unit 17 transmits predetermined commands to the

external DBs

26a and 26b via the network 22 by polling indicated by the two-way arrows Y11 and Y12, and response results returned from the

external DBs

26a and 26b in response to the commands. determines whether it is normal or abnormal. The commands in this case depend on the types of the

external DBs

26a and 26b.

　The response results include the results related to responses and life-and-death monitoring, and the results related to exceeding the upper limit on the number of connections. The response/life-and-death monitoring monitors whether the

external DBs

26a and 26b are operating normally. In other words, the internal abnormality detection unit 17 determines that there is an abnormality if the response result indicates that the

external DBs

26a and 26b have not started normally.

"Exceeding the upper limit of the number of connections" indicates that the number of containers to which the

external DBs

26a and 26b are connected exceeds a predetermined threshold. In other words, the internal abnormality detection unit 17 determines that there is an abnormality if the response result indicates that the number of connected containers in the

external DBs

26a and 26b exceeds the threshold.

In this polling actual test, the polling round-trip time depends on the types of the

external DBs

26a and 26b.

<Multi-cluster anomaly detection 1>
Next, when a failure occurs in a plurality of

clusters

12A and 12B by the external anomaly detection unit 23 shown in FIG. 9, an anomaly detection process related to the failure will be described. However, it is assumed that the abnormality detection 1 for each of the

clusters

12A and 12B is one of the first to sixth abnormality detections described above.

As shown in FIG. 9, the external anomaly detector 23 is connected to the internal anomaly detector 17A of the first cluster 12A and the internal anomaly detector 17B of the second cluster 12B. When the internal

abnormality detection units

17A and 17B detect an abnormality related to any one of the first to sixth abnormality detections, the external abnormality detection unit 23 detects an abnormality application as indicated by an arrow Y31a or Y31b. The

cluster

12A or 12B in which the containers related to 15a and 15b are arranged is detected as abnormal.

<Anomaly detection 2 for multiple clusters>
The external anomaly detection unit 23 shown in FIG. 9 is connected to the communication allocation unit 14a of the cluster management unit 14A in the first cluster 12A and the communication allocation unit 14a of the cluster management unit 14B in the second cluster 12B. The communication distribution unit 14a is arranged in the signal input part of the cluster management unit 14, and distributes the input signal to the subsequent stage and outputs it. Send a response at times.

As indicated by the two-way arrows Y33a and Y33b, the external abnormality detection unit 23 performs confirmation communication for each

cluster

12A and 12B with the communication distribution unit 14a for each

cluster

12A and 12B at regular intervals. Detect whether or not it comes. If no response is returned, it is detected that the corresponding

clusters

12A and 12B are abnormal.

In this multi-cluster anomaly detection 2, anomaly detection of each

cluster

12A, 12B is possible without going through the internal

anomaly detection units

17A, 17B.

<Anomaly detection for multiple clusters 3>
Anomaly detection 3 for a plurality of clusters is a process in which the external anomaly detection unit 23 performs anomaly detection for each of the

clusters

12A and 12B based on both of the anomaly detections 1 and 2 described above. In this process, abnormality detection of each

cluster

12A, 12B can be performed more appropriately.

FIG. 10 is a block diagram for explaining the anomaly handling processing of the virtualization system recovery device 10 of this embodiment. The abnormality detection for which the abnormality handling process is performed is any one of the abnormality detections 1 to 3 of the plurality of clusters.

The distribution destination switching unit 21 is connected to a DNS (Domain Name System) 25 outside the recovery device 10 . The DNS 25 contains domains (or domain names) indicating the names of the

applications

15a and 15b of the

respective clusters

12A and 12B, and resolution destination IP (Internet Protocol) addresses corresponding to the addresses of the communication distribution units 14a of the respective clusters 12A and 12B. is a server that associates and manages The DNS 25 converts between domains and IP addresses, and includes a DNS record table 25a.

As shown in FIG. 11, the DNS record table (also referred to as table) 25a stores domain names and resolution destination IP addresses in association with each other. In this example, in the table 25a, "Svc1.net" as the domain name of the application 15a for each of the

clusters

12A and 12B is added to "first cluster 12A' and 'the IP address of the second cluster 12B' are associated with each other.

This correspondence relationship indicates that the "Svc1.net" application 15a operates in the first cluster 12A or the second cluster 12B.

Furthermore, in the table 25a, "Svc2.net" as the domain name of the application 15b for each of the

clusters

12A and 12B is added to "Svc2.net" as the resolution destination IP address of the communication distribution unit 14a for each of the

clusters

12A and 12B. "IP address" and "IP address of the second cluster 12B" are associated with each other.

This correspondence relationship indicates that the "Svc2.net" application 15b operates in the first cluster 12A or the second cluster 12B.

When an external server (not shown) queries the DNS 25 having such a table 25a for the resolution destination IP address, the DNS 25 returns the IP addresses of both

clusters

12A and 12B. Therefore, the external server can transmit data to both

clusters

12A and 12B.

Here, it is assumed that the external anomaly detection unit 23 shown in FIG. 10 detects an anomaly in any one of the anomaly detections 1 to 3 of the plurality of

clusters

12A and 12B (for example, an anomaly of the second cluster 12B). The external anomaly detection unit 23 notifies the allocation destination switching unit 21 of the anomaly detection of the second cluster 12B as indicated by an arrow Y34.

The distribution destination switching unit 21 notifies the DNS 25 of an instruction to stop the communication distribution to the second cluster 12B (communication distribution stop instruction) as indicated by an arrow Y35. In response to the communication distribution stop instruction, the DNS 25 resolves the second cluster 12B to the resolution destination IP address associated with both the domain names "Svc1.net" and "Svc2.net" in the table 25a shown in FIG. process to delete the IP address of

<Operation of Embodiment>
Next, the operation of the abnormality handling process will be described with reference to the flowchart shown in FIG.

Assume that in step S1 shown in FIG. 13, a failure (x mark) occurs in the

applications

15a and 15b of the second cluster 12B, and this abnormality is detected by the internal abnormality detection unit 17B. In this case, the internal abnormality detection section 17B notifies the external abnormality detection section 23 of the abnormality of the second cluster 12B as indicated by an arrow Y31b.

In step S2, the external anomaly detection unit 23 detects an anomaly in the second cluster 12B from the above notification, and notifies the allocation destination switching unit 21 as indicated by an arrow Y34.

In step S3, the distribution destination switching unit 21 notifies the DNS 25 of an instruction to stop communication distribution to the second cluster 12B, as indicated by an arrow Y35.

In step S4, the DNS 25 deletes the IP address of the second cluster 12B from the resolution destination IP addresses associated with both the domain names "Svc1.net" and "Svc2.net" in the table 25a shown in FIG. do. As a result, the resolution destination IP address associated with both of the domain names "Svc1.net" and "Svc2.net" in the table 25a is only the IP address of the first cluster 12A.

Therefore, in step S5, when the external server inquires of the DNS 25 about the resolution destination IP address, the DNS 25 returns only the IP address of the first cluster 12A. In other words, access to the failed second cluster 12B becomes impossible, and communication to the second cluster 12B is stopped.

<Effects of Embodiment>
Effects of the virtualization system recovery device 10 according to the embodiment of the present invention will be described.

(1a) The restoration device 10 includes a computational resource cluster 15, a cluster management unit 14, a plurality of

clusters

12A and 12B, an internal anomaly detection unit 17, and an external anomaly detection unit 23. The computational resource cluster 15 is virtually created on a physical machine by container virtualization software, and clusters and arranges the virtually created containers. The cluster management unit 14 manages the placement and operation of virtually created and clustered containers.

Each

cluster

12A, 12B is configured with a computational resource cluster 15 and a cluster management unit 14. The internal anomaly detector 17 is arranged for each of the

clusters

12A and 12B and is virtually created outside the virtually created computational resource cluster 15 and the cluster manager 14 to detect an anomaly of the container. The external anomaly detector 23 is virtually created outside each of the

clusters

12A and 12B, and configured to detect an anomaly in the cluster in which the abnormal container is arranged when the internal anomaly detector 17 detects an anomaly in the container. .

According to this configuration, when the internal abnormality detection unit 17 of each of the

clusters

12A and 12B detects an abnormality in the container, the external abnormality detection unit 23 detects that the cluster in which the abnormal container is arranged is abnormal. made it The internal anomaly detector 17 and the external anomaly detector 23 are not involved in container virtualization software that virtually creates the cluster manager 14 and the computational resource cluster 15 . Therefore, failures occurring in the

respective clusters

12A and 12B can be detected earlier than the abnormality detection recovery function of the container virtualization software. This early detection of anomalies enables quick recovery of containers and the like related to cluster failures.

(2a) The cluster management unit 14 is arranged in the signal input part of the cluster management unit 14, distributes the input signal to the subsequent stage and outputs it, and communicates to return a response when the cluster is normal according to the confirmation communication of the cluster. A distribution unit 14a is provided. The external anomaly detection unit 23 is configured to perform cluster confirmation communication to the communication distribution unit 14a at predetermined intervals, and to detect an anomaly in the cluster when no response is returned.

According to this configuration, the abnormality of each cluster can be detected without going through the internal abnormality detection unit 17 of each

cluster

12A, 12B.

(3a) The external anomaly detection unit 23 detects an anomaly in the cluster in which the container is placed when the internal anomaly detection unit 17 detects an anomaly in the container. , and when no response is returned, the abnormality is detected by both the process of detecting the abnormality of the cluster and the process of detecting it.

According to this configuration, anomaly detection of each cluster can be performed more appropriately.

(4a) The DNS 25 that manages the domain name indicating the name of the application related to the container in association with the IP address of each cluster is provided outside the servers that constitute each

cluster

12A and 12B. A distribution destination switching unit, which is virtually created outside each

cluster

12A, 12B and notifies the DNS 25 of a communication distribution stop instruction related to an abnormal cluster detected by the external abnormality detection unit 23, is provided for each cluster. The DNS 25 is configured to delete the IP address of the abnormal cluster indicated by the communication distribution stop instruction.

According to this configuration, the IP address of the cluster detected as abnormal by the external abnormality detection unit 23 is deleted from the cluster IP addresses managed by the DNS 25 . Therefore, when the external server queries the DNS 25 for the IP address of the cluster, it cannot access the IP address of the failed cluster. In other words, communication to the abnormal cluster can be stopped. The external abnormality detection unit 23, the allocation destination switching unit, and the DNS 25 are not involved in the container virtualization software described above. For this reason, a failure occurring in a cluster can be detected earlier than the failure detection and recovery function of the container virtualization software, so that the container or the like related to the failure of the cluster in which the failure has been detected can be quickly restored.

<Modification 1 of Embodiment>
FIG. 14 is a block diagram showing the configuration of a virtualization system restoration device 10A according to Modification 1 of the embodiment of the present invention.

14 differs from the recovery device 10 (FIG. 10) in that the communication distribution stop instruction indicated by the arrow Y35 from the distribution destination switching unit 21 is The reason for this is that the notification is also sent to the communication distribution units 14a of the

clusters

12A and 12B.

The communication distribution unit 14a stops the communication of the first cluster 12A or the second cluster 12B indicated by the notified communication distribution stop instruction. That is, since communication to each

cluster

12A, 12B is always performed via the communication distribution unit 14a on the input side, the communication function of the communication distribution unit 14a is stopped in response to the communication distribution stop instruction. made it

According to this configuration, the communication distribution stop instruction for the abnormal cluster (for example, the second cluster 12B) can be sent to the communication distribution unit 14a of the abnormal cluster 12B, and the communication function of the communication distribution unit 14a can be stopped. . This stop makes it impossible to access the abnormal cluster 12B. Therefore, it is possible to omit the inquiry to the DNS 25 of the external server.

<Modification 2 of Embodiment>
FIG. 15 is a block diagram showing the configuration of a virtualization system recovery device 10B according to Modification 2 of this embodiment.

15 differs from the recovery device 10 (FIG. 10) in that the first cluster 12A is equipped with an internal DNS 25A, the second cluster 12B is equipped with an internal DNS 25B, and the distribution destination switching is performed. In addition to the DNS 25, the

internal DNS

25A and 25B are also notified of the communication distribution stop instruction indicated by the arrow Y35 from the unit 21. FIG.

The

internal DNS

25A, 25B have a DNS record table 25a like the DNS 25, but the difference is that the table 25a is provided in cache memory. Therefore, in the

internal DNS

25A, 25B, the information in the table 25a is deleted after a predetermined period of time. However, the

internal DNS

25A, 25B can acquire necessary information from the DNS 25 as needed after the erasure.

The internal anomaly detection unit 17A (or the internal anomaly detection unit 17B) responds to the communication distribution stop instruction (see arrow Y35) when the anomaly of the second cluster 12B is detected. A process of deleting the IP address of the cluster 12B is performed.

According to this configuration, the external server can query the

internal DNS

25A, 25B of each cluster for the IP address of each

cluster

12A, 12B, so the load on the external DNS 25 can be reduced.

<Hardware configuration>
Any one of the virtualization

system recovery apparatuses

10, 10A, and 10B according to the above-described embodiments is implemented by a computer 100 configured as shown in FIG. 16, for example. The computer 100 includes a CPU (Central Processing Unit) 101, a ROM (Read Only Memory) 102, a RAM (Random Access Memory) 103, a HDD (Hard Disk Drive) 104, an input/output I/F (Interface) 105, and a communication I/F 106. , and a media I/F 107 .

The CPU 101 operates based on programs stored in the ROM 102 or HDD 104, and controls each functional unit. The ROM 102 stores a boot program executed by the CPU 101 when the computer 100 is started, a program related to the hardware of the computer 100, and the like.

The CPU 101 controls an output device 111 such as a printer or display and an input device 110 such as a mouse or keyboard via the input/output I/F 105 . The CPU 101 acquires data from the input device 110 or outputs generated data to the output device 111 via the input/output I/F 105 .

The HDD 104 stores programs executed by the CPU 101 and data used by the programs. Communication I/F 106 receives data from another device (not shown) via communication network 112 and outputs the data to CPU 101, and also transmits data generated by CPU 101 to another device via communication network 112. .

The media I/F 107 reads programs or data stored in the recording medium 113 and outputs them to the CPU 101 via the RAM 103 . The CPU 101 loads a program related to target processing from the recording medium 113 onto the RAM 103 via the media I/F 107, and executes the loaded program. The recording medium 113 is an optical recording medium such as a DVD (Digital Versatile Disc) or a PD (Phase change rewritable Disk), a magneto-optical recording medium such as an MO (Magneto Optical disk), a magnetic recording medium, a conductor memory tape medium, a semiconductor memory, or the like. is.

For example, when the computer 100 functions as one of the virtualization

system recovery apparatuses

10, 10A, and 10B according to the embodiment, the CPU 101 of the computer 100 executes a program loaded on the RAM 103 to perform virtualization. The function of the system recovery device 10 is realized. Data in the RAM 103 is also stored in the HDD 104 . The CPU 101 reads a program related to target processing from the recording medium 113 and executes it. In addition, the CPU 101 may read a program related to target processing from another device via the communication network 112 .
<effect>
(1) A computational resource cluster that is virtually created on a physical machine by container virtualization software and that clusters and arranges the virtually created containers, and the virtually created and clustered containers a plurality of clusters each configured to include the computing resource cluster and the cluster management unit; arranged for each of the plurality of clusters, and the an internal anomaly detection unit that is virtually created outside the virtually created computational resource cluster and the cluster management unit and detects an anomaly in the container; and the virtually created outside of the plurality of clusters, The virtualization system recovery device is characterized by comprising an external anomaly detection unit that detects an anomaly in a cluster in which the abnormal container is arranged when the internal anomaly detection unit detects an anomaly in the container.

According to this configuration, when an internal anomaly detection unit for each cluster detects an anomaly in a container, the external anomaly detection unit detects an anomaly in the cluster in which the abnormal container is located. The internal anomaly detector and the external anomaly detector do not participate in the container virtualization software that virtually creates the cluster manager and computational resource cluster. Therefore, failures that occur in multiple clusters can be detected more quickly than the failure detection and recovery function of container virtualization software. This early detection of anomalies enables quick recovery of containers and the like related to cluster failures.

(2) The cluster management unit is arranged in the signal input part of the cluster management unit, distributes the input signal to the subsequent stage and outputs it, and communicates to return a response when the cluster is normal according to the confirmation communication of the cluster. The above ( 1) is the virtualization system recovery device according to the above.

According to this configuration, an abnormality in each cluster can be detected without going through the internal abnormality detection units for each of the clusters.

(3) The external anomaly detection unit performs processing for detecting an anomaly in the cluster in which the container is placed when the internal anomaly detection unit detects an anomaly in the container, and performs cluster confirmation communication to the communication distribution unit at a predetermined cycle. and, if the response is not returned, the abnormality is detected by both the process of detecting the abnormality of the cluster.

(4) A DNS (Domain Name System) that manages the domain name indicating the name of the application related to the container in association with the IP (Internet Protocol) address for each cluster is provided outside the server that constitutes the cluster. a distribution destination switching unit that is virtually created outside the plurality of clusters and notifies the DNS of a communication distribution stop instruction related to the abnormal cluster detected by the external abnormality detection unit; The virtualization system according to any one of (1) to (3) above, wherein the DNS deletes the IP address of the abnormal cluster indicated by the communication distribution stop instruction every time. It is a recovery device.

According to this configuration, the IP addresses of clusters detected as abnormal by the external abnormality detection unit are deleted from the IP addresses of clusters managed by DNS. Therefore, when the external server queries the DNS for the IP address of the cluster, it cannot access the IP address of the failed cluster. In other words, communication to the abnormal cluster can be stopped. The external anomaly detection unit, allocation destination switching unit, and DNS are not involved in the container virtualization software described above. For this reason, a failure occurring in a cluster can be detected earlier than the failure detection and recovery function of the container virtualization software, so that the container or the like related to the failure of the cluster in which the failure has been detected can be quickly restored.

(5) The distribution destination switching unit notifies the communication distribution units in the plurality of clusters of a communication distribution stop instruction related to the abnormal cluster detected by the external abnormality detection unit, and the communication distribution unit , the virtualization system recovery device according to the above (4), characterized in that a process of stopping a communication function is performed when a communication distribution stop instruction relating to an abnormal cluster detected by the external abnormality detection unit is notified. .

According to this configuration, it is possible to notify the communication distribution unit of the abnormal cluster of the communication distribution stop instruction related to the abnormal cluster, and stop the communication function of the communication distribution unit. This outage prevents access to the abnormal cluster. Therefore, it is possible to omit the inquiry to the DNS of the external server.

(6) For each of the plurality of clusters, an internal DNS for managing a domain name indicating the name of an application related to the container and an IP address for each cluster in association with each other, similar to the DNS, is provided; The virtualization system recovery device according to (4) above, characterized in that the internal DNS is notified of a communication distribution stop instruction from the previous switching unit.

According to this configuration, the external server can query the internal DNS of each cluster for the IP address of the cluster, so the load on the external DNS can be reduced.

In addition, the specific configuration can be changed as appropriate without departing from the gist of the present invention.

10, 10A, 10B Virtualized system recovery device 12A First cluster (cluster)
12B second cluster (cluster)
14A, 14B Cluster management unit 14a Communication distribution unit 14b Computational resource operation unit 14c Computational resource management unit 14d Container configuration reception unit 14e Container placement destination determination unit 14f

Container management unit

15A, 15B

Computational resource cluster

15a,

15b Application

17A, 17B Inside Abnormality detection unit 21 Distribution destination switching unit 23 External abnormality detection unit 25 DNS
25a DNS record table 25A, 25B Internal DNS

Claims

a computational resource cluster that is virtually created on a physical machine by container virtualization software and that clusters and arranges the virtually created containers;
a cluster management unit that manages control related to the placement and operation of the virtually created clustered containers;
a plurality of clusters, each configured with the computing resource cluster and the cluster management unit;
an internal anomaly detection unit that is arranged for each of the plurality of clusters and that is virtually created outside the virtually created computational resource cluster and cluster management unit that detects an anomaly in the container;
an external anomaly detection unit that is virtually created outside the plurality of clusters and detects an anomaly in the cluster in which the abnormal container is arranged when the internal anomaly detection unit detects an anomaly in the container; virtual system recovery device.
The cluster management unit is arranged in the signal input part of the cluster management unit, distributes the input signal to the subsequent stage and outputs it, and responds to the confirmation communication of the cluster when the cluster is normal. with
2. The virtualization according to claim 1, wherein the external anomaly detection unit performs cluster confirmation communication to the communication distribution unit at predetermined intervals, and detects an anomaly in the cluster when the response is not returned. System recovery device.
The external anomaly detection unit performs processing for detecting an anomaly in a cluster in which the container is placed when the internal anomaly detection unit detects an anomaly in a container, and performs cluster confirmation communication with the communication distribution unit at a predetermined cycle, 3. The virtualization system recovery device according to claim 2, wherein when said response is not returned, an abnormality is detected by both processing for detecting an abnormality in said cluster.
A DNS (Domain Name System) for managing a domain name indicating the name of an application related to the container and an IP (Internet Protocol) address for each cluster in association with each other is provided outside the server that constitutes the cluster,
a distribution destination switching unit that is virtually created outside the plurality of clusters and that notifies the DNS of a communication distribution stop instruction related to the abnormal cluster detected by the external abnormality detection unit for each of the clusters; prepared,
4. The virtualization system recovery device according to any one of claims 1 to 3, wherein said DNS deletes an IP address of an abnormal cluster indicated by said communication distribution stop instruction.
The distribution destination switching unit notifies the communication distribution units in the plurality of clusters of a communication distribution stop instruction related to the abnormal cluster detected by the external abnormality detection unit,
5. The virtualization according to claim 4, wherein the communication distribution unit performs a process of stopping the communication function at the time of notification of the communication distribution stop instruction related to the abnormal cluster detected by the external abnormality detection unit. System recovery device.
For each of the plurality of clusters, similarly to the DNS, an internal DNS that associates and manages a domain name indicating the name of the application related to the container and the IP address of each cluster,
5. The virtualization system restoration device according to claim 4, wherein a communication distribution stop instruction from said distribution destination switching unit is notified to said internal DNS.
A virtualization system restoration method by a virtualization system restoration device,
The virtual system recovery device is
a plurality of clusters arranged by clustering containers virtually created by container virtualization software on physical machines;
An internal anomaly detection unit and an external anomaly detection unit that are virtually created outside the plurality of clusters,
a step in which the internal anomaly detection unit detects an anomaly in the container for each of the plurality of clusters;
a virtualization system recovery method, wherein the external anomaly detection unit detects an anomaly in a cluster in which the abnormal container is arranged when the internal anomaly detection unit detects an anomaly in the container.
A DNS for managing the domain name indicating the name of the application related to the container and the IP address for each cluster in association with each other is provided outside the server constituting the cluster,
The virtualization system recovery device,
performing a step of notifying the DNS of a communication distribution stop instruction related to the abnormal cluster detected by the external anomaly detection unit;
8. The virtualization system restoration method according to claim 7, wherein said DNS executes a step of deleting an IP address of an abnormal cluster indicated by said communication distribution stop instruction.