WO2016104829A1

WO2016104829A1 - Modular data center system and method for managing equipment thereof

Info

Publication number: WO2016104829A1
Application number: PCT/KR2014/012811
Authority: WO
Inventors: 김영환; 박창원; 김현우
Original assignee: 전자부품연구원
Priority date: 2014-12-24
Filing date: 2014-12-24
Publication date: 2016-06-30

Abstract

Provided are a POD-based modular data center and a monitoring method therefor. The data center system according to embodiments of the present invention comprises virtual machines divided so as to monitor different types of equipment in a POD. Thereby, it is possible to perform monitoring optimized for relevant equipment by separately operating virtual machines for monitoring in the POD-based modular data center according to the types of equipment constituting the POD.

Description

How to manage modular data center systems and their equipment

The present invention relates to a data center, and more particularly to a method for monitoring a data center and the various equipment constituting the same.

A data center is a facility where computer systems, communication equipment, and storage, storage, are installed. Data centers are the core infrastructure for storing and distributing big data and require large amounts of power.

In addition, servers in the data sensor are sensitive to temperature and humidity, so they must be monitored and managed in real time to maintain the correct temperature (16-24 degrees) and the appropriate humidity (40-55%).

In addition, since power loss occurs, information loss and service interruption occur, so the power state is also monitored and managed.

Currently, many devices constituting the data sensor are collectively monitored and managed through one management server. However, due to the slow processing speed, timely and inadequate measures may not be performed.

The present invention has been made to solve the above problems, and an object of the present invention is to provide an effective monitoring method for a modular data center based on a POD (Portable Optimized Datacenter).

According to an embodiment of the present invention, a data center system includes: a first virtual machine for monitoring first devices in a portable optimized datacenter (POD); And a second virtual machine that monitors second types of devices different from the first devices in the POD.

The first virtual machine and the second virtual machine may include those operated independently.

In addition, a third virtual machine for replicating the first virtual machine, and checks whether the first virtual machine is operating normally while exchanging Heartbeat with the first virtual machine; And a fourth virtual machine that duplicates the second virtual machine and checks whether the second virtual machine is operating normally while sending and receiving Heartbeat with the second virtual machine.

When a failure occurs in the first virtual machine, the third virtual machine that detects the failure may monitor the first devices.

The apparatus may further include a fifth virtual machine that is newly created to check whether the third virtual machine is operating normally while exchanging Heartbeat with the third virtual machine.

The first virtual machine and the second virtual machine may interwork with one dashboard system that receives monitoring data from a plurality of PODs.

The first equipment may be any one of a CRAC, a UPS & PDU, and an IT Rack, and the second equipment may be another one of a CRAC, a UPS & PDU, and an IT Rack.

On the other hand, the data center monitoring method according to another embodiment of the present invention, the first virtual machine, the step of monitoring the first equipment in the Portable Optimized Data Center (POD); And monitoring, by the second virtual machine, second equipment of a different type from the first equipments in the POD.

As described above, according to the embodiments of the present invention, a monitoring operation optimized for the corresponding equipment is possible by separately operating a virtual machine for monitoring according to the type of the equipment configuring the POD in the POD-based modular data center.

In addition, since the virtual machines are divided, it is possible to prevent a failure in one virtual machine from affecting other virtual machines. In addition, by operating a preliminary replica virtual machine, the best possible quick recovery in the event of a failure.

1 shows an overall system of a data center to which the present invention is applicable;

FIG. 2 is an enlarged view of one of the PODs shown in FIG. 1;

3 to 7 are diagrams provided for explaining the processing procedure when a failure occurs in the VM,

8 is a view provided for the detailed structure of the VM;

9 is a view provided for the description of the agent system provided in the equipment to be monitored;

10 is a diagram illustrating a process of sensing (gathering) and monitoring data of equipment by a VM and an agent system;

FIG. 11 is a diagram illustrating a process of handling an abnormal condition of a device (equipment failure) and message structures used therein; FIG.

12 to 18 are diagrams showing in detail the process of processing the abnormal condition of the equipment.

Hereinafter, with reference to the drawings will be described the present invention in more detail.

1 shows an overall system of a data center to which the present invention is applicable. The data center to which the present invention is applicable includes a plurality of Portable Optimized Data Centers (PODs # 1 to POD #n) and one dash board system, as shown in FIG. 1.

Data centers are built / operated by POD units. In addition, the data center monitors and manages faults on a POD basis, while the administrator can monitor / manage all PODs with a dashboard system.

FIG. 2 is an enlarged view of one of the PODs shown in FIG. 1. As shown in FIG. 2, the POD is made of CRAC, UPS & PDU, and IT Rack (200-0 to 200-9, ...), and there is no limitation on the number of components.

POD also has an independent Data Center Monitor Middleware (DCMM) system. That is, there is a DCMM system for each POD. POD's DCMM systems work with dashboard systems.

The DCMM system is a system for monitoring and managing the status of equipment (CRAC, UPS & PDU, IT Rack) (200-0 ~ 200-9, ...) that make up the POD. It has specialized virtual machines.

In detail, the DCMM system may include virtual machines (VMs) 100-0, 100-1, and 100-2, virtual layers (VLs) 100-3, and multi-core embedded platforms (MEPs) 100-4. ).

VM # 0 (100-0) is a virtual machine for monitoring / managing the CRACs (200-0, 200-4, 200-5, 200-6, ...) installed in the POD, and VM # 1 (100). -1) is a virtual machine for monitoring / managing UPS & PDUs (200-1, 200-7, ...) installed in the POD, and VM # 2 (100-2) is the IT Racks (200-) installed in the POD. 2, 200-3, 200-8, 200-9, ...) is a virtual machine to monitor / manage.

Since VMs that monitor / manage the devices of the POD are divided and operate independently, other VMs can operate normally without any problem even if a VM fails.

On the other hand, in case a failure occurs in the VM, the spare VMs 100-5, 100-6, and 100-7 are operated as shown in FIG. As shown in FIG. 4, the VMs 100-0, 100-1, and 100-2 of the active base collect data from devices and make DBs, monitor / manage them, and pass-through VMs 100-100. 5, 100-6 and 100-7) allow them to be duplicated (backed up) respectively.

As shown in FIG. 4, the VMs 100-5, 100-6, and 100-7 of the Passive Base are the VMs 100-0, 100-1, 100-of the Active Base through the FT Manager (FT_Manager). 2) With Heartbeat, check whether the VMs (100-0, 100-1, 100-2) of the Active Base are operating normally.

In order to explain a process in the case where a failure of the VM of the Active Base occurs, as shown in FIG. 5, a case where a failure occurs in the VM # 0 (100-0) is assumed. If VM # 0 (100-0) fails, clone VM # 1 (100-5) will detect it via Heartbeat.

Thereafter, as shown in FIG. 6, the system memory of the failed VM # 0 (100-0) is recovered, and the clone VM # 1 (100-5) is changed to Active Base to monitor / manage the CRACs of the POD. do.

Next, as illustrated in FIG. 7, a new clone VM 100-8 is created in the passive base to check whether it is operating normally while cloning (backup) the VM # 0 (100-5).

Hereinafter, the detailed structure of the VMs will be described in detail with reference to FIG. 8. Since the VMs differ only in the object of monitoring / management, and the structure can be implemented in the same way, FIG. 8 shows one VM as a representative.

As shown in FIG. 8, the VM includes an SNMP module, a check_snmp module, a DCM daemon, a DCMM, a DB, a DB manager, an FT manager, and an Overstate Control Module (OCM).

DCMM creates configuration files (cfg files) for each device (host) used to monitor the target device, and the DCM daemon manages periodic monitoring. The configuration file acts as a data collection object that contains commands related to the monitoring of the device.

The check_snmp module transfers the configuration file created by DCMM to the monitored device (host) through SNMP to obtain data for monitoring. The SNMP module is a module that performs networking with the monitored device through Ethernet.

The DB manager stores the data acquired by the check_snmp module in the DB. In addition, the DB administrator provides data stored in the DB to the dashboard system so that the administrator can directly check the status of the devices through the dashboard system.

The FT manager is a module for delivering Heartbeart with other VMs, and the OCM performs fault management and control, which will be described later in detail.

9 is a diagram provided to explain an agent system provided in equipments to be monitored. Like the VM, the agent system is specialized in the type of equipment, but the structure is the same.

All equipment has an agent system. The agent system collects data about the equipment and passes it to the VM's DCMM. The data collected includes temperature, humidity, power usage, etc. Of course, other data may be further included.

The agent system includes an SNMP agent, a subagent, and a management information base (MIB), as shown in FIG.

The SNMP agent establishes and maintains a communication connection with the VM's SNMP, and the subagent's handler senses (gathers) data required by the configuration file received from the VM. The MIB stores information that is referred to for data collection / management.

FIG. 10 illustrates a process of sensing and collecting data of a device by the VM and the agent system.

As shown in FIG. 10, data necessary for monitoring is collected from an object using a configuration file keti_host generated by DCMM. It has been described above that the data collected includes temperature, humidity, power consumption, and the like.

Thereafter, the check_snmp module of the VM requests / collects the above data from the agent system (SNMP_GET, SNMP_RESPONSE) and stores it in the DB (Insertr data). Then, all or part of the data stored in the DB (eg, data requested by the administrator) is reported to the dashboard system (select data). The data passed to the dashboard system is shown to the administrator in various forms.

On the left side of FIG. 11, a process of processing an abnormal condition (equipment failure) occurring in the equipment is illustrated. As shown in FIG. 11, when an abnormal state is detected in a device, which is a monitoring object, DCM of the VM first analyzes it and delivers necessary messages for resolution.

The format of the messages used in this process is shown on the right side of FIG. As shown, messages used for state abnormality control include an Alert message, a handle message, a control message, and a check message.

The messages are prefixed with "Msg Type" to indicate the type of message. "Device Type" is a field indicating the type of equipment contains CRAC, IT Rack, UPS & PDU. "Device ID" is an ID assigned to each device to specify the device.

"Error State" is a field indicating the type of state abnormality (fault) occurred in the equipment, and "Error Information" is additional data necessary to deal with the state abnormality and contains detailed state information of the current equipment.

"Handle State" is a field indicating whether a state abnormality can be solved, and a "Handle Command" is a field containing an operation to be performed to resolve a state abnormality. "Control Command" contains commands that are sent to the equipment for remedy.

12 illustrates a state abnormality (disorder) process, and message types and delivery paths are embodied in FIGS. 13 to 18. In FIG. 12, it is assumed that a state abnormality (temperature abnormality) occurs in a specific IT rack.

The OCM of the IT Rack VM (VM # 2) that detects the abnormal status of a specific IT Rack (Figure 13) (Figure 14) analyzes the status abnormality and sends an alert message containing specific status information to the CRAC VM (VM # 0). Pass (FIG. 15).

The OCM of the CRAC VM transmits a Handle message to the OCM of the IT Rack VM, which indicates whether or not the solution is possible and an operation for solving the problem (when a state abnormality is possible) (FIG. 16). In addition, the OCM of the CRAC VM transmits a control message for solving the problem to the corresponding device (CRAC #n) (FIG. 17).

Next, the OCM of the CRAC VM transmits a check message to the OCM of the IT Rack VM to confirm whether the IT Rack is in a normal state (FIG. 18). The OCM of the IT Rack VM then sends an Alert message to the OCM of the CRAC VM, indicating the current status of the device.

The IT Rack VM's OCM sent an Alert message to the CRAC VM's OCM because the IT Rack had an abnormal temperature. If there is a power failure in the IT Rack, the OCM of the IT Rack VM sends an Alert message to the UPS & PDU to initiate the process of abnormal status.

In addition, although the preferred embodiment of the present invention has been shown and described above, the present invention is not limited to the specific embodiments described above, but the technical field to which the invention belongs without departing from the spirit of the invention claimed in the claims. Of course, various modifications can be made by those skilled in the art, and these modifications should not be individually understood from the technical spirit or the prospect of the present invention.

Claims

A first virtual machine monitoring the first devices in a Portable Optimized Datacenter (POD); And

And a second virtual machine for monitoring a second kind of second equipment different from said first equipments in said POD.
The method of claim 1,

The first virtual machine and the second virtual machine,

A data center system, comprising operating independently.
The method of claim 2,

A third virtual machine that duplicates the first virtual machine and checks whether the first virtual machine is operating normally while exchanging Heartbeat with the first virtual machine; And

And a fourth virtual machine that duplicates the second virtual machine and checks whether the second virtual machine is operating normally while exchanging heartbeats with the second virtual machine. .
The method of claim 3, wherein

And if the failure occurs in the first virtual machine, the third virtual machine detecting the failure monitors the first devices.
The method of claim 4, wherein

And a fifth virtual machine, which is newly created to check whether the third virtual machine is operating normally while exchanging heartbeats with the third virtual machine.
The method of claim 1,

The first virtual machine and the second virtual machine,

A data center system comprising interworking with one dashboard system receiving monitoring data from multiple PODs.
The method of claim 1,

The first equipment is any one of CRAC, UPS & PDU, and IT Rack,

The second equipment is a data center system, characterized in that the other one of the CRAC, UPS & PDU and IT Rack.
Monitoring, by the first virtual machine, the first devices in a portable optimized datacenter (POD); And

Monitoring, by the second virtual machine, second types of equipment different from the first equipments in the POD.