KR102415027B1

KR102415027B1 - Backup recovery method for large scale cloud data center autonomous operation

Info

Publication number: KR102415027B1
Application number: KR1020200146806A
Authority: KR
Inventors: 김성윤; 박성호
Original assignee: 주식회사 테라텍
Priority date: 2019-11-05
Filing date: 2020-11-05
Publication date: 2022-07-01
Also published as: KR20210054480A

Abstract

본 발명은 대규모 클라우드 데이터 센터 자율 운영을 위한 백업 복구 방법 구조 및 방법에 관한 것으로,
이는 대규모 클라우드 데이터 센터를 구성하는 다수의 노드 각각을 인지하고, 상기 노드 각각의 하드웨어 및 소프트웨어 정보를 수집 및 저장하는 노드 등록 단계; 상기 노드 각각의 하드웨어 이벤트 로그 및 운영체체 이벤트 로그를 실시간 수집 및 분석하여, 상기 노드 각각의 백업 또는 복구의 진행 여부, 대상, 방식을 결정하는 이벤트 모니터링 단계; 백업 필요 노드가 감지되면, 상기 백업 필요 노드의 백업 대상을 영구 메모리에 1차 백업한 후 백업 미디어에 2차 백업하는 복구 단계; 및 복구 필요 노드가 감지되면, 상기 영구 메모리에 기반하여 상기 복구 필요 노드를 복구하되, 상기 영구 메모리가 사용 불가 상태이면 상기 백업 미디어에 기반하여 노드를 추가 복구하는 복구 단계를 포함한다. The present invention relates to a backup recovery method structure and method for autonomous operation of a large-scale cloud data center,
This includes a node registration step of recognizing each of a plurality of nodes constituting a large-scale cloud data center, and collecting and storing hardware and software information of each of the nodes; an event monitoring step of collecting and analyzing hardware event logs and operating system event logs of each node in real time, and determining whether to proceed with backup or recovery of each of the nodes, a target, and a method; a recovery step of backing up a backup target of the node requiring backup to a permanent memory and then performing a secondary backup to a backup medium when a node requiring a backup is detected; and a recovery step of recovering the node requiring recovery based on the permanent memory when the node requiring recovery is detected, and further recovering the node based on the backup media if the permanent memory is in an unusable state.

Description

Backup recovery method for large scale cloud data center autonomous operation

본 발명은 클라우드를 구성하는 다수의 컴퓨팅 장치를 네트워크를 통해 연결하여 통합된 이벤트(하드웨어, 운영체제, 어플리케이션) 기반 하에 자율 백업 복구 환경을 제공하는 대규모 클라우드 데이터 센터 자율 운영을 위한 백업 복구 방법에 관한 것이다. The present invention relates to a backup recovery method for autonomous operation of a large-scale cloud data center that provides an autonomous backup recovery environment based on an integrated event (hardware, operating system, application) by connecting a plurality of computing devices constituting the cloud through a network. .

리눅스 서비스에서 가장 어려운 작업은 시스템 복구에 관한 작업인데, 시스템 복구는 시스템을 새로 만드는 것보다 누군가가 만든 시스템을 복구해야 하기 때문에, 시스템 분석 뿐만 아니라 그 안에 숨겨진 전(前) 엔지니어의 노하우 및 오류를 잡아내는데 배 이상의 노력이 필요로 한다. The most difficult task in Linux service is the task of system recovery. Because system recovery requires restoring a system created by someone rather than creating a new system, not only system analysis, but also the know-how and errors of ex-engineers hidden therein. It takes twice as much effort to catch it.

기존 업체들은 데이터 백업에 대해 충분한 고려가 있었으며, 많은 솔루션들이 이에 대하여 좋은 해결 방안을 제시하고 있으나. 시스템 백업에 대해서는 충분한 해결 방안을 제시 못하는 것이 현실이다. Existing companies have given sufficient consideration to data backup, and many solutions offer good solutions for this. The reality is that there is no adequate solution for system backup.

리눅스 시스템의 경우, 설치부터 어플리케이션의 설치까지 따진다면, 대략 3시간 내지 4시간의 설치 시간이 필요로 한다. In the case of a Linux system, from installation to application installation, it takes about 3 to 4 hours to install.

클라우드 데이터 센터의 복잡성(운영체제, 클라우드 미들웨어, 어플리케이션)에 의한 재 설치시 어려움이 증가한다. The difficulty increases during reinstallation due to the complexity of the cloud data center (operating system, cloud middleware, applications).

특히, 수학식1의 고가용성(High-Availability) 요구가 증대되는 데, 복구 시간을 줄이는 것이 바로 가용성을 높이는 기술이다. In particular, as the demand for high availability in Equation 1 increases, reducing the recovery time is a technique for increasing availability.

[수학식1][Equation 1]

이때, A는 가용성(Availability, %)을 나타내며, MTBF는 Mean Time Between Failures, MTTR은 Mean Time To Recover(복구하는 데 걸리는 시간)를 의미한다. 실제로 MTTR이 최소화될수록 가용성은 100%에 가깝게 된다.In this case, A represents availability (%), MTBF means Mean Time Between Failures, and MTTR means Mean Time To Recover (time taken to recover). In practice, the more MTTR is minimized, the closer to 100% availability.

다운 타임을 정의 할 때, 많은 변수가 존재를 하는데, 예정된 다운(Planed Downtime)이 많은 비중을 차지하고 있는 것은 정상적으로 볼 수 있다. 소프트웨어로 인한 다운타임이 많은 퍼센트를 차지하는 것은 서버 소프트웨어(운영체제 포함), 클라이언트 소프트웨어, 네트워크 소프트웨어 등의 버그는 시스템 안정성을 위하여 가장 통제하기 힘든 분야이기 때문이다.When defining downtime, many variables exist, and it is normal to see that Planned Downtime occupies a large proportion. The reason that software-caused downtime accounts for a large percentage is that bugs in server software (including operating systems), client software, and network software are the most uncontrollable areas for system stability.

하드웨어가 안정화되어 갈수록 소프트웨어로 인한 다운타임의 비중은 높아질 것이며, 또한 소프트웨어가 점점 더 복잡해지면서 소프트웨어 자체 문제로 인한 장애가 더 많이 발생할 것이다. 이로 인하여 소프트웨어에 대한 복구 기술이 요구가 되는 것이다. As hardware becomes more stable, the proportion of downtime caused by software will increase, and as software becomes more complex, more failures due to software problems will occur. For this reason, recovery technology for software is required.

국내공개특허 제10-2019-0106488호(공개일자 : 2019.09.18)Domestic Patent Publication No. 10-2019-0106488 (published date: 2019.09.18)

이에 상기와 같은 문제점을 해결하기 위한 것으로서, 본 발명은 보다 신속, 정확, 안전하게 대규모 클라우드 데이터 센터의 백업 및 복구를 자율적으로 수행할 수 있도록 하는 대규모 클라우드 데이터 센터 자율 운영을 위한 백업 복구 방법 구조 및 방법을 제공하고자 한다. Accordingly, in order to solve the above problems, the present invention provides a backup recovery method structure and method for autonomous operation of a large-scale cloud data center that enables to autonomously perform backup and recovery of a large-scale cloud data center more quickly, accurately, and safely would like to provide

본 발명의 목적은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 본 발명이 속하는 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The object of the present invention is not limited to the object mentioned above, and other objects not mentioned will be clearly understood by those of ordinary skill in the art from the description below.

상기 과제를 해결하기 위한 수단으로서, 본 발명의 일 실시 형태에 따르면 대규모 클라우드 데이터 센터를 구성하는 다수의 노드 각각을 인지하고, 상기 노드 각각의 하드웨어 및 소프트웨어 정보를 수집 및 저장하는 노드 등록 단계; 상기 노드 각각의 하드웨어 이벤트 로그 및 운영체체 이벤트 로그를 실시간 수집 및 분석하여, 상기 노드 각각의 백업 또는 복구의 진행 여부, 대상, 방식을 결정하는 이벤트 모니터링 단계; 백업 필요 노드가 감지되면, 상기 백업 필요 노드의 백업 대상을 영구 메모리에 1차 백업한 후 백업 미디어에 2차 백업하는 복구 단계; 및 복구 필요 노드가 감지되면, 상기 영구 메모리에 기반하여 상기 복구 필요 노드를 복구하되, 상기 영구 메모리가 사용 불가 상태이면 상기 백업 미디어에 기반하여 노드를 추가 복구하는 복구 단계를 포함하는 대규모 클라우드 데이터 센터 자율 운영을 위한 백업 복구 방법을 제공한다. As a means for solving the above problem, according to an embodiment of the present invention, a node registration step of recognizing each of a plurality of nodes constituting a large-scale cloud data center, and collecting and storing hardware and software information of each of the nodes; an event monitoring step of collecting and analyzing hardware event logs and operating system event logs of each node in real time, and determining whether to proceed with backup or recovery of each of the nodes, a target, and a method; a recovery step of backing up the backup target of the node needing backup to a permanent memory and then performing a secondary backup to a backup medium when a node requiring a backup is detected; and a recovery step of recovering the node requiring recovery based on the permanent memory when a node requiring recovery is detected, but additionally recovering the node based on the backup media if the permanent memory is unavailable It provides a backup recovery method for autonomous operation.

상기 이벤트 모니터링 단계는 상기 하드웨어 이벤트 로그 중 온도, 팬, 전력, 하드웨어 오류 관련 로그만을 선택 및 이용하여 HW 장애 발생 여부를 확인하고, OS 운영 체제 로그 중 운영체제 장애 로그, 장애 발생 횟수를 선택 및 이용하여 OS 장애 발생 여부를 확인하며, 장애 유형과 횟수 또는 정도에 따라 백업 또는 복구의 진행 여부, 대상, 방식을 자동 결정하는 것을 특징으로 한다. The event monitoring step checks whether HW failure occurs by selecting and using only temperature, fan, power, and hardware error related logs from the hardware event log, and selecting and using the operating system failure log and the number of failures from the OS operating system log. It checks whether an OS failure has occurred, and it is characterized by automatically determining whether to proceed with backup or recovery, the target, and the method according to the type, number, or degree of failure.

상기 이벤트 모니터링 단계는 백업 또는 복구의 대상을 시스템과 데이터로 구분하고, 백업 또는 복구의 방식을 전체 백업과 스냅샵 백업으로 구분하는 것을 특징으로 한다. The event monitoring step is characterized in that the target of backup or recovery is divided into system and data, and the method of backup or recovery is divided into full backup and snapshot backup.

상기 노드 등록 단계는 CPU 구성, 사용자 구성 옵션, 부팅 모드, 부팅 디바이스, RAID(Redundant Array of Independent Disks), iSCSI(Internet Small Computer System Interface) 및 PXE(Pre-boot eXecution Environment), PXE(Pre-boot eXecution Environment) 중 적어도 하나를 하드웨어 정보로써 획득 및 저장하고, OS 종류, 부트 매니저, 파티션, 파일 시스템 중 적어도 하나를 소프트웨어 정보로써 획득 및 제공하는 것을 특징으로 한다.The node registration step includes CPU configuration, user configuration options, boot mode, boot device, RAID (Redundant Array of Independent Disks), iSCSI (Internet Small Computer System Interface) and PXE (Pre-boot eXecution Environment), PXE (Pre-boot). eXecution Environment) is acquired and stored as hardware information, and at least one of an OS type, boot manager, partition, and file system is acquired and provided as software information.

상기 방법은 상기 하드웨어 정보의 부팅 모드와 상기 소프트웨어 정보의 부트 매니저 중 적어도 하나를 고려하여, 라이브 미디어 기반 백업 및 복구 환경을 구축하거나, 네트워크 기반 백업 및 복구 환경을 구축하는 백업 및 복구 환경 구축 단계를 더 포함하는 것을 특징으로 하는 것을 특징으로 한다. The method considers at least one of the boot mode of the hardware information and the boot manager of the software information to build a live media-based backup and recovery environment, or a network-based backup and recovery environment construction step of establishing a backup and recovery environment. It is characterized in that it further comprises.

본 발명은 대규모 클라우드 데이터 센터에 적용 가능하며, 동일한 성능 대비 시스템 구축 비용(장비비, 인건비를 포함한 종합적인 비용)이 저렴한 시스템 구조를 구축할 수 있도록 한다. The present invention can be applied to a large-scale cloud data center, and enables a system structure to be built with a low system construction cost (comprehensive cost including equipment cost and labor cost) compared to the same performance.

도 1은 본 발명의 일 실시예에 따른 대규모 클라우드 데이터 센터 자율 운영을 위한 백업 복구 방법을 설명하기 위한 도면이다.
도 2 내지 도 4는 본 발명의 일 실시예에 따른 노드 등록 단계를 보다 상세히 설명하기 위한 도면이다.
도 5 내지 도 7은 본 발명의 일 실시예에 따른 이벤트 수집 단계를 보다 상세히 설명하기 위한 도면이다.
도 8 및 도 9는 본 발명의 일 실시예에 따른 백업 단계를 보다 상세히 설명하기 위한 도면이다.
도 10 내지 도 12는 본 발명의 일 실시예에 따른 복구 단계를 보다 상세히 설명하기 위한 도면이다. 1 is a view for explaining a backup recovery method for autonomous operation of a large-scale cloud data center according to an embodiment of the present invention.
2 to 4 are diagrams for explaining in more detail a node registration step according to an embodiment of the present invention.
5 to 7 are diagrams for explaining in more detail an event collection step according to an embodiment of the present invention.
8 and 9 are diagrams for explaining in more detail a backup step according to an embodiment of the present invention.
10 to 12 are diagrams for explaining in more detail a recovery step according to an embodiment of the present invention.

이하의 내용은 단지 본 발명의 원리를 예시한다. 그러므로 당업자는 비록 본 명세서에 명확히 설명되거나 도시되지 않았지만 본 발명의 원리를 구현하고 본 발명의 개념과 범위에 포함된 다양한 장치를 발명할 수 있는 것이다. 또한, 본 명세서에 열거된 모든 조건부 용어 및 실시예들은 원칙적으로, 본 발명의 개념이 이해되도록 하기 위한 목적으로만 명백히 의도되고, 이와 같이 특별히 열거된 실시예들 및 상태들에 제한적이지 않는 것으로 이해되어야 한다.The following is merely illustrative of the principles of the invention. Therefore, those skilled in the art will be able to devise various devices that, although not explicitly described or shown herein, embody the principles of the present invention and are included within the spirit and scope of the present invention. Further, it is to be understood that all conditional terms and examples listed herein are, in principle, expressly intended solely for the purpose of enabling the concept of the present invention to be understood, and not limited to the specifically enumerated embodiments and states as such. should be

또한, 본 발명의 원리, 관점 및 실시예들 뿐만 아니라 특정 실시예를 열거하는 모든 상세한 설명은 이러한 사항의 구조적 및 기능적 균등물을 포함하도록 의도되는 것으로 이해되어야 한다. 또한 이러한 균등물들은 현재 공지된 균등물뿐만 아니라 장래에 개발될 균등물 즉 구조와 무관하게 동일한 기능을 수행하도록 발명된 모든 소자를 포함하는 것으로 이해되어야 한다.Moreover, it is to be understood that all detailed description reciting the principles, aspects, and embodiments of the invention, as well as specific embodiments, are intended to cover structural and functional equivalents of such matters. It should also be understood that such equivalents include not only currently known equivalents, but also equivalents developed in the future, i.e., all devices invented to perform the same function, regardless of structure.

따라서, 예를 들어, 본 명세서의 블럭도는 본 발명의 원리를 구체화하는 예시적인 회로의 개념적인 관점을 나타내는 것으로 이해되어야 한다. 이와 유사하게, 모든 흐름도, 상태 변환도, 의사 코드 등은 컴퓨터가 판독 가능한 매체에 실질적으로 나타낼 수 있고 컴퓨터 또는 프로세서가 명백히 도시되었는지 여부를 불문하고 컴퓨터 또는 프로세서에 의해 수행되는 다양한 프로세스를 나타내는 것으로 이해되어야 한다.Thus, for example, the block diagrams herein are to be understood as representing conceptual views of illustrative circuitry embodying the principles of the present invention. Similarly, all flowcharts, state transition diagrams, pseudo code, etc. may be tangibly embodied on computer-readable media and be understood to represent various processes performed by a computer or processor, whether or not a computer or processor is explicitly shown. should be

프로세서 또는 이와 유사한 개념으로 표시된 기능 블럭을 포함하는 도면에 도시된 다양한 소자의 기능은 전용 하드웨어뿐만 아니라 적절한 소프트웨어와 관련하여 소프트웨어를 실행할 능력을 가진 하드웨어의 사용으로 제공될 수 있다. 프로세서에 의해 제공될 때, 상기 기능은 단일 전용 프로세서, 단일 공유 프로세서 또는 복수의 개별적 프로세서에 의해 제공될 수 있고, 이들 중 일부는 공유될 수 있다.The functions of the various elements shown in the drawings including a processor or functional blocks represented by similar concepts may be provided by the use of dedicated hardware as well as hardware having the ability to execute software in association with appropriate software. When provided by a processor, the functionality may be provided by a single dedicated processor, a single shared processor, or a plurality of separate processors, some of which may be shared.

또한 프로세서, 제어 또는 이와 유사한 개념으로 제시되는 용어의 명확한 사용은 소프트웨어를 실행할 능력을 가진 하드웨어를 배타적으로 인용하여 해석되어서는 아니되고, 제한 없이 디지털 신호 프로세서(DSP) 하드웨어, 소프트웨어를 저장하기 위한 롬(ROM), 램(RAM) 및 비 휘발성 메모리를 암시적으로 포함하는 것으로 이해되어야 한다. 주지관용의 다른 하드웨어도 포함될 수 있다.In addition, the clear use of terms presented as processor, control, or similar concepts should not be construed as exclusively referring to hardware having the ability to execute software, and without limitation, digital signal processor (DSP) hardware, ROM for storing software. It should be understood to implicitly include (ROM), RAM (RAM) and non-volatile memory. Other common hardware may also be included.

본 명세서의 청구범위에서, 상세한 설명에 기재된 기능을 수행하기 위한 수단으로 표현된 구성요소는 예를 들어 상기 기능을 수행하는 회로 소자의 조합 또는 펌웨어/마이크로 코드 등을 포함하는 모든 형식의 소프트웨어를 포함하는 기능을 수행하는 모든 방법을 포함하는 것으로 의도되었으며, 상기 기능을 수행하도록 상기 소프트웨어를 실행하기 위한 적절한 회로와 결합된다. 이러한 청구범위에 의해 정의되는 본 발명은 다양하게 열거된 수단에 의해 제공되는 기능들이 결합되고 청구항이 요구하는 방식과 결합되기 때문에 상기 기능을 제공할 수 있는 어떠한 수단도 본 명세서로부터 파악되는 것과 균등한 것으로 이해되어야 한다.In the claims of the present specification, a component expressed as a means for performing the function described in the detailed description includes, for example, a combination of circuit elements that perform the function or software in any form including firmware/microcode, etc. It is intended to include all methods of performing the functions of the device, coupled with suitable circuitry for executing the software to perform the functions. Since the present invention defined by these claims is combined with the functions provided by the various enumerated means and in a manner required by the claims, any means capable of providing the functions are equivalent to those contemplated from the present specification. should be understood as

상술한 목적, 특징 및 장점은 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해질 것이며, 그에 따라 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명의 기술적 사상을 용이하게 실시할 수 있을 것이다. 또한, 본 발명을 설명함에 있어서 본 발명과 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에 그 상세한 설명을 생략하기로 한다. The above objects, features and advantages will become more apparent through the following detailed description in relation to the accompanying drawings, and accordingly, those of ordinary skill in the art to which the present invention pertains can easily implement the technical idea of the present invention. There will be. In addition, in the description of the present invention, if it is determined that a detailed description of a known technology related to the present invention may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted.

도 1은 본 발명의 일 실시예에 따른 대규모 클라우드 데이터 센터 자율 운영을 위한 백업 복구 방법을 설명하기 위한 도면이다. 1 is a view for explaining a backup recovery method for autonomous operation of a large-scale cloud data center according to an embodiment of the present invention.

도 1을 참고하면, 본 발명의 방법은 대규모 클라우드 데이터 센터를 구성하는 다수의 노드 각각을 인지하고, 상기 노드 각각의 하드웨어 및 소프트웨어 정보를 수집 및 저장하는 노드 등록 단계(S1), 상기 노드 각각의 하드웨어 이벤트 로그 및 운영체체 이벤트 로그를 실시간 수집 및 분석하여, 상기 노드 각각의 백업 또는 복구의 진행 여부, 대상, 방식을 결정하는 이벤트 모니터링 단계(S2), 백업 필요 노드가 감지되면, 상기 백업 필요 노드의 백업 대상을 영구 메모리에 1차 백업한 후 백업 미디어에 2차 백업하는 복구 단계(S3), 및 복구 필요 노드가 감지되면, 상기 영구 메모리에 기반하여 상기 복구 필요 노드를 복구하되, 상기 영구 메모리가 사용 불가 상태이면 상기 백업 미디어에 기반하여 노드를 추가 복구하는 복구 단계(S4) 등을 포함할 수 있다. Referring to FIG. 1, the method of the present invention recognizes each of a plurality of nodes constituting a large-scale cloud data center, and collects and stores hardware and software information of each node (S1), a node registration step (S1) of each of the nodes Event monitoring step (S2) of collecting and analyzing hardware event logs and operating system event logs in real time to determine whether to proceed, target, and method for each of the nodes to be backed up or restored. When a node requiring backup is detected, the node requiring backup A recovery step (S3) of backing up the primary backup target of the backup target to permanent memory and then secondary backing up to the backup media, and when a recovery node is detected, the recovery node is restored based on the permanent memory, but the permanent memory If is unavailable, a recovery step (S4) of additionally recovering the node based on the backup media may be included.

본 발명은 노드 단위의 장애와 새로운 노드 추가를 자동으로 추적, 인식하고, 이에 대한 백업 복구에 대한 관리자를 제공한다. The present invention automatically tracks and recognizes node-level failures and new node additions, and provides an administrator for backup and recovery.

또한 이기종 노드 이벤트 수집 및 통합 관리기를 구비하여, 대규모 클라우드 구성시, 각 벤더 마다 다른 장비들로, 이에 대한 통합 이벤트 관리하도록 하며, 이기종 노드 또한 통합 관리할 수 있도록 한다. In addition, by having a heterogeneous node event collection and integrated manager, when configuring a large-scale cloud, it is possible to manage the integrated event with different devices for each vendor, and to manage the heterogeneous nodes as well.

그리고 노드 단위 장애 분석을 위한 실시간 시스템 이벤트 로그 및 OS 로그 수집, 분석을 통한 이벤트를 자동 생성한다. And it automatically generates events through real-time system event log and OS log collection and analysis for node unit failure analysis.

수집된 이벤트에 기반하여 백업 및 복구 스케쥴 및 절차를 작성 및 이용하여, 백업 및 복구 동작이 보다 효율적이고 일관적으로 수행될 수 있도록 한다. By creating and using backup and recovery schedules and procedures based on the collected events, backup and recovery operations can be performed more efficiently and consistently.

뿐 만 아니라 라이브 미디어/ 네트워크 부팅 기능을 이용한 백업/ 복구 환경을 구축한다. 즉, 본 발명은 종래의 방법과 달리 네트워크 부팅 (PXE Booting) 을 통하여, 라이브 부팅(Live booting)을 제공하고, 사용자의 선택에 의한 백업, 복구를 지원하며, 정해진 룰(Rule)에 따른 자동 복구 작업 또한 진행할 수 있도록 한다. In addition, it establishes a backup/restore environment using the live media/network boot function. That is, the present invention provides live booting through network booting (PXE booting) unlike the conventional method, supports backup and recovery by the user's selection, and automatic recovery according to a set rule. It also allows the work to proceed.

도 2 내지 도 4는 본 발명의 일 실시예에 따른 노드 등록 단계를 보다 상세히 설명하기 위한 도면이다. 2 to 4 are diagrams for explaining in more detail a node registration step according to an embodiment of the present invention.

도 2에 도시된 바와 같이, 본 발명의 노드 등록 단계(S1)는 노드 인식 단계(S11)와, 노드 하드웨어 정보 등록 단계(S12), 및 노드 소프트웨어 정보 등록 단계(S13) 등을 포함하여 구성될 수 있다. As shown in Fig. 2, the node registration step (S1) of the present invention may include a node recognition step (S11), a node hardware information registration step (S12), and a node software information registration step (S13). can

노드 인식 단계(S11)에서, 백업 관리자는 클라우드 데이터 센터를 구성하거나, 클라우드 데이터 센터에 새로이 추가되는 노드 각각을 자동 인식한다. In the node recognition step S11, the backup manager configures the cloud data center or automatically recognizes each node newly added to the cloud data center.

노드 하드웨어 정보 등록 단계(S12)에서, 백업 관리자는 도 3에서와 같이 표준화된 메시지 기반의 하드웨어 관리 인터페이스인 IPMI(Intelligent Platform Management Interface)를 통해 각 노드에 하드웨어 정보 제공을 요청하고, 각 노드는 시스템 하드웨어의 성능 상태를 모니터하는 인터페이스인 BMC(Base Mother Controller)를 이용하여 자신의 하드웨어 정보를 수집 및 제공한다. 그러면, 백업 관리자는 노드 각각이 수집 및 제공한 하드웨어 정보를 백업 대상 기초 자료로써 데이터베이스화하여 저장한다. In the node hardware information registration step (S12), the backup manager requests to provide hardware information to each node through IPMI (Intelligent Platform Management Interface), which is a standardized message-based hardware management interface as shown in FIG. It collects and provides its own hardware information by using the Base Mother Controller (BMC), an interface that monitors the hardware performance status. Then, the backup manager stores the hardware information collected and provided by each node into a database as basic data to be backed up.

이때, 하드웨어 정보는 CPU 구성, 사용자 구성 옵션, 부팅 모드, 부팅 디바이스, RAID(Redundant Array of Independent Disks), iSCSI(Internet Small Computer System Interface) 및 PXE(Pre-boot eXecution Environment), PXE(Pre-boot eXecution Environment) 등에 대한 정보일 수 있다. At this time, hardware information includes CPU configuration, user configuration options, boot mode, boot device, RAID (Redundant Array of Independent Disks), iSCSI (Internet Small Computer System Interface) and PXE (Pre-boot eXecution Environment), PXE (Pre-boot). eXecution Environment) and the like.

노드 소프트웨어 정보 등록 단계(S13)에서, 백업 관리자는 도 4에서와 같이 IPMI와 BMC를 이용하여 각 노드의 소프트웨어 정보를 추가 획득하고, 이를 백업 대상 기초 자료로써 데이터베이스화하여 추가 저장한다. In the node software information registration step (S13), the backup manager additionally acquires software information of each node using IPMI and BMC as shown in FIG. 4, and further stores it as a database as a backup target basic data.

이때, 소프트웨어 정보는 OS 종류, 부트 매니저, 파티션, 파일 시스템 등에 대한 정보일 수 있다. In this case, the software information may be information about an OS type, a boot manager, a partition, a file system, and the like.

도 5 내지 도 7은 본 발명의 일 실시예에 따른 이벤트 수집 단계를 보다 상세히 설명하기 위한 도면이다. 5 to 7 are diagrams for explaining in more detail an event collection step according to an embodiment of the present invention.

도 5에 도시된 바와 같이, 본 발명의 이벤트 모니터링 단계(S2)는 노드 각각의 이벤트 로그 수집 단계(S21), 및 이벤트 로그 분석 단계(S22) 등을 포함하여 구성될 수 있다. 5, the event monitoring step (S2) of the present invention may be configured to include an event log collection step (S21) of each node, an event log analysis step (S22), and the like.

이벤트 로그 수집 단계(S21)에서, 각 노드의 BMC는 도 6에서와 같이 노드에 설치된 각종 하드웨어 센서와 OS 운영 체제를 노드의 현재 상태를 모니터링하여 다수의 이벤트 로그를 실시간 수집 및 제공한다. In the event log collection step (S21), the BMC of each node collects and provides a plurality of event logs in real time by monitoring the current state of the node using various hardware sensors and OS operating systems installed in the node as shown in FIG. 6 .

백업 관리자는 IPMI를 통해 이들을 수신하고, 표준 데이터 형태로 변환한 후 데이터베이스화하여 저장한다. 특히, 본 발명에는 노드 각각이 전송하는 이벤트 로그를 표준 데이터 형태로 변환하여 저장하는 데, 이는 다수의 노드 중 적어도 하나 이상이 이기종 장비로 구현될 수 있음을 고려하여, 이기종 노드 이벤트까지도 통합적으로 수집 및 분석할 수 있기 위함이다. The backup manager receives them through IPMI, converts them into standard data format, and stores them in a database. In particular, in the present invention, the event log transmitted by each node is converted into a standard data format and stored. Considering that at least one or more of a plurality of nodes can be implemented as heterogeneous equipment, even heterogeneous node events are integrated and to be able to analyze it.

이때, 이벤트 로그는 하드웨어 이벤트 로그와 OS 운영 체제 로그로 구분되며, 하드웨어 이벤트 로그는 온도, 팬, 파워, 하드웨어 오류 이벤트 로그, 사용자 정의 이벤트 로그 등일 수 있으며, OS 운영 체제 로그는 운영체제 장애 로그, 장애 발생 횟수 등일 수 있다. At this time, the event log is divided into a hardware event log and an OS operating system log. The hardware event log may be a temperature, fan, power, hardware error event log, user-defined event log, etc., and the OS operating system log is an operating system failure log, failure The number of occurrences may be the same.

이벤트 로그 분석 단계(S22)에서, 백업 관리자는 도 7에서와 같이 기 설정된 이벤트 분석 기준에 따라 노드 각각의 이벤트 로그를 선별하여 분석하고, 분석 결과에 따라 노드 각각의 백업 또는 복구의 진행 여부, 대상, 방식을 자동 결정한다. In the event log analysis step (S22), the backup manager selects and analyzes the event logs of each node according to a preset event analysis criterion as shown in FIG. , the method is automatically determined.

예를 들어, 다수의 하드웨어 이벤트 로그 중 온도, 팬, 전력, 하드웨어 오류 관련 로그만을 선택 및 이용하여 HW 장애 발생 여부를 확인하고, OS 운영 체제 로그 중 운영체제 장애 로그, 장애 발생 횟수를 선택 및 이용하여 OS 장애 발생 여부를 확인할 수 있다. For example, select and use only temperature, fan, power, and hardware error related logs among multiple hardware event logs to check whether HW failure occurs, and select and use the operating system error log and number of failures among OS operating system logs. You can check whether an OS failure has occurred.

그리고 장애 유형 및 횟수에 기반하여 시스템 백업을 수행할지 또는 데이터 백업 필요 노드로 결정하거나(즉, 백업 또는 복구의 대상), 또 다르게는 전체 백업(Full Backup)을 수행할지 또는 스냅샷(Snapshot) 백업을 진행할지 결정할 수 있도록 한다(즉, 백업 또는 복구의 방식). And based on the type and number of failures, decide whether to perform a system backup or a node that needs data backup (i.e., the destination for backup or recovery), or alternatively whether to perform a Full Backup or a Snapshot backup Allows you to decide whether to proceed (ie how to backup or restore).

또한 사용자 요청하에 백업 또는 복구의 진행 여부, 대상, 방식을 직접 결정함으로써, 다양한 사용자 요구 사항도 보다 유연하게 처리할 수 있도록 한다. In addition, by directly determining whether to proceed with backup or recovery at the request of the user, the target, and the method, various user requirements can be handled more flexibly.

도 8 및 도 9는 본 발명의 일 실시예에 따른 백업 단계를 보다 상세히 설명하기 위한 도면이다. 8 and 9 are diagrams for explaining in more detail a backup step according to an embodiment of the present invention.

도 8에 도시된 바와 같이, 본 발명에서는 이벤트 로그 분석 결과에 기반하여 백업 스케쥴러를 작성하고, 이를 통해 시스템과 데이터 중 어느 대상을 백업할지, 또한 전체 백업(Full Backup)과 스냅샷(Snapshot) 백업 중 어떤 방식을 이용할지, 백업 환경의 종류 등을 다양하게 선택할 수 있도록 한다. As shown in FIG. 8, in the present invention, a backup scheduler is created based on the event log analysis result, and through this, which target of the system and data is backed up, and full backup and snapshot backup You can choose which method to use, the type of backup environment, etc.

또한 본 발명은 백업 미디어(SSD, Disk)와 기존의 저장 매체 이외에 고성능 백업을 위한 영구 메모리를 추가 구비한 후, 노드 정보를 1차로 영구 메모리상에 백업한 후, 2차로 백업 미디어(SSD, Disk)에 백업하도록 함으로써, 실시간 백업 환경을 구축 및 제공할 수 있다. In addition, the present invention additionally includes a permanent memory for high-performance backup in addition to the backup media (SSD, Disk) and the existing storage media, and then backs up node information to the permanent memory first, and then backs up the node information to the second backup media (SSD, Disk). ), it is possible to build and provide a real-time backup environment.

도 10 내지 도 12는 본 발명의 일 실시예에 따른 복구 단계를 보다 상세히 설명하기 위한 도면이다. 10 to 12 are diagrams for explaining in more detail a recovery step according to an embodiment of the present invention.

도 10에서와 같이, 본 발명에서는 이벤트 로그 분석 결과에 따라 복구 관리자를 구성하고, 복구 관리자는 백업 스케쥴러와 동일하게 시스템과 데이터 중 어느 대상을 복원할지, 또한 전체 복원과 스냅샷 복원 중 어떤 방식을 이용할지, 복원 환경의 종류 등을 다양하게 선택할 수 있도록 한다. As shown in Figure 10, in the present invention, the recovery manager is configured according to the event log analysis result, and the recovery manager uses the same method as the backup scheduler to restore the system and data, and also selects which method between full restore and snapshot restore. It allows you to choose whether to use it or the type of restoration environment in a variety of ways.

복구 관리자는 빠른 복구를 보장하기 위해, 영구 메모리에 기반하여 노드 시스템을 우선 복구하되, 영구 메모리가 사용 불가 상태이면 백업 미디어에 기반하여 노드 시스템을 추가 복구하도록 한다. In order to ensure fast recovery, the recovery manager first restores the node system based on the permanent memory, but if the permanent memory is unavailable, additionally recovers the node system based on the backup media.

그리고 본 발명은 라이브 미디어(Live Media) 및 네트워크 부트(Netboot) 기능을 이용한 백업/ 복구 환경을 구축함으로써, 종래에서와 같이 별도 소프트웨어를 이용할 필요 없이 자동으로 네트워크 부팅을 통하여 백업 또는 복구 절차를 수행할 수 있도록 한다. In addition, the present invention establishes a backup/restore environment using live media and network boot functions, so that a backup or recovery procedure can be automatically performed through network booting without the need to use separate software as in the prior art. make it possible

이때, 라이브 미디어 백업/ 복구 환경과 네트워크 부트 백업/ 복구 환경은 하드웨어 정보의 부팅 모드와 상기 소프트웨어 정보의 부트 매니저 중 적어도 하나를 고려하여 결정될 수 있으나, 이에 한정될 필요는 없다.In this case, the live media backup/restore environment and the network boot backup/restore environment may be determined in consideration of at least one of a boot mode of hardware information and a boot manager of the software information, but is not limited thereto.

더하여, 라이브 미디어는 표준 하드웨어 디텍터(HW detector)와 네트워크(network) 드라이버 및 X-windows 등이 탑재된 모델이며, 자체 개발된 백업 복구 애플리케이션 또는 웹페이지가 지원하여 별도의 운영체제 및 백업 애플리케이션 없이도 백업/ 복구 환경을 구축할 수 있도록 한다. In addition, live media is a model equipped with a standard hardware detector, network driver, and X-windows, and a self-developed backup recovery application or webpage supports backup/repair without a separate operating system and backup application. Allows you to build a recovery environment.

네트워크 부트는 네트워크 인터페이스를 통해 컴퓨터를 부팅할 수 있게 해주는 환경으로, 이 또한 별도의 운영체제 및 백업 애플리케이션 없이도 백업/ 복구 환경을 구축할 수 있도록 한다. Network boot is an environment that allows you to boot a computer through a network interface, and it also allows you to build a backup/restore environment without a separate operating system and backup application.

도 13는 본 명세서에 개진된 하나 이상의 실시예가 구현될 수 있는 예시적인 컴퓨팅 환경을 도시하는 도면으로, 상술한 하나 이상의 실시예를 구현하도록 구성된 컴퓨팅 디바이스(1100)를 포함하는 시스템(1000)의 예시를 도시한다. 예를 들어, 컴퓨팅 디바이스(1100)는 개인 컴퓨터, 서버 컴퓨터, 핸드헬드 또는 랩탑 디바이스, 모바일 디바이스(모바일폰, PDA, 미디어 플레이어 등), 멀티프로세서 시스템, 소비자 전자기기, 미니 컴퓨터, 메인프레임 컴퓨터, 임의의 전술된 시스템 또는 디바이스를 포함하는 분산 컴퓨팅 환경 등을 포함하지만, 이것으로 한정되는 것은 아니다.13 is a diagram illustrating an example computing environment in which one or more embodiments disclosed herein may be implemented, and is an illustration of a system 1000 including a computing device 1100 configured to implement one or more embodiments described above. shows For example, computing device 1100 may be a personal computer, server computer, handheld or laptop device, mobile device (mobile phone, PDA, media player, etc.), multiprocessor system, consumer electronics, minicomputer, mainframe computer, distributed computing environments including any of the aforementioned systems or devices, and the like.

컴퓨팅 디바이스(1100)는 적어도 하나의 프로세싱 유닛(1110) 및 메모리(1120)를 포함할 수 있다. 여기서, 프로세싱 유닛(1110)은 예를 들어 중앙처리장치(CPU), 그래픽처리장치(GPU), 마이크로프로세서, 주문형 반도체(Application Specific Integrated Circuit, ASIC), Field Programmable Gate Arrays(FPGA) 등을 포함할 수 있으며, 복수의 코어를 가질 수 있다. 메모리(1120)는 휘발성 메모리(예를 들어, RAM 등), 비휘발성 메모리(예를 들어, ROM, 플래시 메모리 등) 또는 이들의 조합일 수 있다.The computing device 1100 may include at least one processing unit 1110 and a memory 1120 . Here, the processing unit 1110 may include, for example, a central processing unit (CPU), a graphic processing unit (GPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), Field Programmable Gate Arrays (FPGA), etc. and may have a plurality of cores. The memory 1120 may be a volatile memory (eg, RAM, etc.), a non-volatile memory (eg, ROM, flash memory, etc.), or a combination thereof.

또한, 컴퓨팅 디바이스(1100)는 추가적인 스토리지(1130)를 포함할 수 있다. 스토리지(1130)는 자기 스토리지, 광학 스토리지 등을 포함하지만 이것으로 한정되지 않는다. 스토리지(1130)에는 본 명세서에 개진된 하나 이상의 실시예를 구현하기 위한 컴퓨터 판독 가능한 명령이 저장될 수 있고, 운영 시스템, 애플리케이션 프로그램 등을 구현하기 위한 다른 컴퓨터 판독 가능한 명령도 저장될 수 있다. 스토리지(1130)에 저장된 컴퓨터 판독 가능한 명령은 프로세싱 유닛(1110)에 의해 실행되기 위해 메모리(1120)에 로딩될 수 있다.Additionally, computing device 1100 may include additional storage 1130 . Storage 1130 includes, but is not limited to, magnetic storage, optical storage, and the like. The storage 1130 may store computer readable instructions for implementing one or more embodiments disclosed herein, and other computer readable instructions for implementing an operating system, an application program, and the like. Computer readable instructions stored in storage 1130 may be loaded into memory 1120 for execution by processing unit 1110 .

또한, 컴퓨팅 디바이스(1100)는 입력 디바이스(들)(1140) 및 출력 디바이스(들)(1150)을 포함할 수 있다. 여기서, 입력 디바이스(들)(1140)은 예를 들어 키보드, 마우스, 펜, 음성 입력 디바이스, 터치 입력 디바이스, 적외선 카메라, 비디오 입력 디바이스 또는 임의의 다른 입력 디바이스 등을 포함할 수 있다. 또한, 출력 디바이스(들)(1150)은 예를 들어 하나 이상의 디스플레이, 스피커, 프린터 또는 임의의 다른 출력 디바이스 등을 포함할 수 있다. 또한, 컴퓨팅 디바이스(1100)는 다른 컴퓨팅 디바이스에 구비된 입력 디바이스 또는 출력 디바이스를 입력 디바이스(들)(1140) 또는 출력 디바이스(들)(1150)로서 사용할 수도 있다.Computing device 1100 may also include input device(s) 1140 and output device(s) 1150 . Here, the input device(s) 1140 may include, for example, a keyboard, mouse, pen, voice input device, touch input device, infrared camera, video input device, or any other input device, or the like. Further, the output device(s) 1150 may include, for example, one or more displays, speakers, printers, or any other output device, or the like. Also, the computing device 1100 may use an input device or an output device included in another computing device as the input device(s) 1140 or the output device(s) 1150 .

또한, 컴퓨팅 디바이스(1100)는 컴퓨팅 디바이스(1100)가 다른 디바이스(예를 들어, 컴퓨팅 디바이스(1300))와 통신할 수 있게 하는 통신접속(들)(1160)을 포함할 수 있다. 여기서, 통신 접속(들)(1160)은 모뎀, 네트워크 인터페이스 카드(NIC), 통합 네트워크 인터페이스, 무선 주파수 송신기/수신기, 적외선 포트, USB 접속 또는 컴퓨팅 디바이스(1100)를 다른 컴퓨팅 디바이스에 접속시키기 위한 다른 인터페이스를 포함할 수 있다. 또한, 통신 접속(들)(1160)은 유선 접속 또는 무선 접속을 포함할 수 있다.Computing device 1100 may also include communication connection(s) 1160 that enable computing device 1100 to communicate with another device (eg, computing device 1300 ). Here, communication connection(s) 1160 may be a modem, network interface card (NIC), integrated network interface, radio frequency transmitter/receiver, infrared port, USB connection, or other for connecting computing device 1100 to another computing device. It may include interfaces. Also, the communication connection(s) 1160 may include a wired connection or a wireless connection.

상술한 컴퓨팅 디바이스(1100)의 각 구성요소는 버스 등의 다양한 상호접속(예를 들어, 주변 구성요소 상호접속(PCI), USB, 펌웨어(IEEE 1394), 광학적 버스 구조 등)에 의해 접속될 수도 있고, 네트워크(1200)에 의해 상호접속될 수도 있다.Each component of the above-described computing device 1100 may be connected by various interconnections such as a bus (eg, peripheral component interconnection (PCI), USB, firmware (IEEE 1394), optical bus structure, etc.) and may be interconnected by a network 1200 .

본 명세서에서 사용되는 "구성요소", "모듈", "시스템", "인터페이스" 등과 같은 용어들은 일반적으로 하드웨어, 하드웨어와 소프트웨어의 조합, 소프트웨어, 또는 실행중인 소프트웨어인 컴퓨터 관련 엔티티를 지칭하는 것이다. 예를 들어, 구성요소는 프로세서 상에서 실행중인 프로세스, 프로세서, 객체, 실행 가능물(executable), 실행 스레드, 프로그램 및/또는 컴퓨터일 수 있지만, 이것으로 한정되는 것은 아니다. 예를 들어, 컨트롤러 상에서 구동중인 애플리케이션 및 컨트롤러 모두가 구성요소일 수 있다. 하나 이상의 구성요소는 프로세스 및/또는 실행의 스레드 내에 존재할 수 있으며, 구성요소는 하나의 컴퓨터 상에서 로컬화될 수 있고, 둘 이상의 컴퓨터 사이에서 분산될 수도 있다.As used herein, terms such as "component," "module," "system," "interface," and the like, generally refer to a computer-related entity that is hardware, a combination of hardware and software, software, or software in execution. For example, a component can be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer. For example, both an application running on a controller and a controller may be a component. One or more components may reside within a process and/or thread of execution, and a component may be localized on one computer or distributed between two or more computers.

이상에서는 본 발명의 바람직한 실시예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 다양한 변형 실시가 가능한 것은 물론이고, 이러한 변형 실시들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어서는 안될 것이다.In the above, preferred embodiments of the present invention have been illustrated and described, but the present invention is not limited to the specific embodiments described above, and it is common in the technical field to which the present invention pertains without departing from the gist of the present invention as claimed in the claims. Various modifications may be made by those having the knowledge of, of course, and these modifications should not be individually understood from the technical spirit or perspective of the present invention.

Claims

A node registration step of recognizing each of a plurality of nodes constituting a large-scale cloud data center, and collecting and storing hardware and software information of each of the nodes;
an event monitoring step of collecting and analyzing hardware event logs and operating system event logs of each node in real time, and determining whether to proceed with backup or recovery of each of the nodes, a target, and a method;
A backup step of performing a secondary backup to a backup medium after the primary backup target of the backup node is detected in a backup node in a permanent memory; and
a recovery step of recovering the node requiring recovery based on the permanent memory when a node requiring recovery is detected, and further recovering the node based on the backup media if the permanent memory is unavailable;
The event monitoring step is
Each of the nodes collects and provides hardware event logs and operating system event logs in real time through BMC (Base Mother Controller), and the backup manager receives the event logs and operating system event logs through IPMI (Intelligent Platform Management Interface). A backup recovery method for autonomous operation of large-scale cloud data centers that enables integrated collection and analysis of heterogeneous node events by converting them into standard data format and storing them.

According to claim 1, wherein the event monitoring step
Check whether HW failure occurs by selecting and using only temperature, fan, power, and hardware error related logs among the hardware event logs, and check whether an OS failure occurs by selecting and using the operating system failure log and number of failures from the OS operating system log. A backup recovery method for autonomous operation of a large-scale cloud data center, characterized in that it automatically determines whether to proceed with backup or recovery, the target, and the method according to the type, number, or degree of failure.

The method of claim 2, wherein the event monitoring step
A backup recovery method for autonomous operation of a large-scale cloud data center, characterized in that the target of backup or recovery is divided into system and data, and the method of backup or recovery is divided into full backup and snapshot backup.

The method of claim 1, wherein the node registration step
At least one of CPU configuration, user configuration options, boot mode, boot device, Redundant Array of Independent Disks (RAID), Internet Small Computer System Interface (iSCSI) and Pre-boot eXecution Environment (PXE), Pre-boot eXecution Environment (PXE) A backup recovery method for autonomous operation of a large-scale cloud data center, characterized in that one is acquired and stored as hardware information, and at least one of an OS type, boot manager, partition, and file system is acquired and provided as software information.

5. The method of claim 4,
In consideration of at least one of the boot mode of the hardware information and the boot manager of the software information, the step of constructing a live media-based backup and recovery environment or a network-based backup and recovery environment further comprising a backup and recovery environment construction step Backup recovery method for large-scale cloud data center autonomous operation, characterized in that.