KR20040047209A

KR20040047209A - Method for automatically recovering computer system in network and recovering system for realizing the same

Info

Publication number: KR20040047209A
Application number: KR1020020075336A
Authority: KR
Inventors: 김태현
Original assignee: (주)소프트위드솔루션
Priority date: 2002-11-29
Filing date: 2002-11-29
Publication date: 2004-06-05

Abstract

PURPOSE: A method and a system for automatically recovering a computer system on the network are provided to recover the computer system automatically and immediately when a failure is detected from the computer system. CONSTITUTION: User computer systems are equipped with a failure detecting module(23) informing a backup/recovery server of the failure occurrence information by detecting the failure occurrence of the computer system, and an automatic recovery OS(Operating System)(21) rebooted/operated after detecting the failure in the OS through the failure detecting module. The automatic recovery OS is equipped with a network driver(22) for communication with the backup/recovery server, and a recovery performing agent(30) for recovering the failure detected from the failure detecting module. The backup/recovery server(40) extracts the backup data from a data storing server after receiving a failure occurrence message from the failure detecting module, and provides a recovery scenario and the recovery data to the recovery performing agent based on the extracted data.

Description

Method for automatically recovering a computer system on a network and an automatic recovery system for a computer system for implementing the same {Method for automatically recovering computer system in network and recovering system for realizing the same}

본 발명은 장애가 발생한 컴퓨터 시스템을 복구하는 기술에 관한 것으로서, 보다 상세하게는 네트워크 상에 있는 사용자 컴퓨터 시스템에 장착된 장애 감지 에이전트와 자동 복구용 운영체제 그리고 사용자 컴퓨터와 연결된 네트워크 상의 백업/복구 서버를 이용하여 컴퓨터 시스템을 자동으로 복구하는 방법 및 이를 구현하기 위한 복구 시스템에 관한 것이다.The present invention relates to a technology for recovering a failed computer system, and more particularly, using a failure detection agent installed in a user computer system on a network, an automatic recovery operating system, and a backup / recovery server on a network connected to the user computer. The present invention relates to a method for automatically recovering a computer system and a recovery system for implementing the same.

컴퓨터 시스템은, 하드웨어를 구성하는 입력장치, 출력장치, 기억장치, 연산장치와 제어장치를 기본 구성으로 하고 있으며, 또한 컴퓨터 시스템에는 이들 하드웨어를 효율적으로 사용할 수 있도록 시스템 사용자에게 인터페이스를 제공하는 펌웨어나 소프트웨어 형태의 프로그램군인 운영체제가 설치되어 있다.The computer system has a basic configuration of input devices, output devices, storage devices, arithmetic devices, and control devices, and the computer system includes firmware that provides an interface for system users to efficiently use the hardware. The operating system, a family of software programs, is installed.

컴퓨터 시스템의 운영체제는 유닉스, 도스, 윈도우(Win98, WinMe, WinNT, Win2000)와 같은 것을 말하는 것으로서, 전술한 하드웨어의 관리 외에도 자료 및 각종 소프트웨어 형태의 프로그램을 관리하는 것으로서, 컴퓨터 시스템이 운용되는 도중 장애를 감지하여, 장애 메시지를 발생시키거나 문제점을 해결하지 않은 채 장애가 발생된 프로그램을 종료시킨다. 그런데, 장애가 발생된 프로그램 등을 적절히 복구하지 못한 경우에는, 발생된 그 장애로 인해 컴퓨터 시스템 전체가 영향을 받을 수 있게 되어 최악의 경우에는 하드 디스크를 다시 포맷하거나 새로운 것으로 교체해야하는 경우도 있다.The operating system of a computer system refers to Unix, DOS, Windows (Win98, WinMe, WinNT, Win2000), and manages data and various types of software programs in addition to the above-described hardware management. Detect the faulted program and terminate the faulted program without generating a fault message or solving the problem. However, if a failed program or the like is not properly recovered, the entire system may be affected by the generated failure, and in the worst case, the hard disk may need to be reformatted or replaced with a new one.

한편, 장애를 복구하는 방법으로는 시스템 관리자 또는 전문가가 방문하여 해당 장애를 찾아내어 복구하는 방법과 복구용 하드웨어 카드를 부착하여 사용 중인 디스크의 이미지를 이중화하는 것이 있다. 전자의 경우에는, 직접 방문에 따른 시간 및 비용의 상승뿐만 아니라, 장애 원인 파악이 안 된 부분에 대한 복구는 이루어지지 않게 되고 관리자의 작업량이 증가하는 문제가 있다. 후자에 있어서는, 하드웨어 카드 구입 비용 및 장착에 따른 비용이 요구되며, 또한 부착된 하드웨어카드 장치도 새로운 관리 요소가 되는 문제가 있다.On the other hand, a method of recovering from a failure includes a system administrator or an expert visits to find and repair the failure, and attaches a recovery hardware card to duplicate an image of the disk in use. In the former case, not only the time and cost increase due to the direct visit, but also the recovery of the part not identified as the cause of the failure is not made and the amount of work of the manager increases. In the latter case, there is a problem in that the cost of acquiring and mounting the hardware card is required, and the attached hardware card device also becomes a new management element.

또한, 네트워크 상에 있는 컴퓨터 시스템의 장애를 원격에서 감지하고 원격에서 복구할 수 있는 기술이 있으나, 원격에서 복구하기 위해서는 관리자가 항상 네크워크 상의 컴퓨터 시스템을 모니터링하고 있어야한다. 만약 관리자가 특정 컴퓨터 시스템의 장애를 모니터링하지 못할 경우에는 복구 자체가 이루어지지 않게 되는 문제가 있다.In addition, there is a technology that can remotely detect and recover the failure of the computer system on the network, but to recover remotely, the administrator must always monitor the computer system on the network. If the administrator fails to monitor the failure of a particular computer system, the recovery itself does not occur.

따라서, 본 발명의 목적은, 전술한 문제점들을 해결하여, 컴퓨터 시스템의 장애가 감지되는 즉시 컴퓨터 시스템이 자체적으로 자동적으로 그 시스템을 복구할 수 있도록 하는 방법 및 복구 시스템을 제공하는 것이다.Accordingly, it is an object of the present invention to provide a method and a recovery system that solves the above-mentioned problems and enables the computer system to recover itself automatically as soon as a failure of the computer system is detected.

도 1은 본 발명에 따른 네트워크 상의 컴퓨터 시스템의 자동 복구 방법을 보여주는 개념도를 나타낸다.1 is a conceptual diagram showing an automatic recovery method of a computer system on a network according to the present invention.

도 2는 본 발명의 일 예에 따른 컴퓨터 시스템의 자동 복구 시스템을나타낸다.2 illustrates an automatic recovery system of a computer system according to an embodiment of the present invention.

도 3은 도 2의 자동 복구 시스템을 이용한 자동 복구 과정을 보여주는 흐름도이다.3 is a flowchart illustrating an automatic recovery process using the automatic recovery system of FIG. 2.

도 4는 본 발명의 자동 복구 시스템에 사용된 자동 복구용 운영체제의 구조를 나타낸다.Figure 4 shows the structure of the operating system for automatic recovery used in the automatic recovery system of the present invention.

본 발명의 목적을 달성하기 위해, 다수의 사용자 컴퓨터 시스템, 네트워크를 통해 다수의 사용자 컴퓨터 시스템의 운영체제 이미지, 응용 프로그램 및 사용자 데이터를 일정주기로 백업시키는 백업/복구 서버 및 백업 /복구 서버로부터 데이터를 전달받아 저장하는 데이터 저장 서버로 이루어진 시스템에 있어서, 다수의 사용자 컴퓨터 시스템에, 다수의 컴퓨터 시스템의 장애 발생을 감지하여 백업/복구 서버로 장애발생 정보를 통보하는 장애감지모듈, 장애감지모듈을 통해 사용자 시스템의 운영 체제의 이상이 있음이 감지된 후 재부팅되어 가동되는 자동 복구용 운영체제, 자동 복구용 운영체제에 내장되어 있되, 백업/복구 서버와의 통신을 위한 네트워크 드라이버 및 자동 복구용 운영체제에 내장되어 있되 장애 감지모듈에서 감지된 장애를 복구하기 위한 복구 수행 에이전트를 마련하였다. 그리고 백업/복구 서버에는, 장애감지모듈로부터 장애 발생 메시지를 받은 뒤 데이터 저장 서버로부터 백업 데이터를 추출하고, 추출 데이터를 기초로 하여 복구 수행 에이전트에게 복구 시나리오 및 복구 데이터를 제공하는 프로그램 모듈을 마련한다.In order to achieve the object of the present invention, data is transferred from a backup / recovery server and a backup / recovery server for backing up a plurality of user computer systems, operating system images, applications and user data of the plurality of user computer systems at regular intervals over a network. In the system consisting of a data storage server for receiving and storing, a user through a failure detection module, a failure detection module for detecting a failure of a plurality of computer systems to notify the backup / recovery server of the failure information to a plurality of user computer systems After detecting that there is a problem with the operating system of the system, it is embedded in the automatic recovery operating system, the automatic recovery operating system, and the network driver for communication with the backup / recovery server and the automatic recovery operating system. Recover from fault detected by fault detection module It prepared a recovery agent for performing group. In the backup / recovery server, after receiving a failure message from the failure detection module, the backup data is extracted from the data storage server, and a program module is provided for providing a recovery scenario and recovery data to a recovery performing agent based on the extracted data. .

이하 첨부된 도면을 참조로 본 발명을 상세히 설명한다.Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명에 따른 네트워크 상의 컴퓨터 시스템의 자동 복구 방법을 보여주는 개념도이며, 도 2는 본 발명의 사상에 입각한 자동 복구 시스템의 일 예를 나타낸다. 그리고 도 3은 도 2의 자동 복구 시스템을 이용한 자동 복구 과정을 보여주며, 도 4는 본 발명의 자동 복구 시스템에 사용된 자동 복구용 운영체제의 구조를 나타낸다.1 is a conceptual diagram illustrating an automatic recovery method of a computer system on a network according to the present invention, and FIG. 2 illustrates an example of an automatic recovery system based on the spirit of the present invention. 3 illustrates an automatic recovery process using the automatic recovery system of FIG. 2, and FIG. 4 illustrates a structure of an automatic recovery operating system used in the automatic recovery system of the present invention.

본 발명에 따른 자동 복구 시스템은 도 1에 나타난 바와 같이 사용자의 컴퓨터 시스템(11)이 인터넷 망 또는 인트라넷 망을 통해서 백업/복구 서버(17), 관리자 콘솔(18) 그리고 다수의 다른 컴퓨터 시스템들(12, 13, 14, 15)과 연결되어 있는 환경에서 사용된다. 다수의 다른 검퓨터 시스템들의 운영 체제는 윈도우 또는 리눅스를 사용하는 것으로, 서로 다른 운영체제를 갖고 있어도 무방하다. 백업/복구 서버(17)는 관리자 콘솔(18)의 제어 하에서 주기적으로 컴퓨터 시스템들(11, 12, 13, 14, 15)로부터 각 시스템의 운영 체제 이미지, 응용 프로그램 이미지 및 사용자 데이터를 전달받아 데이터 저장 서버(16)에 저장(백업)하여 둔다. 데이터의 백업 과정은 널리 알려진 각종 방법을 사용하여 행해질 수 있다.As shown in FIG. 1, the automatic recovery system according to the present invention may be implemented by the user's computer system 11 as a backup / recovery server 17, an administrator console 18, and a number of other computer systems via an Internet or intranet network. 12, 13, 14, and 15). The operating systems of many different computer systems use Windows or Linux, and may have different operating systems. The backup / recovery server 17 receives the operating system image, the application image, and the user data of each system from the computer systems 11, 12, 13, 14, and 15 periodically under the control of the administrator console 18. The data is stored (backed up) in the storage server 16. The backup process of the data can be done using various well-known methods.

컴퓨터 시스템의 운영 도중 장애가 발생할 경우, 종래의 방법에 따르면, 이장애 메시지는 인터넷 망 또는 인트라넷 망(이하에서는 "네트워크"라한다.)을 통해 관리자 콘솔(18)에 전달된다. 그리고 관리자는 관리자 콘솔(18)로부터 어떤 컴퓨터 시스템에 장애가 발생하였는지를 확인한 뒤, 해당 컴퓨터 시스템의 장애를 원격으로 복구하거나 또는 복구자를 파견하였다. 그러나 본 발명에서는 특정 컴퓨터 시스템의 장애 발생이 감지되면, 이 장애 발생 메시지는 해당 시스템에서 감지되어 백업/복구 서버(17)로 전달되고, 백업/복구 서버(17)는 데이터가 백업되어 있는 데이터 저장 서버(16)를 조사하여, 장애가 발생한 데이터와 관련된 백업 데이터를 추출한다. 사용자 시스템(11, 12, 13, 14, 15)은 추출된 데이터에 근거하여 시스템의 장애를 복구하게 된다. 여기서, 장애 발생 메시지는 관리자 콘솔(18)로 전달되기는 하나, 관리자는 더 이상 복구 작업에 참여하지 않고 복구 과정만을 모니터링하게 된다. 따라서, 관리자가 자리를 비울 경우에도 장애 발생 즉시 해당 시스템의 장애가 치유되어 복구될 수 있게 된다.If a failure occurs during the operation of the computer system, according to the conventional method, the failure message is transmitted to the administrator console 18 via an Internet network or an intranet network (hereinafter referred to as "network"). The administrator checks which computer system has failed from the administrator console 18, and remotely recovers the failure of the computer system or dispatches a repairer. However, in the present invention, when a failure of a specific computer system is detected, the failure message is detected by the corresponding system and transmitted to the backup / recovery server 17, and the backup / recovery server 17 stores data on which data is backed up. The server 16 is examined to extract backup data related to the failed data. The user systems 11, 12, 13, 14, and 15 recover from the failure of the system based on the extracted data. Here, the failure message is transmitted to the administrator console 18, but the administrator no longer participates in the recovery operation and only monitors the recovery process. Therefore, even if the administrator is away, the failure of the system can be healed and recovered immediately after the failure occurs.

전술한 개념을 구현하기 위해서는 도 2에 도시된 바와 같이, 우선 사용자 시스템에는 장애 발생을 감지하는 장애감지모듈(23), 추출된 백업 데이터를 근거로 하여 시스템의 복구를 수행할 복구 수행 에이전트(30)가 마련되어 있어야 한다. 이에 덧붙여서 해당 컴퓨터 시스템의 운영체제에 장애가 발생한 경우에 대비하여, 복구용 운영 체제(21)와 복구용 운영 체제(21)에 내장되어 있는 네트워크 드라이버(22)가 마련되어야 한다. 한편 복구용 운영 체제(21) 내에는 네트워크 드라이버(22)외에도 전술한 복구 수행 에이전트(30)가 내장되어 있게 된다. 복구 수행 에이전트(30), 복구용 운영체제(21) 및 네트워크 드라이버(22)는 하드디스크에기록되어 있을 수 있으며 또는 시스템내의 보조 기억 장치에 기록되어 있을 수 있다.In order to implement the above-described concept, as shown in FIG. 2, first, the user system includes a failure detection module 23 that detects a failure occurrence, and a recovery performing agent 30 that performs recovery of the system based on the extracted backup data. ) Should be provided. In addition, in case of failure of the operating system of the computer system, a recovery operating system 21 and a network driver 22 embedded in the recovery operating system 21 should be provided. On the other hand, in addition to the network driver 22, the above-mentioned recovery execution agent 30 is built in the recovery operating system 21. The recovery execution agent 30, the recovery operating system 21, and the network driver 22 may be recorded in the hard disk or may be recorded in the auxiliary storage device in the system.

본 발명에 따른 복구용 운영 체제가 설치되어 있는 컴퓨터 시스템의 하드디스크의 구조는 도 4에 도시되어 있다.4 illustrates a structure of a hard disk of a computer system in which a recovery operating system according to the present invention is installed.

시스템의 하드디스크는 몇 개의 파티션으로 구분되며, 이에 의해 각 파티션은 서로 다른 하드디스크처럼 동작한다. 시스템의 하드디스크의 처음에는 컴퓨터가 부팅될 때 바이오스(BIOS)를 읽어들이고 시작하는 부트 섹터인 MRB(Master Record Block)이 존재한다. 다음에는 컴퓨터 시스템을 구동하기 위한 운영체제와 사용을 위한 응용 프로그램 그리고 사용자 데이터가 존재하는 프라이머리 파티션(Primary partition)이 존재한다. 일반적인 컴퓨터 시스템에는 프라이머리 파티션과 MRB가 하드디스크에 존재한다.The hard disk of a system is divided into several partitions, whereby each partition behaves like a different hard disk. At the beginning of the system's hard disk is the MRB (Master Record Block), a boot sector that reads and starts the BIOS when the computer boots. Next is the primary partition, which contains the operating system to run the computer system, the application to use, and the user data. In a typical computer system, a primary partition and an MRB exist on a hard disk.

한편, 프라이머리 파티션 뒷부분에 사용자의 필요에 의해 별도의 논리 파티션(Logical Partition)을 구성할 수 있으며, 이 경우 이 논리 파티션을 확장 파티션(Extended partition)이라 한다. 확장 파티션에 있는 논리 파티션에는 해당 시스템의 운영체제와는 다른 운영체제와 사용자 데이터를 위치시켜, 하나의 시스템에서 2개의 운영체제를 사용하는데 이용되고 있는 실정이다. 여기에 덧붙여 본 발명에서는 확장 파티션의 마지막 부분에 논리 파티션을 마련하였으며, 이부분에 자동복구용 운영체제와 관련된 네트워크 드라이버 및 장애복구에이전트가 설치된다.On the other hand, after the primary partition, a user can configure a separate logical partition (Logical Partition) according to the user's needs, in which case the logical partition is called an extended partition. The logical partition in the extended partition is used to use two operating systems on one system by placing an operating system and user data different from the operating system of the corresponding system. In addition, in the present invention, a logical partition is provided at the end of the extended partition, where a network driver and a failover agent related to the automatic recovery operating system are installed.

자동 복구용 운영 체제에는 컴퓨터 시스템 사용자가 사용하는 운영체제와는 별도의 것으로서, 시스템 기동에 필요한 커널, 디바이스 드라이버, 시스템 부팅 관련 파일들, 시스템 명령어들이 포함되어 있다.The operating system for self-healing is separate from the operating system used by computer system users, and includes a kernel, device driver, system boot-related files, and system commands necessary for system startup.

한편, 백업/복구 서버에는 장애 발생 메시지를 받은 뒤 데이터 저장 서버의 데이터 베이스(50) 및 사용자 시스템(20)과 연동하여, 복구 시나리오 및 복구 데이터를 제공할 프로그램 모듈들(40; 41, 42, 43, 44, 45)이 장착되어 있다.On the other hand, the backup / recovery server receives the failure message and then program modules 40 to provide the recovery scenario and the recovery data in association with the database 50 and the user system 20 of the data storage server. 43, 44, 45).

보다 구체적으로 살펴보면, 사용자 시스템의 복구 수행 에이전트(30)는 백업/복구 서버로부터 수신된 장애 복구 시나리오를 분석하는 시나리오 분석 모듈(31), 데이터 저장서버(50)에서 추출된 복구 데이터를 수신하는 복구 데이터 수신 모듈(32), 복구 데이터 수신 모듈(32)의 제어 하에 시스템의 복구를 수행하는 복구 실행 모듈(33), 복구 완료를 통보하는 통보 모듈(34) 및 복구 완료 후 시스템의 재부팅을 시행하는 재부팅 실행모듈(35)을 포함한다. 그리고, 복구 수행 에이전트(30)에 연동하는 백업/복구 서버에 마련된 프로그램 모듈(40)은 장애감지모듈(23)로부터 장애 발생된 데이터의 위치, 데이터 종류 등이 포함되어 있는 장애 메시지를 통보 받은 뒤 저장 서버의 데이터 베이스(50)를 조사하여 백업 데이터를 추출하는 복구 진행 모듈(41), 복구 진행 모듈(41)로부터 조사된 데이터들의 복구 순서 등을 정하는 복구 시나리오를 작성하는 시나리오 작성 모듈(42), 시나리오 분석 모듈(31)로부터 받은 데이터 요청 순서에 따라 추출된 데이터를 복구 순서에 따라 순차 전송하는 복구 데이터 전송 모듈(43), 복구 실행 에이전트(30)의 복구 완료 통보를 받아 복구 작업 완료를 감지하는 감지 모듈(44), 감지 모듈(44)로부터 복구 완료 통보를 받은 뒤 사용자 시스템의 재부팅을 지시하는 완료 통보 및 재부팅 지시 모듈(45)을 포함한다.In more detail, the recovery performing agent 30 of the user system recovers a scenario analysis module 31 for analyzing a failure recovery scenario received from a backup / recovery server and a recovery data extracted from the data storage server 50. The data receiving module 32, the recovery execution module 33 for performing the recovery of the system under the control of the recovery data receiving module 32, the notification module 34 for notifying the completion of the recovery, and the rebooting of the system after the completion of the recovery. Reboot execution module 35 is included. In addition, the program module 40 provided in the backup / recovery server linked to the recovery performing agent 30 receives a failure message from the failure detection module 23 including a location of the failed data, a data type, and the like. Scenario creation module 42 for creating a recovery scenario for determining the recovery order of the data surveyed from the recovery progress module 41 and the recovery progress module 41 to extract the backup data by examining the database 50 of the storage server. In response to the data request order received from the scenario analysis module 31, the recovery data transmission module 43 for sequentially transmitting the extracted data in the recovery order and the recovery completion notification of the recovery execution agent 30 detect the completion of the recovery operation. After receiving the recovery completion notification from the detection module 44 and the detection module 44, the completion notification and reboot instruction mode instructing the reboot of the user system are performed. Module 45 is included.

이제 도 3을 참고로 하여, 장애 감지 과정 및 시스템 복구 과정을 살펴본다. 사용자 시스템의 운영 체제가 정상적으로 작동하고, 백업/복구 서버(17)가 주기적으로 컴퓨터 시스템들(11, 12, 13, 14, 15)로부터 각 시스템의 운영 체제 이미지, 응용 프로그램 이미지 및 사용자 데이터를 전달받아 데이터 저장 서버(16)에 저장하여 둔 상태이다. 이때, 사용자 시스템에 마련되어 있는 장애감지모듈(23)은 사용자 시스템의 동작 상태를 모니터링하고 있게 된다. 그런데, 사용자의 부주의 또는 외부의 충격에 의해 사용자 시스템에 장애가 발생하면, 장애감지모듈(23)에서는 장애 발생이 감지되게 된다(단계 S1). 장애 감지는 사전에 가지고 있던 디스크의 정보와 비교하여 달라진 정보가 발생되면 감지되는데, 이 때 경고하는 메시지 창을 사용자의 컴퓨터 화면에 출력하고 사용자가 장애 발생 확인을 하면 복구를 시작한다. 사용자 환경에서 정의하는 확인 간격 시간의 설정 시간이 경과하도록 사용자 확인이 없으면 그것을 장애로 간주하고 복구를 시작한다. 그리고, 장애감지모듈(23)은 일정 시간 간격으로 네트워크를 통하여 백업/복구 서버(17)와 통신을 하여 사용자 시스템의 이상 유무를 알린다(단계 S7). 시스템의 장애가 응용 프로그램 또는 사용자 데이터에서 발생된 것이라면, 장애감지모듈(23)은 백업/복구 서버(17)에 장애 사실 및 시스템 IP 주소, 장애 데이터 정보를 통보한다. 만약 일정 시간 간격으로 제공되던 시스템의 이상 유무 통보가 소정 시간 동안 백업/복구 서버(17)에 전달되지 않게 될 수 있는데, 이때는 운영 체제의 장애가 발생한 것으로서, 장애감지모듈이 일정 시간 간격으로 이상 유무를 통보할 수 없게 된다. 이 때는 백업/복구 서버(17)는 시스템 장애를 "운영체제 장애"로 판단하고 이를 지정된 관리자 콘솔(18)로 보낸다. 한편, 사용자 시스템에서는 장애감지모듈(23)에서 운영 체제의 이상이 생긴 것이 감지되면, 사용자 시스템의 제어장치는 장애감지모듈(23)과 연동하여 사용자 시스템을 재부팅 시킨다(단계 S3). 재부팅되면서 통상의 운영체제는 가동되지 않고 자동 복구용 운영체제가 가동되게 된다(단계S4). 자동 복구용 운영체제가 가동됨에 따라, 해당 사용자 시스템에 장착되어 있는 하드웨어를 제어하고 가동시키고(단계S5) 그리고 네트워크 통신을 위해 네트워크 드라이버(22)를 가동시킨다(단계 S6).Referring now to Figure 3, look at the failure detection process and system recovery process. The operating system of the user system is operating normally, and the backup / recovery server 17 periodically transfers the operating system image, application image, and user data of each system from the computer systems 11, 12, 13, 14, 15. It has been received and stored in the data storage server 16. At this time, the failure detection module 23 provided in the user system is monitoring the operating state of the user system. However, when a failure occurs in the user system due to user's carelessness or external shock, the failure detection module 23 detects a failure (step S1). Failure detection is detected when the changed information is compared with the information of the disk in advance. At this time, a warning message window is displayed on the user's computer screen and recovery starts when the user confirms the failure. If there is no user confirmation that the set time of the confirmation interval time defined in the user environment has elapsed, it is regarded as a failure and the recovery starts. In addition, the failure detection module 23 communicates with the backup / recovery server 17 through the network at predetermined time intervals to inform the user system of abnormality (step S7). If the failure of the system is caused by application or user data, the failure detection module 23 notifies the backup / recovery server 17 of the failure facts, the system IP address, and the failure data information. If the abnormality notification of the system, which was provided at regular time intervals, may not be delivered to the backup / recovery server 17 for a predetermined time, in which case the failure of the operating system occurs, the failure detection module may detect abnormality at regular time intervals. You will not be able to notify. At this time, the backup / recovery server 17 determines that the system failure is an "operating system failure" and sends it to the designated administrator console 18. On the other hand, if the user system is detected that the failure of the operating system in the failure detection module 23, the control device of the user system in conjunction with the failure detection module 23 reboots the user system (step S3). Upon rebooting, the normal operating system is not operated and the automatic recovery operating system is operated (step S4). As the automatic recovery operating system is operated, the hardware mounted on the user system is controlled and operated (step S5) and the network driver 22 is operated for network communication (step S6).

운영체제에 이상이 발생한 경우이든, 운영체제가 아닌 응용 프로그램 등에 이상이 발생한 경우에 있어서, 해당 장애가 백업/복구 서버(17)에 전달된 뒤에는, 해당 사용자 시스템은 백업/복구 서버(17)와 통신하여, 장애 복구의 기초가 되는 추출 데이터와 복구 시나리오를 제공 받게 된다.In the case where an error occurs in the operating system, or an error occurs in an application other than the operating system, after the failure is transmitted to the backup / recovery server 17, the user system communicates with the backup / recovery server 17, You will be provided with extracted data and recovery scenarios that are the basis for disaster recovery.

즉, 자동 복구를 위한 준비가 완료되면, 백업/복구 서버(17)의 복구 진행 모듈(41)은 해당 사용자 시스템에서 발생한 장애가 있는 데이터에 대한 백업 데이터를 찾기 위해, 백업 데이터가 보관되어 있는 데이터 저장 서버(16)를 조사하고, 백업 데이터를 추출한다(단계 S8). 복구 시나리오 작성 모듈은 복구 진행 모듈(41)로부터 받은 추출된 백업 데이터들을 근거로 하여 장애가 있는 데이터를 복구하기 위한 순서 등을 스케줄링 하는 복구 시나리오를 작성하고(단계 S9) 이를 해당 시스템의 시나리오 분석 모듈(31)로 전달한다. 시나리오 분석 모듈(31)에서는 전달받은 데이터로부터 복구 순서 및 복구 데이터 영역 및 레지스트리 정보의 복구에 대한 정보를 분석하여 장애 복구 시나리오에 따른 복구 데이터를 요청하게 된다(단계S10). 그리고 복구데이터 전송 모듈(43)에서는 데이터 요청에 응답하여 복구 시나리오에 따라 추출된 데이터의 전체 이미지에서부터 일부 변경된 이미지를 해당 시스템의 복구 데이터 수신 모듈(22)로 전송한다(단계 S11). 해당 사용자 시스템은 복구 데이터를 수신 받은 뒤 복구 실행 모듈(23)을 가동시켜 복구 작업을 수행한다(단계 S12). 그리고 해당 시스템의 복구 완료 통보 모듈(24)이 복구 실행 모듈(23)을 모니터링하고 있다가 복구가 완료되면 이를 백업/복구 서버(17)의 복구작업완료감지 모듈(44)로 전송한다(단계 S13). 복구작업완료감지모듈(44)은 재부팅 지시모듈(45)로 복구완료신호를 보내며, 이를 받은 뒤 재부팅 지시모듈(45)은 해당 시스템의 재팅 실행 모듈(24)을 가동시켜 해당 시스템을 재부팅시킨다(단계S14).That is, when the preparation for automatic recovery is completed, the recovery progress module 41 of the backup / recovery server 17 stores data in which the backup data is stored in order to find backup data for the failed data generated in the corresponding user system. The server 16 is examined and the backup data is extracted (step S8). Based on the extracted backup data received from the recovery progress module 41, the recovery scenario creation module creates a recovery scenario for scheduling an order for restoring the failed data (step S9), and the scenario analysis module of the corresponding system (step S9). 31). The scenario analysis module 31 analyzes information on the recovery order, the recovery data area and the recovery of the registry information from the received data and requests recovery data according to the failure recovery scenario (step S10). In response to the data request, the recovery data transmission module 43 transmits a partially changed image from the entire image of the extracted data according to the recovery scenario to the recovery data receiving module 22 of the corresponding system (step S11). After receiving the recovery data, the user system starts the recovery execution module 23 to perform a recovery operation (step S12). Then, the recovery completion notification module 24 of the corresponding system monitors the recovery execution module 23, and when the recovery is completed, transmits it to the recovery operation completion detection module 44 of the backup / recovery server 17 (step S13). ). The recovery operation completion detection module 44 sends a recovery completion signal to the reboot instruction module 45, and after receiving the reboot instruction module 45 reboots the system by operating the chat execution module 24 of the corresponding system ( Step S14).

본 발명에 따르면, 네트워크에 있는 다수의 시스템의 운용 체제 이미지의 훼손 또는 사용 중인 응용 프로그램 및 사용자 데이터의 훼손, 분실 및 삭제에 대한 장애 복구 절차가 콘솔 관리자의 개입 없이 자동적으로 이루어질 수 있어서, 전산 관리 비용이 상당히 절감될 뿐만 아니라 장애 복구에 소요되는 시간 또한 최소화할 수 있게 되어, 시스템 사용자가 빠른 시간안에 그 시스템을 재사용할 수 있게 된다.According to the present invention, a disaster recovery procedure for corrupting an operating system image of a plurality of systems in a network or corrupting, losing, or deleting an application and user data in use can be automatically performed without the intervention of a console administrator. Not only are the costs significantly reduced, but the time required for disaster recovery can also be minimized, allowing system users to reuse the system quickly.

관리자의 기술습득 정도에 영향을 받지 않게 되고 복구 작업이 자동적으로 수행되므로, 특히 높은 기술력을 요구하는 시스템을 복구할 때에는 기술력 부족에 따른 시스템 복구 실패 또는 기술 지원 인력이 도착할 때까지의 대기 시간을 절감할 수 있는 이점이 있다.It is not affected by the level of skill acquisition by administrators and recovery is performed automatically, reducing the time required for system recovery due to lack of skills or waiting time for technical support personnel to arrive, especially when recovering systems requiring high technical skills. There is an advantage to this.

Claims

A system comprising a plurality of user computer systems, a backup / recovery server for backing up operating system images, applications and user data of a plurality of user computer systems over a network at regular intervals, and a data storage server for receiving and storing data from the backup server. In

The plurality of user computer systems include a failure detection module for detecting a failure occurrence of the plurality of computer systems and notifying failure occurrence information to the backup / recovery server, and an abnormality of an operating system of the user system through the failure detection module. The automatic recovery operating system that is rebooted after the operation is detected, the network driver for communication with the backup / recovery server is built in the automatic recovery operating system, and

It is built in the automatic recovery operating system but is provided with a recovery performing agent for recovering the failure detected by the failure detection module,

In the backup / recovery server, after receiving a failure occurrence message from the failure detection module, the program module extracts backup data from the data storage server and provides recovery scenarios and recovery data to the recovery performing agent based on the extracted data. Computer system recovery system on a network, characterized in that provided.

The system of claim 1, wherein the recovery performing agent of the user system comprises: a scenario analysis module for analyzing a recovery scenario received from the backup / recovery server; a recovery data receiving module for receiving recovery data extracted from the data storage server; A recovery execution module for performing recovery of the user system under control of a data receiving module, a notification module for notifying completion of recovery, and a rebooting execution module for rebooting the user system after the completion of recovery, and provided in the backup / recovery server. The program module may receive a notification of a failure message from the failure detection module and then perform a recovery progress module for extracting backup data by examining the data storage server, a scenario creation module for creating a recovery scenario of the data surveyed from the recovery progress module, and the scenario. Analysis Modulo A recovery data transmission module for sequentially transmitting the extracted data according to the received data request order, a detection module for detecting completion of a recovery operation by receiving a recovery completion notification of the recovery execution agent, and a user system after receiving a recovery completion notification from the detection module Completion notification and reboot instruction module for instructing a reboot of the computer system recovery system on a network.

In a system comprising a plurality of user computer systems, a backup / recovery server for backing up operating system images, applications and user data of a plurality of user computer systems over a network at regular intervals, and a data storage server for receiving and storing data from the backup server. A method for recovering from a failure of the user computer system,

Detecting a failure of the user system in a failure detection module provided in the user system;

Determining whether a failure of the user system is an error of an operating system,

If it is determined that the operating system is not abnormal, transmitting failure occurrence information to the backup / recovery server;

If it is determined that the operating system is abnormal, reboot the user system, start a recovery operating system provided on the hard disk of the user system, and subsequently install a network driver for communication with the hardware of the user system and the backup / recovery server. Operating and forwarding the failure information to the backup / recovery server,

The backup / recovery server, after receiving the failure information, examining the data storage server to extract backup data related to the failure data;

Creating a recovery scenario using the extracted data and transmitting it to a recovery performing agent provided in the user system;

Analyzing the recovery scenario in the recovery performing agent and requesting and receiving recovery data from the backup / recovery server according to the result;

The recovery performing agent recovering the user system based on the recovery scenario and the recovery data; and

Rebooting the user system after completion of the recovery of the user system.