KR20030006661A

KR20030006661A - Method and system for managing server failure

Info

Publication number: KR20030006661A
Application number: KR1020010042508A
Authority: KR
Inventors: 박동혁
Original assignee: 인터컴 소프트웨어(주)
Priority date: 2001-07-13
Filing date: 2001-07-13
Publication date: 2003-01-23
Also published as: KR100415830B1

Abstract

PURPOSE: A method and system for managing a server trouble are provided to inform a trouble to a server manager, perform an emergency recovery, and suggest a method for dealing with the trouble by automatically sensing a trouble state of a server. CONSTITUTION: A CMS(Connection Management System,110) is mounted in a user server, manages resources in a system and an error event, etc., and checks/transmits a trouble. A CAS(Connection Accept Server,210) receives data transmitted from the CMS(110), and converts the data into normal data. A DPS(Data Passing Server,220) stores data being output from the CAS(210) in a database(230). An NDS(Network Diagnostic Server,240) is communicated with an NDC(Network Diagnostic Client) in a user server, checks a network trouble, inputs a trouble in the database(230), and outputs information with respect to a pre-set emergency state. A DAS(Data Analyze Server,250) processes and outputs various kinds of data for analyzing data stored in the database(230). An ACS(Automatic error Calling Server) is connected to a public network or the Internet and transmits information with respect to a pre-set emergency state being output in the NDS(240) and the DAS(250) to a wire/wireless telephone or an E-mail of a pre-set server manager.

Description

METHOD AND SYSTEM FOR MANAGING SERVER FAILURE}

본 발명은 사용자 서버 관리 기술에 관한 것으로, 특히 서버 장애에 대한 예방 및 장애 발생시에 신속히 대처하기 위한 서버 관리 방법 및 그 시스템에 관한 것이다.The present invention relates to a user server management technology, and more particularly, to a server management method and system for preventing a server failure and quickly coping when a failure occurs.

최근들어 컴퓨터가 대용량화, 고속화되어감에 따라, 시스템의 에러나 바이러스 등에 의한 컴퓨터 장애가 자주 발생되고 있다. 특히 대용량의 서버의 경우 다양한 응용 프로그램의 동작과 데이터 저장, 독출 및 전송 등 여러 요인에 의한 장애가 빈번하게 발생할 수 있다. 따라서 각 기업에서는 이러한 서버를 관리하는 별도의 서버 관리자를 상주시켜 서버를 관리하고, 장애 발생시 이를 처리하도록 하고 있다.In recent years, as computers become larger and faster, computer failures due to system errors, viruses, and the like frequently occur. In particular, in the case of a large server, failures due to various factors such as operation of various applications and data storage, reading, and transmission may occur frequently. Therefore, each company has a separate server administrator who manages these servers to manage them and to handle them in case of failure.

그런데, 서버 관리에는 전문적인 기술이 요구되며, 그러한 전문 인력을 채용하는데는 상당한 비용이 요구된다. 따라서 특히 소규모의 기업 등에서는 해당 서버 관리자로서 전문 기술자를 채용하는 것이 아니라, 사내 기존 인력 중에서 적절한 사람을 선택하여 서버 관리자로서 두고 있는 실정이다. 그럴 경우에는 서버 관리가 원활히 이루어지기 힘들며, 더구나 서버 장애 발생시에 원활히 대처하기가 거의 불가능하다.However, server management requires specialized skills, and the hiring of such specialized personnel requires considerable costs. Therefore, in particular, a small company does not employ a professional technician as the server administrator, but selects an appropriate person from the existing manpower of the company and maintains it as a server administrator. In such a case, server management is difficult to perform smoothly, and moreover, it is almost impossible to cope smoothly in case of server failure.

또한, 서버 관리를 위해 전문 기술을 가진 서버 관리자를 채용하였을 경우에도, 서버 관리자가 출장 등의 이유로 서버에서 원격지에 있을 경우에는 서버의 장애 발생시 이러한 서버의 상황이 관리자에 신속히 통보되기가 힘들어서 서버 장애 발생시에 원활히 대처하기가 힘들었다. 더욱이 서버 관리자가 해당 서버의 장애 발생을 통보 받았을 경우에도, 원격지에 있는 관계로 이에 대한 즉각적인 대처가 어려워서, 결국 서버가 다운되는 등 막대한 손실이 초래될 수 있다.In addition, even if a server administrator with specialized skills is employed to manage the server, if the server administrator is remote from the server for business trips or the like, it is difficult for the server to be notified of the status of such a server quickly when the server fails. It was difficult to cope with the occurrence smoothly. Moreover, even when the server administrator is notified of the failure of the server, it is difficult to immediately deal with it because it is remote, which can result in enormous loss such as the server crashing.

따라서 본 발명의 제1 목적은 서버의 장애 상황을 자동으로 감지할 수 있는 서버 관리 방법 및 그 시스템을 제공함에 있다.Accordingly, a first object of the present invention is to provide a server management method and system for automatically detecting a failure state of a server.

본 발명의 제2 목적은 서버 장애 발생시에 서버 관리자에게 이를 효율적으로 통보해 줄 수 있는 서버 관리 방법 및 그 시스템을 제공함에 있다.It is a second object of the present invention to provide a server management method and system capable of efficiently notifying a server administrator when a server failure occurs.

본 발명의 제3 목적은 서버의 장애 발생시 이에 대한 응급 복구를 수행하여 서버 장애를 해결할 수 있는 서버 관리 방법 및 그 시스템을 제공함에 있다.It is a third object of the present invention to provide a server management method and system for solving a server failure by performing an emergency recovery when a server failure occurs.

본 발명의 제4 목적은 서버의 상태를 분석하여 서버 상태의 장래변화를 예측하여 그에 대한 대처 방안을 제시할 수 있는 서버 관리 방법 및 그 시스템을 제공함에 있다.It is a fourth object of the present invention to provide a server management method and system capable of analyzing the state of a server, predicting a future change in the state of the server, and suggesting a countermeasure therefor.

상기한 목적을 달성하기 위하여 본 발명은 사용자 서버에서 시스템 감시 모듈이 서버 상황을 감시하고 장애 발생시 장애요소를 분석하여 경고, 위험, 긴급상황별로 분류한다. 경고 및 위험 장애는 이메일, 일반전화, 휴대폰 등으로 서버관리자에게 자동으로 통보되고, 장애건은 방치시에 서버가 다운되는 등의 심각한 상황이 발생하기 때문에 자동으로 응급조치가 수행된다. 또한 사용자 서버의 상태를 나타내는 구성요소를 데이터베이스화하여 시계열분석을 행함으로서 현재의 서버시스템 상황을 파악할 수 있고 앞으로 유의해야할 사항과 미리 조치해야할 사항을 진단 컨설팅할 수 있는 기능을 제공한다.In order to achieve the above object, the present invention provides a system monitoring module for monitoring the server status in the user server and analyzing the failure factors in the event of a failure and classifying them into warning, danger, and emergency situations. Warnings and risks Disasters are automatically notified to the server administrator by e-mail, general telephone, mobile phone, etc., and emergency cases are automatically performed due to serious situations such as server downtime when left unattended. In addition, time series analysis is performed by database of the components representing the status of the user server, so that the current server system status can be grasped and diagnostic consulting on the matters to be noted and the measures to be taken in advance is provided.

도 1은 본 발명의 일 실시예에 따른 서버 관리 시스템의 개략적인 전체 블록 구성도1 is a schematic overall block diagram of a server management system according to an embodiment of the present invention;

도 2는 본 발명의 일 실시예에 따른 하드 디스크 등과 같은 저장매체 관리를 위한 사용자 출력 화면의 예시도2 is an exemplary diagram of a user output screen for managing a storage medium such as a hard disk according to an embodiment of the present invention.

도 3은 본 발명의 일 실시예에 따른 CPU 상태 관리를 위한 사용자 출력 화면의 예시도3 is an exemplary diagram of a user output screen for CPU state management according to an embodiment of the present invention.

도 4는 본 발명의 일 실시예에 따른 메모리 사용량 관리를 위한 사용자 출력 화면의 예시도4 is an exemplary view of a user output screen for memory usage management according to an embodiment of the present invention.

도 5는 본 발명의 일 실시예에 따른 로그인 인증 시도 관리를 위한 사용자 출력 화면의 예시도5 is an exemplary view illustrating a user output screen for managing login authentication attempts according to an embodiment of the present invention.

도 6은 본 발명의 일 실시예에 따른 스왑 사용량 관리를 위한 사용자 출력 화면의 예시도6 is an exemplary diagram of a user output screen for managing swap usage according to an embodiment of the present invention.

도 7은 본 발명의 일 실시예에 따른 프로세스 상태 관리를 위한 사용자 출력 화면의 예시도7 is an exemplary view of a user output screen for process state management according to an embodiment of the present invention.

도 8은 본 발명의 일 실시예에 따른 서버 사용자 정보 관리를 위한 사용자 출력 화면의 예시도8 is an exemplary view of a user output screen for managing server user information according to an embodiment of the present invention.

도 9는 본 발명의 일 실시예에 따른 호스트 정보 관리를 위한 사용자 출력 화면의 예시도9 is an exemplary diagram of a user output screen for managing host information according to an embodiment of the present invention.

도 10은 본 발명의 일 실시예에 따른 환경설정 파일 관리를 위한 사용자 출력 화면의 예시도10 is an exemplary diagram of a user output screen for managing a configuration file according to an embodiment of the present invention.

도 11은 본 발명의 일 실시예에 따른 네트워크 장애 관리를 위한 사용자 출력 화면의 예시도11 is an exemplary view of a user output screen for network failure management according to an embodiment of the present invention.

도 12는 본 발명의 일 실시예에 따른 네트워크 관리를 위한 사용자 출력 화면의 예시도12 is an exemplary diagram of a user output screen for network management according to an embodiment of the present invention.

도 13은 본 발명의 일 실시예에 따른 중요 파일 관리를 위한 사용자 출력 화면의 예시도13 is an exemplary diagram of a user output screen for managing an important file according to an embodiment of the present invention.

도 14는 본 발명의 일 실시예에 따른 전체 서버 시스템 상황을 나타낸 사용자 출력 화면의 예시도14 is an exemplary view of a user output screen showing the overall server system status according to an embodiment of the present invention.

도 15는 본 발명의 일 실시예에 따른 서버 상태 등급 설정을 위한 사용자 출력 화면의 예시도15 is an exemplary diagram of a user output screen for setting a server status level according to an embodiment of the present invention.

도 16은 본 발명의 일 실시예에 따른 서버 긴급상황 자동 복구 제어의 개략적인 흐름도16 is a schematic flowchart of a server emergency automatic recovery control according to an embodiment of the present invention.

도 17은 본 발명의 일 실시예에 따른 서버 진단 컨설팅 동작의 개략적인 흐름도17 is a schematic flowchart of a server diagnostic consulting operation according to an embodiment of the present invention.

도 18은 본 발명의 일 실시예에 따른 서버 관리 동작의 전체 흐름도18 is an overall flowchart of a server management operation according to an embodiment of the present invention.

이하 본 발명에 따른 바람직한 실시예를 첨부한 도면을 참조하여 상세히 설명한다. 하기 설명에서는 구체적인 구성 소자 등과 같은 특정 사항들이 나타나고 있는데 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것일 뿐 이러한 특정 사항들이 본 발명의 범위 내에서 소정의 변형이나 혹은 변경이 이루어질 수 있음은 이 기술분야에서 통상의 지식을 가진 자에게는 자명하다 할 것이다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the following description, specific details such as specific components are shown, which are provided to help a more general understanding of the present invention, and it is understood that these specific details may be changed or changed within the scope of the present invention. It is self-evident to those of ordinary knowledge in Esau.

도 1은 본 발명의 일 실시예에 따른 서버 관리 시스템의 개략적인 전체 블록 구성도이다. 도 1을 참조하면, 본 발명에 따른 서버 관리 시스템은 먼저 사용자 서버(100)에는 시스템 내부 자원 및 에러 이벤트 등을 관리하며 이상 여부를 체크하여 콜 센터(200)로 전송하는 CMS(Connection Management System)(110)가 구비된다. 이때 CMS(110)와 콜 센터(200)간은 인터넷을 통해 접속될 수 있으며, CMS(110)는 콜 센터(200)에 부여된 IP(Internet Protocol) 주소를 목적지 주소로 하여 데이터 패킷을 전송한다. 이때 전송되는 데이터는 미리 설정된 암호화 방식에 따라 암호화되어 전송될 수 있다.1 is a schematic overall block diagram of a server management system according to an embodiment of the present invention. Referring to FIG. 1, the server management system according to the present invention first manages system internal resources and error events in the user server 100, and checks for abnormalities and transmits them to the call center 200. 110 is provided. In this case, the CMS 110 and the call center 200 may be connected through the Internet, and the CMS 110 transmits a data packet using an IP (Internet Protocol) address assigned to the call center 200 as a destination address. . In this case, the transmitted data may be encrypted and transmitted according to a preset encryption scheme.

콜 센터(200)에는 상기 CMS(110)로부터 전송된 데이터를 복호화하여 정상 데이터로 변환하는 CAS(Connection Accept Server)(210)와, CAS(210)로부터 출력되는 데이터를 DB(Database)(230)에 입력하는 DPS(Data Passing Server)(220)와, 상기 사용자 서버(100)에 구비되는 NDC(Network Diagnostic Client, 미도시)와 통신하여 네트워크의 이상 유무를 체크하여 이상 유무를 DB(230)에 입력하며 미리 설정된 긴급 상황 발생시에는 이에 대한 정보를 출력하는 NDS(Network Diagnostic Server)(240)와, DB(230)에 저장된 데이터를 분석하기 위하여 각종 데이터들의 통계를 담당하고 DB(230)의 정보를 실시간으로 구현하며, 미리 설정된 긴급 상황 발생시에는 이에 대한 정보를 출력하는 DAS(Data Analyze Server)(250)와, 공중망 및 인터넷망과 연결되어 상기 NDS(240) 및 상기 DAS(250)에서 출력되는 긴급 상황 발생에 대한 정보를 미리 설정된 서버 관리자의 유/무선 전화 및 전자 우편(E-mail)으로 전송하는 기능을 하는 ACS(Automatic error Calling Server)(260)와, 인터넷과 접속되어 웹상으로 서버 모니터링을 할 수 있도록 DB(230)의 내용을 보여주는 기능을 가진 CWS(CMS Web Server)(270)을 포함하여 구성된다.The call center 200 includes a CAS (Connection Accept Server) 210 for decrypting the data transmitted from the CMS 110 and converting the data into normal data, and DB (Database) 230 for data output from the CAS 210. Communication with the DPS (Data Passing Server) 220 and the NDC (Network Diagnostic Client) (not shown) included in the user server 100 to check whether there is an abnormality of the network to the DB 230 In response to a preset emergency situation, a network diagnostic server (NDS) 240 for outputting information about the emergency situation and the data of the DB 230 are analyzed to analyze the data stored in the DB 230. Implement in real time, and when a predetermined emergency occurs, the DAS (Data Analyze Server) 250 for outputting information on this, and the emergency output from the NDS 240 and the DAS 250 connected to the public network and the Internet network Information about what happened Automatic error calling server (ACS) 260 that transmits to wired / wireless telephone and e-mail of the set server administrator, and DB 230 for server monitoring on the web while connected to the Internet. It is configured to include a CWS (CMS Web Server) 270 having a function of showing the contents of.

또한 상기 서버 관리 시스템에서 ACS(260)는 서버 관리자로부터 사용자 서버(100)의 조작을 위하여 미리 설정된 번호를 휴대폰 또는 일반전화로 입력을 받는 경우 이를 처리하여 출력할 수 있다. 이러한 ACS(260)에서 출력되는 정보는 콜 센터(200)의 내부에 구비되는 RPS(Remote Process Server, 미도시)에 제공되며, RPS는 이러한 정보를 바탕으로 사용자 서버(100)에 구비되는 RPC(Remote Process Client)와 통신하여 사용자 서버(100)를 적절히 조작하게 된다. 이에 따라 서버 관리자는 원격지에서도 사용자 서버(100)를 어느 정도 조작 가능하게 된다.In addition, in the server management system, the ACS 260 may process and output a preset number received from a server manager to a mobile phone or a general phone for manipulation of the user server 100. The information output from the ACS 260 is provided to an RPS (Remote Process Server) (not shown) provided in the call center 200, and the RPS is provided to the RPC (referred to in the user server 100 based on the information). Remote Process Client) to properly manipulate the user server 100. Accordingly, the server administrator can operate the user server 100 to a certain degree even from a remote location.

이러한 구성을 가지므로, 사용자 서버(100)의 CMS(110)에서 전송된 각종 서버에 대한 정보는 콜 센터(200)의 CAS(210) 및 DPS(220)를 거쳐 DB(230)에 저장되고, 또한 네트워크 상태에 대한 각종 정보는 (NDS)240에 의해 DB(230)에 저장되고, DB(230)에 저장된 정보는 DAS(250)에 의해 분석되어 실시간으로 모니터 화면상에 출력된다. 또한 저장된 정보 분석으로 미리 설정된 긴급 상황이 발생된 것으로 판단되면, 이에 대한 정보는 ACS(260)를 거쳐 서버 관리자에게 통보된다.Having such a configuration, information about various servers transmitted from the CMS 110 of the user server 100 is stored in the DB 230 via the CAS 210 and the DPS 220 of the call center 200, In addition, various kinds of information on the network status are stored in the DB 230 by the (NDS) 240, and the information stored in the DB 230 is analyzed by the DAS 250 and output on the monitor screen in real time. In addition, if it is determined that a predetermined emergency situation has occurred due to the analysis of the stored information, the information about this is notified to the server administrator via the ACS 260.

이러한 서버 관리 동작에서 서버 관리 항목은 하드 디스크 등과 같은 저장매체 관리, CPU 프로세스 성능 관리, 메모리 사용량 관리, 로그인 인증 관리, 스왑(SWAP) 사용량 관리, 프로세스 상태 관리, 서버 사용자 정보 관리, 호스트 정보 관리, 환경설정파일 관리, 네트워크 관리, 주요 파일 관리 등이 있다. 이러한 관리 동작에 대한 설명을 이하 첨부 도면을 참조하여 상세히 설명하기로 한다.In this server management operation, server management items include storage media management such as hard disk, CPU process performance management, memory usage management, login authentication management, swap (SWAP) usage management, process status management, server user information management, host information management, Configuration file management, network management, and major file management. Description of this management operation will be described in detail with reference to the accompanying drawings.

도 2는 하드 디스크 드라이브(HDD) 등과 같은 저장매체 관리를 위한 사용자 출력 화면의 예시도이다. 도 2를 참조하면, 저장매체 관리를 위해서 장착된 파일 시스템 단위로 총공간 대비 사용공간 사용률 체크를 하며, 전체 총공간 대비 사용공간의 체크도 수행한다. 이러한 HDD 또는 저장매체 상의 전체 용량 및 사용용량, 잔여용량 등을 체크, 주어진 임계치 이상의 파일-시스템이 발생하지 않도록 관리하며 사용기간 대비 사용량을 판독하여 마모율 및 교체시기에 대한 리포트가 가능하다. 도 2에 도시된 바와 같은 저장매체의 사용공간에 대한 정보는 도 1에 도시된 바와 같은 사용자 서버(100)의 CMS(110)에서 콜 센터(200)로 전송되어, 콜센터(200)의 DB(230)에 테이블화되어 저장된다. 이때 미리 설정된 시간 간격으로 해당 저장매체의 상태를 체크하여 정보를 전송하게 된다. 또한 이와 같이 저장된 데이터는 CWS(270)에 의해 모니터 화면을 통해 사용자에게 출력된다.2 illustrates an example of a user output screen for managing a storage medium such as a hard disk drive (HDD). Referring to FIG. 2, the space used ratio is checked against the total space by the mounted file system, and the space used against the total space is also checked. It checks the total capacity, usage capacity, and remaining capacity on the HDD or storage media, manages the file-system not exceeding the given threshold, and reads the usage rate against the usage period to report the wear rate and replacement time. Information on the space used of the storage medium as shown in Figure 2 is transmitted from the CMS 110 of the user server 100 as shown in Figure 1 to the call center 200, DB ( 230 is tabled and stored. At this time, the information is transmitted by checking the state of the storage medium at preset time intervals. In addition, the data stored in this way is output to the user through the monitor screen by the CWS (270).

도 3은 CPU 상태 관리를 위한 사용자 출력 화면의 예시도이다. 도 3을 참조하면, CPU 프로세스 상태 체크를 체크하기 위하여 먼저, CPU 종류를 파악하고 CPU 사용량을 추출하게 된다. 이때 CPU의 클럭수와, CPU 사용자 사용량, CPU 시스템의 사용량, CPU 입출력을 위한 대기 사용량, CPU 휴지기간 등을 체크한다. 도 3의 예에서는 CPU의 사용자 프로세스 비율이 1%이며, 시스템 프로세스 비율이 3%, 입출력 대기 프로세스 비율이 0%로서, CPU 시스템의 휴지 비율이 96%인 것으로 도시되고 있다. 이러한 CPU 상태에 대한 정보도 마찬가지로, 도 1에 도시된 바와 같은 사용자 서버(100)의 CMS(110)에서 콜 센터(200)로 전송되어, 콜 센터(200)의 DB(230)에 테이블화되어 저장된다.3 is an exemplary diagram of a user output screen for CPU state management. Referring to FIG. 3, in order to check the CPU process status check, the CPU type is first identified and the CPU usage is extracted. At this time, the number of clocks of the CPU, CPU user usage, CPU system usage, standby usage for CPU input / output, and CPU idle period are checked. In the example of FIG. 3, the CPU has a user process rate of 1%, a system process rate of 3%, an I / O wait process rate of 0%, and a CPU system idle rate of 96%. Similarly, the information on the CPU status is transmitted from the CMS 110 of the user server 100 as shown in FIG. 1 to the call center 200 and tabled in the DB 230 of the call center 200. Stored.

이와 같이, CPU 프로세스를 사용자와 시스템, 입출력 대기별로 사용량을 구별하여 체크하고 리셋 프로세스량을 체크한다. 이 기능은 CPU를 병렬 또는 직렬 설치하여 확장성 및 현재 가용성에 대한 판단자료로 활용할 수 있고 전체 시스템상의 프로세스 점유율에 관한 유용한 자료로 활용할 수 있다. 또 이 기능으로 인해 시스템 속도 저하 현상과 이유없는 다운을 방지할 수 있다.In this way, the CPU process is checked by the user, the system, and the I / O wait, and the reset process amount is checked. This feature can be used to determine CPU scalability and current availability by parallel or serial installation of the CPU and as a useful resource for process occupancy over the entire system. This feature also prevents system slowdowns and unplanned downtime.

도 4는 메모리 사용량 관리를 위한 사용자 출력 화면의 예시도이다. 도 4를 참조하면, 메모리의 사용량을 체크하기 위하여 총 메모리 크기를 체크하고, 현재 사용하고 있는 메모리 크기 체크한다.4 is an exemplary diagram of a user output screen for managing memory usage. Referring to FIG. 4, the total memory size is checked to check the memory usage, and the memory size currently being used is checked.

이러한 메모리 사용량 관리는 시스템 내부의 메모리 자원에 관한 모든 체크를 수행하는 기능으로서, 메모리 부족으로 인해 발생할 수 있는 스왑 디스크 사용률을 효과적으로 처리할 수 있고 메모리 사용 성향에 따른 종류 선택 및 향후 메모리 추가/변경 작업시 완벽한 시스템 구성을 가능하게 한다. 또, 메모리 추가/감소를 체크할 수 있으므로 하드웨어 변동 사항의 근거 자료로 활용될 수 있다.This memory usage management is a function that performs all checks on the memory resources inside the system. It can effectively handle the swap disk usage caused by insufficient memory, select the type according to the memory usage tendency, and add / change memory in the future. Enables complete system configuration. In addition, memory addition / deletion can be checked and used as a basis for hardware changes.

도 5는 로그인 인증 시도 관리를 위한 사용자 출력화면의 예시도이다. 도 4를 참조하면, 로그인 인증시도 체크는 로그인 시도자의 ID, 패스워드, 시도시간, 시도자 IP 주소, 터미널 노드 디바이스 타입 등을 체크하여, 로그인 시도자에 대한 추적 근거자료로 활용할 수 있다. 이는 곧 보안시스템을 부착해야 하는지 여부를 근거에 입각하여 판단할 수 있으며 간단한 침입탐지의 역할을 수행할 수 있다.5 is an exemplary diagram of a user output screen for managing login authentication attempts. Referring to FIG. 4, the login authentication attempt check may be used as a tracking basis for the login attempt by checking the login attempt ID, password, attempt time, attempt IP address, terminal node device type, and the like. This can be judged on the basis of whether or not a security system should be attached, and can act as a simple intrusion detection.

도 6은 스왑 사용량 관리를 위한 사용자 출력 화면의 예시도이다. 도 6을 참조하면, 스왑 사용량 체크는 스왑 디바이스의 총 크기와 사용중인 크기를 기록, 관리한다. 사용량의 통계분석을 통해 스왑 디바이스의 사용효율을 진단함으로써 스왑 공간의 확보요구시기를 알 수 있다.6 is an exemplary diagram of a user output screen for managing swap usage. Referring to FIG. 6, the swap usage check records and manages the total size of the swap device and the size in use. By analyzing the usage statistics, the usage efficiency of the swap device can be diagnosed to determine the time to secure the swap space.

이와 같은 정보는 메모리 증설 여부를 판단할 수 있는 근거자료로 활용할 수 있으며, 스왑 사용으로 인해 지연되는 프로세싱타임을 해결 가능하게 된다.Such information can be used as a basis for determining whether or not memory is expanded, and processing time delayed by swap usage can be solved.

도 7은 본 발명의 일 실시예에 따른 프로세스 상태 관리를 위한 사용자 출력 화면의 예시도이다. 도 7을 참조하면, 프로세스 상태 체크는 실행중인 프로세스 명과 프로세스 ID, 사용자 ID, CPU 점유율, 총메모리크기, 사용된 메모리, 프로세스의 현재 상태, CPU 사용기간 등을 기록/관리함으로서 프로세스 데몬별로 자원관리를 효율적으로 수행하게 된다.7 is an exemplary view of a user output screen for process state management according to an embodiment of the present invention. Referring to FIG. 7, the process status check records and manages the process name, process ID, user ID, CPU occupancy, total memory size, used memory, current state of the process, CPU usage period, etc., for each process daemon. Will be performed efficiently.

이러한 기능을 통해 분산시스템 도입 여부를 결정하는데 유용한 자료를 제공하게 되며, 이러한 정보는 S/W 테스트시에 시스템 과부하여부를 판독하기에도 유용한 자료로 활용될 수 있다.This function provides useful data for deciding whether to introduce a distributed system, and this information can be used as a useful data to read whether the system is overloaded during software testing.

도 8은 서버 사용자 정보 관리를 위한 사용자 출력 화면의 예시도이다. 도 8을 참조하면, 서버 사용자 정보 체크는 전체 등록된 서버 사용자, 현재 접속해 있는 사용자의 숫자 등을 기록, 관리함으로서 일정기간동안 동시 사용자 수를 파악하게 된다.8 is an exemplary diagram of a user output screen for managing server user information. Referring to FIG. 8, the server user information check records the number of registered server users, the number of currently connected users, and the like to determine the number of concurrent users for a certain period of time.

이러한 정보는 시스템 성능대비 사용자 사용량을 판단하는데 유용한 자료로 활용될 수 있다.This information can be used as a useful data for determining user usage against system performance.

도 9는 호스트 정보 관리를 위한 사용자 출력 화면의 예시도이다. 도 9를 참조하면, 호스트 정보 체크는 호스트의 고유한 정보(IP, OS, Domain, Name 등)를 기록 관리하므로 변화를 감지하기에 유용한 데이터를 제공한다.9 is an exemplary diagram of a user output screen for managing host information. Referring to FIG. 9, the host information check records and manages unique information (IP, OS, Domain, Name, etc.) of the host, thereby providing data useful for detecting a change.

도 10은 환경설정(config) 파일 관리를 위한 사용자 출력 화면의 예시도이다. 도 10을 참조하면 환경설정 파일 체크는 주요 환경설정 파일의 변경시에 이에 대한 정보를 저장하여 이후 시스템 다운 등과 같은 장애 발생시에 해당 장애에 대한 원인을 밝히는데 유용한 정보를 제공할 수 있다.10 illustrates an example of a user output screen for managing a configuration file. Referring to FIG. 10, the configuration file check may provide information useful for identifying a cause of a failure in the event of a failure such as a system down by storing information about the change in the main configuration file.

도 11은 네트워크 장애 관리를 위한 사용자 출력 화면의 예시도이다. 도 11을 참조하면, 네트워크 장애 관리는 도 1에 도시된 바와 같은 NDS(240)에서 수행되는 기능으로서, 미리 설정된 시간(약 30초) 간격으로 고객의 서버 시스템 다운 및네트워크 단절로 인한 서비스 중단 여부를 체크하여 기록하며, 장애 발생시에 이를 통보함으로서 서버 관리자가 이동 중 이거나 외부에 있을 때에도 시스템 상황을 알 수 있다.11 is an exemplary view of a user output screen for network failure management. Referring to FIG. 11, network failure management is a function performed by the NDS 240 as shown in FIG. 1, and whether the service is interrupted due to the server system down or network disconnection of the customer at a predetermined time interval (about 30 seconds). It checks and records and notifies you when a failure occurs, so you can know the system status even when the server administrator is moving or outside.

도 12는 네트워크 관리를 위한 사용자 출력 화면의 예시도이다. 도 12를 참조하면, 네트워크 체크는 사용자 서버(100)의 네트워크 어댑터 및 IP 주소, 게이트웨이 주소, 플래그, "Refs", 사용량, 디바이스 이름 등을 기록하여 네트워크 관리가 쉽도록 한다. 이러한 네트워크 관리는 IP 주소별로 개별적으로 사용량을 알 수 있으므로 네트워크 다운의 원인파악에 유용한 자료로 활용될 수 있다.12 is an exemplary diagram of a user output screen for network management. Referring to FIG. 12, the network check records a network adapter and an IP address of the user server 100, a gateway address, a flag, a "Refs", a usage amount, a device name, and the like, to facilitate network management. This network management can be used as a useful data to determine the cause of the network down because it can know the usage by each IP address.

도 13은 중요 파일 관리를 위한 사용자 출력 화면의 예시도이다. 도 13을 참조하면, 중요 파일 체크는 중요한 파일들이 임의로 권한 변경 및 내용 변경여부를 판독하는 체크 기능으로서, 중요한 설정 파일들을 체크하여야 하는 관리자들에게 편의성을 제공한다.13 is an exemplary view of a user output screen for managing important files. Referring to FIG. 13, the critical file check is a check function that reads whether important files are arbitrarily changed or changed in content, and provides convenience for administrators who need to check important configuration files.

상기 도 2내지 도 13에 걸쳐 개시한 바와 같은 서버 관리를 위한 각각의 항목은 관리자가 일일이 조회해 볼 수 있도록 할 수 있으며, 또한 정상이나 장애 상태를 쉽게 확인할 수 있도록 전체적으로 출력할 수 있다.Each item for server management as described above with reference to FIGS. 2 to 13 may be viewed by an administrator, and may be output as a whole so that a normal or failure state may be easily checked.

도 14는 전체 서버 시스템 상황을 나타낸 사용자 출력 화면의 예시도이다. 도 14를 참조하면, 전체 서버의 시스템 상황을 나타내기 위하여 시스템 상황 모니터링 영역(141)과, 네트워크 상황 모니터링 영역(142)이 구비되며, 시스템 상황 모니터링 영역(141)에는 점검 시간, 디스크 상태, CPU 상태, 메모리 상태, 스왑 상태, 중요한 파일 권한 변경 상태, 인증 시도 상태, 리부팅 상태 등에 대한 항목이있으며, 네트워크 상황 모니터링 영역(142)에는 점검 시간, 네트워크 상황 등에 대한 항목이 있을 수 있다. 각 항목의 상태에 대한 등급은 예를 들어 정상/경고/장애/긴급상황으로 분류할 수 있고, 각 상태에 대한 등급을 적절한 색상으로 구분된 표식을 통해 표시함으로 이를 쉽게 알릴 수 있도록 한다.14 is an exemplary view of a user output screen showing the overall server system status. Referring to FIG. 14, the system status monitoring area 141 and the network status monitoring area 142 are provided to indicate the system status of the entire server, and the system status monitoring area 141 includes a check time, a disk status, and a CPU. There are items on the status, memory status, swap status, important file permission change status, authentication attempt status, reboot status, and the like. The network status monitoring area 142 may include items on a check time and a network status. The grades for each item's status can be categorized as normal / warning / disability / emergency, for example, and the grades for each condition can be easily indicated by marking them with appropriately colored markers.

한편, 이러한 정상/경고/장애/긴급상황과 같은 장애 등급에 대한 설정을 관리자의 설정에 따라 적절히 분류할 수 있도록 할 수 있다. 도 15를 참조하면, 사용자는 각각의 서버 관리 항목에 대한 장애 등급설정을 임의로 적절히 설정할 수 있으며, 그 예로서 하드 디스크의 관리 항목의 장애 등급 설정을 위하여, 디스크 임계치 설정 항목(151)에서 경고 상황은 90%, 장애 상황은 95%, 긴급상황은 98%로 설정할 수 있다.On the other hand, it is possible to properly classify the settings for the disability level, such as normal / warning / disability / emergency situation according to the setting of the administrator. Referring to FIG. 15, a user may arbitrarily set a failure grade setting for each server management item. For example, in order to set a failure grade of a management item of a hard disk, a warning condition may be set in the disk threshold setting item 151. 90%, 95% for disabilities and 98% for emergencies.

이와 같이 설정된 장애 등급 상황에서 각 항목이 경고/장애/긴급 상황일 경우에, 이를 텍스트 상태로 추출하여 관리자 이메일로 전송하고, 텍스트를 음성으로 변환하여 유/무선 전화를 걸어 통보해 주게된다. 이때 관리자에게 통보될 때까지 반복적으로 동작하여 확실한 통보가 가능하도록 한다. 또한 이러한 통보 이력을 보관하여 향후 에러 발생상황 데이터로 저장할 수 있다.When each item is a warning / disability / emergency situation in the above-described disability grade situation, it is extracted as a text state and sent to the administrator e-mail, and the text is converted into voice to make a notification by making a wired / wireless call. At this time, it operates repeatedly until it is notified to the administrator so that certain notification is possible. In addition, this notification history can be stored and stored as future error occurrence data.

이러한 구성 및 동작을 가지므로, 본 발명에 따른 서버 관리 시스템은 종래와 차별되는 다양한 기능을 가지게 된다. 먼저, 본 발명의 서버 관리 시스템은 네트워크 다운과 시스템 다운 식별할 수 있다. 즉, 종래의 서버 관리 시스템의 경우에는 서버 관리 기능을 수행하는 서버 시스템이 다운될 경우에 네트워크 다운인지 시스템 다운인지를 식별할 수가 없었다. 이에 비해 본 발명의 서버 관리 시스템은CMS(110)와 NDS(240)이 각각 사용자서버(100)와 콜 센서(200)에 별도로 구비되기 때문에 이에 대한 식별이 가능하다.Having such a configuration and operation, the server management system according to the present invention will have a variety of functions different from the conventional. First, the server management system of the present invention can identify network down and system down. That is, in the conventional server management system, when the server system performing the server management function is down, it is not possible to identify whether the network is down or the system is down. On the contrary, in the server management system of the present invention, since the CMS 110 and the NDS 240 are separately provided in the user server 100 and the call sensor 200, identification thereof may be possible.

물론, 네트워크가 다운되거나 시스템이 다운되어도 통보하는 내용은 동일하지만, CMS(110) 자체는 네트워크가 다운되어도 계속 실행되고 그 데이터를 자체 서버에 저장해 두기 때문에 네트워크가 연결되는 시점에서 이에 대한 확인이 가능하다. 이에 따라, 사용자 서버(100) 시스템이 다운되면 CMS(110) 모듈 자체도 다운이 되므로 저장되는 데이터가 발견되지 않게 되고 이것은 곧 시스템 다운 여부를 판단할 수 있도록 한다.Of course, even if the network is down or the system is down, the contents of the notification are the same, but the CMS 110 itself continues to run even if the network is down and stores the data on its own server, so it is possible to check this at the time the network is connected. Do. Accordingly, when the user server 100 system is down, the CMS 110 module itself is also down, so that the stored data is not found and this enables to determine whether the system is down soon.

또한 종래의 서버 관리 시스템의 경우에는, 서버 관리 시스템이 다운되면 장애상황 통보가 불가능해진다. 이에 비해 본 발명의 서버 관리 시스템은 CMS(110) 만이 사용자 서버(100)에 설치되어 실행되고 원격지에 수신 콜센터 서버들(CAS, NDS, DPS, DB, ACS, CWS, DAS)이 구비되므로, CMS(110) 모듈 다운과 관계없이 장애상황통보가 가능하다. 이러한 기능은 NDS(240)에 의해 수시로 시스템 네트워크 상황이 감시되므로 더욱 정확하다.In addition, in the conventional server management system, when the server management system is down, it is impossible to notify the failure situation. On the contrary, in the server management system of the present invention, only the CMS 110 is installed and executed in the user server 100, and the receiving call center servers (CAS, NDS, DPS, DB, ACS, CWS, DAS) are provided at a remote location. (110) Failure notification is possible regardless of module down. This function is more accurate because the system network status is monitored from time to time by the NDS (240).

또한 전화상의 음성 통보를 담당하는 ACS(260)는 TTS(Text to Speech) 리더 제품을 탑재하여 에러 상황을 음성으로 읽어주는 역할을 담당한다. 따라서, 에러가 발생했을 시에 생성되는 텍스트를 그대로 음성으로 변환하는 TTS(Text to Speech) 기능을 사용하기 때문에, 어떠한 텍스트 메시지 내용도 음성으로 전달 가능하다.In addition, the ACS 260, which is in charge of voice notification on the phone, is equipped with a TTS (Text to Speech) reader product and plays a role of reading an error situation by voice. Therefore, since the text to speech (TTS) function of converting text generated when an error occurs to speech is used, any text message contents can be transmitted by voice.

또한 본 발명의 서버 관리 시스템은 여러 서버를 구축하여 서비스할 뿐만 아니라 CMS(110) 모듈 자체에 서비스 시스템 다운시 라운드로빈(round robin) 되어야하는 콜 센터의 IP 주소를 여러 개 지정할 수 있도록 하여, 실시간으로 정상 서비스 시스템으로 옮겨 송수신 작업 가능하도록 한다. 이에 따라 특정 콜 센터(200)이 다운되는 경우 이와 접속하는 CMS(110)는 해당 콜 센터(200)의 다운을 확인할 경우에 미리 설정된 라우드로빈되어야 하는 콜 센터로 사용자 서버의 관리 정보를 전송하여 그 콜 센터로 하여금 해당 사용자 서버를 관리할 수 있도록 한다.In addition, the server management system of the present invention allows not only to construct and service several servers, but also to designate a plurality of IP addresses of call centers to be round-robin when the service system is down in the CMS 110 module itself. It transfers to normal service system to enable sending and receiving work. Accordingly, when a specific call center 200 is down, the CMS 110 accessing it transmits management information of the user server to a call center that should be set in advance when the call center 200 is down. Allows the call center to manage its user server.

또한 CMS(110) 모듈이 CAS(210)로 해당 사용자 서버의 관리 정보를 전송하는 경우에, 이전에 전송했던 데이터와 현재 전송하려는 데이터를 CMS(110) 모듈이 자체적으로 비교하여 변동사항이 있는 부분만을 패킹하여 CAS(210)으로 전송하여 전송되는 데이터량을 줄일 수 있도록 한다. 이에 따라 여러대의 사용자 서버의 CMS(110)가 동시에 CAS(210)으로 데이터를 보내는 경우에도 CAS(210)이 모든 데이터를 적절히 처리할 수 있도록 한다.In addition, when the CMS 110 module transmits management information of the corresponding user server to the CAS 210, the CMS 110 module compares the previously transmitted data with the data to be transmitted at present, and changes the parts by itself. Only the packing is transmitted to the CAS 210 to reduce the amount of data to be transmitted. Accordingly, even when the CMSs 110 of several user servers simultaneously send data to the CAS 210, the CAS 210 can properly process all the data.

또한 이 경우에 CMS(110)와 CAS(210)간의 데이터 통신에는 서비스 업체에서 지정한 특정 포트를 이용하고 미리 설정된 특정 암호화 프로토콜에 따라 데이터를 전송하도록 하여 사용자 서버에 해킹 및 정보유출을 방지하도록 한다.In this case, the data communication between the CMS 110 and the CAS 210 is performed by using a specific port designated by a service provider and transmitting data according to a predetermined encryption protocol to prevent hacking and information leakage to the user server.

또한 관리자가 인터넷을 통해 콜 센터(200)에 접속하여 해당 사용자 서버(100)의 상황을 모니터링하고자 하는 경우에, CWS(270)는 예를 들어 10초의 간격으로 화면 리프레쉬를 실행하여 정보의 실시간 구현 기능을 지원하도록 한다. 이 경우, 웹상으로 모니터링되는 화면은 단지 DB(230)에 입력된 내용으로서, 모니터링으로 인한 CMS(110)과 CAS(210)간의 네트워크 트래픽은 발생하지 않는다.In addition, when an administrator wants to monitor the situation of the user server 100 by accessing the call center 200 through the Internet, the CWS 270 executes screen refreshes at intervals of 10 seconds, for example, to implement information in real time. Support the function. In this case, the screen monitored on the web is merely input to the DB 230, and network traffic between the CMS 110 and the CAS 210 due to the monitoring does not occur.

한편, 상기 도 15에서 살펴본 바와 같이, 본 발명의 서버 관리에서 각각의관리항목 중 서버 장애 등급이 가장 큰 상태인 긴급상황의 설정이 약 98~99% 정도로 설정되어 있음을 볼 수 있다. 이러한 설정에서도 알 수 있는 바와 같이 본 발명은 장애가 발생하여 시스템이 다운된 이후에 사용자에게 통보하는 등 여러 조치를 취하는 것이 아니라, 시스템이 다운되기 전에 이를 관리자에게 통보하여 관리자로 하여금 조치를 취하도록 함으로 시스템 다운 등을 방지할 수 있도록 한다.On the other hand, as shown in FIG. 15, it can be seen that in the server management of the present invention, the emergency situation in which the server failure level is the highest among the respective management items is set to about 98 to 99%. As can be seen from this configuration, the present invention does not take various actions such as notifying the user after the system is down due to a failure, but notifies the administrator before the system is down so that the administrator can take action. Prevent the system from crashing.

그런데, 서버 관리자가 원격지에 있는 등 여러 가지 이유로 서버 관리자가 그러한 서버 장애가 긴급 상황 상태임을 통보 받는다고 하더라도 즉각적인 조치를 취하기가 어렵거나 또는 서버 관리자의 전문 기술 미비 등으로 효과적인 조치를 취하기가 어려울 수 있다. 그럴 경우에 사용자 서버가 결국 다운되는 등의 심각한 상황을 방지하기 위하여 본 발명의 특징에 따라 자동 응급 복구 동작이 수행된다. 이를 첨부 도면을 참조하여 보다 상세히 설명한다.However, even if the server administrator is informed that such a server failure is in an emergency state due to various reasons, such as the server administrator is remote, it may be difficult to take immediate action or it may be difficult to take effective measures due to lack of technical expertise of the server administrator. . In such a case, an automatic emergency recovery operation is performed in accordance with a feature of the present invention to prevent a serious situation such as the user server eventually crashing. This will be described in more detail with reference to the accompanying drawings.

도 16은 본 발명의 일 실시예에 따른 서버 긴급상황 자동 복구 제어의 개략적인 흐름도이다. 도 16을 참조하면, 먼저 161단계에서 사용자 서버(100) 측의 CMS(110)는 서버의 각종 관리 정보를 추출하여 이를 콜 센터(200)로 전송한다. 162단계에서 콜 센터(200)측에서는 상기 전송된 정보를 분석하여 상기 도 15에 도시된 바와 같은 설정된 장애 등급에 해당하는 지를 판단하여 장애 여부를 판단하게된다. 장애 발생시에는 이후 163단계에서는 미리 설정된 서버 장애에 대한 응급 복구 처리 방안을 도출하여, 이후 163단계에서 이를 사용자 서버(100)측으로 전달한다. 이에 따라 165단계에서 사용자 서버(100)에서는 응급복구 처리 동작이 수행된다. 이후 콜 센터(200) 측에서는 166단계에서 이러한 응급 복구 처리 결과를 서버 관리자에게 통보한다.16 is a schematic flowchart of a server emergency automatic recovery control according to an embodiment of the present invention. Referring to FIG. 16, first, in step 161, the CMS 110 of the user server 100 extracts various management information of the server and transmits the management information to the call center 200. In step 162, the call center 200 analyzes the transmitted information to determine whether a failure corresponds to the set failure level as illustrated in FIG. 15. When a failure occurs, in step 163, the emergency recovery processing method for the predetermined server failure is derived, and in step 163, it is transmitted to the user server 100. Accordingly, in step 165, the user server 100 performs an emergency recovery process. Thereafter, the call center 200 notifies the server administrator of the result of the emergency repair process in step 166.

이러한 긴급 상황에 대한 자동 응급 복구 동작의 항목은 디스크, CPU, 메모리, 스왑, 네트워크, 데몬(Daemon), 포트(Port), 중요 파일 변경 등 상기 도 2내지 도 13에 도시된 바와 같은 모든 서버 장애 관리 항목이 해당될 수 있다. 예를 들어, 관리 항목이 도 2에 도시된 바와 같은 하드 디스크이며, 상기 하드 디스크의 장애 상황이 도 15에 도시된 바와 같이 98%가 긴급상황인 것으로 설정되어 있을 경우를 살펴보기로 한다. 상기 162단계에서 이러한 하드 디스크의 사용량이 98%에 이른 것이 확인되면, 163단계에서 미리 설정된 응급 복구 처리 방안을 도출하게 되는데, 이때 응급 복구 처리 방안은 하드 디스크에 저장된 파일 중에서 중요하게 보관할 필요가 없는 것으로 미리 설정된 파일들(예를 들어 "Temporary File")을 삭제하는 것으로 설정될 수 있다. 이에 따라 상기 163단계에서 사용자 서버(100)측에서는 상기 설정된 파일들을 삭제하여 디스크의 공간을 확보하게 된다. 이러한 응급 복구 동작은 일시적으로 디스크의 긴급 상황에 대처하는 미봉책에 불과할 수 있다. 그러나 이러한 동작을 수행함으로서 시스템이 다운될 수 있는 상황을 어느 정도 막을 수 있고 서버 관리자 등이 적절한 후속 조치를 취할 수 있도록 여유를 줄 수 있게 된다.The items of the automatic emergency recovery operation for such an emergency are all server failures as shown in Figs. 2 to 13, such as disk, CPU, memory, swap, network, daemon, port, and important file changes. Management items may correspond. For example, a case where a management item is a hard disk as shown in FIG. 2 and a failure state of the hard disk is set to 98% as shown in FIG. 15 will be described. When it is confirmed in step 162 that the usage of the hard disk reaches 98%, a predetermined emergency recovery processing method is derived in step 163, where the emergency recovery processing method does not need to be kept important among the files stored on the hard disk. It can be set to delete the preset files (eg, "Temporary File"). Accordingly, in step 163, the user server 100 deletes the set files to secure a disk space. This emergency repair operation may be merely an unsolvable measure of temporarily dealing with a disk emergency. However, by doing this, you can prevent the system from crashing to some extent and free up server administrators to follow up.

한편, 본 발명에서 서버의 상태에 대한 일정기간 동안 누적되어 저장된 정보들을 분석하여 서버 상태의 장래변화를 예측하며 그에 대한 대처 방안을 제시할 수 있다.On the other hand, in the present invention by analyzing the accumulated and stored information for a certain period of the state of the server can predict the future change of the server state and can propose a countermeasure.

도 17은 본 발명의 일 실시예에 따른 서버 진단 컨설팅 동작의 개략적인 흐름도이다. 도 17을 참조하면, 먼저 171단계에서 콜 센터(200)는 사용자 서버(100)의 CMS(110)을 통해 사용자 서버(100)의 상태에 대한 정보를 주기적으로 추출하여, 이후 172단계에서 상기 추출된 정보를 DB(230)에 저장한다. 이후 173단계에서 미리 설정된 기간 동안 상기 저장 데이터를 통계처리하며, 이후 174단계에서 이러한 통계 데이터를 이용하여 시계열 분석을 행한다. 이후 175단계에서는 상기 시계열 분석결과 현재까지의 사용자 서버(100) 상태를 파악하며 또한 데이터 마이닝 기법을 이용하여 서버 상태의 미래를 예측할 수 있는 정보를 추출한다. 이후 176단계에서는 상기 추출된 여러 가지 정보를 바탕으로 성향도 분석 알고리즘을 이용하여 서버 상태에 대한 진단 컨설팅이 이루어지게 된다.17 is a schematic flowchart of a server diagnostic consulting operation according to an embodiment of the present invention. Referring to FIG. 17, first, in step 171, the call center 200 periodically extracts information on a state of the user server 100 through the CMS 110 of the user server 100, and then extracts the information in step 172. The stored information in the DB 230. Thereafter, the stored data is statistically processed for a preset period of time in step 173, and time series analysis is then performed using the statistical data in step 174. In step 175, the time series analysis results determine the state of the user server 100 up to the present, and extract information for predicting the future of the server state using a data mining technique. Thereafter, in step 176, diagnostic consulting on the server status is performed using the propensity analysis algorithm based on the extracted various information.

이러한 서버 진단 컨설팅 동작을 위하여 사용자 서버(100)로부터 추출하는 상태 정보의 항목들은 상기 장애 응급 복구를 위해 파악하는 항목과 동일할 수 있다. 이러한 진단 컨설팅 동작을 위한 상태 정보의 항목이 하드 디스크인 경우를 예로 들어 상기 동작을 살펴보면, 하드 디스크의 사용량이 주기적으로 추출할 때 각각 40%, 50%, 60% 등으로 추출될 경우에, 이러한 데이터들을 시계열 분석을 행하면, 향후 일정 기간 후에는 하드 디스크의 공간이 장애 설정 등급에 도달되는 것을 예상 할 수 있다. 이러한 예상에 따라 하드 디스크의 용량이 부족함을 알리고 하드 디스크의 용량 확장이나 불필요한 파일을 삭제를 제안하는 등과 같이, 적절히 설정된 컨설팅 정보를 관리자에게 통보해 줄 수 있게 된다.The items of the state information extracted from the user server 100 for the server diagnostic consulting operation may be the same as the items identified for the failure emergency recovery. For example, in the case where the item of the status information for the diagnostic consulting operation is a hard disk, the above operation is described. When the usage of the hard disk is periodically extracted to 40%, 50%, 60%, etc., By time series analysis of the data, it can be expected that after a certain period of time, the space on the hard disk will reach the fault setting class. According to this expectation, it is possible to notify the administrator of appropriately set consulting information, such as notifying that the capacity of the hard disk is insufficient, and suggesting that the capacity of the hard disk is expanded or deleting unnecessary files.

도 18은 본 발명의 일 실시예에 따른 서버 관리 동작의 전체 흐름도이다. 도 18을 참조하면, 먼저 18a단계에서 사용자 서버(100)에서 CMS(110)는 서버의 상태에관한 데이터를 추출한다. 이후 18b단계에서 CMS(110)는 콜 센터(200)로 추출한 상태 데이터를 전송한다. 이후 콜 센터(200)에서는 18c단계에서 상기 상태 데이터를 상기 도 15에 도시된 바와 같은 설정 상태에 따라 장애 여부를 분석하게 된다. 이러한 분석 결과 18d단계에서 장애가 발생한 것으로 판단되면 이후 18e단계로 진행하며, 장애가 발생하지 않은 것으로 판단되면 이후 18j단계로 진행하여 18j단계에서 해당 상태 데이터를 DB(230)에 저장하게 된다.18 is an overall flowchart of a server management operation according to an embodiment of the present invention. Referring to FIG. 18, in operation 18a, the CMS 110 extracts data regarding the state of the server from the user server 100. Thereafter, in step 18b, the CMS 110 transmits the extracted state data to the call center 200. In step 18c, the call center 200 analyzes the state data according to the set state as shown in FIG. 15. As a result of the analysis, if it is determined that a failure occurs in step 18d, the process proceeds to step 18e. If it is determined that a failure does not occur, the process proceeds to step 18j and stores the state data in the DB 230 in step 18j.

한편, 상기 18d단계에서 장애가 발생한 것으로 판단되어 진행한 18e단계에서는 자동 응급 복구 동작의 수행 여부가 설정되어 있는지를 판단한다. 자동 응급 복구 동작을 수행하지 않는 것으로 설정되어 있으면 18i단계로 진행하며, 자동 응급 복구 동작을 수행하는 것으로 설정되어 있으면 18f단계로 진행한다. 18f단계에서는 장애 내역을 분석하고, 이후 18g단계에서는 미리 설정된 응급 복구 방안을 도출하게 된다. 이후 18h단계에서는 상기 응급 복구 방안에 따라 응급 복구 처리를 수행한다. 이후 18i단계에서는 해당 동작 결과를 서버 관리자에게 통보하고 이후 18j 단계로 진행하여 이러한 모든 상황에 대한 정보를 저장하게 된다.On the other hand, it is determined whether or not the automatic emergency recovery operation is performed in step 18e that is determined that the failure occurred in step 18d. If it is set not to perform the automatic emergency recovery operation, the operation proceeds to step 18i. If it is set to perform the automatic emergency recovery operation, the operation proceeds to step 18f. In step 18f, the fault history is analyzed, and then in step 18g, a predetermined emergency recovery plan is derived. In step 18h, the emergency recovery process is performed according to the emergency recovery method. Thereafter, in step 18i, the server administrator is notified of the result of the operation, and then the process proceeds to step 18j to store information on all these situations.

이후 18k단계에서는 DB(230)에 저장된 데이터가 시계열 분석이 가능한 만큼 확보되었는지를 판단한다. 분석 데이터량이 확보되지 않으면 상기 18a단계로 진행하여 상기의 과정을 반복 진행하며, 분석 데이터량이 확보되었으면 이후 18l단계로 진행하여 DB(230)에서 통계 데이터를 추출하게 된다. 이후 18m단계에서는 시계열 분석 동작을 수행하여, 18n단계에서 장래 변화를 예측하게 된다. 이후 18o단계에서 상기 예측한 장래 변화에 따른 서버 진단 컨설팅 동작을 수행한다.Thereafter, in step 18k, it is determined whether the data stored in the DB 230 is secured as much as time series analysis is possible. If the analysis data amount is not secured, the process proceeds to step 18a and the process is repeated. If the analysis data amount is secured, the process proceeds to step 18l to extract statistical data from the DB 230. After that, the time series analysis operation is performed in step 18m, and the future change is predicted in step 18n. After that, in step 18o, the server diagnosis consulting operation according to the predicted future change is performed.

이와 같이 상기 도 18에 도시된 바와 같은 과정에 의해 사용자 서버 장애 자동 복구 및 사용자 서버 컨설팅 동작이 수행될 수 있다.As described above, the user server failure automatic recovery and the user server consulting operation may be performed by the process as shown in FIG. 18.

한편 상기한 본 발명의 설명에서는 구체적인 실시예에 관해 설명하였으나 여러 가지 변형이 본 발명의 범위를 벗어나지 않고 실시될 수 있다. 따라서 본 발명의 범위는 설명된 실시예에 의하여 정할 것이 아니고 청구범위와 청구범위의 균등한 것에 의하여 정하여져야 할 것이다.Meanwhile, in the above description of the present invention, specific embodiments have been described, but various modifications may be made without departing from the scope of the present invention. Therefore, the scope of the present invention should not be defined by the described embodiments, but by the claims and equivalents of the claims.

상기한 바와 같이 본 발명은 사용자 서버의 장애 상황을 자동으로 감지하며, 서버 장애 발생시 서버 관리자에게 이를 효율적으로 통보할 뿐만 아니라, 서버의 장애 발생시 이에 대한 응급 복구를 수행하여 서버 장애를 해결할 수 있으며, 더욱이 서버의 상태를 주기적으로 분석하여 서버 상태의 장래변화를 예측하여 그에 대한 대처 방안을 제시할 수 있는 컨설팅 동작까지 수행할 수 있다.As described above, the present invention automatically detects a failure situation of a user server, and not only notifies the server administrator efficiently when a server failure occurs, and can also solve the server failure by performing an emergency recovery for the failure of the server. Furthermore, by analyzing the server status periodically, it is possible to perform consulting operations that can predict future changes in server status and suggest ways to cope with it.

Claims

In the server failure management system,

CMS (Connection Management System) installed in the user server to manage system internal resources and error events, and to check and transmit any abnormality,

A CAS (Connection Accept Server) for receiving the data transmitted from the CMS 110 and converting the data into normal data, a Data Passing Server (DPS) for storing the data output from the CAS in a database, and the user server. NDS (Network Diagnostic Server) which communicates with NDC (Network Diagnostic Client), checks the network for abnormality and inputs the abnormality to the database, and outputs information on a preset emergency situation, and the data stored in the database. DAS (Data Analyze Server) which processes and outputs various data for analysis, and information about the occurrence of preset emergency situation output from the NDS and the DAS connected to a public network or the Internet network, by wired / wireless Includes Automatic Error Calling Server (ACS), which provides the ability to send by phone and e-mail. Server Fault management system that is characterized by having a call center for configuration.

The server of claim 1, wherein the call center includes a CWS (CMS Web Server) having a function of displaying the contents of the database so that the server manager can monitor the server on the web when connected to the Internet. Fault Management System.

In the server failure management method,

Extracting and transmitting a plurality of types of management information of the server from the user server;

A process of determining whether the call center corresponds to a predetermined emergency disorder situation by analyzing the extracted;

In the case of the emergency failure situation, deriving a predetermined emergency recovery processing method for each failure and transmitting the same to the user server;

And performing an operation according to the emergency recovery processing method at the user server.

The method of claim 3, wherein the items of the emergency recovery operation for the emergency failure situation include storage medium management, CPU process performance management, memory usage management, login authentication management, swap usage management, process state management, and server user information management. At least one of host information management, configuration file management, network management, and main file management.

4. The method of claim 3, wherein the emergency recovery method deletes a file which is preset to be not important to be stored in the hard disk when the emergency failure situation is a hard disk failure.

In the server failure management method,

Analyzing and storing the extracted information in a call center;

Statistically processing the stored information for a preset period of time;

Performing time series analysis using the statistically processed information;

Determining the state of the user server as a result of the time series analysis and extracting information for predicting the future of the server state using a data mining technique;

And a process of performing diagnostic consulting on a server state using the propensity analysis algorithm.

In the server failure management method,

Extracting data about the state of the server from the user server and transmitting it to the call center;

Analyzing the failure state through the state data in the call center;

Storing information on the analysis result;

If it is determined that a failure has occurred as a result of the analysis, determining whether the automatic emergency recovery operation for the failure is set;

If it is set to perform the automatic emergency recovery operation, deriving a predetermined emergency recovery method according to the failure and performing an operation according to the emergency recovery method;

Notifying the server administrator of the result of the operation and storing information about the result of the processing;

Determining whether the data stored in the database is secured as much as possible for time series analysis;

If the data amount is secured, extracting data to perform a time series analysis operation;

Predicting future changes according to the time series analysis;

And performing a preset server diagnosis consulting operation according to the predicted future change.