KR20130123007A

KR20130123007A - Method for controlling trouble and server thereof

Info

Publication number: KR20130123007A
Application number: KR1020120046080A
Authority: KR
Inventors: 이아영; 남한산; 이현림
Original assignee: (주)네오위즈게임즈
Priority date: 2012-05-02
Filing date: 2012-05-02
Publication date: 2013-11-12

Abstract

The present invention relates to technology to issue a trouble ticket and to manage server or service trouble based on the issued trouble ticket. A server manages trouble as follows. The server: generates virtual role groups and stores processing staff information of each role that each virtual role group is matched to at least one piece of processing staff information; obtains information about a trouble occurrence from a subject being monitored and selects a virtual role group to process the trouble occurrence information among the virtual role groups; identifies processing staff information which is matched to the virtual role group among the processing staff information of each role and generates trouble ticket information including part of or all the trouble occurrence information; processes the issue of the trouble ticket using the trouble ticket information and the identified processing staff information; and completes the trouble ticket processing. According to the present invention, trouble occurrence information is accurately conveyed to a person in charge and a trouble ticket is correctly delivered although a person in charge is changed. Also, trouble recovery time from a managerial point of view can be accurately calculated. [Reference numerals] (110) Server;(120) Terminal;(130) Network

Description

METHOOD FOR CONTROLLING TROUBLE AND SERVER THEREOF}

본 발명은 장애 티켓을 발행하고, 이를 중심으로 시스템 혹은 서비스의 장애를 관리하는 기술에 관한 것이다.The present invention relates to a technique for issuing a trouble ticket and managing the failure of a system or service based on this.

장애가 발생하면, 장애 내용을 확인하고, 장애 관리 담당자는 장애 티켓을 발행하게 된다. 발생한 장애를 처리하도록 지정된 담당자가 장애를 처리하고, 장애 티켓을 마감 처리 혹은 종료 처리함으로써 장애 관리의 기본 사이클이 만들어진다. 이러한 기본 사이클을 기반으로 하는 장애 관리 방법들은 상용적으로 다양한 분야에서 적용되고, 사용되고 있다. 하지만, 현재 사용되고 있는 장애 관리 방법들을 살펴 보면 다음과 같은 몇 가지 문제점들을 가지고 있는 것을 확인할 수 있다.If a failure occurs, the failure details are checked, and the trouble management officer issues a trouble ticket. The basic cycle of fault management is created by handling a fault by a designated person to handle the fault, and closing or closing the fault ticket. Fault management methods based on these basic cycles are commonly applied and used in various fields. However, if you look at the current fault management methods, you can see that there are some problems as follows.

우선 장애 티켓이 발행되나 이것이 지정된 담당자에게 정확하게 전달되는지 확인하기 어려워 장애 티켓을 발행하는 관리자가 일일이 유선 전화 등을 통해 메세지를 전달하는 경우가 많다. 이로 인한, 업무 처리의 부정확성 및 인적 자원의 낭비를 초래한다.First, a trouble ticket is issued, but it is difficult to check whether it is correctly delivered to a designated person, and thus, an administrator who issues a trouble ticket frequently delivers a message through a landline. This results in inaccuracies in business processes and waste of human resources.

또 다른 문제점은 담당자가 퇴사, 이직, 전근 등의 사유로 변경되는 경우 해당 담당자에게 배정된 장애 티켓이 새로운 담당자에게 정확하게 인수 인계되지 않는다는 것이다. 현재의 장애 관리 방법들은 장애 티켓의 발행과 함께 생성되는 장애 티켓 번호와 연동하여 직접 처리 담당자를 연결시키고 있기 때문에 담당자가 변경되는 경우 일일이 해당 담당자와 연결된 장애 티켓을 검색하고 새로운 담당자로 변경해 줘야 하는 문제를 발생시키고 있다.Another problem is that if a contact is changed for reasons such as leaving the company, leaving a job, transferring, etc., the disability ticket assigned to that contact is not correctly taken over by the new contact. Current trouble management methods are directly linked with the trouble ticket number generated by issuance of trouble ticket, so if the contact person changes, the trouble ticket associated with the contact person must be searched and changed to a new contact person. Is generating.

또 다른 문제점 중의 하나는 장애가 복구된 이후에 장애 회복 시간을 산정하는 정확한 방법이 포함되어 있지 않다는 것이다. 고장난 부분을 고치는 차원에서 장애 처리가 마무리되고 있으며, 실제 서비스가 정상 궤도로 이용되고 있는지에 대한 지표를 계산하는 방법과 장치들이 포함되어 있지 않아 경영, 관리 차원에서는 낮은 수준의 관리 서비스만 제공되고 있는 문제점이 있다.Another problem is that it does not include an accurate way of estimating failure recovery time after a failure has been recovered. Disaster handling is being finalized in order to fix the failure, and there are no methods and devices to calculate the indicators of the actual service being used on the normal track, so only low-level management services are provided at the management and management level. There is a problem.

이러한 배경에서, 본 발명의 목적은, 장애 발생 정보가 획득되면, 담당자에게 정확하게 관련 정보를 전송하기 위해 SMS, SNS, 메신저, 전자 메일 등 담당자 정보 매체의 다양한 통로로 장애 관련 알람 메세지를 전달하고, 담당자가 변경되어도 해당 담당자와 연동되어 있던 장애 티켓 정보가 새로운 담당자에게 자동으로 이월될 수 있도록 장애 발생 정보와 담당자 정보 사이에 중간 매개체로서 역할 그룹을 형성하여 장애 발생 정보를 특정 역할을 담당하고 있는 역할 그룹으로 연동시키고, 이 역할 그룹에 다시 실제 담당자 정보를 매핑시키는 방법을 적용한다. 또한, 장애 회복 시간을 정확하게 산정하기 위해 장애 처리가 종료되었다고 하더라도 그 이후의 관리 대상 서비스가 정상적으로 이용되는지 모니터링하여 시스템이나 하드웨어 상의 문제를 해결하는데까지 걸린 시간이 아닌 서비스 이용률이 정상적인 상태로 회복되는 시간을 장애 회복 시간으로 산정하고자 한다.Against this background, an object of the present invention is to transmit a failure-related alarm message through various passages of a representative information medium such as SMS, SNS, messenger, e-mail, etc., in order to accurately transmit the relevant information to a representative when the failure occurrence information is obtained. A role group that plays a specific role of failure information by forming a role group as an intermediary between failure information and contact information so that trouble ticket information associated with the contact person can be automatically carried over to the new contact even if the contact person changes. Linking to a group and applying real contact information to this role group again is applied. Also, even if the failure processing is completed to accurately calculate the failure recovery time, the time when the service utilization rate returns to the normal state, not the time taken to solve the system or hardware problem by monitoring whether the managed service afterwards is normally used. We want to calculate this as the time to recover from a failure.

전술한 목적을 달성하기 위하여, 일 측면에서, 본 발명은, 서버가 장애를 관리하는 방법에 있어서, 가상 역할 그룹 리스트에 속한 각각의 가상 역할 그룹에 대해 하나 이상의 처리 담당자 정보를 매핑시킨 역할별 처리 담당자 정보를 데이터베이스에 저장하는 단계; 장애 모니터링 장치로부터 장애 발생 정보를 획득하는 단계; 및 상기 장애 발생 정보를 처리하기 위한 가상 역할 그룹을 상기 가상 역할 그룹 리스트에서 선택하고, 상기 장애 발생 정보의 일부 혹은 전부를 포함하는 장애 티켓을 상기 선택된 가상 역할 그룹에 연동하여 발행하는 단계를 포함하는 장애를 관리하는 방법을 제공한다.In order to achieve the above object, in one aspect, the present invention, in a method for managing a failure in a server, role-specific processing that maps one or more processing personnel information for each virtual role group belonging to the virtual role group list Storing contact information in a database; Obtaining failure occurrence information from the failure monitoring apparatus; And selecting a virtual role group for processing the failure occurrence information from the virtual role group list, and issuing a failure ticket including some or all of the failure occurrence information in association with the selected virtual role group. Provides a way to manage failures.

다른 측면에서, 본 발명은, 가상 역할 그룹 리스트에 속한 각각의 가상 역할 그룹에 대해 하나 이상의 처리 담당자 정보를 매핑시킨 역할별 처리 담당자 정보를 데이터베이스에 저장하는 역할별 담당자 정보 저장부; 장애 모니터링 장치로부터 장애 발생 정보를 획득하는 장애 발생 정보 획득부; 및 상기 장애 발생 정보를 처리하기 위한 가상 역할 그룹을 상기 가상 역할 그룹 리스트에서 선택하고, 상기 장애 발생 정보의 일부 혹은 전부를 포함하는 장애 티켓을 발행하는 장애 티켓 발행부를 포함하는 장애를 관리하는 서버를 제공한다. In another aspect, the present invention, the role-specific contact information storage unit for storing the role-specific processing contact information for mapping the at least one processing contact information for each virtual role group belonging to the virtual role group list in the database; Failure occurrence information acquisition unit for obtaining failure occurrence information from the failure monitoring device; And a trouble ticket issuing unit for selecting a virtual role group for processing the failure occurrence information from the virtual role group list and issuing a trouble ticket including some or all of the failure occurrence information. to provide.

다른 측면에서, 본 발명은, 서버가 장애를 관리하는 방법에 있어서, 장애 모니터링 장치로부터 장애 발생 정보를 획득하는 단계; 상기 장애 발생 정보의 일부 혹은 전부를 포함하는 장애 티켓을 발행하는 단계; 미리 정해진 방법에 따라 상기 장애 티켓에 대해 종료 처리하는 단계; 상기 종료 처리 이후 모니터링 대상에서 제공되는 어플리케이션 서비스 혹은 상기 모니터링 대상을 일부 이용하여 제공되는 어플리케이션 서비스에 대한 서비스 가동 정보를 획득하고, 미리 계산된 서비스 회복시 예상 가동 정보와 비교하여 서비스 회복 여부를 판단하는 단계; 및 상기 서비스 회복 여부를 판단하는 단계에서, 서비스 회복 상태로 판단되는 경우, 상기 종료 처리 시점으로부터 상기 서비스 회복 상태로 판단된 시점까지의 시간을 장애 회복 시간으로 결정하는 단계를 포함하는 장애를 관리하는 방법을 제공한다. In another aspect, the present invention provides a method for managing a failure in a server, the method comprising: obtaining failure occurrence information from a failure monitoring device; Issuing a trouble ticket including some or all of the failure occurrence information; Terminating the fault ticket according to a predetermined method; Obtaining service operation information on the application service provided by the monitoring target or the application service provided by using the monitoring target after the termination processing, and comparing the estimated operation information when the service is calculated in advance to determine whether to recover the service. step; And in the determining of whether the service is recovered, determining a time from the end processing time to a time when the service recovery state is determined as the service recovery state, when the service recovery state is determined. Provide a method.

다른 측면에서, 본 발명은, 장애 모니터링 장치로부터 장애 발생 정보를 획득하는 장애 발생 정보 획득부; 상기 장애 발생 정보의 일부 혹은 전부를 포함하는 장애 티켓을 발행하는 장애 티켓 발행부; 미리 정해진 방법에 따라 상기 장애 티켓에 대해 종료 처리하는 장애 티켓 종료 처리부; 상기 종료 처리 이후 모니터링 대상에서 제공되는 어플리케이션 서비스 혹은 상기 모니터링 대상을 일부 이용하여 제공되는 어플리케이션 서비스에 대한 서비스 가동 정보를 획득하고, 미리 계산된 서비스 회복시 예상 가동 정보와 비교하여 서비스 회복 여부를 판단하는 서비스 회복 여부 판단부; 및 상기 서비스 회복 여부를 판단하는 단계에서, 서비스 회복 상태로 판단되는 경우, 상기 종료 처리 시점으로부터 상기 서비스 회복 상태로 판단된 시점까지의 시간을 장애 회복 시간으로 결정하는 장애 회복 시간 산정부를 포함하는 장애를 관리하는 서버를 제공한다. In another aspect, the present invention, the failure occurrence information acquisition unit for obtaining failure occurrence information from the failure monitoring device; A trouble ticket issuing unit for issuing a trouble ticket including some or all of the trouble occurrence information; A trouble ticket termination processor configured to terminate the trouble ticket according to a predetermined method; Obtaining service operation information on the application service provided by the monitoring target or the application service provided by using the monitoring target after the termination processing, and comparing the estimated operation information when the service is calculated in advance to determine whether to recover the service. Service recovery determination unit; And in the determining of whether the service is recovered, a failure recovery time calculation unit configured to determine, as the service recovery state, a time from the termination processing time point to the time when the service recovery state is determined as the service recovery state as the failure recovery time. Provides a server to manage.

이상에서 설명한 바와 같이 본 발명에 의하면, 먼저 장애 발생 정보를 정확하게 담당자에게 전달할 수 있게 되고, 또한, 담당자가 변경되어도 장애 티켓이 정확하게 이월되며, 경영적인 관점에서 필요한 장애 회복 시간을 정확하게 산정할 수 있는 효과가 있다.As described above, according to the present invention, it is possible to accurately transmit the failure occurrence information to the person in charge first, and even if the person in charge is changed, the trouble ticket is carried forward correctly, and from the management point of view, it is possible to accurately calculate the required recovery time of the failure. It works.

도 1은 본 발명의 일 실시예에 따른 서버와 단말기를 포함하는 네트워크 구성도이다.
도 2는 서버 및 단말기와 네트워크로 연결되어 모니터링 대상체를 관리하는 주변 장치들을 포함하는 본 발명의 일 실시예에 따른 장애 관리 시스템 구성도이다.
도 3은 본 발명의 일 실시예에 따른 서버의 내부 블록도이다.
도 4는 가상의 역할 그룹에 처리 담당자 정보가 매핑된 역할별 처리 담당자 정보를 표로서 도시한 예시 도면이다.
도 5는 모니터링 대상체의 내부 블록도이다.
도 6은 본 발명의 일 실시예에 따른 장애를 관리하는 방법의 흐름도이다.
도 7은 본 발명의 다른 실시예에 따른 서버의 내부 블록도이다.
도 8은 서비스 가동 현황 패턴과 서비스 회복시 서비스 가동 예상 패턴을 하나의 화면에 표시한 그래픽 도면이다.
도 9는 본 발명의 다른 실시예에 따른 장애를 관리하는 방법의 흐름도이다.1 is a diagram illustrating a network including a server and a terminal according to an embodiment of the present invention.
2 is a configuration diagram of a failure management system according to an exemplary embodiment of the present invention, which includes peripheral devices connected to a network with a server and a terminal to manage a monitoring object.
3 is an internal block diagram of a server according to an embodiment of the present invention.
FIG. 4 is an exemplary diagram showing processing person information for each role in which processing person information is mapped to a virtual role group as a table.
5 is an internal block diagram of a monitoring object.
6 is a flowchart of a method for managing a failure according to an embodiment of the present invention.
7 is an internal block diagram of a server according to another embodiment of the present invention.
8 is a graphic diagram displaying a service operation status pattern and a service operation expected pattern at the time of service recovery on one screen.
9 is a flowchart of a method for managing a failure according to another embodiment of the present invention.

이하, 본 발명의 일부 실시예들을 예시적인 도면을 통해 상세하게 설명한다. 각 도면의 구성요소들에 참조부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 발명을 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략한다.Hereinafter, some embodiments of the present invention will be described in detail with reference to exemplary drawings. It should be noted that, in adding reference numerals to the constituent elements of the drawings, the same constituent elements are denoted by the same reference symbols as possible even if they are shown in different drawings. In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear.

또한, 본 발명의 구성 요소를 설명하는 데 있어서, 제 1, 제 2, A, B, (a), (b) 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성 요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성 요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 어떤 구성 요소가 다른 구성요소에 "연결", "결합" 또는 "접속"된다고 기재된 경우, 그 구성 요소는 그 다른 구성요소에 직접적으로 연결되거나 또는 접속될 수 있지만, 각 구성 요소 사이에 또 다른 구성 요소가 "연결", "결합" 또는 "접속"될 수도 있다고 이해되어야 할 것이다.In addition, in describing the component of this invention, terms, such as 1st, 2nd, A, B, (a), (b), can be used. These terms are intended to distinguish the constituent elements from other constituent elements, and the terms do not limit the nature, order or order of the constituent elements. When a component is described as being "connected", "coupled", or "connected" to another component, the component may be directly connected to or connected to the other component, It should be understood that an element may be "connected," "coupled," or "connected."

도 1은 본 발명의 일 실시예에 따른 서버(110)와 단말기(120)를 포함하는 네트워크 구성도이다.1 is a diagram illustrating a network including a server 110 and a terminal 120 according to an embodiment of the present invention.

서버(110)와 단말기(120)는 네트워크(130)를 통해 연결되어 있으며, 서버(110)를 통해 제공되는 장애 관리 서비스는 단말기(120) 화면을 통해 표시될 수 있으며, 관리자는 단말기(120)에 대한 조작을 통해 서버(110)에 특정 입력들을 만들어낼 수 있다. 아래에는 서버(110)와 단말기(120)에 해당될 수 있는 일부 특성들에 대해 설명한다.The server 110 and the terminal 120 are connected through the network 130, the failure management service provided through the server 110 may be displayed through the screen of the terminal 120, and the administrator 120 may display the terminal 120. By operating on, specific inputs to the server 110 may be generated. Hereinafter, some characteristics that may correspond to the server 110 and the terminal 120 will be described.

전술한 단말기(120)는, 일반적인 데스크 탑이나 노트북 등의 일반 PC를 포함하고, 스마트 폰, 태블릿 PC, PDA(Personal Digital Assistants) 및 이동통신 단말기 등의 모바일 단말기 등을 포함할 수 있으며, 이에 제한되지 않고, 서버(110)와 통신 가능한 어떠한 전자 기기로 폭넓게 해석되어야 할 것이다. The terminal 120 may include a general PC such as a general desktop or a notebook computer and may include a mobile terminal such as a smart phone, a tablet PC, a PDA (Personal Digital Assistants), and a mobile communication terminal. But should be broadly interpreted as any electronic device capable of communicating with the server 110. [

전술한 서버(110)는 하드웨어적으로는 통상적인 웹 서버(Web Server) 또는 웹 어플리케이션 서버(Web Application Server) 또는 왑 서버(WAP Server)와 동일한 구성일 수 있다. 그러나, 소프트웨어적으로는, 도 3과 관련하여 아래에서 상세하게 설명할 바와 같이, C, C++, Java, PHP, .Net, Python, Ruby 등 여하한 언어를 통하여 구현되어 여러 가지 기능을 하는 프로그램 모듈(Module)을 포함할 수 있다.The server 110 may be configured in the same manner as a conventional Web server or a web application server or a WAP server. However, in software, as described in detail below with reference to FIG. 3, a program module implemented through various languages such as C, C ++, Java, PHP, .Net, Python, Ruby, and performing various functions. (Module) may be included.

또한, 서버(110)는, 네트워크(130)를 통하여 불특정 다수 클라이언트(단말기(120)를 포함) 및/또는 다른 서버와 연결될 수 있는데, 이에 따라, 서버(110)는 클라이언트 또는 다른 서버의 작업수행 요청을 접수하고 그에 대한 작업 결과를 도출하여 제공하는 컴퓨터 시스템 또는 이러한 컴퓨터 시스템을 위하여 설치되어 있는 컴퓨터 소프트웨어(서버 프로그램)를 뜻하는 것일 수도 있다. In addition, the server 110 may be connected to an unspecified number of clients (including the terminal 120) and / or other servers via the network 130 so that the server 110 can perform operations of the client or other server A computer system for receiving a request and deriving a result of the operation, or computer software (server program) installed for such a computer system.

또한, 서버(110)는, 전술한 서버 프로그램 이외에도, 서버(110) 상에서 동작하는 일련의 응용 프로그램(Application Program)과, 경우에 따라서는 내부 또는 외부에 구축되어 있는 각종 데이터베이스를 포함하는 넓은 개념으로 이해되어야 할 것이다. In addition to the above-described server program, the server 110 may also include a wide range of application programs (application programs) operating on the server 110 and, in some cases, various databases built in or outside the server 110 It should be understood.

또한, 서버(110)는 콘텐츠, 각종 정보 및 데이터를 데이터베이스에 저장시키고 관리할 수 있다. 여기서, 데이터베이스는 서버(110)의 내부 또는 외부에 구현될 수 있다.In addition, the server 110 can store and manage content, various information, and data in a database. Here, the database may be implemented inside or outside the server 110.

또한, 서버(110)는 일반적인 서버용 하드웨어에 도스(DOS), 윈도우(windows), 리눅스(Linux), 유닉스(UNIX), 매킨토시(Macintosh) 등의 운영체제에 따라 다양하게 제공되고 있는 서버 프로그램을 이용하여 구현될 수 있으며, 대표적인 것으로는 윈도우 환경에서 사용되는 웹 사이트(Website), IIS(Internet Information Server)와 유닉스환경에서 사용되는 Apache, Nginx, Light HTTP 등이 이용될 수 있다. The server 110 may use a server program that is variously provided according to an operating system such as DOS, Windows, Linux, UNIX, or Macintosh to general server hardware Typical examples include a Web site used in a Windows environment, an Internet Information Server (IIS), Apache, Nginx, and Light HTTP used in a UNIX environment.

한편, 네트워크(130)는 서버(110)와 단말기(120)를 연결해주는 망(Network)으로서, LAN(Local Area Network), WAN(Wide Area Network)등의 폐쇄형 네트워크일 수도 있으나, 인터넷(Internet)과 같은 개방형 네트워크일 수도 있다. 여기서, 인터넷은 TCP/IP 프로토콜 및 그 상위계층에 존재하는 여러 서비스, 즉 HTTP(HyperText Transfer Protocol), Telnet, FTP(File Transfer Protocol), DNS(Domain Name System), SMTP(Simple Mail Transfer Protocol), SNMP(Simple Network Management Protocol), NFS(Network File Service), NIS(Network Information Service)를 제공하는 전 세계적인 개방형 컴퓨터 네트워크 구조를 의미한다. The network 130 is a network connecting the server 110 and the terminal 120 and may be a closed network such as a LAN (Local Area Network) or a WAN (Wide Area Network) ). &Lt; / RTI > Here, the Internet includes various services existing in the TCP / IP protocol and its upper layers such as HTTP (HyperText Transfer Protocol), Telnet, File Transfer Protocol (FTP), Domain Name System (DNS), Simple Mail Transfer Protocol (SMTP), The global open computer network architecture that provides Simple Network Management Protocol (SNMP), Network File Service (NFS), and Network Information Service (NIS).

또한, 단말기(120)가 스마트 폰, 태블릿 PC, PDA(Personal Digital Assistants) 및 이동통신 단말기 등의 모바일 단말기를 포함하는 경우, 네트워크(130)는 이동 통신망이나 와이파이(WiFi) 망 등의 무선 액세스 망을 더 포함할 수도 있다. In addition, when the terminal 120 includes a mobile terminal such as a smart phone, a tablet PC, a PDA (personal digital assistant), and a mobile communication terminal, the network 130 may be a wireless access network such as a mobile communication network or a WiFi As shown in FIG.

도 2는 서버 및 단말기와 네트워크로 연결되어 모니터링 대상체를 관리하는 주변 장치들을 포함하는 본 발명의 일 실시예에 따른 장애 관리 시스템 구성도이다.2 is a configuration diagram of a failure management system according to an exemplary embodiment of the present invention, which includes peripheral devices connected to a network with a server and a terminal to manage a monitoring object.

도 2를 참조하면, 본 발명의 일 실시예에 따른 장애 관리 시스템은 장애 관리 서비스를 제공하는 서버(110)와 이에 네트워크로 연결되어 장애 관리 정보를 수신하거나 장애 관리 정보를 화면으로 출력할 수 있는 단말기(120)뿐만 아니라, 모니터링 대상체(140), 데몬 서버(150), 서비스 모니터링 단말(160) 및 시스템 모니터링 단말(170) 등을 포함할 수 있다. 서버(110)는 도 3을 참조하여 아래에서 상술하고, 다른 장치들에 대해 먼저 설명한다.Referring to FIG. 2, a failure management system according to an embodiment of the present invention may be connected to a server 110 that provides a failure management service and connected to a network to receive failure management information or output failure management information on a screen. In addition to the terminal 120, the monitoring object 140, the daemon server 150, the service monitoring terminal 160, and the system monitoring terminal 170 may be included. The server 110 is described in detail below with reference to FIG. 3, and other devices will be described first.

모니터링 대상체(140)는 장애 관리 서비스의 대상이 되는 장치 혹은 어플리케이션 등의 모니터링의 대상체를 의미한다. 대표적인 것으로 네트워크 시스템, DB 등의 하드웨어 장비가 전술한 모니터링 대상체 중의 하나일 수 있으며, 또한 다른 서버를 통해 제공되고 있는 서비스(예를 들어, 소프트웨어의 특정 기능 등)도 전술한 모니터링 대상체(140)의 일 구성 요소가 될 수 있다. 예를 들어, 어느 한 게임 서비스가 장애 관리 서비스의 대상이라고 할 때, 전술한 모니터링 대상체(140)의 일 구성 요소들에는 게임 서비스를 제공하는 게임 서버, 게임 서버와 연동되어 사용자 계정 정보 등을 저장하고 있는 데이터베이스 시스템, 게임 서버와 사용자의 단말을 연결시켜주는 네트워크 장비들, 게임의 서비스 내용(예를 들어, 게임 실행 화면, 소프트웨어 블록 등) 등이 포함될 수 있다. 결국 모니터링 대상체(140)는 장애 관리 서비스를 통해 모니터링 되고, 장애가 발생한 장치 혹은 서비스라고 서버(110)에 보고될 수 있는 모든 구성 요소들을 포함하고 있다고 해석해야할 것이다.The monitoring object 140 refers to a monitoring object such as a device or an application that is a target of the failure management service. Representatively, a hardware system such as a network system or a DB may be one of the above-described monitoring objects, and a service (for example, a specific function of software, etc.) provided through another server may also be included in the above-described monitoring object 140. Can be a component. For example, when a game service is a target of a disability management service, one component of the above-described monitoring object 140 stores a game server providing a game service, a user account information linked with a game server, and the like. Database system, network devices connecting the game server to the user's terminal, service contents of the game (for example, a game execution screen, a software block), and the like. Eventually, the monitoring object 140 should be interpreted as including all components that can be monitored through the failure management service and reported to the server 110 as a failing device or service.

데몬 서버(150)는 모니터링 대상체(140)를 감시하고 있는 소프트웨어가 구동되고 있는 장치를 의미한다. 모니터링 대상체(140)의 특정 구성 요소는 이 특정 구성 요소의 속성을 모니터링함으로써 장애 여부를 판단할 수 있는데, 예를 들어 네트워크 구성의 경우 핑(Ping) 서비스에 대해 응답하지 않는다면, 네트워크 기능에 장애가 있다고 판단할 수 있다. 이러한 예뿐만 아니라, 데몬 서버(150)는 최근에 개발된 다양한 장애 진단 기술들을 적용하고 있으면서, 모니터링 대상체(140)를 실시간으로 감시하면서 특정 구성 요소에 장애가 발생하였는지에 대한 정보를 수집하게 된다. 이렇게 수집된 장애 발생 정보는 네트워크를 통해 연결된 서버(110)로 전송되게 되는데, 데몬 서버(150)가 장애를 확인한 즉시 장애 발생 정보를 서버(110)로 전자통신적으로 자동으로 전송할 수도 있으며, 또는 확인된 장애 발생 정보를 서버(110)와 공유하고 있는 데이터베이스에 저장하여 상호 정보를 교류할 수도 있으며, 또 다른 방법으로 서버(110)가 장애 발생 정보를 주기적으로 요청하는 경우에 맞추어 데몬 서버(150)가 수집한 장애 발생 정보들을 전송할 수도 있다. 본 발명에서 서버(110)가 데몬 서버(150) 등을 통해 장애 발생 정보를 획득할 수 있음을 설명한 것이고, 전술한 방식의 정보 전달 경로로 한정되는 것은 아니다.The daemon server 150 refers to a device on which software for monitoring the monitoring object 140 is running. A particular component of the monitoring object 140 can determine whether there is a failure by monitoring the properties of this particular component, e.g. if the network configuration does not respond to the ping service, there is a failure in the network function. You can judge. In addition to these examples, the daemon server 150 is applying various recently developed fault diagnosis techniques, and collects information on whether a specific component has failed while monitoring the monitoring object 140 in real time. The failure information collected as described above is transmitted to the server 110 connected through the network. As soon as the daemon server 150 confirms the failure, the failure information may be automatically and electronically transmitted to the server 110. The identified failure information may be stored in a database shared with the server 110 to exchange information with each other. Alternatively, the daemon server 150 may be adapted to the case where the server 110 periodically requests failure information. ) May also transmit failure information collected. In the present invention has been described that the server 110 can obtain the failure occurrence information through the daemon server 150, etc., it is not limited to the information transmission path of the above-described manner.

데몬 서버(150)는 실시간 감시 소프트웨어가 구동된 장치일 수 있으므로, 이는 독립된 장치 혹은 서버로서의 구성으로 한정되는 것은 아니며, 전술한 감시 소프트웨어가 서버(110)에 설치되는 경우, 하드웨어적인 구성에 있어서는 데몬 서버(150)가 서버(110)와 동일한 구성일 수 있다.Since the daemon server 150 may be a device on which real-time monitoring software is run, this is not limited to an independent device or a configuration as a server. When the above-described monitoring software is installed in the server 110, the daemon may be configured in hardware. The server 150 may have the same configuration as the server 110.

서비스 모니터링 단말(160)은 모니터링 대상체(140)의 일 구성 요소인 서비스에 대해 모니터링하는 관리자가 모니터링 대상인 서비스에서 장애를 발견한 경우 서버(110)로 장애 발생 정보를 송신할 수 있도록 기능하는 장치이다. 통상적으로 서비스 모니터링 관리자는 고객센터를 통해 접수되는 고장 신고나 서비스가 단말로 출력하는 화면을 모니터링하면서 모니터링 대상체의 일 구성 요소로서의 서비스에 장애가 발생하였는지 여부를 판단하게 된다. 이러한 판단을 통해 장애 발생 정보를 생성하면 이를 서버(110)로 전송하게 된다. 이를 위해 서버(110)는 서비스 모니터링 단말(160)로 화면 표시 제어하는 기능들을 포함할 수 있으면, 관리자들은 서비스 모니터링 단말(160)로 표시되는 화면 상에 장애 발생 정보를 입력함으로써 장애 발생 정보를 서버(110)로 전송할 수 있다.The service monitoring terminal 160 is a device that functions to transmit failure occurrence information to the server 110 when an administrator who monitors a service that is a component of the monitoring object 140 detects a failure in a service that is monitored. . Typically, the service monitoring manager determines whether a failure occurs in a service as a component of a monitoring object while monitoring a failure report received through a customer center or a screen output of a service to a terminal. When the failure occurrence information is generated through this determination, it is transmitted to the server 110. To this end, if the server 110 may include functions for controlling the display of the screen by the service monitoring terminal 160, the administrators input the failure occurrence information by inputting the failure occurrence information on the screen displayed by the service monitoring terminal 160. And transmit to 110.

시스템 모니터링 단말(170)은 모니터링 대상체(140)의 일 구성 요소인 시스템(예를 들어, 인프라라고 일컬어지는 네트워크, DB, 공용 서버 등)에 대해 모니터링하는 관리자가 모니터링 대상인 시스템에서 장애를 발견한 경우 서버(110)로 장애 발생 정보를 송신할 수 있도록 기능하는 장치이다. 시스템에 대해 모니터링하는 관리자는 시스템의 속성, 사용 현황 등을 모니터링하고 이상 징후가 있는 경우, 장애 발생 정보를 생성하고 서버(110)로 송신하던지 아니면 상세한 점검 후 실제 장애가 발생된 것으로 확인되면 이를 장애 발생 정보로서 서버(110)로 전송하게 된다. 이를 위해 서버(110)는 시스템 모니터링 단말(170)로 화면 표시 제어하는 기능들을 포함할 수 있으면, 관리자들은 시스템 모니터링 단말(170)로 표시되는 화면 상에 장애 발생 정보를 입력함으로써 장애 발생 정보를 서버(110)로 전송할 수 있다.When the system monitoring terminal 170 detects a failure in a system to be monitored by an administrator who monitors a system (for example, a network, a DB, a public server, etc., referred to as an infrastructure) that is one component of the monitoring object 140. It is a device functioning to transmit failure occurrence information to the server (110). The administrator who monitors the system monitors the system's properties, usage status, etc., and if there are any abnormalities, it generates a failure information and sends it to the server 110, or if it is confirmed that the actual failure has occurred after detailed inspection, the failure occurs. The information is transmitted to the server 110. To this end, if the server 110 may include functions of controlling the display of the screen to the system monitoring terminal 170, the administrator inputs the failure occurrence information by inputting the failure occurrence information on the screen displayed by the system monitoring terminal 170 And transmit to 110.

전술한 데몬 서버(150), 서비스 모니터링 단말(160), 시스템 모니터링 단말(170) 등을 장애 모니터링 장치라고 명명할 수도 있으며, 서버(110)는 이러한 장애 모니터링 장치로부터 장애 발생 정보를 획득할 수 있다.The daemon server 150, the service monitoring terminal 160, the system monitoring terminal 170, and the like may be referred to as a failure monitoring device, and the server 110 may obtain failure occurrence information from the failure monitoring device. .

아래에서는 이러한 주변 장치들과 연동되어 있으면서 본 발명의 일 실시예에 따른 주요 내용을 실시하는 서버(110)에 대해 상술한다.Hereinafter, the server 110 that performs the main contents according to an embodiment of the present invention while interworking with these peripheral devices will be described in detail.

도 3은 본 발명의 일 실시예에 따른 서버(110)의 내부 블록도이다.3 is an internal block diagram of the server 110 according to an embodiment of the present invention.

도 3을 참조하면, 서버(110)는 역할별 담당자 정보 저장부(310), 장애 발생 정보 획득부(320), 담당자 정보 파악부(330), 장애 티켓 정보 생성부(340), 장애 티켓 발행부(350), 장애 티켓 발행 알림부(360), 장애 티켓 연동부(370) 등을 포함할 수 있다.Referring to FIG. 3, the server 110 may perform a role-specific contact information storage unit 310, a failure occurrence information acquisition unit 320, a contact information identifying unit 330, a trouble ticket information generation unit 340, and a trouble ticket issuance. The unit 350 may include a trouble ticket issue notification unit 360 and a trouble ticket interworking unit 370.

역할별 담당자 정보 저장부(310)는 가상 역할 그룹 리스트에 속한 각각의 가상 역할 그룹에 대해 하나 이상의 처리 담당자 정보를 매핑시킨 역할별 처리 담당자 정보를 데이터베이스에 저장한다. 개념을 명확하게 하기 위해 도 4를 참조하여 가상의 역할 그룹에 대해 상술한다.The role-specific contact information storage unit 310 stores process-specific contact information for each of the virtual role groups included in the virtual role group list in the database. In order to clarify the concept, the virtual role group will be described in detail with reference to FIG. 4.

도 4는 가상의 역할 그룹에 처리 담당자 정보가 매핑된 역할별 처리 담당자 정보를 표로서 도시한 예시 도면이다. 도 4를 살펴보면, 가상의 역할 그룹은 서비스에서의 게시판 장애를 담당하는 그룹, 서비스에서 로그인 장애를 담당하는 그룹, 공통적으로 네트워크의 장애를 담당하는 그룹, 공통적으로 이메일 장애를 담당하는 그룹들로 나누어져 있다. 이러한 그룹들은 해당 이름에서 알 수 있듯이 장애 처리의 역할에 따른 그룹이다. 통상적인 기술들을 살펴보면, 장애 발생 정보를 획득하게 되면, 장애 관리자가 이를 확인하고 처리할 담당자를 바로 지정하게 된다. 처리할 담당자는 일 개인일 수도 있으며, 또한 특정 부서일 수도 있다. 예를 들어, 특정 게임의 사용자들을 위한 커뮤니티에 설치된 게시판이 작동 오류를 일으키고 있다고 가정할 때, 통상적인 기술에서는 해당 게시판을 개발한 담당자를 처리 담당자로 지정하거나 아니면 해당 특정 게임의 개발 부서를 처리 담당자로 지정하게 된다. 이 경우 문제가 되는 것은 지정된 담당자가 퇴사, 전근 등의 이유로 변경되는 경우에 발생한다. 전술한 게시판 문제의 담당자가 김삼담이라고 하고, 김삼담은 전술한 게시판 뿐만 아니라 해당 특정 게임과 관련하여 이메일도 관리한다고 가정할 때, 김삼담이 퇴직하고, 김사담이 대체 담당자가 된 경우, 서버(110)를 운용하고 있는 관리자는 김삼담이 담당자로 있던 장애 발생 정보 혹은 이러한 장애 발생 정보를 통해 발행된 장애 티켓에 대해 모두 일일이 담당자 변경을 처리하여야 한다. 이러한 과정에서 작업 오류 혹은 작업 지연이 발생하여 장애에 대한 신속한 처리가 지연될 수 있다.FIG. 4 is an exemplary diagram showing processing person information for each role in which processing person information is mapped to a virtual role group as a table. Referring to Figure 4, the virtual role group is divided into a group responsible for the board failure in the service, a group responsible for the login failure in the service, a group in charge of the network failure in common, a group in charge of the email failure in common Lost These groups, as their name suggests, are groups that are responsible for handling faults. Looking at common techniques, once the failure information is obtained, the failure manager can directly assign a person to identify and handle. The person to handle may be an individual or may be a specific department. For example, assuming that a bulletin board installed in a community for users of a specific game is causing a malfunction, in a typical technology, the person who developed the bulletin board is designated as the processing person, or the development department of the specific game is the person in charge of processing. It is specified as. In this case, the problem arises when the designated person changes due to reasons such as leaving the company. Assuming that the person in charge of the aforementioned bulletin board problem is Kim Samdam, and Kim Samdam manages not only the aforementioned bulletin board but also an e-mail related to the specific game, when Kim Samdam retires and Kim Saddam becomes the substitute person, the server 110 The manager who manages the system must handle the change of the person in charge for all the trouble occurrence information that Kim Samdam was in charge of or the trouble ticket issued through this failure occurrence information. In this process, a work error or a work delay may occur, thereby delaying the prompt handling of the failure.

이에 반해, 본 발명의 일 실시예에서는 가상의 역할 그룹을 사용하게 되는데, 이는 일종의 중간 매핑 코드 역할을 한다. 다시 말해, 장애 발생 정보 혹은 이러한 장애 발생 정보를 통해 발행된 장애 티켓에는 일종의 중간 매핑 코드로서 전술한 가상의 역할 그룹을 담당자로 지정하고, 가상의 역할 그룹에 다시 실제 처리 담당자 정보를 연결시켜 장애 발생 정보 혹은 이러한 장애 발생 정보를 통해 발행된 장애 티켓과 실제 처리 담당자 정보 사이에 중간 코드를 넣고 양자가 직접 연결되지 않도록 하는 것이다. 이렇게 하게 되면, 전술한 예에서 김삼담이 매핑되어 있는 가상의 역할 그룹들에 대해 담당자를 변경시키면 장애 발생 정보 혹은 이러한 장애 발생 정보를 통해 발행된 장애 티켓을 변경할 필요가 없어지며 자동적으로 변경된 담당자로 변경 처리되는 효과가 발생하게 된다.In contrast, one embodiment of the present invention uses a virtual role group, which serves as a kind of intermediate mapping code. In other words, in the failure information or the trouble ticket issued through the failure information, the above-mentioned virtual role group is designated as the person in charge as a kind of intermediate mapping code, and the failure is generated by linking the actual person in charge to the virtual role group again. The intermediate code is inserted between the fault ticket issued through the information or the fault occurrence information and the actual person in charge of the process so that the two are not directly connected. In this case, changing the contact person for the virtual role groups to which Kim Sam-dam is mapped in the above example eliminates the need to change the fault information or the trouble ticket issued through the fault information, and automatically changes to the changed contact person. The effect to be processed is generated.

전술한 예에서 김삼담이 담당자로 지정된 장애 티켓이 하나 혹은 두 개로 되어 있어 효과면에서 큰 차이가 없어 보이나 실제로는 수십에서 수백개의 장애 티켓이 동일인에게 발행될 수 있으며, 이 경우 본 발명의 일 실시예처럼 중간 매핑 코드로서 가상의 역할 그룹을 두게 되면 상기 가상의 역할 그룹에 매핑된 정보만을 수정함으로써 모든 장애 티켓의 담당자를 변경 처리하는 효과를 얻을 수 있게 된다.In the above example, there is one or two disability tickets designated by Kim Sam-Dam as the person in charge, but there is no significant difference in terms of effectiveness, but in practice, tens to hundreds of disability tickets can be issued to the same person. In this case, one embodiment of the present invention As described above, if a virtual role group is provided as an intermediate mapping code, only the information mapped to the virtual role group can be modified to obtain an effect of changing the person in charge of all trouble tickets.

장애 발생 정보 획득부(320)는 장애 모니터링 장치로부터 장애 발생 정보를 획득한다. 장애 발생 정보를 획득하는 과정은 전술한 여러 가지 방식(데몬 서버(150), 서비스 모니터링 단말(160), 시스템 모니터링 단말(170) 등을 이용한 방식)을 설명한 내용을 참조하면 된다.The failure occurrence information acquisition unit 320 obtains failure occurrence information from a failure monitoring apparatus. The process of acquiring the failure occurrence information may be described with reference to the aforementioned various methods (methods using the daemon server 150, the service monitoring terminal 160, the system monitoring terminal 170, etc.).

담당자 정보 파악부(330)는 역할별 처리 담당자 정보에서 상기 가상 역할 그룹에 매핑되어 있는 처리 담당자 정보를 파악하게 된다.The contact information identifying unit 330 may identify the processing contact information mapped to the virtual role group from the processing contact information for each role.

장애 티켓 정보 생성부(340)는 상기 장애 발생 정보의 일부 혹은 전부를 포함하는 장애 티켓 정보를 생성한다. 획득된 장애 발생 정보에는 원시 데이터와 가공되지 않은 데이터들이 포함되어 있을 수 있으므로, 장애 티켓으로 생성되어 처리 담당자와 결재 담당자들을 순회하게 되는 장애 티켓 정보에 포함시키기 어려운 정보들도 있다. 따라서 장애 티켓 정보에는 획득된 장애 발생 정보의 일부만 포함하거나 이를 가공한 정보가 포함될 수 있다.The trouble ticket information generation unit 340 generates trouble ticket information including some or all of the trouble occurrence information. Since the acquired failure information may include raw data and raw data, there is also information that is difficult to include in the trouble ticket information that is generated as a trouble ticket and iterates the processing staff and the settlement staff. Therefore, the trouble ticket information may include only a part of the acquired failure occurrence information or may include information processed therein.

전술한 장애 티켓 정보는 이와 연동하여 선택된 가상 역할 그룹에 대한 정보를 더 포함할 수 있다. 이렇게 장애 티켓 정보에 직접적으로 처리 담당자 정보를 연동시키는 것이 아닌 처리 담당자 정보에 대한 중간 매핑 코드로서의 가상 역할 그룹 정보를 연동시킬 수 있다. 물론, 장애 티켓 정보에 전술한 가상 역할 그룹에 매핑되어 있는 처리 담당자 정보를 직접 포함시킬 수도 있다.The above-mentioned trouble ticket information may further include information on the virtual role group selected in association with this. Thus, the virtual role group information as the intermediate mapping code for the processing representative information can be linked instead of directly linking the processing representative information to the trouble ticket information. Of course, the trouble ticket information may directly include the processor representative information mapped to the aforementioned virtual role group.

장애 티켓 발행부(350)는 장애 발생 정보를 처리하기 위한 가상 역할 그룹을 상기 가상 역할 그룹 리스트에서 선택하고, 상기 장애 발생 정보의 일부 혹은 전부를 포함하는 장애 티켓을 상기 선택된 가상 역할 그룹에 연동하여 발행한다. 보통, 장애 발생 내역을 정리하고, 이를 처리할 담당자가 지정하며 이에 대한 관리 번호를 생성하는 것을 장애 티켓 발행이라고 하는데, 이러한 장애 티켓 발행 처리 과정에는 처리 담당자, 결재 담당자, 회람자 등에게 공유될 장애 발생 내용을 확정하는 기능, 처리 담당자의 계정으로 장애 티켓 발행 내역을 저장하는 것, 처리 담당자의 연락처로 장애 티켓 발행 사항을 고지하는 것 등의 세부 과정이 포함되게 된다.The trouble ticket issuing unit 350 selects a virtual role group for processing failure occurrence information from the virtual role group list, and links a failure ticket including some or all of the failure occurrence information to the selected virtual role group. Issue. Generally, disability ticket issuance, designated by the person in charge to handle the problem, and generating a management number for it is called disability ticket issue. The detailed process includes the function of confirming the occurrence, storing the trouble ticket issuance in the account of the processing person, and notifying the issue of the trouble ticket to the person in charge of the processing person.

전술한 선택은 단말기(120)로부터 수신되는 관리자의 단말기 조작에 따른 선택 정보를 기초로 하여 이루어질 수 있다. 통상적으로는 관리자가 단말기(120)에 표시된 장애 발생 정보를 확인하고, 이를 적당한 역할 그룹으로 분배하게 되는 것이다.The above-described selection may be made based on selection information according to the terminal's operation of the manager received from the terminal 120. Typically, the administrator checks the failure occurrence information displayed on the terminal 120, and distributes it to the appropriate role group.

장애 티켓 발행 알림부(360)는 장애 티켓이 발행된 이후에, 파악된 처리 담당자 정보에 포함되어 있는 담당자 정보 매체로 장애 티켓 정보에 포함된 내용 중 일부를 전송할 수 있다. 이러한 담당자 정보 매체로는 휴대용 개인 통신 단말(SMS 발송), 온라인 메신저, SNS 및 전자메일 등이 해당될 수 있다. 처리 담당자가 신속하게 확인할 수 있는 방법으로 SMS 발송이 사용될 수 있는데, 이를 위해 서버(110)는 SMS 서버와 통신하여 정보를 SMS 방식으로 자동 전송할 수 있는 SMS 송신부(미도시)를 더 포함할 수 있다.After the trouble ticket is issued, the trouble ticket issuing notification unit 360 may transmit a part of the contents included in the trouble ticket information to the person in charge information medium included in the determined person in charge information. Such personal information carriers may include a portable personal communication terminal (SMS), an online messenger, an SNS, and an e-mail. SMS transmission may be used as a method that can be quickly confirmed by the processing personnel. For this purpose, the server 110 may further include an SMS transmission unit (not shown) which may automatically communicate with the SMS server to transmit information in an SMS manner. .

장애 티켓이 발행된 이후에 장애 티켓에 포함되어 있는 가상 역할 그룹에 대한 정보, 즉 상기 가상 역할 그룹에 매핑되어 있는 처리 담당자 정보가 변경되는 경우, 변경된 처리 담당자 정보에 따라 이에 포함되어 있는 담당자 정보 매체로 장애 티켓에 포함된 내용을 재전송할 수 있다. 종전에는 관리자가 일일이 장애 티켓 정보를 수정하고 변경된 처리 담당자로 정보를 재전송하였으나, 중간 매핑 코드 개념으로 가상의 역할 그룹을 설정하고 있기 때문에 가상의 역할 그룹 정보를 관리하는 블록에서 정보가 변경되었음을 장애 티켓 발행부(350) 혹은 장애 티켓 발행 알림부(360)로 통보하거나, 장애 티켓 발행부(350) 혹은 장애 티켓 발행 알림부(360)가 주기적으로 가상의 역할 그룹 정보의 변경을 확인하면 장애 티켓의 내용 변경없이 변경된 담당자 정보 매체로 장애 티켓을 재발행 혹은 그에 포함된 내용을 재전송할 수 있다.If information on the virtual role group included in the trouble ticket, that is, the processing person information mapped to the virtual role group is changed after the trouble ticket is issued, the contact person information medium included according to the changed processing person information Can retransmit the contents of the trouble ticket. In the past, the administrator modified the trouble ticket information and resent the information to the changed person in charge. However, because the virtual role group was set up with the concept of intermediate mapping code, the trouble ticket was changed in the block that manages the virtual role group information. If the issuing unit 350 or the disability ticket issuing notification unit 360 is notified, or the disability ticket issuing unit 350 or the disability ticket issuing notification unit 360 confirms the change of the virtual role group information periodically, Disability tickets can be reissued or the contents contained in the changed contact information carrier without any change in content.

장애 티켓 종료 처리부(미도시)는 발행된 장애 티켓에 대해 종료 처리를 수행한다. 종료 처리는 보통 관리자에 의해 수동으로 이루어 질수 있는데, 관리자가 장애가 복구되었고, 더 이상 처리 담당자의 역할이 없다고 판단되면 종료 처리 버튼 혹은 메뉴 등을 선택하여 서버(110)로 하여금 해당 장애 티켓을 삭제하거나 해당 장애 티켓에 종료 처리 정보를 더 포함하게 하여 다른 진행 중 장애 티켓과 구별되도록 하게 한다. 또는 종료 처리는 서버(110)에 의해 자동으로 이루어질 수도 있다. 해당 장애 발생 정보를 모니터링할 수 있는 수단을 구비하고 있는 경우, 이를 통해 장애가 지속되고 있는지 정상 작동되고 있는지 파악할 수 있고, 이 정보를 이용하여 해당 장애 티켓을 종료 처리하거나 혹은 장시간 방치되어 더 이상 의미없는 장애 티켓을 미리 정해진 기준에 따라 종료 처리할 수도 있다.The trouble ticket termination processing unit (not shown) performs termination processing on the issued trouble ticket. Shutdown processing can usually be done manually by the administrator. When the administrator determines that the failure has been restored and there is no longer the role of the person in charge of processing, the shutdown process button or menu can be selected to allow the server 110 to delete the trouble ticket. The failure ticket further includes termination processing information to distinguish it from other ongoing trouble tickets. Alternatively, the termination process may be automatically performed by the server 110. If you have a means to monitor the fault information, you can use this information to determine whether the fault is continuing or normal operation, and use this information to terminate the fault ticket or to leave it for a long time so that it is no longer meaningful. Disability tickets may be terminated according to predetermined criteria.

장애 티켓 연동부(370)는 생성한 장애 티켓과 이미 생성되어 있는 특정 장애 티켓을 연동시킨다. 티켓을 연동시킨다는 것은 해당 장애 티켓에 연동 관련 정보를 더 포함시키거나 해당 장애 티켓의 기본 포맷으로 마련되어 있는 연동 정보 영역에 연동되는 특정 장애 티켓의 번호 혹은 아이디 등을 입력하는 형식으로 이루어질 수 있다. 이렇게 장애 티켓이 다른 장애 티켓과 연동 처리되는 경우, 관리자 등은 장애 티켓과 연동된 다른 장애 티켓들의 정보를 하나의 화면으로 모니터링할 수 있고, 또한 하나의 종료 처리 프로세스를 통해 여러 장애 티켓을 종료 처리할 수도 있다.The trouble ticket interlocking unit 370 links the generated trouble ticket with a specific trouble ticket that has already been generated. The linking of the ticket may be performed by including a linkage-related information in the corresponding disability ticket or inputting a number or ID of a specific disability ticket linked to the interworking information area provided in the basic format of the disability ticket. When a trouble ticket is interlocked with another trouble ticket, an administrator or the like can monitor the information of the other trouble tickets associated with the trouble ticket on a single screen, and also terminate several trouble tickets through one termination process. You may.

장애 발생 정보를 획득하는 과정에 대해 설명한 부분을 참조해 보면, 장애 발생 정보가 획득되는 경로는 다양하다. 데몬 서버(150)를 통할 수도 있고, 서비스 모니터링 단말(160)을 통할 수도 있으며, 시스템 모니터링 단말(170)을 통할 수도 있다. 또한, 전술한 데몬 서버(150) 등도 하나가 아니라 여러 개일 수 있다. 이러한 경우에 하나의 장애에 대해 여러 군데에서 여러 개의 장애 발생 정보가 생성되고 서버(110)로 전송될 수 있다. 이러한 구조로 인해 하나의 장애에 대해 여러 번 반복하여 업무를 처리해야할 위험이 있고, 비효율을 만들 수 있다. 따라서, 동일한 장애에 대해 중복으로 발행된 장애 티켓들을 연동시켜 장애 티켓을 정리하거나 아니면 그러한 연동 정보를 활용하여 효율적으로 장애 처리를 수행할 수 있도록 해야한다. 이러한 장애 티켓 연동 기능을 수행하는 것이 장애 티켓 연동부(370)이다.Referring to the description of the process of acquiring fault occurrence information, the paths through which the fault occurrence information is obtained vary. It may be through the daemon server 150, may be through the service monitoring terminal 160, may be through the system monitoring terminal 170. In addition, the above-described daemon server 150 may not be one but several. In this case, several failure occurrence information may be generated and transmitted to the server 110 in several places for one failure. This structure creates the risk of having to do work over and over for a single failure and can create inefficiencies. Therefore, it is necessary to link trouble tickets issued repeatedly for the same failure to clean up the trouble tickets or to utilize the interworking information so that failure processing can be efficiently performed. The trouble ticket interworking unit 370 performs the trouble ticket interworking function.

장애 티켓 연동부(370)의 기능을 설명하기 위해 도 5를 참조한다. 도 5는 모니터링 대상체(140)의 내부 블록도이다. 도 5를 참조하면, 모니터링 대상체(140)는 이를 구성하는 각 구성 요소들을 계층적으로 분석해서 표현될 수 있는데, 시스템의 인프라에 해당되는 네트워크 시스템(550), DB 시스템(540), 그리고 이를 기반으로 서버 기능을 제공하는 어플리케이션 서버(530), 공용 서버(520), 마지막으로 전술한 구성 요소들을 기반으로 사용자에게 직접 제시되는 어플리케이션 서비스(510)로 나누어 볼 수 있다. 최상단의 하나의 어플리케이션 서비스를 제공하기 위해서는 이와 같이 여러 가지의 하위 계층 구성 요소를 요구할 수 있다. 여기서 모니터링 대상체(140)의 내부 블록으로 표시된 각각의 구성 요소는 장애 모니터링 장치에게 하나의 모니터링 대상으로 인식될 수 있다. 따라서, 전술되거나 후술되는 설명에서 구성 요소는 하나의 모니터링 대상으로 이해할 수 있다.5 to describe the function of the trouble ticket interlocking unit 370. 5 is an internal block diagram of the monitoring object 140. Referring to FIG. 5, the monitoring object 140 may be represented by hierarchically analyzing each component constituting the network object, the network system 550 corresponding to the infrastructure of the system, the DB system 540, and the same. The application server 530 that provides the server function, the common server 520, and finally can be divided into the application service 510 is presented directly to the user based on the above-described components. In order to provide one top-level application service, various lower layer components may be required. Here, each component indicated by the internal block of the monitoring object 140 may be recognized as one monitoring target by the failure monitoring apparatus. Therefore, in the above description or described below, the component may be understood as one monitoring object.

이러한 계층적인 구조로 이루어진 모니터링 대상체(140)에서 일 부분에 장애가 발생한 상황을 가정해 보겠다. 먼저 최하단의 네트워크 시스템(550)이 고장난 경우, 그 상위 단의 DB 시스템(540), 어플리케이션 서버(530), 공용 서버(520) 및 어플리케이션 서비스(510) 모두 정상적인 기능을 수행할 수 없다. 이 경우 각각의 구성 요소를 모니터링하는 단말 혹은 장치에서 자신이 모니터링하고 있는 구성 요소에 장애가 발생하였다고 하여 각 해당 구성 요소(예를 들어, 어플리케이션 서버(530), 공용 서버(520) 등)를 포함하여 장애 발생 정보를 서버(110)로 전송하게 될 것이다. 이 경우, 사실상 하나의 원인에서 발생한 동일한 장애이나 여러 건의 장애 티켓이 발행될 가능성이 있게 된다. 이렇게 중복으로 발행되는 장애 티켓은 업무의 비효율을 만들게 됨으로, 이를 연동 시켜 하나의 장애 처리 프로세스로 관리하는 것이 좋다.Let us assume a situation in which a failure occurs in a part of the monitoring object 140 having a hierarchical structure. First, when the lowermost network system 550 fails, the upper DB system 540, the application server 530, the common server 520, and the application service 510 may not all perform their normal functions. In this case, the terminal or device monitoring each component has a failure in the component monitored by the respective component (for example, including the application server 530, public server 520, etc.) Failure information will be sent to the server (110). In this case, in fact, there is a possibility that the same trouble or multiple trouble tickets which originate from one cause may be issued. As this trouble ticket is issued in duplicate, it creates inefficiency of work.

동일 혹은 함께 처리되는 것이 좋은 장애 티켓들을 연동시키는 구체적인 실시예들을 몇 가지 소개한다.Some specific embodiments of linking fault tickets that are best handled the same or together are introduced.

먼저, 서버(110)가 모니터링 대상체(140)를 구성하는 요소들의 장애 상관 관계를 분석한 구성 요소 간 장애 상관 관계 정보를 획득한다. 구성 요소 간 장애 상관 관계 정보의 일 예는 전술한 모니터링 대상체(140)를 구성하는 구성 요소들의 계층 구조 정보일 수 있다. 계층 구조 정보를 통해 각 구성 요소가 다른 구성 요소에 어떻게 영향을 미치는지 확인할 수 있다. 서버(110)는 모니터링 대상체(140)를 구성하는 요소들 중 장애 티켓 정보에 포함된 장애 관련 구성 요소를 구성 요소 간 장애 상관 관계 정보에 대입하여 장애 가능성 있는 다른 구성 요소 혹은 장애의 원인이 되는 또 다른 구성 요소를 파악할 수 있다. 전술한 예에서 보면, 여러 개의 장애 티켓 정보 중에 장애의 원인인 네트워크 시스템(550)을 장애의 구성 요소로 포함하는 장애 티켓 정보를 가정할 때, 이 장애 티켓 정보에 포함된 장애가 발생한 구성 요소인 네트워크 시스템(550)이라는 구성 요소를 전술한 구성 요소 간 장애 상관 관계 정보(일 예로서, 계층 구조 정보)에 대입해 보면, 장애 가능성 있는 다른 구성 요소로 DB 시스템(540), 어플리케이션 서버(530), 공용 서버(520) 및 어플리케이션 서비스(510)가 파악되게 된다. 이렇게 파악된 장애 가능성 있는 다른 구성 요소와 실제 다른 요소들에 대한 모니터링 중에 도출된 장애 티켓 정보에 포함된 장애 관련 구성 요소가 일치하게 되는데, 이 경우 일치하는 장애 티켓들은 서로 연동시킬 수 있다.First, the server 110 obtains failure correlation information between components in which failure correlation between elements constituting the monitoring object 140 is analyzed. An example of failure correlation information between components may be hierarchical structure information of components constituting the aforementioned monitoring object 140. Hierarchical information shows how each component affects other components. The server 110 assigns a failure-related component included in the trouble ticket information among the components constituting the monitoring object 140 to failure correlation information between the components, and causes another component or failure that may be a failure. Identify other components. In the above example, assuming failure ticket information including the network system 550 which is the cause of the failure as a component of the failure among several failure ticket information, the network that is the failure component included in the failure ticket information Substituting a component called system 550 into the above-described failure correlation information (for example, hierarchical structure information) between the components, DB system 540, application server 530, The common server 520 and the application service 510 are identified. In this case, the fault-related components included in the fault ticket information derived during the monitoring of the other possible components and the actual faults that are identified in this case may coincide with each other.

어느 한 장애 티켓에 포함된 장애 관련 구성 요소와 다른 특정 장애 티켓 정보를 전술한 구성 요소 간 장애 상관 관계 정보에 대입하여 획득한 장애 가능성 있는 다른 구성 요소 혹은 장애의 원인이 되는 또 다른 구성 요소 중 서로 일치하는 것이 있으면 양 장애 티켓을 연동시킬 수 있다. 보통 서비스의 외형을 모니터링하는 서비스 모니터링 팀에서는 서비스의 관점에서 장애를 바라보고 장애 티켓을 발행하게 되고, 시스템의 인프라를 관리하는 팀에서는 인프라의 관점에서 장애를 바라보고 장애 티켓을 발행하는 경우가 많다. 이 경우, 서비스의 관점에서 바라본 장애 티켓을 전술한 구성 요소 간 장애 상관 관계 정보에 대입해 보면, 그 하부 구조의 원인되는 인프라 관점의 장애 구성 요소를 파악할 수 있으며, 마찬가지로 인프라의 관점에서 바라본 장애 티켓을 전술한 구성 요소 간 장애 상관 관계 정보에 대입해 보면, 그 상부 구조의 장애 가능성 있는 서비스 관점의 장애 구성 요소를 파악할 수 있다. 이렇게 파악된 구성 요소들을 서로 비교하여 양 장애 티켓을 연동시킬지 결정할 수 있다.The failure-related component included in one trouble ticket and another specific trouble ticket information are substituted for the failure correlation information between the above-mentioned components, and the other components that may cause the failure or another component that causes the failure If there is a match, both disability tickets can be linked. Service monitoring teams that monitor the appearance of a service usually look at the failure from the service point of view and issue a trouble ticket, while teams managing the system's infrastructure often look at the failure from the infrastructure point of view and issue a trouble ticket. . In this case, substituting the fault ticket from the point of view of the service into the above-described fault correlation information between the components, it is possible to identify the fault component of the infrastructure point of view that causes the infrastructure, and similarly the fault ticket from the point of view of the infrastructure. By substituting the above-described failure correlation information between the components, it is possible to grasp the failure component of the possible service failure point of the superstructure. The components identified in this way can be compared with each other to determine whether both disability tickets are linked.

발행된 장애 티켓을 연동시키는 다른 방법으로는 두 장애 티켓의 장애 티켓 정보에 포함된 가상의 역할 그룹 정보가 서로 일치하는 경우 두 장애 티켓을 연동시킬 수 있다. 두 장애 티켓이 동일한 장애에서 비롯되었다고 보기 어려운 경우도 있겠지만, 가상의 역할 그룹은 역할에 따라 분류된 것이어서 묶어서 처리하는 것이 업무의 효율을 더 증대시킬 수 있다.As another method of interworking the issued trouble tickets, when the virtual role group information included in the trouble ticket information of the two trouble tickets coincides with each other, the two trouble tickets may be interworked. In some cases, it may be difficult to assume that two disability tickets originated from the same disability, but virtual role groups are grouped by role, which can be more efficient.

이상에서는, 본 발명의 일 실시예에 따른 장애를 관리하는 서버(110)에 대하여 설명하였으며, 이하에서는, 본 발명의 일 실시예에 따른 서버(110)가 장애를 관리하는 방법에 대하여 설명한다. 후술하게 될 본 발명의 일 실시예에 따른 장애를 관리하는 방법은, 도 3에 도시된 본 발명의 일 실시예에 따른 서버(110)에 의해 모두 수행될 수 있다. In the above, the server 110 for managing a failure according to an embodiment of the present invention has been described. Hereinafter, a method for managing a failure by the server 110 according to an embodiment of the present invention will be described. The method for managing a failure according to an embodiment of the present invention, which will be described later, may be performed by the server 110 according to an embodiment of the present invention shown in FIG. 3.

도 6은 본 발명의 일 실시예에 따른 장애를 관리하는 방법의 흐름도이다.6 is a flowchart of a method for managing a failure according to an embodiment of the present invention.

서버(110)는 먼저, 장애 처리를 담당하는 가상의 역할 그룹들을 생성하고, 상기 가상의 역할 그룹들 각각에 하나 이상의 처리 담당자 정보를 매핑시킨 역할별 처리 담당자 정보를 데이터베이스에 저장한다(S600). 다음으로, 서버(110)는 모니터링 대상체에서 장애 발생 정보를 획득하고(S602), 상기 장애 발생 정보를 처리하기 위한 가상 역할 그룹을 상기 가상의 역할 그룹들 중에서 선택한다(S604). 서버(110)는 상기 역할별 처리 담당자 정보에서 상기 가상 역할 그룹에 매핑되어 있는 처리 담당자 정보를 파악하고(S606), 상기 장애 발생 정보의 일부 혹은 전부를 포함하는 장애 티켓 정보를 생성하게 된다(S608). 상기 장애 티켓 정보와 상기 파악된 처리 담당자 정보를 이용하여 서버(110)는 장애 티켓을 발행 처리하고(S610), 그 후에 상기 장애 티켓에 대해 종료 처리를 수행하게 된다(S612).The server 110 first generates virtual role groups that are in charge of failure processing, and stores processing person information for each role in which at least one processing person information is mapped to each of the virtual role groups in a database (S600). Next, the server 110 obtains the failure occurrence information in the monitoring object (S602), and selects a virtual role group for processing the failure occurrence information from the virtual role groups (S604). The server 110 grasps processing person information mapped to the virtual role group from the processing person information for each role (S606), and generates trouble ticket information including some or all of the failure occurrence information (S608). ). By using the trouble ticket information and the identified processing person information, the server 110 issues a trouble ticket (S610), and then performs a termination process on the trouble ticket (S612).

이상에서는 본 발명의 일 실시예에 따른 장애를 관리하는 방법이 도 6에서와 같은 절차로 수행되는 것으로 설명되었으나, 이는 설명의 편의를 위한 것일 뿐, 본 발명의 본질적인 개념을 벗어나지 않는 범위 내에서, 구현 방식에 따라 각 단계의 수행 절차가 바뀌거나 둘 이상의 단계가 통합되거나 하나의 단계가 둘 이상의 단계로 분리되어 수행될 수도 있다. In the above description, the method for managing a failure according to an embodiment of the present invention has been described as being performed by the same procedure as in FIG. 6, but this is only for convenience of description and within the scope not departing from the essential concept of the present invention. Depending on the implementation method, the execution procedure of each step may be changed, two or more steps may be integrated, or one step may be performed in two or more steps.

아래에서는 본 발명의 다른 실시예에 따른 장애를 관리하는 서버(700) 및 방법에 대해 설명한다.Hereinafter, a server 700 and a method for managing a failure according to another embodiment of the present invention will be described.

도 7은 본 발명의 다른 실시예에 따른 서버(700)의 내부 블록도이다. 도 7을 참조하면, 본 발명의 다른 실시예에 따른 장애를 관리하는 서버(700)는 장애 발생 정보 획득부(710), 장애 티켓 발행부(720), 장애 티켓 종료 처리부(730), 서비스 회복 여부 판단부(740), 장애 회복 시간 산정부(750) 등을 포함할 수 있다.7 is an internal block diagram of a server 700 according to another embodiment of the present invention. Referring to FIG. 7, a server 700 for managing a failure according to another embodiment of the present invention includes a failure occurrence information acquisition unit 710, a trouble ticket issuing unit 720, a trouble ticket termination processing unit 730, and service recovery. The determination unit 740 may include a failure recovery time calculation unit 750.

도 7의 서버(700)는 도 3의 서버(110)의 기능을 포함하면서, 이에 더해 장애 회복 시간을 산정하기 위한 일련의 기능 블록을 더 포함하는 것으로 해석할 수 있다. 하지만, 도 7에 도시된 본 발명의 다른 실시예에 따른 서버(700)는 이러한 해석으로 제한되지 않는다. 따라서, 이하에서는 도 3의 서버(110)와 무관하게 독립적으로 도 7의 서버(700)에 대해 설명한다.The server 700 of FIG. 7 may be interpreted as including a function of the server 110 of FIG. 3, and in addition, further including a series of functional blocks for calculating a failure recovery time. However, the server 700 according to another embodiment of the present invention shown in FIG. 7 is not limited to this interpretation. Therefore, hereinafter, the server 700 of FIG. 7 will be described independently of the server 110 of FIG. 3.

장애 발생 정보 획득부(710)는 장애 모니터링 장치로부터 장애 발생 정보를 획득한다. 장애 티켓 발행부(720)는 획득된 장애 발생 정보의 일부 혹은 전부를 포함하는 장애 티켓을 발행한다. 장애 티켓 종료 처리부(730)는 발행된 장애 티켓에 대해 미리 정해진 방법에 따라 종료 처리한다. 전술한 장애 티켓 정보 획득부(710), 장애 티켓 발행부(720), 장애 티켓 종료 처리부(730)의 세부 기능들은 본 발명의 일 실시예에서의 서버(110)에서의 기능을 참조하면 된다. 내용의 중복을 피하기 위해 여기서는 생략한다.The failure occurrence information acquisition unit 710 obtains failure occurrence information from a failure monitoring apparatus. The trouble ticket issuing unit 720 issues a trouble ticket including some or all of the acquired trouble occurrence information. The trouble ticket termination processing unit 730 terminates the issued trouble ticket according to a predetermined method. Detailed functions of the above-mentioned trouble ticket information acquisition unit 710, the trouble ticket issuing unit 720, and the trouble ticket termination processing unit 730 may refer to the functions of the server 110 in one embodiment of the present invention. It is omitted here to avoid duplication of content.

서비스 회복 여부 판단부(740)는 발행된 장애 티켓이 종료 처리된 이후 모니터링 대상에서 제공되는 어플리케이션 서비스 혹은 상기 모니터링 대상을 일부 이용하여 제공되는 어플리케이션 서비스에 대한 서비스 가동 정보를 획득하고, 미리 계산된 서비스 회복시 예상 가동 정보와 비교하여 서비스 회복 여부를 판단한다. 상세한 기능은 도 8을 참조하여 설명한다.The service recovery determination unit 740 obtains service operation information on the application service provided by the monitoring target or the application service provided by using the monitoring target after the issued trouble ticket is terminated, and the service calculated in advance When recovering, it is compared with the expected operation information to determine whether the service recovers. Detailed functions will be described with reference to FIG. 8.

도 8은 서비스 가동 현황 패턴과 서비스 회복시 서비스 가동 예상 패턴을 하나의 화면에 표시한 그래픽 도면이다. 도 8의 그래픽은 Y축으로 제공되는 서비스에 대한 동시 접속자 수를 나타내고 있으며, X축으로는 시간을 나타내고 있다. 또한, 실선은 실제 서비스의 가동 정보를 나타내고 있으며, 실선의 상단에 표시된 점선은 서비스 회복시 예상 가동 정보를 나타내고 있다.8 is a graphic diagram displaying a service operation status pattern and a service operation expected pattern at the time of service recovery on one screen. The graphic of FIG. 8 shows the number of simultaneous users for the service provided on the Y axis, and the time on the X axis. In addition, the solid line represents the operation information of the actual service, and the dotted line displayed at the top of the solid line represents the expected operation information at the time of service recovery.

도 8을 참조하면, (A) 시간에 서비스에 장애가 발생하였다. 이로 인해, 서비스에 대한 동시 접속자 수가 급격히 감소하고 있다. 이후 (B) 시간에 서비스에 대한 장애가 복구되었고, 이후 서비스에 대한 동시 접속자 수가 다시 원상태로 회복되고 있다. 통상적으로 장애 복구 시간이라고 하면, (A)에서 (B) 사이의 시간을 의미한다. 다시 말해, 서비스에 장애가 발생한 후 서비스가 다시 복구된 시점까지를 계산하여 장애 복구 시간이라고 정의하고 있다. 이는 상당히 고전적인 시스템에 적용되던 개념으로 주로 하드웨어와 같은 물리적 시스템에 대한 장애를 어떻게 빠른 시간 내에 복구하느냐의 지표를 만들기 위해 사용하던 개념이다. 하지만, 경영의 관점에서 중요한 정보는 (A)에서 (B)까지의 시간이라기 보다는 (A)부터 (C)까지의 시간이다. (C)는 장애가 복구된 후 서비스의 수준이 장애 없었을 경우의 정상 수준에 도달한 시간을 의미한다. (A)에서 (C)까지의 시간이 장애로부터 손실을 발생시킨 시간이기 때문에 경영의 관점에서는 (A)부터 (C)까지의 시간을 최소화하는 것이 중요하다. 그런데, (A)에서 (C)까지의 시간을 계산하기 위해서는 (B)부터 (C)까지의 시간을 계산하는 방법이 필요하였는데, 예전에는 이에 대한 방법이 제출되지 않아, 사실상 (A)에서 (B)까지의 시간만 측정한 것이다. (A)에서 (B)까지의 시간을 계산하는 것은 간단하다. 먼저, 장애가 접수된 시간, 다시 말해 전술한 예에서 장애 발생 정보를 획득한 시간이 (A)가 된다. 이후 장애를 복구하고 발행된 장애 티켓에 대해 종료 처리를 하는 시간이 (B)가 된다. 따라서, (A)와 (B)의 시간 차이를 계산하면 된다. 문제는 (B)에서 (C)의 시간을 계산하는 것이다. (B)에서 (C)까지의 시간을 장애 복구 이후 서비스가 정상적으로 가동될 때까지의 시간이라고 하여 여기서는 장애 회복 시간으로 정의하기로 한다.Referring to FIG. 8, service failure occurred at time (A). As a result, the number of concurrent users for the service is rapidly decreasing. After (B), the failure of the service was recovered, and the number of concurrent users for the service was restored. Generally speaking, the failure recovery time means a time between (A) and (B). In other words, it is defined as a failure recovery time by calculating the time from when a service fails to when the service is restored. This is a concept that was applied to a fairly classic system. It was mainly used to create an indicator of how to quickly recover from a failure of a physical system such as hardware. However, in terms of management, the important information is not the time from (A) to (B), but from (A) to (C). (C) means the time when the level of service reaches the normal level without failure after the failure is recovered. It is important to minimize the time from (A) to (C) from a management point of view because the time from (A) to (C) is the time that caused the loss from failure. However, in order to calculate the time from (A) to (C), a method of calculating the time from (B) to (C) was required. Only time to B) is measured. Calculating the time from (A) to (B) is simple. First, a time when a failure is received, that is, a time when the failure occurrence information is acquired in the above-mentioned example becomes (A). Thereafter, the time for recovering the fault and terminating the issued trouble ticket becomes (B). Therefore, what is necessary is just to calculate the time difference of (A) and (B). The problem is to calculate the time from (B) to (C). The time from (B) to (C) is the time from the failure recovery to the normal operation of the service, which is defined here as the failure recovery time.

문제가 되는 (B)에서 (C)까지의 시간을 계산하는 방법에 대해 살펴 보자, 이를 계산하는 가장 쉬운 방법은 관리자의 입력을 받아 (C)의 시간을 확정하는 방법일 것이다. 관리자의 경험 혹은 어떤 관리 프로세스 상의 지침에 따라 관리자 단말로부터 (C) 시간을 입력받을 수 있는 단계를 더 포함시키고, 앞서 정의한 장애 회복 시간을 (B)와 (C)의 차이 시간으로서 계산할 수 있다. 관리자의 개입이 필요한 것으로 데이터의 부정확 혹은 업무의 손실이 발생할 수 있다. 장애 회복 시간을 자동으로 계산하는 방법은 미리 계산된 서비스 회복시 예상 가동 정보를 이용하는 것이다. 서비스가 회복되는 경우에 예상 가동 정보, 예를 들어, 서비스 이용률을 미리 계산할 수 있다면, 장애에 대해 복구 처리를 완료한 시간(B)으로부터 서비스 이용률 추이를 관찰하다가 이 값이 미리 계산된 서비스 이용률에 근접했을 때, 서비스가 회복되었다고 판단하고, 그 시간을 (C)로서 확정할 수 있게 된다.Let's take a look at how to calculate the time from (B) to (C) in question. The easiest way to calculate this is to take the input of the administrator and confirm the time in (C). The method may further include the step of receiving (C) time from the manager terminal according to the manager's experience or a guideline in a certain management process, and the failure recovery time defined above may be calculated as a difference time between (B) and (C). Manager intervention is required and can lead to data inaccuracies or loss of work. The method of automatically calculating the failure recovery time is to use the estimated operation information when the service is calculated in advance. If the service availability can be estimated in advance, for example, the expected utilization information, for example, the service utilization rate, then observe the service utilization trend from the time (B) to complete the recovery process for the failure, and then the value When approaching, it is determined that the service has recovered, and the time can be determined as (C).

서비스 회복시 가동 정보를 미리 계산하는 방법에 대해 좀더 상세히 설명한다. 통상 서비스는 그 이용량이나 가동량이 일정 패턴으로 형성되어 있는 경우가 많은데, 예를 들어, 게임 서비스의 경우, 시간대별로 혹은 요일별로 동시 접속자 수의 추이가 다른 패턴을 형성하기도 하고, 서비스 개통으로부터 시간의 경과에 따라 동시 접속자 수가 증가하는 비율이 또한 일정 패턴을 형성하는 경우가 많다. 이는 서비스 회복 시간이 길어지는 경우에 대한 것이고, 서비스 회복 시간이 짧은 경우 예를 들어, 10분에서 1시간 이내인 경우, 전술한 서비스 회복시 예상 가동 정보는 장애가 발생하기 직전의 정보가 계속 유지된다고 보아도 될 것이다. 이러한 서비스 가동 정보에 대한 패턴은 이전의 정상 서비스 시간에서의 서비스 이력이나 직전 서비스 가동 정보 등을 이용하여 쉽게 계산하고, 도출해 낼 수 있다. 예를 들어, 시간별 가동 정보의 패턴은 전날 정보를 이용할 수 있으며, 요일별 패턴은 지난 주 정보를 이용할 수 있다.The method of precomputing operation information upon service recovery will be described in more detail. In general, the service and the operation amount are often formed in a certain pattern. For example, in the case of a game service, the number of concurrent users may be different depending on the time of day or day of the week, and the time from service opening may be different. The rate at which the number of concurrent users increases as the number of times also forms a certain pattern. This is for the case that the service recovery time is long, and when the service recovery time is short, for example, within 10 minutes to 1 hour, the above-mentioned expected operation information during service recovery is said to be maintained just before failure occurs. You may see it. The pattern for the service operation information can be easily calculated and derived using the service history at the previous normal service time or the previous service operation information. For example, the pattern of hourly operation information may use the previous day information, and the pattern of the day of the week may use information of last week.

전술한 과정들과 같이 예상할 수 있는 서비스 회복시 예상 가동 정보를 이용하여 서비스 회복 여부 판단부(750)가 서비스 회복 여부를 판단하게 되는데, 이때, 서비스 예상 가동 정보를 통해 도출된 값과 추적되고 있는 현재 가동 정보의 값이 일정 시간 구간에서 동일하거나 일정 범위 이내로 근접한 경우 서비스가 회복되었다고 판단할 수 있다.As described above, the service recovery determination unit 750 determines whether the service is recovered by using the expected operation information when the service is expected to recover, and at this time, the value derived from the service expected operation information is tracked and If the value of the current operation information is the same or within a certain range within a certain time interval, it can be determined that the service has recovered.

이러한 방법에 의해 서비스가 회복되었다고 판단되면, 장애 회복 시간 산정부(760)는 장애 티켓에 대한 종료 처리 시점으로부터 서비스 회복 상태로 판단된 시점까지의 시간을 장애 회복 시간으로 결정하게 된다.When it is determined that the service is recovered by this method, the failure recovery time calculation unit 760 determines the time from the termination processing point for the trouble ticket to the time when the service recovery state is determined as the failure recovery time.

이상에서는, 본 발명의 다른 실시예에 따른 장애를 관리하는 서버(700)에 대하여 설명하였으며, 이하에서는, 본 발명의 다른 실시예에 따른 서버(700)가 장애를 관리하는 방법에 대하여 설명한다. 후술하게 될 본 발명의 다른 실시예에 따른 장애를 관리하는 방법은, 도 7에 도시된 본 발명의 다른 실시예에 따른 서버(700)에 의해 모두 수행될 수 있다. In the above, the server 700 for managing a failure according to another embodiment of the present invention has been described. Hereinafter, a method for managing a failure by the server 700 according to another embodiment of the present invention will be described. The method for managing a failure according to another embodiment of the present invention to be described later may be performed by the server 700 according to another embodiment of the present invention shown in FIG. 7.

도 9는 본 발명의 다른 실시예에 따른 장애를 관리하는 방법의 흐름도이다.9 is a flowchart of a method for managing a failure according to another embodiment of the present invention.

서버(700)는 모니터링 대상체에서 장애 발생 정보를 획득하고(S900), 상기 장애 발생 정보의 일부 혹은 전부를 포함하는 장애 티켓 정보를 생성한다(S902). 이후 서버(700)는 상기 장애 티켓 정보와 상기 장애 발생 정보를 처리하는 처리 담당자 정보를 이용하여 장애 티켓을 발행 처리하고(S904), 상기 장애 티켓에 대해 미리 정해진 방법에 따라 종료 처리를 수행한다(S906). 상기 종료 처리 이후 서버(700)는 상기 모니터링 대상에서 제공되는 어플리케이션 서비스 혹은 상기 모니터링 대상을 일부분으로 이용하여 제공되는 어플리케이션 서비스에 대한 서비스 가동 정보를 획득하고, 미리 계산된 서비스 회복시 예상 가동 정보와 비교하여 서비스 회복 여부를 판단하고(S908), 서비스 회복 상태로 판단되는 경우, 상기 종료 처리 시점으로부터 상기 서비스 회복 상태로 판단된 시점까지의 시간을 장애 회복 시간으로 결정한다(S910).The server 700 obtains failure occurrence information from the monitoring object (S900), and generates failure ticket information including some or all of the failure occurrence information (S902). Thereafter, the server 700 issues and processes a trouble ticket using the trouble ticket information and the person in charge of processing the trouble occurrence information (S904), and performs a termination process according to a predetermined method for the trouble ticket ( S906). After the termination process, the server 700 obtains service operation information on the application service provided by the monitoring target or the application service provided by using the monitoring target as a part, and compares it with the expected operation information when the service is calculated in advance. In step S908, the service recovery state is determined. When the service recovery state is determined, the time from the termination processing time point to the service recovery state is determined as the failure recovery time (S910).

이상에서는 본 발명의 다른 실시예에 따른 장애를 관리하는 방법이 도 9에서와 같은 절차로 수행되는 것으로 설명되었으나, 이는 설명의 편의를 위한 것일 뿐, 본 발명의 본질적인 개념을 벗어나지 않는 범위 내에서, 구현 방식에 따라 각 단계의 수행 절차가 바뀌거나 둘 이상의 단계가 통합되거나 하나의 단계가 둘 이상의 단계로 분리되어 수행될 수도 있다. In the above description, the method for managing a failure according to another embodiment of the present invention has been described as being performed by the same procedure as in FIG. 9, but this is only for convenience of description and within the scope not departing from the essential concept of the present invention. Depending on the implementation method, the execution procedure of each step may be changed, two or more steps may be integrated, or one step may be performed in two or more steps.

이상에서, 본 발명의 실시예를 구성하는 모든 구성 요소들이 하나로 결합되거나 결합되어 동작하는 것으로 설명되었다고 해서, 본 발명이 반드시 이러한 실시예에 한정되는 것은 아니다. 즉, 본 발명의 목적 범위 안에서라면, 그 모든 구성 요소들이 하나 이상으로 선택적으로 결합하여 동작할 수도 있다. 또한, 그 모든 구성 요소들이 각각 하나의 독립적인 하드웨어로 구현될 수 있지만, 각 구성 요소들의 그 일부 또는 전부가 선택적으로 조합되어 하나 또는 복수 개의 하드웨어에서 조합된 일부 또는 전부의 기능을 수행하는 프로그램 모듈을 갖는 컴퓨터 프로그램으로서 구현될 수도 있다. 그 컴퓨터 프로그램을 구성하는 코드들 및 코드 세그먼트들은 본 발명의 기술 분야의 당업자에 의해 용이하게 추론될 수 있을 것이다. 이러한 컴퓨터 프로그램은 컴퓨터가 읽을 수 있는 저장매체(Computer Readable Media)에 저장되어 컴퓨터에 의하여 읽혀지고 실행됨으로써, 본 발명의 실시예를 구현할 수 있다. 컴퓨터 프로그램의 저장매체로서는 자기 기록매체, 광 기록매체, 등이 포함될 수 있다.While the present invention has been described in connection with what is presently considered to be the most practical and preferred embodiments, it is to be understood that the invention is not limited to the disclosed embodiments. That is, within the scope of the present invention, all of the components may be selectively coupled to one or more of them. In addition, although all of the components may be implemented as one independent hardware, some or all of the components may be selectively combined to perform a part or all of the functions in one or a plurality of hardware. As shown in FIG. Codes and code segments constituting the computer program may be easily inferred by those skilled in the art. Such a computer program may be stored in a computer readable storage medium and read and executed by a computer, thereby implementing embodiments of the present invention. As a storage medium of the computer program, a magnetic recording medium, an optical recording medium, or the like can be included.

또한, 이상에서 기재된 "포함하다", "구성하다" 또는 "가지다" 등의 용어는, 특별히 반대되는 기재가 없는 한, 해당 구성 요소가 내재될 수 있음을 의미하는 것이므로, 다른 구성 요소를 제외하는 것이 아니라 다른 구성 요소를 더 포함할 수 있는 것으로 해석되어야 한다. 기술적이거나 과학적인 용어를 포함한 모든 용어들은, 다르게 정의되지 않는 한, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 사전에 정의된 용어와 같이 일반적으로 사용되는 용어들은 관련 기술의 문맥 상의 의미와 일치하는 것으로 해석되어야 하며, 본 발명에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.It is also to be understood that the terms such as " comprises, "" comprising," or "having ", as used herein, mean that a component can be implanted unless specifically stated to the contrary. But should be construed as including other elements. All terms, including technical and scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs, unless otherwise defined. Commonly used terms, such as predefined terms, should be interpreted to be consistent with the contextual meanings of the related art, and are not to be construed as ideal or overly formal, unless expressly defined to the contrary.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The foregoing description is merely illustrative of the technical idea of the present invention, and various changes and modifications may be made by those skilled in the art without departing from the essential characteristics of the present invention. Therefore, the embodiments disclosed in the present invention are intended to illustrate rather than limit the scope of the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments. The protection scope of the present invention should be interpreted by the following claims, and all technical ideas within the equivalent scope should be interpreted as being included in the scope of the present invention.

Claims

In the way the server manages failures,
Storing processing representative information for each role in which at least one processing representative information is mapped for each virtual role group belonging to the virtual role group list in a database;
Obtaining failure occurrence information from the failure monitoring apparatus; And
Selecting a virtual role group for processing the failure occurrence information from the virtual role group list, and issuing a failure ticket including some or all of the failure occurrence information in association with the selected virtual role group; How to manage it.

The method of claim 1,
Determining the processing person information mapped to the selected virtual role group from the processing person information for each role;
And after the issuing of the trouble ticket, transmitting some or all of the contents included in the trouble ticket to a person in charge information medium included in the determined person in charge information.

3. The method of claim 2,
The contact information carrier is at least one or more of a portable personal communication terminal, an online messenger, an SNS, and an e-mail, and manages a failure of the contact information carrier by automatically transmitting some or all of the contents included in the trouble ticket to the contact information carrier. How to.

The method of claim 1,
After the issuance of the trouble ticket, if the process representative information mapped to the selected virtual role group is changed, a part or all of the contents included in the trouble ticket to the contact information medium included in the changed process representative information. Method for managing a failure, characterized in that for transmitting.

The method of claim 1,
And a trouble ticket interworking step of integrating the trouble ticket with a specific trouble ticket that has already been generated.

The method of claim 5,
Obtaining disability correlation information analyzing the disability correlation between the monitoring targets; And
From the failure correlation information, the cause monitoring target identified as the cause of the failure corresponding to the trouble ticket and the failure prediction monitoring target predicted to cause the failure in series according to the failure corresponding to the trouble ticket are identified, and in the same manner. Identifying a cause monitoring target and a predicted failure monitoring target in response to the specific trouble ticket,
In the trouble ticket interworking step, information on at least one monitoring target among a monitoring target, a cause monitoring target, or a predicted failure monitoring target corresponding to the trouble ticket, and a monitoring target, a cause monitoring target, or a failure prediction corresponding to the specific trouble ticket; And comparing the information on at least one monitoring target among the monitoring targets to link the trouble ticket with the specific trouble ticket.

The method according to claim 6,
In the trouble ticket linking step,
The monitoring target or cause monitoring target corresponding to the trouble ticket and the monitoring target or failure prediction monitoring target corresponding to the specific trouble ticket match or correspond to the monitoring target or failure prediction monitoring target corresponding to the trouble ticket and the specific failure ticket. And when the monitoring target or the cause monitoring coincides, the trouble ticket and the specific trouble ticket are interlocked.

The method of claim 5,
The monitoring target corresponding to the trouble ticket is for an application service, and the monitoring target corresponding to the specific trouble ticket is an infrastructure that is partially used to provide the application service or to provide the application service. How to manage.

The method of claim 5,
And when the virtual role group selected in response to the trouble ticket and the virtual role group selected in response to the specific trouble ticket are the same, linking the trouble ticket and the specific ticket.

A role-specific contact information storage unit for storing role-specific contact information for each of the virtual role groups belonging to the virtual role group list in a database;
Failure occurrence information acquisition unit for obtaining failure occurrence information from the failure monitoring device; And
And a trouble ticket issuing unit for selecting a virtual role group for processing the failure occurrence information from the virtual role group list and issuing a trouble ticket including some or all of the failure occurrence information.

In the way the server manages failures,
Obtaining failure occurrence information from the failure monitoring apparatus;
Issuing a trouble ticket including some or all of the failure occurrence information;
Terminating the fault ticket according to a predetermined method;
Obtaining service operation information on the application service provided by the monitoring target or the application service provided by using the monitoring target after the termination processing, and comparing the estimated operation information when the service is calculated in advance to determine whether to recover the service. step; And
In the determining of whether the service is recovered, if it is determined that the service is in a recovery state, determining a time from the termination processing time to the time determined as the service recovery state as a failure recovery time comprising: .

12. The method of claim 11,
In determining whether to recover the service,
The service operation information is an hourly service operation status pattern derived by tracking a service operation amount, which is a value that quantifies the operation degree of the service hourly, and the expected operation information when the service is restored is an hourly service expected when the service is normally operated. It is an operation expected pattern, and the service operation status pattern and the service operation expected pattern is compared to determine the service recovery state when the pattern value of the predetermined time interval included in the two pattern information is within a predetermined criterion How to manage it.

The method of claim 12,
And the service operation amount is the number of simultaneous access users for the service.

12. The method of claim 11,
The expected start-up information at the time of service recovery is calculated from the service start-up information at a normal service time stored in advance.

Failure occurrence information acquisition unit for obtaining failure occurrence information from the failure monitoring device;
A trouble ticket issuing unit for issuing a trouble ticket including some or all of the trouble occurrence information;
A trouble ticket termination processor configured to terminate the trouble ticket according to a predetermined method;
Obtaining service operation information on the application service provided by the monitoring target or the application service provided by using the monitoring target after the termination processing, and comparing the estimated operation information when the service is calculated in advance to determine whether to recover the service. Service recovery determination unit; And
In the determining of whether the service is recovered, if it is determined that the service is in a recovery state, a failure including a failure recovery time calculation unit determining a time from the termination processing time point to the service recovery state is determined as the failure recovery time. Managed server.