KR100753565B1

KR100753565B1 - High availability system and his task devide takeover method

Info

Publication number: KR100753565B1
Application number: KR1020010087429A
Authority: KR
Inventors: 김성진
Original assignee: 엘지엔시스(주)
Priority date: 2001-12-28
Filing date: 2001-12-28
Publication date: 2007-08-30
Also published as: KR20030057057A

Abstract

본 발명은 고가용성 시스템 및 그의 태스크 분할 인계방법에 관한 것으로, 노드에 대한 태스크 이벤트가 발생하는 경우에 그 태스크를 노드의 자원 사용율에 따라, 분할하여 인계함으로써, 시스템 고가용성을 지향하여, 클라이언트의 요청에 대한 서비스의 성능저하를 최소화하도록 한 것이다. 이를 위하여 본 발명은 2대 내지 N대의 노드(서버 시스템)로 이루어져, 서비스 프로세스의 장애를 감시함과 아울러 복구하는 고가용성 시스템에 있어서, 상기 노드는, HA운용을 위한 매니저를 액티브시키는 HA마스터와; 자신의 상태 정보를 수집하여 다른 노드로 멀티 캐스팅하고, 정보 매니저의 수집 정보를 바탕으로 필요 작업을 결정하여 태스크 매니저로 전송하며, 그 태스크 매니저를 감시하여, 장애시 재수행하는 노드 매니저와; 상기 노드 매니저에서 전송되는 정보를 바탕으로 태스크 관리하고, 노드 매니저, 로드 매니저 및 정보 매니저를 감시하여 장애시 재수행하여, 노드 매니저, 로드 매니저 및 정보 매니저를 액티브시키는 태스크 매니저와; 허트 비트를 통해 다른 노드로부터 멀티 캐스팅된 정보를 수집하여 관리하고, 그 정보를 노드 매니저 및 로드 매니저로 전송하는 정보 매니저와; 자신의 자원 사용율을 계산하고, 정보 매니저로부터 받은 다른 노드의 자원 사용율을 자신과 비교하여 순위화하는 로드 매니저로 구성한다.The present invention relates to a high availability system and a task partitioning method thereof, wherein when a task event occurs for a node, the task is partitioned and handed over according to the node's resource utilization rate, so as to achieve high system availability. This is to minimize the performance degradation of the service for the request. To this end, the present invention consists of 2 to N nodes (server systems), and in a high availability system for monitoring and recovering from a failure of a service process, the node includes an HA master for activating a manager for HA operation. ; A node manager which collects its own state information and multicasts it to another node, determines necessary tasks based on the collected information of the information manager, transmits them to the task manager, monitors the task manager, and re-executes in case of failure; A task manager based on the information transmitted from the node manager, monitoring the node manager, the load manager, and the information manager to re-execute in the event of a failure to activate the node manager, the load manager, and the information manager; An information manager which collects and manages the multicasted information from other nodes through the heartbeat, and transmits the information to the node manager and the load manager; It calculates its own resource usage rate, and configures the load manager that ranks the resource utilization rate of other nodes received from the information manager with its own.

Description

HIGH AVAILABILITY SYSTEM AND HIS TASK DEVIDE TAKEOVER METHOD}

도1은 종래 고가용성 시스템의 노드에 대한 구성을 보인 개략도.1 is a schematic diagram illustrating a configuration of a node of a conventional high availability system.

도2는 일반적인 고가용성 시스템의 구성을 보인 개략도.Figure 2 is a schematic diagram showing the configuration of a typical high availability system.

도3은 본 발명 고가용성 시스템의 노드에 대한 구성을 보인 개략도.Figure 3 is a schematic diagram showing the configuration of a node of the high availability system of the present invention.

도4는 도3에 있어서, 로드 매니저의 구성을 보인 개략도.4 is a schematic diagram showing the configuration of a load manager in FIG.

도5는 도4에 있어서, SRC 및 SC의 구성을 보인 개략도.Fig. 5 is a schematic diagram showing the configuration of SRC and SC in Fig. 4;

도6은 도4에 있어서, 로드 매니저의 전반적인 동작의 흐름을 보인도.Figure 6 shows the flow of the overall operation of the load manager in Figure 4;

*** 도면의 주요부분에 대한 부호의 설명 ****** Explanation of symbols for main parts of drawing ***

NM:노드 매니저 TM:태스크 매니저NM: Node Manager TM: Task Manager

IM:정보 매니저 LM:로드 매니저IM: Information Manager LM: Load Manager

MEM:로컬메모리MEM: Local memory

본 발명은 고가용성 시스템에 관한 것으로, 특히 노드에 대한 태스크 이벤트가 발생하는 경우에 그 태스크를 노드의 자원 사용율에 따라, 분할하여 인계하도록 한 고가용성 시스템 및 그의 태스크 분할 인계방법에 관한 것이다. The present invention relates to a high availability system, and more particularly, to a high availability system and a method for splitting a task, in which a task is divided and handed over according to a node's resource utilization rate when a task event occurs for the node.

일반적으로, 2대 또는 그 이상의 N대의 서버 시스템으로 HA시스템을 구성하여, 시스템 운용상태를 시스템 상호간에 지속적으로 체크하여 운용중인 작업들에 문제가 발생하였는지 또는 운용중인 시스템에 문제가 발생하였는지를 확인하여, 문제가 발생한 노드 또는 그 노드에서 운용중인 테스크를 지정된 노드로 서비스권을 넘겨, 외부 클라이언트 서비스 요청에 응대하여 연속적인 서비스를 할 수 있도록 한다.In general, HA system is composed of two or more N server systems, and the system operation status is continuously checked between the systems to determine whether there are problems in the working tasks or in the operating system. Then, the service node is transferred to the designated node that has a problem or the task running on the node, so that continuous service can be provided in response to an external client service request.

이때, 시스템 노드들은 각 노드간의 정보 교류를 위하여 크게 3가지 메니저를 사용하는데, 노드 메니저, 태스크 메니저, 정보 메니저가 있다.At this time, the system nodes largely use three managers for information exchange between each node. There are a node manager, a task manager, and an information manager.

도1을 참조하여, 각 노드에서의 각 매니저들의 동작을 설명하면, HA마스터가 HA를 시작하여 노드 매니저(NM), 태스크 매니저(TM), 정보 매니저(IM)를 액티브 시킨다(S1).Referring to FIG. 1, the operation of each manager in each node will be described. The HA master starts HA to activate the node manager NM, the task manager TM, and the information manager IM (S1).

그 다음,테스크의 상태정보를 로컬 메모리(MEM)로 전달한 후(S2), 연결된 다른 노드로부터 전송되어온 노드 정보를 정보 매니저(IM)에서 수집한다.(S3)Then, after transmitting the status information of the task to the local memory (MEM) (S2), the node information transmitted from another connected node is collected in the information manager (IM) (S3).

그 다음, 상기 정보 매니저(IM)에서 로컬 메모리(MEM)로 상태정보를 전송한 후(S4), 그 로컬 메모리(MEM)에 저장된 자신의 태스크 상태정보를 노드 매니저(NM)에서 참조한다(NM).Next, after the status information is transferred from the information manager IM to the local memory MEM (S4), the node manager NM refers to its task status information stored in the local memory MEM (NM). ).

그 다음, 상기 로컬 메모리(MEM)에 저장된 다른 노드의 상태정보를 리드하여, 그 리드된 정보를 바탕으로 정리된 자신의 노드 전체 정보를 외부 노드로 멀티 캐스팅한다.(S7) Next, the state information of another node stored in the local memory MEM is read, and multicasting of the entire node information thereof based on the read information is performed to the external node (S7).

그 다음, 노드 매니저(NM)에서 전송되어온 정보에 따라 태스크를 관리(수행, 중지, 감시)하고, 노드 매니저(NM)가 의사결정을 수행하기 위해 정보 매니저(IM)의 정보를 참조한 후(S9), 장애가 발생하면, 노드 매니저(NM)에 의해 태스크 매니저(TM)르 재수행시킨다(S10).Then, the task is managed (executed, stopped, monitored) according to the information transmitted from the node manager NM, and the node manager NM refers to the information of the information manager IM to perform a decision (S9). When a failure occurs, the task manager TM is rerun by the node manager NM (S10).

도2는 일반적인 고가용성 시스템의 구성을 보인 개략도로서, 각 노드의 HA 내부구성은 도1과 같으며, HA를 시작하여 3가지 매니저가 동작하며, 문제 발생시 인계를 위하여 HA시작전에 시스템 컨피규레이션 파일에 인계순서를 인위적으로 할당해 놓는다.FIG. 2 is a schematic diagram showing a general high availability system configuration. HA configuration of each node is the same as that of FIG. 1, and the three managers operate by starting HA, and in the system configuration file before starting HA to take over when a problem occurs. The takeover order is assigned artificially.

즉, A_노드에 문제가 발생하면 C_노드에서 인계, C_노드에서 문제가 발생하면 B_노드에서 인계, B_노드에서 문제가 발생하면 N_노드에서, N_노드에서 문제가 발생하면 N-1_노드에서 인계하도록 각각 모든 노드를 인계할 수 있는 순서를 HA 시작하기 전에 수동으로 구성, 컨피규레이션 파일에 저장해 놓고, 수행되는 각 노드에서는 이 파일을 읽어가서 장애에 대비하게 된다.In other words, if a problem occurs in node A_, take over at node C_, a problem occurs in node C_, take over at node B_, and if a problem occurs in node B_, at node N_, or at node N_ Then, in order to take over all nodes to take over from N-1_ nodes, manually save the configuration and configuration files in the configuration file before starting HA, and read each file to prepare for failure.

도2와 같이, 각 노드가 할당된 태스크a~n까지를 각각 수행하고 있는 상태에서 문제가 발생하지 않는다면, 각 노드간에 인계는 발생하지 않는다. As shown in Fig. 2, if no problem occurs in the state where each node performs each of the assigned tasks a to n, no handover occurs between each node.

즉, 예를 들어, 외부 클라이언트가 a관련 서비스를 받기 위하여 태스크a에 요청을 하면 A_노드는 자신이 처리해야 됨을 인지하고, 서비스 요청에 대한 적절한 응답을 하게 된다.That is, for example, when an external client makes a request to task a to receive a related service, node A_ recognizes that it needs to process and responds appropriately to the service request.

그러나, 만약 서비스 요청에 대한 응답이 없을 경우, 컨피규레이션 파일에 정해진 순서에 의해서 A_노드의 태스크(IP,FS,AP)를 공유디스크에서 마운트(Maunt) 하여 인계하게 된다.However, if there is no response to the service request, the task (IP, FS, AP) of the A_node is mounted on the shared disk and handed over in the order specified in the configuration file.

인계의 상황이 발생하기 전에, A_노드의 TM은 자신의 태스크가 킬드된 상태를 확인하고 리스타트를 수행하게 된다.Before the takeover situation occurs, the A_Node's TM checks the status of its task killed and restarts it.

상기 리스타트를 수행해서 이상이 없다면, 인계는 발생하지 않지만, 네트워크 장애 등 해결할 수 없는 상황이 발생하게 되면 인계를 순서에 의해 진행한다.If there is no abnormality by performing the above restart, no handover occurs, but if a situation such as a network failure that cannot be solved occurs, the handover proceeds in order.

이때, 태스크 매니저(TM)는 외부로부터의 서비스 요청이 있기 전에라도 항상 태스크를 모니터링한다.At this time, the task manager TM always monitors the task even before there is a service request from the outside.

만약, 태스크가 아니라 노드에 문제가 발생한 경우라면(서비스 요청의 유무에 관계없이), 즉 허트 비트를 통해, 항상 노드간에 노드 액티브 상태를 1차적으로 확인하는데, 1차적으로 허트 비트를 통해 시그널을 확인하게 되면, 상대 노드는 액티브 상태이므로 인계는 발생하지 않는다.If there is a problem with the node, not with a task (with or without a service request), i.e., always check the node's active state between the nodes, first through the Hert bit, and first through the Hert bit. If confirmed, the other node is active and no handover occurs.

그러나, 허트 비트로 신호를 받을 수 없는 경우(노드가 킬드된 상태, 또는 HB 케이블 단락 문제등)라면 컨피규레이션 파일에 미리 설정되어 있는 인계하고자 하는 노드에서 2차적으로 Ping으로서 확인한다.However, if the signal cannot be received by the heartbeat (node killed, HB cable short-circuit problem, etc.), a second Ping check is performed at the node to be pre-set in the configuration file.

이때, 노말 상태의 응답을 받으면 상대 노드가 액티브한 상태이므로 인계를 하지 않는다.At this time, when the response of the normal state is received, the counterpart node is in an active state and does not take over.

즉, 내부 네트워크에 장애는 있지만 외부 클라이언트에 대한 서비스는 노말 상태라는 것이다. In other words, the internal network has failed, but the service to external clients is normal.

만약, ping으로 확인이 안되면 마지막으로 파일 시스템 체크를 하는데, 이 파일 시스템에 액션이 발생하고 있다면, 단지 노드간에 HB와 Ping으로 확인을 할 수 없다는 것일 뿐, 상대 노드는 정상적인 서비스를 하고 있다고 판단한다.If it is not confirmed by ping, it checks the file system lastly. If there is action in this file system, it simply means that the node cannot check by HB and Ping. .

그러나, 파일 시스템 체크에서도 조차 액션을 확인하지 못한다면, 노드가 킬드된 상태이므로 컨피규레이션 파일에 정해진 순서에 따라 대상 노드가 인계하게 된다.However, even if the file system check does not confirm the action, the node is killed and the target node takes over in the order specified in the configuration file.

이때, N개의 노드가 운영되는 HA시스템 환경하에서, 임의의 노드에 문제가 발생한 상태 또는 운영중이던 태스크에 문제가 발생한 상태를 인계하는 과정에 있어, 각 노드마다 인계의 순서를 미리 입력하여 지정한 내용이 기록된 컨피규레이션 파일을 참조하여 모든 인계가 발생한다.At this time, in the HA system environment in which N nodes are operated, in the process of taking over a problem in which a node has a problem or a problem in a running task, the specified contents are inputted in advance for each node. All takeovers take place with reference to the recorded configuration file.

이때, 태스크의 인계가 진행될 때, 진행 노드는 현재 태스크를 실행하고 있는 자기 노드의 로드가 어느정도 되는지, 대소에 관계없이 무조건 인계를 수행한다.At this time, when the task is handed over, the progress node unconditionally performs the handover regardless of how large or small the load of its own node currently executing the task is.

상술한 바와같은 종래 고가용성 시스템은, 멀티 노드 HA시스템에서 인계 이벤트가 발생하였을 때, 임의의 노드에 대한 태스크 인계는, 사전에 지정해 놓은 특정 노드에 인계하도록 수행하는데, 이런 처리방식은 특정 노드에 태스크가 집중되어 그 노드의 서비스 속도가 저하될 수도 있고, 시스템 자원의 균등한 배분 사용이 라는 점에서도 효율적이지 못한 문제점이 있다.In the conventional high availability system as described above, when a takeover event occurs in a multi-node HA system, task takeover for any node is performed to take over to a specific node that has been designated in advance. Tasks may be concentrated and the service speed of the node may be lowered, and there is a problem that it is not efficient in terms of evenly using the system resources.

본 발명은 상기와 같은 문제점을 해결하기 위하여 안출된 것으로, 노드에 대한 태스크 이벤트가 발생하는 경우에 그 태스크를 노드의 자원 사용율에 따라, 분할하여 인계함으로써, 시스템 고가용성을 지향하여, 클라이언트의 요청에 대한 서비스의 성능저하를 최소화하도록 한 고가용성 시스템 및 그의 태스크 분할 인계방법을 제공함에 그 목적이 있다.The present invention has been made to solve the above problems, and when a task event for a node occurs, the task is divided and handed over according to the resource utilization rate of the node, so as to achieve high system availability, and to request the client. An object of the present invention is to provide a high availability system and a method for splitting a task to minimize a performance degradation of a service.

상기와 같은 목적을 달성하기 위한 본 발명은, 2대 내지 N대의 노드(서버 시스템)로 이루어져, 서비스 프로세스의 장애를 감시함과 아울러 복구하는 고가용성 시스템에 있어서, 상기 노드는, HA운용을 위한 매니저를 액티브시키는 HA마스터와; 자신의 상태 정보를 수집하여 다른 노드로 멀티 캐스팅하고, 정보 매니저의 수집 정보를 바탕으로 필요 작업을 결정하여 태스크 매니저로 전송하며, 그 태스크 매니저를 감시하여, 장애시 재수행하는 노드 매니저와; 상기 노드 매니저에서 전송되는 정보를 바탕으로 태스크 관리하고, 노드 매니저, 로드 매니저 및 정보 매니저를 감시하여 장애시 재수행하여, 노드 매니저, 로드 매니저 및 정보 매니저를 액티브시키는 태스크 매니저와; 허트 비트를 통해 다른 노드로부터 멀티 캐스팅된 정보를 수집하여 관리하고, 그 정보를 노드 매니저 및 로드 매니저로 전송하는 정보 매니저와; 자신의 자원 사용율을 계산하고, 정보 매니저로부터 받은 다른 노드의 자원 사용율을 자신과 비교하여 순위화하는 로드 매니저로 구성한 것을 특징으로 한다.The present invention for achieving the above object is composed of 2 to N nodes (server system), in the high availability system for monitoring and recovering from the failure of the service process, the node, for the HA operation An HA master for activating the manager; A node manager which collects its own state information and multicasts it to another node, determines necessary tasks based on the collected information of the information manager, transmits them to the task manager, monitors the task manager, and re-executes in case of failure; A task manager based on the information transmitted from the node manager, monitoring the node manager, the load manager, and the information manager to re-execute in the event of a failure to activate the node manager, the load manager, and the information manager; An information manager which collects and manages the multicasted information from other nodes through the heartbeat, and transmits the information to the node manager and the load manager; It is characterized by comprising a load manager that calculates its own resource usage rate, and compares the resource usage rate of other nodes received from the information manager with the ranking.

상기와 같은 목적을 달성하기 위한 본 발명은, HA마스터에 의해, 로드 매니저, 노드 매니저, 정보 매니저, 태스크 매니저를 동작시키는 제1 과정과; 태스크 상태 정보를 로컬 메모리에 전송한 후, 허트 비트를 통해 다른 노드로부터 해당 노드정보를 수집하는 제2 과정과; 로드 매니저로 다른 노드의 상태정보를 전송한 후, 그 로드 매니저에서 자신의 자원 사용율과 다른 노드의 자원 사용율을 비교하여 인계 순서 정보 및 해당 노드의 상태 정보를 저장하는 제3 과정과; 자신의 상태 정보와 인계 순서를 외부 노드로 멀티 캐스팅하는 제4 과정과; 상기 노드 매니저가 정보 매니저의 정보에 따라 태스크를 결정하고, 태스크 매니저에 의해 태스크를 관리하는 제5 과정과; 각 노드별 태스크 동작의 장애를 감지하여, 소정 노드에 장애가 발생하면 해당 노드의 태스크를 상기 인계 순서정보에 따라 노드로 인계하는 제6 과정으로 수행함을 특징으로 한다.The present invention for achieving the above object, the HA master, the first step of operating the load manager, node manager, information manager, task manager; A second process of transmitting the task state information to the local memory and collecting corresponding node information from another node through a heartbeat bit; Transmitting the state information of the other node to the load manager, and then storing the takeover order information and the state information of the corresponding node by comparing the resource use rate of the other node with the resource use rate of the other node; A fourth process of multicasting own state information and a takeover order to an external node; A fifth step of the node manager determining a task according to the information of the information manager and managing the task by the task manager; Detecting a failure of a task operation for each node, if a failure occurs in a predetermined node, and performs a sixth process of taking over the task of the node to the node according to the takeover order information.

이하, 본 발명에 의한 고가용성 시스템 및 그의 태스크 분할 인계방법에 대한 작용과 효과를 첨부한 도면을 참조하여 상세히 설명한다.Hereinafter, with reference to the accompanying drawings, the operation and effects of the high availability system according to the present invention and the task splitting method thereof will be described in detail.

도3은 본 발명 고가용성 시스템의 노드에 대한 구성을 보인 개략도로서, 이에 도시한 바와같이, 각 노드는, HA운용을 위한 매니저를 액티브시키는 HA마스터와; 자신의 상태 정보를 수집하여 다른 노드로 멀티 캐스팅하고, 정보 매니저(IM)의 수집 정보를 바탕으로 필요 작업을 결정하여 태스크 매니저(TM)로 전송하며, 그 태스크 매니저(TM)를 감시하여, 장애시 재수행하는 노드 매니저(NM)와; 상기 노드 매니저(NM)에서 전송되는 정보를 바탕으로 태스크 관리하고, 노드 매니저(NM), 로드 매니저(LM) 및 정보 매니저(IM)를 감시하여 장애시 재수행하여, 노드 매니저(NM),로드 매니저(LM) 및 정보 매니저(IM)를 액티브시키는 태스크 매니저(TM)와; 허트 비트를 통해 다른 노드로부터 멀티 캐스팅된 정보를 수집하여 관리하고,그 정보를 노드 매니저(NM) 및 로드 매니저(LM)로 전송하는 정보 매니저(IM)와; 자신의 자원 사용율을 계산하고, 정보 매니저(IM)로부터 받은 다른 노드의 자원 사용율을 자신과 비교하여 순위화하는 로드 매니저(LM)로 구성한다.Figure 3 is a schematic diagram showing a configuration of a node of the high availability system of the present invention. As shown in the drawing, each node includes: an HA master for activating a manager for HA operation; It collects its own status information and multicasts it to other nodes. Based on the collected information of the information manager (IM), it decides the necessary work and sends it to the task manager (TM). It monitors the task manager (TM) and fails. A node manager (NM) which is redone at time; Task management is performed based on the information transmitted from the node manager (NM), and the node manager (NM), the load manager (LM), and the information manager (IM) are monitored and re-executed in case of failure, and the node manager (NM) and the load manager are executed. A task manager TM for activating the LM and the information manager IM; An information manager IM that collects and manages the multicasted information from other nodes through the heartbeat, and transmits the information to the node manager NM and the load manager LM; It is composed of a load manager (LM) that calculates its own resource utilization rate and ranks the resource utilization rate of other nodes received from the information manager (IM).

상기 로드 매니저(LM)는, 도 4에 도시한 바와같이, 자기 자신의 자원 사용율을 계산하는 SRC(Self Resource Checker:이하, SRC)와; 정보 매니저(IM)로부터 전송받은 다른 노드의 자원 사용율과 상기 자신의 자원 사용율을 비교하여 인계 순서를 결정한 후, 그 인계 순서 정보와 각 노드의 상태 정보를 로컬 메모리(MEM)에 저장하는 SC(Sequence Comparator:이하,SC)로 구성한다.The load manager LM includes a self resource checker (SRC) for calculating its own resource utilization rate, as shown in FIG. 4; SC (Sequence) which determines the takeover order by comparing the resource use rate of another node received from the information manager (IM) with the own resource use rate, and stores the takeover order information and state information of each node in the local memory (MEM). Comparator: hereafter referred to as SC).

상기 SRC는, 도 5에 도시한 바와같이, 자신의 씨피유 자원 사용율을 계산하는 CUC와, 메모리(MEM) 자원 사용율을 계산하는 MUC와, 디스크 I/O 자원 사용율을 계산하는 DUC와, 네트워크 I/O 자원 사용율을 계산하는 NUC로 구성한다.As shown in Fig. 5, the SRC includes a CUC for calculating its own CUI resource utilization rate, a MUC for calculating a memory (MEM) resource utilization rate, a DUC for calculating disk I / O resource utilization rate, and a network I / O. O consists of NUCs for calculating resource utilization.

상기 SC는, 도5 에 도시한 바와같이, 각 노드들의 자원 사용율과 자신의 자원 사용율을 비교하는 UC와; 상기 UC의 비교 결과를 이용하여 인계 순서를 결정한후, 이를 로컬 메모리(MEM)에 저장하는 DTS로 구성하며, 이와같은 본 발명의 동작을 설명한다.As shown in Fig. 5, the SC includes: a UC for comparing resource utilization of each node with its own resource usage; After determining the takeover order using the comparison result of the UC, it is configured as a DTS that stores the result in a local memory (MEM), and the operation of the present invention will be described.

먼저, 임의의 노드에 대한 동작을 설명한다.First, the operation of any node will be described.

상기 HA마스터가 내부 동작을 관리하는 4개의 매니저를 액티브하게 만들고, 각 매니저들은 HA 수행을 위한 정보 전송 및 정보입수를 시작한다. The HA master activates four managers managing internal operations, and each manager starts transmitting information and obtaining information for performing HA.

즉, 상기 HA 마스터는 HA운용을 위하여, 노드 매니저(NM), 태스크 매니저(TM), 정보 매니저(IM), 로드 매니저(LM)를 액티브 시킨다.That is, the HA master activates the node manager NM, the task manager TM, the information manager IM, and the load manager LM for HA operation.

이에따라, 노드 매니저(NM)는, 자신의 상태 정보를 수집하여 다른 노드로 멀티 캐스팅하고, 정보 매니저(IM)의 수집 정보를 바탕으로 필요 작업을 결정하여 태스크 매니저(TM)로 전송하며, 그 태스크 매니저(TM)를 감시하여, 장애시 재수행한다.Accordingly, the node manager NM collects its own state information, multicasts it to another node, determines the required work based on the collected information of the information manager IM, and transmits the required job to the task manager TM, and the task Manager (TM) is monitored and re-run in case of failure.

태스크 매니저(TM)는, 상기 노드 매니저(NM)에서 전송되는 정보를 바탕으로 태스크 관리하고, 노드 매니저(NM), 로드 매니저(LM) 및 정보 매니저(IM)를 감시하여 장애시 재수행하여, 노드 매니저(NM), 로드 매니저(LM) 및 정보 매니저(IM)를 액티브시킨다.The task manager TM manages a task based on the information transmitted from the node manager NM, monitors the node manager NM, the load manager LM, and the information manager IM, and re-executes when a failure occurs. The manager NM, the load manager LM and the information manager IM are activated.

정보 매니저(IM)는 허트 비트를 통해 다른 노드로부터 멀티 캐스팅된 정보를 수집하여 관리하고, 그 정보를 노드 매니저(NM) 및 로드 매니저(LM)로 전송한다.The information manager IM collects and manages multicasted information from other nodes through the heartbeat, and transmits the information to the node manager NM and the load manager LM.

로드 매니저(LM)는 자신의 자원 사용율을 계산하고, 정보 매니저(IM)로부터 받은 다른 노드의 자원 사용율을 자신과 비교하여 태스크 인계 우선순위를 결정하여 로컬 메모리(MEM)에 저장한다.The load manager LM calculates its own resource utilization rate, compares the resource utilization rate of other nodes received from the information manager IM with itself, and determines a task takeover priority, and stores it in the local memory MEM.

즉, 상기 로드 매니저(LM)의 SRC가 자기 자신의 자원 사용율을 계산하여 이를 SC에 전송하고, 그러면 상기 SC는 정보 매니저(IM)로부터 전송받은 다른 노드의 자원 사용율과 상기 자신의 자원 사용율을 비교하여 인계 순서를 결정한 후, 그 인계 순서 정보와 각 노드의 상태 정보를 로컬 메모리(MEM)에 저장한다.That is, the SRC of the load manager LM calculates its own resource utilization rate and transmits it to the SC. Then, the SC compares the resource utilization rate of the other node received from the information manager IM with the resource utilization rate of the other node. After determining the takeover order, the takeover order information and the state information of each node are stored in the local memory MEM.

이때, 상기 SRC는 4개의 기능이 수행되는데, 즉 CPC Usage(CUC)가 자신의 씨 피유 자원 사용율을 계산하고, MEM Usage(MUC)가 메모리(MEM) 자원 사용율을 계산하며, Disk I/O가 디스크 I/O 자원 사용율을 계산하고, N/W I/O가 네트워크 I/O 자원 사용율을 계산한다. At this time, the SRC is performed four functions, that is, CPC Usage (CUC) calculates its own seed oil resource utilization, MEM Usage (MUC) calculates the memory (MEM) resource usage, Disk I / O Calculate disk I / O resource utilization, and N / WI / O calculates network I / O resource utilization.

그리고, 상기 SC의 UC는, 다른 노드들의 자원 사용율과 자신의 자원 사용율을 비교하고, DTS는 상기 UC의 비교 결과를 이용하여 인계 순서를 결정한 후, 이를 로컬 메모리(MEM)에 저장한다.The UC of the SC compares resource utilization rates of other nodes with its own resource utilization rate, and the DTS determines a takeover order using the comparison result of the UC, and then stores it in the local memory MEM.

여기서, 도6은 상기 로드 매니저(LM)의 내부동작을 보여주고 있는데, 내부의 SRC가 시작되면 SR마스터가 자기 자원 Usage를 파악하기 위하여 4가지 루틴이 수행되며, 여기서 출력되는 값은 SC의 Cul계산기로 Trans펑션을 이용하여 전송된다.6 shows an internal operation of the load manager LM. When the internal SRC is started, four routines are performed by the SR master to determine the usage of its own resources, and the output value is the Cul of the SC. It is sent to the calculator using a Trans function.

대기 상태에 있던 SC는 IM으로부터 수집된 다른 노드 정보를 Cul계산기로 보내어 값 계산을 진행하게 된다.The SC in standby sends the node information collected from the IM to the Cul calculator to calculate the value.

계산결과로, 인계 우선순위가 결과물이 되며, 이 내용은 W2M펑션으로 자신의 로컬 메모리(MEM)에 저장한다. 그 저장된 내용을 참조하여 각 노드에서 인계하게 되면, 우선순위가 저장되어 있는 메모리(MEM) 영역이 Invalidation되어 향후에 발생될 이벤트에 재사용된다.As a result of the calculation, the takeover priority is the result, which is stored in its local memory (MEM) as a W2M function. When a node takes over the stored contents, the memory (MEM) area in which the priority is stored is invalidated and reused for future events.

여기서, 도3을 참조하여, 본 발명 고가용성 시스템의 태스크 분할 방법에 대하여 설명하면, HA마스터에 의해, 로드 매니저(LM), 노드 매니저(NM), 정보 매니저(IM), 태스크 매니저(TM)를 동작시키고(S1), 태스크 상태 정보를 로컬 메모리(MEM)에 전송한 후(S2), 허트 비트를 통해 다른 노드로부터 해당 노드정보를 수집한다(S3).Referring to FIG. 3, the task partitioning method of the high availability system of the present invention will be described by the HA master as the load manager (LM), the node manager (NM), the information manager (IM), and the task manager (TM). After operating (S1), and transmits the task status information to the local memory (MEM) (S2), and collects the corresponding node information from the other node through the heartbeat (S3).

그 다음, 로드 매니저(LM)로 다른 노드의 상태정보를 전송한 후(S4), 그 로드 매니저(LM)에서 자신의 자원 사용율과 다른 노드의 자원 사용율을 비교하여 인계 순서 정보 및 해당 노드의 상태 정보를 저장한다(S5,S6). Then, after transmitting the status information of the other node to the load manager (LM) (S4), the load manager (LM) compares the resource utilization rate of its own resource with the resource utilization rate of the other node, and the takeover order information and the state of the node. The information is stored (S5, S6).

그 다음, 자신의 상태 정보와 인계 순서를 외부 노드로 멀티 캐스팅한다 (S7~S9).Next, it multicasts its own state information and takeover order to external nodes (S7 ~ S9).

그 다음, 상기 노드 매니저(NM)가 정보 매니저(IM)의 정보에 따라 태스크를 결정하고, 태스크 매니저(TM)에 의해 태스크를 관리한다(S10,S11).Next, the node manager NM determines a task according to the information of the information manager IM, and manages the task by the task manager TM (S10 and S11).

그 다음, 각 노드별 태스크 동작의 장애를 감지하여, 소정 노드에 장애가 발생하면 해당 노드의 태스크를 상기 인계 순서정보에 따라 노드로 인계하는데(S12), 인계 순서를 확인하여, 가장 작은 자원 사용율을 가진 하나의 노드로 태스크를 인계하거나, 몇개의 노드가 태스크를 분배하여 인계한다.Then, if a failure of a task operation for each node is detected, and a failure occurs in a predetermined node, the task of the corresponding node is transferred to the node according to the takeover order information (S12). Take over a task to one node that owns it, or several nodes distribute and take over the task.

다시 말해서, 도2와 같이, 각 노드 a~m까지 태스크가 수행되고, 서비스 네트워크를 통해 서비스를 받고자 하는 요청이 A_노드로 들어오면, a-태스크가 정상동작하고 있으면 서비스를 계속하여 진행하고, 만약 문제가 발행하면 태스크 인계가 수행된다.In other words, as shown in FIG. 2, when a task is performed to each node a to m, and a request to receive service through the service network enters the node A_, if the a-task is operating normally, the service continues. If a problem occurs, task takeover is performed.

여기서, 외부에서 서비스요청의 유무에 관계없이 인계가 발생되는 조건은 태스크가 킬드되어 태스크 매니저(TM)에서 재수행한 후, 액티브 되지 않는 경우 또는 네크워크 문제가 발생하여 서비스 처리를 못할 경우인데, 이때 로드 매니저(LM)를 통해 결정된 인계 순서를 확인하여 가장 자원 사용률이 작은 노드로 태스크를 인계하게 된다.Here, the condition that the takeover occurs regardless of whether there is a service request from the outside is when the task is killed and re-executed by the task manager (TM), or when the service is not activated due to a network problem. Checking the turnover order determined by the manager LM, the task is handed over to the node with the least resource utilization.

만약, 노드 자체에 이상이 있다면, 그 노드에서 수행되는 여러개의 모든 태스크는 킬드되므로 각 노드의 로컬 메모리(MEM)에 저장되어 있는 자신 및 각 노드의 자원사용율을 확인하여, 자원사용율이 작은 노드가 태스크를 수행하게 되는데, 이때 가장 작은 자원사용율을 가진 노드가 모든 태스크를 인계할 수도 있고, 몇 개의 노드가 태스크에 분배하여 인계한다.If there is an error in the node itself, all the tasks performed in that node are killed, so check the resource usage rate of each node and each node stored in the local memory (MEM) of each node. In this case, the node with the smallest resource utilization may take over all tasks, and several nodes distribute the tasks to the tasks.

이때, 멀티 태스크에 대한 분할 인계에 대한 동작을 설명하면, 다수의 노드에 대한 평균 자원 이용률을 산출한 후, 소정 노드의 자기 자원 사용율과 인계할 태스크들의 자원 사용률을 합한값이 평균자원이용률 값보다 크면 인계할 태스크를 그 소정 노드에 인계하고, 작으면 하나 이상의 다른 노드와 연계하여 태스크를 분할하여 인계한다.In this case, the operation of the split-over for the multi-task will be described. After calculating the average resource utilization rate for a plurality of nodes, the sum of the self-use rate of a given node and the resource use rate of tasks to be taken over is greater than the average resource utilization value. If larger, the task to be handed over is given to the predetermined node. If the size is small, the task is divided and handed over in association with one or more other nodes.

즉, N개 노드의 리소스 Usage값을 합산하고, 이 값을 다시 N으로 나누어 평균값을 구한다.That is, the resource usage values of N nodes are summed, and this value is divided by N again to obtain an average value.

만약, 인계해야 하는 태스크가 3개 있고, 평균 값 보다 낮은 상위 5개의 노드가 있다면, 인계 시컨스에 의해 상위 3개의 노드가 태스크를 1개씩 인계하게 된다.If there are three tasks to be turned over and there are top five nodes lower than the average value, the top three nodes will take over one task by the takeover sequence.

만약, 3개의 태스크를 1순위 노드에서 인계할 경우, 자신의 자원사용율과 인계한 태스크의 자원 사용율의 합이 평균값 보다 높을 때, 태스크의 크기가 큰 1개의 태스크만 1순위 노드에 인계하고, 나머지 2개의 태스크는 2순위 노드에서 체크하게 된다.If three tasks are taken over by the priority node, when the sum of their resource utilization rate and the resource utilization rate of the inherited task is higher than the average value, only one task having a larger task size is taken over to the priority node. The two tasks are checked at the second node.

만약, 동일한 방법으로, 자신의 자원사용율과 인계한 태스크의 자원 사용율 의 합이 평균보다 낮으면 2개의 태스크를 모두 제2 순위 노드에 인계하고, 평균보다 높으면 2개중 태스크의 크기가 큰 태스크를 제2 순위 노드에 인계하고, 나머지 1개의 태스크는 3순위 노드에 인계하게 된다.In the same way, if the sum of the resource utilization rate of the own resource and the inherited task is lower than the average, both tasks are transferred to the second priority node. If the sum is higher than the average, the task having the larger size of the two tasks is removed. It takes over to the second rank node, and the other one task takes over to the third rank node.

이렇게 동작함으로써, 노드의 리소스를 충분하고도 효율적으로 활용할 수 있고, 외부 클라이언트에 대한 서비스의 성능 개선을 기대한다.By doing so, it is possible to utilize the node's resources sufficiently and efficiently, and to improve the performance of services to external clients.

상기 본 발명의 상세한 설명에서 행해진 구체적인 실시 양태 또는 실시예는 어디까지나 본 발명의 기술 내용을 명확하게 하기 위한 것으로 이러한 구체적 실시예에 한정해서 협의로 해석해서는 안되며, 본 발명의 정신과 다음에 기재된 특허 청구의 범위내에서 여러가지 변경 실시가 가능한 것이다.Specific embodiments or examples made in the detailed description of the present invention are intended to clarify the technical contents of the present invention to the extent that they should not be construed as limited to these specific embodiments and should not be construed in consultation. Various changes can be made within the scope of.

이상에서 상세히 설명한 바와같이 본 발명은, 노드에 대한 태스크 이벤트가 발생할 때, 그 태스크를 노드의 자원 사용율에 따라, 분할하여 인계함으로써, 시스템 고가용성을 지향하여, 클라이언트의 요청에 대한 서비스의 성능저하를 최소화 할 수 있고, 고가용성 시스템의 자원을 배분하여 충분히 활용함으로써, 시스템 동작을 안정성 확보 및 작업처리 능력개선, 시스템 자원 사용의 효율성을 개선하는 효과가 있다.As described in detail above, in the present invention, when a task event occurs for a node, the task is divided and handed over according to the resource utilization rate of the node to achieve high system availability, thereby degrading performance of a service for a client's request. By minimizing and allocating the resources of the high availability system, it is possible to secure the system operation stability, improve the work processing capacity, and improve the efficiency of system resource use.

Claims

In a high availability system consisting of 2 to N nodes (server systems) to monitor and recover from failures of service processes,

The node includes: an HA master for activating a manager for HA operation;

A node manager which collects its own state information and multicasts it to another node, determines necessary tasks based on the collected information of the information manager, transmits them to the task manager, monitors the task manager, and re-executes in case of failure;

A task manager based on the information transmitted from the node manager, monitoring the node manager, the load manager, and the information manager to re-execute in the event of a failure to activate the node manager, the load manager, and the information manager;

An information manager which collects and manages the multicasted information from other nodes through the heartbeat, and transmits the information to the node manager and the load manager;

A high availability system comprising a load manager that calculates its own resource utilization rate and ranks the resource utilization rate of the other nodes received from the information managers with its own.

The method of claim 1, wherein the load manager,

An SRC for calculating its own resource utilization rate;

After determining the takeover order by comparing the resource usage rate of the other node received from the information manager and the resource utilization rate of the own node, and the SC to store the takeover order information and the status information of each node in the local memory (MEM) High availability system.

The method of claim 2, wherein the SRC is

A CUC that calculates your CPI resource utilization,

A MUC for calculating a memory (MEM) resource utilization rate;

A DUC for calculating disk I / O resource utilization;

A high availability system comprising NUC for calculating network I / O resource utilization.

The method of claim 2, wherein SC is

UC for comparing the resource utilization rate of each node with its own resource utilization rate;

The high availability system, characterized in that the DTS to determine the take-up order using the comparison result of the UC, and to store it in a local memory (MEM).

A first step of operating, by the HA master, the load manager, the node manager, the information manager, and the task manager;

A second process of transmitting the task state information to the local memory MEM and collecting corresponding node information from another node through a heartbeat bit;

Transmitting the state information of the other node to the load manager, and then storing the takeover order information and the state information of the corresponding node by comparing the resource use rate of the other node with the resource use rate of the other node;

A fourth process of multicasting own state information and a takeover order to an external node;

A fifth step of the node manager determining a task according to the information of the information manager and managing the task by the task manager;

Detecting a failure of a task operation for each node, if a failure occurs in a predetermined node, the task partitioning method of the high availability system, characterized in that performing the task of the node in the sixth process to take over to the node according to the takeover order information.

The method of claim 5, wherein the sixth process is

Checking the takeover order, the task is handed over to one node having the smallest resource utilization, or a plurality of nodes distribute the task to take over, the task partition takeover method of a high availability system.

The method of claim 6, wherein the number of nodes distribute tasks.

Calculating an average resource utilization rate for the node;

If the sum of the self resource utilization rate of the given node and the resource utilization rate of the tasks to take over is greater than the average resource utilization value, the task to take over is transferred to the predetermined node; The task partitioning method of the high availability system, characterized in that consisting of.