KR20050097015A

KR20050097015A - Method of resilience for fault tolerant function

Info

Publication number: KR20050097015A
Application number: KR1020040021728A
Authority: KR
Inventors: 허성길
Original assignee: 삼성탈레스 주식회사
Priority date: 2004-03-30
Filing date: 2004-03-30
Publication date: 2005-10-07

Abstract

본 발명은 고신뢰도가 요구되는 대형 시스템에서 동일한 기능을 수행하는 프로세서(processor)를 이중으로 구성하여 하나의 프로세서에 고장이 발생하였을 경우 다른 프로세서가 그 기능을 이어받아 데이터 일관성을 유지하며 중단없이 동작할 수 있도록 하는 이중화를 위한 소프트웨어 알고리즘을 제공하도록 구현된다. 이에 따라 본 발명은 대형 시스템에서 하나의 프로세서가 마스터(master)로 동작하여 시스템에서 요구되는 기능을 수행하고, 다른 프로세서는 스탠바이(standby) 상태의 슬래이브(slave)로 동작하여 마스터에 고장이 발생하였을 때 마스터로 전환되어 중단없이 기능을 수행하도록 한다. 이를 통해 본 발명은 단순히 소프트웨어적인 기교를 이용하여 고장 감내 기능을 갖출 수 있고 중단없이 기능을 유지함으로써 데이터 일관성을 유지하며 장애 발생 시 빠른 재구성과 재시동을 실현할 수 있는 효과가 있다.According to the present invention, when a processor fails in one processor, a dual processor is configured to perform the same function in a large system requiring high reliability, and another processor takes over the function to maintain data consistency and operate without interruption. It is implemented to provide a software algorithm for redundancy that makes it possible. Accordingly, in the present invention, one processor operates as a master in a large system to perform a function required by the system, and the other processor operates as a slave in a standby state so that a failure occurs in the master. When it is done, it switches to the master and performs the function without interruption. Through this, the present invention can be equipped with a fault tolerance function simply by using a software technique, and maintains the function without interruption, thereby maintaining data consistency and realizing quick reconfiguration and restart in the event of a failure.

Description

Redundancy Method for Implementing Fault Tolerance in Large Systems {METHOD OF RESILIENCE FOR FAULT TOLERANT FUNCTION}

본 발명은 대형 시스템에서 이중화를 위한 소프트웨어 알고리즘에 관한 것으로, 특히 운용서비스의 안정성을 추구할 수 있는 고도의 고장 감내 기능 구현을 가능하게 하는 대형 시스템에서의 이중화를 위한 소프트웨어 알고리즘에 관한 것이다.The present invention relates to a software algorithm for redundancy in a large system, and more particularly, to a software algorithm for redundancy in a large system that enables the implementation of a high fault tolerance function capable of pursuing stability of an operation service.

모든 시스템은 설계자에 의한 실수, 전자 부품의 고장 및 기타 원인으로 인해 고장이 발생할 가능성이 항상 내재되어 있다. 이와 같은 고장이 무기 시스템, 전전자 교환기, 의료 장비, 비행 제어 시스템, 인공위성처럼 고장 발생을 허용하지 않는 대형 시스템에서 고장이 발생하여 정상 동작을 수행하지 못한다면 심각한 문제점을 야기할 수 있다. All systems are always inherently susceptible to failure by designers, failures of electronic components and other causes. Such failures can cause serious problems if failures occur in large systems that do not allow failure, such as weapon systems, electronic switchgear, medical equipment, flight control systems, or satellites, and fail to perform normal operation.

따라서, 대형 시스템은 통상적인 컴퓨터 시스템과 달리 고도의 정밀함이 요구되는 고장 감내 기능과 고속의 실시간 처리가 중요한 요건이다. 여기서 고장 감내 기능(Fault Tolerance Technique)이라 함은 시스템에 고장이 발생할 때, 이에 대한 신속한 검출과 처리를 통해 시스템에 제공하는 출력값의 신뢰도를 향상시키기 위한 기능을 말한다.Therefore, large systems have a high level of fault tolerance and high speed real-time processing, which are highly demanded, unlike conventional computer systems. The fault tolerance technique refers to a function for improving the reliability of the output value provided to the system through rapid detection and processing when a fault occurs in the system.

특히 고신뢰도를 원하는 대형 시스템에서의 설계 목표는 고도의 고장 감내 기능 구현과 함께 이의 구현에 있어서 성능 저하를 최소로 하고, 동시에 데이터 일관성 유지 및 장애 발생 시 빠른 재구성 (reconfiguration)과 재시동(recovery)을 실현할 수 있는 것이 요망된다. 이러한 요건을 충족하기 위하여 대형 시스템은 비동기식 이중화 시스템 방식을 기본 전제로 채택하고 있다. 여기서 비동기식 이중화 시스템 방식은 두개의 프로세서들 중 어느 한쪽 프로세서는 실질적으로 동작하지 않고 있다가 즉, 스탠바이(standby) 상태로 있다가 동작중(active)인 프로세서에 장애 발생 시 그 동작을 이어 수행하는 방식이다.In particular, the design goal for large systems that require high reliability is to achieve high fault tolerance and minimize performance degradation in its implementation, while maintaining data consistency and fast reconfiguration and recovery in the event of a failure. What can be realized is desired. To meet these requirements, large systems adopt the asynchronous redundancy system as a basic premise. In this case, the asynchronous redundancy system is a method in which one of the two processors is substantially inoperable, that is, in a standby state and continues to operate when a failure occurs in the active processor. to be.

한편, 통상적으로 시스템에 고장 감내 능력을 확보하기 위한 종래의 기술은 크게 두가지로 나눌 수 있는데, 하나는 부가적인 하드웨어의 추가 등을 통해서 고장 감내 능력을 갖추게 하는 방법이고, 다른 하나는 소프트웨어의 기교에 의해 시스템에 고장 감내 능력을 확보하게 하는 방법이다. 즉, 하드웨어에 기반하거나 소프트웨어에 기반하는 두가지 방식으로 나뉘어질 수 있다.On the other hand, the conventional techniques for securing fault tolerance in the system can be divided into two types, one is to add fault tolerance through the addition of additional hardware, the other is the technique of software This is how to ensure fault tolerance in the system. That is, it can be divided into two methods, hardware based or software based.

하지만, 종래의 하드웨어의 추가를 통한 고장 감내 기능은 시스템 구현이 어렵고 하드웨어 추가 비용이 별도로 발생되는 문제점이 있었다. 한편, 소프트웨어 기반의 비동기식 이중화 시스템 방식은 데이터 일관성 유지 및 장애 발생 시 빠른 재구성와 재시동을 실현하는데에 어려움이 있다. 이를 위해 현재 상용의 미들웨어가 고장 감내 기법을 지원하지만 고장 감지 시간 및 테이크-오버(Take-over)하는 시간이 고신뢰도를 원하는 대형 시스템에서 요구되는 성능을 만족시키지 못하기 때문에 관련 업체에서는 자체적으로 제품을 개발하여 사용하고 있는 실정이다.However, the fault tolerant function through the addition of the conventional hardware has a problem that the system implementation is difficult and the additional cost of hardware is generated separately. On the other hand, the software-based asynchronous redundancy system has difficulty in maintaining data consistency and realizing fast reconfiguration and restart in case of failure. To this end, commercial middleware currently supports fault tolerance techniques, but related companies do not own their own products because fault detection time and take-over time do not meet the performance requirements of large systems that require high reliability. The situation is developing and using.

따라서, 현재의 기술분야에서는 고장 감내 시스템 구현상의 용이함과 시스템 응용시의 고정 감내 능력의 효율적 향상과 아울러 고장 감내 시스템 구현시 발생하는 비용 상승을 절감할 수 있는 방안이 필수적으로 요구된다.Therefore, in the current technical field, there is a need for a method that can easily improve the fault tolerance system, efficiently improve the fixed tolerance capability in system application, and reduce the cost increase in implementing the fault tolerance system.

상술한 바와 같이 종래에는 이중화 시스템에서 하나의 하드웨어에 고장이 발생하였을 경우 이를 감지하고 중단없이 그 기능을 계속적으로 수행하도록 지원하는 상용의 미들웨어 제품이 있다. 하지만, 고장 감지 시간 1초 미만 및 데이터의 연속성을 보장하는 상용제품은 없는 실정이다.As described above, there is a commercially available middleware product that detects a failure in one hardware in a redundant system and continuously performs its function without interruption. However, there is no commercial product that guarantees a continuity of data and a fault detection time of less than 1 second.

따라서 본 발명의 목적은 대형 시스템에서 고도의 고장 감내 기능을 구현하면서 그 구현에 있어서 소프트웨어를 기반으로 중단없이 기능을 수행함으로써 데이터 일관성을 유지하며 장애 발생 시 빠른 재구성과 재시동을 실현할 수 있는 이중화를 위한 소프트웨어 알고리즘을 제공함에 있다. Therefore, an object of the present invention is to implement a high fault tolerance function in a large system, while maintaining the data consistency by performing functions without interruption based on software in the implementation for redundancy that can realize fast reconfiguration and restart in the event of a failure In providing a software algorithm.

상술한 목적들을 달성하기 위해 본 발명은 복수 개의 어플리케이션 프로세서와 상기 복수 개의 어플리케이션 프로세서에 연결되어 상기 어플리케이션 프로세서의 동작을 감시하는 감시 프로세서를 구비하는 대형 시스템에서 고장 감내 기능 구현을 위한 이중화 방법에 있어서, 상기 복수 개의 어플리케이션 프로세서들 중 동일한 기능을 수행하는 적어도 2개의 어플리케이션 프로세서에 대해 마스터 프로세서와 슬래이브 프로세서를 결정하는 과정과, 상기 마스터 프로세서가 상기 마스터 프로세서와 상기 슬래이브 프로세서간에 동일하게 유지해야 하는 데이터의 변동이 있는 경우 상기 마스터 프로세서가 상기 데이터의 변동이 있을 때마다 상기 변동된 데이터를 상기 슬래이브 프로세서로 전송하여 데이터 동기화하는 과정과, 상기 감시 프로세서가 상기 마스터 프로세서에서 고장 발생 시 슬래이브 프로세서에 상기 마스터 프로세서의 비동작을 알리는 메시지를 전송하는 과정과, 상기 슬래이브 프로세서가 상기 메시지 분석 후 상기 마스터 프로세서와 동일한 기능을 수행하는 과정을 포함함을 특징으로 한다.In order to achieve the above object, the present invention provides a redundant method for implementing a fault tolerance function in a large system having a plurality of application processors and a monitoring processor connected to the plurality of application processors to monitor the operation of the application processor, Determining a master processor and a slave processor for at least two application processors performing the same function among the plurality of application processors, and data in which the master processor is to maintain the same between the master processor and the slave processor; The master processor transmits the changed data to the slave processor and synchronizes the data whenever there is a change in the data; And transmitting a message indicating a non-operation of the master processor to a slave processor when a failure occurs in the master processor, and performing the same function as the master processor after the slave processor analyzes the message. It is done.

이하 본 발명의 바람직한 실시 예들을 첨부한 도면을 참조하여 상세히 설명한다. 도면들 중 동일한 구성 요소들은 가능한 한 어느 곳에서든지 동일한 부호들로 나타내고 있음에 유의해야 한다. 또한 본 발명의 요지를 불필요하게 흐릴 수 있는 공지 기능 및 구성에 대한 상세한 설명은 생략한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. It should be noted that the same elements in the figures are represented by the same numerals wherever possible. In addition, detailed descriptions of well-known functions and configurations that may unnecessarily obscure the subject matter of the present invention will be omitted.

본 발명은 복수 개의 어플리케이션 프로세서와 상기 복수 개의 어플리케이션 프로세서에 연결되어 상기 어플리케이션 프로세서의 동작을 감시하는 감시 프로세서를 구비한 대형 시스템에서 동일한 기능을 수행하는 어플리케이션 프로세서(processor)를 이중으로 구성하여 하나의 프로세서에 고장이 발생하였을 경우 다른 프로세서가 그 기능을 이어받아 데이터 일관성을 유지하며 중단없이 동작할 수 있도록 하는 이중화를 위한 소프트웨어 알고리즘을 제공하도록 구현된다. 이에 따라 본 발명은 고신뢰도를 요구하는 대형 시스템에서 하나의 어플리케이션 프로세서가 마스터(master)로 동작하여 시스템에서 요구되는 기능을 수행하고, 다른 어플리케이션 프로세서는 스탠바이(standby) 상태의 슬래이브(slave)로 동작하여 마스터에 고장이 발생하였을 때 마스터로 전환되어 중단없이 기능을 수행하도록 한다. The present invention is configured to dually configure an application processor that performs the same function in a large system having a plurality of application processors and a monitoring processor connected to the plurality of application processors to monitor the operation of the application processor. In the event of a failure, another processor can take over the function and provide a software algorithm for redundancy that ensures data consistency and operation without interruption. Accordingly, in the present invention, in one large system requiring high reliability, one application processor operates as a master to perform a function required by the system, and the other application processor as a slave in a standby state. When a failure occurs in the master, it switches to the master to perform the function without interruption.

한편, 어플리케이션 프로세서의 고장 감내 기능을 위하여 본 발명에서는 안정성 확보를 최우선으로 하는 대형 시스템에서 어플리케이션 프로세서의 신뢰성을 높이기 위하여 활성-대기(Active-Standby) 프로세서 이중화를 사용한다. On the other hand, for the fault tolerance function of the application processor, the present invention uses active-standby processor redundancy in order to increase the reliability of the application processor in a large system whose priority is to secure stability.

이 구조는 두 어플리케이션 프로세서에 동일한 구조의 기억장소를 부여하고 절체에 대비하는 준비(Ready) 단계와, 실제 활성-절체가 발생하는 절체(Take-over)단계, 절체 후 다시 준비 단계로 들어가기 위하여 대기 프로세서를 복구(Recovery)하는 재준비(Re-ready)단계를 거친다. This structure gives the two application processors the same structure and prepares for the switchover, the take-over stage where the actual active-switching takes place, and waits for the switchover to the preparation stage. It goes through a re-ready step of recovering the processor.

이어, 본 발명의 실시 예에 따라 대형 시스템에서 어플리케이션 프로세서간 이중화를 구현하기 위한 시스템의 구성도는 도 1에 도시된 바와 같다. 본 발명의 실시 예에 따른 시스템은 크게 감시 프로세서와 복수 개의 어플리케이션 프로세서로 구성된다.Next, a configuration diagram of a system for implementing redundancy between application processors in a large system according to an exemplary embodiment of the present invention is illustrated in FIG. 1. The system according to an embodiment of the present invention is largely composed of a monitoring processor and a plurality of application processors.

먼저, 감시 프로세서(100)는 각각의 어플리케이션 프로세서(110,120,130)간의 통신을 연결하며 각각의 어플리케이션 프로세서의 관리자 역할을 담당하며 다른 감시 프로세서(140)와의 통신도 가능하게 한다. 여기서 어플리케이션 프로세서(110,120,130)간의 통신(160,170,180) 또는 감시 프로세서(100,140)간의 통신(150)은 상용의 TCP/IP 또는 UDP/IP 형식의 프로토콜을 사용하여 이루어질 수 있지만 고장 감내의 감지 시간을 최소화하고 관리를 효율적으로 하기 위하여 UDP 형식의 프로토콜을 사용하여 이루어진다. 여기서 감시 프로세서(100,140)간의 통신은 UDP 형식의 프로토콜을 사용하여 이루어지지만 송수신되는 메시지의 안정성을 보장하기 위한 기능이 보강되어야 할 것이다. 그리고 감시 프로세서(100)는 자신의 노드(node)에 접속된 활성화된 어플리케이션 프로세서의 상태 즉, 어플리케이션 프로세서의 접속 상태, 어플리케이션 프로세서를 식별하기 위한 프로세서 ID 등을 관리하며, 어플리케이션 프로세서(110,120,130)간의 통신을 위한 통신 수단을 제공한다.First, the supervisor processor 100 connects communication between each of the application processors 110, 120, and 130, serves as an administrator of each application processor, and also enables communication with other supervisory processors 140. Here, the communication between the application processor (110, 120, 130) (160, 170, 180) or the communication between the monitoring processor (100, 140) 150 can be made using a commercial TCP / IP or UDP / IP format protocol, but to minimize and manage the detection time of fault tolerance In order to be efficient, the protocol is made using UDP format. Here, the communication between the monitoring processor (100, 140) is made using a protocol of the UDP format, but the function to ensure the stability of the transmitted and received messages will have to be reinforced. The monitoring processor 100 manages the state of an activated application processor connected to its node, that is, the connection state of the application processor, a processor ID for identifying the application processor, and the communication between the application processors 110, 120, and 130. It provides a communication means for.

이어, 본 발명에 따른 어플리케이션 프로세서(110,120,130)의 구조를 상세히 설명하기 위해 도 2를 참조한다. 이하, 어플리케이션 프로세서(110)를 기준으로 구조를 설명하며, 나머지 어플리케이션 프로세서(120,130)의 구조는 동일함으로 설명은 생략한다.Next, to describe the structure of the application processor (110, 120, 130) according to the present invention in detail with reference to FIG. Hereinafter, the structure will be described based on the application processor 110, and the structure of the remaining application processors 120 and 130 are the same, and thus description thereof will be omitted.

도 2에 도시된 바와 같이 어플리케이션 프로세서(110)의 사용자 구조(User Framework)는 어플리케이션 영역(200)과 감시 프로세서(100)와 통신을 위한 태스크(task) 제어 모듈(250) 및 이중화를 효과적으로 지원하기 위한 이중화(Resilience) 제어 모듈(220)로 구성된다. 여기서 태스크 제어 모듈(250)과 이중화 제어 모듈(220)은 클래스(class)로 만들고 어플리케이션 개발자가 이러한 두 개의 클래스를 상속하여 모든 프로세서 구성요소를 동일한 구조를 가지도록 한다. As shown in FIG. 2, the user framework of the application processor 110 may effectively support a task control module 250 and redundancy for communicating with the application area 200 and the monitoring processor 100. It is composed of a redundancy (Resilience) control module 220 for. Here, the task control module 250 and the redundancy control module 220 are made into classes, and the application developer inherits these two classes so that all processor components have the same structure.

먼저, 태스크 제어 모듈(250) 내부에는 감시 프로세서(100)에서 메시지큐(270)를 통해 메시지를 수신하는 수신 태스크(receive task)(260)와 프로세서(110)의 각 구성요소가 동작하고 있음을 통보하는 활성 태스크(alive task)(280)가 내부적으로 동작하고 있다. 그리고 이중화 제어 모듈(220)은 감시 프로세서(100)를 통해 입력받은 데이터를 임시 저장하는 입력 버퍼(Input Buffer)(230)와 이중화를 위한 데이터를 저장하는 업데이트 버퍼(240)를 통해 이중화 동작을 수행하게 된다.First, the task control module 250 is a reception task (260) for receiving a message through the message queue 270 in the monitoring processor 100 and each component of the processor 110 is operating. An alerting active task 280 is operating internally. The redundancy control module 220 performs a redundancy operation through an input buffer 230 for temporarily storing data received through the monitoring processor 100 and an update buffer 240 for storing data for redundancy. Done.

그리고 어플리케이션 프로세서(110)내의 어플리케이션 영역(200)에는 마스터 프로세서와 슬래이브 프로세서간에 동일하게 유지하여야 할 재시동 데이터(Recovery Data)(210)가 정의되어 있으며 프로세서 객체가 생성될 때 이중화 제어 모듈(220)에 통보하여 이를 이중화 제어 모듈(220)에서 관리되도록 한다.In addition, in the application area 200 of the application processor 110, recovery data 210 to be maintained between the master processor and the slave processor is defined, and the redundancy control module 220 when the processor object is created. Notify and manage it in the redundancy control module 220.

이하, 본 발명의 실시 예에 따른 마스터 프로세서에서의 이중화 방법을 설명하기 위해 도 3을 참조한다. 도 3은 본 발명의 실시 예에 따른 마스터 프로세서에서의 이중화 방법에 대한 흐름도이다. Hereinafter, referring to FIG. 3 to describe a duplication method in a master processor according to an exemplary embodiment of the present invention. 3 is a flowchart illustrating a duplication method in a master processor according to an embodiment of the present invention.

도 3에 도시된 바와 같이, 마스터 프로세서(110)는 (300)단계에서 감시 프로세서(100) 외부메시지를 수신한다. 그러면 마스터 프로세서(110)는 (310)단계로 진행하여 마스터 프로세서(110)의 각 해당 응용 블록, 예컨대, 어플리케이션 영역(200)에서 메시지를 처리한다. 이 후, 마스터 프로세서(110)는 (320)단계로 진행하여 어플리케이션 영역(200)에서 재시동 데이터의 변동이 있는지 판단한다. 판단 결과 재시동 데이터의 변동이 있는 경우 마스터 프로세서(110)는 (330)단계로 진행하여 슬래이브 프로세서로 변동된 데이터를 전송한다. 이와 달리 판단 결과 재시동 데이터의 변동이 없는 경우 마스터 프로세서(110)는 감시 프로세서(100) 외부 메시지를 수신하는 (300)단계로 되돌아간다.As shown in FIG. 3, the master processor 110 receives an external message from the monitoring processor 100 in operation 300. Then, the master processor 110 proceeds to step 310 to process a message in each corresponding application block of the master processor 110, for example, the application area 200. Thereafter, the master processor 110 proceeds to step 320 and determines whether there is a change in restart data in the application area 200. If there is a change in the restart data, the master processor 110 proceeds to step 330 and transmits the changed data to the slave processor. In contrast, if there is no change in restart data as a result of the determination, the master processor 110 returns to step 300 of receiving the external message of the monitoring processor 100.

즉, 마스터 프로세서(110)는 재시동 데이터의 변동이 있는 경우 슬래이브 프로세서로 변동된 데이터를 전송함으로써 마스터 프로세서와 슬래이브 프로세서간의 동일한 데이터를 유지하는 데이터 동기화를 수행하여 고장 감내 구현을 위한 이중화를 수행한다.That is, the master processor 110 performs data synchronization to maintain the same data between the master processor and the slave processor by transmitting the changed data to the slave processor when the restart data is changed, thereby performing redundancy for implementing fault tolerance. do.

전술한 바와 같이 이중으로 구성된 각 어플리케이션 프로세서 중 하나는 마스터 프로세서로 동작하여 대형 시스템에서 할당된 기능을 수행하고, 다른 하나는 슬래이브 프로세서로 동작한다. 만약, 마스터 프로세서에서 고장이 발생이 한 경우 슬래이브 프로세서는 대형 시스템에 이상이 없도록 마스터 프로세서로부터 이어받은 기능을 중단없이 그대로 수행한다. As described above, one of each dually configured application processor operates as a master processor to perform an assigned function in a large system, and the other operates as a slave processor. If a failure occurs in the master processor, the slave processor performs the functions inherited from the master processor without interruption so that there is no problem in a large system.

한편, 슬래이브 프로세서에서 마스터 프로세서의 기능을 이어받아 중단없이 동작하기 위해서는 마스터 프로세서에서 고장이 발생한 경우 슬래이브 프로세서를 마스터 프로세서와 동일한 동작을 하도록 하는 동작 모드 결정이 빨리 이루어져야 한다. 또한 이러한 슬래이브 프로세서가 마스터 프로세서의 기능을 이어받아 중단없이 동작하기 위해서는 슬래이브 프로세서가 마스터 프로세서의 고장 시점과 상관없이 마스터 프로세서와 동일한 데이터를 유지하고 있어야 한다. 마지막으로 전술한 바 뿐만 아니라 더욱 정확하고 쉽게 동기화하기 위해서는 마스터 프로세서의 고장 시점을 신속하게 검출하고 데이크-오버(Take-over)하는 시간을 줄이는 방법도 요구된다.On the other hand, in order to take over the function of the master processor in the slave processor to operate without interruption, an operation mode decision to make the slave processor perform the same operation as the master processor in case of failure of the master processor should be made quickly. In addition, in order for the slave processor to inherit the functions of the master processor and operate without interruption, the slave processor must maintain the same data as the master processor regardless of when the master processor fails. Finally, in addition to the foregoing, a more accurate and easier synchronization requires a method of quickly detecting a failure point of the master processor and reducing the time for take-over.

이하, 본 발명의 실시 예에 따라 슬래이브 프로세서에서 마스터 프로세서의 기능을 이어받아 중단없이 동작하기 위한 마스터 프로세서와 슬래이브 프로세서의 동작 결정 과정에 대해 설명한다.Hereinafter, an operation determination process of a master processor and a slave processor for operating without interruption of a function of a master processor in a slave processor will be described.

본 발명의 실시 예에 따라 감시 프로세서에 연결되어 같은 기능을 수행하는 프로세서는 2개씩 짝을 이루게 되어 먼저 활성화된 프로세서가 마스터 프로세서로 동작하고 그 후에 활성화된 프로세서가 슬래이브 프로세서로 동작한다.According to an embodiment of the present invention, two processors connected to the supervisor processor and performing the same function are paired with each other so that the first activated processor operates as a master processor and the activated processor operates as a slave processor.

먼저, 복수 개의 어플리케이션 프로세서들 중 동일한 기능을 수행하는 적어도 2개의 어플리케이션 프로세서에 대해 마스터 프로세서와 슬래이브 프로세서를 결정하게 된다. 이를 위해 먼저 상기 복수 개의 어플리케이션 프로세서 중 어느 하나의 어플리케이션 프로세서의 기능 수행시 예컨대, 초기화를 완료하고 난 후, 다른 어플리케이션 프로세서와 통신 및 이중화 기능을 수행하기 위해 감시 프로세서에 접속하여 어플리케이션 프로세서 자신에 대한 정보를 감시 프로세서로 전송한다. 그러면 감시 프로세서는 수신한 어플리케이션 프로세서의 정보를 확인하여 이를 통신 선로를 통해 연결된 다른 감시 프로세서로 전송한다. 그러면 다른 감시 프로세서에서는 어플리케이션 프로세서 식별을 가능하게 하는 어플리케이션 프로세서 ID가 동일한 어플리케이션 프로세서가 있는지 검사한다. 검사 결과 동일한 어플리케이션 프로세서가 있는 경우 이 다른 어플리케이션 프로세서는 동일한 ID를 가진 어플리케이션 프로세서가 이미 동작하고 있는지를 판단하여 자신에게 접속을 요구한 처음의 어플리케이션 프로세서에 전달한다.First, a master processor and a slave processor are determined for at least two application processors performing the same function among a plurality of application processors. To this end, when performing the function of any one of the plurality of application processors, for example, after the initialization is completed, the information on the application processor itself by accessing the monitoring processor to perform the communication and redundancy function with another application processor Is sent to the monitoring processor. The supervisor processor then checks the received information of the application processor and transmits it to another supervisor processor connected through the communication line. The other supervisor then checks if there is an application processor with the same application processor ID that enables the identification of the application processor. If the result of the check is that there is the same application processor, the other application processor determines whether the application processor with the same ID is already running and delivers it to the first application processor that requests the connection.

즉, 어느 하나의 어플리케이션 프로세서가 감시 프로세서에 대해 등록을 요청하면, 그 감시 프로세서는 이 요청을 받아들여 다른 감시 프로세서와의 통신을 위한 통신 채널을 통해 동일한 ID를 가진 다른 어플리케이션 프로세서가 동작하고 있는지의 여부를 판단한다. 이러한 판단 결과에 따라 그 어플리케이션 프로세서는 마스터 또는 슬래이브 중의 어느 하나의 역할을 수행하게 된다.That is, when an application processor requests a registration for a supervisor, the supervisor accepts the request and determines whether another application processor with the same ID is operating through a communication channel for communication with the other supervisor. Determine whether or not. According to the result of this determination, the application processor plays the role of either a master or a slave.

마스터 또는 슬래이브의 판단은 어플리케이션 프로세서의 이중화 제어 모듈에서 이루어지며, 처음의 초기화를 완료한 어플리케이션 프로세서의 이중화 제어 모듈은 다른 감시 프로세서가 전송한 어플리케이션 프로세서 정보를 확인한다. 확인 결과 이미 다른 감시 프로세서를 통해 자신과 동일한 어플리케이션 프로세서 ID를 가진 프로세서가 동작하고 있으면 이중화 제어 모듈은 초기화가 완료된 어플리케이션 프로세서를 슬래이브 프로세서로써 동작하도록 한다. 이와 달리 자신과 동일한 어플리케이션 프로세서 ID를 가진 어플리케이션 프로세서가 동작하고 있지 않은 경우 초기화가 완료된 프로세서는 이중화 제어 모듈을 통해 마스터 프로세서로 동작하게 된다. The determination of the master or slave is made in the redundant control module of the application processor, and the redundant control module of the application processor that has completed the initial initialization checks the application processor information transmitted from another supervisory processor. As a result of the check, if the processor having the same application processor ID as the other processor is already running, the redundancy control module causes the initialized application processor to operate as the slave processor. On the contrary, when an application processor having the same application processor ID as its own is not operating, the initialized processor is operated as a master processor through the redundant control module.

전술한 바와 같이 초기 상태에서 감시 프로세서를 통해 서로의 상태를 확인하여, 상대측 어플리케이션 프로세서가 활성 상태이면 자신은 대기 상태의 슬래이브 프로세서로 동작하고, 상대측 어플리케이션 프로세서가 대기 상태이면 자신은 활성 상태의 마스터 프로세서로 동작한다. 이와 같이 마스터 프로세서와 슬래이브 프로세서로서의 동작이 결정되면, 마스터 프로세서와 슬래이브 프로세서는 본 발명에 따른 이중화 제어모듈을 통해 고장 감내 기능을 수행하게 된다. 이 때 이중화 제어 모듈은 마스터 프로세서와 슬래이브 프로세서간의 데이터를 동기화하며, 마스터 프로세서에서의 고장 발생시 슬래이브 프로세서에서 마스터 프로세서의 역할을 대신 수행하기 위한 핵심 기능이 수행되도록 한다.As described above, the monitoring processor checks each other's state in the initial state, and when the counterpart application processor is active, it operates as a slave slave processor and if the counterpart application processor is standby, it is the master of the active state. Runs as a processor As such, when the operation as the master processor and the slave processor is determined, the master processor and the slave processor perform the fault tolerance function through the redundant control module according to the present invention. At this time, the redundancy control module synchronizes data between the master processor and the slave processor, and when a failure occurs in the master processor, core functions for performing the role of the master processor in the slave processor are performed.

전술한 바와 같이 마스터 프로세서와 슬래이브 프로세서가 결정된 후 본 발명의 실시 예에 따른 각 프로세서에서의 이중화를 통해 중단없이 데이터를 동기화하여 수행하는 과정을 설명하기 위해 도 4 및 도 5를 참조하여 설명한다.As described above, after the master processor and the slave processor are determined, a process of synchronizing and performing data without interruption through duplication in each processor according to an embodiment of the present invention will be described with reference to FIGS. 4 and 5. .

도 4는 본 발명의 실시 예에 따른 마스터 프로세서와 슬래이브 프로세서간의 이중화를 위한 구성도이며, 도 5는 본 발명의 실시 예에 따른 마스터 프로세서와 슬래이브 프로세서간의 이중화를 위한 메시지 전송 흐름도이다.4 is a configuration diagram for redundancy between the master processor and the slave processor according to an embodiment of the present invention, Figure 5 is a flow diagram for message transmission for redundancy between the master processor and the slave processor according to an embodiment of the present invention.

먼저, 데이터 동기화(Data Synchronization)를 위해 마스터 프로세서와 슬래이브 프로세서에서의 동작 과정은 주로 이중화 제어 모듈에서 결정되며, 프로세서의 개발이 쉽도록 마스터 프로세서와 슬래이브 프로세서에 따른 동작의 차이가 없도록 어플리케이션 영역은 동일한 데이터를 포함하게 된다.First, the operation process of the master processor and the slave processor for data synchronization is mainly determined by the redundancy control module, and in order to facilitate the development of the processor, there is no difference in operation between the master processor and the slave processor. Will contain the same data.

도 4에 도시된 바와 같이 고신뢰도가 요구되는 대형 시스템에서 동일한 기능을 수행하는 프로세서(processor)는 이중으로 구성된다. 우선 마스터 프로세서(110)에 고장이 발생하였을 경우 다른 프로세서 즉, 슬래이브 프로세서(120)가 그 기능을 이어받아 데이터 일관성을 유지하며 중단없이 동작할 수 있도록 한다. 이를 위해 이중화를 위한 소프트웨어 알고리즘을 제공하도록 구현된 대형 시스템은 마스터 프로세서(110)와 슬래이브 프로세서(120), 각 프로세서를 연결하는 감시 프로세서(100, 140)들로 구성된다.As shown in FIG. 4, a processor that performs the same function in a large system requiring high reliability is dually configured. First, when a failure occurs in the master processor 110, another processor, that is, the slave processor 120 takes over its function to maintain data consistency and operate without interruption. To this end, a large system implemented to provide a software algorithm for redundancy includes a master processor 110, a slave processor 120, and supervisory processors 100 and 140 connecting the respective processors.

우선, 도 4 및 도 5를 참조하여 마스터 프로세서(110)와 슬래이브 프로세서(120)간의 데이터 동기화 과정을 살펴보면, 마스터 프로세서(110)의 이중화 제어 모듈(220)은 태스크 제어 모듈(250)을 통해 (500)단계에서 감시 프로세서(100)로부터 메시지를 수신한다. 한편, 슬래이브 프로세서(120)에서는 (505)단계에서 감시 프로세서(100) 로부터 메시지 수신을 한 후 (510)단계로 진행하여 이중화 제어 모듈(410)에 구현된 입력 버퍼에 저장한다.First, referring to FIG. 4 and FIG. 5, the data synchronization process between the master processor 110 and the slave processor 120 will be described. The redundant control module 220 of the master processor 110 may be configured through the task control module 250. In step 500, a message is received from the monitoring processor 100. On the other hand, the slave processor 120 receives a message from the monitoring processor 100 in step 505 and proceeds to step 510 and stores in the input buffer implemented in the redundancy control module 410.

한편, 마스터 프로세서(110)에서의 이중화 제어 모듈(220)은 감시 프로세서(100)로부터 메시지를 수신하면 (515)단계에서 슬래이브 프로세서(120)에 메시지 수신이 있음을 통보한다. 이에 대응하여 슬래이브 프로세서(120)에서의 이중화 제어 모듈(410)은 마스터 프로세서(110)에서 전송한 메시지 정보를 수신하면 (525)단계에서 입력 버퍼내에 저장된 메시지와 수신한 메시지를 비교하고, (530)단계에서 메시지 비교를 통해 마스터 프로세서(110)에서 어떠한 입력을 처리하는지 추적한다.On the other hand, when the redundancy control module 220 in the master processor 110 receives a message from the monitoring processor 100 in step 515 notifies the slave processor 120 that the message is received. In response, the redundancy control module 410 in the slave processor 120 compares the received message with the message stored in the input buffer in step 525 when receiving the message information transmitted from the master processor 110, In step 530, the master processor 110 tracks which input is processed through the message comparison.

그리고 마스터 프로세서(110)에서의 이중화 제어 모듈(220)은 (520)단계에서 그 메시지를 어플리케이션 영역(200)에 제공한다. 그러면 어플리케이션 영역(200)에서는 그 메시지를 제공받아 해당 기능을 수행하기 위한 처리 과정이 수행된다. 그러나 어플리케이션 영역(200)에서는 (535)단계의 메시지 처리 과정에서 재시동 데이터의 변동이 있는지를 판단하게 된다. In operation 520, the redundancy control module 220 of the master processor 110 provides the message to the application area 200. Then, the application area 200 receives the message and performs a process for performing the corresponding function. However, the application area 200 determines whether there is a change in restart data in the process of processing the message in step 535.

만약 어플리케이션 영역(200)에서 재시동 데이터의 변동이 있음을 감지하면 (540)단계에서 감시 프로세서(140)에 연결되어 있는 태스크 제어 모듈(420)을 통해 슬래이브 프로세서(120)로 변동 메시지 정보를 전송하게 된다. 이 때 슬래이브 프로세서(120)에서의 이중화 제어 모듈(410)은 변동된 메시지 정보를 수신하게 되면 이를 업데이트 버퍼에 임시 저장한다.If the application area 200 detects that there is a change in restart data, in operation 540, the change message information is transmitted to the slave processor 120 through the task control module 420 connected to the monitoring processor 140. Done. At this time, if the redundancy control module 410 in the slave processor 120 receives the changed message information, it is temporarily stored in the update buffer.

그리고나서 마스터 프로세서(110)에서는 (550)단계에서 메시지 처리가 완료되었는지를 판단하여, 메시지 처리가 완료된 경우 (550)단계에서 슬래이브 프로세서(120)측으로 완료 메시지 정보를 전송한다. 그러면 슬래이브 프로세서(120)는 업데이트 버퍼에 저장된 정보를 이용하여 (560)단계에서 어플리케이션 영역(400)의 재시동 데이터를 갱신한다. 이 때 슬래이브 프로세서(120)측에서의 어플리케이션 영역(400)에서는 전송받은 메시지를 처리하는 과정이 수행되지 않는다.Then, the master processor 110 determines whether the message processing is completed in step 550, and transmits the completion message information to the slave processor 120 in step 550 when the message processing is completed. Then, the slave processor 120 updates the restart data of the application area 400 by using the information stored in the update buffer. In this case, the process of processing the received message is not performed in the application area 400 on the slave processor 120 side.

이와 같이 마스터 프로세서(110)와 슬래이브 프로세서(120)간의 데이터 동기화는 이중화 제어 모듈을 통해 이루어지게 된다. 특히, 본 발명에 따른 데이터 동기화는 마스터 프로세서(110)에서 데이터 변동이 있을 때마다 슬래이브 프로세서(120)의 데이터를 갱신함으로써 이루어지게 된다. 이렇게 함으로써 마스터 프로세서(110)에서 고장이 발생할지라도 데이터 동기화를 통해 동일한 데이터를 가지고 있으므로 중단없이 슬래이브 프로세서(120)에서 계속적으로 마스터 프로세서(110)와 동일한 기능이 수행될 수 있다. 즉, 슬래이브 프로세서(120)가 마스터 프로세서(110)의 기능을 이어받아 중단없이 동작이 가능한 이유는 슬래이브 프로세서(110)가 마스터 프로세서(110)의 고장 시점과 상관없이 마스터 프로세서(110)와 동일한 데이터를 유지하고 있기 때문이다.As such, data synchronization between the master processor 110 and the slave processor 120 is achieved through a redundancy control module. In particular, data synchronization according to the present invention is achieved by updating the data of the slave processor 120 whenever there is a data change in the master processor 110. By doing so, even if a failure occurs in the master processor 110, since the same data is obtained through data synchronization, the slave processor 120 may continuously perform the same function as the master processor 110 without interruption. That is, the reason why the slave processor 120 can be operated without interruption by inheriting the functions of the master processor 110 is because the slave processor 110 and the master processor 110 are independent of the failure time of the master processor 110. This is because the same data is maintained.

전술한 바 뿐만 아니라 마스터 프로세서 고장시 슬래이브 프로세서가 그 기능을 중단없이 동작할 수 있도록 하기 위한 방법으로는 마스터 프로세서의 고장 시점을 신속하게 검출하고 테이크-오버(Take-over)하는 시간을 줄이는 방법이 있다. 이를 설명하기 위해 도 6을 참조하여 설명한다. 도 6은 본 발명의 일실시 예에 따른 고장 감내 방법에 대한 흐름도이다. In addition to the foregoing, a method for enabling a slave processor to operate without interruption in the event of a master processor failure is a method of quickly detecting the failure time of the master processor and reducing the take-over time. There is this. This will be described with reference to FIG. 6. 6 is a flowchart illustrating a fault tolerance method according to an embodiment of the present invention.

이하, 도 6을 참조하여 각 프로세서와 연결된 감시 프로세서 전체에 고장이 발생한 경우를 설명한다. 여기서 고장은 감시 프로세서 전체에서 발생할 수도 있지만 마스터 프로세서의 임의의 모듈 또는 영역에서 발생할 수도 있다.Hereinafter, a case in which a failure occurs in the entire monitoring processor connected to each processor will be described with reference to FIG. 6. The failure here may occur throughout the supervisory processor but may occur in any module or region of the master processor.

먼저, 도 6을 참조하여 본 발명에 따른 고장 감내 방법 중 감시 프로세서간의 동작 과정을 설명하면 각 프로세서와 연결된 감시 프로세서 전체적으로 고장이 있는지를 감지하기 위하여 하나의 감시 프로세서는 (600)단계에서 다른 감시 프로세서와 서로 Heart-beat를 주고받는다. 그러면 하나의 감시 프로세서는 (610)단계로 진행하여 Heart-beat가 소정 시간내에 수신되는지를 판단한다. 만약 소정 시간내에 Heart-beat가 수신되면 하나의 감시 프로세서는 (600)단계로 되돌아가 다른 감시 프로세서와 서로 Heart-beat를 주고 받음으로써 주기적으로 전술한 과정을 반복적으로 수행한다. First, referring to FIG. 6, a description will be given of an operation process between monitoring processors in a fault tolerance method according to the present invention. Exchange heart-beat with each other. Then, the monitoring processor proceeds to step 610 to determine whether the heart-beat is received within a predetermined time. If the heart-beat is received within a predetermined time, one supervisor processor returns to step 600 and performs the above-described process periodically by exchanging a heart-beat with another supervisor processor.

이와 달리 하나의 감시 프로세서가 (610)단계에서 Heart-beat를 수신하지 못한 경우 그 하나의 감시 프로세서는 다른 감시 프로세서에 고장이 발생했다고 판단한다. 따라서, 그 하나의 감시 프로세서는 (620)단계로 진행하여 슬래이브 프로세서를 마스터 프로세서로 전환한다. 즉, 그 하나의 감시 프로세서는 그에 접속된 슬래이브로 하여금 마스터 프로세서와 동일한 기능을 중단없이 수행하도록 한다.On the contrary, when one surveillance processor does not receive a heartbeat in step 610, the one surveillance processor determines that a failure occurs in the other surveillance processor. Accordingly, the one supervisor processor proceeds to step 620 to convert the slave processor into a master processor. That is, the one supervisor processor causes the slaves connected to it to perform the same functions as the master processor without interruption.

한편, 본 발명에 따른 고장 감내 방법 중의 하나로 마스터 프로세서에서 비정상 종료시 즉, 마스터 프로세서의 태스크 제어 모듈을 통해 활성 태스크에 정의된 시간동안 활성 메시지(alive message)를 수신하지 못한 경우 고장이 발생했음이 감지될 수 있다. 그리고 마스터 프로세서에서 정상 종료시에 즉, 마스터 프로세서가 감시 프로세서로 접속 해지 요청 메시지를 전송하여 스스로 연결을 종료함으로써 고장이 감지될 수도 있다. On the other hand, one of the fault tolerance method according to the present invention detects that a failure occurred when the master processor abnormally terminated, that is, when the active message is not received for the time defined in the active task through the task control module of the master processor Can be. The failure may be detected at the normal termination of the master processor, that is, the master processor terminates the connection by sending a disconnect request message to the monitoring processor.

이후, 특정 프로세서에 고장이 발생하면 고장이 발생하지 않은 프로세서와 연결된 감시 프로세서는 자신에 접속된 모든 프로세서에 특정 프로세서의 비동작을 알리는 메시지를 전송하여 슬래이브 프로세서로 하여금 특정 프로세서의 기능을 이어받아 중단없이 그 기능이 수행되도록 한다. 만약, 하나의 마스터 프로세서에 고장이 발생하면 감시 프로세서를 통해 마스터 프로세서와 동일한 프로세서 ID를 가지는 슬래이브 프로세서에만 이를 통보하여 그 슬래이브 프로세서만 마스터 프로세서의 기능을 이어받아 마스터로써 동작하게 하다.After that, when a specific processor fails, the supervisor processor connected to the non-failing processor sends a message indicating the non-operation of the specific processor to all the processors connected to the slave processor so that the slave processor can take over the function of the specific processor. Allow the function to be performed without interruption. If a failure occurs in one master processor, only the slave processor having the same processor ID as the master processor is notified through the supervisory processor so that only the slave processor takes over the function of the master processor and operates as a master.

전술한 바와 같이 마스터 프로세서에 고장이 발생하였음을 통보받은 슬래이브 프로세서내의 이중화 제어 모듈은 입력 버퍼를 확인하여 마스터 프로세서에서 처리하지 않은 메시지를 슬래이브 프로세서로 전송하여 처리를 완료한 이후에 슬래이브 프로세서를 마스터 프로세서에서의 동작을 이어받아 중단없이 기능을 수행하도록 한다. As described above, the redundancy control module in the slave processor that is notified that the master processor has failed may check the input buffer and transmit a message not processed by the master processor to the slave processor to complete the processing. It takes over from the master processor and performs its functions without interruption.

상술한 바와 같이 본 발명은 이중화 기능이 요구되는 시스템에 공통으로 적용 가능한 이중화를 위한 소프트웨어 알고리즘을 개발하여 개발의 효율성 및 일관성을 유지할 수 있도록 한다. 또한 본 발명에서 제안한 고장 감내 기능을 구현하기 위한 이중화 방법은 소프트웨어 방식에 기반하고 있어, 하드웨어의 변형 내지는 새로운 하드웨어의 추가를 필요로 하지 않으므로 단순히 소프트웨어적인 기교를 이용하여 고장 감내 기능을 갖출 수 있고 중단없이 기능을 유지함으로써 데이터 일관성을 유지하며 장애 발생 시 빠른 재구성과 재시동을 실현할 수 있는 효과가 있다. 또한, 이를 통해 본 발명은 개발의 위험을 줄일 수 있고 일부 선진 업체에서만 보유하고 있는 핵심 소프트웨어를 자체 개발함으로써 기술자립을 위한 기틀을 마련할 수 있는 이점이 있다.As described above, the present invention develops a software algorithm for redundancy that is commonly applicable to a system requiring redundancy, thereby maintaining development efficiency and consistency. In addition, since the redundancy method for implementing the fault tolerance function proposed in the present invention is based on a software method, it does not require any modification of hardware or the addition of new hardware, and thus it is possible to have a fault tolerance function simply by using a software technique. Maintaining functionality without compromises maintains data consistency and enables fast reconfiguration and restart in the event of a failure. In addition, through this, the present invention can reduce the risk of development and have the advantage of providing a framework for self-development by developing core software owned only by some advanced companies.

도 1은 본 발명의 실시 예에 따른 대형 시스템에서 프로세서간 이중화를 위한 구성도,1 is a block diagram for redundancy between processors in a large system according to an embodiment of the present invention;

도 2는 본 발명의 실시 예에 따른 프로세서의 구조도,2 is a structural diagram of a processor according to an embodiment of the present invention;

도 3은 본 발명의 실시 예에 따른 마스터 프로세서에서의 이중화 방법에 대한 흐름도,3 is a flowchart illustrating a duplication method in a master processor according to an embodiment of the present invention;

도 4는 본 발명의 실시 예에 따른 마스터 프로세서와 슬래이브 프로세서간의 이중화를 위한 구성도,4 is a configuration diagram for redundancy between a master processor and a slave processor according to an embodiment of the present invention;

도 5는 본 발명의 실시 예에 따른 마스터 프로세서와 슬래이브 프로세서간의 이중화를 위한 메시지 전송 흐름도,5 is a flowchart illustrating message transmission for redundancy between a master processor and a slave processor according to an embodiment of the present invention;

도 6은 본 발명의 일실시 예에 따른 고장 감내 방법에 대한 흐름도.6 is a flowchart illustrating a fault tolerance method according to an embodiment of the present invention.

Claims

A redundancy method for implementing a fault tolerance function in a large system having a plurality of application processors and a monitoring processor connected to the plurality of application processors to monitor the operation of the application processor,

Determining a master processor and a slave processor for at least two application processors performing the same function among the plurality of application processors;

When there is a change in data that the master processor should keep the same between the master processor and the slave processor, the master processor transmits the changed data to the slave processor whenever there is a change in the data, thereby synchronizing data. Process,

Transmitting, by the monitoring processor, a message notifying the non-operation of the master processor to a slave processor when a failure occurs in the master processor;

And the slave processor performing the same function as the master processor after analyzing the message.

The method of claim 1, wherein the determining of the master processor and the slave processor comprises:

Requesting registration through the monitoring processor when a function of any one of the plurality of application processors is performed;

In response to the registration request, determining whether another processor having the same ID for identifying the processor is operating in the supervisor processor through a communication channel for communication with the other supervisor processor;

And operating the one processor as one of a master processor and a slave processor in response to the determination result.

The method of claim 1, wherein synchronizing the data comprises:

Transmitting, by the master processor, change message information according to a change of data to the slave processor;

Temporarily storing the variation message received by the slave processor;

If the slave processor receives the completion message information according to the completion of the message processing from the master processor, updating the data of the slave processor to be identical to the data of the master processor by using the temporarily stored message. Characterized by the above.