KR0176085B1

KR0176085B1 - Error detecting method of processor node and node network of parallel computer system

Info

Publication number: KR0176085B1
Application number: KR1019950006002A
Authority: KR
Inventors: 김중배; 안대영; 박윤옥; 이상민
Original assignee: 양승택; 재단법인한국전자통신연구원
Priority date: 1995-03-21
Filing date: 1995-03-21
Publication date: 1999-05-15
Also published as: KR960035258A

Abstract

본 발명은 컴퓨터 시스템의 데이타 병렬처리를 위한 시스템 구성방식에 관한 것으로, 특히, 소정갯수의 입출력버스로 구성되는 소정갯수의 버스군과, 상기 버스군을 구성하는 입출력버스에 대하여 각각 일대일로 연결되어 임의의 버스군을 통하여 데이타의 저장 및 억세스가 가능한 소정갯수의 데이타 저장수단과, 상기 버스군에 대하여 각각 일대일로 연결되어 상기 버스군을 구성하는 입출력버스를 통하여 각각 상기 데이타 저장수단들과의 데이타 송수신을 제어하는 소정갯수의 데이타 입출력 제어수단과, 상태 진단을 위한 정보의 교환을 위하여 별도의 직렬 통신회선으로 연결되어지며 동일 기능을 수행하는 제1,제2프로세싱 노드로 구성되는 소정갯수의 프로세싱 노드군 및 상기 프로세싱 노드군들과 데이타 입출력 제어수단간의 데이타 연결을 위한 소정갯수의 노드 상호연결망을 포함하는 것을 특징으로 하는 병렬처리 컴퓨터 시스템 및 상기 시스템에서 운용되는 에러 검출방법을 제공하여 구성 비용을 절감할 수 있도록 시스템의 구성을 간소화하면서도 시스템을 구성하는 자원중 임의의 자원에 결함 발생시 시스템이 정지하지 않고 데이타의 손실이 발생되지 않는 범위내에서 결함의 탐지 및 복구작업을 수행할 수 있는 효과가 있다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a system configuration method for data parallel processing of a computer system. In particular, the present invention relates to a predetermined number of bus groups composed of a predetermined number of input / output buses and one-to-one connection to the input / output buses constituting the bus group. Data with the data storage means through a predetermined number of data storage means capable of storing and accessing data through an arbitrary bus group, and an I / O bus which is connected one-to-one to the bus group and constitutes the bus group. A predetermined number of processing units comprising a predetermined number of data input / output control means for controlling transmission and reception, and first and second processing nodes connected to separate serial communication lines for exchanging information for status diagnosis and performing the same function. Data connection between the node group and the processing node groups and data input / output control means. A parallel processing computer system comprising a predetermined number of node interconnection networks and an error detection method operating in the system, thereby simplifying the configuration of the system so as to reduce the configuration cost, and any of the resources constituting the system. In the event of a fault in the resource of the system, the system can be detected and repaired within the extent that the system does not stop and data loss does not occur.

Description

Error Detection Method of Processor Node and Node Connection Network in Parallel Processing Computer System

제1도는 프로세서 이중화 방식을 적용한 종래 병렬처리 컴퓨터 시스템의 간략 블럭 구성도.1 is a simplified block diagram of a conventional parallel processing computer system employing a processor duplication method.

제2도는 프로세서 삼중화 방식을 적용한 종래 병렬처리 컴퓨터 시스템의 간략 블럭 구성도.2 is a simplified block diagram of a conventional parallel processing computer system employing a processor triple system.

제3도는 본 발명에 따른 병렬처리 컴퓨터 시스템의 부부 간략 블럭 구성도.3 is a simplified block diagram of a couple of parallel processing computer systems according to the present invention.

* 도면의 주요부분에 대한 부호의 설명* Explanation of symbols for main parts of the drawings

PN1∼PNN, PNa~PNc,10A,10B,20A,20B : 프로세싱 노드PN1-PNN, PNa-PNc, 10A, 10B, 20A, 20B: processing node

IN : 네트워크 BF : 공유형 버퍼IN: Network BF: Shared Buffer

M1,2 : 메모리 IOP1,2 : 입출력 프로세서M1,2: Memory IOP1,2: I / O Processor

IOB,2 : 입출력 버스 DB1,2 : 버스군IOB, 2: I / O bus DB1,2: Bus group

D1,2 : 디스크 ION1,2 : 입출력 프로세서 노드D1,2: disk ION1,2: I / O processor node

INN1,2 : 노드 상호 연결망 SL : 직렬 통신회선INN1,2: Node interconnection network SL: Serial communication line

SB1,2 : 시스템 버스 Ba,Bb : 버퍼 메모리SB1,2: System bus Ba, Bb: Buffer memory

NIF : 인터페이싱 회로(Network Interface)NIF: Interfacing Circuit (Network Interface)

SIF : 직렬통신 접속회로(Serial Interface)SIF: Serial Communication Connection Circuit (Serial Interface)

본 발명은 컴퓨터 시스템의 데이타 병렬처리를 위한 시스템 구성 방식에 관한 것으로, 특히 병렬 데이타 처리 방식을 적용하고 있는 컴퓨터 시스템에서 동작상의 에러나 결함이 발생되어진 프로세서 노드를 검출하고 전체 시스템에서 분리시켜 시스템 전체의 오동작을 방지하면서도 시스템이 정지하지 않도록 하기 위한 병렬처리 컴퓨터 시스템에서의 프로세서노드 및 노드연결망의 에러검출방법에 관한 것이다.The present invention relates to a system configuration method for data parallel processing of a computer system. In particular, a computer node adopting a parallel data processing method detects a processor node in which an operation error or a defect occurs and separates it from the entire system. The present invention relates to an error detection method of a processor node and a node connection network in a parallel processing computer system for preventing a malfunction of a system and preventing a system from stopping.

일반적으로, 병렬처리 컴퓨터 시스템은 다수개의 프로세서를 구비시켜 각각의 프로세서가 독특한 기능을 수행하거나 하나의 기능을 분할하여 수행하도록 하여 데이타를 병렬처리하는 것으로 기능수행의 속도가 빠르며, 매우 복잡한 기능 예를들어 인공지능과 같은 기능을 구현할 수 있다는 장점이 있다.In general, a parallel processing computer system is equipped with a plurality of processors, each processor performs a unique function or a single function to perform a parallel processing of data to perform a fast function, a very complex function example For example, there is an advantage that can implement a function such as artificial intelligence.

그러나, 전반적으로 다수개의 프로세서가 연결되어 있으므로 어느 하나의 프로세서 오동작 또는 작동불능이 발생되는 경우 시스템 전체의 기능을 상실할 수 있다는 단점을 내제하고 있다.However, since a plurality of processors are generally connected, there is a disadvantage in that any one of a processor malfunction or inoperability may cause a loss of the entire system function.

상기와 같은 단점을 극복하기 위하여 제안되어진 종래의 개선방안중 대표적인 것은 병렬처리 컴퓨터 시스템 구성시 각 기능을 수행하는 프로세서들을 이중화하는 제1방안과 삼중화하는 제2방안이 제안되었다.In order to overcome the drawbacks described above, the conventional improvement schemes are proposed. First, a second scheme and a third scheme of dualizing processors performing respective functions in a parallel computer system have been proposed.

상기와 같은 종래의 방안중 제1방안은, 첨부한 도면중 제1도에 도시되어 있는 바와같이, 다수개의 프로세싱 노드(PN1~PNN)와 제3, 제4프로세싱 노드(ON3, 4)가 각각 하나의 프로세싱 노드군을 구성하며, 각 프로세싱 노드군은 자신을 구성하고 있는 두개의 프로세싱 노드가 동일한 기능을 수행할 수 있도록 구성되므로서 이중화되어진다.In the conventional method as described above, a plurality of processing nodes PN1 to PNN, and third and fourth processing nodes ON3 and 4, respectively, are illustrated in FIG. One processing node group is configured, and each processing node group is duplicated by configuring two processing nodes constituting the same to perform the same function.

또한, 상기 프로세싱 노드(PN1~PNN)간을 연결하는 네트워크(IN)에는 프로세싱 노드들의 동작 결과나 상태의 정보를 저장하기 위한 분주형 또는 공유형 버퍼(BF)가 연결되어 있다.In addition, the network IN connecting the processing nodes PN1 to PNN is connected to a distributed or shared buffer BF for storing information on the operation result or state of the processing nodes.

상기와 같은 구성을 갖는 제1방안에 따른 병렬처리 컴퓨터 시스템은 그 구성이 간단하여 설치경비가 절감되는 장점을 갖고 있으나, 반면에 상기 공유형 버퍼(BF)에 저장되는 상태 정보량이 증가하거나 각 프로세싱 노드의 상태를 검사하는 체크포인트(Check Point)의 시차간격이 짧아지는 경우 각각의 프로세싱 노드를 연결하는 네트워크(IN)의 부하가 증가하게 되어 시스템의 효율을 감소시키는 요인으로 작용하는 문제점이 발생되어진다.The parallel processing computer system according to the first scheme having the above configuration has the advantage that the configuration is simple and the installation cost is reduced, while the amount of state information stored in the shared buffer (BF) is increased or each processing is performed. When the time difference between check points for checking the status of nodes is shortened, the load of the network IN connecting each processing node increases, which causes a problem of reducing the efficiency of the system. Lose.

또한, 상술한 경우와 반대로 체크포인트의 시간 간격이 넓어지는 경우에는 전체 시스템의 신뢰성이 떨어지는 문제점이 발생되어진다.In addition, in contrast to the above-described case, when the time interval between checkpoints is widened, a problem arises in that the reliability of the entire system is low.

상술한 바와같은 제1방안과 달리 각각의 프로세싱 노드군을 구성하는 프로세싱 노드를 삼중화로 구성한 제2방안은 첨부한 제2도에 도시되어 있는 바와같이 다수개의 프로세싱 노드군이 두개의 입출력 버스(IOB1,2)를 통하여 보조기억장치(D)에 연결되어 있다.Unlike the first scheme described above, the second scheme in which the processing nodes constituting each processing node group is tripled is shown in FIG. 2, and a plurality of processing node groups have two input / output buses (IOB1). It is connected to the secondary memory (D) through, 2).

상기 프로세싱 노드순은 동일한 기능을 수행하는 3개의 프로세싱 노드(PNa~PNc)들과, 상기 프로세싱 노드(PNa~PNc)에 연결되는 제1, 제2메모리(M1,2) 및 상기 입출력 버스(IOB1,2)를 통하여 상기 메모리(M1,2)에 연결되는 제1, 제2입출력 프로세서(IOP1,2)로 구성된다.The processing node order is three processing nodes PNa to PNc performing the same function, first and second memories M1 and 2 connected to the processing nodes PNa to PNc, and the input / output bus IOB1, 2) the first and second input / output processors IOP1 and 2 are connected to the memories M1 and 2 through the second and second memories.

또한, 상기 제1입출력 프로세서(IOP1)는 제1입출력 버스(IOB1)에 연결되고, 제2입출력 프로세서(IOP2)는 제2입출력 버스(IOB2)에 연결 구성된다. 상기와 같이 구성되는 제2방안에 따른 병렬처리 컴퓨터 시스템은 상기 프로세싱 노드(PNa~PNc)들로 구성되는 프로세싱 노드군의 결함을 쉽게 탐지할 수 있다는 장점은 있으나 시스템의 구성 비용이 상승한다는 문제점이 발생되어진다.In addition, the first I / O processor IOP1 is connected to the first I / O bus IOB1, and the second I / O processor IOP2 is connected to the second I / O bus IOB2. The parallel processing computer system according to the second method configured as described above has an advantage of easily detecting a defect in the processing node group including the processing nodes PNa to PNc, but the cost of the system is increased. Is generated.

상기와 같은 문제점을 해소하기 위한 본 발명의 목적은, 시스템의 구성 비용을 절감할 수 있도록 시스템의 구성을 간소화하면서도 병렬 데이타 처리 방식을 적용하고 있는 컴퓨터 시스템에서 동작상의 에러나 결함이 발생되어진 프로세싱 노드군을 검출하고 전체 시스템에서 분리시켜 시스템의 유지보수를 위한 정지 동작없이도 사용자가 정상적인 출력을 얻을 수 있으며 시스템 전체의 오동작을 방지하기 위한 병렬처리 컴퓨터 시스템의 에러 검출방법을 제공하는데 있다.An object of the present invention for solving the above problems is to simplify the configuration of the system to reduce the configuration cost of the system, while processing errors or defects occurred in the computer system applying a parallel data processing method to the operation system The present invention provides a method for detecting errors in a parallel processing computer system to detect a group and isolate the system from the entire system so that a user can obtain a normal output without stopping a system for maintenance.

상기 목적을 달성하기 위한 본 발명의 특징은, 다수개의 프로세서 노드를 구비하고 있는 병렬처리 컴퓨터 시스템에 있어서, 소정갯수의 입출력버스로 구성되는 소정갯수의 버스군과, 상기 버스군을 구성하는 입출력버스에 대하여 각각 일대일로 연결되어 임의의 버스군을 통하여 데이타의 저장 및 억세스가 가능한 소정갯수의 데이타 저장수단과, 상기 버스군에 대하여 각각 일대일로 연결되어 상기 버스군을 구성하는 입출력버스를 통하여 각각 상기 데이타 저장수단들과의 데이타 송수신을 제어하는 소정갯수의 데이타 입출력 제어수단과, 상태 진단을 위한 정보의 교환을 위하여 별도의 직렬 통신회선으로 연결되어지며 동일 기능을 수행하는 제1, 제2프로세싱 노드로 구성되는 소정갯수의 프로세싱 노드군 및 상기 프로세싱 노드군들과 데이타 입출력 제어수단간의 데이타 연결을 위한 소정갯수의 노드 상호연결망을 포함하는데 있다.A feature of the present invention for achieving the above object is a parallel processing computer system having a plurality of processor nodes, comprising: a predetermined number of bus groups composed of a predetermined number of input / output buses, and an input / output bus constituting the bus group. A predetermined number of data storage means capable of storing and accessing data through an arbitrary bus group connected to each other one-to-one, and an I / O bus connected to the bus group one-to-one respectively to form the bus group. A predetermined number of data input / output control means for controlling data transmission / reception with data storage means, and first and second processing nodes connected with separate serial communication lines for exchanging information for status diagnosis and performing the same function. A predetermined number of processing node groups and data processing groups It may comprise a node interconnection network of a predetermined number for the data connection between the output control means.

상기 목적을 달성하기 위한 부가적인 특징은, 상기 프로세싱 노드군을 구성하는 프로세싱 노드는 각각 개별적인 기능을 수행하는 소정갯수의 프로세서들과, 상기 프로세서들을 연결하고 있는 시스템버스에 연결되어 운영 및 상태진단 프로그램과 상기 프로세서들에서 발생되는 데이타를 저장하는 공유메모리와, 소정갯수의 채널을 구비하고 각 채널당 하나씩의 노드 상호연결망과 연결되어 상기 시스템버스를 통한 데이타 접속기능을 수행하는 인터페이싱 회로와, 상기 직렬 통신회선을 통한 데이타의 송수신을 위한 직렬통신 접속회로 및 상기 직렬통신 접속회로를 통하여 송수신되는 정보의 저장을 위한 버퍼메모리에 구성되는데 있다.An additional feature for achieving the above object is that a processing node constituting the processing node group includes a predetermined number of processors each performing a separate function, and an operating and status diagnosis program connected to a system bus connecting the processors. And a shared memory for storing data generated by the processors, an interfacing circuit having a predetermined number of channels and connected to one node interconnection network for each channel to perform a data connection function through the system bus; A serial communication connection circuit for transmitting and receiving data over a line and a buffer memory for storing information transmitted and received through the serial communication connection circuit.

상기 목적을 달성하기 위한 본 발명의 다른 특징은, 두개의 프로세싱 노드로 이루어지는 하나의 프로세싱 노드군을 다수개 구비하고 있는 병렬처리 컴퓨터 시스템에서 프로세서 노드의 에러 검출 방법에 있어서, 인터럽트와 같은 특정한 사건이 발생하거나 임의의 소정시점에 노드군을 형성하고 있는 각 프로세싱 노드에서 현재 동일하게 수행하고 있는 동작에 의하여 발생되어진 연산결과 및 프로세서의 레지스터 값을 직렬통신 접속회로를 통하여 상대 프로세싱 노드의 버퍼메모리에 각각 전송하는 제1과정과, 상대 프로세싱 노드에서 버퍼메모리로 전송한 연산결과 및 레지스터 값을 읽어서 상기 제1과정을 통하여 전송시킨 자신의 연산결과 및 레지스터 값을 읽어서 상기 제1과정을 통하여 전송시킨 자신의 연산결과 및 레지스터 값과 비교하는 제2과정과, 상기 제2과정에서 데이타 비교시 상이한 경우 자신의 노드군에 에러가 발생된 것으로 판단하고 해당 노드군내의 각 프로세싱 노드는 자신의 동작을 정지시키는 제3과정과, 임의의 다른 프로세싱 노드군에 상기 제1과정에서 수행되어진 동작, 즉 정상적인 상태로 판단된 바로 전의 비교 시점부터 에러가 발생된 시점까지의 동작을 재수행하도록 한 후, 그 수행결과를 상기 제3과정에서 에러가 발생되어 정지된 노드군내의 각 프로세싱 노드에 입력하는 제4과정과, 상기 제3과정에서 정지된 노드군내의 각 프로세싱 노드는 상기 제4과정을 통하여 입력받은 정상적인 연산결과와 각각 자신이 수행한 연산결과를 비교하는 제5과정 및 상기 제5과정을 통하여 상이한 결과를 발생시킨 노드는 고장인 것으로 간주되어 사용자에게 에러발생 현황을 경고하는 제6과정을 포함하는데 있다.Another aspect of the present invention for achieving the above object, in a parallel processing computer system having a plurality of processing node group consisting of two processing nodes in the error detection method of the processor node, a specific event such as interrupt The calculation result and the register value of the processor generated by each processing node occurring at the same time or at each processing node forming the node group at a predetermined point in time are stored in the buffer memory of the corresponding processing node through the serial communication connection circuit. The first step of the transmission, the operation result and the register value transmitted to the buffer memory from the partner processing node to read the operation result and register value transmitted through the first process, and the own process transmitted through the first process A second comparison with the operation result and the register value In the second step, when comparing data in the second step, it is determined that an error has occurred in its own node group, and each processing node in the corresponding node group stops its operation and any other processing node group. After performing the operation performed in the first process, that is, the operation from the comparison time immediately before it is determined to be in a normal state to the time when the error occurs, the execution result is stopped in the third process. The fourth process of inputting to each processing node in the node group, and each processing node in the node group stopped in the third process compares the normal operation result received through the fourth process with the operation result performed by each of them. Nodes that produce different results through the fifth process and the fifth process are considered to be faulty and alert the user to the occurrence of error. Claim 6 may comprise a process.

상기 목적을 달성하기 위한 본 발명의 다른 특징은, 두개의 프로세싱 노드로 이루어지는 프로세싱 노드군을 다수개 구비하고 각각의 프로세서 노드군을 연결하고 있는 연결망을 이중으로 구성하고 있는 병렬처리 컴퓨터 시스템에서 프로세서 노드 연결망의 에러 검출방법에 있어서, 이중으로 구성되는 노드 연결망중 사용되고 있는 연결망에서 데이타 전송시 지속적으로 에러가 발생하는 경우 해당 연결망이 고장이라 판단하고 다른 연결망을 사용하여 데이타 전송기능을 수행하며 사용자에게 해당 연결망의 에러발생 현황을 경고하는 제1과정과, 상기 제1과정을 통하여 구비되어 있는 노드 상호연결망중 어느것이 에러상태인가를 판단하기 어려운 경우 임의의 노드군을 선정하는 제2과정과, 상기 과정에서 선정된 노드군을 구성하는 두개의 프로세싱 노드가 동일한 데이타를 각각 다른 연결망에 전송시키는 제3과정과, 상기 두개의 프로세싱 노드 각각은 자신이 상기 제3과정을 통하여 데이타를 전송시킨 연결망 이외의 연결망을 통하여 수신되어진 데이타를 자신이 발생시킨 데이타와 비교하는 제4과정 및 상기 제4과정을 통하여 비교되는 데이타가 서로 상이한 경우 상이한 데이타가 수신되어진 연결망을 고장이라고 판단하고 사용자에게 에러 발생 현황을 경고하는 제5과정을 포함하는데 있다.Another feature of the present invention for achieving the above object is a processor node in a parallel processing computer system having a plurality of processing node group consisting of two processing nodes and a dual connection network connecting each processor node group In the error detection method of the network, if an error occurs continuously during data transmission in the network that is being used among the dual network nodes, the network is considered to be out of order and the data is transmitted using another network. A first step of alerting an error occurrence state of the network; a second step of selecting an arbitrary node group when it is difficult to determine which of the node interconnection networks provided through the first step is in an error state; Processing to form the selected node family A third process in which the node transmits the same data to different connection networks, and each of the two processing nodes each generates data received through a connection network other than the connection network in which it has transmitted data through the third process. And a fourth process of comparing with the fourth process and a fifth process of determining that a connection network in which different data is received is a failure when the data compared through the fourth process is different from each other and warning a user of an error occurrence state.

이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일실시예를 설명한다.Hereinafter, with reference to the accompanying drawings will be described a preferred embodiment according to the present invention.

제3도는 본 발명에 따른 병렬처리 컴퓨터 시스템의 부분간략 블럭 구성도로서, 두개의 입출력버스로 구성되는 2개의 버스군(DB1,2)과, 상기 버스군(DB1,2)에 연결되어 있는 2개의 보조기억장치 또는 디스크(D1,2)와, 상기 버스군(DB1,2)을 구성하는 입출력버스를 통하여 각각 상기 디스크(D1,2)와의 데이타 송수신을 제어하는 2개의 입출력 프로세서노드(ION1,2)와, 동일 기능을 수행하는 2개의 프로세싱 노드(10A,10B)(20A,20B)간의 필요한 정보교환 및 진단을 위한 직렬통신회선(SL)으로 연결되어 쌍방간의 진단이 가능하도록 구성되어 있는 2개의 프로세싱 노드군 및 프로세싱 노드(10A,10B,20A,20B)들과 입출력 프로세서노드(ION1,2)간의 데이타 연결을 위한 2개의 노드 상호연결망(INN1,2)으로 구성되어 있다.3 is a schematic block diagram of a parallel processing computer system according to the present invention, and includes two bus groups DB1 and 2 composed of two input / output buses and two connected to the bus groups DB1 and 2. Two auxiliary memory devices or disks D1 and 2, and two input / output processor nodes ION1 and 202 that control data transmission and reception with the disks D1 and 2, respectively, through the I / O buses constituting the bus groups DB1 and 2; 2) is connected to a serial communication line (SL) for the necessary information exchange and diagnosis between the two processing nodes (10A, 10B) (20A, 20B) performing the same function is configured to enable the diagnosis between the two It consists of two processing node groups and two node interconnection networks INN1,2 for data connection between processing nodes 10A, 10B, 20A, 20B and input / output processor nodes ION1,2.

상기와 같은 구성중 두개의 프로세싱 노드군을 본 발명에서는 클러스터(Cluster)라 칭하고 본 발명에 따른 병렬처리 컴퓨터 시스템은 다수개의 클러스터로 구성하는데, 상기 2개의 상호연결망(INN1,2)은 모두 클러스터에 공통적으로 연결되어 있다.The two processing node groups of the above configurations are referred to as a cluster in the present invention, and the parallel processing computer system according to the present invention is composed of a plurality of clusters, and the two interconnection networks INN1 and 2 are all connected to the cluster. Commonly connected.

또한, 쌍을 이루는 프로세싱 노드들은 필요한 정보교환 및 진단을 위하여 직렬통신회선(SL)을 두어 노드상호 연결망의 부하경감 및 효율적인 진단이 가능하도록 하였으며, 입출력 노드 및 입출력 버스를 이중화하여 디스크는 서로다른 입출력 노드 및 입출력 버스를 통하여 접근 가능하도록 하였다.In addition, the paired processing nodes are equipped with serial communication lines (SL) for the necessary information exchange and diagnosis, enabling load reduction and efficient diagnosis of node interconnection networks. It is made accessible through node and I / O bus.

상기 구성중 프로세싱 노드들의 구성을 살펴보면, 소정갯수의 프로세서(Pa1,Pa2)(Pb1,Pb2)와 공유메모리를 구비하고 있으며, 공통시스템 버스(SB1,2)를 사용하는 대칭형 다중처리 시스템(Symmetric Multi-Processor)의 구성을 갖는다.Looking at the configuration of the processing nodes of the above configuration, a predetermined number of processors (Pa1, Pa2) (Pb1, Pb2) and a shared memory, a symmetric multi-processing system (Symmetric Multi) using a common system bus (SB1, 2) -Processor).

이때 상기 프로세서((Pa1,Pa2)(Pb1,Pb2))들과 공유메모리는 상기 제3도에 도시된 바와같이 프로세싱 노드(10A,10B)내 프로세서들(Pa1,2)(Pb1,2)에 연결된 시스템 버스(SB1,2)를 통해 시스템 운영 및 상태진단 프로그램과 상기 프로세서(Pa1,2)(Pb1,2)들에서 발생되는 데이터를 공유메모리(Ma,Mb)에 저장하는데 통신목적이 있습니다.At this time, the processors (Pa1, Pa2) (Pb1, Pb2) and the shared memory are connected to the processors Pa1, 2 (Pb1, 2) in the processing nodes 10A, 10B, as shown in FIG. Communication purpose is to save the system operation and status diagnosis program and the data generated by the processors (Pa1, 2) (Pb1, 2) to the shared memory (Ma, Mb) through the connected system bus (SB1, 2).

또한, 제1노드 상호연결망(INN1)과 제1채널(CH0)이 연결되고 제2채널(CH1)이 제2노드 상호연결망(INN2)과 연결되어 상기 공통 시스템버스(SB1,2)를 통하여 데이타 접속을 위한 인터페이싱 회로(Network Interface : NIF)와, 짝을 이루는 프로세싱 노드와의 통신을 위한 직렬통신 접속회로(Serial Interface : SIF) 및 상기 직렬통신 접속회로(SIF)를 통하여 송수신되는 정보의 저장을 위한 버퍼메모리(Ba,Bb)로 구성된다.In addition, a first node interconnection network INN1 and a first channel CH0 are connected, and a second channel CH1 is connected to a second node interconnection network INN2, thereby providing data through the common system bus SB1,2. Storage of information transmitted / received through a network interface (NIF) for connection, a serial interface (SIF) for communication with a paired processing node, and the serial communication connection circuit (SIF). Buffer memory Ba, Bb.

상기 프로세서((Pa1,Pa2)(Pb1,Pb2))들과 버퍼(메모리(Ba,Bb)간의 통신목적으로는, 프로세싱 노드(10A)에 포함된 프로세서들(Pa1,2)에서 발생되어진 연산결과 및 프로세서의 레지스터 값을 직렬통신 접속회로(SIF) 및 직렬통신회선(SL)을 통하여 상대 프로세싱 노드(10B)의 버퍼(Bb)에 전송하는데 그 목적이 있다.For communication purposes between the processors (Pa1, Pa2) (Pb1, Pb2) and the buffers (memory Ba, Bb), arithmetic results generated by the processors Pa1, 2 included in the processing node 10A. And transfer the register value of the processor to the buffer Bb of the partner processing node 10B via the serial communication connection circuit SIF and the serial communication line SL.

상기와 같이 구성되는 본 발명에 따른 병렬처리 컴퓨터 시스템의 바람직한 동작예를 설명한다.A preferred operation example of the parallel processing computer system according to the present invention configured as described above will be described.

본 시스템의 일반적인 동작 즉, 정상적인 동작은 종래의 시스템과 기본적으로 동일하므로 생략하고 이하의 설명에서는 각 구성별로 이루어지는 에러검출과정에 대하여 상세히 설명한다.Since the general operation of the system, that is, the normal operation is basically the same as the conventional system, it will be omitted. In the following description, an error detection process for each component will be described in detail.

하나의 프로세싱 노드군을 이루는 두개의 프로세싱 노드(예를들어 10A,10B)는 같은 프로그램을 이중으로 수행하며 메모리 쓰기/읽기, 인터럽트 혹은 입출력 요구와 같은 특정한 사건이 발생하거나 적당한 주기에 두개의 노드(10A,10B)에서 수행되어진 결과를 직렬통신 회선(SL)을 통하여 상대방에 전송(예를들면, 제1프로세싱 노드(10B)내의 버퍼 메모리(Bb)에 저장)한다.Two processing nodes (for example, 10A and 10B) that form a group of processing nodes execute the same program twice, and specific nodes such as memory write / read, interrupt or I / O requests occur, The results performed in 10A and 10B are transmitted to the other party via the serial communication line SL (for example, stored in the buffer memory Bb in the first processing node 10B).

상기와 같은 과정의 종료 후 각 프로세싱 노드는 계속해서 다음일을 수행하는데, 만약 자신의 프로세싱 노드에서 발생되어진 결과와 상기 직렬통신 접속회로(SIF)를 통하여 수신되어 버퍼 메모리에 저장되어진 결과가 상이한 경우가 발생되면, 자신 또는 상대방이 오동작한 것이므로 해당 프로세싱 노드군은 동작을 중지한다.After the end of the process, each processing node continues to perform the next task, if the result generated by its processing node and the result received through the serial communication connection circuit (SIF) and stored in the buffer memory are different. Is generated, the processing node group stops the operation because the self or the counterpart has malfunctioned.

이후, 클러스터 내의 다른 프로세싱 노드로 해당 기능을 재수행하도록 한 후 다른 프로세싱 노드에서 재 수행되어진 결과를 입력받아 자신이 발생시킨 결과와 서로 동일한 노드는 정상적인 노드로 인식하고, 상이한 결과를 발생시킨 노드는 고장인 것으로 간주한다.After that, other processing nodes in the cluster are allowed to re-execute the function, and after receiving the result of the re-execution from other processing nodes, the same nodes as those generated by the node are recognized as normal nodes. It is considered to be a malfunction.

그러므로 임의의 노드 자신의 내부에 존재하는 버퍼 메모리에 저장되어 있는 데이타 결과는 자신이 발생시킨 것이 아니고 직렬통신 회선(SL)으로 연결되어 있는 다른 노드에서 발생되어진 데이타 결과이므로, 상기 버퍼 메모리에 저장되어 있는 데이타의 관점에서는 다른 프로세싱 노드로부터 입력받은 결과와 상이한 노드가 정상적인 것이다.Therefore, the data result stored in the buffer memory existing in any node itself is not generated by itself, but the data result generated by another node connected through the serial communication line (SL). From the point of view of the data present, a node different from the result received from another processing node is normal.

그러므로, 동작을 중지한 노드는 자신의 내부에 구비되어 있는 버퍼메모리의 데이타와 다른 프로세싱 노드로부터 입력받은 결과를 비교하여 상이한 경우 다시 동작하게 된다.Therefore, the node that has stopped operating is compared with the data received from the processing node and the data of the buffer memory provided therein, and is operated again when different.

이중으로 구성되는 노드 상호연결망(INN1,2)은 패리티, 체크 섬 혹은 ECC와 같은 에러검출 방식을 사용하여 데이타 전송시 발생되는 오류를 검출하고, 만일 하나의 연결망(예를들어, 제1상호연결망(INN1)이라 하자)에 이와같은 오류가 계속적으로 발생되는 경우 제1상호연결망(INN1)을 고장이라 판단한다.The duplex node interconnection network INN1,2 detects errors in data transmission using an error detection scheme such as parity, checksum or ECC, and if one network (for example, the first interconnection network) If such an error occurs continuously (in IN1), it is determined that the first interconnection network INN1 is a failure.

상기와 같은 이유로 인하여 고장이라 판단된 상기 제1상호연결망(INN1)의 사용을 중지시키고 제2상호연결망(INN2)만을 사용하여 데이타 전송 기능을 수행하며 사용자에게 연결망의 에러발생 현황을 경고하여 유지 보수가 가능하도록 한다.Stop using the first interconnection network (INN1) that is determined to be a fault for the above reasons, perform data transmission function using only the second interconnection network (INN2), and warn the user of the error occurrence status of the network. To make it possible.

그러나, 만약 구비되어 있는 노드 상호연결망(INN1,2)중 어느 것이 에러상태인가를 판단하기 어려운 경우에는 상기 제1프로세싱 노드(10A)와 제2프로세싱 노드(10B)간의 직렬통신 회선(SL)을 사용하여 상기 각 프로세싱 노드(10A,10B)에서 동일한 데이타를 자신이 발생시킨 데이타와 비교하여 상이한 데이타가 수신되어진 연결망을 고장이라고 판단한다.However, if it is difficult to determine which of the provided node interconnection networks INN1 and 2 is in an error state, the serial communication line SL between the first processing node 10A and the second processing node 10B may be disconnected. Each processing node 10A, 10B compares the same data with data generated by itself, and determines that a connection network in which different data is received is a failure.

즉, 상기 제1프로세싱 노드(10A)에서는 인터페이싱 회로(NIF)의 제1채널(CH0)을 통하여 제1상호연결망(INN1)에 데이타를 전송하고 제2채널(CH1)을 통하여 제2상호연결망(INN2)에서 수신되는 데이타를 입력받아 상기 제1채널(CH0)을 통하여 전송하고 있는 데이타와 동일한가를 비교한다.That is, the first processing node 10A transmits data to the first interconnection network INN1 through the first channel CH0 of the interfacing circuit NIF, and transmits data to the second interconnection network through the second channel CH1. The data received from INN2) is input and compared with the data transmitted through the first channel CH0.

이때, 비교되는 데이타가 서로 동일한 경우 상기 제2상호연결망(INN2)은 안정적인 것이며, 데이타가 상이하다고 판단되면 상기 제2상호연결망(INN2)은 에러상태인 것이다.In this case, when the data to be compared are the same, the second interconnection network INN2 is stable, and when it is determined that the data is different, the second interconnection network INN2 is in an error state.

반면에, 제1상호연결망(INN1)의 상태는 상기 제2프로세싱 노드(10B)에서 검사하게 되며, 상술한 제2상호연결망(INN2)의 검사과정과 동일하다.On the other hand, the state of the first interconnection network INN1 is inspected by the second processing node 10B and is the same as the inspection process of the second interconnection network INN2 described above.

또한, 입출력 시스템은 두개의 입출력 프로세서 노드(INN1,2)가 상기 노드 상호연결망(INN1,2)을 통하여 접속되도록 하였으며, 모든 프로세싱 노드가 입출력 시스템을 공유하는 공유 디스크 방식이다.In addition, the input / output system allows two input / output processor nodes INN1 and 2 to be connected through the node interconnection network INN1 and 2, and is a shared disk method in which all processing nodes share the input / output system.

하나의 입출력 프로세서 노드에서는 기본적으로 두개의 입출력 버스(IOB1,2)를 제공하며, 상기 입출력 버스(IOB1,2)를 통하여 자기 디스크 또는 테이트 장치와 같은 입출력 장치들과 접속된다.One input / output processor node basically provides two input / output buses IOB1 and 2 and is connected to input / output devices such as a magnetic disk or a data device through the input / output buses IOB1 and 2.

상기 입출력 장치로 사용되는 디스크(D1,2)의 연결은 이중 포트를 제공하는 디스크 장치를 두개의 입출력 노드로부터 제공되는 입출력 버스를 통하여 제3도에 도시되어 있는 바와같이 구성되어 어느 한 입출력 노드 또는 버스의 에러 발생시 다른 입출력 노드 또는 데이타 통로를 사용하여 접근하도록 구성되어 있다.The connection of the disks D1 and 2 used as the input / output device is configured as shown in FIG. 3 through the input / output bus provided from the two input / output nodes to the disk device providing the dual port. It is configured to access using other I / O node or data path in case of a bus error.

또한, 디스크의 데이타도 이중화(Disk mirroring)하여 하나의 디스크에 결함이 발생되어도 데이타의 손실을 방지할 수 있으며 쉽게 복구 가능하다.In addition, disk mirroring also prevents data loss and recovers easily even if one disk fails.

시스템에 고장난 노드가 발생하면 이룰 사용자에게 알려주고 고장난 노드는 시스템으로부터 전기적으로 분리되고 사용자로 하여금 보드를 교체할 수 있으며, 새로운 정상적인 보드가 장착되면 시스템의 재 구성을 통하여 완전한 복구가 이루어지도록 한다.If a faulty node occurs in the system, the user is notified of the failure. The faulty node is electrically disconnected from the system, the user can replace the board, and when a new normal board is installed, the system can be completely reconfigured to recover.

이를 위해서 본 발명에서는 시스템에 전원이 공급되는 상태에서 보드 탈장착이 가능하도록 설계된다.To this end, in the present invention, the board is designed to be removable in a state where power is supplied to the system.

상기와 같이 동작하는 본 발명에 따른 병렬처리 컴퓨터 시스템 및 상기 시스템에서 운영되는 에러 검출방법을 제공하여 구성비용을 절감할 수 있도록 시스템의 구성을 간소화하면서도 시스템을 구성하는 자원중 임의의 자원에 결함발생시 시스템이 정지하지 않고 데이타의 손실이 발생되지 않는 범위내에서 결함의 방지 및 복구작업을 수행할 수 있는 효과가 있다.When a defect occurs in any of the resources constituting the system while simplifying the configuration of the system so as to reduce the configuration cost by providing a parallel processing computer system and an error detection method operating in the system operating as described above. It is effective to prevent and repair faults within the range that the system does not stop and data loss does not occur.

Claims

An error detection method of a processor node in a parallel processing computer system having a plurality of processing node groups consisting of two processing nodes, wherein a specific event such as an interrupt occurs or a node group is formed at an arbitrary point in time. A first step of transmitting operation results generated by an operation currently performed in each processing node and a register value of a processor to a buffer memory of a corresponding processing node of a corresponding node group through a serial communication line; A second step of receiving a register value of an operation result generated at a counterpart processing node and comparing it with its own data transmitted through the first step; A third step of determining that an error has occurred in its own node group when the data is compared in the second step, and each processing node in the corresponding node group stops its operation; A fourth step of causing another processing node group to re-perform the operation performed in the first step and then inputting the result to each processing node in the node group stopped in the third step; A fifth step in which each processing node in the node group stopped in the third step receives and compares a result of a normally performed operation received through the fourth step; And a sixth process of alerting the user of an error occurrence status as a node that has generated a different result through the fifth process is an error, and an error of the processor node and the node connection network in the parallel processing computer system. Detection method.

In the parallel processing computer system having a plurality of processing node groups consisting of two processing nodes and connecting each processor node group, the error detection method of the processor node connection network in a parallel processing computer system is a dual node. A first step of determining that a corresponding network is a failure, performing a data transmission function using another connection network, and warning a user of an error occurrence status of the corresponding network, when an error continuously occurs during data transmission in a connection network being used in the connection network; A second step of selecting an arbitrary node group when it is difficult to determine which of the node interconnection networks provided through the first step is in an error state; A third step of transmitting the same data to different connection networks by two processing nodes constituting the node group selected in the above step; A fourth step in which each of the two processing nodes compares data received through a connection network other than the connection network through which the data is transmitted through the third process with data generated by the two processing nodes; And a fifth process of determining that a connection network to which different data has been received is broken when the data to be compared through the fourth process is different from each other and warning a user of an error state. Error detection method of node and node connection network.