JP2006260007A

JP2006260007A - Collation processor

Info

Publication number: JP2006260007A
Application number: JP2005075004A
Authority: JP
Inventors: Yoichi Mizuko; 陽一水子
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2005-03-16
Filing date: 2005-03-16
Publication date: 2006-09-28

Abstract

<P>PROBLEM TO BE SOLVED: To continue operation of a collation processor by minimizing the deterioration of its performance, without stopping tasks, when the collation processor breaks down. <P>SOLUTION: When a control system CPU 9 detects the failure of a collation system "CPU #0" 1 of a "collation system #0' 11, a collation system "CPU #2" 3 of a "collation system #1" 12 that is different from the defective "collation system #0" 11 takes over processing. In this case, the control system CPU 9 switches a memory map so that a Memory 5 of the defective collation system "CPU #0" 1 can be viewed directly from the collation system "CPU #2" 3 which has taken over the processing. When the memory map is switched, the collation system "CPU #2" 3, which has taken over the processing directly, uses the referenced data of the Memory 5 to take over the collation processing. While the collation system "CPU #2" 3 which has taken over processing performs collation instead, the control system CPU 9 resets the defective collation system "CPU #0" 1 to attempt its restoration. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、照合処理装置に関し、特に、データ照合における障害処理を行う照合処理装置に関する。 The present invention relates to a collation processing apparatus, and more particularly to a collation processing apparatus that performs failure processing in data collation.

従来の照合装置ではソフトエラーなどによる間欠の障害が発生したとき、照合していた処理は中断され、障害が発生したノードは切り離されて縮退運転をする方式をとっていた。しかし、マルチＣＰＵ構成である照合系の１ＣＰＵに障害が発生すると、正常なＣＰＵも含めて切り離されてしまうため、効率が悪く、性能低下も大きかった。 In the conventional verification device, when an intermittent failure due to a soft error or the like occurs, the verification process is interrupted, and the node in which the failure has occurred is disconnected to perform a degenerate operation. However, when a failure occurs in one CPU of the collation system having a multi-CPU configuration, the normal CPU and other CPUs are disconnected, resulting in poor efficiency and large performance degradation.

従来の照合処理装置は、照合処理中にあるプロセッサがソフトエラーなどの原因により間欠の障害が発生した場合には、照合処理は中断しレジスタダンプ・メモリダンプを出した後にリセットを掛けることにより復旧を試みていた。しかし、障害が発生した時の照合処理は中断してデータは廃棄され、ＣＰＵのリセット立上げが完了した後に、再度障害が発生したデータから再度照合をやり直していた。本方式では、ＣＰＵにリセットがかかると、立ち上がるまで処理を中断し、また立ち上がってからも再度エラーが発生したデータまで戻って再照合を行わなければならないため、復旧までには時間がかかっていた。 The conventional verification processing device recovers by interrupting the verification process and issuing a register dump / memory dump if an intermittent failure occurs due to a software error or other cause during the verification process. Was trying. However, the collation process when a failure occurs is interrupted and the data is discarded, and after completion of the resetting of the CPU, the collation is performed again from the data where the failure occurred again. In this method, when the CPU is reset, the process is interrupted until it starts up, and even after starting up, it is necessary to return to the data where the error occurred again and perform re-verification. .

本発明は、マルチＣＰＵ構成での正常なＣＰＵを生かしたまま障害が発生したＣＰＵのみにリセットをかけ、かつ中断した処理を破棄・再試行することなく、別のノードに肩代わりさせて処理を継続することで、アプリケーションは意識しなくても性能低下を最小限にとどめ、自律的に障害を復旧させる方式を組み込んだ障害処理装置をするものである。 The present invention resets only the CPU in which a failure has occurred while utilizing a normal CPU in a multi-CPU configuration, and continues processing by taking over another node without discarding / retrying the interrupted processing. By doing so, the fault processing apparatus incorporates a method for minimizing the performance degradation and recovering the fault autonomously even if the application is not conscious.

本発明の第１の照合処理装置は、照合処理中に照合プロセッサで障害が発生すると、別のノードのプロセッサが障害プロセッサをスヌープできるようにメモリマップ切り替える機構を備える。 The first matching processing apparatus of the present invention includes a memory map switching mechanism so that when a failure occurs in the matching processor during the matching process, a processor of another node can snoop the failed processor.

本発明の第２の照合処理装置は、故障ＣＰＵを検出し自動でメモリマップを切り替えて、別ノードのプロセッサからローカルメモリを直接参照して照合処理を肩代わりして継続する。 The second verification processing apparatus of the present invention detects the failed CPU, automatically switches the memory map, and directly refers to the local memory from the processor of another node and continues the verification process.

第１の効果は、多数プロセッサからなる本システムにおいて、障害プロセッサで行っている処理を別のプロセッサが肩代わりして処理を継続することにより、障害時に業務を停止することなく、最小限の性能低下で運用を継続できることである。 The first effect is that, in this system consisting of a large number of processors, the processing performed by the faulty processor takes over from another processor and the processing continues, so that the performance is minimized without stopping the business at the time of the fault. The operation can be continued.

第２の効果は、上位アプリケーションが意識しなくても、自動的に復旧処理を行うので可容性を向上することができる点である。 The second effect is that the availability can be improved because the restoration process is automatically performed without being conscious of the host application.

次に、本発明を実施するための最良の形態について図面を参照して詳細に説明する。 Next, the best mode for carrying out the present invention will be described in detail with reference to the drawings.

図１を参照すると、本装置は照合動作をする‘照合系＃０’１１，‘照合系＃１’１２，…，‘照合系＃ｎ’１ｎと装置全体を制御する制御系２１からなる。また、各‘照合系＃０’１１，‘照合系＃１’１２，…，‘照合系＃ｎ’１ｎはＨＯＳＴ−ＰＣＩブリッジ７，８，…を介してマルチＣＰＵ構成となっており、各々ＰＣＩバスで制御系２１と接続されている。 Referring to FIG. 1, the present apparatus comprises 'collation system # 0' 11, 'collation system # 1' 12,..., 'Collation system #n' 1n for performing a collation operation, and a control system 21 for controlling the entire apparatus. In addition, each of “verification system # 0” 11, “verification system # 1” 12,..., “Verification system #n” 1n has a multi-CPU configuration via HOST-PCI bridges 7, 8,. It is connected to the control system 21 by a PCI bus.

制御系２１は‘照合系＃０’１１，‘照合系＃１’１２，…，‘照合系＃ｎ’１ｎのメモリマップをテーブルとして保持している。また、各‘照合系＃０’１１，‘照合系＃１’１２，…，‘照合系＃ｎ’１ｎのＨＯＳＴ−ＰＣＩブリッジ７，８，…は、ＰＣＩコンフィグアクセスにより内部レジスタを設定できるものとし、この設定によってアドレスマップを変換できるようになっている。照合系‘ＣＰＵ＃０’１，‘ＣＰＵ＃１’２，‘ＣＰＵ＃２’３，‘ＣＰＵ＃３’４，…がソフトエラーなど間欠のエラーが発生したとき、Ｍｅｍｏｒｙ５のログエリアにレジスタダンプを吐き出し処理を停止する。制御系ＣＰＵ９は、各照合系ＣＰＵの状態をウォッチドッグタイマーで監視する。 The control system 21 holds a memory map of 'collation system # 0'11,' collation system # 1'12, ..., 'collation system # n'1n as a table. In addition, each of the “matching system # 0” 11, “matching system # 1” 12,..., “Matching system #n” 1n HOST-PCI bridges 7, 8,. The address map can be converted by this setting. When the verification system 'CPU # 0'1,' CPU # 1'2, 'CPU # 2'3,' CPU # 3'4, ... has an intermittent error such as a software error, register dumps in the log area of Memory5 The spout process is stopped. The control system CPU 9 monitors the state of each verification system CPU with a watch dog timer.

障害などで処理が中断した場合、制御系プロセッサ９は、ローカルＭｅｍｏｒｙ５，ローカルＭｅｍｏｒｙ６，…（メモリ）のログエリアに吐き出されたダンプを採取する。ダンプを採取した後に、制御系プロセッサ９は障害が発生した照合系ＣＰＵのＭｅｍｏｒｙをＰＣＩメモリ空間にリマップするように設定を変更する。障害が発生した照合系のＭｅｍｏｒｙをＰＣＩメモリ空間にマップした後に、今まで行っていた照合処理を別の照合系ＣＰＵへ切り替える指示をだす。 When the processing is interrupted due to a failure or the like, the control system processor 9 collects a dump that is discharged to the log area of the local memory 5, local memory 6,... (Memory). After collecting the dump, the control system processor 9 changes the setting so as to remap the memory of the collation system CPU in which the failure has occurred to the PCI memory space. After the memory of the collation system in which the failure has occurred is mapped to the PCI memory space, an instruction is issued to switch the collation processing performed so far to another collation system CPU.

切り替え指示を受け取った別の照合系ＣＰＵは、ＰＣＩメモリ空間を通じて、リマップされた障害照合系ＣＰＵのＭｅｍｏｒｙを直接参照し、中断した照合処理から照合を肩代わりして継続する。照合処理を別の照合系へ肩代わりさせて継続させると、制御系ＣＰＵ９は障害の起きている照合系ＣＰＵにリセットを発行し、復旧を試みる。リセットを受けたＣＰＵが正常に立ち上がると、次の照合より縮退を解除して復旧する。 Another collation system CPU that has received the switching instruction directly refers to the memory of the remapped fault collation system CPU through the PCI memory space, and continues collation from the interrupted collation processing. When the collation process is continued to another collation system, the control system CPU 9 issues a reset to the collation system CPU in which a failure has occurred and attempts to recover. When the reset CPU starts up normally, the degeneration is released from the next verification and the CPU is restored.

次に、本発明を実施するための最良の形態の動作について図面を参照して説明する。 Next, the operation of the best mode for carrying out the present invention will be described with reference to the drawings.

次に、図１の動作について、図２の切り替え図と図３の切り替えフローを用いて説明する。Ｓｔｅｐ．１で‘照合系＃０’１１のノードにおける照合系‘ＣＰＵ＃０’１に障害が発生すると、照合系‘ＣＰＵ＃０’１はＭｅｍｏｒｙ５のログエリアにレジスタをダンプして停止する。 Next, the operation of FIG. 1 will be described using the switching diagram of FIG. 2 and the switching flow of FIG. Step. When a failure occurs in the collation system 'CPU # 0'1 in the node of the collation system # 0'11, the collation system' CPU # 0'1 dumps a register in the log area of Memory5 and stops.

制御系ＣＰＵ９はウォッチドッグタイマーにて障害を検出すると、‘照合系＃０’１１のＨＯＳＴ−ＰＣＩブリッジ７のメモリマップ設定レジスタにＰＣＩコンフィグアクセスにてアクセスし、障害が起きた照合系‘ＣＰＵ＃０’１のローカルＭｅｍｏｒｙを、ＰＣＩメモリ空間にマッピングするようにメモリマップを変更する（Ｓｔｅｐ．２）。Ｓｔｅｐ．３で、制御系ＣＰＵ９は別のノードである‘照合系＃１’１２の‘ＣＰＵ＃２’３にシステムコール割り込みにて、参照するＭｅｍｏｒｙのアドレスをローカルＭｅｍｏｒｙからＰＣＩメモリ空間へと切り替えを行い、照合処理の引継ぎを指示する。 When the control system CPU 9 detects a failure in the watchdog timer, it accesses the memory map setting register of the HOST-PCI bridge 7 of the “verification system # 0” 11 by PCI configuration access, and the verification system “CPU # in which the failure has occurred. The memory map is changed so as to map the local memory of 0′1 to the PCI memory space (Step 2). Step. 3, the control system CPU 9 switches the address of the memory to be referred to from the local memory to the PCI memory space by a system call interrupt to the “CPU # 2” 3 of “verification system # 1” 12 which is another node. Instructing to take over the verification process.

Ｓｔｅｐ．３で制御系ＣＰＵ９から引継ぎ指示を受けた照合系‘ＣＰＵ＃２’３は、ＰＣＩメモリ空間にマップされた‘ＣＰＵ＃０’１のローカルＭｅｍｏｒｙをダイレクトにアクセスし、このＭｅｍｏｒｙのデータを使い中断した照合処理を再開する。障害が発生して中断した照合処理を‘ＣＰＵ＃２’３に引き継ぐと、制御系ＣＰＵ９は障害が発生した照合系‘ＣＰＵ＃０’１に対しリセットを発行する（Ｓｔｅｐ．４）。 Step. 3. In response to the takeover instruction from the control system CPU 9 in step 3, the verification system “CPU # 2” 3 directly accesses the local memory of “CPU # 0” 1 mapped to the PCI memory space, and interrupts using the memory data. Restart the matching process. When the collation process interrupted due to the occurrence of the failure is taken over by 'CPU # 2' 3, the control system CPU 9 issues a reset to the collation system 'CPU # 0' 1 in which the failure has occurred (Step 4).

再立ち上げが成功し、‘ＣＰＵ＃０’１がＲｅａｄｙになると、制御系ＣＰＵ９は‘ＣＰＵ＃２’３の照合を停止し、以降の照合を‘ＣＰＵ＃０’１に引き継ぐ。‘ＣＰＵ＃０’１が引き継ぎを完了すると、制御系ＣＰＵ９は‘照合系＃０’１１のＨＯＳＴ−ＰＣＩブリッジ７のメモリマップ設定レジスタに再度ＰＣＩコンフィグアクセスにてアクセスし、ＰＣＩメモリ空間にマッピングされている‘ＣＰＵ＃０’１のローカルＭｅｍｏｒｙを解除する。 When the restart is successful and ‘CPU # 0’1 becomes Ready, the control CPU 9 stops the collation of‘ CPU # 2’3 and takes over the subsequent collation to ‘CPU # 0’1. When 'CPU # 0'1 completes the takeover, the control system CPU9 accesses the memory map setting register of the HOST-PCI bridge 7 of' verification system # 0'11 again by PCI configuration access and is mapped to the PCI memory space. The local memory of 'CPU # 0' 1 is released.

引継ぎが完了すると、制御系ＣＰＵ９は再度‘ＣＰＵ＃２’３へシステムコール割り込みを発行し、再度参照するＭｅｍｏｒｙのアドレスをローカルＭｅｍｏｒｙへ変更する。この方式にすることにより、障害発生時に中断していた照合を中断することなく、影響度が処理切り替え時のオーバヘッドと縮退運転による性能低下程度にとどめることができる。 When the takeover is completed, the control system CPU 9 issues a system call interrupt to the “CPU # 2” 3 again, and changes the address of the memory to be referred to to the local memory again. By adopting this method, it is possible to limit the degree of influence to the overhead due to the process switching and the performance degradation due to the degenerate operation without interrupting the collation that was interrupted when the failure occurred.

本発明を実施するための最良の形態の構成を示すブロック図。The block diagram which shows the structure of the best form for implementing this invention. 復旧の手順を示す説明図。Explanatory drawing which shows the procedure of recovery. 本発明を実施するための最良の形態の動作を示すシーケンス図。The sequence diagram which shows the operation | movement of the best form for implementing this invention. メモリマップの変更例の説明図。Explanatory drawing of the example of a change of a memory map.

Explanation of symbols

１ ‘ＣＰＵ＃０’
２ ‘ＣＰＵ＃１’
３ ‘ＣＰＵ＃２’
４ ‘ＣＰＵ＃３’
５Ｍｅｍｏｒｙ
６Ｍｅｍｏｒｙ
７，８，… ＨＯＳＴ−ＰＣＩブリッジ
９制御系ＣＰＵ
１１ ‘照合系＃０’
１２ ‘照合系＃１’
１ｎ ‘照合系＃ｎ’
２１制御系
1 'CPU # 0'
2 'CPU # 1'
3 'CPU # 2'
4 'CPU # 3'
5 Memory
6 Memory
7, 8, ... HOST-PCI bridge 9 Control system CPU
11 'Verification system # 0'
12 'Verification system # 1'
1n 'Verification system #n'
21 Control system

Claims

A collation processing apparatus comprising means for switching a memory map so that a processor of another node can snoop a faulty processor when a fault occurs in the collation processor during the collation processing.

A collation processing apparatus that detects a failed CPU, automatically switches a memory map, and directly refers to a local memory from a processor of another node to continue the collation processing.