JP2006260007A - Collation processor - Google Patents

Collation processor Download PDF

Info

Publication number
JP2006260007A
JP2006260007A JP2005075004A JP2005075004A JP2006260007A JP 2006260007 A JP2006260007 A JP 2006260007A JP 2005075004 A JP2005075004 A JP 2005075004A JP 2005075004 A JP2005075004 A JP 2005075004A JP 2006260007 A JP2006260007 A JP 2006260007A
Authority
JP
Japan
Prior art keywords
cpu
collation
memory
processing
processor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
JP2005075004A
Other languages
Japanese (ja)
Inventor
Yoichi Mizuko
陽一 水子
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NEC Corp
Original Assignee
NEC Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC Corp filed Critical NEC Corp
Priority to JP2005075004A priority Critical patent/JP2006260007A/en
Publication of JP2006260007A publication Critical patent/JP2006260007A/en
Withdrawn legal-status Critical Current

Links

Images

Abstract

<P>PROBLEM TO BE SOLVED: To continue operation of a collation processor by minimizing the deterioration of its performance, without stopping tasks, when the collation processor breaks down. <P>SOLUTION: When a control system CPU 9 detects the failure of a collation system "CPU #0" 1 of a "collation system #0' 11, a collation system "CPU #2" 3 of a "collation system #1" 12 that is different from the defective "collation system #0" 11 takes over processing. In this case, the control system CPU 9 switches a memory map so that a Memory 5 of the defective collation system "CPU #0" 1 can be viewed directly from the collation system "CPU #2" 3 which has taken over the processing. When the memory map is switched, the collation system "CPU #2" 3, which has taken over the processing directly, uses the referenced data of the Memory 5 to take over the collation processing. While the collation system "CPU #2" 3 which has taken over processing performs collation instead, the control system CPU 9 resets the defective collation system "CPU #0" 1 to attempt its restoration. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、照合処理装置に関し、特に、データ照合における障害処理を行う照合処理装置に関する。   The present invention relates to a collation processing apparatus, and more particularly to a collation processing apparatus that performs failure processing in data collation.

従来の照合装置ではソフトエラーなどによる間欠の障害が発生したとき、照合していた処理は中断され、障害が発生したノードは切り離されて縮退運転をする方式をとっていた。しかし、マルチCPU構成である照合系の1CPUに障害が発生すると、正常なCPUも含めて切り離されてしまうため、効率が悪く、性能低下も大きかった。   In the conventional verification device, when an intermittent failure due to a soft error or the like occurs, the verification process is interrupted, and the node in which the failure has occurred is disconnected to perform a degenerate operation. However, when a failure occurs in one CPU of the collation system having a multi-CPU configuration, the normal CPU and other CPUs are disconnected, resulting in poor efficiency and large performance degradation.

従来の照合処理装置は、照合処理中にあるプロセッサがソフトエラーなどの原因により間欠の障害が発生した場合には、照合処理は中断しレジスタダンプ・メモリダンプを出した後にリセットを掛けることにより復旧を試みていた。しかし、障害が発生した時の照合処理は中断してデータは廃棄され、CPUのリセット立上げが完了した後に、再度障害が発生したデータから再度照合をやり直していた。本方式では、CPUにリセットがかかると、立ち上がるまで処理を中断し、また立ち上がってからも再度エラーが発生したデータまで戻って再照合を行わなければならないため、復旧までには時間がかかっていた。   The conventional verification processing device recovers by interrupting the verification process and issuing a register dump / memory dump if an intermittent failure occurs due to a software error or other cause during the verification process. Was trying. However, the collation process when a failure occurs is interrupted and the data is discarded, and after completion of the resetting of the CPU, the collation is performed again from the data where the failure occurred again. In this method, when the CPU is reset, the process is interrupted until it starts up, and even after starting up, it is necessary to return to the data where the error occurred again and perform re-verification. .

本発明は、マルチCPU構成での正常なCPUを生かしたまま障害が発生したCPUのみにリセットをかけ、かつ中断した処理を破棄・再試行することなく、別のノードに肩代わりさせて処理を継続することで、アプリケーションは意識しなくても性能低下を最小限にとどめ、自律的に障害を復旧させる方式を組み込んだ障害処理装置をするものである。   The present invention resets only the CPU in which a failure has occurred while utilizing a normal CPU in a multi-CPU configuration, and continues processing by taking over another node without discarding / retrying the interrupted processing. By doing so, the fault processing apparatus incorporates a method for minimizing the performance degradation and recovering the fault autonomously even if the application is not conscious.

本発明の第1の照合処理装置は、照合処理中に照合プロセッサで障害が発生すると、別のノードのプロセッサが障害プロセッサをスヌープできるようにメモリマップ切り替える機構を備える。   The first matching processing apparatus of the present invention includes a memory map switching mechanism so that when a failure occurs in the matching processor during the matching process, a processor of another node can snoop the failed processor.

本発明の第2の照合処理装置は、故障CPUを検出し自動でメモリマップを切り替えて、別ノードのプロセッサからローカルメモリを直接参照して照合処理を肩代わりして継続する。   The second verification processing apparatus of the present invention detects the failed CPU, automatically switches the memory map, and directly refers to the local memory from the processor of another node and continues the verification process.

第1の効果は、多数プロセッサからなる本システムにおいて、障害プロセッサで行っている処理を別のプロセッサが肩代わりして処理を継続することにより、障害時に業務を停止することなく、最小限の性能低下で運用を継続できることである。   The first effect is that, in this system consisting of a large number of processors, the processing performed by the faulty processor takes over from another processor and the processing continues, so that the performance is minimized without stopping the business at the time of the fault. The operation can be continued.

第2の効果は、上位アプリケーションが意識しなくても、自動的に復旧処理を行うので可容性を向上することができる点である。   The second effect is that the availability can be improved because the restoration process is automatically performed without being conscious of the host application.

次に、本発明を実施するための最良の形態について図面を参照して詳細に説明する。   Next, the best mode for carrying out the present invention will be described in detail with reference to the drawings.

図1を参照すると、本装置は照合動作をする‘照合系#0’11,‘照合系#1’12,…,‘照合系#n’1nと装置全体を制御する制御系21からなる。また、各‘照合系#0’11,‘照合系#1’12,…,‘照合系#n’1nはHOST−PCIブリッジ7,8,…を介してマルチCPU構成となっており、各々PCIバスで制御系21と接続されている。   Referring to FIG. 1, the present apparatus comprises 'collation system # 0' 11, 'collation system # 1' 12,..., 'Collation system #n' 1n for performing a collation operation, and a control system 21 for controlling the entire apparatus. In addition, each of “verification system # 0” 11, “verification system # 1” 12,..., “Verification system #n” 1n has a multi-CPU configuration via HOST-PCI bridges 7, 8,. It is connected to the control system 21 by a PCI bus.

制御系21は‘照合系#0’11,‘照合系#1’12,…,‘照合系#n’1nのメモリマップをテーブルとして保持している。また、各‘照合系#0’11,‘照合系#1’12,…,‘照合系#n’1nのHOST−PCIブリッジ7,8,…は、PCIコンフィグアクセスにより内部レジスタを設定できるものとし、この設定によってアドレスマップを変換できるようになっている。照合系‘CPU#0’1,‘CPU#1’2,‘CPU#2’3,‘CPU#3’4,…がソフトエラーなど間欠のエラーが発生したとき、Memory5のログエリアにレジスタダンプを吐き出し処理を停止する。制御系CPU9は、各照合系CPUの状態をウォッチドッグタイマーで監視する。   The control system 21 holds a memory map of 'collation system # 0'11,' collation system # 1'12, ..., 'collation system # n'1n as a table. In addition, each of the “matching system # 0” 11, “matching system # 1” 12,..., “Matching system #n” 1n HOST-PCI bridges 7, 8,. The address map can be converted by this setting. When the verification system 'CPU # 0'1,' CPU # 1'2, 'CPU # 2'3,' CPU # 3'4, ... has an intermittent error such as a software error, register dumps in the log area of Memory5 The spout process is stopped. The control system CPU 9 monitors the state of each verification system CPU with a watch dog timer.

障害などで処理が中断した場合、制御系プロセッサ9は、ローカルMemory5,ローカルMemory6,…(メモリ)のログエリアに吐き出されたダンプを採取する。ダンプを採取した後に、制御系プロセッサ9は障害が発生した照合系CPUのMemoryをPCIメモリ空間にリマップするように設定を変更する。障害が発生した照合系のMemoryをPCIメモリ空間にマップした後に、今まで行っていた照合処理を別の照合系CPUへ切り替える指示をだす。   When the processing is interrupted due to a failure or the like, the control system processor 9 collects a dump that is discharged to the log area of the local memory 5, local memory 6,... (Memory). After collecting the dump, the control system processor 9 changes the setting so as to remap the memory of the collation system CPU in which the failure has occurred to the PCI memory space. After the memory of the collation system in which the failure has occurred is mapped to the PCI memory space, an instruction is issued to switch the collation processing performed so far to another collation system CPU.

切り替え指示を受け取った別の照合系CPUは、PCIメモリ空間を通じて、リマップされた障害照合系CPUのMemoryを直接参照し、中断した照合処理から照合を肩代わりして継続する。照合処理を別の照合系へ肩代わりさせて継続させると、制御系CPU9は障害の起きている照合系CPUにリセットを発行し、復旧を試みる。リセットを受けたCPUが正常に立ち上がると、次の照合より縮退を解除して復旧する。   Another collation system CPU that has received the switching instruction directly refers to the memory of the remapped fault collation system CPU through the PCI memory space, and continues collation from the interrupted collation processing. When the collation process is continued to another collation system, the control system CPU 9 issues a reset to the collation system CPU in which a failure has occurred and attempts to recover. When the reset CPU starts up normally, the degeneration is released from the next verification and the CPU is restored.

次に、本発明を実施するための最良の形態の動作について図面を参照して説明する。   Next, the operation of the best mode for carrying out the present invention will be described with reference to the drawings.

次に、図1の動作について、図2の切り替え図と図3の切り替えフローを用いて説明する。Step.1で‘照合系#0’11のノードにおける照合系‘CPU#0’1に障害が発生すると、照合系‘CPU#0’1はMemory5のログエリアにレジスタをダンプして停止する。   Next, the operation of FIG. 1 will be described using the switching diagram of FIG. 2 and the switching flow of FIG. Step. When a failure occurs in the collation system 'CPU # 0'1 in the node of the collation system # 0'11, the collation system' CPU # 0'1 dumps a register in the log area of Memory5 and stops.

制御系CPU9はウォッチドッグタイマーにて障害を検出すると、‘照合系#0’11のHOST−PCIブリッジ7のメモリマップ設定レジスタにPCIコンフィグアクセスにてアクセスし、障害が起きた照合系‘CPU#0’1のローカルMemoryを、PCIメモリ空間にマッピングするようにメモリマップを変更する(Step.2)。Step.3で、制御系CPU9は別のノードである‘照合系#1’12の‘CPU#2’3にシステムコール割り込みにて、参照するMemoryのアドレスをローカルMemoryからPCIメモリ空間へと切り替えを行い、照合処理の引継ぎを指示する。   When the control system CPU 9 detects a failure in the watchdog timer, it accesses the memory map setting register of the HOST-PCI bridge 7 of the “verification system # 0” 11 by PCI configuration access, and the verification system “CPU # in which the failure has occurred. The memory map is changed so as to map the local memory of 0′1 to the PCI memory space (Step 2). Step. 3, the control system CPU 9 switches the address of the memory to be referred to from the local memory to the PCI memory space by a system call interrupt to the “CPU # 2” 3 of “verification system # 1” 12 which is another node. Instructing to take over the verification process.

Step.3で制御系CPU9から引継ぎ指示を受けた照合系‘CPU#2’3は、PCIメモリ空間にマップされた‘CPU#0’1のローカルMemoryをダイレクトにアクセスし、このMemoryのデータを使い中断した照合処理を再開する。障害が発生して中断した照合処理を‘CPU#2’3に引き継ぐと、制御系CPU9は障害が発生した照合系‘CPU#0’1に対しリセットを発行する(Step.4)。   Step. 3. In response to the takeover instruction from the control system CPU 9 in step 3, the verification system “CPU # 2” 3 directly accesses the local memory of “CPU # 0” 1 mapped to the PCI memory space, and interrupts using the memory data. Restart the matching process. When the collation process interrupted due to the occurrence of the failure is taken over by 'CPU # 2' 3, the control system CPU 9 issues a reset to the collation system 'CPU # 0' 1 in which the failure has occurred (Step 4).

再立ち上げが成功し、‘CPU#0’1がReadyになると、制御系CPU9は‘CPU#2’3の照合を停止し、以降の照合を‘CPU#0’1に引き継ぐ。‘CPU#0’1が引き継ぎを完了すると、制御系CPU9は‘照合系#0’11のHOST−PCIブリッジ7のメモリマップ設定レジスタに再度PCIコンフィグアクセスにてアクセスし、PCIメモリ空間にマッピングされている‘CPU#0’1のローカルMemoryを解除する。   When the restart is successful and ‘CPU # 0’1 becomes Ready, the control CPU 9 stops the collation of‘ CPU # 2’3 and takes over the subsequent collation to ‘CPU # 0’1. When 'CPU # 0'1 completes the takeover, the control system CPU9 accesses the memory map setting register of the HOST-PCI bridge 7 of' verification system # 0'11 again by PCI configuration access and is mapped to the PCI memory space. The local memory of 'CPU # 0' 1 is released.

引継ぎが完了すると、制御系CPU9は再度‘CPU#2’3へシステムコール割り込みを発行し、再度参照するMemoryのアドレスをローカルMemoryへ変更する。この方式にすることにより、障害発生時に中断していた照合を中断することなく、影響度が処理切り替え時のオーバヘッドと縮退運転による性能低下程度にとどめることができる。   When the takeover is completed, the control system CPU 9 issues a system call interrupt to the “CPU # 2” 3 again, and changes the address of the memory to be referred to to the local memory again. By adopting this method, it is possible to limit the degree of influence to the overhead due to the process switching and the performance degradation due to the degenerate operation without interrupting the collation that was interrupted when the failure occurred.

本発明を実施するための最良の形態の構成を示すブロック図。The block diagram which shows the structure of the best form for implementing this invention. 復旧の手順を示す説明図。Explanatory drawing which shows the procedure of recovery. 本発明を実施するための最良の形態の動作を示すシーケンス図。The sequence diagram which shows the operation | movement of the best form for implementing this invention. メモリマップの変更例の説明図。Explanatory drawing of the example of a change of a memory map.

符号の説明Explanation of symbols

1 ‘CPU#0’
2 ‘CPU#1’
3 ‘CPU#2’
4 ‘CPU#3’
5 Memory
6 Memory
7,8,… HOST−PCIブリッジ
9 制御系CPU
11 ‘照合系#0’
12 ‘照合系#1’
1n ‘照合系#n’
21 制御系
1 'CPU # 0'
2 'CPU # 1'
3 'CPU # 2'
4 'CPU # 3'
5 Memory
6 Memory
7, 8, ... HOST-PCI bridge 9 Control system CPU
11 'Verification system # 0'
12 'Verification system # 1'
1n 'Verification system #n'
21 Control system

Claims (2)

照合処理中に照合プロセッサで障害が発生すると、別のノードのプロセッサが障害プロセッサをスヌープできるようにメモリマップ切り替える手段を備えたことを特徴とする照合処理装置。 A collation processing apparatus comprising means for switching a memory map so that a processor of another node can snoop a faulty processor when a fault occurs in the collation processor during the collation processing. 故障CPUを検出し自動でメモリマップを切り替えて、別ノードのプロセッサからローカルメモリを直接参照して照合処理を肩代わりして継続することを特徴とする照合処理装置。
A collation processing apparatus that detects a failed CPU, automatically switches a memory map, and directly refers to a local memory from a processor of another node to continue the collation processing.
JP2005075004A 2005-03-16 2005-03-16 Collation processor Withdrawn JP2006260007A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2005075004A JP2006260007A (en) 2005-03-16 2005-03-16 Collation processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2005075004A JP2006260007A (en) 2005-03-16 2005-03-16 Collation processor

Publications (1)

Publication Number Publication Date
JP2006260007A true JP2006260007A (en) 2006-09-28

Family

ID=37099212

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2005075004A Withdrawn JP2006260007A (en) 2005-03-16 2005-03-16 Collation processor

Country Status (1)

Country Link
JP (1) JP2006260007A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011065528A (en) * 2009-09-18 2011-03-31 Toyota Motor Corp Multiprocessor system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011065528A (en) * 2009-09-18 2011-03-31 Toyota Motor Corp Multiprocessor system

Similar Documents

Publication Publication Date Title
WO2020151722A1 (en) Fault processing method, related device, and computer storage medium
JP2552651B2 (en) Reconfigurable dual processor system
JP3844621B2 (en) Application realization method and application realization apparatus
US20020144177A1 (en) System recovery from errors for processor and associated components
US8667315B2 (en) Synchronization control apparatus, information processing apparatus, and synchronization management method for managing synchronization between a first processor and a second processor
US20080229158A1 (en) Restoration device for bios stall failures and method and computer program product for the same
US20060149903A1 (en) Fault tolerant computer system and a synchronization method for the same
US10379931B2 (en) Computer system
US20150100776A1 (en) Non-disruptive code update of a single processor in a multi-processor computing system
US8650431B2 (en) Non-disruptive hardware change
JP2002251300A (en) Fault monitoring method and device
US8028190B2 (en) Computer system and bus control device
JP2006260007A (en) Collation processor
JP2006172390A (en) Fault tolerant duplex computer system and its control method
JPS6129239A (en) Processor fault restart system
WO2012164418A1 (en) Facilitating processing in a communications environment using stop signaling
JP2011081544A (en) Timeout preventing method in cpu re-initialization accompanied by cpu re-reset, device, and program thereof
JP2010061258A (en) Duplex processor system and processor duplex method
JP4066950B2 (en) Computer system and maintenance method thereof
US11354182B1 (en) Internal watchdog two stage extension
US20070038849A1 (en) Computing system and method
JP2004139492A (en) Computer system
JPS6128141B2 (en)
JPS5957351A (en) Data processing system
JP4494263B2 (en) Service system redundancy method

Legal Events

Date Code Title Description
A300 Withdrawal of application because of no request for examination

Free format text: JAPANESE INTERMEDIATE CODE: A300

Effective date: 20080603