JP2011170755A

JP2011170755A - Device, method and program for processing memory failure

Info

Publication number: JP2011170755A
Application number: JP2010035991A
Authority: JP
Inventors: Hideyuki Wada; 英之和田
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2010-02-22
Filing date: 2010-02-22
Publication date: 2011-09-01
Anticipated expiration: 2030-02-22
Also published as: JP5464347B2

Abstract

<P>PROBLEM TO BE SOLVED: To prevent stoppage of a system even when a failure occurs in an RAID controller part. <P>SOLUTION: A node controller transfers, upon receipt of a memory write request from the arithmetic processor, the memory write request to each of the plurality of memory controllers connected to itself and a node controller of the other memory failure processing devices connected to itself. The node controller transfers, when a memory write request is transferred from the node controller of the other memory failure processing device connected to itself, the memory write request to each of the plurality of memory controllers connected to itself. Each of the plurality of memory controllers performs, when the memory write request is transferred and this memory write request is to a memory DIMM serving under itself, storage of data according to the write request. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明はメモリＤＩＭＭ（Dual Inline Memory Module）のマルチビットエラーやメモリコントローラ障害が発生した場合でもシステムの運用を継続することに関する。 The present invention relates to continuing system operation even when a multi-bit error or a memory controller failure occurs in a memory DIMM (Dual Inline Memory Module).

近年では障害が発生しても動作が停止することなく、保守介入が可能なシステムが求められている。メモリに関しても同様の要求があり、その対策の１つがメモリＲＡＩＤによる冗長化である。 In recent years, there has been a demand for a system that can perform maintenance intervention without stopping operation even if a failure occurs. There is a similar request for the memory, and one of the countermeasures is redundancy by the memory RAID.

このような、メモリＲＡＩＤの冗長化に関する技術が例えば特許文献１に記載されている。特許文献１に記載の技術では、以下のような構成を提案している。 For example, Patent Document 1 discloses a technique related to redundancy of memory RAID. The technique described in Patent Document 1 proposes the following configuration.

特許文献１に記載のメモリシステムにおいては、第１のデータ・メモリは、第１のメモリ・コントローラに、第２のデータ・メモリは、第２のメモリ・コントローラに、パリティ・メモリは、パリティ・コントローラに結合され、パリティ・コントローラは、第１および第２のメモリ・コントローラに直接結合される。パリティ・データ制御ロジックは、第１および第２のデータ・メモリ内のデータと関連付けられたパリティ情報を記憶して取り出し、パリティ・データ制御ロジックは、第１のデータ・メモリ内のデータと関連付けられたパリティ・データを、前記第２のデータ・メモリ内のデータと関連付けられたパリティ・データと、パリティ・メモリ内でインタリーブする。そして、上述のような構成をとることによりメモリ・コントローラ・レベルのＲＡＩＤに必要とされるシステムの複雑化を防ぐことが可能となる。 In the memory system described in Patent Document 1, the first data memory is the first memory controller, the second data memory is the second memory controller, and the parity memory is the parity memory. Coupled to the controller, the parity controller is directly coupled to the first and second memory controllers. The parity data control logic stores and retrieves parity information associated with the data in the first and second data memories, and the parity data control logic is associated with the data in the first data memory. The parity data is interleaved in the parity memory with the parity data associated with the data in the second data memory. By adopting the configuration as described above, it is possible to prevent the system from being complicated for the memory controller level RAID.

特開２００５−１０８２２４号公報JP 2005-108224 A

マルチビットエラーなどのメモリＤＩＭＭの障害やメモリコントローラの障害が発生した場合、当該ＤＩＭＭやメモリコントローラは自配下のデータが保証できなくなるためシステムの停止などを引き起こしていた。これを回避するために例えば特許文献１に記載の技術のようにメモリの二重化やメモリＲＡＩＤが行われている。しかし、二重化やデータのストライピング及びパリティ生成・チェックを集中管理しているため、パリティ生成部やデータ復号部に負荷が集中し、性能の低下が発生していた。また、複数のストライピングデータを１つのＲＡＩＤメモリコントローラが制御しているため、当該コントローラで障害が発生した場合にシステムダウンが回避できなかった。例えば特許文献１に記載の技術では、メモリへの書き込みに関してはパリティデータ制御ロジック４６０が、メモリからの読み出しデータのパリティチェック及びデータ復号化は各Ｄコントローラ内のパリティ制御ロジックが行っている。また、各Ｄコントローラ内のメモリ制御ロジックが一括して配下のメモリへのアクセスを行っている。(図６参照。なお図６は特許文献１の図４に相当する図面である。）
つまり、一般的なメモリＲＡＩＤはＲＡＩＤアーキテクチャが集中管理されており、システムの性能がＲＡＩＤコントローラの性能に左右されていた。また、当該ＲＡＩＤコントローラ部で障害が発生した場合にシステムの停止を引き起こすことが懸念されていた。 When a memory DIMM failure such as a multi-bit error or a memory controller failure occurs, the DIMM or memory controller cannot guarantee data under its control, causing a system stop or the like. In order to avoid this, memory duplication and memory RAID are performed as in the technique described in Patent Document 1, for example. However, since duplication, data striping, and parity generation / check are centrally managed, the load is concentrated on the parity generation unit and the data decoding unit, resulting in performance degradation. In addition, since a single RAID memory controller controls a plurality of striping data, a system down cannot be avoided when a failure occurs in the controller. For example, in the technique described in Patent Document 1, the parity data control logic 460 performs writing to the memory, and the parity control logic in each D controller performs parity check and data decoding of data read from the memory. Further, the memory control logic in each D controller accesses the subordinate memory at once. (See FIG. 6. Note that FIG. 6 corresponds to FIG. 4 of Patent Document 1.)
That is, a general memory RAID has a centralized RAID architecture, and the system performance depends on the performance of the RAID controller. In addition, there is a concern that the system may be stopped when a failure occurs in the RAID controller unit.

そこで、本発明は、ＲＡＩＤコントローラ部で障害が発生した場合にシステムの停止を引き起こすことがないメモリ障害処理装置、メモリ障害処理方法及びメモリ障害処理プログラムを提供することを目的とする。 Therefore, an object of the present invention is to provide a memory failure processing apparatus, a memory failure processing method, and a memory failure processing program that do not cause a system stop when a failure occurs in a RAID controller unit.

本発明の第１の観点によれば、演算処理装置と、前記演算処理装置に接続されたノードコントローラと、前記ノードコントローラに接続された複数のメモリコントローラと、前記複数のメモリコントローラのそれぞれの配下にあるメモリＤＩＭＭ（Dual Inline Memory Module）と、を備えたメモリ障害処理装置であって、前記ノードコントローラが、前記演算処理装置からメモリ書き込み要求を受けつけた場合には、自身に接続されている前記複数のメモリコントローラのそれぞれと、自身に接続されている他のメモリ障害処理装置のノードコントローラと、に当該メモリ書き込み要求を転送し、前記ノードコントローラが、自身に接続されている他のメモリ障害処理装置のノードコントローラからメモリ書き込み要求を転送された場合には、自身に接続されている前記複数のメモリコントローラのそれぞれに当該メモリ書き込み要求を転送し、前記複数のメモリコントローラのそれぞれは、前記メモリ書き込み要求が転送されてきた場合であって、当該書き込み要求が自配下のメモリＤＩＭＭへのものである場合に当該書き込み要求に従ってデータの格納をすることを特徴とするメモリ障害処理装置が提供される。 According to a first aspect of the present invention, an arithmetic processing unit, a node controller connected to the arithmetic processing unit, a plurality of memory controllers connected to the node controller, and subordinates of the plurality of memory controllers A memory fault processing device comprising a memory DIMM (Dual Inline Memory Module) in the memory controller, wherein when the node controller receives a memory write request from the arithmetic processing device, the memory fault processing device is connected to the memory fault processing device. The memory controller transfers the memory write request to each of the plurality of memory controllers and the node controller of another memory failure processing apparatus connected to the memory controller, and the node controller is connected to the other memory failure processing. When a memory write request is transferred from the device node controller, The memory write request is transferred to each of the plurality of connected memory controllers, and each of the plurality of memory controllers is a case where the memory write request is transferred, and the write request is under its control. A memory failure processing apparatus is provided that stores data in accordance with a write request when the data is in a memory DIMM.

本発明の第２の観点によれば、複数のメモリ障害装置を有するメモリ障害処理システムにおいて、前記複数のメモリ障害装置が上記のメモリ障害装置であることを特徴とするメモリ障害処理システムが提供される。 According to a second aspect of the present invention, there is provided a memory fault processing system having a plurality of memory fault devices, wherein the plurality of memory fault devices are the memory fault devices described above. The

本発明の第３の観点によれば、演算処理装置と、前記演算処理装置に接続されたノードコントローラと、前記ノードコントローラに接続された複数のメモリコントローラと、前記複数のメモリコントローラのそれぞれの配下にあるメモリＤＩＭＭ（Dual Inline Memory Module）と、を備えたメモリ障害処理装置が行うメモリ障害処理方法であって、前記ノードコントローラが、前記演算処理装置からメモリ書き込み要求を受けつけた場合には、自身に接続されている前記複数のメモリコントローラのそれぞれと、自身に接続されている他のメモリ障害処理装置のノードコントローラと、に当該メモリ書き込み要求を転送し、前記ノードコントローラが、自身に接続されている他のメモリ障害処理装置のノードコントローラからメモリ書き込み要求を転送された場合には、自身に接続されている前記複数のメモリコントローラのそれぞれに当該メモリ書き込み要求を転送し、前記複数のメモリコントローラのそれぞれは、前記メモリ書き込み要求が転送されてきた場合であって、当該書き込み要求が自配下のメモリＤＩＭＭへのものである場合に当該書き込み要求に従ってデータの格納をすることを特徴とするメモリ障害処理方法が提供される。 According to a third aspect of the present invention, an arithmetic processing unit, a node controller connected to the arithmetic processing unit, a plurality of memory controllers connected to the node controller, and a subordinate of each of the plurality of memory controllers A memory failure processing method performed by a memory failure processing device including a memory DIMM (Dual Inline Memory Module) in the memory controller, when the node controller receives a memory write request from the arithmetic processing device, The memory write request is transferred to each of the plurality of memory controllers connected to the node controller and to the node controller of the other memory failure processing apparatus connected to itself, and the node controller is connected to itself. The memory write request is transferred from the node controller of the other memory fault processor. If it is sent, the memory write request is transferred to each of the plurality of memory controllers connected to itself, and each of the plurality of memory controllers is a case where the memory write request is transferred. Thus, there is provided a memory failure processing method that stores data according to the write request when the write request is for a memory DIMM under its control.

本発明の第４の観点によれば、演算処理装置と、前記演算処理装置に接続されたノードコントローラと、前記ノードコントローラに接続された複数のメモリコントローラと、前記複数のメモリコントローラのそれぞれの配下にあるメモリＤＩＭＭ（Dual Inline Memory Module）と、を備えたメモリ障害処理装置であって、前記ノードコントローラが、前記演算処理装置からメモリ書き込み要求を受けつけた場合には、自身に接続されている前記複数のメモリコントローラのそれぞれと、自身に接続されている他のメモリ障害処理装置のノードコントローラと、に当該メモリ書き込み要求を転送し、前記ノードコントローラが、自身に接続されている他のメモリ障害処理装置のノードコントローラからメモリ書き込み要求を転送された場合には、自身に接続されている前記複数のメモリコントローラのそれぞれに当該メモリ書き込み要求を転送し、前記複数のメモリコントローラのそれぞれは、前記メモリ書き込み要求が転送されてきた場合であって、当該書き込み要求が自配下のメモリＤＩＭＭへのものである場合に当該書き込み要求に従ってデータの格納をするメモリ障害処理装置としてコンピュータを機能させることを特徴とするメモリ障害処理プログラムが提供される。 According to a fourth aspect of the present invention, an arithmetic processing unit, a node controller connected to the arithmetic processing unit, a plurality of memory controllers connected to the node controller, and a subordinate of each of the plurality of memory controllers A memory fault processing device comprising a memory DIMM (Dual Inline Memory Module) in the memory controller, wherein when the node controller receives a memory write request from the arithmetic processing device, the memory fault processing device is connected to the memory fault processing device. The memory controller transfers the memory write request to each of the plurality of memory controllers and the node controller of another memory failure processing apparatus connected to the memory controller, and the node controller is connected to the other memory failure processing. When a memory write request is transferred from the device node controller, The memory write request is transferred to each of the plurality of connected memory controllers, and each of the plurality of memory controllers is a case where the memory write request is transferred, and the write request is under its control. A memory failure processing program is provided that causes a computer to function as a memory failure processing device that stores data in accordance with the write request when the data is in a memory DIMM.

本発明によれば、ＲＡＩＤ機能の分散化によりＲＡＩＤ構築処理が分散・平準化されることから、ＲＡＩＤ機能を持つメモリコントローラの１つが障害により停止或いはデータの不整合が発生しても、データの再構築が可能となる。 According to the present invention, since the RAID construction process is distributed and leveled by the distribution of the RAID function, even if one of the memory controllers having the RAID function is stopped due to a failure or data inconsistency occurs, Reconstruction is possible.

本発明の実施形態の基本的構成を表す図である。It is a figure showing the basic composition of the embodiment of the present invention. 書き込みアクセス時のデータのやり取りを模式的に表す図である。It is a figure which represents typically the exchange of the data at the time of write access. 本発明の実施形態の基本的動作を表すフローチャートである。It is a flowchart showing the basic operation | movement of embodiment of this invention. 読み込みアクセス時のデータのやり取りを模式的に表す図である。It is a figure which represents typically the exchange of the data at the time of read access. 書き込みアクセス時の動作を表すフローチャートである。It is a flowchart showing the operation | movement at the time of write access. 関連する技術を説明するための図である（特許文献１の図４に相当する図である。）。It is a figure for demonstrating a related technique (it is a figure equivalent to FIG. 4 of patent document 1).

まず、本発明の実施形態の概略を説明する。本発明の実施形態は、概略、以下のようなものである。 First, an outline of an embodiment of the present invention will be described. An embodiment of the present invention is roughly as follows.

ＣＰＵからのメモリへのデータ書き込み時は、メモリアクセスをノードコントローラで一旦受けた後、全てのメモリコントローラへとトランザクションを転送する。トランザクションを受け取った各メモリコントローラは共通するＲＡＩＤアーキテクチャを持ち、アドレスから自配下のメモリへのアクセスであるかを判断し、自配下のメモリに対するアクセスである場合は、該当するストライピングデータの書き込みを行う。 When data is written from the CPU to the memory, the memory access is once received by the node controller, and then the transaction is transferred to all the memory controllers. Each memory controller that receives the transaction has a common RAID architecture, determines whether it is an access to the memory under its own control from the address, and writes the corresponding striping data if it is an access to the memory under its own control. .

ＣＰＵからメモリへのデータ読み出し時は、書き込み時と同様にメモリアクセスをノードコントローラで一旦受けた後、全てのメモリコントローラへとトランザクションを転送する。各メモリコントローラはアドレスから自配下のメモリへのアクセスであるかを判断し、自配下のメモリに対するアクセスである場合は、該当するストライピングデータを返却する。ノードコントローラは各メモリコントローラと同等のＲＡＩＤアーキテクチャを持ち、各メモリコントローラから返却されたデータを取りまとめてＣＰＵへとデータを返却する。メモリＤＩＭＭの訂正不可能なエラーやメモリコントローラの障害が発生した場合には、ノードコントローラがデータを訂正してＣＰＵへとデータを転送する。 When data is read from the CPU to the memory, memory access is once received by the node controller as in the case of writing, and then the transaction is transferred to all the memory controllers. Each memory controller determines whether it is an access to its own subordinate memory from the address, and if it is an access to its own subordinate memory, returns the corresponding striping data. The node controller has a RAID architecture equivalent to each memory controller, collects data returned from each memory controller, and returns the data to the CPU. When an uncorrectable error in the memory DIMM or a failure of the memory controller occurs, the node controller corrects the data and transfers the data to the CPU.

次に、本発明の実施形態について図面を用いて詳細に説明する。 Next, embodiments of the present invention will be described in detail with reference to the drawings.

図１を参照すると、図１には本実施形態である複数のＣＥＬＬからなるシステムが示されている。本システムは第１のＣＥＬＬ１００、第２のＣＥＬＬ２００、第３のＣＥＬＬ３００及び第４のＣＥＬＬ４００を有する。なお今回は説明の便宜上ＣＥＬＬの数を４つとしているが、これは本実施形態のＣＥＬＬの数を限定するものではない。４つ以外の個数のＣＥＬＬで本システムを実現するようにしてもよい。 Referring to FIG. 1, FIG. 1 shows a system comprising a plurality of CELLs according to this embodiment. The system includes a first CELL 100, a second CELL 200, a third CELL 300, and a fourth CELL 400. Although the number of CELLs is four for convenience of explanation this time, this does not limit the number of CELLs of this embodiment. You may make it implement | achieve this system by the number of CELLs other than four.

第１のＣＥＬＬ１００は、ＣＰＵ１１１〜１１４、ノードコントローラ１２１及びメモリコントローラ１３１、１３２を有している。また、同様に第２のＣＥＬＬ２００は、ＣＰＵ２１１〜２１４、ノードコントローラ２２１及びメモリコントローラ２３１、２３２を有している。また、第３のＣＥＬＬ３００は、ＣＰＵ３１１〜３１４、ノードコントローラ３２１及びメモリコントローラ３３１、３３２を有している。また、第４のＣＥＬＬ４００は、ＣＰＵ４１１〜４１４、ノードコントローラ４２１及びメモリコントローラ４３１、４３２を有している。また、各ＣＥＬＬに実装されている各メモリコントローラにはメモリＤＩＭＭが接続されている。なお、各ＣＥＬＬのそれぞれは、本願発明の「メモリ障害処理装置」に相当する。また、複数のＣＥＬＬを組み合わせたシステムは、本願発明の「メモリ障害処理システム」に相当する。 The first CELL 100 includes CPUs 111 to 114, a node controller 121, and memory controllers 131 and 132. Similarly, the second CELL 200 includes CPUs 211 to 214, a node controller 221, and memory controllers 231 and 232. The third CELL 300 includes CPUs 311 to 314, a node controller 321, and memory controllers 331 and 332. The fourth CELL 400 includes CPUs 411 to 414, a node controller 421, and memory controllers 431 and 432. A memory DIMM is connected to each memory controller mounted on each CELL. Each CELL corresponds to the “memory failure processing apparatus” of the present invention. A system combining a plurality of CELLs corresponds to the “memory failure processing system” of the present invention.

次に、上述した各部の機能について説明する。 Next, the function of each unit described above will be described.

ＣＰＵ１１１〜１１４、ＣＰＵ２１１〜２１４、ＣＰＵ３１１〜３１４及びＣＰＵ４１１〜４１４は演算処理装置である。これら各ＣＰＵは各ＣＥＬＬのノードコントローラを経由して、メモリに対してのアクセス要求を行う。アクセス要求には書き込み及び読み込みの２つがある。 The CPUs 111 to 114, the CPUs 211 to 214, the CPUs 311 to 314, and the CPUs 411 to 414 are arithmetic processing units. Each of these CPUs makes an access request to the memory via the node controller of each CELL. There are two access requests: write and read.

ノードコントローラ１２１、ノードコントローラ２２１、ノードコントローラ３２１及びノードコントローラ４２１は、各ＣＰＵ唐のアクセス要求を受け取り他のノードコントローラに転送する。また、各メモリコントローラから返却されたストライピングデータやパリティは読み出し要求を行ったＣＰＵ配下のノードコントローラへと集められ、当該ノードコントローラによりデータの組み立てを行う。メモリコントローラやメモリＤＩＭＭの障害によりデータの一部が欠落した場合は、当該ノードコントローラによりデータの再構築を行ない、要求元のＣＰＵへとデータを返却する。 The node controller 121, the node controller 221, the node controller 321 and the node controller 421 receive access requests from each CPU and transfer them to other node controllers. Further, the striping data and parity returned from each memory controller are collected to the node controller under the CPU that issued the read request, and the data is assembled by the node controller. When a part of data is lost due to a failure of the memory controller or memory DIMM, the data is reconstructed by the node controller and the data is returned to the requesting CPU.

メモリコントローラ１３１、１３２、２３１、２３２、３３１、３３２、４３１及び４３２は、書き込み要求があった場合に、アドレスより自配下のメモリＤＩＭＭへのアクセスであるか判断すると同時に、共通したＲＡＩＤアーキテクチャにより自配下のメモリＤＩＭＭが担当するストライピングデータ或いはパリティのみを保存する。また、読み込み要求があった場合に、各メモリコントローラは自配下のメモリＤＩＭＭへのアクセスであるかを判断すると同時に、共通したＲＡＩＤアーキテクチャにより自配下のメモリＤＩＭＭから担当するストライピングデータ或いはパリティを読み出し、上位のノードコントローラへと転送する。 When there is a write request, the memory controllers 131, 132, 231, 232, 331, 332, 431 and 432 determine whether the access is to the memory DIMM subordinate to the address, and at the same time, the memory controllers 131, 132, 231, 232, 331, 332, 431 and 432 Only the striping data or parity for which the subordinate memory DIMM is in charge is stored. When there is a read request, each memory controller determines whether it is an access to its own subordinate memory DIMM, and at the same time reads out the striping data or parity in charge from the subordinate memory DIMM by a common RAID architecture, Transfer to higher-level node controller.

次に、データのやり取りを模式的に表す図である図２と、動作を表すフローチャートである図３を用いてＣＰＵからメモリへの書き込みアクセスが発生した場合の動作について説明する。今回は、具体例としてＣＰＵ２１２からメモリへの書き込みアクセスが発生した場合について説明するが、これは本実施形態の動作を限定するものではない。本実施形態では何れのＣＰＵからもメモリへの書き込みアクセスが可能である。 Next, an operation when a write access from the CPU to the memory occurs will be described with reference to FIG. 2 which is a diagram schematically showing exchange of data and FIG. 3 which is a flowchart showing the operation. This time, as a specific example, a case where a write access to the memory from the CPU 212 occurs will be described, but this does not limit the operation of the present embodiment. In this embodiment, any CPU can write access to the memory.

まず、各々にＣＰＵ、ノードコントローラ及びメモリコントローラを有する第１のＣＥＬＬ１００、第２のＣＥＬＬ２００、第３のＣＥＬＬ３００及び第４のＣＥＬＬ４００から構成されるシステムにおいて、第２のＣＥＬＬ２００上のＣＰＵ２１２からメモリへの書き込みアクセスが発生する（ステップＳ１１）。 First, in a system composed of a first CELL 100, a second CELL 200, a third CELL 300, and a fourth CELL 400 each having a CPU, a node controller, and a memory controller, the CPU 212 on the second CELL 200 to the memory Write access occurs (step S11).

上記の書き込みアクセスが発生した場合、ＣＰＵ２１２からの書き込み要求をノードコントローラ２２１で受け取った後、各ＣＥＬＬのノードコントローラ１２１、３２１、４２１へとそのまま転送する（ステップＳ１２）。 When the above write access occurs, the node controller 221 receives a write request from the CPU 212, and then transfers it directly to the node controllers 121, 321, and 421 of each CELL (step S12).

次に、各ノードコントローラは配下のメモリコントローラへと当該アクセスをそのまま転送する（ステップＳ１３）。 Next, each node controller transfers the access as it is to the subordinate memory controller (step S13).

続いて、各メモリコントローラは予め設定された情報を元に自配下のメモリＤＩＭＭへのアクセスであるか否かを判断し、自配下のメモリアクセスである場合は、予め設定された情報を元にデータを分割してストライピングデータ或いはパリティを生成し、メモリＤＩＭＭに該当するデータ或いはパリティを格納する（ステップＳ１４）。今回の例では、ＣＰＵ２１２から発行されたメモリコントローラ１３１、１３２、２３１、２３２及び３３１の５つのメモリコントローラが処理対象となるアドレスへのメモリ書き込み要求がなされたものとする。この場合は、これらの５つのメモリコントローラのみによって処理され、各コントローラ配下のメモリＤＩＭＭにストライピングデータと場合によってはパリティを格納する。この時、データのどの部分或いはパリティを格納するかは、アドレス及び設定されたオフセットから判定される。 Subsequently, each memory controller determines whether or not it is an access to its own subordinate memory DIMM based on preset information, and if it is a subordinate memory access, based on the preset information The data is divided to generate striping data or parity, and the data or parity corresponding to the memory DIMM is stored (step S14). In this example, it is assumed that the memory controller 131, 132, 231, 232, and 331 issued from the CPU 212 has made a memory write request to an address to be processed. In this case, processing is performed only by these five memory controllers, and striping data and possibly parity are stored in the memory DIMM under each controller. At this time, which part of data or parity is stored is determined from the address and the set offset.

次に、データのやり取りを模式的に表す図である図４と、動作を表すフローチャートである図５を用いてＣＰＵからメモリへの読み出しアクセスが発生した場合の動作について説明する。上述の書き込みアクセスと同様、本実施形態では何れのＣＰＵからもメモリへの読み込みアクセスが可能である。 Next, an operation when a read access from the CPU to the memory occurs will be described with reference to FIG. 4 which is a diagram schematically showing exchange of data and FIG. 5 which is a flowchart showing the operation. Similar to the write access described above, in this embodiment, any CPU can read access to the memory.

第４のＣＥＬＬ４００上のＣＰＵ４１３からメモリへの読み出しアクセスが発生する（ステップＳ２１）。 A read access to the memory from the CPU 413 on the fourth CELL 400 occurs (step S21).

上記の読み込みアクセスが発生した場合、ＣＰＵ４１３からの読み出し要求をノードコントローラ４２１で受け取った後、各ＣＥＬＬのノードコントローラ１２１、２２１、３２１へとそのまま転送する（ステップＳ２２）。 When the above read access occurs, the node controller 421 receives a read request from the CPU 413, and then transfers it directly to the node controllers 121, 221, and 321 of each CELL (step S22).

次に、各ノードコントローラは配下のメモリコントローラへと当該アクセスをそのまま転送する（ステップＳ２３）。 Next, each node controller transfers the access as it is to the subordinate memory controller (step S23).

続いて、各メモリコントローラは予め設定された情報を元に自配下のメモリＤＩＭＭへのアクセスであるか否かを判断し、自配下のメモリアクセスである場合は、メモリＤＩＭＭへのデータの読み出しを行う（ステップＳ２４）。 Subsequently, each memory controller determines whether it is an access to its own memory DIMM based on preset information. If it is a memory access under its own, it reads out data to the memory DIMM. It performs (step S24).

メモリＤＩＭＭより読み出されたデータは下位アドレス或いはパリティフラグを付加してノードコントローラへと返却される（ステップＳ２５においてＹｅｓ）。 The data read from the memory DIMM is returned to the node controller with the lower address or parity flag added (Yes in step S25).

各ノードコントローラで受け取ったデータは発行元のＣＰＵ４１３配下のノードコントローラ４２１へと集められ、当該ノードコントローラでデータの組み立てを行い、要求元のＣＰＵ４１３へと返却される（ステップＳ２６）。 The data received by each node controller is collected to the node controller 421 under the CPU 413 of the issuer, and the data is assembled by the node controller and returned to the requesting CPU 413 (step S26).

この際、障害によりデータが返却できない場合は（ステップＳ２５においてＮｏ）、メモリコントローラからエラーフラグを付加して返却する。これを受け取ったノードコントローラ４２１は残りのストライピングデータ及びパリティからデータを再構築し、要求元のＣＰＵ４１３へと返却する（ステップＳ２７）。 At this time, if data cannot be returned due to a failure (No in step S25), an error flag is added from the memory controller and returned. Receiving this, the node controller 421 reconstructs data from the remaining striping data and parity, and returns the data to the requesting CPU 413 (step S27).

以上説明した本発明の実施形態は、以下に示すような効果を奏する。 The embodiment of the present invention described above has the following effects.

第１の効果は集中管理方式に比べてデータのストライピングやパリティの生成のための時間を短くできることである。 The first effect is that the time required for data striping and parity generation can be shortened as compared with the centralized management method.

その理由は、ＲＡＩＤ機能の分散化によりＲＡＩＤ構築処理が分散・平準化されるからである。 This is because the RAID construction process is distributed and leveled by the distribution of the RAID function.

第２の効果はＲＡＩＤ機能を持つメモリコントローラの１つが障害により停止或いはデータの不整合が発生しても、データの再構築が可能となることである。 The second effect is that data can be reconstructed even if one of the memory controllers having a RAID function is stopped due to a failure or data inconsistency occurs.

その理由は、ストライピングデータ及びパリティ生成部とデータの再構築部を分離するからである。 The reason is that the striping data / parity generation unit and the data reconstruction unit are separated.

なお、本発明の実施形態であるメモリ障害処理装置は、ハードウェアにより実現することもできるが、コンピュータをそのメモリ障害処理装置として機能させるためのプログラムをコンピュータがコンピュータ読み取り可能な記録媒体から読み込んで実行することによっても実現することができる。 The memory failure processing apparatus according to the embodiment of the present invention can be realized by hardware, but the computer reads a program for causing the computer to function as the memory failure processing apparatus from a computer-readable recording medium. It can also be realized by executing.

また、本発明の実施形態によるメモリ障害処理方法は、ハードウェアにより実現することもできるが、コンピュータにその方法を実行させるためのプログラムをコンピュータがコンピュータ読み取り可能な記録媒体から読み込んで実行することによっても実現することができる。 In addition, the memory failure processing method according to the embodiment of the present invention can be realized by hardware, but the computer reads a program for causing the computer to execute the method from a computer-readable recording medium and executes the program. Can also be realized.

また、上述した実施形態は、本発明の好適な実施形態ではあるが、上記実施形態のみに本発明の範囲を限定するものではなく、本発明の要旨を逸脱しない範囲において種々の変更を施した形態での実施が可能である。 Moreover, although the above-described embodiment is a preferred embodiment of the present invention, the scope of the present invention is not limited only to the above-described embodiment, and various modifications are made without departing from the gist of the present invention. Implementation in the form is possible.

上記の実施形態の一部又は全部は、以下の付記のようにも記載されうるが、以下には限られない。 A part or all of the above-described embodiment can be described as in the following supplementary notes, but is not limited thereto.

（付記１）演算処理装置と、前記演算処理装置に接続されたノードコントローラと、前記ノードコントローラに接続された複数のメモリコントローラと、前記複数のメモリコントローラのそれぞれの配下にあるメモリＤＩＭＭ（Dual Inline Memory Module）と、を備えたメモリ障害処理装置であって、
前記ノードコントローラが、前記演算処理装置からメモリ書き込み要求を受けつけた場合には、自身に接続されている前記複数のメモリコントローラのそれぞれと、自身に接続されている他のメモリ障害処理装置のノードコントローラと、に当該メモリ書き込み要求を転送し、
前記ノードコントローラが、自身に接続されている他のメモリ障害処理装置のノードコントローラからメモリ書き込み要求を転送された場合には、自身に接続されている前記複数のメモリコントローラのそれぞれに当該メモリ書き込み要求を転送し、
前記複数のメモリコントローラのそれぞれは、前記メモリ書き込み要求が転送されてきた場合であって、当該書き込み要求が自配下のメモリＤＩＭＭへのものである場合に当該書き込み要求に従ってデータの格納をするメモリ障害処理装置としてコンピュータを機能させることを特徴とするメモリ障害処理プログラム。 (Supplementary Note 1) An arithmetic processing unit, a node controller connected to the arithmetic processing unit, a plurality of memory controllers connected to the node controller, and a memory DIMM (Dual Inline) under each of the plurality of memory controllers A memory fault processing apparatus comprising:
When the node controller receives a memory write request from the arithmetic processing unit, each of the plurality of memory controllers connected to itself and a node controller of another memory fault processing unit connected to the node controller And transfer the memory write request to
When the node controller receives a memory write request from a node controller of another memory failure processing apparatus connected to itself, the memory write request is sent to each of the plurality of memory controllers connected to the node controller. Forward and
Each of the plurality of memory controllers has a memory failure in which data is stored in accordance with the write request when the memory write request is transferred and the write request is for a memory DIMM under its control. A memory failure processing program for causing a computer to function as a processing device.

（付記２）付記１に記載のメモリ障害処理プログラムであって、
前記ノードコントローラが、前記演算処理装置からメモリ読み込み要求を受けつけた場合には、自身に接続されている前記複数のメモリコントローラのそれぞれと、自身に接続されている他のメモリ障害処理装置のノードコントローラと、に当該メモリ読み込み要求を転送し、
前記ノードコントローラが、自身に接続されている他のメモリ障害処理装置のノードコントローラからメモリ読み込み要求を転送された場合には、自身に接続されている前記複数のメモリコントローラのそれぞれに当該メモリ読み込み要求を転送し、
前記複数のメモリコントローラのそれぞれは、前記メモリ読み込み要求が転送されてきた場合であって、当該読み込み要求が自配下のメモリＤＩＭＭへのものである場合に当該
読み込み要求に従ってデータの読み込みを試み、当該データの読み込みが成功した場合は前記ノードコントローラに読み込んだデータを返却し、当該データの読み込みが失敗した場合は前記ノードコントローラに失敗をした旨を返却することを特徴とするメモリ障害処理プログラム。
（付記３）付記２に記載のメモリ障害処理プログラムであって、
前記ノードコントローラは、自身に接続されている他のメモリ障害処理装置のノードコントローラからメモリ読み込み要求を転送された場合には、自配下の前記メモリＤＩＭＭから返却された前記読み込んだデータ若しくは前記失敗をした旨を、自身に接続されている他のメモリ障害処理装置のノードコントローラに転送し、
前記ノードコントローラは、自身が前記演算処理装置からメモリ読み込み要求を受けつけた場合には、自身に接続されている他のメモリ障害処理装置のノードコントローラから転送されてきた前記読み込んだデータ若しくは前記失敗をした旨、及び、自配下の前記メモリＤＩＭＭから返却された前記読み込んだデータ若しくは前記失敗をした旨、を用いてデータを組み立て又は再構築し、当該組み立て又は再構築したデータを前記メモリ読み込み要求した前記演算処理装置に返却することを特徴とするメモリ障害処理プログラム。
（付記４）付記１乃至３の何れか１項に記載のメモリ障害処理プログラムであって、
前記データの格納は、予め設定された情報を元にデータを分割してストライピングデータ或いはパリティを生成し、当該生成したストライピングデータ或いはパリティを格納することにより行われることを特徴とするメモリ障害処理プログラム。 (Supplementary note 2) The memory fault processing program according to supplementary note 1, wherein
When the node controller receives a memory read request from the arithmetic processing unit, each of the plurality of memory controllers connected to itself and a node controller of another memory fault processing unit connected to the node controller And the memory read request is transferred to
When the node controller receives a memory read request from a node controller of another memory failure processing apparatus connected to the node controller, the memory read request is sent to each of the plurality of memory controllers connected to the node controller. Forward and
Each of the plurality of memory controllers attempts to read data in accordance with the read request when the memory read request has been transferred and the read request is for a memory DIMM under its control. A memory failure processing program that returns data read to the node controller when data reading is successful, and returns failure to the node controller when data reading fails.
(Supplementary note 3) The memory fault processing program according to supplementary note 2, wherein
When the node controller receives a memory read request from a node controller of another memory failure processing apparatus connected to the node controller, the node controller returns the read data or the failure returned from the memory DIMM under its control. To the node controller of the other memory fault processing device connected to itself,
When the node controller receives a memory read request from the arithmetic processing unit, the node controller displays the read data or the failure transferred from the node controller of another memory fault processing unit connected to the node controller. The data is assembled or reconstructed using the read data returned from the memory DIMM under its control or the failure, and the memory read request is made for the assembled or reconstructed data. A memory failure processing program which is returned to the arithmetic processing unit.
(Supplementary note 4) The memory fault processing program according to any one of supplementary notes 1 to 3,
The memory is stored by dividing the data based on preset information to generate striping data or parity, and storing the generated striping data or parity. .

１００第１のＣＥＬＬ
１１１、１１２、１１３、１１４、２１１、２１２、２１３、２１４、３１１、３１２、３１３、３１４、４１１、４１２、４１３、４１４ＣＰＵ
１２１、２２１、３２１、４２１ノードコントローラ
１３１、１３２、２３１、２３２、３３１、３３２、４３１、４３２メモリコントローラ
２００第２のＣＥＬＬ
３００第３のＣＥＬＬ
４００第４のＣＥＬＬ 100 first CELL
111, 112, 113, 114, 211, 212, 213, 214, 311, 312, 313, 314, 411, 412, 413, 414 CPU
121,221,321,421 Node controller 131,132,231,232,331,332,431,432 Memory controller 200 Second CELL
300 3rd CELL
400 4th CELL

Claims

An arithmetic processing unit, a node controller connected to the arithmetic processing unit, a plurality of memory controllers connected to the node controller, and a memory DIMM (Dual Inline Memory Module) under each of the plurality of memory controllers; A memory failure processing apparatus comprising:
When the node controller receives a memory write request from the arithmetic processing unit, each of the plurality of memory controllers connected to itself and a node controller of another memory fault processing unit connected to the node controller And transfer the memory write request to
When the node controller receives a memory write request from a node controller of another memory failure processing apparatus connected to itself, the memory write request is sent to each of the plurality of memory controllers connected to the node controller. Forward and
Each of the plurality of memory controllers stores data according to the write request when the memory write request has been transferred and the write request is for a memory DIMM under its control. A memory failure processing apparatus.

The memory failure processing apparatus according to claim 1,
When the node controller receives a memory read request from the arithmetic processing unit, each of the plurality of memory controllers connected to itself and a node controller of another memory fault processing unit connected to the node controller And the memory read request is transferred to
When the node controller receives a memory read request from a node controller of another memory failure processing apparatus connected to the node controller, the memory read request is sent to each of the plurality of memory controllers connected to the node controller. Forward and
Each of the plurality of memory controllers attempts to read data in accordance with the read request when the memory read request has been transferred and the read request is for a memory DIMM under its control. A memory failure processing apparatus, wherein when the data reading is successful, the read data is returned to the node controller, and when the data reading fails, a failure notification is returned to the node controller.

The memory failure processing apparatus according to claim 2,
When the node controller receives a memory read request from a node controller of another memory failure processing apparatus connected to the node controller, the node controller returns the read data or the failure returned from the memory DIMM under its control. To the node controller of the other memory fault processing device connected to itself,
When the node controller receives a memory read request from the arithmetic processing unit, the node controller displays the read data or the failure transferred from the node controller of another memory fault processing unit connected to the node controller. The data is assembled or reconstructed using the read data returned from the memory DIMM under its control or the failure, and the memory read request is made for the assembled or reconstructed data. A memory failure processing device, which is returned to the arithmetic processing device.

The memory failure processing apparatus according to any one of claims 1 to 3,
The memory is stored by dividing the data based on preset information to generate striping data or parity, and storing the generated striping data or parity. .

5. A memory failure processing system having a plurality of memory failure devices, wherein the plurality of memory failure processing devices is the memory failure processing device according to any one of claims 1 to 4.

An arithmetic processing unit, a node controller connected to the arithmetic processing unit, a plurality of memory controllers connected to the node controller, and a memory DIMM (Dual Inline Memory Module) under each of the plurality of memory controllers; A memory fault processing method performed by a memory fault processing apparatus comprising:
When the node controller receives a memory write request from the arithmetic processing unit, each of the plurality of memory controllers connected to itself and a node controller of another memory fault processing unit connected to the node controller And transfer the memory write request to
When the node controller receives a memory write request from a node controller of another memory failure processing apparatus connected to itself, the memory write request is sent to each of the plurality of memory controllers connected to the node controller. Forward and
Each of the plurality of memory controllers stores data according to the write request when the memory write request has been transferred and the write request is for a memory DIMM under its control. A memory failure processing method characterized by the above.

The memory failure processing method according to claim 6,
When the node controller receives a memory read request from the arithmetic processing unit, each of the plurality of memory controllers connected to itself and a node controller of another memory fault processing unit connected to the node controller And the memory read request is transferred to
When the node controller receives a memory read request from a node controller of another memory failure processing apparatus connected to the node controller, the memory read request is sent to each of the plurality of memory controllers connected to the node controller. Forward and
Each of the plurality of memory controllers attempts to read data in accordance with the read request when the memory read request has been transferred and the read request is for a memory DIMM under its control. A memory failure processing method comprising: returning data read to the node controller when data reading is successful; and returning failure to the node controller when data reading fails.

The memory failure processing method according to claim 7,
When the node controller receives a memory read request from a node controller of another memory failure processing apparatus connected to the node controller, the node controller returns the read data or the failure returned from the memory DIMM under its control. To the node controller of the other memory fault processing device connected to itself,
When the node controller receives a memory read request from the arithmetic processing unit, the node controller displays the read data or the failure transferred from the node controller of another memory fault processing unit connected to the node controller. The data is assembled or reconstructed using the read data returned from the memory DIMM under its control or the failure, and the memory read request is made for the assembled or reconstructed data. A memory failure processing method comprising returning to the arithmetic processing unit.

The memory failure processing method according to any one of claims 6 to 8,
The data storage is performed by dividing the data based on preset information to generate striping data or parity, and storing the generated striping data or parity. .

An arithmetic processing unit, a node controller connected to the arithmetic processing unit, a plurality of memory controllers connected to the node controller, and a memory DIMM (Dual Inline Memory Module) under each of the plurality of memory controllers; A memory failure processing apparatus comprising:
When the node controller receives a memory write request from the arithmetic processing unit, each of the plurality of memory controllers connected to itself and a node controller of another memory fault processing unit connected to the node controller And transfer the memory write request to
When the node controller receives a memory write request from a node controller of another memory failure processing apparatus connected to itself, the memory write request is sent to each of the plurality of memory controllers connected to the node controller. Forward and
Each of the plurality of memory controllers has a memory failure in which data is stored in accordance with the write request when the memory write request is transferred and the write request is for a memory DIMM under its control. A memory failure processing program for causing a computer to function as a processing device.