JP2015041175A

JP2015041175A - Memory management device, control method, program, and recording medium

Info

Publication number: JP2015041175A
Application number: JP2013171050A
Authority: JP
Inventors: 泰彦田邉; Yasuhiko Tanabe
Original assignee: NEC Platforms Ltd
Current assignee: NEC Platforms Ltd
Priority date: 2013-08-21
Filing date: 2013-08-21
Publication date: 2015-03-02
Anticipated expiration: 2033-08-21
Also published as: JP6306300B2

Abstract

PROBLEM TO BE SOLVED: To provide a memory management device which can efficiently use a physical memory even after a restart due to occurrence of memory failure.SOLUTION: A memory management device comprises: a memory module including a plurality of storage elements; memory determination means which, on the basis of pieces of identification information identifying the respective storage elements included in the memory module, determines a physical address assigned to the storage element identified by the identification information; and control means which performs control for using the storage element of specific physical address among the plurality of storage elements. When a failure occurs in at least one of the plurality of storage elements, the memory determination means holds the identification information of the storage element in which the failure occurs, and after a restart, determines a physical address of the storage element in which the failure has occurred on the basis of the held identification information. The control means performs control for using the storage element other than the storage element of the physical address determined by the memory determination means among the plurality of storage elements.

Description

本発明は、メモリ管理装置、制御方法、プログラムおよび記録媒体に関する。 The present invention relates to a memory management device, a control method, a program, and a recording medium.

システムにおいて、メモリ障害が生じると、当該障害が生じたメモリを検出し、検出したメモリを使用しないようにする方法が知られている。 When a memory failure occurs in a system, a method is known in which a memory in which the failure has occurred is detected and the detected memory is not used.

例えば、特許文献１には、運用中に、障害が生じたデータ記憶機構を検出し、置換データ及び利用可能データで置き換える装置が記載されている。 For example, Patent Document 1 describes a device that detects a data storage mechanism in which a failure has occurred during operation and replaces it with replacement data and usable data.

特許文献２には、運用中に、障害が発生したブロックを縮退し、ブロックが縮退していることをあらわす縮退情報をブロックに書き込んで、当該縮退情報の読み出し結果に基づいて当該ブロックの縮退状況を判定するキャッシュメモリ装置が記載されている。 Patent Document 2 describes that a block in which a failure has occurred during operation is degenerated, degenerate information indicating that the block is degenerated is written in the block, and the degeneration status of the block is based on the read result of the degenerate information. A cache memory device for determining the above is described.

特許文献３には、描画領域が割り当てられたメモリに対し、異常の発生が検知された場合、メモリ領域全体から異常領域を除く領域に対して、描画領域の再割り当てを行う表示装置が記載されている。 Patent Document 3 describes a display device that, when an abnormality is detected in a memory to which a drawing area is assigned, reassigns the drawing area to an area excluding the abnormal area from the entire memory area. ing.

特表２０１１−５２１３９７号公報Special table 2011-521397 gazette 国際公開第２００７／０９７０１９号International Publication No. 2007/097019 特開２００７−２１９０９６号公報JP 2007-219096 A

ミッションクリティカルな領域で運用されているハイエンドサーバでは、障害が発生した場合にシステムへの影響を最小限にとどめることが要求される。また、ハイエンドサーバにおいてシステムダウンを伴う障害の場合には、保守による故障部位の交換が必要となる場合がある。 High-end servers operating in mission-critical areas are required to minimize the impact on the system when a failure occurs. Further, in the case of a failure accompanying a system failure in a high-end server, it may be necessary to replace the failed part by maintenance.

しかしながら、システムが基幹系システムなど場合、運用状態によっては、保守を行うことよりも、システムダウンからいち早く回復しなければならない場合がある。このため、障害発生後においても障害の影響を最小限にとどめ、安定してシステムを動作させることが求められる。 However, when the system is a backbone system or the like, depending on the operation state, it may be necessary to quickly recover from the system down rather than performing maintenance. For this reason, it is required to operate the system stably while minimizing the influence of the failure even after the occurrence of the failure.

特許文献１および２の技術では、運用中に障害が生じたブロック等を検出してデータの置換またはブロックの縮退運転を行っているが、システムダウンなどによりシステムが再起動した後では、障害が生じたメモリ自体を検出し、当該メモリを縮退して運転する場合がある。 In the techniques of Patent Documents 1 and 2, a block in which a failure has occurred during operation is detected and data replacement or block degeneration operation is performed. However, after the system is restarted due to a system down or the like, the failure occurs. The generated memory itself may be detected and the memory may be degenerated and operated.

近年では、ＤＩＭＭ（ＤｕａｌＩｎｌｉｎｅＭｅｍｏｒｙＭｏｄｕｌｅ）などのメモリの容量は大容量化してきており、１枚で６４ＧＢの容量を持ったＤＩＭＭも存在する。このため、特許文献２の技術のようにメモリを縮退する運転をする場合、このようなＤＩＭＭ１枚を縮退させることになる。しかしながら、ＤＩＭＭの大容量化に伴い、ＤＩＭＭ１枚を縮退しただけで、システム全体の性能を低下させ、システムを安定して動作させることができなくなってしまう可能性がある。 In recent years, the capacity of memories such as DIMMs (Dual Inline Memory Modules) has increased, and there is a single DIMM having a capacity of 64 GB. For this reason, when performing the operation | movement which degenerates a memory like the technique of patent document 2, such DIMM1 will be degenerated. However, as the capacity of a DIMM increases, there is a possibility that the performance of the entire system is lowered and the system cannot be stably operated just by degenerating one DIMM.

また、特許文献３の技術では、描画領域や表示領域など、特定の領域において、異常領域を除く領域に対して、描画領域の再割り当てを行っているが、物理的なメモリ（例えば、ＤＩＭＭなど）の容量を効率的に利用する方法については開示されていない。 In the technique of Patent Document 3, the drawing area is reassigned to the area excluding the abnormal area in a specific area such as the drawing area and the display area. However, the physical memory (for example, DIMM or the like) is used. ) Is not disclosed about a method for efficiently using the capacity.

本発明は、上記問題に鑑みてなされたものであり、その目的は、メモリ障害発生後の再起動後においても、物理的なメモリの使用を効率的に行うメモリ管理装置を提供することにある。 The present invention has been made in view of the above problems, and an object of the present invention is to provide a memory management device that efficiently uses physical memory even after restart after a memory failure occurs. .

本発明の一態様に係るメモリ管理装置は、複数の記憶素子を含むメモリモジュールと、前記メモリモジュールに含まれる複数の記憶素子の夫々を特定するための特定情報から当該特定情報によって特定される記憶素子に割り当てられた物理アドレスを特定するメモリ特定手段と、前記複数の記憶素子のうち、何れの物理アドレスの記憶素子を使用するのかを制御する制御手段と、を備え、前記メモリ特定手段は、前記複数の記憶素子の少なくとも何れかに障害が発生した際、障害が発生した前記記憶素子の特定情報を保持し、再起動後に、保持した特定情報から障害が発生した前記記憶素子の物理アドレスを特定し、前記制御手段は、前記複数の記憶素子のうち、前記メモリ特定手段が特定した物理アドレスの記憶素子以外の記憶素子を使用するよう制御する。 A memory management device according to an aspect of the present invention includes a memory module that includes a plurality of storage elements, and a storage that is specified by the specific information from specific information that specifies each of the plurality of storage elements included in the memory module. A memory specifying means for specifying a physical address assigned to the element, and a control means for controlling which of the plurality of storage elements to use a storage element of, the memory specifying means, When a failure occurs in at least one of the plurality of storage elements, the specific information of the storage element in which the failure has occurred is retained, and after restart, the physical address of the storage element in which the failure has occurred is determined from the retained specific information. The control unit uses a storage element other than the storage element of the physical address specified by the memory specifying unit among the plurality of storage elements. To so that control.

本発明の一態様に係る制御方法は、複数の記憶素子を含むメモリモジュールを備えたメモリ管理装置の制御方法であって、前記複数の記憶素子の少なくとも何れかに障害が発生した際、障害が発生した前記記憶素子を特定するための特定情報を保持し、再起動後に、保持した特定情報から当該特定情報によって特定される記憶素子に割り当てられた物理アドレスであって、障害が発生した前記記憶素子の物理アドレスを特定し、前記複数の記憶素子のうち、前記特定した物理アドレスの記憶素子以外の記憶素子を使用するよう制御する。 A control method according to an aspect of the present invention is a control method of a memory management device including a memory module including a plurality of storage elements, and when a failure occurs in at least one of the plurality of storage elements, the failure occurs. The specific information for specifying the generated storage element is held, and after the restart, the physical address assigned to the storage element specified by the specific information from the held specific information, and the storage in which the failure has occurred A physical address of the element is specified, and control is performed to use a storage element other than the storage element of the specified physical address among the plurality of storage elements.

本発明の一態様に係るプログラムは、メモリモジュールに含まれる複数の記憶素子の少なくとも何れかに障害が発生した際、障害が発生した前記記憶素子を特定するための特定情報を保持し、再起動後に、保持した特定情報から当該特定情報によって特定される記憶素子に割り当てられた物理アドレスであって、障害が発生した前記記憶素子の物理アドレスを特定する処理と、前記複数の記憶素子のうち、前記特定した物理アドレスの記憶素子以外の記憶素子を使用するよう制御する処理と、をコンピュータに実行させる。 The program according to one aspect of the present invention holds specific information for specifying the storage element in which the failure has occurred and restarts when a failure occurs in at least one of the plurality of storage elements included in the memory module. A process of identifying a physical address of the storage element that is a physical address assigned to the storage element specified by the specific information from the stored specific information later, and among the plurality of storage elements, And causing the computer to execute a process of controlling to use a storage element other than the storage element having the specified physical address.

本発明のメモリ管理装置によれば、メモリ障害発生後の再起動後においても、物理的なメモリの使用を効率的に行うことができる。 According to the memory management device of the present invention, physical memory can be used efficiently even after restart after a memory failure occurs.

本発明の実施の形態に係るメモリ管理装置の構成を示すブロック図である。It is a block diagram which shows the structure of the memory management apparatus which concerns on embodiment of this invention. 実施の形態に係るメモリ管理装置の機能構成を示す機能ブロック図である。It is a functional block diagram which shows the function structure of the memory management apparatus which concerns on embodiment. 各ＤＲＡＭのＤＲＡＭ番号と、各ＤＲＡＭに割り当てられている物理アドレスと、各ＤＲＡＭがＯＳで利用可能か否かを示す情報の一例を示す図である。It is a figure which shows an example of the information which shows whether the DRAM number of each DRAM, the physical address allocated to each DRAM, and each DRAM can be used by OS. 実施の形態に係るメモリ管理装置の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the memory management apparatus which concerns on embodiment. 障害が発生した後の、各ＤＲＡＭのＤＲＡＭ番号と、各ＤＲＡＭに割り当てられている物理アドレスと、各ＤＲＡＭがＯＳで利用可能か否かを示す情報の一例を示す図である。It is a figure which shows an example of the information which shows whether the DRAM number of each DRAM after a failure generate | occur | produces, the physical address allocated to each DRAM, and each DRAM can be used by OS. 比較の形態に係る、障害が発生した後の、各ＤＲＡＭのＤＲＡＭ番号と、各ＤＲＡＭに割り当てられている物理アドレスと、各ＤＲＡＭがＯＳで利用可能か否かを示す情報の一例を示す図である。The figure which shows an example of the information which shows whether the DRAM number of each DRAM after a failure generate | occur | produces, the physical address allocated to each DRAM, and each DRAM can be used by OS based on the comparison form is there.

＜実施の形態＞
本発明の実施の形態について、図面を参照して詳細に説明する。 <Embodiment>
Embodiments of the present invention will be described in detail with reference to the drawings.

図１は、本発明の一実施の形態に係るメモリ管理装置の構成を示すブロック図である。図１に示すように、本実施の形態に係るメモリ管理装置１は、ＣＰＵ１０と、複数のＤＩＭＭ１１〜１８と、を備えている。なお、本実施の形態においてメモリモジュールとしてＤＩＭＭを採用して説明を行うが、本発明はこれに限定されるものではない。また、本実施の形態においては、ＤＩＭＭが８つであることを例に説明を行うが、本発明はこれに限定されるものではない。 FIG. 1 is a block diagram showing a configuration of a memory management device according to an embodiment of the present invention. As shown in FIG. 1, the memory management device 1 according to the present embodiment includes a CPU 10 and a plurality of DIMMs 11 to 18. In the present embodiment, description will be given by adopting DIMM as a memory module, but the present invention is not limited to this. Further, in the present embodiment, description is made by taking an example where there are eight DIMMs, but the present invention is not limited to this.

ＣＰＵ１０は、メモリ管理装置１の全体を制御する。ＤＩＭＭ１１〜１８は、図１に示すようにＣＰＵ１０に接続されている。なお、図１は、ＣＰＵ１０とＤＩＭＭ１１〜１８との典型的な接続例を示している。 The CPU 10 controls the entire memory management device 1. The DIMMs 11 to 18 are connected to the CPU 10 as shown in FIG. FIG. 1 shows a typical connection example between the CPU 10 and the DIMMs 11 to 18.

ＤＩＭＭ１１〜１８の夫々には、６つの記憶素子（ＤＲＡＭ１１０〜１１５、１２０〜１２５、１３０〜１３５、１４０〜１４５、１５０〜１５５、１６０〜１６５、１７０〜１７５、１８０〜１８５）が含まれている。本実施の形態においては、記憶素子として、ＤＲＡＭ（ＤｙｎａｍｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）を採用して説明を行うが、本発明はこれに限定されるものではない。また、各ＤＩＭＭに含まれるＤＲＡＭの数は６つに限定されるものではない。 Each of the DIMMs 11 to 18 includes six storage elements (DRAMs 110 to 115, 120 to 125, 130 to 135, 140 to 145, 150 to 155, 160 to 165, 170 to 175, 180 to 185). . In the present embodiment, a DRAM (Dynamic Random Access Memory) is used as a memory element. However, the present invention is not limited to this. Further, the number of DRAMs included in each DIMM is not limited to six.

各ＤＲＡＭには、物理アドレスが割り当てられている。なお、本実施の形態にて記述する物理アドレスとは、典型的なパーソナルコンピュータ等のハードウェアにおいて割り当てられる実メモリに対する物理アドレスであるとして説明を行う。 A physical address is assigned to each DRAM. In the following description, the physical address described in the present embodiment is a physical address for a real memory allocated in hardware such as a typical personal computer.

次に、本実施の形態に係るメモリ管理装置１の機能構成について、図２を参照して説明する。図２は、本実施の形態に係るメモリ管理装置１の機能構成を示す機能ブロック図である。 Next, the functional configuration of the memory management device 1 according to the present embodiment will be described with reference to FIG. FIG. 2 is a functional block diagram showing a functional configuration of the memory management device 1 according to the present embodiment.

図２に示す通り、メモリ管理装置１は、メモリ特定部１０１と、メモリ制御部１０２と、を含んでいる。メモリ特定部１０１およびメモリ制御部１０２は、ＣＰＵ１０に実現される。また、図２において、図１のＤＩＭＭ１１〜１８は、ＤＩＭＭ群として表している。 As shown in FIG. 2, the memory management device 1 includes a memory specifying unit 101 and a memory control unit 102. The memory specifying unit 101 and the memory control unit 102 are realized by the CPU 10. In FIG. 2, DIMMs 11 to 18 in FIG. 1 are represented as DIMM groups.

メモリ特定部１０１は、ＤＲＡＭの少なくとも何れかに障害が発生した際、障害が発生した（故障した）ＤＲＡＭ（故障ＤＲＡＭ）を特定し、特定した故障ＤＲＡＭの位置情報を保持する。ここで、ＤＲＡＭの位置情報とは、各ＤＲＡＭを特定するための情報であり、例えば、故障ＤＲＡＭがどのＤＩＭＭの何番目のＤＲＡＭかを示す情報（特定情報）のことである。なお、本実施の形態では、各ＤＲＡＭの位置を特定する情報として、ＤＲＡＭ番号を採用して説明を行うが、本発明はこれに限定されるものではない。 When a failure occurs in at least one of the DRAMs, the memory specifying unit 101 specifies the failed (failed) DRAM (failed DRAM), and holds location information of the specified failed DRAM. Here, the DRAM position information is information for specifying each DRAM, for example, information (specific information) indicating which DIMM of which DIMM the failed DRAM is. In the present embodiment, description is made by adopting a DRAM number as information for specifying the location of each DRAM, but the present invention is not limited to this.

メモリ特定部１０１は、保持している故障ＤＲＡＭの位置情報から当該故障ＤＲＡＭの物理アドレスを特定し、メモリ制御部１０２に通知する。 The memory specifying unit 101 specifies the physical address of the failed DRAM from the stored location information of the failed DRAM and notifies the memory control unit 102 of the physical address.

なお、メモリ特定部１０１の動作は、ＢＭＣＦＷ（ＢａｓｅｂｏａｒｄＭａｎａｇｅｍｅｎｔＣｏｎｔｒｏｌｌｅｒＦｉｒｍＷａｒｅ）などの典型的なマネージメントファームウェアの障害特定機能であってもよい。 The operation of the memory specifying unit 101 may be a typical management firmware failure specifying function such as BMCFW (Baseboard Management Controller Firmware).

メモリ制御部１０２は、複数のＤＲＡＭのうち、何れの物理アドレスのＤＲＡＭを使用するのかを制御する。つまり、メモリ制御部１０２は、メモリ特定部１０１から通知された、故障ＤＲＡＭの物理アドレスに基づいて、当該物理アドレスによって特定されるＤＲＡＭをメモリ管理装置１から除外するよう制御する。具体的には、メモリ制御部１０２は、上記故障ＤＲＡＭをメモリ管理装置１にて利用可能なＤＲＡＭでないと特定し、メモリ管理装置１のＯＳで利用可能なＤＲＡＭを使用して、メモリ管理装置１を起動する。 The memory control unit 102 controls which physical address DRAM is used among the plurality of DRAMs. That is, the memory control unit 102 controls to exclude the DRAM specified by the physical address from the memory management device 1 based on the physical address of the failed DRAM notified from the memory specifying unit 101. Specifically, the memory control unit 102 specifies that the failed DRAM is not a DRAM that can be used by the memory management device 1, and uses the DRAM that can be used by the OS of the memory management device 1. Start up.

なお、メモリ制御部１０２は、メモリ管理装置１のＯＳに含まれる機能であってもよい。 Note that the memory control unit 102 may be a function included in the OS of the memory management device 1.

図３は各ＤＲＡＭのＤＲＡＭ番号と、各ＤＲＡＭに割り当てられている物理アドレスと、各ＤＲＡＭがＯＳで利用可能か否かを示す情報の一例を示す図である。 FIG. 3 is a diagram showing an example of information indicating the DRAM number of each DRAM, the physical address assigned to each DRAM, and whether each DRAM can be used by the OS.

図３の一番右側の列は、各ＤＲＡＭが、メモリ管理装置１のＯＳで利用可能か否かを表す情報（使用可否情報）を示している。図３においては、各ＤＲＡＭに対し、ＯＳで利用可能である場合「○」を示し、利用可能でない場合「×」を示している。図３に示す通り、各ＤＲＡＭは、すべて、ＯＳ（ＯｐｅｒａｔｉｏｎＳｙｓｔｅｍ）が利用可能であることがわかる。 The rightmost column in FIG. 3 shows information (usability information) indicating whether or not each DRAM can be used by the OS of the memory management device 1. In FIG. 3, “◯” indicates that each DRAM can be used by the OS, and “X” indicates that it is not usable. As shown in FIG. 3, it can be seen that each DRAM can use an OS (Operation System).

図３に示すように、ＤＲＡＭを特定するＤＲＡＭ番号に、ＤＲＡＭの物理アドレスと、使用可否情報とが関連付けられている。図３に示す情報は、図示しないメモリに記録されてもよいし、ＣＰＵ１０に内蔵されたメモリに記録されていてもよい。 As shown in FIG. 3, the physical number of the DRAM and the availability information are associated with the DRAM number that identifies the DRAM. The information shown in FIG. 3 may be recorded in a memory (not shown) or may be recorded in a memory built in the CPU 10.

次に、メモリ管理装置１の動作について、図４を参照して説明する。図４は、メモリ障害が発生した際のメモリ管理装置１の処理の流れを示すフローチャートである。図４の各処理は、ＣＰＵ１０のプログラム制御によって実行される。 Next, the operation of the memory management device 1 will be described with reference to FIG. FIG. 4 is a flowchart illustrating a processing flow of the memory management device 1 when a memory failure occurs. Each process of FIG. 4 is performed by program control of CPU10.

ここで、メモリ障害について説明する。本実施の形態で対象となるメモリ障害は、典型的なＤＩＭＭを利用して発生する可能性がある、ＤＲＡＭのシングルビット（Ｓｉｎｇｌｅｂｉｔ）エラー（継続運用可能障害）と、マルチビット（Ｍｕｌｔｉｂｉｔ）エラー（継続運用不可能障害）と、が挙げられる。 Here, the memory failure will be described. A memory failure that is a target of the present embodiment may occur using a typical DIMM, a single bit (single bit) error (continuous operation failure) of a DRAM, and a multi-bit (multi bit). Error (failure that cannot be continued).

本実施の形態におけるメモリ管理装置１は、上記メモリ障害が起こった際に、図４に示す動作を行う。図４において、ステップＳ１〜Ｓ７の夫々は、以下の説明では、単にＳ１〜Ｓ７の符号で表すものとする。 The memory management device 1 in the present embodiment performs the operation shown in FIG. 4 when the memory failure occurs. In FIG. 4, each of steps S1 to S7 is simply represented by reference numerals S1 to S7 in the following description.

図４に示す通り、障害が発生すると、まず、メモリ特定部１０１が、故障ＤＲＡＭを特定する（Ｓ１）。そして、メモリ特定部１０１は、特定した故障ＤＲＡＭの位置情報を図示しない不揮発性のメモリなどに保持する（Ｓ２）。 As shown in FIG. 4, when a failure occurs, the memory specifying unit 101 first specifies a failed DRAM (S1). Then, the memory specifying unit 101 holds the specified location information of the failed DRAM in a non-illustrated non-volatile memory (S2).

その後、メモリ管理装置１がメモリ障害を含む、各種再起動要因（例えば、ＯＳのアップデートや、その他の運用継続不可能障害発生など）で、再起動する（Ｓ３）と、メモリ特定部１０１は、図３に示す情報を参照し、Ｓ２で保持した故障ＤＲＡＭの位置情報から、当該故障ＤＲＡＭの物理アドレスを特定する（Ｓ４）。 After that, when the memory management device 1 is restarted due to various restart factors (for example, OS update or other operation continuation failure) including a memory failure (S3), the memory specifying unit 101 With reference to the information shown in FIG. 3, the physical address of the failed DRAM is specified from the location information of the failed DRAM held in S2 (S4).

メモリ特定部１０１は、Ｓ４で特定した故障ＤＲＡＭの物理アドレスを、メモリ制御部１０２に通知する（Ｓ５）。そして、メモリ制御部１０２は、通知された物理アドレスを、メモリ管理装置１から除外し（Ｓ６）、現時点で使用可能なＤＲＡＭを使用して、メモリ管理装置１を起動する（Ｓ７）。 The memory specifying unit 101 notifies the memory control unit 102 of the physical address of the failed DRAM specified in S4 (S5). Then, the memory control unit 102 excludes the notified physical address from the memory management device 1 (S6), and activates the memory management device 1 using a DRAM that can be used at the present time (S7).

ここで、図１に示すＤＲＡＭのうち、ＤＲＡＭ１１３でマルチビットエラーが発生した場合を例に挙げ、メモリ管理装置１の動作についてさらに説明する。 Here, the operation of the memory management device 1 will be further described by taking as an example a case where a multi-bit error has occurred in the DRAM 113 in the DRAM shown in FIG.

まずメモリ障害が発生すると、メモリ特定部１０１は、故障ＤＩＭＭ（本例では、ＤＩＭＭ１１）と、故障ＤＲＡＭ（ＤＲＡＭ１１３）を特定する。そして、メモリ特定部１０１は、特定した故障ＤＲＡＭ（ＤＲＡＭ１１３）の位置情報（ＤＲＡＭ番号）を保持する。 First, when a memory failure occurs, the memory specifying unit 101 specifies a failed DIMM (in this example, DIMM 11) and a failed DRAM (DRAM 113). The memory specifying unit 101 holds location information (DRAM number) of the specified failed DRAM (DRAM 113).

その後、ＤＩＭＭ１１の故障により継続運用不可となった場合や、その他の要因（ＯＳＵｐｄａｔｅ等）で、装置再起動が行われると、メモリ特定部１０１は、再起動時に、保持しておいた故障したＤＲＡＭ１１３の位置から物理アドレス「０ｘ０００００００３」を特定する。 After that, when the continuous operation is not possible due to a failure of the DIMM 11 or when the device is restarted due to other factors (such as OS Update), the memory specifying unit 101 has a failure that was retained at the time of restart. The physical address “0x00000003” is specified from the position of the DRAM 113.

メモリ特定部１０１は、特定した物理アドレス「０ｘ０００００００３」を、メモリ制御部１０２に通知する。そして、メモリ制御部１０２は、メモリ特定部１０１から通知された物理アドレス「０ｘ０００００００３」をメモリ管理装置１から使用しないように制御する。つまり、メモリ制御部１０２は、物理アドレス「０ｘ０００００００３」のＤＲＡＭ１１３をメモリ管理装置１が使用するＤＲＡＭから除外する。 The memory specifying unit 101 notifies the memory control unit 102 of the specified physical address “0x00000003”. Then, the memory control unit 102 performs control so that the physical address “0x00000003” notified from the memory specifying unit 101 is not used from the memory management device 1. That is, the memory control unit 102 excludes the DRAM 113 with the physical address “0x00000003” from the DRAM used by the memory management device 1.

そして、メモリ制御部１０２は、ＤＲＡＭ１１３に対し、メモリアクセスを行うことなく、メモリ管理装置１を起動する。 Then, the memory control unit 102 activates the memory management device 1 without performing memory access to the DRAM 113.

これにより、メモリ管理装置１としては、ＤＲＡＭ１１３のみが縮退された状態で起動し、運用状態となることができる。 As a result, the memory management device 1 can be activated in a state where only the DRAM 113 is degenerated, and can be brought into an operation state.

この時の、各ＤＲＡＭのＤＲＡＭ番号と、各ＤＲＡＭに割り当てられている物理アドレスと、各ＤＲＡＭがＯＳで利用可能か否かを示す情報の一例を図５に示す。図５に示す通り、ＤＲＡＭ１１３の行における、ＯＳが利用可能か否かを示す欄が「×」になっていることがわかる。 FIG. 5 shows an example of information indicating the DRAM number of each DRAM, the physical address assigned to each DRAM, and information indicating whether each DRAM can be used by the OS at this time. As shown in FIG. 5, it can be seen that the column indicating whether the OS is available or not in the row of the DRAM 113 is “x”.

このように、ＤＲＡＭ１１３のみ縮退しているため、ＯＳは図５に示す通り、故障したＤＲＡＭ１１３分の物理アドレスだけが使用できないことになる。したがって、メモリ管理装置１のメモリ容量は、上記処理後の運用ではメモリ障害発生以前より、ＤＲＡＭ１個分減少している状態となる。 Thus, since only the DRAM 113 is degenerated, the OS cannot use only the physical address corresponding to the failed DRAM 113 as shown in FIG. Accordingly, the memory capacity of the memory management device 1 is reduced by one DRAM from the time before the occurrence of the memory failure in the operation after the above processing.

＜比較の形態＞
次に、本発明の上記実施の形態と比較するための比較の形態について、説明する。比較の形態に係るメモリ管理装置２は、従来技術におけるメモリ管理装置２である。比較の形態に係るメモリ管理装置２のハードウェア構成は、図１のメモリ管理装置１と同じであるため、説明を省略する。また、障害発生前のメモリ管理装置２の各ＤＲＡＭに対する物理アドレスおよびＯＳが使用可能か否かを示す情報は、図３と同じであるとする。 <Comparison form>
Next, a comparative form for comparison with the above embodiment of the present invention will be described. The memory management device 2 according to the comparative form is the memory management device 2 in the prior art. The hardware configuration of the memory management device 2 according to the comparison form is the same as that of the memory management device 1 of FIG. Further, it is assumed that the physical address for each DRAM of the memory management device 2 before the failure and the information indicating whether or not the OS is usable are the same as those in FIG.

比較の形態に係るメモリ管理装置２において、運用中にメモリ障害が発生した場合の動作について説明する。 In the memory management device 2 according to the comparative embodiment, an operation when a memory failure occurs during operation will be described.

メモリ管理装置２で、例えば、シングルビットエラーが発生した場合、典型的なハイエンドサーバの機能である、予防縮退（システムダウンを引き起こすマルチビットエラーになる前に縮退させる）機能によって、縮退運転が行われる。ここで、予防縮退機能とは、例えば、ある一定期間にシングルビットエラーが多発した場合（例えば、同じＤＩＭＭから、２４時間以内に２０回以上のシングルビットエラーが発生したなど）に、ＤＩＭＭを縮退予約としておき、次回再起動（他要因のエラーなど）で当該ＤＩＭＭを縮退させるというような機能である。 For example, when a single bit error occurs in the memory management device 2, a reduced operation is performed by a function of preventive degeneration (degeneration before a multi-bit error causing a system down), which is a typical high-end server function. Is called. Here, the preventive degeneration function refers to, for example, when a single bit error frequently occurs in a certain period (for example, 20 or more single bit errors have occurred within 24 hours from the same DIMM, etc.) The function is such that the DIMM is reserved, and the DIMM is degenerated at the next restart (such as an error of another factor).

また、メモリ管理装置２で、例えば、マルチビットエラーが発生した場合、継続運用不可能であるため、例えば、システムダウンから再起動するなどした後に、ＤＩＭＭの縮退運転を行う。 Further, in the memory management device 2, for example, when a multi-bit error occurs, the continuous operation is impossible. For example, after the system is restarted, the DIMM is degenerated.

比較の形態に係るメモリ管理装置２において、ＤＩＭＭ１１のＤＲＡＭ１１３でメモリ障害が発生した場合、ハイエンドサーバでは一般的なマネージメントファームウェア（ＢＭＣなど）の障害特定機能により、故障ＤＩＭＭ（ＤＩＭＭ１１）が特定される。 In the memory management device 2 according to the comparison mode, when a memory failure occurs in the DRAM 113 of the DIMM 11, the high-end server identifies the failed DIMM (DIMM 11) by a failure identification function of general management firmware (such as BMC).

その後、ＤＩＭＭ１１の故障により継続運用不可となった場合や、その他の要因（ＯＳＵｐｄａｔｅ等）で、装置再起動が行われると、再起動時にマネージメントファームウェアにより自動的に故障したＤＩＭＭ１１が縮退される。そして故障したＤＩＭＭ１１が縮退した状態でメモリ管理装置２が起動し、運用状態となる。 Thereafter, when the continuous operation is disabled due to a failure of the DIMM 11 or when the device is restarted due to other factors (such as OS Update), the failed DIMM 11 is automatically degraded by the management firmware at the time of restart. Then, the memory management device 2 is activated in a state in which the failed DIMM 11 is degenerated, and enters an operation state.

この時の、各ＤＲＡＭに割り当てられている物理アドレスと、各ＤＲＡＭがＯＳで利用可能か否かを示す情報の一例を図６に示す。図６に示す通り、ＤＩＭＭ１１が縮退しているため、ＯＳは、故障したＤＲＡＭ１１３を含む、ＤＲＡＭ１１０、１１１、１１２、１１３、１１４および１１５の物理アドレスが使用できない。そのため、比較の形態に係るメモリ管理装置２において、障害発生後の運用では、メモリ障害発生以前よりメモリ容量がＤＩＭＭ１枚分減少している状態となる。 FIG. 6 shows an example of information indicating the physical address assigned to each DRAM and whether or not each DRAM can be used by the OS. As shown in FIG. 6, since the DIMM 11 is degenerated, the OS cannot use the physical addresses of the DRAMs 110, 111, 112, 113, 114, and 115 including the failed DRAM 113. Therefore, in the memory management device 2 according to the comparative embodiment, in the operation after the occurrence of the failure, the memory capacity is reduced by one DIMM from before the occurrence of the memory failure.

（メモリ管理装置１の効果）
以上に説明したように、本発明の実施の形態に係るメモリ管理装置１では、メモリ障害発生後の再起動後においても、物理的なメモリの使用を効率的に行うことができる。 (Effect of the memory management device 1)
As described above, the memory management device 1 according to the embodiment of the present invention can efficiently use physical memory even after restart after a memory failure occurs.

その理由は、メモリ特定部１０１が故障したＤＲＡＭの位置を特定し、保持しておき、当該故障したＤＲＡＭの物理アドレスを、メモリ制御部１０２に通知することにより、メモリ制御部１０２が当該物理アドレスにアクセスせずにメモリ管理装置１を起動させるためである。 The reason is that the memory specifying unit 101 specifies and holds the location of the failed DRAM, and notifies the memory control unit 102 of the physical address of the failed DRAM. This is because the memory management device 1 is activated without accessing the memory.

したがって、メモリ障害発生後のシステム再起動後であっても、ＤＲＡＭ単位での縮退が可能となる。これにより、メモリ縮退容量を最小限にとどめることができ、物理的なメモリの使用を効率的に行うことができる。また、メモリ管理装置１のＯＳから故障したＤＲＡＭへのアクセスも行われないことから、安定してシステムを動作させることができる。 Therefore, even after the system is restarted after a memory failure occurs, it is possible to degenerate in units of DRAM. As a result, the memory degeneration capacity can be minimized, and the physical memory can be used efficiently. In addition, since the OS of the memory management device 1 does not access the failed DRAM, the system can be operated stably.

なお、図３および５で示した、ＤＲＡＭを特定するＤＲＡＭ番号に、ＤＲＡＭの物理アドレスと、使用可否情報とが関連付けられた情報は、図示しないメモリに記録されてもよいし、ＣＰＵ１０に内蔵されたメモリに記録されて管理されていてもよい。 3 and 5 may be recorded in a memory (not shown) or built in the CPU 10 as the DRAM number identifying the DRAM and the physical address of the DRAM and the availability information may be associated with each other. It may be recorded and managed in a separate memory.

そして、メモリ制御部１０２は、複数のＤＩＭＭの少なくとも何れかに障害が発生した際、障害が発生したＤＩＭＭの使用可否情報を、使用不可に更新してもよい。また、メモリ制御部１０２は、使用可否情報を参照して、複数のＤＩＭＭのうち、メモリ管理装置１のＯＳで使用可能となっているＤＩＭＭを使用するよう制御してもよい。 Then, when a failure occurs in at least one of the plurality of DIMMs, the memory control unit 102 may update the availability information of the DIMM in which the failure has occurred. Further, the memory control unit 102 may control to use a DIMM that can be used by the OS of the memory management device 1 among a plurality of DIMMs with reference to the availability information.

このような情報を用いた場合であっても、実施の形態に係るメモリ管理装置１は、メモリ障害発生後の再起動後においても、物理的なメモリの使用を効率的に行うことができる。 Even when such information is used, the memory management device 1 according to the embodiment can efficiently use the physical memory even after restarting after the occurrence of a memory failure.

以上、実施の形態を参照して本願発明を説明したが、本願発明は上記実施形態に限定されものではない。本願発明の構成や詳細には、本願発明のスコープ内で当業者が理解し得る様々な変更をすることができる。 Although the present invention has been described with reference to the embodiment, the present invention is not limited to the above embodiment. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

１メモリ管理装置
１０ＣＰＵ
１１〜１８ＤＩＭＭ
１０１メモリ特定部
１０２メモリ制御部 1 Memory management device 10 CPU
11-18 DIMM
101 Memory specifying unit 102 Memory control unit

Claims

A memory module including a plurality of storage elements;
Memory specifying means for specifying a physical address assigned to a storage element specified by the specific information from specific information for specifying each of the plurality of storage elements included in the memory module;
Control means for controlling which of the plurality of storage elements to use the storage element of which physical address,
The memory specifying unit holds specific information of the storage element in which a failure has occurred when a failure occurs in at least one of the plurality of storage elements, and after restarting, the failure has occurred from the stored specific information. Identify the physical address of the storage element,
The memory control apparatus, wherein the control unit controls to use a storage element other than the storage element of the physical address specified by the memory specifying unit among the plurality of storage elements.

The specific information of the storage element is associated with availability information indicating whether the storage element is usable in a memory management device,
2. The memory according to claim 1, wherein the control unit controls which of the plurality of storage elements is to be used with reference to the availability information. Management device.

3. The control unit according to claim 2, wherein, when a failure occurs in at least one of the plurality of storage elements, the control unit updates the availability information of the storage element in which the failure has occurred to be unusable. Memory management device.

A method for controlling a memory management device including a memory module including a plurality of storage elements,
When a failure occurs in at least one of the plurality of storage elements, specific information for specifying the storage element in which the failure has occurred is retained, and after restart, specified by the specific information from the retained specific information A physical address assigned to the storage element, wherein the physical address of the storage element in which the failure has occurred is identified;
A control method comprising: controlling a storage element other than the storage element having the specified physical address among the plurality of storage elements.

When a failure occurs in at least one of the plurality of storage elements included in the memory module, the specific information for specifying the storage element in which the failure has occurred is retained, and the specific information is retained from the retained specific information after the restart. A process of specifying a physical address of the storage element in which a failure has occurred, which is a physical address assigned to the storage element specified by
A program for causing a computer to execute a process of controlling to use a storage element other than the storage element of the specified physical address among the plurality of storage elements.

A recording medium on which the program according to claim 5 is recorded.