JP2009205212A

JP2009205212A - Memory fault processing system, memory fault processing method, and memory fault processing program

Info

Publication number: JP2009205212A
Application number: JP2008043924A
Authority: JP
Inventors: Eiichiro Kawaguchi; 英一郎川口
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2008-02-26
Filing date: 2008-02-26
Publication date: 2009-09-10
Anticipated expiration: 2028-02-26
Also published as: JP5272443B2

Abstract

<P>PROBLEM TO BE SOLVED: To determine whether to continue or stop a process using a memory region according to the use of the process when the correctable defect of the memory occurs. <P>SOLUTION: To an entry registered in a TLB, "a defect frequency" and "a defect threshold " are added. The defect frequency holds the frequency of the correctable defect occurring in a corresponding physical address space. The defect threshold holds the maximum number of correctable defects permitted by the corresponding process. Also, a defect frequency monitoring mechanism connected to the TLB is provided with a function for comparing the defect frequency with the defect threshold, and when the defect frequency exceeds the defect threshold, issuing a signal to stop the process. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、メモリ障害処理システムに関し、特にプロセスの動作を継続するためのメモリ障害処理システムに関する。 The present invention relates to a memory fault processing system, and more particularly to a memory fault processing system for continuing the operation of a process.

通常、同一コンピュータシステム内には、様々な目的を持ったプロセスが動作している。それらのプロセスの中には、訂正可能障害が何回起きようとも動作し続けなければならないプロセスもあれば、訂正可能障害が訂正不可能障害にエスカレーションするのを避けるために即停止しても良いプロセスも存在する。 Usually, processes having various purposes operate in the same computer system. Some of these processes must continue to work no matter how many times a correctable failure occurs, or may be stopped immediately to avoid escalating a correctable failure to an uncorrectable failure. There is also a process.

なお、ここでは、メモリの訂正可能障害とは、ＥＣＣ（ＥｒｒｏｒＣｏｒｒｅｃｔｉｎｇＣｏｄｅ）で、訂正が完了した障害を示すものとする。動作し続けなければならないプロセスの例としては、カーネルやミッションクリティカル領域で動作しているアプリケーションソフトウェア（ａｐｐｌｉｃａｔｉｏｎｓｏｆｔｗａｒｅ）によるプロセス等が挙げられる。反対に、即停止しても良いプロセスの例としては、単純に、プロセッサのアイドル時間を活用して動作しているものであり、失敗してもやり直しの効くアプリケーションソフトウェアによるプロセス等が挙げられる。 Here, it is assumed that the memory correctable fault indicates a fault that has been corrected by ECC (Error Correcting Code). Examples of processes that must continue to operate include processes by application software (application software) operating in the kernel and mission critical areas. On the other hand, as an example of a process that may be stopped immediately, a process using application software that simply operates by utilizing the idle time of the processor and that can be re-executed even if it fails can be cited.

即停止しても良いプロセスが利用しているメモリ空間において訂正可能障害が発生した場合、そのまま当該プロセスを動作させて、訂正可能障害が訂正不可能障害にエスカレーションしてしまうと、即停止しても良いプロセスを継続したがためにシステムダウンが発生することになる。そして、その結果として重要なプロセスまでも止まってしまうという弊害があった。 If a correctable failure occurs in the memory space used by a process that can be stopped immediately, if the correctable failure escalates to a non-correctable failure by operating the process as it is, the process stops immediately. However, the system goes down because the good process is continued. As a result, there is a problem that even important processes are stopped.

関連する技術として、特開平０４−３４３１４８号公報（特許文献１）に組込型マルチタスクオペレーティングシステムの例外処理方法が開示されている。
この関連技術では、同一タスクが不正なシステムコールを所定回数連続して発行すると、そのタスクを終了する。 As a related technique, Japanese Patent Laid-Open No. 04-343148 (Patent Document 1) discloses an exception handling method for an embedded multitask operating system.
In this related technique, when the same task issues unauthorized system calls for a predetermined number of times, the task is terminated.

また、特開平１１−１７５４０９号公報（特許文献２）にメモリ制御方式が開示されている。
この関連技術では、物理ページの累積障害数が、予め定めた規定値を超えると、代替処理に入り、その処理の終了を待って動作を完了する。 Japanese Patent Laid-Open No. 11-175409 (Patent Document 2) discloses a memory control system.
In this related technique, when the cumulative number of failures in a physical page exceeds a predetermined value, a substitute process is entered, and the operation is completed after the process is completed.

特開平０４−３４３１４８号公報Japanese Patent Laid-Open No. 04-343148 特開平１１−１７５４０９号公報JP-A-11-175409

通常、同一コンピュータシステム内には、様々な目的を持ったプロセスが動作している。その中には、極めて重要なプロセスもあれば、単にプロセッサの空き時間を利用しているだけの重要度の低いプロセスも存在する。しかし、既存の障害処理方法では、一元的に生涯処理を実施しているため、その重要度に応じた適切な処理を施すことができなかった。 Usually, processes having various purposes operate in the same computer system. Some of them are extremely important processes, while others are less important processes that simply use processor idle time. However, in the existing failure processing method, since the lifetime processing is centrally performed, it is impossible to perform an appropriate processing according to the importance.

本発明のメモリ障害処理システムは、物理アドレス空間を利用するプロセスを起動し、ＴＬＢ（ＴｒａｎｓｌａｔｉｏｎＬｏｏｋ−ａｓｉｄｅＢｕｆｆｅｒ）のエントリに、物理アドレス空間で発生した訂正可能障害の回数を示す障害回数、及びプロセスが許容する最大訂正可能障害数を示す障害閾値を登録し、訂正可能障害は、ＥＣＣ（ＥｒｒｏｒＣｏｒｒｅｃｔｉｎｇＣｏｄｅ）で訂正が完了した障害を示し、障害回数を更新した際、障害回数と障害閾値とを比較し、障害回数が障害閾値を超えている場合、プロセスを停止するプロセッサと、物理アドレス空間を有し、訂正可能障害が発生する毎に、障害回数をカウントし、障害回数をプロセッサに通知する障害カウンタ付きメモリユニットとを具備する。 The memory failure processing system according to the present invention starts a process that uses a physical address space, and in a TLB (Translation Look-aside Buffer) entry, the number of failures indicating the number of correctable failures that have occurred in the physical address space, and the process Is registered a failure threshold value indicating the maximum number of correctable failures allowed, and the correctable failure indicates a failure that has been corrected by ECC (Error Correcting Code). When the failure frequency is updated, the failure frequency and the failure threshold value are indicated. In comparison, if the number of failures exceeds the failure threshold, each time a correctable failure occurs with a processor that stops the process and a physical address space, the number of failures is counted and the number of failures is notified to the processor And a memory unit with a fault counter.

本発明のメモリ障害処理方法は、障害カウンタ付きメモリユニットの物理アドレス空間を利用するプロセスを起動するステップと、プロセッサが保持するＴＬＢ（ＴｒａｎｓｌａｔｉｏｎＬｏｏｋ−ａｓｉｄｅＢｕｆｆｅｒ）のエントリに、物理アドレス空間で発生した訂正可能障害の回数を示す障害回数、及びプロセスが許容する最大訂正可能障害数を示す障害閾値を登録するステップと、訂正可能障害は、ＥＣＣ（ＥｒｒｏｒＣｏｒｒｅｃｔｉｎｇＣｏｄｅ）で訂正が完了した障害を示し、障害カウンタ付きメモリユニット上で、訂正可能障害が発生する毎に、障害回数をカウントし、障害回数をＴＬＢに提供するステップと、ＴＬＢ上で障害回数を更新した際、障害回数と障害閾値とを比較し、障害回数が障害閾値を超えている場合、プロセスを停止するステップとを含む。 The memory failure processing method of the present invention occurs in the physical address space in the step of starting a process that uses the physical address space of the memory unit with the failure counter and the entry of the TLB (Translation Look-aside Buffer) held by the processor. A step of registering a failure count indicating the number of correctable failures and a failure threshold value indicating the maximum number of correctable failures allowed by the process; and the correctable failure indicates a failure that has been corrected by ECC (Error Collecting Code); Each time a correctable failure occurs on a memory unit with a failure counter, the step of counting the number of failures and providing the number of failures to the TLB, and when the number of failures is updated on the TLB, In comparison, the number of failures exceeds the failure threshold And a step of stopping if the process.

本発明のメモリ障害処理プログラムは、障害カウンタ付きメモリユニットの物理アドレス空間を利用するプロセスを起動するステップと、プロセッサが保持するＴＬＢ（ＴｒａｎｓｌａｔｉｏｎＬｏｏｋ−ａｓｉｄｅＢｕｆｆｅｒ）のエントリに、物理アドレス空間で発生した訂正可能障害の回数を示す障害回数、及びプロセスが許容する最大訂正可能障害数を示す障害閾値を登録するステップと、訂正可能障害は、ＥＣＣ（ＥｒｒｏｒＣｏｒｒｅｃｔｉｎｇＣｏｄｅ）で訂正が完了した障害を示し、障害カウンタ付きメモリユニット上で、訂正可能障害が発生する毎に、障害回数をカウントし、障害回数をＴＬＢに提供するステップと、ＴＬＢ上で障害回数を更新した際、障害回数と障害閾値とを比較し、障害回数が障害閾値を超えている場合、プロセスを停止するステップとをプロセッサに実行させるためのプログラムである。 The memory fault processing program of the present invention is generated in the physical address space in the step of starting the process using the physical address space of the memory unit with the fault counter and the entry of the TLB (Translation Look-aside Buffer) held by the processor. A step of registering a failure count indicating the number of correctable failures and a failure threshold value indicating the maximum number of correctable failures allowed by the process; and the correctable failure indicates a failure that has been corrected by ECC (Error Collecting Code); Each time a correctable failure occurs on a memory unit with a failure counter, the step of counting the number of failures and providing the number of failures to the TLB, and when the number of failures is updated on the TLB, Compared, failure count exceeds failure threshold If there is a program for executing a step of stopping the process to the processor.

メモリの訂正可能障害が発生した際、当該メモリ領域を利用しているプロセスの用途に応じて、そのプロセスを継続するか／中止するかの決定ができる。 When a correctable failure occurs in the memory, it is possible to decide whether to continue or stop the process according to the use of the process that uses the memory area.

本発明のメモリ障害処理システムでは、ＴＬＢ（ＴｒａｎｓｌａｔｉｏｎＬｏｏｋ−ａｓｉｄｅＢｕｆｆｅｒ）に登録されるエントリに「障害回数」、及び、「障害閾値」を追加する。障害回数は、対応する物理アドレス空間で発生した訂正可能障害の回数を保持する。なお、物理アドレス空間は、メモリアドレスによってアクセス可能なメモリ空間を示す。障害閾値は、対応するプロセスが許容する最大訂正可能障害数を保持する。また、ＴＬＢに接続される障害回数監視機構は、これら２つを比較し障害回数が障害閾値を超えている場合には、当該プロセスを停止させる信号を発行する機能を有する。 In the memory failure processing system according to the present invention, “failure count” and “failure threshold” are added to entries registered in a TLB (Translation Look-aside Buffer). The number of failures holds the number of correctable failures that occurred in the corresponding physical address space. The physical address space indicates a memory space that can be accessed by a memory address. The failure threshold holds the maximum number of correctable failures allowed by the corresponding process. The failure frequency monitoring mechanism connected to the TLB has a function of comparing these two and issuing a signal for stopping the process when the failure frequency exceeds the failure threshold.

以下に、本発明の第１実施形態について添付図面を参照して説明する。
図１を参照すると、本発明のメモリ障害処理システムは、プロセッサ１と、障害カウンタ付きメモリユニット２を含む。 Hereinafter, a first embodiment of the present invention will be described with reference to the accompanying drawings.
Referring to FIG. 1, the memory fault processing system of the present invention includes a processor 1 and a memory unit 2 with a fault counter.

プロセッサ１は、障害カウンタ付きＴＬＢ１１と、障害回数監視部１２と、オペレーティングシステム（ＯＳ：ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）１３を備える。 The processor 1 includes a TLB 11 with a fault counter, a fault count monitoring unit 12, and an operating system (OS: Operating System) 13.

障害カウンタ付きＴＬＢ１１は、一般的なＴＬＢ機能に加えて、以下の３つのエントリを有している。
（１）第１のエントリは、物理ページを利用しているプロセスを識別するプロセス識別子エントリである。
（２）第２のエントリは、物理ページで発生した障害回数を保持する障害回数エントリである。
（３）第３のエントリは、物理ページを利用しているプロセスが、許容できる障害回数を保持する障害閾値エントリである。 The TLB 11 with a failure counter has the following three entries in addition to the general TLB function.
(1) The first entry is a process identifier entry for identifying a process using a physical page.
(2) The second entry is a failure count entry that holds the number of failures that occurred in the physical page.
(3) The third entry is a failure threshold entry in which a process using a physical page holds an allowable number of failures.

プロセス識別子エントリは、オペレーティングシステム（ＯＳ）１３により設定される固有の値である。障害回数エントリは、メモリの訂正可能障害が発生したときに、その障害の発生が物理ページにおいて何回目かを示す障害回数を示す値である。ここで、メモリの訂正可能障害とは、ＥＣＣ（ＥｒｒｏｒＣｏｒｒｅｃｔｉｎｇＣｏｄｅ）で、訂正が完了した障害を示すものとする。障害カウンタ付きメモリユニット２は、メモリの訂正可能障害が発生する毎に、発生したメモリの訂正可能障害をＥＣＣで訂正し、障害回数を更新する。障害閾値エントリは、プロセスの重要度に応じて、プロセスを起動した際に、指定される値である。指定された値が、障害閾値となる。ここでは、これら３つのエントリをＴＬＢエントリと呼ぶ。 The process identifier entry is a unique value set by the operating system (OS) 13. The failure count entry is a value indicating the number of failures indicating how many times the failure has occurred in the physical page when a memory correctable failure occurs. Here, the correctable fault of the memory indicates a fault that has been corrected by ECC (Error Correcting Code). The memory unit 2 with the fault counter corrects the correctable fault of the generated memory with the ECC each time a correctable fault occurs in the memory, and updates the fault count. The failure threshold entry is a value that is specified when a process is started according to the importance of the process. The specified value becomes the failure threshold. Here, these three entries are called TLB entries.

障害回数監視部１２は、ＴＬＢエントリ中の障害回数と障害閾値を比較し、比較の結果、「障害回数＞障害閾値」であることを検出した場合、プロセッサ１に対し割り込み信号を発信すると共に、物理ページを利用しているプロセスを示すプロセス識別子を送信する機能を有する。 The failure frequency monitoring unit 12 compares the failure frequency in the TLB entry with the failure threshold value, and if it is detected that “failure frequency> failure threshold value” as a result of the comparison, sends an interrupt signal to the processor 1 and It has a function of transmitting a process identifier indicating a process using a physical page.

オペレーティングシステム（ＯＳ）１３は、プロセッサ１上で動作し、障害回数監視部１２から割り込み信号とプロセス識別子を受信すると、当該プロセスを停止させる機能を有する。但し、実際には、オペレーティングシステム（ＯＳ）１３は、プロセッサ１の外部で動作するようにすることも可能である。例えば、オペレーティングシステム（ＯＳ）１３は、プロセッサ１と連携する他のプロセッサ上で動作していても良い。また、オペレーティングシステム（ＯＳ）１３は、ファームウェア（ｆｉｒｍｗａｒｅ）、又はアプリケーションソフトウェア（ａｐｐｌｉｃａｔｉｏｎｓｏｆｔｗａｒｅ）と読み替えても良い。 The operating system (OS) 13 operates on the processor 1 and has a function of stopping the process when receiving an interrupt signal and a process identifier from the failure frequency monitoring unit 12. In practice, however, the operating system (OS) 13 can be operated outside the processor 1. For example, the operating system (OS) 13 may operate on another processor that cooperates with the processor 1. Further, the operating system (OS) 13 may be read as firmware (firmware) or application software (application software).

障害カウンタ付きメモリユニット２は、障害通知部２１と、メモリ本体２２を備える。 The memory unit 2 with a fault counter includes a fault notification unit 21 and a memory main body 22.

障害通知部２１は、メモリ本体２２において、訂正可能障害が発生する毎に、その発生事象を、障害アドレス、及び、障害回数と共にプロセッサ１に通知する機能を有する。ここで、障害アドレスとは、障害が発生したアドレスを示す。なお、障害通知部２１は、メモリの訂正可能障害が発生する毎に、発生したメモリの訂正可能障害をＥＣＣで訂正し、障害回数を更新した後に、障害アドレス、及び、障害回数と共にプロセッサ１に通知するようにしても良い。 The failure notification unit 21 has a function of notifying the processor 1 of the occurrence event together with the failure address and the number of failures each time a correctable failure occurs in the memory main body 22. Here, the failure address indicates an address where a failure has occurred. Whenever a memory correctable failure occurs, the failure notification unit 21 corrects the memory correctable failure that has occurred by ECC and updates the number of failures, and then sends the failure address to the processor 1 together with the failure address and the number of failures. You may make it notify.

メモリ本体２２は、物理アドレス空間を最小物理ページ単位毎に分割されており、分割単位毎に、訂正可能障害が何回発生したかを示す障害回数をカウントし保持するカウンタを持っている。ここでは、メモリ本体２２は、ＭＭ−０からＭＭ−Ｎ（Ｎは任意）に分割されている。また、カウンタは、訂正可能障害が発生する毎に、障害回数を加算する。 The memory body 22 divides the physical address space into minimum physical page units, and has a counter that counts and holds the number of failures indicating how many correctable failures have occurred for each division unit. Here, the memory main body 22 is divided into MM-0 to MM-N (N is arbitrary). The counter adds the number of failures every time a correctable failure occurs.

なお、プロセッサ１の例として、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）やマイクロプロセッサ（ｍｉｃｒｏｐｒｏｃｅｓｓｏｒ）等の処理装置、又は同様の機能を有する半導体集積回路（ＩＣ：ＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）が考えられる。また、障害カウンタ付きメモリユニット２の例として、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）やＥＥＰＲＯＭ（ＥｌｅｃｔｒｉｃａｌｌｙＥｒａｓａｂｌｅａｎｄＰｒｏｇｒａｍｍａｂｌｅＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、フラッシュメモリ（ＦｌａｓｈＭｅｍｏｒｙ）等の半導体記憶装置が考えられる。但し、実際には、これらの例に限定されない。 As an example of the processor 1, a processing device such as a CPU (Central Processing Unit) or a microprocessor, or a semiconductor integrated circuit (IC: Integrated Circuit) having a similar function is conceivable. Further, as an example of the memory unit 2 with the fault counter, a semiconductor storage device such as a RAM (Random Access Memory), an EEPROM (Electrically Erasable and Programmable Read Only Memory), a flash memory (Flash Memory), or the like is conceivable. However, actually, it is not limited to these examples.

図２を参照して、プロセッサ１の構成例について詳細に説明する。
障害カウンタ付きＴＬＢ１１は、ＴＬＢエントリとして、「仮想ページ番号」、「物理ページ番号」、「プロセス識別子」、「障害回数」、「障害閾値」の各々について、表形式で情報を格納している。 A configuration example of the processor 1 will be described in detail with reference to FIG.
The TLB 11 with a failure counter stores information in a table format for each of “virtual page number”, “physical page number”, “process identifier”, “number of failures”, and “failure threshold” as a TLB entry.

障害回数監視部１２は、比較部１２１と、割り込み通知部１２２を備える。 The failure frequency monitoring unit 12 includes a comparison unit 121 and an interrupt notification unit 122.

比較部１２１は、障害カウンタ付きＴＬＢ１１のＴＬＢエントリ中の「プロセス識別子」、「障害回数」、及び「障害閾値」を取得し、「障害回数」と「障害閾値」を比較し、比較の結果、「障害回数＞障害閾値」であることを検出した場合、割り込み通知部１２２に対し、当該「プロセス識別子」を通知する。 The comparison unit 121 acquires “process identifier”, “failure count”, and “failure threshold” in the TLB entry of the TLB 11 with fault counter, compares the “failure count” with the “failure threshold”, When it is detected that “number of failures> failure threshold”, the interrupt notification unit 122 is notified of the “process identifier”.

例えば、訂正可能障害が発生した際に、即停止しても良いプロセスＡがあると仮定する。これには単にプロセッサの空き時間を有効利用するようなプロセス等が該当する。プロセスＡのように即停止しても良いプロセスの場合、当該プロセスが動作するメモリ空間に対応するＴＬＢエントリの障害閾値は０に設定される。このとき、対応する物理ページＰＰＮ０において訂正可能障害が発生した場合、ＴＬＢ内の対応するエントリの障害回数が更新される。その後、障害回数と障害閾値の比較が実施されると、障害回数が障害閾値を超えることになるため、対応するプロセスＡを停止させる信号が発行される。その結果として訂正可能障害が発生したメモリ空間は利用されなくなり、障害のエスカレーションを回避できることになる。同様にプロセスＢで見た場合は、障害回数は４、障害閾値は５であり、これは、許容する最大訂正可能障害数を下回っているため、処理は継続されることになる。プロセスＣでは、障害閾値は∞になっている。これは、何回訂正可能障害が起きようとも動作し続けなければならないプロセスであることを示している。 For example, assume that there is a process A that may be stopped immediately when a correctable failure occurs. This is simply a process that makes effective use of processor idle time. In the case of a process that can be stopped immediately such as process A, the failure threshold value of the TLB entry corresponding to the memory space in which the process operates is set to zero. At this time, when a correctable failure occurs in the corresponding physical page PPN0, the number of failures of the corresponding entry in the TLB is updated. After that, when the failure frequency is compared with the failure threshold value, the failure frequency exceeds the failure threshold value, so that a signal for stopping the corresponding process A is issued. As a result, the memory space in which a correctable failure has occurred is not used, and failure escalation can be avoided. Similarly, in the case of the process B, the number of failures is 4 and the failure threshold is 5, which is below the maximum number of correctable failures that can be tolerated. Therefore, the processing is continued. In process C, the failure threshold is ∞. This indicates that the process must continue to work no matter how many correctable failures occur.

割り込み通知部１２２は、比較部１２１からの「プロセス識別子」の通知に応じて、オペレーティングシステム（ＯＳ）１３に対し、当該「プロセス識別子」と割り込み信号を発信する。 In response to the notification of the “process identifier” from the comparison unit 121, the interrupt notification unit 122 transmits the “process identifier” and an interrupt signal to the operating system (OS) 13.

次に、本実施形態の動作について詳細に説明する。
図３のフローチャートを参照して、プロセスを起動するときの処理について説明する。
（１）ステップＳ１０１
プロセスの起動処理では、プロセスの起動要求が発行される。基本的には、通常のプロセスの起動処理と同じだが、本実施形態では、プロセスの起動と同時に、そのプロセスの重要度、若しくは、その重要度に応じた障害閾値そのものを指定する。重要度を指定した場合には、その重要度に応じた規定の障害閾値を利用するものとするが、今回は、説明の簡単化のために、障害閾値そのものを利用すると仮定して、順次説明を進める。このとき、プロセッサ１のオペレーティングシステム（ＯＳ）１３は、プロセスの起動要求及び指定された障害閾値に基づいて、障害閾値付きでプロセスを起動する。
（２）ステップＳ１０２
障害閾値付きで起動されたプロセスは、オペレーティングシステム（ＯＳ）１３により、障害カウンタ付きＴＬＢ１１に登録されるが、そのときのＴＬＢエントリに、そのプロセスの識別子と起動時に指定された障害閾値も同時に登録される。このとき、プロセッサ１のオペレーティングシステム（ＯＳ）１３は、少なくとも、障害閾値付きで起動されたプロセスの「プロセスの識別子」及び「障害閾値」を障害カウンタ付きＴＬＢ１１のＴＬＢエントリに登録する。起動時の処理は、これで完了する。 Next, the operation of this embodiment will be described in detail.
With reference to the flowchart of FIG. 3, a process when starting a process will be described.
(1) Step S101
In the process activation process, a process activation request is issued. Basically, it is the same as the normal process activation process, but in this embodiment, simultaneously with the process activation, the importance level of the process or the failure threshold value corresponding to the importance level is specified. When the importance level is specified, the specified failure threshold value corresponding to the importance level is used. However, for the sake of simplicity, this time, it is assumed that the failure threshold value itself is used. To proceed. At this time, the operating system (OS) 13 of the processor 1 starts the process with a failure threshold based on the process startup request and the specified failure threshold.
(2) Step S102
A process started with a failure threshold is registered in the TLB 11 with a failure counter by the operating system (OS) 13, and the identifier of the process and the failure threshold specified at startup are also registered in the TLB entry at the same time. Is done. At this time, the operating system (OS) 13 of the processor 1 registers at least the “process identifier” and the “failure threshold” of the process started with the failure threshold in the TLB entry of the TLB 11 with the failure counter. This completes the startup process.

次に、実際に障害が発生したときの一連の処理について説明する。
図４のフローチャートを参照して、メモリユニット上の処理について説明する。
（１）ステップＳ２０１
まず、訂正可能障害が発生した場合、障害カウンタ付きメモリユニット２上で、当該メモリ障害の訂正を実施する。このとき、障害カウンタ付きメモリユニット２は、ＥＣＣに基づいて、メモリ障害の訂正を実施する。
（２）ステップＳ２０２
次に、障害カウンタ付きメモリユニット２上で最小物理ページ単位毎に備えている障害カウンタを更新する。具体的には、障害カウンタ付きメモリユニット２は、訂正可能障害が発生したアドレスを含む最小物理ページに対応したカウンタを１つインクリメント（ｉｎｃｒｅｍｅｎｔ）する。このとき、障害カウンタ付きメモリユニット２は、メモリ本体２２のＭＭ−０からＭＭ−Ｎの各々に対応したカウンタを１つインクリメントする。
（３）ステップＳ２０３
障害カウンタ付きメモリユニット２は、カウンタの更新が完了すると、訂正可能障害が発生したことをプロセッサ１に通知する。このとき、障害カウンタ付きメモリユニット２の障害通知部２１は、障害カウンタ付きＴＬＢ１１に対して、訂正可能障害が発生したアドレスと障害カウンタの値を合わせて通知する。続いて、処理がプロセッサ１側に移る。プロセッサ１上での処理を図５に示す。 Next, a series of processing when a failure actually occurs will be described.
The processing on the memory unit will be described with reference to the flowchart of FIG.
(1) Step S201
First, when a correctable failure occurs, the memory failure is corrected on the memory unit 2 with the failure counter. At this time, the memory unit with a fault counter 2 corrects the memory fault based on the ECC.
(2) Step S202
Next, the failure counter provided for each minimum physical page unit is updated on the memory unit 2 with the failure counter. Specifically, the memory unit with a fault counter 2 increments the counter corresponding to the smallest physical page including the address where the correctable fault has occurred by one. At this time, the memory unit 2 with the fault counter increments the counter corresponding to each of the MM-0 to MM-N of the memory body 22 by one.
(3) Step S203
When the counter update is completed, the memory unit 2 with the fault counter notifies the processor 1 that a correctable fault has occurred. At this time, the failure notification unit 21 of the memory unit 2 with the failure counter notifies the TLB 11 with the failure counter together with the address where the correctable failure has occurred and the value of the failure counter. Subsequently, the processing moves to the processor 1 side. The processing on the processor 1 is shown in FIG.

図５のフローチャートを参照して、プロセッサ上の処理について説明する。
（１）ステップＳ３０１
プロセッサ１は、障害カウンタ付きメモリユニット２より障害発生通知を受信し、障害アドレスと障害回数を受理する。このとき、障害カウンタ付きＴＬＢ１１は、障害通知部２１より障害発生通知と共に、障害アドレスと障害回数を受理する。なお、このプロセッサ上の処理において、障害カウンタ付きＴＬＢ１１が行う処理は、実際には、オペレーティングシステム（ＯＳ）１３が障害カウンタ付きＴＬＢ１１に対して行うようにしても良い。
（２）ステップＳ３０２
次に、プロセッサ１は、ＴＬＢ上でこの障害アドレスを含むエントリを検索し、そのエントリが保持している障害回数を更新する。このとき、障害カウンタ付きＴＬＢ１１は、ＴＬＢエントリ中から、この障害アドレスを含むエントリを検索し、発見されたエントリが保持している障害回数を更新する。
（３）ステップＳ３０３
プロセッサ１は、障害回数が更新されると、当該障害回数と当該障害閾値を比較する。このとき、障害回数監視部１２の比較部１２１は、障害カウンタ付きＴＬＢ１１において障害回数が更新されると、当該障害回数と当該障害閾値を比較する。この障害閾値とは、図３のフローで設定されたものであり、障害回数と同じようにＴＬＢ上で保持されているものである。プロセッサ１は、障害回数と障害閾値を比較した結果、障害回数が障害閾値以下だった場合には、以降は何もせずに処理を終了する。反対に、プロセッサ１は、障害回数が障害閾値を超えていた場合には、処理を継続する。
（４）ステップＳ３０４
プロセッサ１は、障害回数が障害閾値を超えたことをオペレーティングシステム（ＯＳ）１３に通知する。このとき、プロセッサ１の比較部１２１は、割り込み通知部１２２を介して、障害回数が障害閾値を超えたことをオペレーティングシステム（ＯＳ）１３に通知する。割り込み通知部１２２は、該ＴＬＢエントリにはその物理ページを利用しているプロセス識別子が登録されているため、割り込み信号を発信する際、そのプロセス識別子も同時にオペレーティングシステム（ＯＳ）１３に通知する。プロセッサ１からオペレーティングシステム（ＯＳ）１３への通知が発行されるケースでは、処理がオペレーティングシステム（ＯＳ）１３側に移る。オペレーティングシステム（ＯＳ）１３上での処理を図６に示す。 Processing on the processor will be described with reference to the flowchart of FIG.
(1) Step S301
The processor 1 receives the failure occurrence notification from the memory unit 2 with the failure counter and accepts the failure address and the number of failures. At this time, the TLB 11 with a failure counter receives the failure address and the number of failures together with the failure occurrence notification from the failure notification unit 21. In this processing on the processor, the processing performed by the TLB 11 with a fault counter may actually be performed by the operating system (OS) 13 on the TLB 11 with a fault counter.
(2) Step S302
Next, the processor 1 searches for an entry including the failure address on the TLB, and updates the number of failures held by the entry. At this time, the TLB 11 with a failure counter searches for an entry including the failure address from the TLB entries, and updates the number of failures held by the found entry.
(3) Step S303
When the failure count is updated, the processor 1 compares the failure count with the failure threshold. At this time, when the failure count is updated in the TLB 11 with the failure counter, the comparison unit 121 of the failure count monitoring unit 12 compares the failure count with the failure threshold. The failure threshold is set in the flow of FIG. 3 and is held on the TLB in the same manner as the number of failures. As a result of comparing the number of failures with the failure threshold, if the number of failures is less than or equal to the failure threshold, the processor 1 ends the process without doing anything thereafter. On the contrary, the processor 1 continues the process when the number of failures exceeds the failure threshold.
(4) Step S304
The processor 1 notifies the operating system (OS) 13 that the number of failures has exceeded the failure threshold. At this time, the comparison unit 121 of the processor 1 notifies the operating system (OS) 13 that the number of failures exceeds the failure threshold via the interrupt notification unit 122. Since the process identifier using the physical page is registered in the TLB entry, the interrupt notification unit 122 notifies the operating system (OS) 13 of the process identifier at the same time when an interrupt signal is transmitted. In a case where a notification from the processor 1 to the operating system (OS) 13 is issued, the processing moves to the operating system (OS) 13 side. The processing on the operating system (OS) 13 is shown in FIG.

図６のフローチャートを参照して、オペレーティングシステム（ＯＳ）上の処理について説明する。
（１）ステップＳ４０１
オペレーティングシステム（ＯＳ）１３は、プロセッサ１からの通知と同時にプロセス識別子を受信する。ここでは、オペレーティングシステム（ＯＳ）１３は、割り込み通知部１２２からプロセス識別子と割り込み信号を受信する。
（２）ステップＳ４０２
オペレーティングシステム（ＯＳ）１３は、受信されたプロセス識別子に対応するプロセスを強制終了させて処理を完了する。 Processing on the operating system (OS) will be described with reference to the flowchart of FIG.
(1) Step S401
The operating system (OS) 13 receives the process identifier simultaneously with the notification from the processor 1. Here, the operating system (OS) 13 receives a process identifier and an interrupt signal from the interrupt notification unit 122.
(2) Step S402
The operating system (OS) 13 forcibly terminates the process corresponding to the received process identifier and completes the processing.

以上が、本実施形態の動作である。 The above is the operation of the present embodiment.

本実施形態では、訂正可能なメモリ障害のカウンタ／閾値をＴＬＢのエントリに持たせることにより、プロセスの重要度に応じた障害処理機能を実現できる。 In this embodiment, by providing a TLB entry with a correctable memory failure counter / threshold, a failure processing function according to the importance of the process can be realized.

以下に、本発明の第２実施形態について説明する。
本実施形態では、実メモリ空間側にも障害回数を保持し、ページイン／ページアウトの際に、プロセスに応じたメモリ割り当てを実施する。すなわち、ページインで実メモリを割り当てる際に、実メモリ上の障害回数と割り当てられるプロセスの閾値を比較する。閾値∞のような重要プロセスでは、過去に障害が起きているメモリを使用しないようにする。 The second embodiment of the present invention will be described below.
In the present embodiment, the number of failures is also held on the real memory space side, and memory allocation according to the process is performed at the time of page-in / page-out. That is, when real memory is allocated by page-in, the number of failures in the real memory is compared with the threshold value of the allocated process. In an important process such as the threshold ∞, memory that has failed in the past is not used.

なお、本実施形態におけるメモリ障害処理システムは、プロセッサ１と、障害カウンタ付きメモリユニット２を含む。プロセッサ１及び障害カウンタ付きメモリユニット２については、基本的に、第１実施形態と同様である。 Note that the memory failure processing system in this embodiment includes a processor 1 and a memory unit 2 with a failure counter. The processor 1 and the memory unit with fault counter 2 are basically the same as those in the first embodiment.

具体的に、本実施形態の動作について説明する。
図７のフローチャートを参照して、ページインで実メモリを割り当てる処理について説明する。
（１）ステップＳ５０１
プロセッサ１は、ページインが開始された場合、まず、障害カウンタ付きメモリユニット２上のページイン対象になる物理メモリ空間を検索する。このとき、オペレーティングシステム（ＯＳ）１３は、ページインが開始された場合、メモリ本体２２から、ページイン対象になる物理メモリ空間を検索する。
（２）ステップＳ５０２
プロセッサ１は、検出された物理メモリ空間の障害が規定値以下であった場合には、ページインが実施可能であると判断する。このとき、オペレーティングシステム（ＯＳ）１３は、検出された物理メモリ空間の障害が規定値以下であった場合、検出された物理メモリ空間に対して、ページインが実施可能であると判断する。ここでは、規定値は、予め設定された値であり、例えば、閾値∞のような重要プロセスでは、”０”と設定される。この場合、プロセッサ１は、今までに一度も障害が発生していないメモリ空間を利用するようにする。
（３）ステップＳ５０３
プロセッサ１は、障害回数が規定値以下でなかった場合には、最大検索回数を調査し、最大検索回数以下の場合には、ページイン対象になる物理メモリ空間を検索する処理に戻り、再び、規定値が小さい物理メモリ空間の検出に努める。このとき、オペレーティングシステム（ＯＳ）１３は、障害回数が規定値以下でなかった場合には、最大検索回数を調査し、最大検索回数以下の場合には、再度、規定値が小さい物理メモリ空間を検索する。但し、仮に、一回も障害が発生したことのないページが存在しない場合等は、期待されたページが存在しないことになるので、ここでは、最大検索回数を設けており、プロセッサ１は、最大検索回数を超えた場合には、処理を継続し、やむを得ず障害が発生したことのある物理メモリ空間でもページインを実施することとする。このとき、オペレーティングシステム（ＯＳ）１３は、最大検索回数を超えた場合、障害が発生したことのある物理メモリ空間でもページインが実施可能であると判断する。
（４）ステップＳ５０４
プロセッサ１は、ページイン対象になる物理メモリ空間を決定する。このとき、オペレーティングシステム（ＯＳ）１３は、ページインが実施可能であると判断した物理メモリ空間を、ページイン対象になる物理メモリ空間として決定する。 Specifically, the operation of this embodiment will be described.
With reference to the flowchart of FIG. 7, a process of allocating real memory by page-in will be described.
(1) Step S501
When page-in is started, the processor 1 first searches for a physical memory space to be paged-in on the memory unit 2 with a fault counter. At this time, when page-in is started, the operating system (OS) 13 searches the memory main body 22 for a physical memory space to be paged-in.
(2) Step S502
When the detected failure in the physical memory space is equal to or less than the specified value, the processor 1 determines that page-in can be performed. At this time, the operating system (OS) 13 determines that the page-in can be performed on the detected physical memory space when the detected failure of the physical memory space is equal to or less than the specified value. Here, the specified value is a preset value, and is set to “0” in an important process such as a threshold ∞, for example. In this case, the processor 1 uses a memory space that has never failed.
(3) Step S503
If the number of failures is not less than or equal to the specified value, the processor 1 checks the maximum number of searches. If the number of failures is less than or equal to the maximum number of searches, the processor 1 returns to the process of searching for a physical memory space to be paged in. Try to detect physical memory space with a small specified value. At this time, the operating system (OS) 13 investigates the maximum number of searches if the number of failures is not less than the specified value. If the number of failures is not more than the maximum number of searches, the operating system (OS) 13 again sets a physical memory space having a smaller specified value. Search for. However, if there is no page in which a failure has never occurred, the expected page does not exist. Therefore, the maximum number of searches is provided here, and the processor 1 If the number of searches is exceeded, the processing is continued, and page-in is performed even in a physical memory space where a failure has inevitably occurred. At this time, when the maximum number of searches is exceeded, the operating system (OS) 13 determines that page-in can be performed even in a physical memory space where a failure has occurred.
(4) Step S504
The processor 1 determines a physical memory space to be paged in. At this time, the operating system (OS) 13 determines a physical memory space that is determined to be page-in as a physical memory space to be paged-in.

以上の動作を繰り返すことにより、重要なプロセスには、常に安全な物理メモリ空間を割り当てられるようになる。 By repeating the above operations, a safe physical memory space can always be assigned to an important process.

以上のように、本発明は、ＴＬＢに登録されるエントリに「障害回数」、及び、「障害閾値」を追加することで実現する。障害回数は、対応する物理アドレス空間で発生した訂正可能障害の回数を保持する。障害閾値は、対応するプロセスが許容する最大訂正可能障害数を保持する。また、ＴＬＢに接続される障害回数監視機構は、これら２つを比較し障害回数が障害閾値を超えている場合には、当該プロセスを停止させる信号を発行する機能を有する。 As described above, the present invention is realized by adding the “failure count” and the “failure threshold” to the entry registered in the TLB. The number of failures holds the number of correctable failures that occurred in the corresponding physical address space. The failure threshold holds the maximum number of correctable failures allowed by the corresponding process. The failure frequency monitoring mechanism connected to the TLB has a function of comparing these two and issuing a signal for stopping the process when the failure frequency exceeds the failure threshold.

なお、実メモリ空間側にも障害回数を保持しているため、ページイン／ページアウトの際に、プロセスに応じたメモリ割り当てを実施することも可能である。すなわち、ページインで実メモリを割り当てる際に、実メモリ上の障害回数と割り当てられるプロセスの閾値を比較し、閾値∞のような重要プロセスでは、過去に障害が起きているメモリを使用しないようにすることができる。 Since the number of failures is also held on the real memory space side, it is possible to perform memory allocation according to the process at the time of page-in / page-out. That is, when allocating real memory during page-in, compare the number of failures in the real memory with the threshold of the allocated process, and do not use memory that has failed in the past in critical processes such as the threshold ∞. can do.

本発明は、小規模から大規模までのあらゆるコンピュータに適用可能であるが、柔軟な障害処理方式を備えるため、ミッションクリティカルな領域で動作する計算機において特に有効である。 The present invention can be applied to any computer from a small scale to a large scale, but is particularly effective in a computer that operates in a mission critical area because it has a flexible failure processing method.

図１は、本発明のメモリ障害処理システムの構成例を示す図である。FIG. 1 is a diagram showing a configuration example of a memory failure processing system according to the present invention. 図２は、プロセッサの構成例を示す図である。FIG. 2 is a diagram illustrating a configuration example of a processor. 図３は、プロセスを起動するときの処理を示すフローチャートである。FIG. 3 is a flowchart showing a process when starting a process. 図４は、メモリユニット上の処理を示すフローチャートである。FIG. 4 is a flowchart showing processing on the memory unit. 図５は、プロセッサ上の処理を示すフローチャートである。FIG. 5 is a flowchart showing processing on the processor. 図６は、オペレーティングシステム（ＯＳ）上の処理を示すフローチャートである。FIG. 6 is a flowchart showing processing on the operating system (OS). 図７は、ページインで実メモリを割り当てる処理を示すフローチャートである。FIG. 7 is a flowchart showing a process of allocating real memory by page-in.

Explanation of symbols

１… プロセッサ
１１… 障害カウンタ付きＴＬＢ
１２… 障害回数監視部
１２１… 比較部
１２２… 割り込み通知部
１３… オペレーティングシステム（ＯＳ）
２… 障害カウンタ付きメモリユニット
２１… 障害通知部
２２… メモリ本体 1 ... Processor 11 ... TLB with fault counter
DESCRIPTION OF SYMBOLS 12 ... Failure frequency monitoring part 121 ... Comparison part 122 ... Interrupt notification part 13 ... Operating system (OS)
2 ... Memory unit with fault counter 21 ... Fault notification unit 22 ... Memory body

Claims

A process that uses the physical address space is started, and the number of failures indicating the number of correctable failures that occurred in the physical address space is entered in a TLB (Translation Look-aside Buffer) entry, and the maximum correctable failure allowed by the process A failure threshold indicating a number is registered, and the correctable failure indicates a failure that has been corrected by ECC (Error Correcting Code), and when the failure count is updated, the failure count is compared with the failure threshold; A processor that stops the process if the failure count exceeds the failure threshold;
A memory fault processing system comprising the physical address space, and comprising a memory unit with a fault counter that counts the fault count and notifies the processor of the fault count each time the correctable fault occurs.

The memory fault handling system according to claim 1,
The processor is
A TLB with a failure counter that has the TLB and stores a process identifier for identifying the process, the number of failures, and the failure threshold in the TLB entry;
The failure frequency is compared with the failure threshold value. When the failure frequency exceeds the failure threshold value, an interrupt signal is transmitted to the operating system (OS) that started the process and the process identifier is transmitted. And a memory failure processing system comprising: a failure frequency monitoring means.

The memory fault handling system according to claim 2,
The failure frequency monitoring means includes
When the number of failures is updated on the TLB, the process identifier, the number of failures, and the failure threshold are referred to, the failure times are compared with the failure threshold, and the number of failures exceeds the failure threshold. A comparison means for notifying the process identifier;
A memory failure processing system comprising: interrupt notification means for transmitting the process identifier and the interrupt signal to the operating system (OS) in response to the notification of the process identifier.

The memory fault processing system according to any one of claims 1 to 3,
The memory unit with the fault counter is
The physical address space is divided into minimum physical page units, and a memory main body having a counter for counting and holding the number of failures for each division unit;
A memory fault processing system comprising fault notification means for notifying the processor of a fault address and the fault count each time the correctable fault occurs in the memory body.

The memory fault handling system according to claim 4,
When the correctable failure occurs, the memory unit with the failure counter performs correction of the correctable failure based on the ECC, and the memory unit with the failure counter corresponds to the minimum physical page including the address where the correctable failure has occurred. Count the number of failures,
The processor acquires the failure address and the failure count from the memory unit with the failure counter, searches an entry corresponding to the failure address on the TLB, and updates the failure count stored in the found entry. When the failure count is updated, the failure count stored in the entry is compared with the failure threshold, and the process is stopped when the failure count exceeds the failure threshold. system.

A memory fault processing system according to any one of claims 1 to 5,
When the page-in is started, the processor searches for a physical memory space to be paged in on the memory unit with the fault counter, and when the detected physical memory space fault is equal to or less than a predetermined value, A memory failure processing system that determines that page-in is possible for the detected physical memory space.

The memory fault handling system according to claim 6,
The processor investigates the maximum number of searches when the detected failure of the physical memory space is not less than a specified value, and searches for a physical memory space with a smaller specified value if the failure is less than the specified number of searches, A memory failure processing system that, when the number of times is exceeded, determines that page-in is possible even in a physical memory space where a failure has occurred.

Starting a process that uses the physical address space of the memory unit with the fault counter;
Registered in the TLB (Translation Look-aside Buffer) entry held by the processor is a failure count indicating the number of correctable failures that occurred in the physical address space, and a failure threshold value indicating the maximum number of correctable failures allowed by the process. And the correctable fault indicates a fault that has been corrected by ECC (Error Correcting Code),
Each time the correctable fault occurs on the memory unit with the fault counter, the fault count is counted, and the fault count is provided to the TLB;
A memory fault processing method comprising: comparing the fault count with the fault threshold when the fault count is updated on the TLB, and stopping the process if the fault count exceeds the fault threshold .

The memory failure processing method according to claim 8,
Storing a process identifier for identifying the process, the number of failures, and the failure threshold in the TLB entry;
The failure frequency is compared with the failure threshold value. When the failure frequency exceeds the failure threshold value, an interrupt signal is transmitted to the operating system (OS) that started the process and the process identifier is transmitted. And a memory fault handling method.

The memory failure processing method according to claim 9,
When the number of failures is updated on the TLB, the process identifier, the number of failures, and the failure threshold are referred to, the failure times are compared with the failure threshold, and the number of failures exceeds the failure threshold. If not, the step of notifying the process identifier,
A memory failure processing method further comprising: sending the process identifier and the interrupt signal to the operating system (OS) in response to the notification of the process identifier.

A memory failure processing method according to any one of claims 8 to 10,
Dividing the physical address space into minimum physical page units, and counting and holding the number of failures for each division unit;
And a step of notifying the failure address and the number of failures each time the correctable failure occurs.

The memory failure processing method according to claim 11,
When the correctable failure occurs, the correctable failure is corrected based on the ECC on the memory unit with the failure counter, and the minimum physical page including the address where the correctable failure has occurred is supported. Counting the number of failures;
Obtaining the failure address and the failure count from the memory unit with the failure counter, searching an entry corresponding to the failure address on the TLB, and updating the failure count stored in the found entry;
When the failure count is updated, the failure count stored in the entry is compared with the failure threshold, and if the failure count exceeds the failure threshold, the process is further stopped. Memory failure handling method.

A memory failure processing method according to any one of claims 8 to 12,
When page-in is started, searching for a physical memory space to be paged-in on the memory unit with a fault counter; and
And a step of determining that page-in can be performed on the detected physical memory space when the detected failure of the physical memory space is equal to or less than a predetermined value.

The memory failure processing method according to claim 13,
If the detected physical memory space failure is not less than or equal to a prescribed value, investigating the maximum number of searches;
If the maximum number of searches is less than or equal to, searching for a physical memory space with a small specified value;
And a step of determining that page-in can be performed even in a physical memory space where a failure has occurred when the maximum number of searches is exceeded.

Starting a process that uses the physical address space of the memory unit with the fault counter;
Registered in the TLB (Translation Look-aside Buffer) entry held by the processor is a failure count indicating the number of correctable failures that occurred in the physical address space, and a failure threshold value indicating the maximum number of correctable failures allowed by the process. And the correctable fault indicates a fault that has been corrected by ECC (Error Correcting Code),
Each time the correctable fault occurs on the memory unit with the fault counter, the fault count is counted, and the fault count is provided to the TLB;
When the failure count is updated on the TLB, the failure count is compared with the failure threshold, and if the failure count exceeds the failure threshold, the processor is caused to stop the process. Memory fault handling program.

The memory failure processing program according to claim 15,
Storing a process identifier for identifying the process, the number of failures, and the failure threshold in the TLB entry;
The failure frequency is compared with the failure threshold value. When the failure frequency exceeds the failure threshold value, an interrupt signal is transmitted to the operating system (OS) that started the process and the process identifier is transmitted. And a memory failure processing program for causing the processor to further execute the step of performing.

The memory failure processing program according to claim 16,
When the number of failures is updated on the TLB, the process identifier, the number of failures, and the failure threshold are referred to, the failure times are compared with the failure threshold, and the number of failures exceeds the failure threshold. If not, the step of notifying the process identifier,
A memory failure processing program for causing a processor to further execute a step of transmitting the process identifier and the interrupt signal to the operating system (OS) in response to the notification of the process identifier.

A memory failure processing program according to any one of claims 15 to 17,
Dividing the physical address space into minimum physical page units, and counting and holding the number of failures for each division unit;
A memory failure processing program for causing a processor to further execute a step of notifying a failure address and the number of failures each time the correctable failure occurs.

The memory failure processing program according to claim 18,
When the correctable failure occurs, the minimum for performing correction of the correctable failure on the memory unit with the failure counter based on the ECC and causing the processor to execute the address where the correctable failure has occurred Counting the number of failures corresponding to a physical page;
Obtaining the failure address and the failure count from the memory unit with the failure counter, searching an entry corresponding to the failure address on the TLB, and updating the failure count stored in the found entry;
When the failure count is updated, the failure count stored in the entry is compared with the failure threshold, and the process is stopped when the failure count exceeds the failure threshold. Memory failure processing program for execution.

A memory failure processing program according to any one of claims 15 to 19,
When page-in is started, searching for a physical memory space to be paged-in on the memory unit with a fault counter; and
A memory failure for causing the processor to further execute a step of determining that page-in can be performed on the detected physical memory space when the detected failure of the physical memory space is equal to or less than a predetermined value; Processing program.

The memory failure processing program according to claim 20,
If the detected physical memory space failure is not less than or equal to a prescribed value, investigating the maximum number of searches;
If the maximum number of searches is less than or equal to, searching for a physical memory space with a small specified value;
A memory failure processing program for causing a processor to further execute a step of determining that page-in can be performed even in a physical memory space where a failure has occurred when the maximum number of searches is exceeded.