JP2017010421A

JP2017010421A - Information processing apparatus, processor management method, and program

Info

Publication number: JP2017010421A
Application number: JP2015127443A
Authority: JP
Inventors: 崇奥野; Takashi Okuno; 智央及川; Tomohisa Oikawa
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2015-06-25
Filing date: 2015-06-25
Publication date: 2017-01-12
Anticipated expiration: 2035-06-25
Also published as: JP6558098B2

Abstract

PROBLEM TO BE SOLVED: To reliably acquire information useful for evaluating reliability of a processor.SOLUTION: A storage unit 15 of an information processing apparatus 10 stores operation information indicating operating time and error occurrence state of each of a plurality of processors 11-14. A control unit 16 of the information processing apparatus 10 selects a predetermined number of processors with shorter operating time, out of the plurality of processors 11-14, as first processors to operate, on the basis of the operation information at the start of execution of a program. The control unit 16 causes the first processors to execute the program, while suspending second processors which are not selected, to acquire operating time and error occurrence state of the first processors. The control unit 16 stores the acquired operating time and error occurrence state on the storage unit 15.SELECTED DRAWING: Figure 1

Description

本発明は、情報処理装置、プロセッサ管理方法、およびプログラムに関する。 The present invention relates to an information processing apparatus, a processor management method, and a program.

近年のコンピュータでは、１台のコンピュータ内に複数のプロセッサが搭載されている場合がある。プロセッサは、ＣＰＵ（Central Processing Unit）とも呼ばれる。１つのプロセッサ内に、複数のプロセッサコアが搭載されている場合もある。この場合の複数のプロセッサコアそれぞれが、コンピュータ内の独立したプロセッサとして機能する。以下、プロセッサまたはＣＰＵと呼んだ場合、プロセッサコアを含むものとする。 In recent computers, a plurality of processors may be mounted in one computer. The processor is also called a CPU (Central Processing Unit). In some cases, a plurality of processor cores are mounted in one processor. Each of the plurality of processor cores in this case functions as an independent processor in the computer. Hereinafter, when referred to as a processor or a CPU, a processor core is included.

マルチコアプロセッサの普及により、コンピュータ内のプロセッサ数は増加傾向にある。従来は、プロセッサが増加しても、システム全体の負荷を考慮せず、プロセッサをすべて使用していた。この場合、いずれか１つのプロセッサに訂正不可能なエラーが発生すると、ＯＳ（Operating System）の動作を継続できなくなり、システムがダウンする。 With the spread of multi-core processors, the number of processors in a computer tends to increase. Conventionally, even if the number of processors increases, all the processors are used without considering the load on the entire system. In this case, if an uncorrectable error occurs in any one of the processors, the operation of the OS (Operating System) cannot be continued and the system goes down.

なお、訂正不可能なエラーが発生しやすいプロセッサを、訂正可能なエラー数から判断し、運用中のシステムダウンの抑止措置を採ることも可能である。例えば訂正可能なエラー数が一定数に達したプロセッサが検出されると、そのプロセッサの使用を停止することができる。これにより、訂正不可能なエラーの発生が抑止される。 It is also possible to determine the processor that is likely to generate an uncorrectable error from the number of correctable errors, and take measures to prevent system down during operation. For example, when a processor in which the number of correctable errors reaches a certain number is detected, the use of the processor can be stopped. Thereby, the occurrence of an uncorrectable error is suppressed.

システムの障害発生時の対応に関する技術としては、例えば、命令再試行可能なエラー発生時、システムの負荷状況及びエラー頻度状況によりプロセッサリリーフ、命令再試行処理を行う技術がある。また、システムリセット時の初期化処理において異常が発生しても、システムが起動しなくなることを回避する技術もある。さらに、サーバ等の処理装置に障害が発生した場合の復旧時間の長期化をより確実に抑えられるようにする技術もある。 As a technique for dealing with a system failure, for example, there is a technique for performing processor relief and instruction retry processing according to the system load status and error frequency status when an error capable of retrying an instruction occurs. There is also a technique for avoiding that the system does not start even if an abnormality occurs in the initialization process at the time of system reset. In addition, there is a technology that can more reliably suppress an increase in recovery time when a failure occurs in a processing device such as a server.

特開平６−３２４８９７号公報JP-A-6-324897 特開２０１０−６１４１９号公報JP 2010-61419 A 特開２０１３−１６４７６２号公報JP 2013-164762 A

複数のプロセッサを有するコンピュータシステムでは、そのシステムの性能を使い切っていないことがある。このような場合、訂正可能なエラー数が一定数に達したプロセッサが存在しなくても、システムの省電力化などの目的で、一部のプロセッサを停止させておくことができる。このように、性能に余力があるときに一部のプロセッサを停止させるという運用を継続すると、各プロセッサの稼働時間に大きな差が生じることがある。 A computer system having a plurality of processors may not use up the performance of the system. In such a case, even if there is no processor in which the number of correctable errors reaches a certain number, a part of the processors can be stopped for the purpose of power saving of the system. As described above, when the operation of stopping some processors is continued when there is a surplus in performance, there may be a large difference in the operating time of each processor.

各プロセッサの稼働時間の実績が大きく異なる場合に、プロセッサごとの訂正可能なエラー数だけでプロセッサの信頼性を判断すると、信頼性を正確には判断できない。例えば、他のプロセッサよりも長時間稼働したプロセッサは、他のプロセッサよりも訂正可能なエラーが多く検出されて当然であり、訂正可能なエラー数が他のプロセッサよりも多いからといって、信頼性が低いと評価することはできない。そこで、稼働実績を考慮して各プロセッサの信頼性を評価することが考えられる。 When the performance of each processor is greatly different, if the reliability of the processor is determined only by the number of errors that can be corrected for each processor, the reliability cannot be determined accurately. For example, a processor that has been operating for a longer time than other processors is likely to detect more correctable errors than other processors, and just because there are more correctable errors than other processors, It cannot be evaluated that the nature is low. Therefore, it is conceivable to evaluate the reliability of each processor in consideration of the operation results.

しかし、ほとんど使用されてないプロセッサが存在すると、そのプロセッサの稼働実績は存在せず、稼働実績に関する情報を取得できない。稼働実績を用いた信頼性評価において、稼働実績に関する情報が取得できないプロセッサがあると、プロセッサ間の信頼性の優劣を正しく判断することができない。その結果、訂正不可能なエラーが発生しやすいプロセッサの判断を誤る可能性が生じる。 However, if there is a processor that is hardly used, there is no operation record of the processor, and information on the operation record cannot be acquired. In the reliability evaluation using the operation results, if there is a processor for which information on the operation results cannot be obtained, it is not possible to correctly determine the superiority or inferiority of the reliability between the processors. As a result, there is a possibility that the determination of the processor that is likely to generate an uncorrectable error is erroneous.

１つの側面では、本件は、プロセッサの信頼性評価に有用な情報を確実に取得できるようにすることを目的とする。 In one aspect, the object is to ensure that information useful for processor reliability evaluation can be obtained.

１つの案では、記憶部と制御部とを有する情報処理装置が提供される。記憶部は、複数のプロセッサそれぞれの使用時間とエラー発生状況とを示す動作情報を記憶する。制御部は、プログラムの実行開始時に、動作情報に基づいて、複数のプロセッサのうちの使用時間が短い方から所定数のプロセッサを、動作させる第１のプロセッサとして選択する。次に制御部は、選択されていない第２のプロセッサを停止した状態で、第１のプロセッサにプログラムを実行させ、第１のプロセッサの使用時間とエラー発生状況とを取得する。そして制御部は、取得した該使用時間と該エラー発生状況とを記憶部に格納する。 In one proposal, an information processing apparatus having a storage unit and a control unit is provided. The storage unit stores operation information indicating usage times and error occurrence states of the plurality of processors. When starting execution of the program, the control unit selects a predetermined number of processors as the first processors to be operated from the ones having a shorter usage time among the plurality of processors based on the operation information. Next, the control unit causes the first processor to execute a program in a state where the second processor that has not been selected is stopped, and obtains the usage time and error occurrence status of the first processor. Then, the control unit stores the acquired usage time and the error occurrence status in the storage unit.

１態様によれば、プロセッサの信頼性評価に有用な情報を確実に取得できる。 According to one aspect, it is possible to reliably acquire information useful for processor reliability evaluation.

第１の実施の形態に係る情報処理装置の一例を示す図である。It is a figure which shows an example of the information processing apparatus which concerns on 1st Embodiment. 第２の形態に用いるサーバのハードウェアの一構成例を示す図である。It is a figure which shows one structural example of the hardware of the server used for a 2nd form. システムボードの詳細とサーバ内で保持される情報の例を示す図である。It is a figure which shows the example of the detail of a system board, and the information hold | maintained in a server. 稼働時間・エラー管理簿の一例を示す図である。It is a figure which shows an example of an operation time and error management book. エラーログの一例を示す図である。It is a figure which shows an example of an error log. ＣＰＵ動作管理機能を示すブロック図である。It is a block diagram which shows a CPU operation management function. システム起動時のＣＰＵ動作管理処理の手順の一例を示すフローチャートである。It is a flowchart which shows an example of the procedure of the CPU operation management process at the time of system starting. ＣＰＵ数算出処理の手順の一例を示すフローチャートである。It is a flowchart which shows an example of the procedure of CPU number calculation processing. ＣＰＵ選定処理の手順の一例を示すフローチャートである。It is a flowchart which shows an example of the procedure of CPU selection processing. システム停止時のＣＰＵ動作管理処理の手順の一例を示すフローチャートである。It is a flowchart which shows an example of the procedure of CPU operation management processing at the time of a system stop. 使用率採取処理の手順の一例を示すフローチャートである。It is a flowchart which shows an example of the procedure of a utilization rate collection process. エラー情報収集処理の手順の一例を示すフローチャートである。It is a flowchart which shows an example of the procedure of an error information collection process. 初回起動時における使用ＣＰＵ選定の第１の例を示す図である。It is a figure which shows the 1st example of use CPU selection at the time of first time starting. 初回起動時における使用ＣＰＵ選定の第２の例を示す図である。It is a figure which shows the 2nd example of use CPU selection at the time of first time starting. ２回目起動時における使用ＣＰＵ選定例を示す図である。It is a figure which shows the use CPU selection example at the time of the 2nd starting. ３回目起動時における使用ＣＰＵ選定例を示す図である。It is a figure which shows the use CPU selection example at the time of the 3rd starting. ｎ回目の起動時における使用ＣＰＵ選定例を示す図である。It is a figure which shows the use CPU selection example at the time of n-th starting. ｎ＋ｍ回目の起動時における使用ＣＰＵ選定例を示す図である。It is a figure which shows the example of use CPU selection at the time of n + m time starting.

以下、本実施の形態について図面を参照して説明する。なお各実施の形態は、矛盾のない範囲で複数の実施の形態を組み合わせて実施することができる。
〔第１の実施の形態〕
図１は、第１の実施の形態に係る情報処理装置の一例を示す図である。情報処理装置１０は、複数のプロセッサ（ＣＰＵ）１１〜１４、記憶部１５、および制御部１６を有する。情報処理装置１０内では、ＣＰＵ１１〜１４それぞれに識別番号が付与されている。ＣＰＵ１１の識別番号は「１」、ＣＰＵ１２の識別番号は「２」、ＣＰＵ１３の識別番号は「３」、ＣＰＵ１４の識別番号は「４」である。 Hereinafter, the present embodiment will be described with reference to the drawings. Each embodiment can be implemented by combining a plurality of embodiments within a consistent range.
[First Embodiment]
FIG. 1 is a diagram illustrating an example of an information processing apparatus according to the first embodiment. The information processing apparatus 10 includes a plurality of processors (CPUs) 11 to 14, a storage unit 15, and a control unit 16. In the information processing apparatus 10, an identification number is assigned to each of the CPUs 11 to 14. The identification number of the CPU 11 is “1”, the identification number of the CPU 12 is “2”, the identification number of the CPU 13 is “3”, and the identification number of the CPU 14 is “4”.

記憶部１５は、複数のＣＰＵ１１〜１４それぞれの使用時間とエラー発生状況とを示す動作情報を記憶する。エラー発生状況には、例えば訂正可能なエラーの発生回数が示されている。例えば記憶部１５内には、動作情報管理テーブル１５ａが設けられ、動作情報管理テーブル１５ａに動作情報が登録される。動作情報管理テーブル１５ａには、例えばＣＰＵ１１〜１４ごとに、使用時間とエラー数とが登録されている。動作情報管理テーブル１５ａに示される各ＣＰＵ１１〜１４のエラー数は、例えばそのＣＰＵから繰り返し収集したエラー発生状況に示されるエラー数の積算値である。 The memory | storage part 15 memorize | stores the operation information which shows each usage time and error occurrence condition of several CPU11-14. The error occurrence status indicates, for example, the number of occurrences of errors that can be corrected. For example, an operation information management table 15a is provided in the storage unit 15, and operation information is registered in the operation information management table 15a. In the operation information management table 15a, for example, the usage time and the number of errors are registered for each of the CPUs 11 to 14. The number of errors of each of the CPUs 11 to 14 shown in the operation information management table 15a is, for example, an integrated value of the number of errors shown in the error occurrence status repeatedly collected from the CPU.

制御部１６は、プログラムの実行開始時に、記憶部１５内の動作情報に基づいて、複数のＣＰＵ１１〜１４のうちの使用時間が短い方から所定数のＣＰＵを、動作させるＣＰＵとして選択する。そして制御部１６は、選択されていないＣＰＵを停止した状態で、選択したＣＰＵにプログラムを実行させる。すなわち、複数のＣＰＵ１１〜１４のうちの使用時間が長い方の所定数のＣＰＵが、停止させるＣＰＵとなる。さらに制御部１６は、選択したＣＰＵそれぞれの使用時間とエラー発生状況とを取得し、取得した使用時間とエラー発生状況とを記憶部１５に格納する。 At the start of execution of the program, the control unit 16 selects a predetermined number of CPUs from among the plurality of CPUs 11 to 14 as the CPUs to be operated, based on the operation information in the storage unit 15. Then, the control unit 16 causes the selected CPU to execute the program while the unselected CPU is stopped. That is, a predetermined number of CPUs having a longer usage time among the plurality of CPUs 11 to 14 are CPUs to be stopped. Further, the control unit 16 acquires the usage time and error occurrence status of each selected CPU, and stores the acquired usage time and error occurrence status in the storage unit 15.

なお、制御部１６は、エラーの発生状況に応じて、動作させるＣＰＵの選択基準を変えることができる。例えば制御部１６は、エラー発生状況に基づく複数のＣＰＵ１１〜１４それぞれの信頼性に差がない場合には、使用時間が短いＣＰＵから順に、動作させるＣＰＵとして選択することができる。そして、制御部１６は、複数のＣＰＵそれぞれの信頼性に差がある場合には、信頼性の高いＣＰＵから順に、動作させるＣＰＵとして選択する。 Note that the control unit 16 can change the selection criteria of the CPU to be operated according to the error occurrence state. For example, when there is no difference in the reliability of each of the plurality of CPUs 11 to 14 based on the error occurrence status, the control unit 16 can select the CPUs to be operated in order from the CPU having the shortest usage time. Then, when there is a difference in the reliability of each of the plurality of CPUs, the control unit 16 selects the CPUs to be operated in order from the highly reliable CPU.

なお制御部１６は、信頼性の判断では、例えば、単位使用時間当たりのエラー数が少ないＣＰＵほど信頼性が高いと判断することができる。また制御部１６は、例えば、ＣＰＵの稼働時間にそのＣＰＵの平均使用率を乗算した値を、そのＣＰＵの使用時間とすることもできる。 In the determination of reliability, for example, the control unit 16 can determine that a CPU having a smaller number of errors per unit usage time has higher reliability. Moreover, the control part 16 can also make the usage time of the CPU the value which multiplied the operating time of CPU and the average usage rate of the CPU, for example.

このような情報処理装置１０において、例えば、情報処理装置１０の運用時に実行するプログラムが、３つのＣＰＵで実行可能であるものとする。その場合、運用中、１つのＣＰＵを停止させておくことができる。どのＣＰＵを使用し、どのＣＰＵを停止させるのかは、例えばプログラムの実行開始時に決定される。図１の例では、情報処理装置１０を起動した際に、使用するＣＰＵが決定され、そのＣＰＵでプログラムが実行されるものとする。 In such an information processing apparatus 10, for example, it is assumed that a program executed when the information processing apparatus 10 is operated can be executed by three CPUs. In that case, one CPU can be stopped during operation. Which CPU is used and which CPU is to be stopped is determined at the start of program execution, for example. In the example of FIG. 1, when the information processing apparatus 10 is activated, a CPU to be used is determined, and a program is executed by the CPU.

情報処理装置１０の１回目の起動時には、すべてのＣＰＵの使用時間とエラー数が共に「０」である。この場合、制御部１６は、任意の１つのＣＰＵを停止させる。図１の例では、ＣＰＵ１１〜１３が使用され、ＣＰＵ１４が停止されている。制御部１６は、例えば、プログラムの実行停止時に、使用されているＣＰＵ１１〜１３から動作情報を取得し、記憶部１５に格納する。その結果、ＣＰＵ１１〜１３についての使用時間とエラー数とが、動作情報管理テーブル１５ａに登録される。 When the information processing apparatus 10 is activated for the first time, the usage time and the number of errors of all the CPUs are both “0”. In this case, the control unit 16 stops any one CPU. In the example of FIG. 1, the CPUs 11 to 13 are used and the CPU 14 is stopped. For example, the control unit 16 acquires operation information from the CPUs 11 to 13 being used and stops the storage unit 15 when the program execution is stopped. As a result, the usage time and the number of errors for the CPUs 11 to 13 are registered in the operation information management table 15a.

なお、ＣＰＵの利用率を加味した値を使用時間として用いることで、同じ期間だけ稼働したＣＰＵでも、使用時間が異なってくる。例えば図１の例では、識別番号「１」のＣＰＵ１１の使用時間が最も長い。 It should be noted that, by using a value that takes into account the utilization rate of the CPU as the usage time, the usage time differs even for CPUs that have been operating for the same period. For example, in the example of FIG. 1, the usage time of the CPU 11 with the identification number “1” is the longest.

情報処理装置１０の２回目の起動時には、制御部１６は、１回目のプログラムの実行により収集された動作情報に基づいて、使用するＣＰＵと停止するＣＰＵとを決定する。図１の例では、２回目の起動時には、いずれのＣＰＵにもエラーが発生していない。そこで制御部１６は、使用時間が短い方から３つのＣＰＵ１２〜１４を、使用対象として選択する。そして制御部１６は、選択されなかったＣＰＵ１１の動作を停止させる。そして制御部１６は、使用しているＣＰＵ１２〜１４の動作情報を取得し、記憶部１５に格納する。これにより、１回目で停止されていたＣＰＵ１４の動作情報も取得することができる。 When the information processing apparatus 10 is activated for the second time, the control unit 16 determines a CPU to be used and a CPU to be stopped based on operation information collected by executing the first program. In the example of FIG. 1, no error has occurred in any of the CPUs at the second startup. Therefore, the control unit 16 selects the three CPUs 12 to 14 as the objects to be used in the order of shorter usage time. And the control part 16 stops operation | movement of CPU11 which was not selected. And the control part 16 acquires the operation information of CPU12-14 currently used, and stores it in the memory | storage part 15. FIG. Thereby, the operation information of the CPU 14 stopped at the first time can also be acquired.

以後、情報処理装置１０を起動するごとに、制御部１６は、使用時間が短いＣＰＵを優先的に使用し、使用時間が長いＣＰＵを停止させる。これにより、各ＣＰＵ１１〜１４の使用時間を均等化させることができる。使用される複数のＣＰＵの使用率に大きな差がなく、一回の起動での運用期間が毎回同じであれば、各ＣＰＵは、ローテーションで使用されることとなる。 Thereafter, each time the information processing apparatus 10 is activated, the control unit 16 preferentially uses the CPU having a short usage time and stops the CPU having a long usage time. Thereby, the usage time of each CPU11-14 can be equalized. If the usage rates of the plurality of CPUs used are not significantly different and the operation period after one activation is the same every time, each CPU is used in rotation.

情報処理装置１０を運用しているうちに、いずれかのＣＰＵにおいて、訂正可能なエラーが発生することがある。図１の例では、情報処理装置１０のｋ回目（ｋは１以上の整数）の起動時には、各ＣＰＵ１１〜１４で１回ずつのエラーが検出されている。この場合、各ＣＰＵ１１〜１４のエラー数は同等であるが、使用時間を考慮して信頼性を評価すると、複数のＣＰＵ１１〜１４の信頼性は同等とはならない。すなわち、単位時間当たりのエラー数が多いほど、信頼性が低いと考えられる。エラー数が同じであれば、使用時間が短いほど信頼性が低いことになる。図１の例では、使用時間が最も短いのは、識別番号「２」のＣＰＵ１２である。そこで制御部１６は、単位時間当たりのエラー数が最大のＣＰＵ１２以外のＣＰＵ１１，１３，１４を動作させるＣＰＵとして選択し、ＣＰＵ１２を停止させる。 While operating the information processing apparatus 10, a correctable error may occur in any of the CPUs. In the example of FIG. 1, when the information processing apparatus 10 is activated for the k-th time (k is an integer equal to or greater than 1), each CPU 11 to 14 detects an error once. In this case, the number of errors of each of the CPUs 11 to 14 is the same, but if the reliability is evaluated in consideration of the usage time, the reliability of the plurality of CPUs 11 to 14 is not the same. That is, it is considered that the reliability is lower as the number of errors per unit time is larger. If the number of errors is the same, the shorter the usage time, the lower the reliability. In the example of FIG. 1, the CPU 12 with the identification number “2” has the shortest usage time. Therefore, the control unit 16 selects the CPUs 11, 13 and 14 other than the CPU 12 having the maximum number of errors per unit time as the CPU to be operated, and stops the CPU 12.

このように、第１の実施の形態では、いずれのＣＰＵからもエラーが検出されていない状態では、使用時間が短いＣＰＵを優先的に使用することで、使用時間の均等化が図られる。その結果、すべてのＣＰＵ１１〜１４から動作情報を確実に収集することができ、ＣＰＵ１１〜１４それぞれの信頼性を同等の条件で評価し、適切に比較することができる。 As described above, in the first embodiment, in a state where no error is detected from any CPU, the use time is equalized by preferentially using the CPU having a short use time. As a result, operation information can be reliably collected from all the CPUs 11 to 14, and the reliability of each of the CPUs 11 to 14 can be evaluated under the same conditions and appropriately compared.

すなわち、使用時間やエラー数を考慮しない規則で動作させるＣＰＵを選択した場合、常に同じＣＰＵが停止される可能性がある。常に停止しているＣＰＵがあると、そのＣＰＵからは動作情報を収集することができず、信頼性を評価できない。それに対し、第１の実施の形態では、ＣＰＵの信頼性に差がない状況では、すべてのＣＰＵが平等に使用されるため、すべてのＣＰＵから十分な動作情報を収集できる。 In other words, when a CPU that is operated according to a rule that does not take into account usage time and the number of errors is selected, there is a possibility that the same CPU is always stopped. If there is a CPU that is always stopped, operation information cannot be collected from the CPU, and reliability cannot be evaluated. On the other hand, in the first embodiment, in a situation where there is no difference in CPU reliability, all CPUs are used equally, so that sufficient operation information can be collected from all CPUs.

しかも、情報処理装置１０の運用のために有用なプログラムを、すべてのＣＰＵに順番に実行させることで、動作情報が収集される。そのため、例えば、使用していないＣＰＵにテストプログラムを実行させて動作情報を収集するような余計な処理を実施せずに済み、情報収集を効率的に実施できる。 In addition, operation information is collected by causing all the CPUs to execute a program useful for the operation of the information processing apparatus 10 in order. Therefore, for example, it is not necessary to perform an extra process of collecting operation information by executing a test program on a CPU that is not being used, and information can be collected efficiently.

また第１の実施の形態では、いずれかのＣＰＵのエラー検出後は、単位使用時間当たりのエラー数が少ないＣＰＵを優先的に使用することで、使用時間を考慮して信頼性を正しく評価できる。例えば、使用時間を考慮せずに信頼性を評価すると、エラー数が少ないＣＰＵの信頼性が高いと評価されてしまう。しかし、他のＣＰＵよりも長時間使用されたＣＰＵであれば、単にエラー数が多いというだけで他のＣＰＵよりも信頼性が低いと判定することはできない。単位使用時間当たりのエラー数で信頼性を判定することで、ＣＰＵ間の使用時間の違いを相殺して、信頼性を正しく判定することができる。 Moreover, in the first embodiment, after detecting an error of any CPU, the reliability can be correctly evaluated in consideration of the usage time by preferentially using the CPU having a small number of errors per unit usage time. . For example, if the reliability is evaluated without considering the usage time, it is evaluated that the reliability of the CPU with a small number of errors is high. However, if a CPU has been used for a longer time than other CPUs, it cannot be determined that the reliability is lower than that of other CPUs simply because the number of errors is large. By determining the reliability based on the number of errors per unit usage time, the difference in usage time between CPUs can be offset and the reliability can be determined correctly.

そして信頼性の低いＣＰＵを的確に停止させることで、使用中のＣＰＵで訂正不可能なエラーが発生することによるシステムダウンを抑止することができる。
さらに、ＣＰＵの稼働時間に使用率を乗算した値を使用時間とすることで、使用時間を正確に算出することができる。すなわち、ＣＰＵのエラーは、処理の実行過程で発生する。そのため、何も処理を実行していないアイドル状態の期間を使用時間に含めてしまうと、信頼性を正確に判断できなくなる。第１の実施の形態では、使用率を加味して使用時間を計算することで、ほとんどアイドル状態にならないＣＰＵと、通常アイドル状態になっているＣＰＵとの違いを加味して信頼性を判断し、訂正不可能なエラーが発生しやすいＣＰＵを正確に判断できる。 And by stopping the CPU with low reliability accurately, it is possible to prevent the system from being down due to the occurrence of an uncorrectable error in the CPU in use.
Furthermore, the usage time can be accurately calculated by setting the usage time to a value obtained by multiplying the operating time of the CPU by the usage rate. That is, the CPU error occurs in the process execution process. Therefore, if the idle time period during which no processing is performed is included in the usage time, the reliability cannot be accurately determined. In the first embodiment, by calculating the usage time in consideration of the usage rate, the reliability is determined in consideration of the difference between the CPU that is hardly in the idle state and the CPU that is in the normal idle state. Therefore, it is possible to accurately determine a CPU that easily generates an uncorrectable error.

なお、制御部１６は、例えば情報処理装置１０が有する、いずれか１つのプロセッサ（ＣＰＵ）により実現することができる。また、記憶部１５は、例えば情報処理装置１０が有するメモリにより実現することができる。 In addition, the control part 16 is realizable by any one processor (CPU) which the information processing apparatus 10 has, for example. Moreover, the memory | storage part 15 is realizable with the memory which the information processing apparatus 10 has, for example.

〔第２の実施の形態〕
次に第２の実施の形態について説明する。第２の実施の形態では、サーバの起動時に、プロセッサ（ＣＰＵ）の稼働時間を加味して、各ＣＰＵの信頼性を評価する。なお、ＣＰＵの稼働時間としては、ＣＰＵの使用率を加味した実質稼働時間が用いられる。実質稼働時間は、例えば「ＣＰＵの稼働時間×ＣＰＵ使用率」で求められる。なお、実質稼働時間は、第１の実施の形態における使用時間の一例である。求められた実質稼働時間を用いて、使用するＣＰＵを選択することで、各ＣＰＵの信頼性を正確に判断可能となる。 [Second Embodiment]
Next, a second embodiment will be described. In the second embodiment, when the server is started, the reliability of each CPU is evaluated in consideration of the operating time of the processor (CPU). In addition, as the operating time of the CPU, the actual operating time taking into account the usage rate of the CPU is used. The actual operating time is obtained by, for example, “CPU operating time × CPU usage rate”. The actual operating time is an example of the usage time in the first embodiment. By selecting the CPU to be used using the obtained actual operating time, the reliability of each CPU can be accurately determined.

なお第２の実施の形態では、訂正可能なエラーが発生したＣＰＵが存在しない場合、実質稼働時間が長いＣＰＵから順に、所定数のＣＰＵの動作を停止させる。また、訂正可能なエラーが発生したＣＰＵが存在する場合は、実質稼働時間あたりの訂正可能なエラー数が多いＣＰＵから順に、所定数のＣＰＵの動作を停止させる。 In the second embodiment, when there is no CPU in which a correctable error has occurred, the operations of a predetermined number of CPUs are stopped in order from the CPU having the longest actual operation time. Further, when there are CPUs in which a correctable error has occurred, the operation of a predetermined number of CPUs is stopped in order from the CPU having the largest number of correctable errors per actual operating time.

図２は、第２の形態に用いるサーバのハードウェアの一構成例を示す図である。サーバ１００は、システムボード１０１内のＣＰＵによって装置全体が制御されている。システムボード１０１は、複数のＣＰＵ（例えばマルチコアプロセッサ）とメモリとを有している。システムボード１０１には、バス１０９を介して複数の周辺機器が接続されている。システムボード１０１では、ＣＰＵがプログラムを実行することで実現する機能の少なくとも一部を、ＡＳＩＣ（Application Specific Integrated Circuit）、ＰＬＤ（Programmable Logic Device）などの電子回路で実現してもよい。 FIG. 2 is a diagram illustrating a configuration example of server hardware used in the second embodiment. The server 100 is entirely controlled by a CPU in the system board 101. The system board 101 has a plurality of CPUs (for example, multi-core processors) and a memory. A plurality of peripheral devices are connected to the system board 101 via a bus 109. In the system board 101, at least a part of the functions realized by the CPU executing the program may be realized by an electronic circuit such as an ASIC (Application Specific Integrated Circuit) or a PLD (Programmable Logic Device).

バス１０９に接続されている周辺機器としては、ＨＤＤ（Hard Disk Drive）１０２、監視ユニット１０３、グラフィック処理装置１０４、入力インタフェース１０５、光学ドライブ装置１０６、機器接続インタフェース１０７およびＦＣ（Fibre Channel）カード１０８ａ，１０８ｂがある。 Peripheral devices connected to the bus 109 include an HDD (Hard Disk Drive) 102, a monitoring unit 103, a graphic processing device 104, an input interface 105, an optical drive device 106, a device connection interface 107, and an FC (Fibre Channel) card 108a. , 108b.

ＨＤＤ１０２は、内蔵したディスクに対して、磁気的にデータの書き込みおよび読み出しを行う。ＨＤＤ１０２は、サーバ１００の補助記憶装置として使用される。ＨＤＤ１０２には、ＯＳのプログラム、アプリケーションプログラム、および各種データが格納される。なお、補助記憶装置としては、フラッシュメモリなどの不揮発性の半導体記憶装置（ＳＳＤ：Solid State Drive）を使用することもできる。 The HDD 102 magnetically writes and reads data to and from the built-in disk. The HDD 102 is used as an auxiliary storage device of the server 100. The HDD 102 stores an OS program, application programs, and various data. As the auxiliary storage device, a non-volatile semiconductor storage device (SSD: Solid State Drive) such as a flash memory can be used.

監視ユニット１０３は、サーバ１００の動作を監視する。例えば監視ユニット１０３は、システムボード１０１内のＣＰＵにおける訂正可能なエラーの情報を収集し、蓄積する。 The monitoring unit 103 monitors the operation of the server 100. For example, the monitoring unit 103 collects and accumulates information on errors that can be corrected by the CPU in the system board 101.

グラフィック処理装置１０４には、モニタ２１が接続されている。グラフィック処理装置１０４は、システムボード１０１内のＣＰＵからの命令に従って、画像をモニタ２１の画面に表示させる。モニタ２１としては、ＣＲＴ（Cathode Ray Tube）を用いた表示装置や液晶表示装置などがある。 A monitor 21 is connected to the graphic processing device 104. The graphic processing device 104 displays an image on the screen of the monitor 21 in accordance with a command from the CPU in the system board 101. Examples of the monitor 21 include a display device using a CRT (Cathode Ray Tube) and a liquid crystal display device.

入力インタフェース１０５には、キーボード２２とマウス２３とが接続されている。入力インタフェース１０５は、キーボード２２やマウス２３から送られてくる信号をシステムボード１０１内のＣＰＵに送信する。なお、マウス２３は、ポインティングデバイスの一例であり、他のポインティングデバイスを使用することもできる。他のポインティングデバイスとしては、タッチパネル、タブレット、タッチパッド、トラックボールなどがある。 A keyboard 22 and a mouse 23 are connected to the input interface 105. The input interface 105 transmits a signal sent from the keyboard 22 or the mouse 23 to the CPU in the system board 101. The mouse 23 is an example of a pointing device, and other pointing devices can also be used. Examples of other pointing devices include a touch panel, a tablet, a touch pad, and a trackball.

光学ドライブ装置１０６は、レーザ光などを利用して、光ディスク２４に記録されたデータの読み取りを行う。光ディスク２４は、光の反射によって読み取り可能なようにデータが記録された可搬型の記録媒体である。光ディスク２４には、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）などがある。 The optical drive device 106 reads data recorded on the optical disc 24 using laser light or the like. The optical disc 24 is a portable recording medium on which data is recorded so that it can be read by reflection of light. Examples of the optical disc 24 include a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only Memory), and a CD-R (Recordable) / RW (ReWritable).

機器接続インタフェース１０７は、サーバ１００に周辺機器を接続するための通信インタフェースである。例えば機器接続インタフェース１０７には、メモリ装置２５やメモリリーダライタ２６を接続することができる。メモリ装置２５は、機器接続インタフェース１０７との通信機能を搭載した記録媒体である。メモリリーダライタ２６は、メモリカード２７へのデータの書き込み、またはメモリカード２７からのデータの読み出しを行う装置である。メモリカード２７は、カード型の記録媒体である。 The device connection interface 107 is a communication interface for connecting peripheral devices to the server 100. For example, the memory device 25 and the memory reader / writer 26 can be connected to the device connection interface 107. The memory device 25 is a recording medium equipped with a communication function with the device connection interface 107. The memory reader / writer 26 is a device that writes data to the memory card 27 or reads data from the memory card 27. The memory card 27 is a card type recording medium.

ＦＣカード１０８ａ，１０８ｂは、ネットワーク２０に接続されている。ＦＣカード１０８ａ，１０８ｂは、ネットワーク２０を介して、他のコンピュータまたは通信機器との間でデータの送受信を行う。 The FC cards 108 a and 108 b are connected to the network 20. The FC cards 108 a and 108 b transmit and receive data to and from other computers or communication devices via the network 20.

図３は、システムボードの詳細とサーバ内で保持される情報の例を示す図である。システムボード１０１には、プロセッサ１０１−１、メモリ１０１−２、メモリブリッジ１０１−３，Ｉ／Ｏ（Input/Output）ブリッジ１０１−４などが搭載されている。 FIG. 3 is a diagram illustrating details of the system board and an example of information held in the server. The system board 101 includes a processor 101-1, a memory 101-2, a memory bridge 101-3, an input / output (I / O) bridge 101-4, and the like.

プロセッサ１０１−１は、複数のＣＰＵ１０１ａ〜１０１ｅ（プロセッサコア）を有している。複数のＣＰＵ１０１ａ〜１０１ｅそれぞれには、「０」〜「４」の識別番号が付与されている。識別番号「０」のＣＰＵ１０１ａは、ＯＳや、ＣＰＵの動作を管理する管理プログラム１０１ｆを実行する。識別番号「１」から「４」の４つのＣＰＵは、アプリケーションプログラム１０１ｈを実行する。管理プログラム１０１ｆを実行するＣＰＵ１０１ａは、図１に示す第１の実施の形態の制御部１６の一例である。アプリケーションプログラム１０１ｈを実行するＣＰＵ１０１ｂ〜１０１ｅは、図１に示す第１の実施の形態のＣＰＵ１１〜１４の一例である。 The processor 101-1 has a plurality of CPUs 101 a to 101 e (processor cores). Identification numbers “0” to “4” are assigned to the CPUs 101a to 101e, respectively. The CPU 101a with the identification number “0” executes an OS and a management program 101f that manages the operation of the CPU. The four CPUs with the identification numbers “1” to “4” execute the application program 101h. The CPU 101a that executes the management program 101f is an example of the control unit 16 according to the first embodiment illustrated in FIG. The CPUs 101b to 101e that execute the application program 101h are examples of the CPUs 11 to 14 according to the first embodiment illustrated in FIG.

メモリ１０１−２は、サーバ１００の主記憶装置として使用される。メモリ１０１−２には、管理プログラム１０１ｆ、稼働時間・エラー管理簿１０１ｇ、アプリケーションプログラム１０１ｈ、およびＯＳ１０１ｉが記憶される。メモリ１０１−２としては、例えばＲＡＭなどの揮発性の半導体記憶装置が使用される。メモリ１０１−２は、図１に示す第１の実施の形態の記憶部１５の一例である。 The memory 101-2 is used as a main storage device of the server 100. The memory 101-2 stores a management program 101f, an operating time / error management book 101g, an application program 101h, and an OS 101i. As the memory 101-2, for example, a volatile semiconductor storage device such as a RAM is used. The memory 101-2 is an example of the storage unit 15 according to the first embodiment illustrated in FIG.

管理プログラム１０１ｆは、アプリケーションプログラム１０１ｈをどのＣＰＵに実行させるかを管理するためのプログラムである。管理プログラム１０１ｆには、使用率採取モジュール、エラー情報収集モジュール、ＣＰＵ数算出モジュール、ＣＰＵ選定モジュールなどが含まれる。稼働時間・エラー管理簿１０１ｇは、各ＣＰＵの稼働時間や、発生したエラーが登録されたデータテーブルである。アプリケーションプログラム１０１ｈは、サーバ１００が提供するサービスに関する情報処理を、ＣＰＵ１０１ｂ〜１０１ｅに実行させるためのプログラムである。ＯＳ１０１ｉは、サーバ１００全体の動作を制御するためのプログラムである。 The management program 101f is a program for managing which CPU is to execute the application program 101h. The management program 101f includes a usage rate collection module, an error information collection module, a CPU count calculation module, a CPU selection module, and the like. The operating time / error management book 101g is a data table in which the operating time of each CPU and errors that have occurred are registered. The application program 101h is a program for causing the CPUs 101b to 101e to perform information processing related to services provided by the server 100. The OS 101i is a program for controlling the operation of the entire server 100.

メモリブリッジ１０１−３は、プロセッサ１０１−１からのメモリ１０１−２へのアクセスを制御する制御回路である。Ｉ／Ｏブリッジ１０１−４は、プロセッサ１０１−１からＨＤＤ１０２などの周辺機器へのアクセスを制御する制御回路である。 The memory bridge 101-3 is a control circuit that controls access to the memory 101-2 from the processor 101-1. The I / O bridge 101-4 is a control circuit that controls access from the processor 101-1 to peripheral devices such as the HDD 102.

ＨＤＤ１０２には、メモリ１０１−２と同様に、管理プログラム１０２ａ、稼働時間・エラー管理簿１０２ｂ、アプリケーションプログラム１０２ｃ、およびＯＳ１０２ｄが記憶される。ＨＤＤ１０２に記憶された各種情報が、メモリ１０１−２に読み出され、プロセッサ１０１−１内のいずれかのＣＰＵで実行される。 Similar to the memory 101-2, the HDD 102 stores a management program 102a, an operating time / error management book 102b, an application program 102c, and an OS 102d. Various types of information stored in the HDD 102 are read into the memory 101-2 and executed by any CPU in the processor 101-1.

監視ユニット１０３は、制御部１０３−１とメモリ１０３−２とを有している。制御部１０３−１は、システムボード１０１から送られたエラー情報を、エラーログ１０３ａとしてメモリ１０３−２に格納する。なおシステムボード１０１から送られたエラー情報は、訂正可能なエラーに関する情報であり、エラー情報にはエラーを発生させたＣＰＵの識別番号が含まれる。また制御部１０３−１は、システムボード１０１内のＣＰＵ１０１ａからの要求に応じて、メモリ１０３−２内のエラーログ１０３ａを、そのＣＰＵ１０１ａに送信する。 The monitoring unit 103 includes a control unit 103-1 and a memory 103-2. The control unit 103-1 stores the error information transmitted from the system board 101 in the memory 103-2 as an error log 103a. The error information sent from the system board 101 is information relating to an error that can be corrected, and the error information includes the identification number of the CPU that caused the error. In response to a request from the CPU 101a in the system board 101, the control unit 103-1 transmits the error log 103a in the memory 103-2 to the CPU 101a.

以上のようなハードウェア構成およびデータによって、第２の実施の形態の処理機能を実現することができる。なお、第１の実施の形態に示した装置も、図２に示したサーバ１００と同様のハードウェアにより実現することができる。 With the hardware configuration and data as described above, the processing functions of the second embodiment can be realized. The apparatus shown in the first embodiment can also be realized by hardware similar to that of the server 100 shown in FIG.

プロセッサ１０１−１は、ＨＤＤ１０２内のプログラムの少なくとも一部をメモリ１０１−２にロードし、プログラムを実行する。またサーバ１００に実行させるプログラムを、光ディスク２４、メモリ装置２５、メモリカード２７などの可搬型記録媒体に記録しておくこともできる。可搬型記録媒体に格納されたプログラムは、例えばプロセッサ１０１−１からの制御により、ＨＤＤ１０２にインストールされた後、実行可能となる。またプロセッサ１０１−１が、可搬型記録媒体から直接プログラムを読み出して実行することもできる。 The processor 101-1 loads at least a part of the program in the HDD 102 into the memory 101-2 and executes the program. A program to be executed by the server 100 can also be recorded on a portable recording medium such as the optical disc 24, the memory device 25, and the memory card 27. The program stored in the portable recording medium becomes executable after being installed in the HDD 102, for example, under the control of the processor 101-1. The processor 101-1 can also read and execute the program directly from the portable recording medium.

次に、ＣＰＵの動作管理に用いる情報について詳細に説明する。
図４は、稼働時間・エラー管理簿の一例を示す図である。稼働時間・エラー管理簿１０１ｇには、Ｕｎｉｔ、実質稼働時間、訂正可能なエラー数、前回の平均使用率の欄が設けられている。Ｕｎｉｔの欄には、管理対象のＣＰＵの識別番号が設定される。実質稼働時間の欄には、対応するＣＰＵの実質的な稼働時間が設定される。実質的な稼働時間とは、ＣＰＵがオンラインとなっていた時間に、そのＣＰＵの平均使用率を乗算して得られる時間である。訂正可能なエラー数の欄には、対応するＣＰＵで発生した訂正可能なエラーの数が設定される。前回の平均使用率の欄には、直近のシステム運用時における対応するＣＰＵの平均使用率が設定される。 Next, information used for operation management of the CPU will be described in detail.
FIG. 4 is a diagram illustrating an example of an operating time / error management list. The operation time / error management book 101g includes columns for Unit, actual operation time, the number of correctable errors, and the previous average usage rate. In the Unit column, the identification number of the CPU to be managed is set. The substantial operation time of the corresponding CPU is set in the column of the actual operation time. The substantial operating time is the time obtained by multiplying the time when the CPU is online by the average usage rate of the CPU. In the column of the number of correctable errors, the number of correctable errors that have occurred in the corresponding CPU is set. In the previous average usage rate column, the average usage rate of the corresponding CPU during the most recent system operation is set.

図５は、エラーログの一例を示す図である。エラーログ１０３ａには、番号、時刻、およびＵｎｉｔの欄が設けられている。番号の欄には、発生したエラーの識別番号が設定される。時刻の欄には、エラーの発生日時が設定される。Ｕｎｉｔの欄は、エラーを発生させたＣＰＵの識別番号が設定される。 FIG. 5 is a diagram illustrating an example of an error log. The error log 103a has columns for number, time, and unit. In the number column, an identification number of the error that has occurred is set. An error occurrence date and time is set in the time column. In the Unit column, the identification number of the CPU that caused the error is set.

管理プログラム１０１ｆを実行するＣＰＵ１０１ａは、図４、図５に示すような情報を用いて、他のＣＰＵ１０１ｂ〜１０１ｅの動作管理を行う。
図６は、ＣＰＵ動作管理機能を示すブロック図である。図６には、サーバ１００が有する、ＣＰＵ動作管理のための機能を、機能ブロックで表している。例えばサーバ１００は、ＯＳ１１０、使用率採取部１２０、エラー情報収集部１３０、ＣＰＵ数算出部１４０、およびＣＰＵ選定部１５０を有する。ＯＳ１１０は、メモリ１０１−２に格納されたＯＳのプログラム（ＯＳ１０１ｉ）を、ＣＰＵ１０１ａが実行することで実現する機能である。使用率採取部１２０、エラー情報収集部１３０、ＣＰＵ数算出部１４０、およびＣＰＵ選定部１５０は、管理プログラム１０１ｆをＣＰＵ１０１ａが実行することで実現される機能である。 The CPU 101a that executes the management program 101f performs operation management of the other CPUs 101b to 101e using information as shown in FIGS.
FIG. 6 is a block diagram showing the CPU operation management function. In FIG. 6, functions for CPU operation management included in the server 100 are represented by functional blocks. For example, the server 100 includes an OS 110, a usage rate collection unit 120, an error information collection unit 130, a CPU number calculation unit 140, and a CPU selection unit 150. The OS 110 is a function realized by the CPU 101a executing an OS program (OS 101i) stored in the memory 101-2. The usage rate collection unit 120, the error information collection unit 130, the CPU number calculation unit 140, and the CPU selection unit 150 are functions realized by the CPU 101a executing the management program 101f.

ＯＳ１１０は、ＣＰＵ１０１ｂ〜１０１ｅの動作状況を監視し、使用率を算出する。使用率採取部１２０は、例えばシステムの停止時にＣＰＵ１０１ｂ〜１０１ｅの使用率をＯＳ１１０から採取する。使用率採取部１２０は、採取した使用率を稼働時間・エラー管理簿１０１ｇに設定する。エラー情報収集部１３０は、例えばシステムの停止時に、監視ユニット１０３からエラーログ１０３ａを収集し、システム起動から停止までの訂正可能なエラー数を、ＣＰＵごとに集計する。そして、エラー情報収集部１３０は、集計したエラー数の値を、稼働時間・エラー管理簿１０１ｇに設定する。 The OS 110 monitors the operation status of the CPUs 101b to 101e and calculates the usage rate. The usage rate collection unit 120 collects the usage rates of the CPUs 101b to 101e from the OS 110 when the system is stopped, for example. The usage rate collection unit 120 sets the collected usage rate in the operating time / error management book 101g. For example, when the system is stopped, the error information collection unit 130 collects the error log 103a from the monitoring unit 103, and totals the number of errors that can be corrected from the start of the system to the stop for each CPU. Then, the error information collection unit 130 sets the total number of errors in the operating time / error management book 101g.

ＣＰＵ数算出部１４０は、例えばシステムの起動時に、稼働時間・エラー管理簿１０１ｇに基づいて、アプリケーションプログラム１０１ｈの実行に使用するＣＰＵ数を算出する。ＣＰＵ選定部１５０は、例えばシステムの起動時に、稼働時間・エラー管理簿１０１ｇに基づいて、ＣＰＵ数算出部１４０で算出されたＣＰＵ数分のＣＰＵを、使用するＣＰＵとして選定する。そしてＣＰＵ選定部１５０は、選定から漏れたＣＰＵの動作を停止させる。 The CPU number calculation unit 140 calculates the number of CPUs used to execute the application program 101h based on the operating time / error management book 101g, for example, when the system is started. For example, when the system is started up, the CPU selection unit 150 selects as many CPUs as the number of CPUs used by the CPU number calculation unit 140 based on the operating time / error management book 101g. Then, the CPU selection unit 150 stops the operation of the CPU that is not selected.

なお、図６に示した各要素間を接続する線は通信経路の一部を示すものであり、図示した通信経路以外の通信経路も設定可能である。
次に、ＣＰＵ動作管理処理について詳細に説明する。ＣＰＵ動作管理処理は、システムの起動時と停止時とに行われる。以下、図７〜図９を参照してシステム起動時の処理を説明し、図１０〜図１２を参照してシステム停止時の処理を説明する。 In addition, the line which connects between each element shown in FIG. 6 shows a part of communication path, and communication paths other than the illustrated communication path can also be set.
Next, the CPU operation management process will be described in detail. The CPU operation management process is performed when the system is started and when it is stopped. Hereinafter, processing at the time of system startup will be described with reference to FIGS. 7 to 9, and processing at the time of system stop will be described with reference to FIGS. 10 to 12.

図７は、システム起動時のＣＰＵ動作管理処理の手順の一例を示すフローチャートである。
［ステップＳ１０１］ＣＰＵ１０１ａは、ＯＳ１１０を起動する。 FIG. 7 is a flowchart illustrating an example of a procedure of CPU operation management processing at the time of system startup.
[Step S101] The CPU 101a activates the OS 110.

［ステップＳ１０２］ＯＳ１１０は、管理プログラム１０１ｆのＣＰＵ数算出モジュールに基づいて、ＣＰＵ数算出部１４０を起動する。ＣＰＵ数算出部１４０は、アプリケーションプログラム１０１ｈの実行に使用するＣＰＵ数の算出処理を実行する。ＣＰＵ数算出処理の詳細は後述する（図８参照）。 [Step S102] The OS 110 activates the CPU number calculation unit 140 based on the CPU number calculation module of the management program 101f. The CPU number calculation unit 140 executes processing for calculating the number of CPUs used for executing the application program 101h. Details of the CPU number calculation process will be described later (see FIG. 8).

［ステップＳ１０３］ＯＳ１１０は、管理プログラム１０１ｆのＣＰＵ選定モジュールに基づいて、ＣＰＵ選定部１５０を起動する。ＣＰＵ選定部１５０は、ＣＰＵ１０１ｂ〜１０１ｅのなかから、ステップＳ１０２で算出されたＣＰＵ数分のＣＰＵ選定処理を行う。ＣＰＵ選定処理の詳細は後述する（図９参照）。 [Step S103] The OS 110 activates the CPU selection unit 150 based on the CPU selection module of the management program 101f. The CPU selection unit 150 performs CPU selection processing for the number of CPUs calculated in step S102 from among the CPUs 101b to 101e. Details of the CPU selection process will be described later (see FIG. 9).

［ステップＳ１０４］ＣＰＵ１０１ａは、ステップＳ１０３で選択されたＣＰＵに、アプリケーションプログラム１０１ｈの実行を指示する。指示を受けたＣＰＵが、アプリケーションを起動する。その後、アプリケーションに基づいて、サーバ１００がサービスを提供する。 [Step S104] The CPU 101a instructs the CPU selected in step S103 to execute the application program 101h. Upon receiving the instruction, the CPU activates the application. Thereafter, the server 100 provides a service based on the application.

図８は、ＣＰＵ数算出処理の手順の一例を示すフローチャートである。
［ステップＳ１１１］ＣＰＵ数算出部１４０は、アプリケーションの使用ＣＰＵ数が固定か否かを判断する。例えばＣＰＵ数算出部１４０は、アプリケーションプログラム１０１ｈのプロパティなどの管理情報に、使用ＣＰＵ数が指定されているかどうかを調査する。ＣＰＵ数算出部１４０は、使用ＣＰＵ数が指定されていれば、使用ＣＰＵ数固定であると判断する。使用ＣＰＵ数が固定の場合、処理が終了する。使用ＣＰＵ数が固定でなければ、処理がステップＳ１１２に進められる。 FIG. 8 is a flowchart illustrating an example of a procedure of CPU number calculation processing.
[Step S111] The CPU number calculation unit 140 determines whether or not the number of CPUs used by the application is fixed. For example, the CPU number calculation unit 140 checks whether or not the number of CPUs used is specified in management information such as properties of the application program 101h. If the number of CPUs used is specified, the CPU number calculation unit 140 determines that the number of CPUs used is fixed. If the number of CPUs used is fixed, the process ends. If the number of CPUs used is not fixed, the process proceeds to step S112.

［ステップＳ１１２］ＣＰＵ数算出部１４０は、前回の平均使用率の合計の値を「０」に初期化する。
［ステップＳ１１３］ＣＰＵ数算出部１４０は、アプリケーションプログラム１０１ｈの実行用に用意されたＣＰＵ数分（図３の例では「４」）だけ、ステップＳ１１４，Ｓ１１５の処理をループする。例えばＣＰＵ数算出部１４０は、稼働時間・エラー管理簿１０１ｇに登録されているＣＰＵを、上から順に処理対象とする。 [Step S112] The CPU number calculation unit 140 initializes the total value of the previous average usage rate to “0”.
[Step S113] The CPU number calculation unit 140 loops the processes of steps S114 and S115 by the number of CPUs prepared for execution of the application program 101h (“4” in the example of FIG. 3). For example, the CPU number calculation unit 140 sets the CPUs registered in the operating time / error management book 101g as processing targets in order from the top.

［ステップＳ１１４］ＣＰＵ数算出部１４０は、処理対象のＣＰＵの前回の平均使用率を、稼働時間・エラー管理簿１０１ｇから取得する。そしてＣＰＵ数算出部１４０は、取得した値を、前回の平均使用率の合計に加算する。 [Step S114] The CPU count calculation unit 140 acquires the previous average usage rate of the processing target CPU from the operating time / error management book 101g. Then, the CPU number calculation unit 140 adds the acquired value to the total of the previous average usage rate.

［ステップＳ１１５］ＣＰＵ数算出部１４０は、処理対象のＣＰＵを、稼働時間・エラー管理簿１０１ｇ上での次のＣＰＵに移動する。
［ステップＳ１１６］ＣＰＵ数算出部１４０は、アプリケーションプログラム１０１ｈの実行用に用意されたすべてのＣＰＵ１０１ｂ〜１０１ｅについて処理が完了したら、処理をステップＳ１１７に進める。 [Step S115] The CPU number calculation unit 140 moves the CPU to be processed to the next CPU on the operating time / error management book 101g.
[Step S116] When the processing is completed for all the CPUs 101b to 101e prepared for execution of the application program 101h, the CPU number calculation unit 140 advances the processing to Step S117.

［ステップＳ１１７］ＣＰＵ数算出部１４０は、前回の平均使用率の合計を６０％で割った値を、使用するＣＰＵ数として算出する。除算の小数点以下の値は、切り上げるものとする。これにより、平均使用率を６０％以下とするための使用ＣＰＵ数が求められる。 [Step S117] The CPU number calculation unit 140 calculates a value obtained by dividing the previous average usage rate by 60% as the number of CPUs to be used. The value after the decimal point of division shall be rounded up. Thereby, the number of CPUs used for making the average usage rate 60% or less is obtained.

次に、ＣＰＵ選定処理について詳細に説明する。
図９は、ＣＰＵ選定処理の手順の一例を示すフローチャートである。
［ステップＳ１２１］ＣＰＵ選定部１５０は、稼働時間・エラー管理簿１０１ｇに登録されたＣＰＵ１０１ｂ〜１０１ｅのエントリを、単位稼働時間当たりの訂正可能なエラー数で昇順に並べ替える。例えばＣＰＵ選定部１５０は、稼働時間・エラー管理簿１０１ｇの各ＣＰＵについて、訂正可能なエラー数を実質稼働時間で除算し、単位稼働時間当たりの訂正可能なエラー数を算出する。そしてＣＰＵ選定部１５０は、稼働時間・エラー管理簿１０１ｇ内のＣＰＵ１０１ｂ〜１０１ｅのエントリを、単位稼働時間当たりの訂正可能なエラー数が少ない順に並べる。 Next, the CPU selection process will be described in detail.
FIG. 9 is a flowchart illustrating an example of a procedure of CPU selection processing.
[Step S121] The CPU selection unit 150 rearranges the entries of the CPUs 101b to 101e registered in the operating time / error management list 101g in ascending order by the number of correctable errors per unit operating time. For example, the CPU selection unit 150 calculates the number of correctable errors per unit operating time by dividing the number of correctable errors by the actual operating time for each CPU in the operating time / error management book 101g. Then, the CPU selection unit 150 arranges the entries of the CPUs 101b to 101e in the operating time / error management list 101g in ascending order of the number of correctable errors per unit operating time.

［ステップＳ１２２］ＣＰＵ選定部１５０は、単位稼働時間当たりの訂正可能なエラー数が同じＣＰＵについて、実質稼働時間で昇順に並べ替える。例えばＣＰＵ選定部１５０は、稼働時間・エラー管理簿１０１ｇから、単位稼働時間当たりの訂正可能なエラー数が同じＣＰＵ群を検出する。該当するＣＰＵ群がある場合、ＣＰＵ選定部１５０は、稼働時間・エラー管理簿１０１ｇ内の該当ＣＰＵ群のエントリを、実質稼働時間が短い順に並べる。 [Step S122] The CPU selection unit 150 rearranges CPUs having the same number of correctable errors per unit operating time in ascending order according to the actual operating time. For example, the CPU selection unit 150 detects CPU groups having the same number of correctable errors per unit operating time from the operating time / error management book 101g. When there is a corresponding CPU group, the CPU selection unit 150 arranges the entries of the corresponding CPU group in the operation time / error management book 101g in the order of short actual operation time.

［ステップＳ１２３］ＣＰＵ選定部１５０は、アプリケーションプログラム１０１ｈの実行用に用意されたＣＰＵ数分（図３の例では「４」）だけ、ステップＳ１２４〜Ｓ１２６の処理をループする。例えばＣＰＵ選定部１５０は、稼働時間・エラー管理簿１０１ｇに登録されているＣＰＵを、上から順に処理対象とする。すなわち単位時間当たりの訂正可能なエラー数が少ないＣＰＵから順に、処理対象となる。また単位時間当たりの訂正可能なエラー数が同じＣＰＵについては、実質稼働時間が短いＣＰＵから順に処理対象となる。 [Step S123] The CPU selection unit 150 loops the processing of steps S124 to S126 by the number of CPUs prepared for execution of the application program 101h (“4” in the example of FIG. 3). For example, the CPU selection unit 150 sets the CPUs registered in the operating time / error management book 101g as processing targets in order from the top. That is, processing is performed in order from the CPU with the smallest number of errors that can be corrected per unit time. CPUs having the same number of errors that can be corrected per unit time are processed in order from the CPU having the shortest actual operation time.

［ステップＳ１２４］ＣＰＵ選定部１５０は、ステップＳ１２４〜Ｓ１２６の処理のループ回数が、使用するＣＰＵ数以内か否かを判断する。ループ回数が使用するＣＰＵ数以内であれば、処理がステップＳ１２６に進められる。ループ回数が使用するＣＰＵ数を超えている場合、処理がステップＳ１２５に進められる。 [Step S124] The CPU selection unit 150 determines whether or not the number of loops of the processes in steps S124 to S126 is within the number of CPUs to be used. If the number of loops is less than the number of CPUs used, the process proceeds to step S126. If the number of loops exceeds the number of CPUs used, the process proceeds to step S125.

［ステップＳ１２５］ＣＰＵ選定部１５０は、処理対象のＣＰＵをオフラインにする。オフラインとなったＣＰＵは、アプリケーションプログラムの実行先から除外される。
［ステップＳ１２６］ＣＰＵ選定部１５０は、処理対象のＣＰＵを、稼働時間・エラー管理簿１０１ｇ上での次のＣＰＵに移動する。 [Step S125] The CPU selection unit 150 takes the processing target CPU offline. The CPU that is offline is excluded from the execution destination of the application program.
[Step S126] The CPU selection unit 150 moves the CPU to be processed to the next CPU on the operating time / error management book 101g.

［ステップＳ１２７］ＣＰＵ選定部１５０は、アプリケーションプログラム１０１ｈの実行用に用意されたすべてのＣＰＵ１０１ｂ〜１０１ｅについて処理が完了したら、ＣＰＵ選定処理を終了する。 [Step S127] The CPU selection unit 150 ends the CPU selection processing when the processing is completed for all the CPUs 101b to 101e prepared for the execution of the application program 101h.

このようにして、システムの起動時に、単位稼働時間当たりのエラー数が多いＣＰＵの使用が抑止される。これにより、信頼性の高いＣＰＵを優先して使用することができる。なお、すべてのＣＰＵの訂正可能なエラーの数が「０」の場合、単位稼働時間当たりのエラー数は、いずれのＣＰＵも「０」となる。その場合、実質稼働時間が短いＣＰＵが使用され、実質稼働時間が長いＣＰＵの使用は抑止される。これにより、ＣＰＵ間の信頼性の優劣が不明な場合には、システムが繰り返し起動されることで、すべてのＣＰＵを均等に使用することができる。ＣＰＵが均等に使用されれば、単位稼働時間当たりのエラー数を算出結果で信頼性を判断したときの、判断結果の統計的な正確性が向上する。 In this way, the use of a CPU with a large number of errors per unit operating time is suppressed when the system is started. As a result, a highly reliable CPU can be preferentially used. If the number of correctable errors for all CPUs is “0”, the number of errors per unit operating time is “0” for any CPU. In that case, a CPU with a short actual operating time is used, and use of a CPU with a long actual operating time is suppressed. Thereby, when the superiority or inferiority of the reliability between CPUs is unknown, all the CPUs can be used equally by repeatedly starting the system. If the CPUs are used evenly, the statistical accuracy of the determination result when the reliability is determined from the calculation result of the number of errors per unit operating time is improved.

システムの停止時には、次回のシステム起動時に使用する情報の収集が行われる。
図１０は、システム停止時のＣＰＵ動作管理処理の手順の一例を示すフローチャートである。 When the system is stopped, information used at the next system startup is collected.
FIG. 10 is a flowchart illustrating an example of the procedure of the CPU operation management process when the system is stopped.

［ステップＳ２０１］ＯＳ１１０は、アプリケーションプログラム１０１ｈを実行しているＣＰＵに対して、実行を停止させる。
［ステップＳ２０２］ＯＳ１１０は、管理プログラム１０１ｆの使用率採取モジュールに基づいて、使用率採取部１２０を起動する。使用率採取部１２０は、システムの起動から停止までの各ＣＰＵの使用率の採取処理を実行する。使用率採取処理の詳細は後述する（図１１参照）。 [Step S201] The OS 110 causes the CPU executing the application program 101h to stop execution.
[Step S202] The OS 110 activates the usage rate collection unit 120 based on the usage rate collection module of the management program 101f. The usage rate collection unit 120 executes usage rate collection processing of each CPU from the start to the stop of the system. Details of the usage rate collection process will be described later (see FIG. 11).

［ステップＳ２０３］ＯＳ１１０は、管理プログラム１０１ｆのエラー情報収集モジュールに基づいて、エラー情報収集部１３０を起動する。エラー情報収集部１３０は、システムの起動から停止までの各ＣＰＵのエラー情報を収集する。エラー情報収集処理の詳細は後述する（図１２参照）。 [Step S203] The OS 110 activates the error information collection unit 130 based on the error information collection module of the management program 101f. The error information collection unit 130 collects error information of each CPU from the start to the stop of the system. Details of the error information collection processing will be described later (see FIG. 12).

［ステップＳ２０４］ＯＳ１１０は、動作を停止する。
次に、使用率採取処理の詳細について説明する。
図１１は、使用率採取処理の手順の一例を示すフローチャートである。 [Step S204] The OS 110 stops operating.
Next, details of the usage rate collection process will be described.
FIG. 11 is a flowchart illustrating an example of a procedure of usage rate collection processing.

［ステップＳ２１１］使用率採取部１２０は、アプリケーションプログラム１０１ｈの実行用に用意されたＣＰＵ数分（図３の例では「４」）だけ、ステップＳ２１２，Ｓ２１３の処理をループする。例えば使用率採取部１２０は、稼働時間・エラー管理簿１０１ｇに登録されているＣＰＵを、上から順に処理対象とする。 [Step S211] The usage rate collecting unit 120 loops the processes of Steps S212 and S213 by the number of CPUs (4 in the example of FIG. 3) prepared for executing the application program 101h. For example, the usage rate collecting unit 120 sets the CPUs registered in the operating time / error management book 101g as processing targets in order from the top.

［ステップＳ２１２］使用率採取部１２０は、前回の平均使用率を設定する。例えば使用率採取部１２０は、処理対象ＣＰＵの平均使用率を、ＯＳ１１０から取得する。使用率採取部１２０は、取得した平均使用率を、処理対象のＣＰＵに関する前回の平均使用率として、稼働時間・エラー管理簿１０１ｇ内に設定する。 [Step S212] The usage rate collection unit 120 sets the previous average usage rate. For example, the usage rate collection unit 120 acquires the average usage rate of the processing target CPU from the OS 110. The usage rate collecting unit 120 sets the acquired average usage rate as the previous average usage rate for the processing target CPU in the operating time / error management book 101g.

［ステップＳ２１３］使用率採取部１２０は、処理対象のＣＰＵの実質稼働時間の値を更新する。例えば使用率採取部１２０は、ＯＳ１１０から、システムの最後の起動から現在までの時間（稼働時間）を取得する。そして使用率採取部１２０は、処理対象のＣＰＵの平均使用率に取得した稼働時間を乗算した値を、稼働時間・エラー管理簿１０１ｇ内の該当ＣＰＵの実質稼働時間に加算する。 [Step S213] The usage rate collection unit 120 updates the value of the actual operating time of the processing target CPU. For example, the usage rate collecting unit 120 acquires the time (operating time) from the last activation of the system to the present time from the OS 110. Then, the usage rate collecting unit 120 adds a value obtained by multiplying the average usage rate of the processing target CPU by the acquired operating time to the actual operating time of the CPU in the operating time / error management book 101g.

［ステップＳ２１４］使用率採取部１２０は、アプリケーションプログラム１０１ｈの実行用に用意されたすべてのＣＰＵ１０１ｂ〜１０１ｅについて処理が完了したら、使用率採取処理を終了する。 [Step S214] When the processing is completed for all the CPUs 101b to 101e prepared for execution of the application program 101h, the usage rate collection unit 120 ends the usage rate collection processing.

次に、エラー情報収集処理について説明する。
図１２は、エラー情報収集処理の手順の一例を示すフローチャートである。
［ステップＳ２２１］エラー情報収集部１３０は、アプリケーションプログラム１０１ｈの実行用に用意されたすべてのＣＰＵ１０１ｂ〜１０１ｅの、訂正可能なエラー数を取得する。例えばエラー情報収集部１３０は、監視ユニット１０３から、システムの最後の起動から現在までの訂正可能なエラー情報を取得する。 Next, error information collection processing will be described.
FIG. 12 is a flowchart illustrating an example of a procedure of error information collection processing.
[Step S221] The error information collection unit 130 acquires the number of correctable errors of all the CPUs 101b to 101e prepared for executing the application program 101h. For example, the error information collection unit 130 acquires, from the monitoring unit 103, error information that can be corrected from the last activation of the system to the present time.

［ステップＳ２２２］エラー情報収集部１３０は、アプリケーションプログラム１０１ｈの実行用に用意されたＣＰＵ数分（図３の例では「４」）だけ、ステップＳ２２３の処理をループする。例えばエラー情報収集部１３０は、稼働時間・エラー管理簿１０１ｇに登録されているＣＰＵを、上から順に処理対象とする。 [Step S222] The error information collection unit 130 loops the process of step S223 for the number of CPUs (“4” in the example of FIG. 3) prepared for execution of the application program 101h. For example, the error information collection unit 130 sets the CPUs registered in the operating time / error management book 101g as processing targets in order from the top.

［ステップＳ２２３］エラー情報収集部１３０は、訂正可能なエラー数を更新する。例えばエラー情報収集部１３０は、ステップＳ２２１で取得した訂正可能なエラー情報に基づいて、処理対象のＣＰＵに関するエラーの数を計数する。そしてエラー情報収集部１３０は、計数した値を、稼働時間・エラー管理簿１０１ｇ内の処理対象のＣＰＵの訂正可能なエラー数に加算する。 [Step S223] The error information collection unit 130 updates the number of correctable errors. For example, the error information collection unit 130 counts the number of errors related to the processing target CPU based on the correctable error information acquired in step S221. Then, the error information collection unit 130 adds the counted value to the number of errors that can be corrected by the processing target CPU in the operating time / error management book 101g.

［ステップＳ２２４］エラー情報収集部１３０は、アプリケーションプログラム１０１ｈの実行用に用意されたすべてのＣＰＵ１０１ｂ〜１０１ｅについて処理が完了したら、エラー情報収集処理を終了する。 [Step S224] The error information collection unit 130 ends the error information collection processing when the processing is completed for all the CPUs 101b to 101e prepared for execution of the application program 101h.

以上の処理により、第２の実施の形態では、使用するＣＰＵを起動時にローテーションし、各ＣＰＵの使用率を加味した稼働時間の実績と、訂正可能なエラー数を収集することで、信頼性の低いＣＰＵを適切に判断し、そのＣＰＵの動作を停止させることができる。 As a result of the above processing, in the second embodiment, the CPU to be used is rotated at the time of startup, and the results of the operating time taking into account the usage rate of each CPU and the number of correctable errors are collected. It is possible to appropriately determine a low CPU and stop the operation of the CPU.

次に、システム全体の負荷に応じて使用するＣＰＵ数が３であり、その負荷が変動しないものとしたときの、単位時間あたりの訂正可能なエラー数の収集と、訂正不可能なエラーが発生しやすいＣＰＵの判断の具体例を示す。 Next, when the number of CPUs used is 3 according to the load of the entire system and the load does not fluctuate, collection of the number of correctable errors per unit time and occurrence of uncorrectable errors occur A specific example of CPU judgment that is easy to perform will be described.

図１３は、初回起動時における使用ＣＰＵ選定の第１の例を示す図である。図１３の例では、アプリケーションプログラム１０１ｈに対して、使用ＣＰＵ数が「３」であることが予め定義されているものとする。この場合、システム起動時に、アプリケーションプログラム１０１ｈ実行用に用意されている４つのＣＰＵ１０１ｂ〜１０１ｅのうち、３つのＣＰＵが使用される。 FIG. 13 is a diagram illustrating a first example of selecting a CPU to be used at the first activation. In the example of FIG. 13, it is assumed that the number of CPUs used is “3” in advance for the application program 101h. In this case, three CPUs are used among the four CPUs 101b to 101e prepared for executing the application program 101h when the system is started.

初回の起動時には、いずれのＣＰＵ１０１ｂ〜１０１ｅについても、実質稼働時間および訂正可能なエラー数の値が「０」である。そこでＣＰＵ選定部１５０は、例えば、識別番号が若番のＣＰＵから順に、使用対象として選択する。その結果、識別番号が最も大きい値「４」であるＣＰＵ１０１ｅは、使用対象から除外される。 At the first activation, the values of the actual operation time and the number of correctable errors are “0” for any of the CPUs 101b to 101e. Therefore, the CPU selection unit 150 selects, for example, the CPUs with the identification numbers from the lowest number as the usage target. As a result, the CPU 101e having the largest identification number “4” is excluded from the use target.

ＣＰＵ選定部１５０は、使用対象から除外されたＣＰＵ１０１ｅを、オフラインにする。そしてＯＳ１１０は、オンラインとなっているＣＰＵ１０１ｂ〜１０１ｄに対して、アプリケーションプログラム１０１ｈを実行させる。 The CPU selection unit 150 takes the CPU 101e excluded from the use target offline. Then, the OS 110 causes the CPUs 101b to 101d that are online to execute the application program 101h.

なおアプリケーションプログラム１０１ｈで何個のＣＰＵを使用するのかが、事前には定義されていない場合もある。
図１４は、初回起動時における使用ＣＰＵ選定の第２の例を示す図である。図１４の例では、初回起動時には、アプリケーションプログラム１０１ｈの負荷が不明であるものとする。この場合、ＣＰＵ選定部１５０は、システム全体の負荷を測定するため、１回目の起動時には、すべてのＣＰＵ１０１ｂ〜１０１ｅを使用対象とする。ＯＳ１１０は、４つのＣＰＵ１０１ｂ〜１０１ｅに対してアプリケーションプログラム１０１ｈを実行させる。 Note that the number of CPUs used in the application program 101h may not be defined in advance.
FIG. 14 is a diagram illustrating a second example of CPU usage selection at the time of initial activation. In the example of FIG. 14, it is assumed that the load of the application program 101h is unknown at the first activation. In this case, in order to measure the load of the entire system, the CPU selection unit 150 sets all the CPUs 101b to 101e to be used at the first startup. The OS 110 causes the four CPUs 101b to 101e to execute the application program 101h.

図１３，図１４に示した初回起動時の処理により、使用されたＣＰＵの実質稼働時間が取得できる。また訂正可能なエラーが発生した場合、そのエラーがどのＣＰＵから何回発生したのかについても取得できる。そしてシステム停止時に、稼働時間・エラー管理簿１０１ｇが更新される。次回以降のシステム起動時には、稼働時間・エラー管理簿１０１ｇの内容に応じて、使用するＣＰＵが選定される。 The actual operating time of the used CPU can be acquired by the process at the first activation shown in FIGS. Further, when a correctable error occurs, it can be acquired how many times the error has occurred from which CPU. When the system is stopped, the operating time / error management book 101g is updated. At the next system startup, the CPU to be used is selected according to the contents of the operating time / error management book 101g.

図１５は、２回目起動時における使用ＣＰＵ選定例を示す図である。図１５の例では、１回目のシステム運用時には、いずれのＣＰＵ１０１ｂ〜１０１ｅからも訂正可能なエラーは検出されていない。実質稼働時間は、「前々回までの実施稼働時間」＋「前回の稼働時間の実績」×「前回の平均使用率」で算出される。なお２回目起動時には、「前々回までの実施稼働時間」は「０」である。例えば「ＣＰＵ１」について、前回のシステム運用時の稼働時間が２０時間であり、平均使用率が０．５（５０％）であったものとする。この場合、「ＣＰＵ１」の実質稼働時間は１０時間（０ｈ＋２０ｈ×０．５＝１０ｈ）となる。 FIG. 15 is a diagram illustrating an example of selecting a CPU to be used at the second activation. In the example of FIG. 15, no correctable error is detected from any of the CPUs 101b to 101e during the first system operation. The actual operation time is calculated by “implementation operation time until the last time” + “actual previous operation time” × “previous average usage rate”. At the second start-up, the “implemented operation time until the previous time” is “0”. For example, assume that “CPU1” has an operating time of 20 hours in the previous system operation and an average usage rate of 0.5 (50%). In this case, the actual operating time of “CPU1” is 10 hours (0h + 20h × 0.5 = 10h).

ＣＰＵ選定部１５０は、すべてのＣＰＵ１０１ｂ〜１０１ｅの訂正可能なエラー数が０の場合は、すべてのＣＰＵ１０１ｂ〜１０１ｅの実質稼働時間を平等にするため、実質稼働時間が長いＣＰＵは使用しない。図１５の例では、識別番号「ＣＰＵ１」ＣＰＵ１０１ｂがオフラインとなり、使用が抑止される。なお、実質稼働時間が同じ場合は、ＣＰＵ選定部１５０は、例えば識別番号が若番のＣＰＵから使用対象として選定する。 When the number of errors that can be corrected by all the CPUs 101b to 101e is 0, the CPU selection unit 150 equalizes the actual operation time of all the CPUs 101b to 101e, and therefore does not use a CPU having a long actual operation time. In the example of FIG. 15, the identification number “CPU1” CPU101b goes offline and its use is suppressed. If the actual operating time is the same, the CPU selection unit 150 selects, for example, a CPU having a young identification number as a usage target.

このように、実質稼働時間が短いＣＰＵお優先的に使用することで、初回起動時に停止していたＣＰＵ１０１ｅが使用されることとなり、ＣＰＵ１０１ｅの動作情報を収集できるようになる。また実質稼働時間が最も長いＣＰＵ１０１ｂが停止されることで、実質稼働時間の差が縮まる。すなわち、実質稼働時間の均等化が図られる。 As described above, the CPU 101e that has been stopped at the first activation is used by preferentially using the CPU having a short effective operating time, and the operation information of the CPU 101e can be collected. Further, the CPU 101b having the longest actual operating time is stopped, so that the difference in actual operating time is reduced. That is, the actual operation time is equalized.

図１６は、３回目起動時における使用ＣＰＵ選定例を示す図である。図１６の例では、１，２回目のシステム運用時には、いずれのＣＰＵ１０１ｂ〜１０１ｅからも訂正可能なエラーは検出されていない。２回目のシステム運用時には、識別番号「ＣＰＵ１」ＣＰＵ１０１ｂの使用が抑止されたため、このＣＰＵ１０１ｂの実質稼働時間は増加していない。他のＣＰＵ１０１ｃ〜１０１ｅは、実質稼働時間が増加している。その結果、識別番号「ＣＰＵ２」ＣＰＵ１０１ｃの実質稼働時間が最も大きくなっている。そこでＣＰＵ選定部１５０は、すべてのＣＰＵ１０１ｂ〜１０１ｅの実質稼働時間を平等にするため、実質稼働時間が長いＣＰＵ１０１ｂの使用対象外としてオフライン化する。 FIG. 16 is a diagram illustrating an example of selecting a CPU to be used at the third activation. In the example of FIG. 16, no correctable error is detected from any of the CPUs 101b to 101e during the first and second system operations. During the second system operation, since the use of the identification number “CPU1” CPU101b is inhibited, the actual operation time of the CPU101b does not increase. The other CPUs 101c to 101e have increased actual operating time. As a result, the actual operating time of the identification number “CPU2” CPU101c is the longest. Therefore, the CPU selection unit 150 sets the actual operating time of all the CPUs 101b to 101e to be equal, so that the CPU 101b goes offline as an out-of-use target of the CPU 101b having a longer actual operating time.

以後、すべてのＣＰＵにおいて訂正可能なエラーが発生していない間は、実質稼働時間が同等になるように、システムの起動時点で、実質稼働時間が最も長いＣＰＵが使用対象外とされる。システムｎ回目（ｎは４以上の整数）の起動時には、１以上のＣＰＵにおいて訂正可能なエラーが発生したものとする。 Thereafter, while no correctable error has occurred in all the CPUs, the CPU with the longest actual operating time is excluded from use at the time of system startup so that the actual operating time is equal. It is assumed that a correctable error has occurred in one or more CPUs at the start of the system n-th (n is an integer of 4 or more).

図１７は、ｎ回目の起動時における使用ＣＰＵ選定例を示す図である。図１７の例では、すべてのＣＰＵ１０１ｂ〜１０１ｅについて、１回ずつ訂正可能なエラーが検出されている。この場合、ＣＰＵ選定部１５０は、単位時間当たりの訂正可能なエラー数が最大であるＣＰＵを、使用対象外とする。図１７の例では、各ＣＰＵ１０１ｂ〜１０１ｅの訂正可能なエラー数が同じであるため、実質稼働時間が最も短い識別番号「２」のＣＰＵ１０１ｃが、単位時間当たりの訂正可能なエラー数が最大のＣＰＵとなる。そこでＣＰＵ選定部１５０は、ＣＰＵ１０１ｃを使用対象外としてオフラインにする。 FIG. 17 is a diagram illustrating an example of selecting a CPU to be used at the n-th startup. In the example of FIG. 17, an error that can be corrected once is detected for all the CPUs 101b to 101e. In this case, the CPU selection unit 150 excludes the CPU having the maximum number of errors that can be corrected per unit time from being used. In the example of FIG. 17, since the number of correctable errors of each of the CPUs 101b to 101e is the same, the CPU 101c with the identification number “2” having the shortest actual operation time is the CPU with the largest number of correctable errors per unit time. It becomes. Therefore, the CPU selection unit 150 takes the CPU 101c offline and makes it offline.

さらにｍ回（ｍは１以上の整数）のシステムの停止・起動が繰り返されると、各ＣＰＵ１０１ｂ〜１０１ｅの訂正可能なエラー数もばらついてくる。
図１８は、ｎ＋ｍ回目の起動時における使用ＣＰＵ選定例を示す図である。図１８の例では、訂正可能なエラー回数は、識別番号「４」のＣＰＵ１０１ｅが最大である。ただし、単位時間当たりの訂正可能なエラー数は、識別番号「２」のＣＰＵ１０１ｃが最大である。この場合、ＣＰＵ選定部１５０は、単位時間当たりの訂正可能なエラー数が最大であるＣＰＵ１０１ｃを、使用対象外とする。 Further, when the system is repeatedly stopped and started m times (m is an integer of 1 or more), the number of errors that can be corrected by each of the CPUs 101b to 101e varies.
FIG. 18 is a diagram illustrating an example of selecting a CPU to be used at the time of n + m-th startup. In the example of FIG. 18, the number of correctable errors is the maximum for the CPU 101e with the identification number “4”. However, the number of errors that can be corrected per unit time is maximum for the CPU 101c with the identification number “2”. In this case, the CPU selection unit 150 excludes the CPU 101c having the maximum number of correctable errors per unit time from being used.

これにより各ＣＰＵの稼働時間の実績に差異があっても、訂正不可能なエラーが発生しやすいＣＰＵを適切に判断でき、そのＣＰＵを使用しないことで、システムがダウンすることを抑止できる。また第２の実施の形態では、使用率を加味した稼働時間の実績を使う。これは、単純な稼働時間のみでは、ほとんどアイドル状態にならないＣＰＵと、通常アイドル状態になっているＣＰＵとを同等に扱ってしまうためである。稼働時間の実績の算出に利用率を加味することで、ＣＰＵが実際に使用された実績を正確に求めることができる。 As a result, even if there is a difference in the operating time of each CPU, it is possible to appropriately determine a CPU that is likely to generate an uncorrectable error, and by not using that CPU, it is possible to prevent the system from going down. In the second embodiment, a record of operating hours is used in consideration of usage rates. This is because a CPU that is hardly in an idle state and a CPU that is normally in an idle state are treated equally with a simple operation time alone. By adding the utilization factor to the calculation of the actual operating time, the actual usage of the CPU can be accurately obtained.

以上、実施の形態を例示したが、実施の形態で示した各部の構成は同様の機能を有する他のものに置換することができる。また、他の任意の構成物や工程が付加されてもよい。さらに、前述した実施の形態のうちの任意の２以上の構成（特徴）を組み合わせたものであってもよい。 As mentioned above, although embodiment was illustrated, the structure of each part shown by embodiment can be substituted by the other thing which has the same function. Moreover, other arbitrary structures and processes may be added. Further, any two or more configurations (features) of the above-described embodiments may be combined.

１０情報処理装置
１１〜１４プロセッサ（ＣＰＵ）
１５記憶部
１５ａ動作情報管理テーブル
１６制御部 DESCRIPTION OF SYMBOLS 10 Information processing apparatus 11-14 Processor (CPU)
DESCRIPTION OF SYMBOLS 15 Memory | storage part 15a Operation information management table 16 Control part

Claims

A storage unit for storing operation information indicating the usage time and error occurrence status of each of the plurality of processors;
At the start of program execution, based on the operation information, a predetermined number of processors are selected as the first processor to be operated from the one with the shorter usage time among the plurality of processors, and the second processor that is not selected In a state where the first processor is stopped, the first processor is caused to execute the program, the usage time and the error occurrence status of the first processor are acquired, and the acquired usage time and the error occurrence status are acquired in the storage unit. A control unit stored in
An information processing apparatus.

In the selection of the first processor, when there is no difference in the reliability of each of the plurality of processors based on the error occurrence state, the control unit selects the first processor in order from the processor with the shortest usage time. When there is a difference in the reliability of each of the plurality of processors, the first processor is selected in descending order of reliability.
The information processing apparatus according to claim 1.

In the selection of the first processor, the control unit determines that a processor having a smaller number of errors per unit usage time is more reliable,
The information processing apparatus according to claim 2.

The control unit is a value obtained by multiplying the operating time of the processor by the average usage rate of the processor as the usage time of the processor.
The information processing apparatus according to claim 1.

Computer
A first processor for operating a predetermined number of processors from a shorter usage time of the plurality of processors based on operation information indicating a usage time and an error occurrence status of each of the plurality of processors at the start of execution of the program; Select as processor,
With the second processor that has not been selected stopped, the first processor is caused to execute the program,
Obtaining the usage time and error occurrence status of the first processor;
Storing the obtained usage time and the error occurrence status in a storage unit;
Processor management method.

On the computer,
A first processor for operating a predetermined number of processors from a shorter usage time of the plurality of processors based on operation information indicating a usage time and an error occurrence status of each of the plurality of processors at the start of execution of the program; Select as processor,
With the second processor that has not been selected stopped, the first processor is caused to execute the program,
Obtaining the usage time and error occurrence status of the first processor;
Storing the obtained usage time and the error occurrence status in a storage unit;
A program that executes processing.