JP4945774B2

JP4945774B2 - Failure information data collection method for disk array device and transport control processor core

Info

Publication number: JP4945774B2
Application number: JP2007105489A
Authority: JP
Inventors: 大川田; 修木村; 浩二山口; 一雄中嶋; 親志前田; 祐司野田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2007-04-13
Filing date: 2007-04-13
Publication date: 2012-06-06
Anticipated expiration: 2027-04-13
Also published as: JP2008262438A

Description

本発明は，ディスクアレイ装置に搭載されたプロセッサの障害発生時にそのメモリダンプデータを採取する技術に関するものであり，特にマルチコアプロセッサにおいて，その１つのプロセッサコアがトランスポート制御用のプロセッサコアであり，そのトランスポート制御用のプロセッサコアに障害が発生した場合に，そのトランスポート制御用プロセッサコアの障害情報を含むメモリダンプデータを採取することが可能となるディスクアレイ装置およびトランスポート制御用プロセッサコアの障害情報データ採取方法に関するものである。 The present invention relates to a technique for collecting memory dump data when a failure of a processor mounted on a disk array device occurs. Particularly in a multi-core processor, one processor core is a processor core for transport control. When a failure occurs in the transport control processor core, the disk array device and the transport control processor core can collect memory dump data including failure information of the transport control processor core. The present invention relates to a method for collecting failure information data.

近年，情報インフラが発達したことにより，取り扱うデータ量が日々増加し続けている情報化社会において，高信頼，高可用性に富んだ情報システムを実現することが要求されている。このような情報システムを実現するため，常時大容量のデータアクセス，データバックアップ可能なディスクアレイ装置が，急速に普及している。 In recent years, with the development of information infrastructure, in an information society where the amount of data handled has been increasing every day, it is required to realize an information system with high reliability and high availability. In order to realize such an information system, disk array devices capable of always accessing large amounts of data and backing up data are rapidly spreading.

急速な普及にともない，著しく性能向上したディスクアレイ装置には，多数の装置コンポーネントが搭載されており，それらのコンポーネントは複雑に関連している。そのため，何らかの問題が発生した場合に，その原因箇所を特定することや影響範囲を認識することなどに，多大な資源や時間，労力がかかるようになってきている。そのため，限られた資源や時間の中で，問題発生原因に関する有用な障害情報データ（ＣＰＵメモリのメモリダンプデータ等）を採取することが必要とされる。 With the rapid spread, disk array devices with significantly improved performance are equipped with a large number of device components, and these components are complicatedly related. For this reason, when any problem occurs, it takes a lot of resources, time, and labor to identify the cause and recognize the range of influence. Therefore, it is necessary to collect useful fault information data (such as memory dump data of the CPU memory) regarding the cause of the problem within a limited resource and time.

図７は，故障発生時の障害情報データの採取を説明するための図である。ディスクアレイ装置５０において，ＣＭ（Controller Module ）５００（ａ，ｂ）は，ホストＩ／Ｏ制御や装置保守制御などストレージシステム全体を管理するコンポーネントである。ＣＰＵ５１０（ａ，ｂ）は，ＣＭ５００（ａ，ｂ）を制御するプロセッサである。エキスパンダ（Expander）７００（ａ，ｂ）は，ディスク（Disk）６００（ａ，ｂ）が搭載されるＤＥ（Drive Enclosure ）の監視・制御を行うコンポーネントである。図示されたディスク６００（ａ，ｂ）のうち，ディスク６００ｂは，あらかじめシステムディスクとして設定されているものとする。 FIG. 7 is a diagram for explaining collection of failure information data when a failure occurs. In the disk array device 50, a CM (Controller Module) 500 (a, b) is a component that manages the entire storage system, such as host I / O control and device maintenance control. The CPU 510 (a, b) is a processor that controls the CM 500 (a, b). The expanders 700 (a, b) are components that monitor and control a DE (Drive Enclosure) on which the disks 600 (a, b) are mounted. Of the illustrated disks 600 (a, b), the disk 600b is set in advance as a system disk.

なお，図７のディスクアレイ装置５０では，説明を簡単にするために，ＣＭ５００，ディスク６００等のコンポーネントが２つずつしか記載されていないが，実際には様々なコンポーネントが冗長化されて複雑に関連している。 In the disk array device 50 of FIG. 7, only two components such as the CM 500 and the disk 600 are shown for simplicity of explanation, but in reality, various components are made redundant and complicated. Related.

例えば，ＣＭ５００ｂのＣＰＵ５１０ｂにおいて障害が発生すると，ＣＰＵ５１０ｂは，通常状態から障害情報格納状態に遷移する。障害情報格納状態では，障害情報格納機能によって，ＣＰＵ５１０ｂのメモリ５２０上のメモリダンプ対象データ５２５を，問題発生原因に関する有用な障害情報データ６１０として，自動的にシステムディスク（ディスク６００ｂ）に格納する。 For example, when a failure occurs in the CPU 510b of the CM 500b, the CPU 510b transitions from the normal state to the failure information storage state. In the failure information storage state, the memory dump target data 525 on the memory 520 of the CPU 510b is automatically stored in the system disk (disk 600b) as useful failure information data 610 regarding the cause of the problem by the failure information storage function.

障害要因がファーム要因（ソフトウェア要因）であれば，障害情報データ６１０の格納後に，障害が発生したＣＭ５００ｂをリセットし，自動組込みする制御が働く。この制御により，障害が発生したＣＭ５００ｂは復旧し，動作可能な通常状態となる。 If the failure factor is a firmware factor (software factor), after the failure information data 610 is stored, the CM 500b in which the failure has occurred is reset and automatically controlled. By this control, the CM 500b where the failure has occurred is restored and becomes in a normal state where it can operate.

システムディスク（ディスク６００ｂ）に格納された障害情報データ６１０は，ディスクアレイ装置５０に接続された保守用のパソコン（保守ＰＣ８００）などで採取することができる。例えば，ディスクアレイ装置５０が設置された現場において，ＣＭ５００ｂに障害が発生した場合，現場のＣＥ（Customer Engineer ）やＳＥ（System Engineer ）は，ディスクアレイ装置５０に保守ＰＣ８００を接続し，保守用のＣＧＩ画面を介して，ディスクアレイ装置５０のシステムディスク（ディスク６００ｂ）に格納された障害情報データ６１０を，保守ＰＣ８００のディスク８０１に採取する。採取された障害情報データ６１０は，開発元に発信され，障害解析が行われる。 The failure information data 610 stored in the system disk (disk 600b) can be collected by a maintenance personal computer (maintenance PC 800) connected to the disk array device 50. For example, when a failure occurs in the CM 500b at the site where the disk array device 50 is installed, a CE (Customer Engineer) or SE (System Engineer) at the site connects the maintenance PC 800 to the disk array device 50 for maintenance. The failure information data 610 stored in the system disk (disk 600b) of the disk array device 50 is collected in the disk 801 of the maintenance PC 800 via the CGI screen. The collected failure information data 610 is transmitted to the developer for failure analysis.

なお，障害情報データの採取に関する技術が記載された文献としては，例えば特許文献１，特許文献２などがある。 Note that, for example, Patent Document 1 and Patent Document 2 are documents that describe techniques related to collection of failure information data.

特許文献１には，障害時における障害解析を迅速に行うために，オペレーションシステムとは切り離されたモジュールであるメモリダンプルーチンを処理装置上に用意し，ダンプスイッチが押下された場合には，メモリ上にデータを残したまま処理装置を再起動してダンプルーチンを実行し，メモリダンプを採取する技術が記載されている。 In Patent Document 1, a memory dump routine, which is a module separated from the operation system, is prepared on a processing device in order to quickly analyze a failure at the time of a failure, and when a dump switch is pressed, There is described a technique for collecting a memory dump by restarting a processing apparatus while leaving data on and executing a dump routine.

特許文献２には，障害発生時のコンピュータシステムの停止時間を短縮するために，ダンプ採取の対象となる被ダンプ採取プロセッサのダンプデータを，退避プロセッサの記憶装置上に一時退避し，ダンプデータの出力を待たずに被ダンプ採取プロセッサを再起動し，退避プロセッサの記憶装置上のダンプデータを外部記憶装置に出力する技術が記載されている。
特開２０００−１３７６３０号公報特開２００１−３４５０８号公報 In Patent Document 2, in order to reduce the stop time of the computer system when a failure occurs, dump data of a dumped processor to be dumped is temporarily saved on a storage device of the save processor, and dump data A technique is described in which the dumped processor is restarted without waiting for output, and dump data on the storage device of the save processor is output to an external storage device.
JP 2000-137630 A JP 2001-34508 A

図８は，本発明の課題を説明する図である。近年，１つのパッケージに複数のプロセッサコアが集積されたマルチコアプロセッサが普及してきている。マルチコアプロセッサにおいては，それぞれのプロセッサコアは，他のプロセッサコアに影響されることなく，独立に機能する。図８において，ＣＭ５００ｂのＣＰＵ５１０ｂは，２つのプロセッサコア（アプリケーションコア（Application Core）５１１，トランスポートコア（Transport Core）５１２）を備えるデュアルコアプロセッサであるものとする。 FIG. 8 is a diagram illustrating the problem of the present invention. In recent years, multi-core processors in which a plurality of processor cores are integrated in one package have become widespread. In a multi-core processor, each processor core functions independently without being influenced by other processor cores. In FIG. 8, it is assumed that the CPU 510b of the CM 500b is a dual-core processor having two processor cores (Application Core 511, Transport Core 512).

ＣＰＵ５１０ｂにおいて，アプリケーションコア５１１は，ホストＩ／Ｏ制御に関するＲＡＩＤ制御やコピー制御機能，装置保守制御などストレージシステム全体を管理するアプリケーションファームウェア（Application Firmware）が載せられたプロセッサコアである。トランスポートコア５１２は，ホストインタフェースやディスクインタフェースにおけるＳＡＳ／ＳＡＴＡや，ＦＣ（Fibre Channel ）のトランスポート層プロトコルを司るトランスポートファームウェア（Transport Firmware）が載せられたプロセッサコアである。 In the CPU 510b, the application core 511 is a processor core on which application firmware (Application Firmware) for managing the entire storage system such as RAID control related to host I / O control, copy control function, and device maintenance control is mounted. The transport core 512 is a processor core on which SAS / SATA in the host interface and the disk interface, and transport firmware (Transport Firmware) that controls the FC (Fibre Channel) transport layer protocol are mounted.

アプリケーションコア５１１で障害が発生した場合には，図７で説明した場合と同様に，アプリケーションコア５１１が，通常状態から障害情報格納状態に遷移し，障害情報格納機能によって，メモリダンプ対象データ５２５を障害情報データ６１０としてシステムディスク（ディスク６００ｂ）に格納する。このときのデータ転送は，トランスポートコア５１２が制御する。 When a failure occurs in the application core 511, as in the case described with reference to FIG. 7, the application core 511 transits from the normal state to the failure information storage state, and the memory dump target data 525 is stored by the failure information storage function. The failure information data 610 is stored in the system disk (disk 600b). The data transfer at this time is controlled by the transport core 512.

トランスポートコア５１２で障害が発生した場合には，アプリケーションコア５１１が通常状態から障害情報格納状態に遷移し，障害情報格納機能によって，障害が発生したトランスポートコア５１２からその障害情報をメモリ５２０に吸い出し，そのトランスポートコア５１２の障害情報を含むメモリダンプ対象データ５２５を障害情報データ６１０としてシステムディスク（ディスク６００ｂ）に格納しようとする。 When a failure occurs in the transport core 512, the application core 511 transits from the normal state to the failure information storage state, and the failure information storage function stores the failure information in the memory 520 from the transport core 512 in which the failure has occurred. The memory dump target data 525 including the failure information of the transport core 512 is dumped and stored in the system disk (disk 600b) as the failure information data 610.

しかし，この場合には，データ転送を制御するトランスポートコア５１２に障害が発生しているため，障害情報格納機能を備えたアプリケーションコア５１１からシステムディスク（ディスク６００ｂ）にアクセスできず，トランスポートコア５１２の障害情報を含むメモリダンプ対象データ５２５をシステムディスク（ディスク６００ｂ）に転送できない可能性が高い。 However, in this case, since a failure has occurred in the transport core 512 that controls data transfer, the application core 511 having the failure information storage function cannot access the system disk (disk 600b), and the transport core There is a high possibility that memory dump target data 525 including 512 failure information cannot be transferred to the system disk (disk 600b).

このように，マルチコアプロセッサ構成において，その１つのプロセッサコアがトランスポート制御用のプロセッサコアである場合に，そのトランスポート制御用のプロセッサコアに障害が発生すると，メモリ上のメモリダンプ対象データを障害情報格納用のシステムディスクに転送できない問題が発生する可能性がある。 In this way, in a multi-core processor configuration, if that processor core is a processor core for transport control, if a failure occurs in that transport control processor core, the memory dump target data in the memory will fail. Problems that cannot be transferred to the system disk for storing information may occur.

なお，上記の特許文献１に記載された技術は，シングルプロセッサシングルコア構成におけるメモリダンプ採取の技術である。また，上記の特許文献２に記載された技術は，コンピュータシステムの停止時間を短縮することを目的とし，マルチプロセッサ構成において，すべてのプロセッサがシステム管理プロセッサ，ダンプデータ退避プロセッサ，障害プロセッサ，関連プロセッサになり得る構成となっており，その目的や装置構成が異なる。 The technique described in Patent Document 1 is a technique for collecting a memory dump in a single processor single core configuration. In addition, the technique described in Patent Document 2 described above aims to reduce the stop time of a computer system. In a multiprocessor configuration, all processors are a system management processor, a dump data saving processor, a faulty processor, and a related processor. The purpose and device configuration are different.

すなわち，上記の特許文献１，特許文献２に記載された技術には，マルチコアプロセッサ構成における特定のトランスポート制御用のプロセッサコアに障害が発生するという概念がなく，上記の特許文献１，特許文献２に記載された技術では，上記の問題を解決することはできない。 That is, the techniques described in Patent Document 1 and Patent Document 2 do not have the concept that a failure occurs in a specific transport control processor core in a multi-core processor configuration. The technique described in 2 cannot solve the above problem.

本発明は，上記の問題点の解決を図り，マルチコアプロセッサを備えるディスクアレイ装置において，その１つのプロセッサコアがトランスポート制御用のプロセッサコアである場合に，そのトランスポート制御用のプロセッサコアで障害が発生しても，そのトランスポート制御用のプロセッサコアの障害情報を含むメモリダンプデータを，問題発生原因に関する有用な障害情報データとして自動的にシステムディスクに格納することが可能となる技術を提供することを目的とする。 The present invention solves the above problems, and in a disk array device having a multi-core processor, when that one processor core is a processor core for transport control, a failure occurs in the processor core for transport control. Provides a technology that can automatically store memory dump data including failure information of the processor core for transport control in the system disk as useful failure information data regarding the cause of the failure The purpose is to do.

本発明は，上記の課題を解決するために，マルチコアプロセッサ構成において，トランスポート制御用プロセッサコアの障害発生時に，そのトランスポート制御用プロセッサコアの障害情報をメモリ上の不揮発対象領域でありかつ格納対象領域である領域に退避し，再起動後に，トランスポート制御用プロセッサコアの障害情報が退避された領域を含むメモリ上の格納対象領域のデータを，問題発生原因に関する有用な障害情報データとして，トランスポート制御用プロセッサコアを介して自動的にシステムディスクに格納することを特徴とする。 In order to solve the above problems, the present invention provides a multi-core processor configuration in which, when a failure occurs in the transport control processor core, the failure information of the transport control processor core is stored in a non-volatile target area on the memory. After saving to the target area and restarting, the data of the storage target area on the memory including the area where the fault information of the transport control processor core was saved is used as useful fault information data regarding the cause of the problem. It is characterized by being automatically stored in a system disk via a transport control processor core.

具体的には，本発明は，１つのプロセッサコアがトランスポート制御用プロセッサコアであり，トランスポート制御用プロセッサコア以外の少なくとも１つのプロセッサコアが障害情報データ採取機能を有するプロセッサコアであるマルチコアプロセッサと，マルチコアプロセッサのメモリと，メモリから採取されたメモリダンプデータを障害情報データとして格納するシステムディスクとを備えたディスクアレイ装置であって，マルチコアプロセッサの再起動時にデータが初期化されないメモリの不揮発対象領域には，メモリを領域ごとに管理する情報であり，少なくとも不揮発対象領域か否かを示す情報と障害発生時にデータが採取される格納対象領域であるか否かを示す情報とを有するメモリ管理情報と，トランスポート制御用プロセッサコアの障害か否かを示す情報とが記憶され，障害情報データ採取機能を有するプロセッサコアは，トランスポート制御用プロセッサコアの障害発生時に，メモリの不揮発対象領域でありかつ格納対象領域である領域に，トランスポート制御用プロセッサコアの障害情報を退避する手段と，トランスポート制御用プロセッサコアの障害か否かを示す情報にトランスポート制御用プロセッサコアの障害である旨を設定し，マルチコアプロセッサを再起動する手段と，再起動時に，トランスポート制御用プロセッサコアの障害か否かを示す情報がトランスポート制御用プロセッサコアの障害である旨を示している場合に，メモリ管理情報で格納対象領域に設定されているメモリの領域に記録されたデータを採取し，トランスポート制御用プロセッサコアを介して，システムディスクに格納する手段とを備えることを特徴とする。 Specifically, the present invention provides a multi-core processor in which one processor core is a transport control processor core and at least one processor core other than the transport control processor core is a processor core having a fault information data collection function. A disk array device comprising a multi-core processor memory and a system disk that stores memory dump data collected from the memory as failure information data, and the nonvolatile memory is not initialized when the multi-core processor is restarted The target area is information for managing the memory for each area, and includes at least information indicating whether or not the target area is a nonvolatile target area and information indicating whether or not the target area is a storage target area from which data is collected when a failure occurs. Management information and transport control process A processor core that stores information indicating whether or not there is a failure of the Sacore and has a failure information data collection function is an area that is a non-volatile target area and a storage target area when a failure occurs in the transport control processor core. In addition, the means for saving the failure information of the transport control processor core and the information indicating whether or not there is a failure of the transport control processor core are set as the failure of the transport control processor core. If the information indicating whether there is a failure in the transport control processor core at the time of restart indicates that the transport control processor core has failed, the memory management information indicates the storage target area. Collect the data recorded in the memory area set for the transport control process. Via Sakoa, characterized in that it comprises a means for storing the system disk.

これにより，マルチコアプロセッサを備えるディスクアレイ装置において，その１つのプロセッサコアがトランスポート制御用のプロセッサコアである場合に，そのトランスポート制御用のプロセッサコアで障害が発生しても，そのトランスポート制御用のプロセッサコアの障害情報を含むメモリダンプデータを，問題発生原因に関する有用な障害情報データとして自動的にシステムディスクに格納することができるようになる。 As a result, in a disk array device having a multi-core processor, when one processor core is a processor core for transport control, even if a failure occurs in the processor core for transport control, the transport control is performed. Memory dump data including failure information of the processor core for the system can be automatically stored on the system disk as useful failure information data relating to the cause of the problem.

また，本発明は，上記のディスクアレイ装置において，トランスポート制御用プロセッサコアの障害情報を退避する手段は，トランスポート制御用プロセッサコアの障害情報を退避するメモリ上の領域を動的に確保し，確保された領域をメモリ管理情報に不揮発対象領域かつ格納対象領域として登録し，確保された領域にトランスポート制御用プロセッサコアの障害情報を退避することを特徴とする。 Further, according to the present invention, in the above disk array apparatus, the means for saving the fault information of the transport control processor core dynamically secures an area on the memory for saving the fault information of the transport control processor core. The reserved area is registered in the memory management information as a nonvolatile target area and a storage target area, and the fault information of the transport control processor core is saved in the reserved area.

これにより，メモリにあらかじめトランスポート制御用プロセッサコアの障害情報を退避する領域を設定しておく必要がないので，通常動作時にメモリ領域を有効に活用することができるようになる。 As a result, there is no need to previously set an area for saving fault information of the processor core for transport control in the memory, so that the memory area can be effectively used during normal operation.

本発明により，マルチコアプロセッサを備えるディスクアレイ装置において，その１つのプロセッサコアがトランスポート制御用のプロセッサコアである場合に，そのトランスポート制御用のプロセッサコアで障害が発生しても，そのトランスポート制御用のプロセッサコアの障害情報を含むメモリダンプデータを，問題発生原因に関する有用な障害情報データとして，トランスポート制御用のプロセッサコアを介して自動的にシステムディスクに格納することが可能となる。 According to the present invention, in a disk array device having a multi-core processor, when one processor core is a processor core for transport control, even if a failure occurs in the processor core for transport control, the transport Memory dump data including failure information of the control processor core can be automatically stored in the system disk via the transport control processor core as useful failure information data relating to the cause of the problem.

以下，本発明の実施の形態について，図を用いて説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図１は，本発明の実施の形態によるディスクアレイ装置の構成例を示す図である。図１に示すディスクアレイ装置１０は，特にＣＭ１００の１つに着目した構成となっている。ディスクアレイ装置１０において，ＣＭ１００およびＣＭ１００’は，ホストＩ／Ｏ制御や装置保守制御などストレージシステム全体を管理するコンポーネントである。ＣＰＵ１１０は，ＣＭ１００を制御するプロセッサである。エキスパンダ３００は，ディスク（図示省略）が搭載されるＤＥ（図示省略）の監視・制御を行うコンポーネントである。なお，図１のディスクアレイ装置１０の例では，説明を簡単にするために，ＣＭ１００等の一部のコンポーネントしか記載されていないが，実際には様々なコンポーネントが冗長化されて複雑に関連した構成となっている。 FIG. 1 is a diagram showing a configuration example of a disk array device according to an embodiment of the present invention. The disk array device 10 shown in FIG. 1 has a configuration that focuses on one of the CMs 100 in particular. In the disk array device 10, the CM 100 and the CM 100 'are components that manage the entire storage system, such as host I / O control and device maintenance control. The CPU 110 is a processor that controls the CM 100. The expander 300 is a component that monitors and controls a DE (not shown) on which a disk (not shown) is mounted. In the example of the disk array device 10 in FIG. 1, only a part of the components such as the CM 100 is described for the sake of simplicity, but in reality, various components are redundantly related in a complicated manner. It has a configuration.

システムディスク２００は，障害発生時に採取された障害情報データを，ディスクアレイ装置１０内部で格納するディスクである。システムディスク２００として専用のディスクが用意されていてもよいし，ユーザホストからのデータが格納されるディスクの一部領域があらかじめシステムディスク２００領域として設定されていてもよい。 The system disk 200 is a disk that stores failure information data collected when a failure occurs in the disk array device 10. A dedicated disk may be prepared as the system disk 200, or a partial area of the disk in which data from the user host is stored may be set as the system disk 200 area in advance.

ＣＭ１００において，ＣＰＵ１１０は，アプリケーションコア１１１とトランスポートコア１１２の２つのプロセッサコアを持つデュアルコアプロセッサである。アプリケーションコア１１１は，ホストＩ／Ｏ制御に関するＲＡＩＤ制御やコピー制御機能，装置保守制御などストレージシステム全体を管理するアプリケーションファームウェア１３０が載せられたプロセッサコアである。トランスポートコア１１２は，ホストインタフェースやディスクインタフェースにおけるＳＡＳ／ＳＡＴＡや，ＦＣのトランスポート層プロトコルを司るトランスポートファームウェア１７０が載せられたプロセッサコアである。 In the CM 100, the CPU 110 is a dual core processor having two processor cores, an application core 111 and a transport core 112. The application core 111 is a processor core on which application firmware 130 for managing the entire storage system such as RAID control related to host I / O control, copy control function, and device maintenance control is mounted. The transport core 112 is a processor core on which the SAS / SATA in the host interface and the disk interface and the transport firmware 170 that controls the FC transport layer protocol are mounted.

アプリケーションファームウェア１３０は，通常ルーチン１４０，障害情報格納ルーチン１５０，パワーオンルーチン１６０を持つ。通常ルーチン１４０は，ＣＭ１００の通常動作時に実行されているプログラムである。障害情報格納ルーチン１５０は，ＣＰＵ１１０の障害発生時に実行されるプログラムである。パワーオンルーチン１６０は，ＣＭ１００の起動時や再起動時に実行されるプログラムである。 The application firmware 130 has a normal routine 140, a failure information storage routine 150, and a power-on routine 160. The normal routine 140 is a program that is executed during the normal operation of the CM 100. The failure information storage routine 150 is a program that is executed when a failure occurs in the CPU 110. The power-on routine 160 is a program that is executed when the CM 100 is started or restarted.

ＣＰＵ１１０のメモリ１２０には，トランスポートファームウェア障害判定フラグ１２１と，メモリ管理テーブル１２２が記憶されている。トランスポートファームウェア障害判定フラグ１２１は，ＣＭ１００の起動時に，その起動がトランスポートファームウェアの障害発生による再起動か否かを示すフラグである。ここでは，“１”がトランスポートファームウェアの障害発生による再起動を示し，“０”がそれ以外を示す。メモリ管理テーブル１２２は，メモリ１２０の管理情報が記録されたテーブルである。 The memory 120 of the CPU 110 stores a transport firmware failure determination flag 121 and a memory management table 122. The transport firmware failure determination flag 121 is a flag indicating whether or not the activation of the CM 100 is a restart due to a failure of the transport firmware. Here, “1” indicates a restart due to the occurrence of a failure in the transport firmware, and “0” indicates the other. The memory management table 122 is a table in which management information of the memory 120 is recorded.

図２は，メモリ管理テーブルの例を示す図である。メモリ管理テーブル１２２は，メモリ１２０を領域ごとに管理するためのテーブルであり，ＣＭ１００の起動時に，メモリディスクリプタをもとに構築される。メモリディスクリプタでは，メモリ１２０上に割り当てる必要がある領域のサイズ等が指示されている。 FIG. 2 is a diagram illustrating an example of a memory management table. The memory management table 122 is a table for managing the memory 120 for each area, and is constructed based on the memory descriptor when the CM 100 is activated. In the memory descriptor, the size of an area that needs to be allocated on the memory 120 is instructed.

メモリ管理テーブル１２２は，テーブル番号，プールネーム（Pool name ），アロケートアドレス（Allocate address），アロケートサイズ（Allocate size ），格納フラグ，不揮発フラグ等の情報を持つ。 The memory management table 122 has information such as a table number, a pool name (Pool name), an allocate address (Allocate address), an allocate size (Allocate size), a storage flag, and a nonvolatile flag.

テーブル番号は，メモリ管理テーブル１２２の各レコードに割り当てられた識別番号である。プールネームは，そのメモリ領域の名称を示す。アロケートアドレスは，そのメモリ領域のアドレスを示す。アロケートサイズは，そのメモリ領域のサイズを示す。 The table number is an identification number assigned to each record in the memory management table 122. The pool name indicates the name of the memory area. The allocate address indicates the address of the memory area. The allocate size indicates the size of the memory area.

格納フラグは，そのメモリ領域がシステムディスク２００への格納対象領域であるか否かを示す情報である。ここでは，“１”がシステムディスク２００への格納対象領域であることを示し，“０”がシステムディスク２００への格納対象領域でないことを示す。システムディスク２００への格納対象領域に指定されたメモリ領域のデータは，障害発生時に，障害情報データとしてシステムディスク２００に転送される。 The storage flag is information indicating whether or not the memory area is a storage target area for the system disk 200. Here, “1” indicates that the storage target area is in the system disk 200, and “0” indicates that the storage target area is not in the system disk 200. Data in the memory area specified as the storage target area in the system disk 200 is transferred to the system disk 200 as fault information data when a fault occurs.

不揮発フラグは，そのメモリ領域が不揮発対象領域であるか否かを示す情報である。ここでは，“１”が不揮発対象領域であることを示し，“０”が不揮発対象領域でないことを示す。不揮発対象領域に指定されたメモリ領域は，トランスポートファームウェアの障害発生によるＣＭ１００の再起動時には初期化されず，データが保持される。逆に，不揮発対象領域に指定されていないメモリ領域は，トランスポートファームウェアの障害発生によるＣＭ１００の再起動時でも，初期化される。 The nonvolatile flag is information indicating whether or not the memory area is a nonvolatile target area. Here, “1” indicates that it is a non-volatile target area, and “0” indicates that it is not a non-volatile target area. The memory area designated as the non-volatile target area is not initialized when the CM 100 is restarted due to a failure of the transport firmware, and data is retained. Conversely, a memory area that is not designated as a non-volatile target area is initialized even when the CM 100 is restarted due to a failure of the transport firmware.

図２に示すメモリ管理テーブルにおいて，プールネーム“ＳＹＳ−ＭＥＭ−ＤＥＳＣ”のメモリ領域が，メモリ管理テーブル１２２の領域である。図２に示すように，プールネーム“ＳＹＳ−ＭＥＭ−ＤＥＳＣ”の不揮発フラグは“１”であるので，トランスポートファームウェアの障害発生によるＣＭ１００の再起動時に初期化されない。すなわち，トランスポートファームウェアの障害発生によるＣＭ１００の再起動時には，メモリディスクリプタから新たにメモリ管理テーブル１２２を構築し直さず，再起動前のメモリ管理テーブル１２２がそのまま残ることになる。なお，特に図２には示されていないが，トランスポートファームウェア障害判定フラグ１２１が記録された領域も，不揮発対象領域に指定される。 In the memory management table shown in FIG. 2, the memory area of the pool name “SYS-MEM-DESC” is the area of the memory management table 122. As shown in FIG. 2, since the nonvolatile flag of the pool name “SYS-MEM-DESC” is “1”, it is not initialized when the CM 100 is restarted due to a failure of the transport firmware. In other words, when the CM 100 is restarted due to the failure of the transport firmware, the memory management table 122 before the restart remains as it is without reconstructing the memory management table 122 from the memory descriptor. Although not particularly shown in FIG. 2, the area in which the transport firmware failure determination flag 121 is recorded is also designated as the non-volatile target area.

図３は，アプリケーションファームウェアの各ルーチンの機能構成例を示す図である。障害情報格納ルーチン１５０は，障害情報格納状態通知処理部１５１，トランスポートファームウェア障害情報退避処理部１５２，ＣＭ再起動処理部１５３，障害情報データ格納処理部１５４を備える。 FIG. 3 is a diagram illustrating a functional configuration example of each routine of the application firmware. The failure information storage routine 150 includes a failure information storage state notification processing unit 151, a transport firmware failure information save processing unit 152, a CM restart processing unit 153, and a failure information data storage processing unit 154.

障害情報格納状態通知処理部１５１は，他のＣＭ１００’やエキスパンダ３００に，自ＣＭ１００のアプリケーションコア１１１が通常状態から障害情報格納状態に遷移したことを通知する処理を行う。トランスポートファームウェア障害情報退避処理部１５２は，トランスポートファームウェア１７０に障害が発生したときに，その障害情報をメモリ１２０に退避する処理を行う。ＣＭ再起動処理部１５３は，ＣＭ１００を再起動するための処理を行う。障害情報データ格納処理部１５４は，メモリ１２０の格納対象領域のデータを，障害情報データとしてシステムディスク２００に格納する処理を行う。 The failure information storage state notification processing unit 151 performs processing for notifying the other CM 100 ′ and the expander 300 that the application core 111 of the own CM 100 has transitioned from the normal state to the failure information storage state. The transport firmware failure information saving processing unit 152 performs processing for saving the failure information in the memory 120 when a failure occurs in the transport firmware 170. The CM restart processing unit 153 performs processing for restarting the CM 100. The failure information data storage processing unit 154 performs processing for storing the data in the storage target area of the memory 120 in the system disk 200 as failure information data.

パワーオンルーチン１６０は，トランスポートファームウェア障害判定処理部１６１を備える。トランスポートファームウェア障害判定処理部１６１は，ＣＭ１００の起動が，トランスポートファームウェア１７０の障害発生による再起動か否かを判定する処理を行う。 The power-on routine 160 includes a transport firmware failure determination processing unit 161. The transport firmware failure determination processing unit 161 determines whether or not the activation of the CM 100 is a restart due to the occurrence of a failure in the transport firmware 170.

ここで，図１から図３を用いて，本実施の形態によるトランスポートファームウェアの障害発生時の一連の動作の例を説明する。 Here, an example of a series of operations when a failure occurs in the transport firmware according to the present embodiment will be described with reference to FIGS.

トランスポートコア１１２におけるトランスポートファームウェア１７０の障害発生を検出したアプリケーションコア１１１は，通常状態から障害情報格納状態に遷移する。すなわち，アプリケーションコア１１１は，通常ルーチン１４０の処理を停止し，障害情報格納ルーチン１５０を起動する。障害情報格納ルーチン１５０の障害情報格納状態通知処理部１５１は，自ＣＭ１００のアプリケーションコア１１１が障害情報格納状態となったことを，他のＣＭ１００’やエキスパンダ３００等に通知する。 The application core 111 that has detected the failure of the transport firmware 170 in the transport core 112 transitions from the normal state to the failure information storage state. That is, the application core 111 stops the processing of the normal routine 140 and starts the failure information storage routine 150. The failure information storage state notification processing unit 151 of the failure information storage routine 150 notifies the other CM 100 ′, the expander 300, and the like that the application core 111 of the own CM 100 has entered the failure information storage state.

ＣＭ１００のアプリケーションコア１１１が障害情報格納状態となったことを他のＣＭ１００’やエキスパンダ３００等に通知する理由は，他のＣＭ１００’やエキスパンダ３００では，ＣＭ１００からの応答がなくなると，ＣＭ１００にハードウェア要因による障害が発生した可能性があると判断し，その危険性を回避するために応答がないＣＭ１００の切り離しを行ってしまうからである。ソフトウェア要因の障害が発生したＣＭ１００が障害情報格納状態になれば，その障害情報データの格納中に，他のＣＭ１００’やエキスパンダ３００から切り離されることはない。 The reason that the application core 111 of the CM 100 is in the failure information storage state is notified to the other CM 100 ′, the expander 300, etc. The other CM 100 ′, the expander 300, etc. This is because it is determined that a failure due to hardware factors may have occurred, and the CM 100 that does not respond is disconnected in order to avoid the risk. If the CM 100 in which a software-caused failure occurs enters the failure information storage state, the CM 100 'and the expander 300 are not disconnected while the failure information data is being stored.

障害が発生したＣＭ１００のアプリケーションコア１１１は，インターナルバスによる制御によって，障害が発生したトランスポートコア１１２からトランスポートファームウェア障害情報を採取し，メモリ１２０上に退避する。すなわち，障害情報格納ルーチン１５０のトランスポートファームウェア障害情報退避処理部１５２は，メモリ１２０上の退避領域を指定する情報を含むトランスポートファームウェア障害情報の採取指示を，トランスポートコア１１２に送る。 The application core 111 of the CM 100 in which the failure has occurred collects the transport firmware failure information from the transport core 112 in which the failure has occurred, and saves it in the memory 120 under the control of the internal bus. That is, the transport firmware failure information save processing unit 152 of the failure information storage routine 150 sends a transport firmware failure information collection instruction including information specifying a save area on the memory 120 to the transport core 112.

図２に示すメモリ管理テーブル１２２において，プールネーム“ＴＦＷ−ＩＮＦＯ”が，トランスポートファームウェア障害情報の退避領域を示している。図２に示すように，プールネーム“ＴＦＷ−ＩＮＦＯ”の不揮発フラグは“１”であるので，トランスポートファームウェア障害情報の退避領域は，トランスポートファームウェアの障害発生によるＣＭ１００の再起動時に初期化されない。また，プールネーム“ＴＦＷ−ＩＮＦＯ”の格納フラグは“１”であるので，そのメモリ領域に退避されたトランスポートファームウェア障害情報は，障害情報データとしてシステムディスク２００に格納される。 In the memory management table 122 shown in FIG. 2, the pool name “TFW-INFO” indicates a save area for transport firmware failure information. As shown in FIG. 2, since the nonvolatile flag of the pool name “TFW-INFO” is “1”, the save area for the transport firmware failure information is not initialized when the CM 100 is restarted due to a failure of the transport firmware. . Further, since the storage flag of the pool name “TFW-INFO” is “1”, the transport firmware failure information saved in the memory area is stored in the system disk 200 as failure information data.

メモリ１２０上のトランスポートファームウェア障害情報を退避する領域は，あらかじめ設定されていてもよいし，動的に確保するようにしてもよい。トランスポートファームウェア障害情報を退避する領域をあらかじめ設定しておく場合には，メモリディスクリプタで指示しておけばよい。 The area for saving the transport firmware failure information on the memory 120 may be set in advance or may be dynamically secured. When an area for saving the transport firmware failure information is set in advance, it may be indicated by a memory descriptor.

トランスポートファームウェア障害情報を退避する領域を動的に確保する場合には，障害情報格納ルーチン１５０のトランスポートファームウェア障害情報退避処理部１５２が，メモリ管理テーブル１２２を参照し，ファストブート（Fastboot）等の制御に影響を及ぼさず，システムディスク２００への格納対象領域になっていない（格納フラグが“０”）メモリ１２０上の領域を確保し，トランスポートファームウェア障害情報を退避する領域とする。このとき，メモリ管理テーブル１２２にトランスポートファームウェア障害情報を退避する領域のレコードを生成し，その格納フラグ，不揮発フラグをともに“１”に設定する。 When the area for saving the transport firmware fault information is dynamically secured, the transport firmware fault information save processing unit 152 of the fault information storage routine 150 refers to the memory management table 122 and performs fast boot or the like. The area on the memory 120 that is not a storage target area in the system disk 200 (storage flag is “0”) is secured and the transport firmware fault information is saved. At this time, a record of an area for saving the transport firmware failure information is generated in the memory management table 122, and both the storage flag and the non-volatile flag are set to “1”.

障害情報格納ルーチン１５０のＣＭ再起動処理部１５３は，メモリ１２０上の不揮発対象領域のトランスポートファームウェア障害判定フラグ１２１を“１”に設定し，他のＣＭ１００’やエキスパンダ３００に自ＣＭ１００のリセットを依頼する。リセットの依頼を受けた他のＣＭ１００’やエキスパンダ３００は，リセットの依頼を行った障害発生ＣＭ１００をリセットする。 The CM restart processing unit 153 of the failure information storage routine 150 sets the transport firmware failure determination flag 121 of the nonvolatile target area on the memory 120 to “1”, and resets the own CM 100 to another CM 100 ′ or the expander 300. Request. The other CM 100 ′ or the expander 300 that has received the reset request resets the faulty CM 100 that has requested the reset.

リセットを受けた障害発生ＣＭ１００では，アプリケーションコア１１１，トランスポートコア１１２がそれぞれ再起動する。このとき，アプリケーションコア１１１は，ファストブート起動を行う。ファストブート起動により，メモリ管理テーブル１２２で不揮発対象領域（不揮発フラグが“１”）に指定されたメモリ１２０上の領域のデータが，初期化されずに残された状態でＣＭ１００を起動することができる。 In the faulty CM 100 that has received the reset, the application core 111 and the transport core 112 are restarted. At this time, the application core 111 performs fast boot activation. The fast boot activation may activate the CM 100 in a state where the data in the area on the memory 120 designated as the non-volatile target area (non-volatile flag is “1”) in the memory management table 122 remains without being initialized. it can.

パワーオンルーチン１６０のトランスポートファームウェア障害判定処理部１６１は，問題発生原因に関する有用な障害情報が触られない起動の早い段階で，トランスポートファームウェア障害判定フラグ１２１を確認し，トランスポートファームウェア障害判定フラグ１２１が“１”である場合には，それを“０”にした後，障害情報格納ルーチン１５０をトランスポートファームウェア障害の旨で呼び出す。なお，トランスポートファームウェア障害判定フラグが“０”であった場合には，通常のパワーオン処理の後，通常ルーチン１４０を呼び出す。 The transport firmware failure determination processing unit 161 of the power-on routine 160 confirms the transport firmware failure determination flag 121 at an early stage of activation when the useful failure information regarding the cause of the problem is not touched. If 121 is “1”, it is set to “0” and then the failure information storage routine 150 is called to indicate that the transport firmware has failed. If the transport firmware failure determination flag is “0”, the normal routine 140 is called after the normal power-on process.

トランスポートコア１１２がリセットされ，動作可能な状態となっているため，アプリケーションコア１１１は，システムディスク２００にアクセスすることができる。障害情報格納ルーチン１５０の障害情報データ格納処理部１５４は，トランスポートファームウェア障害によるＣＭ１００の再起動を確認すると，メモリ管理テーブル１２２を参照し，格納フラグが“１”であるメモリ領域に保持されているメモリ１２０上のデータを，障害情報データとしてシステムディスク２００に格納する。このときシステムディスク２００に格納されるデータには，トランスポートファームウェア障害情報が含まれている。 Since the transport core 112 is reset and in an operable state, the application core 111 can access the system disk 200. When the failure information data storage processing unit 154 of the failure information storage routine 150 confirms the restart of the CM 100 due to a transport firmware failure, the failure information data storage processing unit 154 refers to the memory management table 122 and holds it in the memory area whose storage flag is “1”. The stored data on the memory 120 is stored in the system disk 200 as failure information data. At this time, the data stored in the system disk 200 includes transport firmware failure information.

なお，障害がアプリケーションファームウェアの通常ルーチン１４０で発生した場合には，障害情報格納ルーチン１５０において，トランスポートファームウェア障害情報の退避や，ＣＭ１００の再起動を行わずに，障害情報データ格納処理部１５４が，メモリ管理テーブル１２２で格納フラグが“１”であるメモリ領域に保持されているメモリ１２０上のデータを，障害情報データとしてシステムディスク２００に格納する。 If the failure occurs in the normal routine 140 of the application firmware, the failure information storage routine 150 does not save the transport firmware failure information or restart the CM 100 in the failure information storage routine 150. The data on the memory 120 held in the memory area whose storage flag is “1” in the memory management table 122 is stored in the system disk 200 as failure information data.

以下，図４〜図６のフローチャートを用いて，本実施の形態におけるトランスポートファーム障害発生時の一連の処理の流れを説明する。 Hereinafter, a flow of a series of processing when a transport farm failure occurs in the present embodiment will be described using the flowcharts of FIGS.

図４は，アプリケーションコアによるトランスポートファームウェア障害発生時における障害情報格納処理フローチャート（１）である。図４のフローチャートに示す処理は，障害情報データのシステムディスク２００への格納のための準備段階の処理である。 FIG. 4 is a failure information storage processing flowchart (1) when a transport firmware failure occurs by the application core. The process shown in the flowchart of FIG. 4 is a preparatory process for storing failure information data in the system disk 200.

アプリケーションコア１１１は，トランスポートファームウェア１７０の障害発生を検出すると（ステップＳ１０），それまでの通常状態から障害情報格納状態に遷移する（ステップＳ１１）。このとき，自らが障害情報格納状態であることを，他のＣＭ１００’やエキスパンダ３００等に通知する（ステップＳ１２）。 When the application core 111 detects the occurrence of a failure in the transport firmware 170 (step S10), the application core 111 transits from the normal state until then to the failure information storage state (step S11). At this time, it notifies the other CM 100 ', the expander 300, etc. that it is in the failure information storage state (step S12).

メモリ管理テーブル１２２を参照し，他の制御に影響がなく，格納フラグが“０”であるメモリ１２０上の領域を，トランスポートファームウェア障害情報退避領域として確保し（ステップＳ１３），確保された領域を，格納フラグ“１”，不揮発フラグ“１”でメモリ管理テーブル１２２に登録する（ステップＳ１４）。トランスポートファームウェア障害情報を，トランスポートコア１１２からトランスポートファームウェア障害情報退避領域に退避する（ステップＳ１５）。 By referring to the memory management table 122, an area on the memory 120 that has no influence on other controls and whose storage flag is “0” is secured as a transport firmware failure information save area (step S13), and the secured area Are registered in the memory management table 122 with the storage flag “1” and the nonvolatile flag “1” (step S14). The transport firmware failure information is saved from the transport core 112 to the transport firmware failure information saving area (step S15).

トランスポートファームウェア障害判定フラグ１２１を“１”に設定し（ステップＳ１６），自ＣＭ１００をファストブートで再起動する（ステップＳ１７）。 The transport firmware failure determination flag 121 is set to “1” (step S16), and the own CM 100 is restarted by fast boot (step S17).

図５は，アプリケーションコアによるトランスポートファームウェア障害発生時における障害情報格納処理フローチャート（２）である。図５のフローチャートに示す処理は，ＣＭ１００再起動段階の処理である。実際には，さまざまな初期化処理が行われるが，ここでは，トランスポートファームウェア障害判定処理についてのみ説明する。 FIG. 5 is a failure information storage processing flowchart (2) when a transport firmware failure occurs by the application core. The process shown in the flowchart of FIG. 5 is a process in the CM 100 restart stage. In practice, various initialization processes are performed, but only the transport firmware failure determination process will be described here.

アプリケーションコア１１１は，ファストブート起動が行われると，初期化処理の比較的早い段階で，トランスポートファームウェア障害判定フラグ１２１を確認する（ステップＳ２０）。トランスポートファームウェア障害判定フラグ１２１が“１”でなければ（ステップＳ２１），通常通りの初期化処理を行い，通常ルーチン１４０に移る。トランスポートファームウェア障害判定フラグ１２１が“１”であれば（ステップＳ２１），トランスポートファームウェア障害判定フラグ１２１を“０”に設定し（ステップＳ２２），その他必要な初期化処理を行い，障害情報格納ルーチン１５０に移る。 When the fast boot activation is performed, the application core 111 checks the transport firmware failure determination flag 121 at a relatively early stage of the initialization process (step S20). If the transport firmware failure determination flag 121 is not “1” (step S21), the normal initialization process is performed, and the process proceeds to the normal routine 140. If the transport firmware failure determination flag 121 is “1” (step S21), the transport firmware failure determination flag 121 is set to “0” (step S22), other necessary initialization processing is performed, and failure information is stored. Move on to routine 150.

図６は，アプリケーションコアによるトランスポートファームウェア障害発生時における障害情報格納処理フローチャート（３）である。図６のフローチャートに示す処理は，トランスポートファームウェア障害情報を含む障害情報データのシステムディスク２００への格納段階の処理である。 FIG. 6 is a failure information storage processing flowchart (3) when a transport firmware failure occurs by the application core. The process shown in the flowchart of FIG. 6 is a process of storing failure information data including transport firmware failure information in the system disk 200.

アプリケーションコア１１１は，ＣＭ１００再起動後に障害情報格納ルーチン１５０の動作に移ると，メモリ管理テーブル１２２を確認し（ステップＳ３０），メモリ１２０上の格納フラグが“１”に設定されているメモリ領域のデータを，システムディスク２００に格納する（ステップＳ３１）。 When the application core 111 moves to the operation of the failure information storage routine 150 after restarting the CM 100, the application core 111 checks the memory management table 122 (step S30), and stores the memory area in which the storage flag on the memory 120 is set to “1”. Data is stored in the system disk 200 (step S31).

以上，本発明の実施の形態について説明したが，本発明はこれに限るものではない。例えば，本実施の形態では，１つのプロセッサコアがトランスポート制御用のプロセッサコアであるデュアルコアプロセッサ構成について説明したが，１つのプロセッサコアがトランスポート制御用のプロセッサコアである３つ以上のプロセッサコアを持つマルチコアプロセッサ構成であってもよい。 Although the embodiment of the present invention has been described above, the present invention is not limited to this. For example, in the present embodiment, the dual core processor configuration in which one processor core is a processor core for transport control has been described. However, three or more processors in which one processor core is a processor core for transport control. A multi-core processor configuration having a core may be used.

本発明の実施の形態によるディスクアレイ装置の構成例を示す図である。It is a figure which shows the structural example of the disk array apparatus by embodiment of this invention. メモリ管理テーブルの例を示す図である。It is a figure which shows the example of a memory management table. アプリケーションファームウェアの各ルーチンの機能構成例を示す図である。It is a figure which shows the function structural example of each routine of application firmware. アプリケーションコアによるトランスポートファームウェア障害発生時における障害情報格納処理フローチャート（１）である。It is a failure information storage processing flowchart (1) at the time of transport firmware failure occurrence by an application core. アプリケーションコアによるトランスポートファームウェア障害発生時における障害情報格納処理フローチャート（２）である。It is a failure information storage process flowchart (2) at the time of transport firmware failure by an application core. アプリケーションコアによるトランスポートファームウェア障害発生時における障害情報格納処理フローチャート（３）である。It is a failure information storage process flowchart (3) at the time of transport firmware failure by an application core. 故障発生時の障害情報データの採取を説明するための図である。It is a figure for demonstrating collection of the failure information data at the time of failure occurrence. 本発明の課題を説明する図である。It is a figure explaining the subject of this invention.

Explanation of symbols

１０ディスクアレイ装置
１００，１００’ ＣＭ
１１０ＣＰＵ
１１１アプリケーションコア
１１２トランスポートコア
１２０メモリ
１２１トランスポートファームウェア障害判定フラグ
１２２メモリ管理テーブル
１３０アプリケーションファームウェア
１４０通常ルーチン
１５０障害情報格納ルーチン
１５１障害情報格納状態通知処理部
１５２トランスポートファームウェア障害情報退避処理部
１５３ＣＭ再起動処理部
１５４障害情報データ格納処理部
１６０パワーオンルーチン
１６１トランスポートファームウェア障害判定処理部
１７０トランスポートファームウェア
２００システムディスク
３００エキスパンダ 10 Disk array device 100, 100 'CM
110 CPU
111 Application Core 112 Transport Core 120 Memory 121 Transport Firmware Failure Determination Flag 122 Memory Management Table 130 Application Firmware 140 Normal Routine 150 Failure Information Storage Routine 151 Failure Information Storage Status Notification Processing Unit 152 Transport Firmware Failure Information Save Processing Unit 153 CM Restart processing unit 154 Fault information data storage processing unit 160 Power-on routine 161 Transport firmware fault determination processing unit 170 Transport firmware 200 System disk 300 Expander

Claims

A multi-core processor in which one processor core is a processor core for transport control and at least one processor core other than the transport control processor core is a processor core having a fault information data collection function; a memory of the multi-core processor; and a memory A disk array device comprising a system disk for storing memory dump data collected from as fault information data,
The non-volatile target area of the memory whose data is not initialized when the multi-core processor is restarted is information for managing the memory for each area, and at least information indicating whether or not the non-volatile target area and data are collected when a failure occurs Memory management information having information indicating whether or not the storage target area is to be stored, and information indicating whether or not the transport control processor core is faulty,
The processor core having the failure information data collection function is:
Means for saving failure information of the transport control processor core in a non-volatile target area and a storage target area of the memory when a failure occurs in the transport control processor core;
Means for setting in the information indicating whether or not the transport control processor core is faulty, a fault of the transport control processor core, and restarting the multi-core processor;
When the information indicating whether or not the transport control processor core is faulty indicates that the transport control processor core is faulty at the time of restart, the memory management information sets the storage target area. Means for collecting data recorded in the memory area and storing the data in the system disk via the transport control processor core.

The disk array device according to claim 1,
The means for saving failure information of the transport control processor core dynamically secures an area on the memory for saving the failure information of the transport control processor core, and the reserved area is the memory management information. A disk array device, wherein the failure information of the processor core for transport control is saved in a reserved area.

A multi-core processor in which one processor core is a processor core for transport control and at least one processor core other than the transport control processor core is a processor core having a fault information data collection function; a memory of the multi-core processor; and a memory A system disk that stores memory dump data collected from the system as failure information data, and the data is not initialized when the multi-core processor is restarted. Memory management information having at least information indicating whether it is a non-volatile target area and information indicating whether it is a storage target area from which data is collected when a failure occurs, and whether there is a fault in the transport control processor core Display information A fault information data collection process of the transport controller processor cores in the array device,
A processor core having the failure information data collection function;
A process of saving failure information of the transport control processor core in a non-volatile target area and a storage target area of the memory when a failure occurs in the transport control processor core;
Setting the information indicating whether or not the transport control processor core is faulty as a fault of the transport control processor core and restarting the multi-core processor;
When the information indicating whether or not the transport control processor core is faulty indicates that the transport control processor core is faulty at the time of restart, the memory management information sets the storage target area. And collecting the data recorded in the memory area and storing the data in the system disk via the transport control processor core. Fault information of the transport control processor core Data collection method.

In the failure information data collection method of the transport control processor core according to claim 3,
In the process of saving fault information of the transport control processor core, an area on the memory for saving the fault information of the transport control processor core is dynamically secured, and the reserved area is assigned to the memory management information A failure control data collection method for a transport control processor core, wherein the failure information of the transport control processor core is saved in a reserved area.