CN102467438A

CN102467438A - Method for obtaining fault signal of storage device by baseboard management controller

Info

Publication number: CN102467438A
Application number: CN2010105467572A
Authority: CN
Inventors: 陈志伟; 卢晓芬
Original assignee: Inventec Corp
Current assignee: Inventec Corp
Priority date: 2010-11-12
Filing date: 2010-11-12
Publication date: 2012-05-23

Abstract

The invention discloses a method for obtaining a fault signal of a storage device by a baseboard management controller, comprising the following steps: building a corresponding detector data record SDR of the storage device and the platform event selection PEF of the storage device in the memory block of the BMC; starting a self-monitoring system of the starting storage device, and ordering the self-monitoring system to detect the health detection data of the storage device; writing the health critical value corresponding to the health detection data into the memory block of the BMC; providing updating instructs such that the self-monitoring system writes the health detection data into the storage device SDR by updating the instructs.

Description

Utilize baseboard management controller to obtain the method for storage device fault-signal

Technical field

The present invention relates to a kind of method that obtains the storage device fault-signal, particularly a kind ofly utilize baseboard management controller (Baseboard Management Controller BMC) obtains the method for storage device fault-signal.

Background technology

Along with the fast development with network technology of popularizing of computing machine, the service that only can be provided by common computer or equipment is not apply use, so develop the technology that server.Server is a computer platform of being good at handling network technology for a kind of, and it can be linked to variety of network systems, and to the computing machine that is connected through network system various application services is provided.Server has jumbo storage device mostly, to provide such as multimedia, network hard disc or enterprise with services such as databases.Hence one can see that, and storage device is a considerable assembly in the server, will cause serious harmful effect to server and even the service that offers the client in case break down.

And for management server, intelligent platform management interface (Intelligent Platform ManagementInterface, arise at the historic moment by technology IPMI).The supvr can and be disposed at baseboard management controller (Baseboard Management Controller, the BMC) monitoring server in the server through IPMI.But present server is to send fault-signal through the independent hardware that operates to light the cresset on the server again, and can notify the keeper after the storage device fault.That is to say that the existing fault signal is directly by the hardware encoded control.Therefore cause existing server can't integrate parallel fault-signal and administrative mechanism, also can't notify the problem of keeper's event of failure efficiently.

Summary of the invention

In order to address the above problem, to the object of the present invention is to provide and a kind ofly utilize baseboard management controller (Baseboard Management Controller BMC) obtains the method for storage device fault-signal.The method of utilizing BMC to obtain the storage device fault-signal is applicable to the server with a BMC and a storage device.The method of utilizing baseboard management controller to obtain the storage device fault-signal comprises: in the memory region of BMC, set up corresponding storage device detector data record (sensor data record; SDR) and storage device platform incident screening (platform event filter, PEF); (Self-Monitoring Analysis and Reporting Technology S.M.A.R.T.), and makes S.M.A.R.T. regularly detect health detection data of storage device in one self-monitoring system of startup storage device; To write the memory region of BMC corresponding at least one healthy critical value of health detection data; And a update instruction is provided, make S.M.A.R.T. regularly the health detection data write storage device SDR through update instruction.

Wherein the health detection data can comprise at least one fitness programme, and the respectively corresponding healthy critical value of each fitness programme.Fitness programme can be bad rail number (uncorrectable sector count), present temperature (temperature) or present rotating speed (speed).Fitness programme also can be that read error rate (readerror rate), rotation try number (spin retry count) again or handle track number (current pendingsector count) at present.

(Intelligent PlatformManagement Interface, threshold value IPMI) are provided with instruction (set sensor thresholdcommand) will write BMC corresponding to the healthy critical value of health detection data and S.M.A.R.T. can pass through an intelligent platform management interface.

According to an enforcement example, the method for utilizing BMC to obtain the storage device fault-signal also can comprise: carry out a storage device supervisory routine according to storage device SDR, healthy critical value and storage device PEF.

Wherein the storage device supervisory routine can comprise: notice is through an Intelligent Platform Management Bus (IntelligentPlatform Management Bus, a remote manager that IPMB) links to each other with BMC.The storage device supervisory routine also can comprise: at least one storage element that suspends storage device according to storage device SDR, healthy critical value.The storage device supervisory routine also can comprise: light a light emitting diode (light emitting diode, LED) group corresponding to storage device according to storage device SDR, healthy critical value.

And storage device can comprise a plurality of storage elements, and light-emitting diode group then comprises a plurality of light emitting diode cressets that correspond respectively to these storage elements.And the storage device supervisory routine can be lighted at least one light emitting diode cresset according to health detection data, healthy critical value.

In sum, the method for utilizing BMC to obtain the storage device fault-signal utilizes storage device SDR and S.M.A.R.T. to obtain the present health status of storage device.And can light corresponding LED group at BMC, also notify simultaneously long-range keeper.Therefore be integrated among the incident of BMC management by the disk failure mechanism of lighting a lamp of hardware controls, make management interface be able to unified and promote the efficiency of management.

Describe the present invention below in conjunction with accompanying drawing and specific embodiment, but not as to qualification of the present invention.

Description of drawings

Fig. 1 is the synoptic diagram of the server of an enforcement example;

Fig. 2 is that the baseboard management controller that utilizes of an enforcement example is obtained the process flow diagram of the method for storage device fault-signal;

Fig. 3 obtains the process flow diagram of the method for storage device fault-signal for another baseboard management controller that utilizes of implementing example;

Fig. 4 implements the synoptic diagram of the server of example for another.

Wherein, Reference numeral

20 servers

21 baseboard management controllers

210 memory regions

212 storage device SDR

214 storage device PEF

22 storage devices

222,222a, 222b, 222c storage element

23 self-monitoring systems

24 central processing units

25 light-emitting diode group

252,252a, 252b, 252c light emitting diode cresset

30 remote calculators

32 remote manager

Embodiment

Below in embodiment, be described in detail detailed features of the present invention and advantage; Its content is enough to make any those skilled in the art to understand technology contents of the present invention and implements according to this; And according to content, the claim scope and graphic that this instructions disclosed, any those skilled in the art can understand purpose and the advantage that the present invention is correlated with easily.

The invention relates to and a kind ofly utilize baseboard management controller (Baseboard ManagementController BMC) obtains the method for storage device fault-signal, and it is applicable to the server with a baseboard management controller (BMC) and a storage device.

Please with reference to Fig. 1, it is the synoptic diagram of the server of an enforcement example.Server 20 comprises BMC 21, storage device 22, (the Self-Monitoring Analysis and ReportingTechnology of a self-monitoring system; S.M.A.R.T.) 23 and one central processing unit (central processing unit; CPU) 24, wherein central processing unit 24 is electrical connected with storage device 22 and S.M.A.R.T.23.Storage device 22 can for example be various jumbo hard disks, or disk array (redundant array ofinexpensive disk, RAID) system.Server 20 also can link to each other with a remote calculator (remotecomputer) 30 through network, and remote calculator 30 then can pass through a remote manager 32 and BMC 21 management servers 20.

Server 20 can support that (Intelligent Platform ManagementInterface IPMI), and moves an operating system through above-mentioned hardware to intelligent platform management interface.Wherein server 20 can use the operating systems such as Windows (Windows) Server 2003 of Linux, FreeBSD or the Microsoft (Microsoft) of Unix; Also can be disc operating system (DOS) (Disk Operating System; DOS) or extensible firmware interface (Extensible Firmware Interface; Extensible Firmware Interface, system EFI).And server 20 also can various labels various server products, the present invention is not to its restriction.

In more detail; Intelligent platform management interface is a kind of standard architecture of server admin platform; It comprises BMC 21, system interface (System Interface), non-volatile memory cells (Non-volatileStorage), Intelligent Platform Management Bus (Intelligent Platform Management Bus; IPMB) and intelligent shelf management bus (Intelligent Chassis Management Bus ICMB) waits 5 assemblies.And wherein most important be exactly BMC 21.BMC 21 similarly is the computing machine of a platform independent, comprises the resource such as processor and memory body of oneself.And the resource that oneself has is all used in the running of BMC 21, and can not take other resource of the hardware module of server 20.For example, remote calculator 30 can use the iLO system of company of Hewlett-Packard (HP), the iDRAC system of company of Dell (DELL), or the ESB2 system of Intel (Intel) company.

S.M.A.R.T.23 system is a kind of hard disk OBD detection technique of IBM Corporation's exploitation, and this technology is adopted by each big factory of tame computing machine hardware manufacturing and the big factory of hard disk.Therefore present most hard disk or disk array all have the function of supporting S.M.A.R.T.23.In simple terms, S.M.A.R.T.23 is in order to the system of monitoring storage device 22, and it can detect the health status of storage device 22 and repay.S.M.A.R.T.23 can to the present temperature of storage device 22 or various projects such as rotating speed regularly detect at present.

Please cooperate Fig. 1 and with reference to Fig. 2, Fig. 2 be one implement example the BMC that utilizes obtain the process flow diagram of the method for storage device fault-signal.At first in the memory region 210 of BMC 21, set up corresponding storage device detector data record (sensor data record, SDR) screening of 212 and one storage device platform incident (platform event filter, PEF) 214 (step S100).Wherein storage device PEF 214 can have at least one storage device supervisory routine, so that BMC to be provided the foundation of 21 management storage devices 22.

Behind the storage device SDR 212 and storage device PEF 214 foundation corresponding to storage device 22, start the S.M.A.R.T.23 of storage device 22, and make S.M.A.R.T.23 regularly detect health detection data (step S110) of storage device 22.Then will write the memory region 210 (step S120) of BMC 21 corresponding at least one healthy critical value of health examination data.Therefore the memory region 210 of BMC 21 has in correspondence with each other at least one group of storage device SDR 212, storage device PEF 214 and healthy critical value.

Wherein the health detection data can comprise at least one fitness programme, and the respectively corresponding healthy critical value of each fitness programme.Fitness programme for example can be bad rail number (uncorrectable sector count), present temperature (temperature) or present rotating speed (speed).Fitness programme also can be that read error rate (readerror rate), rotation try number (spin retry count) again or handle track number (current pendingsector count) at present.And S.M.A.R.T.23 can be provided with instruction (set sensorthreshold command) through the threshold value of IPMI, will write BMC 21 corresponding to the healthy critical value of health detection data.

BMC 21 can provide a update instruction, makes S.M.A.R.T.23 regularly up-to-date health detection data write storage device SDR 212 (step S130) through update instruction.That is to say that BMC 21 can obtain the present health status of storage device 22 through storage device SDR 212 and S.M.A.R.T.23, and need not set up extra detecting device to storage device 22.

Please with reference to Fig. 3, it obtains the process flow diagram of the method for storage device fault-signal for another BMC that utilizes that implements example.The method of utilizing BMC to obtain the storage device fault-signal can also be carried out storage device supervisory routine (step S140) according to storage device SDR 212, healthy critical value and storage device PEF 214.Whether BMC 21 regularly reads at least one fitness programme, and compares present value of this fitness programme and corresponding healthy critical value, occur unusually to judge storage device 22.For example when the present temperature of storage device 22 or bad rail number were higher than corresponding healthy critical value, BMC 21 can assert storage device 22 faults.BMC 21 can the incident of this fault be write a system event log file (system event log, SEL) among, and in storage device PEF 214, look for suitable storage device supervisory routine according to the content of SEL and carry out.

Wherein the storage device supervisory routine can comprise: the remote manager 32 that notice links to each other with BMC 21 through IPMB.And when storage device 22 failure situations were serious, the storage device supervisory routine also can comprise: at least one storage element that suspends storage device 22 according to storage device SDR 212, healthy critical value.In addition, the storage device supervisory routine also can comprise: light a light emitting diode (light emitting diode, LED) group 25 corresponding to storage device 22 according to storage device SDR 212, healthy critical value.That is to say that the function of lighting LED group 25 and notice remote manager 32 is integrated into by BMC 21 to be carried out.

Please cooperate with reference to Fig. 4, it implements the synoptic diagram of the server of example for another.Storage device 22 can comprise a plurality of storage elements 222, for example storage element 222a, storage element 222b and storage element 222c; LED group 25 then can comprise a plurality of LED cressets 252 identical with storage element 222 quantity, for example LED cresset 252a, LED cresset 252b and LED cresset 252.BMC 21 can learn that according to storage device SDR 212 and healthy critical value what break down is which storage element 222 in the storage device 22, lights the corresponding LED cresset 252 of storage element 222 of fault more according to this.Thus, can learn the failure condition of storage device 22 like a cork so that come to check the keeper of server 20.

In sum, the method for utilizing BMC to obtain the storage device fault-signal provides in order to the update instruction of upgrading storage device SDR to S.M.A.R.T., and this obtains the present health status of storage device.And after detecting unusually, the storage device supervisory routine not only can be lighted corresponding LED group, also can notify long-range keeper.That is to say that original independent disk failure mechanism of lighting a lamp by hardware controls is integrated among the incident of BMC management, it is unified that management interface is able to.Thus, can solve prior art as the bull carriage the mixed and disorderly way to manage as parallel, and can be, and notify the keeper when breaking down incident efficiently again with more succinct and efficient management by methods server.

Certainly; The present invention also can have other various embodiments; Under the situation that does not deviate from spirit of the present invention and essence thereof; Those of ordinary skill in the art work as can make various corresponding changes and distortion according to the present invention, but these corresponding changes and distortion all should belong to the protection domain of the appended claim of the present invention.

Claims

1. method of utilizing baseboard management controller BMC to obtain the storage device fault-signal; Be applicable to a server with a baseboard management controller BMC and a storage device; It is characterized in that this method of utilizing baseboard management controller to obtain the storage device fault-signal comprises:

In the memory region of this BMC, set up corresponding storage device detector data record SDR and storage device platform incident screening PEF;

Start a self-monitoring system of this storage device, and make this self-monitoring system regularly detect health detection data of this storage device;

To write this memory region of this BMC corresponding at least one healthy critical value of these health detection data; And

One update instruction is provided, makes this self-monitoring system regularly these health detection data write this storage device SDR through this update instruction.

2. the method for utilizing baseboard management controller to obtain the storage device fault-signal according to claim 1 is characterized in that, these health detection data comprise at least one fitness programme, and each this of fitness programme difference correspondence should the health critical value.

3. the method for utilizing baseboard management controller to obtain the storage device fault-signal according to claim 2 is characterized in that, this fitness programme is bad rail number, present temperature or present rotating speed.

4. the method for utilizing baseboard management controller to obtain the storage device fault-signal according to claim 3 is characterized in that, this fitness programme is that read error rate, rotation try number again or handle track number at present.

5. the method for utilizing baseboard management controller to obtain the storage device fault-signal according to claim 1; It is characterized in that the threshold value of this self-monitoring system through an intelligent platform management interface is provided with instruction and will writes this BMC corresponding to this health critical value of these health detection data.

6. the method for utilizing baseboard management controller to obtain the storage device fault-signal according to claim 1 is characterized in that, also comprises:

Carry out a storage device supervisory routine according to this storage device SDR, this health critical value and this storage device PEF.

7. the method for utilizing baseboard management controller to obtain the storage device fault-signal according to claim 6 is characterized in that, this storage device supervisory routine comprises:

The remote manager that notice links to each other with this BMC through an Intelligent Platform Management Bus.

8. the method for utilizing baseboard management controller to obtain the storage device fault-signal according to claim 6 is characterized in that, this storage device supervisory routine comprises:

Suspend at least one storage element of this storage device according to this storage device SDR, this health critical value.

9. the method for utilizing baseboard management controller to obtain the storage device fault-signal according to claim 6 is characterized in that, this storage device supervisory routine comprises:

Light a light-emitting diode group according to this storage device SDR, this health critical value corresponding to this storage device.

10. the method for utilizing baseboard management controller to obtain the storage device fault-signal according to claim 9; It is characterized in that; This storage device comprises a plurality of storage elements; This light-emitting diode group comprises a plurality of light emitting diode cressets that correspond respectively to those storage elements, and this storage device supervisory routine is lighted at least one this light emitting diode cresset according to these health detection data, this health critical value.