TWI820814B - Storage system and drive recovery method thereof - Google Patents

Storage system and drive recovery method thereof Download PDF

Info

Publication number
TWI820814B
TWI820814B TW111127518A TW111127518A TWI820814B TW I820814 B TWI820814 B TW I820814B TW 111127518 A TW111127518 A TW 111127518A TW 111127518 A TW111127518 A TW 111127518A TW I820814 B TWI820814 B TW I820814B
Authority
TW
Taiwan
Prior art keywords
storage device
list
devices
storage
determining
Prior art date
Application number
TW111127518A
Other languages
Chinese (zh)
Other versions
TW202405655A (en
Inventor
柯乃元
Original Assignee
威聯通科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 威聯通科技股份有限公司 filed Critical 威聯通科技股份有限公司
Priority to TW111127518A priority Critical patent/TWI820814B/en
Priority to CN202211227693.9A priority patent/CN117472619A/en
Application granted granted Critical
Publication of TWI820814B publication Critical patent/TWI820814B/en
Publication of TW202405655A publication Critical patent/TW202405655A/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0727Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a storage system, e.g. in a DASD or network based storage system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions

Abstract

A storage system and drive recovery method thereof are provided. The method includes the following steps. In response to a failure event of a storage device, whether the storage device exists is determined. In response to determining that the storage device exists, the storage device is added to a recovery attempt list. Whether the number of devices in a recovery ongoing list is less than a threshold value is determined. In response to determining that the number of devices in the recovery ongoing list is less than or equal to the threshold value, a power reset operation is immediately performed on the storage device.

Description

儲存系統與其硬碟恢復方法Storage system and its hard drive recovery method

本發明是有關於一種儲存系統,且特別是有關於一種儲存系統與其硬碟恢復方法。The present invention relates to a storage system, and in particular, to a storage system and a hard disk recovery method thereof.

對於具有多個硬碟的網路附接儲存(Network Attached Storage,NAS)裝置來說,硬碟與系統主機中斷連線(disconnect)的情況不時地會發生。造成上述情況的原因有許多,可能是硬碟本身毀損,也有可能是硬碟背板或硬碟控制晶片發生故障等等。維修人員往往需要拿到實體裝置對硬碟進行複製操作,才有可能找到些硬碟與系統中斷連線的原因,此舉往往耗時且耗費人力。For Network Attached Storage (NAS) devices with multiple hard drives, the hard drive may become disconnected from the system host from time to time. There are many reasons for the above situation. It may be that the hard drive itself is damaged, or it may be that the hard drive backplane or the hard drive control chip is faulty, etc. Maintenance personnel often need to obtain the physical device and perform a copy operation on the hard drive before they can possibly find the reason for the disconnection between the hard drive and the system. This is often time-consuming and labor-intensive.

一般而言,若硬碟與系統中斷連線的問題並非因為硬碟自身毀損,使用者可透過重新插拔硬碟而使該硬碟重新與系統建立連線,好讓該硬碟的讀寫操作恢復正常。但是,若使用者無法即時地抵達網路附接儲存裝置的所在位置重啟網路儲存裝置或重新插拔硬碟,中斷連線的硬碟將長時間無法恢復運作。此刻,若中斷連線的硬碟屬於已建立的磁碟冗餘陣列(redundant array of independent disk,RAID),則該RAID將會因此降級(degraded)而處於遺失資料的高風險狀態。Generally speaking, if the problem of disconnection between the hard disk and the system is not caused by the hard disk itself being damaged, the user can re-insert the hard disk and re-establish the connection with the system, so that the hard disk can read and write. Operation returns to normal. However, if the user cannot immediately reach the location of the network-attached storage device to restart the network storage device or re-plug the hard drive, the disconnected hard drive will not be able to resume operation for a long time. At this moment, if the disconnected hard disk belongs to an established redundant array of independent disks (RAID), the RAID will be degraded and be at high risk of losing data.

有鑑於此,本發明實施例提供一種儲存系統與其硬碟恢復方法,可解決上述技術問題。In view of this, embodiments of the present invention provide a storage system and a hard disk recovery method thereof, which can solve the above technical problems.

本發明實施例的儲存系統的硬碟恢復方法包括(但不僅限於)下列步驟。反應於一儲存裝置的故障事件,判斷儲存裝置是否存在。反應於判定儲存裝置存在,將儲存裝置新增至一待恢復裝置清單。判斷恢復執行裝置清單的裝置數量是否小於等於一門檻值。反應於判定恢復執行裝置清單的裝置數量小於等於門檻值,立即對儲存裝置進行一電源重啟操作。The hard disk recovery method of the storage system according to the embodiment of the present invention includes (but is not limited to) the following steps. In response to a failure event of a storage device, determine whether the storage device exists. In response to determining that the storage device exists, the storage device is added to a list of devices to be restored. Determine whether the number of devices that resume execution of the device list is less than or equal to a threshold. In response to determining that the number of devices that resume execution of the device list is less than or equal to the threshold, a power restart operation is immediately performed on the storage device.

本發明實施例的儲存系統包括至少一儲存裝置以及處理器。處理器連接所述儲存裝置,並經配置以執行下列步驟。反應於一儲存裝置的故障事件,判斷儲存裝置是否存在。反應於判定儲存裝置存在,將儲存裝置新增至一待恢復裝置清單。判斷恢復執行裝置清單的裝置數量是否小於等於一門檻值。反應於判定恢復執行裝置清單的裝置數量小於等於門檻值,立即對儲存裝置進行一電源重啟操作。The storage system according to the embodiment of the present invention includes at least one storage device and a processor. A processor is connected to the storage device and configured to perform the following steps. In response to a failure event of a storage device, determine whether the storage device exists. In response to determining that the storage device exists, the storage device is added to a list of devices to be restored. Determine whether the number of devices that resume execution of the device list is less than or equal to a threshold. In response to determining that the number of devices that resume execution of the device list is less than or equal to the threshold, a power restart operation is immediately performed on the storage device.

基於上述,於本發明的實施例中,當儲存裝置發生故障而無法正常運作時,此儲存裝置將被新增至待恢復裝置清單之中。當恢復執行裝置清單所紀錄的裝置數量小於等於門檻值時,可自動地對此儲存裝置立即進行電源重啟操作,以嘗試讓此儲存裝置恢復正常運作。藉此,可盡快讓中斷連線的儲存裝置恢復正常運作,並可降低資料遺失的風險。Based on the above, in embodiments of the present invention, when a storage device fails and cannot operate normally, the storage device will be added to the list of devices to be restored. When the number of devices recorded in the recovery execution device list is less than or equal to the threshold, the storage device can be automatically and immediately powered on to try to restore the storage device to normal operation. In this way, the disconnected storage device can be restored to normal operation as soon as possible and the risk of data loss can be reduced.

為讓本發明的上述特徵和優點能更明顯易懂,下文特舉實施例,並配合所附圖式作詳細說明如下。In order to make the above-mentioned features and advantages of the present invention more obvious and easy to understand, embodiments are given below and described in detail with reference to the accompanying drawings.

本發明的部份實施例接下來將會配合附圖來詳細描述,以下的描述所引用的元件符號,當不同附圖出現相同的元件符號將視為相同或相似的元件。這些實施例只是本發明的一部份,並未揭示所有本發明的可實施方式。更確切的說,這些實施例只是本發明的專利申請範圍中的方法與系統的範例。Some embodiments of the present invention will be described in detail with reference to the accompanying drawings. The component symbols cited in the following description will be regarded as the same or similar components when the same component symbols appear in different drawings. These embodiments are only part of the present invention and do not disclose all possible implementations of the present invention. Rather, these embodiments are merely examples of methods and systems within the scope of the patent application of the present invention.

圖1是依據本發明一實施例的儲存系統的方塊圖。請參照圖1,儲存系統10包括至少一儲存裝置110_1~110_N、處理器120,以及記憶體130。於一些實施例中,儲存系統10可實施為網路附接儲存(Network Attached Storage,NAS)裝置或其他種類的網路伺服器。FIG. 1 is a block diagram of a storage system according to an embodiment of the present invention. Please refer to FIG. 1 , the storage system 10 includes at least one storage device 110_1˜110_N, a processor 120, and a memory 130. In some embodiments, the storage system 10 may be implemented as a Network Attached Storage (NAS) device or other types of network servers.

儲存裝置110_1~110_N例如為固態硬碟(Solid State Drive,SSD)或硬式磁碟(Hard Disk Drive,HDD),本發明對此不限制。此外,本發明對於儲存裝置110_1~110_N的數量也不限制。於一些實施例中,部分或全部的儲存裝置110_1~110_N可組成獨立磁碟冗餘陣列(RAID)。需說明的是,儲存系統10還包括多個裝置槽(Bay)。各個裝置槽中設置有電性插槽。這些電性插槽例如為SATA插槽或U.2 PCIe插槽,但本發明不限制於此。這些儲存裝置110_1~110_N適於放置於這些裝置槽中,以使儲存裝置110_1~110_N可插設於對應的電性插槽而與處理器120連接。The storage devices 110_1 to 110_N are, for example, solid state drives (Solid State Drives, SSDs) or hard disk drives (Hard Disk Drives, HDDs), and the present invention is not limited thereto. In addition, the present invention does not limit the number of storage devices 110_1 to 110_N. In some embodiments, some or all of the storage devices 110_1˜110_N may form a redundant array of independent disks (RAID). It should be noted that the storage system 10 also includes multiple device bays (Bays). Each device slot is provided with an electrical slot. These electrical slots are, for example, SATA slots or U.2 PCIe slots, but the invention is not limited thereto. These storage devices 110_1 to 110_N are suitable for being placed in these device slots, so that the storage devices 110_1 to 110_N can be inserted into corresponding electrical slots and connected to the processor 120 .

處理器120可以是中央處理單元(Central Processing Unit,CPU)、通用處理器或是其他可程式化之一般用途或特殊用途的微處理器(Microprocessor)、數位信號處理器(Digital Signal Processor,DSP)、可程式化控制器、現場可程式化邏輯閘陣列(Field Programmable Gate Array,FPGA)、特殊應用積體電路(Application-Specific Integrated Circuit,ASIC)或其他類似元件或上述元件的組合。The processor 120 may be a central processing unit (CPU), a general-purpose processor, or other programmable general-purpose or special-purpose microprocessor (Microprocessor), or a digital signal processor (Digital Signal Processor, DSP). , programmable controller, Field Programmable Gate Array (FPGA), Application-Specific Integrated Circuit (ASIC) or other similar components or a combination of the above components.

記憶體130可用以儲存指令、程式碼、軟體模組等等資料,其可以例如是任意型式的固定式或可移動式隨機存取記憶體(random access memory,RAM)、唯讀記憶體(read-only memory,ROM)、快閃記憶體(flash memory)或其他類似裝置、積體電路及其組合。The memory 130 can be used to store instructions, program codes, software modules, etc., and can be, for example, any type of fixed or removable random access memory (RAM), read-only memory (read -only memory (ROM), flash memory (flash memory) or other similar devices, integrated circuits and combinations thereof.

處理器120可存取並執行記錄在記憶體130中的軟體模組,以實現本發明實施例中的硬碟恢復方法。上述軟體模組可廣泛地解釋為意謂指令、指令集、代碼、程式碼、程式、應用程式、軟體套件、執行緒、程序、功能等,而不管其是被稱作軟體、韌體、中間軟體、微碼、硬體描述語言亦或其他者。The processor 120 can access and execute the software module recorded in the memory 130 to implement the hard disk recovery method in the embodiment of the present invention. The software modules described above may be broadly construed to mean instructions, instruction sets, code, code, programs, applications, software packages, threads, programs, functions, etc., whether referred to as software, firmware, middleware or the like. Software, microcode, hardware description language, or others.

圖2是依據本發明一實施例的儲存系統的硬碟恢復方法的流程圖。請參照圖2,本實施例的方法可由圖1的儲存系統10執行,以下即搭配圖1所示的元件說明圖2各步驟的細節。此外,為使本發明的概念更易於理解,以下將以儲存裝置110_1發生故障為範例進行說明。FIG. 2 is a flow chart of a hard disk recovery method of a storage system according to an embodiment of the present invention. Please refer to Figure 2. The method in this embodiment can be executed by the storage system 10 in Figure 1. The details of each step in Figure 2 will be described below with reference to the components shown in Figure 1. In addition, in order to make the concept of the present invention easier to understand, the following description will take the failure of the storage device 110_1 as an example.

於步驟S202,處理器120偵測到儲存裝置110_1的故障事件。具體而言,於一些實施例中,當處理器120偵測不到儲存裝置110_1或無法識別儲存裝置110_1時,即代表儲存裝置110_1的故障事件發生,而處理器120也將偵測到儲存裝置110_1的故障事件。從另一觀點來看,於一些實施例中,當處理器120與儲存裝置110_1中斷連線時,即代表儲存裝置110_1的故障事件發生,而處理器120也將偵測到儲存裝置110_1的故障事件。In step S202, the processor 120 detects a fault event of the storage device 110_1. Specifically, in some embodiments, when the processor 120 cannot detect the storage device 110_1 or cannot identify the storage device 110_1, it means that a failure event of the storage device 110_1 occurs, and the processor 120 will also detect the storage device 110_1. Fault event of 110_1. From another point of view, in some embodiments, when the connection between the processor 120 and the storage device 110_1 is interrupted, it means that a failure event of the storage device 110_1 occurs, and the processor 120 will also detect the failure of the storage device 110_1 event.

於步驟S204,反應於儲存裝置110_1的故障事件,處理器120判斷儲存裝置110_1是否存在。換言之,處理器120將判斷儲存裝置110_1是否還插在電性插槽上。若儲存裝置110_1已經被拔出電性插槽,則處理器120判斷儲存裝置110_1不存在。相反地,若儲存裝置110_1插在電性插槽上,則處理器120判斷儲存裝置110_1存在。In step S204, in response to the failure event of the storage device 110_1, the processor 120 determines whether the storage device 110_1 exists. In other words, the processor 120 will determine whether the storage device 110_1 is still plugged into the electrical socket. If the storage device 110_1 has been pulled out of the electrical socket, the processor 120 determines that the storage device 110_1 does not exist. On the contrary, if the storage device 110_1 is plugged into the electrical socket, the processor 120 determines that the storage device 110_1 exists.

若步驟S204判斷為是,於步驟S206,反應於判定儲存裝置110_1存在,處理器120將儲存裝置110_1新增至一待恢復裝置清單。意即,當發生故障事件的儲存裝置110_1還插在電性插槽上時,處理器120可將儲存裝置110_1新增至待恢復裝置清單。換言之,待恢復裝置清單所記錄的儲存裝置皆發生故障事件且依然插在電性插槽上。If the determination in step S204 is yes, in step S206, in response to determining that the storage device 110_1 exists, the processor 120 adds the storage device 110_1 to a list of devices to be restored. That is, when the storage device 110_1 in which a fault event occurs is still plugged into the electrical slot, the processor 120 can add the storage device 110_1 to the list of devices to be restored. In other words, the storage devices recorded in the device list to be restored have all experienced failure events and are still plugged into the electrical slots.

接著,於步驟S208,處理器120判斷恢復執行裝置清單的裝置數量是否小於等於門檻值。詳細而言,恢復執行裝置清單所記錄的儲存裝置是來自待恢復裝置清單。並且,恢復執行裝置清單所記錄的所有儲存裝置正在進行恢復操作,上述恢復操作可包括電源重啟操作與資料重建操作。處理器120將統計恢復執行裝置清單所記錄儲存裝置的裝置數量,並判斷此裝置數量是否小於等於門檻值。Next, in step S208, the processor 120 determines whether the number of devices that resume executing the device list is less than or equal to the threshold. Specifically, the storage devices recorded in the recovery execution device list are from the device list to be recovered. Furthermore, all storage devices recorded in the recovery execution device list are undergoing recovery operations, and the recovery operations may include power restart operations and data reconstruction operations. The processor 120 will count the number of devices that restore and execute the storage devices recorded in the device list, and determine whether the number of devices is less than or equal to the threshold.

於一些實施例中,儲存系統10可包括M個裝置槽,而儲存裝置110_1插設於M個裝置槽其中之一中。用以與恢復執行裝置清單所紀錄的裝置數量進行比對的門檻值可為裝置槽的數量M乘以預設比例。上述預設比例可以是二分之一或其他比例。亦即,門檻值為小於等於M的數值。舉例而言,假設預設比例為二分之一,則門檻值為M/2。In some embodiments, the storage system 10 may include M device slots, and the storage device 110_1 is inserted into one of the M device slots. The threshold used to compare with the number of devices recorded in the recovery execution device list may be the number M of device slots multiplied by a preset ratio. The above preset ratio can be one-half or other ratios. That is, the threshold value is a value less than or equal to M. For example, assuming the default ratio is one-half, the threshold value is M/2.

若步驟S208判斷為是,於步驟S210,反應於判定恢復執行裝置清單的裝置數量小於等於門檻值,處理器120立即對儲存裝置110_1進行電源重啟操作。另一方面,若步驟S208判斷為否,於步驟S212,反應於判定恢復執行裝置清單的裝置數量未小於等於門檻值,處理器120在等待一經過時間之後對儲存裝置110_1進行電源重啟操作。舉例而言,假設儲存系統10可包括4個裝置槽且預設比例為二分之一,則門檻值等於4/2=2。當門檻值為2,處理器120同時間最多只能對2個儲存裝置進行電源重啟動操作。舉另一例,假設儲存系統10可包括5個裝置槽且預設比例為二分之一,則門檻值等於5/2=2.5。當門檻值為2.5,處理器120同時間最多只能對2個儲存裝置進行電源重啟動操作。If the determination in step S208 is yes, in step S210, in response to the determination that the number of devices for resuming execution of the device list is less than or equal to the threshold value, the processor 120 immediately performs a power restart operation on the storage device 110_1. On the other hand, if the determination in step S208 is negative, in step S212, in response to the determination that the number of devices for resuming execution of the device list is not less than or equal to the threshold, the processor 120 performs a power restart operation on the storage device 110_1 after waiting for an elapsed time. For example, assuming that the storage system 10 can include 4 device slots and the default ratio is one-half, the threshold value is equal to 4/2=2. When the threshold value is 2, the processor 120 can only perform power restart operations on up to two storage devices at the same time. As another example, assuming that the storage system 10 can include 5 device slots and the preset ratio is one-half, the threshold value is equal to 5/2=2.5. When the threshold value is 2.5, the processor 120 can only perform power restart operations on up to two storage devices at the same time.

也就是說,當恢復執行裝置清單中的正在進行恢復操作的儲存裝置的裝置數量小於等於門檻值時,處理器120可立即對儲存裝置110_1進行電源重啟操作,以使儲存裝置110_1開始進行恢復操作。當儲存裝置110_1進行電源重啟操作,儲存裝置110_1先被斷電再被上電。另一方面,當恢復執行裝置清單中的正在進行恢復操作的儲存裝置的裝置數量未小於等於門檻值時,處理器120可暫緩對儲存裝置110_1進行電源重啟操作,以使儲存裝置110_1在等待一經過時間之後才開始進行恢復操作。藉此,可避免有過多的儲存裝置同時進行電源重啟操作,造成儲存系統10的電源負擔過重。That is to say, when the number of storage devices in the recovery execution device list that are undergoing recovery operations is less than or equal to the threshold value, the processor 120 can immediately perform a power restart operation on the storage device 110_1 so that the storage device 110_1 starts the recovery operation. . When the storage device 110_1 performs a power restart operation, the storage device 110_1 is first powered off and then powered on. On the other hand, when the number of storage devices in the recovery execution device list that are undergoing recovery operations is not less than or equal to the threshold value, the processor 120 may suspend the power restart operation on the storage device 110_1 so that the storage device 110_1 waits for a period of time. The recovery operation begins after the time has elapsed. This can avoid having too many storage devices perform power restart operations at the same time, causing excessive power load on the storage system 10 .

以下將列舉其他實施例以說明本發明的其他實施樣態。然而,為了方便清楚說明本發明,以下實施例將繼續以儲存裝置110_1發生故障事件為範例進行說明。Other examples will be enumerated below to illustrate other implementation aspects of the present invention. However, in order to facilitate a clear explanation of the present invention, the following embodiments will continue to take a failure event of the storage device 110_1 as an example.

圖3是依據本發明一實施例的儲存系統的方塊圖。請參照圖3,儲存系統10還包括GPIO(General Purpose Input/Output)介面140、供電裝置150,以及開關裝置160。須說明的是,儲存系統10中的儲存裝置110_1可經由GPIO介面140連接處理器120。處理器120可透過GPIO介面140的GPIO針腳來偵測儲存裝置110_1是否插在電性插槽上。於一些實施例中,當用以偵測連接狀態的GPIO針腳具有高準位時,處理器120可判斷儲存裝置110_1插在電性插槽上。當用以偵測連接狀態的GPIO針腳具有低準位時,處理器120可判斷儲存裝置110_1被拔出電性插槽。FIG. 3 is a block diagram of a storage system according to an embodiment of the present invention. Referring to FIG. 3 , the storage system 10 also includes a GPIO (General Purpose Input/Output) interface 140 , a power supply device 150 , and a switching device 160 . It should be noted that the storage device 110_1 in the storage system 10 can be connected to the processor 120 through the GPIO interface 140. The processor 120 can detect whether the storage device 110_1 is plugged into the electrical slot through the GPIO pin of the GPIO interface 140 . In some embodiments, when the GPIO pin for detecting the connection status has a high level, the processor 120 may determine that the storage device 110_1 is plugged into the electrical socket. When the GPIO pin for detecting the connection status has a low level, the processor 120 can determine that the storage device 110_1 is pulled out of the electrical socket.

此外,開關裝置160連接於儲存裝置110_1與供電裝置150之間,且開關裝置160經由GPIO介面140連接處理器120。處理器120可控制開關裝置160導通或截止,以控制供電裝置150輸出的電源是否提供給儲存裝置110_1。開關裝置160例如為熔斷器(eFuse IC)。於一些實施例中,處理器120可透過GPIO介面140的GPIO針腳提供開關控制訊號給開關裝置160。當用以控制供電的GPIO針腳具有高準位時,開關裝置160導通而提供電源至儲存裝置110_1。當用以控制供電的GPIO針腳具有低準位時,開關裝置160截止而停止供電給儲存裝置110_1。須說明的是,雖然圖3僅以儲存裝置110_1為範例進行說明,但其他儲存裝置110_2~110_N可依據相似的硬體配置而與處理器120相連接。In addition, the switching device 160 is connected between the storage device 110_1 and the power supply device 150, and the switching device 160 is connected to the processor 120 through the GPIO interface 140. The processor 120 can control the switching device 160 to be turned on or off to control whether the power output by the power supply device 150 is provided to the storage device 110_1. The switching device 160 is, for example, a fuse (eFuse IC). In some embodiments, the processor 120 may provide a switch control signal to the switch device 160 through a GPIO pin of the GPIO interface 140 . When the GPIO pin used to control power supply has a high level, the switching device 160 is turned on to provide power to the storage device 110_1. When the GPIO pin used to control power supply has a low level, the switch device 160 is turned off and stops supplying power to the storage device 110_1. It should be noted that although FIG. 3 only takes the storage device 110_1 as an example for illustration, other storage devices 110_2 to 110_N can be connected to the processor 120 based on similar hardware configurations.

圖4是依據本發明一實施例的儲存系統的硬碟恢復方法的流程圖。請參照圖4,本實施例的方法可由圖3的儲存系統10執行,以下即搭配圖3所示的元件說明圖4各步驟的細節。FIG. 4 is a flow chart of a hard disk recovery method of a storage system according to an embodiment of the present invention. Referring to FIG. 4 , the method of this embodiment can be executed by the storage system 10 in FIG. 3 . The details of each step in FIG. 4 will be described below with reference to the components shown in FIG. 3 .

於步驟S402,處理器120偵測到儲存裝置110_1的故障事件。於步驟S404,當處理器120偵測到儲存裝置110_1的故障事件,處理器120判斷儲存裝置110_1的電性插槽是否支援電源控制功能。詳細而言,如圖3所示,若儲存裝置110_1的電性插槽經由處理器120可控制的開關裝置160連接至供電裝置,代表儲存裝置110_1的電性插槽支援電源控制功能。亦即,處理器120可透過控制開關裝置160導通或截止來控制儲存裝置110_1的供電與否,以具備控制儲存裝置110_1進行電源重啟操作的能力。In step S402, the processor 120 detects a fault event of the storage device 110_1. In step S404, when the processor 120 detects a fault event of the storage device 110_1, the processor 120 determines whether the electrical slot of the storage device 110_1 supports the power control function. Specifically, as shown in FIG. 3 , if the electrical socket of the storage device 110_1 is connected to the power supply device through the switch device 160 controllable by the processor 120 , it means that the electrical socket of the storage device 110_1 supports the power control function. That is, the processor 120 can control the power supply of the storage device 110_1 by controlling the switching device 160 to be on or off, so as to have the ability to control the storage device 110_1 to perform a power restart operation.

若步驟S404判斷為是,於步驟S406,處理器120判斷儲存裝置110_1的電性插槽是否支援存在偵測功能。詳細而言,如圖3所示,若儲存裝置110_1的電性插槽經由GPIO針腳連接至處理器120且此GPIO針腳用以偵測儲存裝置110_1是否插在電性插槽上,代表儲存裝置110_1的電性插槽支援存在偵測功能。If the determination in step S404 is yes, in step S406, the processor 120 determines whether the electrical slot of the storage device 110_1 supports the presence detection function. Specifically, as shown in FIG. 3 , if the electrical slot of the storage device 110_1 is connected to the processor 120 via a GPIO pin and the GPIO pin is used to detect whether the storage device 110_1 is plugged into the electrical slot, it represents that the storage device The electrical slot of 110_1 supports presence detection function.

若步驟S406判斷為是,於步驟S408,處理器120判斷儲存裝置110_1是否存在。於步驟S408的詳細操作可參見前述實施例,於此不贅述。須注意的是,若步驟S408判斷為是,於步驟S410,處理器120判斷對儲存裝置110_1進行電源重啟操作的次數是否超過一次數門檻值。舉例而言,處理器120可判斷儲存裝置110_1於一日之內執行電源重啟操作的次數是否超過5次。然而,次數門檻值可視實際應用而設計,本發明對此不限制。If the determination in step S406 is yes, in step S408, the processor 120 determines whether the storage device 110_1 exists. The detailed operations in step S408 can be referred to the foregoing embodiments and will not be described again here. It should be noted that if the determination in step S408 is yes, in step S410, the processor 120 determines whether the number of power restart operations on the storage device 110_1 exceeds the once threshold. For example, the processor 120 may determine whether the storage device 110_1 performs power restart operations more than 5 times in a day. However, the number of times threshold can be designed according to actual applications, and the present invention is not limited thereto.

詳細來說,若處理器120在單位時間內對儲存裝置110_1進行太多次的電源重啟操作,代表儲存裝置110_1本身可能已經有毀損的情況,因此一直重複進行電源重啟操作也無法使儲存裝置110_1恢復正常運作。於是,若步驟S410判斷為是,處理器120可放棄恢復儲存裝置110_1。Specifically, if the processor 120 performs too many power restart operations on the storage device 110_1 within a unit time, it means that the storage device 110_1 itself may have been damaged. Therefore, repeated power restart operations cannot restore the storage device 110_1 Return to normal operations. Therefore, if the determination in step S410 is yes, the processor 120 may give up restoring the storage device 110_1.

另一方面,若步驟S410判斷為否,於步驟S412,反應於判定對儲存裝置110_1進行電源重啟操作的次數未超過次數門檻值,處理器120將儲存裝置110_1新增至待恢復裝置清單。於步驟S414,處理器120判斷恢復執行裝置清單的裝置數量是否小於等於門檻值。若步驟S414判斷為是,於步驟S416,反應於判定恢復執行裝置清單的裝置數量小於等於門檻值,處理器120立即對儲存裝置110_1進行電源重啟操作。另一方面,若步驟S414判斷為否,於步驟S418,反應於判定恢復執行裝置清單的裝置數量未小於等於門檻值,處理器120在等待一經過時間之後對儲存裝置進行電源重啟操作。步驟S412~步驟S418的詳細操作方式可參照圖2實施例的說明,於此不再贅述。On the other hand, if the determination in step S410 is negative, in step S412, in response to determining that the number of power restart operations on the storage device 110_1 does not exceed the number threshold, the processor 120 adds the storage device 110_1 to the list of devices to be restored. In step S414, the processor 120 determines whether the number of devices that resume execution of the device list is less than or equal to the threshold. If the determination in step S414 is yes, in step S416, in response to the determination that the number of devices for resuming execution of the device list is less than or equal to the threshold value, the processor 120 immediately performs a power restart operation on the storage device 110_1. On the other hand, if the determination in step S414 is negative, in step S418, in response to the determination that the number of devices for resuming execution of the device list is not less than or equal to the threshold, the processor 120 performs a power restart operation on the storage device after waiting for an elapsed time. The detailed operation methods of steps S412 to S418 can be referred to the description of the embodiment in FIG. 2 and will not be described again here.

圖5是依據本發明一實施例的儲存系統的硬碟恢復方法的流程圖。請參照圖5,本實施例的方法可由圖1的儲存系統10執行,以下即搭配圖1所示的元件說明圖2各步驟的細節。FIG. 5 is a flow chart of a hard disk recovery method of a storage system according to an embodiment of the present invention. Referring to FIG. 5 , the method of this embodiment can be executed by the storage system 10 in FIG. 1 . The details of each step in FIG. 2 will be described below with reference to the components shown in FIG. 1 .

於步驟S502,處理器120偵測到儲存裝置110_1的故障事件。於一些實施例中,反應於處理器120未接收到儲存裝置110_1回覆訊息時,處理器120判定偵測到儲存裝置110_1的故障事件。舉例而言,假設處理器120及儲存裝置110_1之間利用SATA協議進行溝通,當處理器120無法接收到儲存裝置110_1所發送的訊框D2h時,處理器120可判定偵測到儲存裝置110_1的故障事件,但本發明不限制於此。In step S502, the processor 120 detects a fault event of the storage device 110_1. In some embodiments, when the processor 120 does not receive a reply message from the storage device 110_1, the processor 120 determines that a fault event of the storage device 110_1 is detected. For example, assuming that the processor 120 and the storage device 110_1 communicate using the SATA protocol, when the processor 120 cannot receive the frame D2h sent by the storage device 110_1, the processor 120 can determine that the storage device 110_1 has been detected. Failure event, but the present invention is not limited to this.

於一些實施例中,在儲存裝置110_1~110_N完成初始化與安裝之後,處理器120會開始偵測儲存裝置110_1~110_N的故障事件。於一些實施例中,在安裝儲存裝置110_1~110_N初期,處理器120可初始化儲存裝置110_1~110_N,並且對儲存裝置110_1~110_N的資料配置資訊進行記錄及保存。前述資料配置資訊可包括各儲存裝置110_1~110_N所屬之磁碟陣列等級與所屬之磁碟陣列。詳細而言,儲存裝置110_1~110_N可依需求而分屬一或多個磁碟陣列(亦稱為磁碟陣列群組),且這些磁碟陣列可對應至不同的磁碟陣列等級,例如RAID-5或RAID-6等等。In some embodiments, after the storage devices 110_1˜110_N complete initialization and installation, the processor 120 will begin to detect fault events of the storage devices 110_1˜110_N. In some embodiments, during the early stage of installing the storage devices 110_1 to 110_N, the processor 120 may initialize the storage devices 110_1 to 110_N, and record and save the data configuration information of the storage devices 110_1 to 110_N. The aforementioned data configuration information may include the disk array level and the disk array to which each storage device 110_1 ~ 110_N belongs. Specifically, the storage devices 110_1 to 110_N can be assigned to one or more disk arrays (also called disk array groups) according to requirements, and these disk arrays can correspond to different disk array levels, such as RAID. -5 or RAID-6 and so on.

於步驟S504,處理器120判斷儲存裝置110_1是否存在。若步驟S504判斷為是,於步驟S506,處理器120將儲存裝置110_1新增至待恢復裝置清單。於步驟S508,處理器120判斷恢復執行裝置清單的裝置數量是否小於等於門檻值。步驟S504~步驟S508的詳細操作方式可參照前述實施例,於此不贅述。In step S504, the processor 120 determines whether the storage device 110_1 exists. If the determination in step S504 is yes, in step S506, the processor 120 adds the storage device 110_1 to the list of devices to be restored. In step S508, the processor 120 determines whether the number of devices that resume execution of the device list is less than or equal to the threshold. The detailed operation methods of steps S504 to S508 can be referred to the foregoing embodiments, and will not be described again here.

值得一提的是,於一些實施例中,用以與恢復執行裝置清單所紀錄的裝置數量進行比對的門檻值可根據儲存裝置110_1所屬的磁碟陣列等級來配置。詳細而言,當儲存裝置110_1對應至第一磁碟陣列等級,則門檻值可為第一值。當儲存裝置110_1對應至第二磁碟陣列等級,則門檻值可為第二值。第一值相異於第二值。具體而言,對於可容錯的硬碟數量較高的磁碟陣列等級來說,門檻值可設置為較低的第一值。對於可容錯的硬碟數量較低的磁碟陣列等級來說,門檻值可設置為較高的第二值。舉例而言,當儲存裝置110_1屬於RAID-6時,門檻值可為較低的第一值,以在磁碟陣列較不易被降級的情況下盡量降低儲存系統10的電源負擔。當儲存裝置110_1屬於RAID-5時,門檻值可為較高的第二值,以在磁碟陣列較容易被降級的情況下讓儲存裝置110_1可以盡快恢復。It is worth mentioning that in some embodiments, the threshold used for comparison with the number of devices recorded in the recovery execution device list may be configured according to the disk array level to which the storage device 110_1 belongs. Specifically, when the storage device 110_1 corresponds to the first disk array level, the threshold value may be the first value. When the storage device 110_1 corresponds to the second disk array level, the threshold value may be the second value. The first value is different from the second value. Specifically, for a disk array class with a higher number of fault-tolerant hard disks, the threshold value may be set to a lower first value. For disk array classes with a lower number of fault-tolerant drives, the threshold can be set to a higher second value. For example, when the storage device 110_1 belongs to RAID-6, the threshold value may be a lower first value to minimize the power burden of the storage system 10 when the disk array is less likely to be degraded. When the storage device 110_1 belongs to RAID-5, the threshold value may be a higher second value, so that the storage device 110_1 can be restored as quickly as possible when the disk array is easily degraded.

若步驟S508判斷為是,於步驟S510,反應於判定恢復執行裝置清單的裝置數量小於等於門檻值,處理器120將儲存裝置110_1自待恢復裝置清單之中移除並且將儲存裝置110_1新增至恢復執行裝置清單。亦即,反應於判定恢復執行裝置清單的裝置數量小於等於門檻值,儲存裝置110_1將從待恢復裝置清單移至恢復執行裝置清單。於步驟S512,反應於儲存裝置110_1自待恢復裝置清單移至恢復執行裝置清單,處理器120立即對儲存裝置110_1進行電源重啟操作,例如透過控制圖3所示的開關裝置160來對儲存裝置110_1進行斷電與上電。If the determination in step S508 is yes, in step S510, in response to determining that the number of devices for restoring the execution device list is less than or equal to the threshold, the processor 120 removes the storage device 110_1 from the device list to be restored and adds the storage device 110_1 to Restore execution device list. That is, in response to determining that the number of devices in the recovery execution device list is less than or equal to the threshold, the storage device 110_1 will move from the device list to be recovered to the recovery execution device list. In step S512, in response to the storage device 110_1 being moved from the recovery device list to the recovery execution device list, the processor 120 immediately performs a power restart operation on the storage device 110_1, for example, by controlling the switch device 160 shown in FIG. 3 to power the storage device 110_1. Perform power off and on.

若步驟S508判斷為否,於步驟S522,反應於判定恢復執行裝置清單的裝置數量未小於等於門檻值,處理器120在等待一經過時間之後對儲存裝置110_1進行電源重啟操作。於一些實施例中,在等待一經過時間之後,反應於恢復執行裝置清單的裝置數量從未小於等於門檻值轉換為小於等於門檻值,處理器120才對儲存裝置110_1進行電源重啟操作。於一些實施例中,恢復執行裝置清單所紀錄之儲存裝置會在完成恢復操作之後被移除。因此,恢復執行裝置清單的裝置數量會因為其紀錄的儲存裝置完成恢復操作而降低。If the determination in step S508 is negative, in step S522, in response to the determination that the number of devices for resuming execution of the device list is not less than or equal to the threshold, the processor 120 performs a power restart operation on the storage device 110_1 after waiting for an elapsed time. In some embodiments, after waiting for an elapsed time, in response to the fact that the number of devices that resume executing the device list switches from being less than or equal to the threshold value to being less than or equal to the threshold value, the processor 120 performs a power restart operation on the storage device 110_1. In some embodiments, the storage devices recorded in the recovery execution device list will be removed after the recovery operation is completed. Therefore, the number of devices that are restored to the execution device list will be reduced as their recorded storage devices complete the restore operation.

於步驟S514,在對儲存裝置110_1進行電源重啟操作之後,處理器120判斷恢復執行裝置清單中的儲存裝置110_1是否於一預設時段內恢復連線。上述預設時段例如為60秒,但本發明不限制於此。舉例而言,假設處理器120及儲存裝置110_1之間利用SATA協議進行溝通,當處理器120接收到儲存裝置110_1所發送的訊框D2h時,處理器120可判定儲存裝置110_1恢復連線,但本發明不限制於此。In step S514, after performing a power restart operation on the storage device 110_1, the processor 120 determines whether the storage device 110_1 in the recovery execution device list has restored connection within a preset period of time. The above-mentioned preset period is, for example, 60 seconds, but the present invention is not limited thereto. For example, assuming that the processor 120 and the storage device 110_1 communicate using the SATA protocol, when the processor 120 receives the frame D2h sent by the storage device 110_1, the processor 120 can determine that the storage device 110_1 has restored the connection, but The present invention is not limited to this.

若步驟S514判斷為否,代表儲存裝置110_1已經無法透過電源重啟來恢復運作。因此,於步驟S524,處理器120放棄恢復儲存裝置110_1。之後,於步驟S520,處理器120將儲存裝置110_1自恢復執行裝置清單中移除。If the determination in step S514 is negative, it means that the storage device 110_1 cannot resume operation by restarting the power supply. Therefore, in step S524, the processor 120 gives up restoring the storage device 110_1. Afterwards, in step S520, the processor 120 removes the storage device 110_1 from the recovery execution device list.

另一方面,若步驟S514判斷為是,於步驟S516,處理器120判斷儲存裝置110_1是否屬於一磁碟陣列(RAID)。於一些實施例中,處理器120可根據對儲存裝置110_1進行初始化過程中所紀錄的資料配置資訊來判斷儲存裝置110_1是否屬於一磁碟陣列。On the other hand, if the determination in step S514 is yes, in step S516, the processor 120 determines whether the storage device 110_1 belongs to a disk array (RAID). In some embodiments, the processor 120 may determine whether the storage device 110_1 belongs to a disk array based on the data configuration information recorded during the initialization process of the storage device 110_1.

若步驟S516判斷為否,代表儲存裝置110_1無須進行關於磁碟陣列的資料重建操作。之後,於步驟S520,處理器120將儲存裝置110_1自恢復執行裝置清單中移除。If the determination in step S516 is negative, it means that the storage device 110_1 does not need to perform a data reconstruction operation on the disk array. Afterwards, in step S520, the processor 120 removes the storage device 110_1 from the recovery execution device list.

若步驟S516判斷為是,於步驟S518,反應於判定儲存裝置110_1屬於磁碟陣列,處理器120根據磁碟陣列支援的重建(rebuild)功能對儲存裝置110_1進行資料重建操作。舉例而言,若儲存裝置110_1所屬磁碟陣列可支援快速重建功能(例如ZFS檔案系統),則處理器120透過快速重建功能對儲存裝置110_1進行資料重建操作。若儲存裝置110_1所屬磁碟陣列支援一般重建功能,則處理器120透過一般重建功能對儲存裝置110_1進行資料重建操作。在完成資料重建操作之後,於步驟S520,處理器120將儲存裝置110_1自恢復執行裝置清單中移除。If the determination in step S516 is yes, in step S518, in response to determining that the storage device 110_1 belongs to the disk array, the processor 120 performs a data reconstruction operation on the storage device 110_1 according to the rebuild function supported by the disk array. For example, if the disk array to which the storage device 110_1 belongs can support the fast rebuild function (such as the ZFS file system), the processor 120 performs a data reconstruction operation on the storage device 110_1 through the fast rebuild function. If the disk array to which the storage device 110_1 belongs supports the general reconstruction function, the processor 120 performs a data reconstruction operation on the storage device 110_1 through the general reconstruction function. After completing the data reconstruction operation, in step S520, the processor 120 removes the storage device 110_1 from the recovery execution device list.

圖6是依據本發明一實施例的儲存系統的事件日誌的示意圖。請參照圖6,處理器120可將儲存系統10發生的事件的內容及時間等相關資訊記錄為日誌(Log)61,使用者可透過一作業系統得知日誌61內容。當處理器120與儲存裝置110_1中斷連線,處理器120偵測到儲存裝置110_1的故障事件,且處理器120將此故障事件紀錄為日誌61中的日誌消息Msg1。然後,處理器120可對儲存裝置110_1進行電源重啟操作,使儲存裝置110_1恢復與處理器120的連線。處理器120將儲存裝置110_1恢復連線的事件紀錄為日誌61中的日誌消息Msg2。接著,處理器120將對恢復連線的儲存裝置110_1進行資料重建操作,如日誌61中的日誌消息Msg3與Msg4所示。由此可知,在儲存裝置110_1發生故障之後,儲存裝置110_1可在沒有人為操作介入的情況下自動恢復為正常操作且透過日誌記錄自動恢復爲正常操作的過程。FIG. 6 is a schematic diagram of an event log of a storage system according to an embodiment of the present invention. Referring to FIG. 6 , the processor 120 can record relevant information such as the content and time of events occurring in the storage system 10 as a log (Log) 61 , and the user can obtain the contents of the log 61 through an operating system. When the processor 120 is disconnected from the storage device 110_1, the processor 120 detects a failure event of the storage device 110_1, and the processor 120 records the failure event as a log message Msg1 in the log 61. Then, the processor 120 can perform a power restart operation on the storage device 110_1 to restore the connection between the storage device 110_1 and the processor 120 . The processor 120 records the event of the storage device 110_1 reconnecting as a log message Msg2 in the log 61 . Then, the processor 120 will perform a data reconstruction operation on the restored storage device 110_1, as shown in the log messages Msg3 and Msg4 in the log 61. It can be seen from this that after the storage device 110_1 fails, the storage device 110_1 can automatically restore to normal operation without human intervention and record the process of automatically restoring to normal operation through log records.

需說明的是,以至少一個處理器執行之硬碟恢復方法的處理程序並不限於上述實施形態之例。舉例而言,可省略上述步驟(處理)之一部分,亦可以其他順序執行各步驟。又,可組合上述步驟中之任二個以上的步驟,亦可修正或刪除步驟之一部分。或者,亦可除了上述各步驟外還執行其他步驟。It should be noted that the processing procedure of the hard disk recovery method executed by at least one processor is not limited to the above embodiments. For example, part of the above steps (processing) may be omitted, or each step may be performed in other order. Furthermore, any two or more of the above steps may be combined, and part of the steps may also be modified or deleted. Alternatively, other steps may be performed in addition to the above steps.

綜上所述,在本發明實施例中,當儲存裝置發生故障而無法正常運作時,此儲存裝置將被新增至待恢復裝置清單之中。當恢復執行裝置清單所紀錄的裝置數量小於等於門檻值時,儲存裝置將從待恢復裝置清單移至恢復執行裝置清單,以自動地對恢復執行裝置清單中的儲存裝置進行電源重啟操作。藉此,可盡快讓中斷連線的儲存裝置恢復正常運作,並可降低資料遺失的風險。此外,當恢復執行裝置清單所紀錄的裝置數量未小於等於門檻值時,在等待一經過時間之後對此儲存裝置進行所述電源重啟操作。藉此,可避免過多的儲存裝置同時進行電源重啟操作而造成儲存系統的電源負擔過重。To sum up, in the embodiment of the present invention, when the storage device fails and cannot operate normally, the storage device will be added to the list of devices to be restored. When the number of devices recorded in the recovery execution device list is less than or equal to the threshold, the storage device will be moved from the device list to be recovered to the recovery execution device list to automatically power cycle the storage devices in the recovery execution device list. In this way, the disconnected storage device can be restored to normal operation as soon as possible and the risk of data loss can be reduced. In addition, when the number of devices recorded in the recovery execution device list is not less than or equal to the threshold value, the power restart operation is performed on the storage device after waiting for an elapsed time. This can avoid too many storage devices performing power restart operations at the same time, causing excessive power load on the storage system.

雖然本發明已以實施例揭露如上,然其並非用以限定本發明,任何所屬技術領域中具有通常知識者,在不脫離本發明的精神和範圍內,當可作些許的更動與潤飾,故本發明的保護範圍當視後附的申請專利範圍所界定者為準。Although the present invention has been disclosed above through embodiments, they are not intended to limit the present invention. Anyone with ordinary knowledge in the technical field may make some modifications and modifications without departing from the spirit and scope of the present invention. Therefore, The protection scope of the present invention shall be determined by the appended patent application scope.

10:儲存系統10:Storage system

110_1~110_N:儲存裝置110_1~110_N: Storage device

120:處理器120: Processor

130:記憶體130:Memory

140:GPIO介面140:GPIO interface

150:供電裝置150:Power supply device

160:開關裝置160:Switching device

Msg1~Msg4:日誌消息Msg1~Msg4: Log message

S202~S212、S402~S418、S502~S524:步驟S202~S212, S402~S418, S502~S524: steps

圖1是依據本發明一實施例的儲存系統的方塊圖。 圖2是依據本發明一實施例的儲存系統的硬碟恢復方法的流程圖。 圖3是依據本發明一實施例的儲存系統的方塊圖。 圖4是依據本發明一實施例的儲存系統的硬碟恢復方法的流程圖。 圖5是依據本發明一實施例的儲存系統的硬碟恢復方法的流程圖。 圖6是依據本發明一實施例的儲存系統的事件日誌的示意圖。 FIG. 1 is a block diagram of a storage system according to an embodiment of the present invention. FIG. 2 is a flow chart of a hard disk recovery method of a storage system according to an embodiment of the present invention. FIG. 3 is a block diagram of a storage system according to an embodiment of the present invention. FIG. 4 is a flow chart of a hard disk recovery method of a storage system according to an embodiment of the present invention. FIG. 5 is a flow chart of a hard disk recovery method of a storage system according to an embodiment of the present invention. FIG. 6 is a schematic diagram of an event log of a storage system according to an embodiment of the present invention.

S202~S212:步驟 S202~S212: steps

Claims (12)

一種儲存系統的硬碟恢復方法,包括:反應於一儲存裝置的故障事件,判斷所述儲存裝置是否存在;反應於判定所述儲存裝置存在,將所述儲存裝置新增至一待恢復裝置清單;判斷一恢復執行裝置清單的裝置數量是否小於等於一門檻值;以及反應於判定所述恢復執行裝置清單的所述裝置數量小於等於所述門檻值,立即對所述儲存裝置進行一電源重啟操作,其中所述方法更包括:反應於判定所述恢復執行裝置清單的所述裝置數量小於等於所述門檻值,將所述儲存裝置自所述待恢復裝置清單之中移除並且將所述儲存裝置新增至所述恢復執行裝置清單中。 A hard disk recovery method for a storage system, including: responding to a failure event of a storage device, determining whether the storage device exists; responding to determining that the storage device exists, adding the storage device to a list of devices to be restored ; Determine whether the number of devices in a recovery execution device list is less than or equal to a threshold value; and in response to determining that the number of devices in the recovery execution device list is less than or equal to the threshold value, immediately perform a power restart operation on the storage device , wherein the method further includes: in response to determining that the number of devices in the recovery execution device list is less than or equal to the threshold, removing the storage device from the device list to be restored and storing the storage device. The device is added to the recovery execution device list. 如請求項1所述的儲存系統的硬碟恢復方法,所述方法更包括:反應於判定所述恢復執行裝置清單的所述裝置數量未小於等於所述門檻值,在等待一經過時間之後對所述儲存裝置進行所述電源重啟操作。 The hard disk recovery method of the storage system according to claim 1, the method further includes: in response to determining that the number of devices in the recovery execution device list is not less than or equal to the threshold value, after waiting for an elapsed time, The storage device performs the power restart operation. 如請求項1所述的儲存系統的硬碟恢復方法,其中所述儲存系統包括多個裝置槽,所述儲存裝置插設於所述裝置槽其中之一中,所述門檻值為所述裝置槽的數量乘以一預設比例。 The hard disk recovery method of the storage system according to claim 1, wherein the storage system includes a plurality of device slots, the storage device is inserted into one of the device slots, and the threshold value is the device The number of slots is multiplied by a preset ratio. 如請求項1所述的儲存系統的硬碟恢復方法,所述方法更包括:在對所述儲存裝置進行所述電源重啟操作之後,判斷所述恢復執行裝置清單中的所述儲存裝置是否於一預設時段內恢復連線;以及反應於判定所述儲存裝置於所述預設時段內恢復,對所述儲存裝置進行一資料重建操作,並且將所述儲存裝置自所述恢復執行裝置清單中移除。 The hard disk recovery method of the storage system according to claim 1, the method further includes: after performing the power restart operation on the storage device, determining whether the storage device in the recovery execution device list is in Restore the connection within a preset period; and in response to determining that the storage device is restored within the preset period, perform a data reconstruction operation on the storage device, and remove the storage device from the recovery execution device list removed. 如請求項4所述的儲存系統的硬碟恢復方法,其中反應於判定所述儲存裝置於所述預設時段內恢復,對所述儲存裝置進行所述資料重建操作的步驟包括:判斷所述儲存裝置是否屬於一磁碟陣列;以及反應於判定所述儲存裝置屬於所述磁碟陣列,根據所述磁碟陣列支援的重建功能對所述儲存裝置進行所述資料重建操作。 The hard disk recovery method of a storage system as claimed in claim 4, wherein in response to determining that the storage device is recovered within the preset period, the step of performing the data reconstruction operation on the storage device includes: determining that the Whether the storage device belongs to a disk array; and in response to determining that the storage device belongs to the disk array, perform the data reconstruction operation on the storage device according to the reconstruction function supported by the disk array. 如請求項1所述的儲存系統的硬碟恢復方法,其中在將所述儲存裝置新增至所述待恢復裝置清單的步驟包括:判斷對所述儲存裝置進行所述電源重啟操作的次數是否超過一次數門檻值;以及反應於判定對所述儲存裝置進行所述電源重啟操作的次數未超過所述次數門檻值,將所述儲存裝置新增至所述待恢復裝置清單。 The hard disk recovery method of the storage system according to claim 1, wherein the step of adding the storage device to the list of devices to be restored includes: determining whether the number of times the power restart operation is performed on the storage device is Exceeding a number threshold; and in response to determining that the number of times the power restart operation is performed on the storage device does not exceed the number threshold, adding the storage device to the list of devices to be restored. 一種儲存系統,包括: 至少一儲存裝置;以及一處理器,連接所述儲存裝置,經配置以:反應於所述儲存裝置的故障事件,判斷所述儲存裝置是否存在;反應於判定所述儲存裝置存在,將所述儲存裝置新增至一待恢復裝置清單;判斷一恢復執行裝置清單的裝置數量是否小於等於一門檻值;以及反應於判定所述恢復執行裝置清單的所述裝置數量小於等於所述門檻值,立即對所述儲存裝置進行一電源重啟操作,其中所述處理器更經配置以:反應於判定所述待恢復裝置清單的所述裝置數量小於等於所述門檻值,將所述儲存裝置自所述待恢復裝置清單之中移除並且將所述儲存裝置新增至所述恢復執行裝置清單中。 A storage system including: At least one storage device; and a processor connected to the storage device and configured to: in response to a failure event of the storage device, determine whether the storage device exists; in response to determining that the storage device exists, convert the The storage device is added to a device list to be restored; it is determined whether the number of devices in a recovery execution device list is less than or equal to a threshold value; and in response to determining that the number of devices in the recovery execution device list is less than or equal to the threshold value, immediately Perform a power restart operation on the storage device, wherein the processor is further configured to: in response to determining that the number of devices in the device list to be restored is less than or equal to the threshold value, remove the storage device from the Remove the storage device from the list of devices to be restored and add the storage device to the list of recovery execution devices. 如請求項7所述的儲存系統,其中所述處理器更經配置以:反應於判定所述恢復執行裝置清單的所述裝置數量未小於等於所述門檻值,在等待一經過時間之後對所述儲存裝置進行所述電源重啟操作。 The storage system of claim 7, wherein the processor is further configured to: in response to determining that the number of devices in the resume execution device list is not less than or equal to the threshold, after waiting for an elapsed time, The storage device performs the power restart operation. 如請求項7所述的儲存系統,更包括多個裝置槽,各所述儲存裝置插設於所述裝置槽其中之一中,所述門檻值為所述裝置槽的數量乘以一預設比例。 The storage system according to claim 7, further comprising a plurality of device slots, each of the storage devices is inserted into one of the device slots, and the threshold value is the number of the device slots multiplied by a preset Proportion. 如請求項7所述的儲存系統,其中所述處理器更經配置以:在對所述儲存裝置進行所述電源重啟操作之後,判斷所述恢復執行裝置清單中的所述儲存裝置是否於一預設時段內恢復連線;以及反應於判定所述儲存裝置於所述預設時段內恢復,對所述儲存裝置進行一資料重建操作,並且將所述儲存裝置自所述執行裝置清單中移除。 The storage system of claim 7, wherein the processor is further configured to: after performing the power restart operation on the storage device, determine whether the storage device in the recovery execution device list is in a Restore the connection within the preset period; and in response to determining that the storage device is restored within the preset period, perform a data reconstruction operation on the storage device, and move the storage device from the execution device list remove. 如請求項10所述的儲存系統,其中所述處理器更經配置以:判斷所述儲存裝置是否屬於一磁碟陣列;以及反應於判定所述儲存裝置屬於所述磁碟陣列,根據所述磁碟陣列支援的重建功能對所述儲存裝置進行所述資料重建操作。 The storage system of claim 10, wherein the processor is further configured to: determine whether the storage device belongs to a disk array; and in response to determining that the storage device belongs to the disk array, according to the The reconstruction function supported by the disk array performs the data reconstruction operation on the storage device. 如請求項7所述的儲存系統,其中所述處理器更經配置以:判斷對所述儲存裝置進行所述電源重啟操作的次數是否超過一次數門檻值;以及反應於判定對所述儲存裝置進行所述電源重啟操作的次數未超過所述次數門檻值,將所述儲存裝置新增至所述待恢復裝置清單。 The storage system of claim 7, wherein the processor is further configured to: determine whether the number of times the power restart operation is performed on the storage device exceeds a threshold; and in response to determining that the number of power restart operations on the storage device is If the number of times the power restart operation is performed does not exceed the number threshold, the storage device is added to the list of devices to be restored.
TW111127518A 2022-07-22 2022-07-22 Storage system and drive recovery method thereof TWI820814B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
TW111127518A TWI820814B (en) 2022-07-22 2022-07-22 Storage system and drive recovery method thereof
CN202211227693.9A CN117472619A (en) 2022-07-22 2022-10-09 Storage system and hard disk recovery method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW111127518A TWI820814B (en) 2022-07-22 2022-07-22 Storage system and drive recovery method thereof

Publications (2)

Publication Number Publication Date
TWI820814B true TWI820814B (en) 2023-11-01
TW202405655A TW202405655A (en) 2024-02-01

Family

ID=89624390

Family Applications (1)

Application Number Title Priority Date Filing Date
TW111127518A TWI820814B (en) 2022-07-22 2022-07-22 Storage system and drive recovery method thereof

Country Status (2)

Country Link
CN (1) CN117472619A (en)
TW (1) TWI820814B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110078495A1 (en) * 2006-08-25 2011-03-31 Hitachi, Ltd. Storage control apparatus and failure recovery method for storage control apparatus
US8725934B2 (en) * 2011-12-22 2014-05-13 Fusion-Io, Inc. Methods and appratuses for atomic storage operations
TW201423378A (en) * 2012-09-18 2014-06-16 Mitsubishi Electric Corp Raid failure self-repair device
TWI476610B (en) * 2008-04-29 2015-03-11 Maxiscale Inc Peer-to-peer redundant file server system and methods
US20150378858A1 (en) * 2013-02-28 2015-12-31 Hitachi, Ltd. Storage system and memory device fault recovery method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110078495A1 (en) * 2006-08-25 2011-03-31 Hitachi, Ltd. Storage control apparatus and failure recovery method for storage control apparatus
TWI476610B (en) * 2008-04-29 2015-03-11 Maxiscale Inc Peer-to-peer redundant file server system and methods
US8725934B2 (en) * 2011-12-22 2014-05-13 Fusion-Io, Inc. Methods and appratuses for atomic storage operations
TW201423378A (en) * 2012-09-18 2014-06-16 Mitsubishi Electric Corp Raid failure self-repair device
US20150378858A1 (en) * 2013-02-28 2015-12-31 Hitachi, Ltd. Storage system and memory device fault recovery method

Also Published As

Publication number Publication date
CN117472619A (en) 2024-01-30
TW202405655A (en) 2024-02-01

Similar Documents

Publication Publication Date Title
JP4723290B2 (en) Disk array device and control method thereof
CN108228374B (en) Equipment fault processing method, device and system
TWI632462B (en) Switching device and method for detecting i2c bus
JP2008052547A (en) Storage controller and method for recovering failure of the same
JP6555096B2 (en) Information processing apparatus and program update control method
US9244773B2 (en) Apparatus and method for handling abnormalities occurring during startup
JP2006031630A (en) Storage device and method for controlling power consumption of storage device
CN111143132A (en) BIOS recovery method, device, equipment and readable storage medium
JP2018022333A (en) Storage controller and storage unit management program
CN105045336A (en) JBOD (Just Bunch of Disks)
US20090106584A1 (en) Storage apparatus and method for controlling the same
TWI820814B (en) Storage system and drive recovery method thereof
WO2008076203A1 (en) Managing storage stability
JP5387767B2 (en) Update technology for running programs
TWI547798B (en) Data storage system and control method thereof
JP2014191491A (en) Information processor and information processing system
JP2015222454A (en) Raid failure self-repairing device
CN114385412A (en) Storage management method, apparatus and computer program product
US8909983B2 (en) Method of operating a storage device
JP6398727B2 (en) Control device, storage device, and control program
CN113312198A (en) System and method for monitoring and restoring heterogeneous elements
JP2019164578A (en) Control system, information processing device, control method, raid controller restoration method, and program
CN111427721B (en) Abnormality recovery method and device
CN110347555B (en) Hard disk operation state determination method
JP2021150672A (en) Communication apparatus, information processing method, and system