TW202026879A

TW202026879A - Method for remotely clearing abnormal status of racks applied in data center

Info

Publication number: TW202026879A
Application number: TW108100024A
Authority: TW
Inventors: 林韋成; 辛柏陞; 林政翰
Original assignee: 營邦企業股份有限公司
Priority date: 2019-01-02
Filing date: 2019-01-02
Publication date: 2020-07-16
Also published as: TWI698741B

Abstract

A method for remotely clearing abnormal status of racks is disclosed and includes following steps: obtaining each information of a rack management controller (RMC) and multiple baseboard management controllers (BMCs) of a rack regularly by a management system; recording each operating action performed by manager through the management system; analyzing the information and the operating action by the management system for determining whether any RMC or BMC is under one of multiple default attention-statuses; and, automatically performing a remotely rescuing procedure to one of the RMC and the BMCs for eliminating an abnormal connection status from the RMC or the BMC when the RMC or the BMC is determined losing a connection with the management system.

Description

Remote troubleshooting method for abnormal state of cabinets used in data center (3)

本發明涉及資料中心，尤其涉及對資料中心中的機櫃的異常狀態的分析與排除的方法。The present invention relates to a data center, and in particular to a method for analyzing and eliminating abnormal conditions of cabinets in the data center.

一般來說，一個資料中心通常會透過智慧型平台管理介面（Intelligent Platform Management Interface, IPMI）對資料中心內的機櫃、端點伺服器等設備的機櫃管理控制器(Rack Management Controller, RMC)及基板管理控制器(Baseboard Management Controller, BMC)進行遠端管理。Generally speaking, a data center usually uses the Intelligent Platform Management Interface (IPMI) to control the rack management controller (RMC) and baseboard of the racks, endpoint servers and other equipment in the data center. The management controller (Baseboard Management Controller, BMC) performs remote management.

不論使用何種方式進行遠端管理，只要任一機櫃或端點伺服器的RMC或BMC出現異常，管理者就會收到許多警告信件。然而，管理者一般難以通過這些警告信件在第一時間直接得知狀態的真正問題點，往往需要隨著時間不斷推進，直到收到數百封警告信件並且與設備失去連線後，才能確定所述RMC、BMC發生了異常。No matter what method is used for remote management, as long as the RMC or BMC of any cabinet or endpoint server is abnormal, the administrator will receive many warning letters. However, it is generally difficult for managers to know the real problems of the state in the first time through these warning letters. They often need to keep advancing over time until they receive hundreds of warning letters and lose the connection with the device before they can be sure. The RMC and BMC are abnormal.

更甚者，即使部分的管理平台從不同的監控管道收集到錯誤訊息，並且進行彙整後提交故障評估報告給管理者，但這樣的監控方式仍然需要由管理者進行最後的判斷，並且決定處理方式。然而，只要有人為因素的介入，就無法全然避免誤判的可能。What's more, even if part of the management platform collects error messages from different monitoring channels and submits the fault assessment report to the manager after compiling, such a monitoring method still requires the manager to make the final judgment and decide the processing method. . However, as long as human factors are involved, the possibility of misjudgment cannot be completely avoided.

有鑑於此，本領域確實需要發展一套新穎的系統與方法，可針對處於異常狀態的RMC及BMC自動實施遠端修復機制，藉此強化資料中心的監控能力，使得機櫃管理能夠高度自動化，同時減少人為判定所間接流失的時間，並且避免人為誤判。In view of this, the field does need to develop a set of novel systems and methods that can automatically implement a remote repair mechanism for abnormal RMC and BMC, thereby strengthening the monitoring capabilities of the data center, making cabinet management highly automated, and at the same time Reduce the time lost indirectly due to human judgment and avoid human misjudgment.

本發明的主要目的，在於提供一種運用於資料中心的機櫃異常狀態的遠端排除方法，可以在判斷基板管理控制器失去了與機櫃伺服器管理系統間的連線時，直接於遠端排除基板管理控制器的異常狀態。The main purpose of the present invention is to provide a remote elimination method for the abnormal state of the cabinet applied to the data center, which can directly eliminate the substrate remotely when it is determined that the baseboard management controller has lost the connection with the cabinet server management system The abnormal state of the management controller.

為了達成上述的目的，本發明的方法是由一機櫃伺服器管理系統定時於遠端取得一個機櫃內的一機櫃管理控制器以及多個基板管理控制器的各項資訊，並且記錄一管理者通過該機櫃伺服器管理系統所進行的各項操作行為。該機櫃伺服器管理系統對上述資訊以及操作行為進行分析，以判斷該機櫃內的該機櫃管理控制器或各該基板管理控制器是否處於預設的多種關注狀態的其中之一。In order to achieve the above objectives, the method of the present invention is that a rack server management system periodically obtains information about a rack management controller and a plurality of baseboard management controllers in a rack at a remote location, and records the information passed by a manager Various operations performed by the rack server management system. The rack server management system analyzes the above-mentioned information and operation behaviors to determine whether the rack management controller or each baseboard management controller in the rack is in one of the preset attention states.

若判斷任一該基板管理控制器失去了與該機櫃伺服器管理系統間的連線，則該機櫃伺服器管理系統自動實施一遠端救援機制，以排除該基板管理控制器失去連線的異常狀態。If it is determined that any of the baseboard management controllers has lost the connection with the rack server management system, the rack server management system automatically implements a remote rescue mechanism to eliminate the abnormal connection of the baseboard management controller status.

相對於相關技術，本發明的方法由與機櫃連線的機櫃伺服器管理系統來進行分析並自動實施遠端救援機制，無需等待管理者對於異常狀態的人為判定，可大幅降低管理成本，亦使得機櫃的監控無需人為干涉，也不受距離與時間的影響。Compared with related technologies, the method of the present invention analyzes and automatically implements the remote rescue mechanism by the cabinet server management system connected to the cabinet, without waiting for the manager's artificial judgment on the abnormal state, which can greatly reduce the management cost, and also make The monitoring of the cabinet does not require human intervention, nor is it affected by distance and time.

茲就本發明之一較佳實施例，配合圖式，詳細說明如後。With regard to a preferred embodiment of the present invention, detailed description is given below in conjunction with the drawings.

本發明揭露了一種機櫃異常狀態的遠端排除方法(下面將於說明書中簡稱為排除方法)，所述排除方法主要運用於資料中心內，以協助管理者自動監控、分析並且排除資料中心內的異常狀態。The present invention discloses a remote elimination method for cabinet abnormal state (hereinafter referred to as the elimination method in the specification). The elimination method is mainly used in the data center to assist managers in automatically monitoring, analyzing and eliminating the data center. Abnormal state.

參閱圖1，為本發明的資料中心的示意圖。如圖1所示，本發明所述的資料中心1主要具有複數機櫃2，以及由遠端與複數機櫃2連線的機櫃伺服器管理系統3(下面簡稱為管理系統3)。所述管理系統3可設置於資料中心1的內部或外部，並且經由網路連接公共網路交換機4，再經由公共網路交換機4連接資料中心1內的複數機櫃2。Refer to Figure 1, which is a schematic diagram of the data center of the present invention. As shown in Fig. 1, the data center 1 of the present invention mainly has a plurality of cabinets 2, and a cabinet server management system 3 (hereinafter referred to as the management system 3) connected to the plurality of cabinets 2 by a remote end. The management system 3 can be installed inside or outside the data center 1 and connected to a public network switch 4 via a network, and then connected to a plurality of cabinets 2 in the data center 1 via the public network switch 4.

本發明的管理系統3可實時監控資料中心1內的複數機櫃2、獲取複數機櫃2的各項資訊、並且對這些資訊進行分析。當發現任一機櫃2發生異常狀態或即將發生異常狀態時，本發明的管理系統3可自動實施對應的處理機制以進行狀況排除。藉此，本發明可以在完全不需要人為介入、大幅降低人為誤判並且提升處理速度的前提下，對機櫃2已發生的異常狀態進行排除，或對可能即將發生的異常狀態進行預防。The management system 3 of the present invention can monitor the multiple cabinets 2 in the data center 1 in real time, obtain various information of the multiple cabinets 2, and analyze the information. When an abnormal state or an abnormal state is about to occur in any cabinet 2, the management system 3 of the present invention can automatically implement a corresponding processing mechanism to eliminate the situation. Thereby, the present invention can eliminate the abnormal state that has occurred in the cabinet 2 or prevent the abnormal state that may be about to occur on the premise that no human intervention is required, human misjudgment is greatly reduced, and the processing speed is increased.

於一實施例中，所述管理系統3可為個人電腦或雲端伺服器，內部具有一或多個中央處理單元(圖未標示)。管理系統3被啟動後，可通過公共網路交機4連接至資料中心1內的複數機櫃2，並可藉由一或多個中央處理單元執行特定的應用程式與演算法，以實現對這些機櫃2的監控、資料分析及異常狀態排除。In one embodiment, the management system 3 can be a personal computer or a cloud server, with one or more central processing units (not shown). After the management system 3 is activated, it can be connected to the multiple cabinets 2 in the data center 1 through the public network switch 4, and can be implemented by one or more central processing units to execute specific applications and algorithms. Cabinet 2 monitoring, data analysis, and abnormal state elimination.

所述管理系統3還具有資料庫31，用以暫存或永久保存從資料中心1內的複數機櫃2所獲得的各項資訊。於圖1的實施例中，所述資料庫31是內建於管理系統3。於其他實施例中，管理系統3亦可從外部連接一或多個資料庫31，不加以限定。The management system 3 also has a database 31 for temporarily storing or permanently storing various information obtained from the plurality of cabinets 2 in the data center 1. In the embodiment of FIG. 1, the database 31 is built in the management system 3. In other embodiments, the management system 3 can also connect to one or more databases 31 from the outside, without limitation.

參閱圖2，為本發明的機櫃的方塊圖的第一具體實施例。圖2的實施例中以資料中心1內的單一台機櫃2連接至所述管理系統3為例，進行說明，然而資料中心1係可依實際所需設置多台的機櫃2，而不以圖2所示者為限。Refer to Figure 2, which is a first specific embodiment of the block diagram of the cabinet of the present invention. In the embodiment of FIG. 2, a single cabinet 2 in the data center 1 is connected to the management system 3 as an example for description. However, the data center 1 can be equipped with multiple cabinets 2 according to actual needs. Those shown in 2 are limited.

如圖2所示，本發明的機櫃2內主要包括至少一個機櫃管理控制器(Rack Management Controller,RMC)21，以及與RMC21連接的多台端點伺服器220，其中各個端點伺服器220分別具備至少一個基板管理控制器(Baseboard Management Controller,BMC)22。As shown in Figure 2, the cabinet 2 of the present invention mainly includes at least one Rack Management Controller (Rack Management Controller, RMC) 21, and multiple endpoint servers 220 connected to the RMC 21, wherein each endpoint server 220 has At least one baseboard management controller (Baseboard Management Controller, BMC) 22.

所述RMC21為一種嵌入式系統，設置於機櫃2內，透過各式硬體線路協助處理機櫃2的內部硬體設備（降溫風扇，各式感測器或電源供應器等等設備）的所有對外通訊，並與機櫃2內的所有端點伺服器220的BMC22進行溝通。所述BMC22也為嵌入式系統，設置於端點伺服器220中並協助處理端點伺服器220的內部硬體設備（各式感測器等等設備）的所有對外通訊。The RMC21 is an embedded system, which is set in the cabinet 2 and assists in processing all external hardware equipment (cooling fans, various sensors or power supplies, etc.) of the cabinet 2 through various hardware circuits. And communicate with the BMC 22 of all the endpoint servers 220 in the cabinet 2. The BMC 22 is also an embedded system, which is set in the endpoint server 220 and assists in processing all external communications of the internal hardware devices (various sensors, etc.) of the endpoint server 220.

本實施例中，RMC21通過內部硬體線路24連接機櫃2內的所有端點伺服器220的BMC22，藉由與各個BMC22溝通來控制各個端點伺服器220並且獲取所需資訊。本實施例中，所述端點伺服器可例如為直立式伺服器(Tower Model Server)或刀鋒伺服器(Blade Server)等，但不加以限定。In this embodiment, the RMC 21 is connected to the BMC 22 of all the endpoint servers 220 in the cabinet 2 through the internal hardware circuit 24, and communicates with each BMC 22 to control each endpoint server 220 and obtain required information. In this embodiment, the endpoint server may be, for example, a tower model server (Tower Model Server) or a blade server (Blade Server), but it is not limited.

如圖2所示，設置在機櫃2內的每一個端點伺服器220分別具有一個固定的位置號碼(如圖2中的#1、#2、#n等)，當端點伺服器220或是BMC22對外的網路功能失效時，RMC21可通過內部硬體線路24連接至機櫃2內的指定位置(如上述的#1、#2、#n)，進而與該指定位置上的端點伺服器220及BMC22溝通。如此一來，即使端點伺服器220或是BMC22失去網路連線，機櫃2仍可藉由RMC21來進行監控、管理各個BMC22並且排除各個BMC22的異常狀況。As shown in Figure 2, each endpoint server 220 installed in the cabinet 2 has a fixed location number (#1, #2, #n, etc. in Figure 2). When the endpoint server 220 or When the external network function of BMC22 fails, RMC21 can be connected to the designated position in cabinet 2 through internal hardware line 24 (such as #1, #2, #n above), and then serve with the endpoint in the designated position The device 220 communicates with the BMC22. In this way, even if the endpoint server 220 or the BMC 22 loses network connection, the cabinet 2 can still use the RMC 21 to monitor, manage each BMC 22 and eliminate the abnormal condition of each BMC 22.

另，本發明的RMC21內設置有網路介面控制器(Network Interface Controller,NIC)211，各個BMC22內亦分別設置有網路介面控制器221。RMC21通過NIC211連接機櫃2內部的內部網路交換機23，各個BMC22分別通過各自的NIC221連接所述內部網路交換機23。機櫃2通過內部網路交換機23連接公共網路交換機4，並且藉由公共網路交換機4與所述管理系統3建立網路連線。如此一來，管理系統3可經由網路遠程訪問資料中心1內的機櫃2，藉此查詢並獲取機櫃2內的所有RMC21及BMC22的各項資訊，並且儲存於資料庫31內。In addition, the RMC 21 of the present invention is provided with a network interface controller (NIC) 211, and each BMC 22 is also provided with a network interface controller 221. The RMC 21 is connected to the internal network switch 23 inside the cabinet 2 through the NIC 211, and each BMC 22 is connected to the internal network switch 23 through its respective NIC 221. The cabinet 2 is connected to a public network switch 4 through an internal network switch 23, and a network connection is established with the management system 3 through the public network switch 4. In this way, the management system 3 can remotely access the cabinet 2 in the data center 1 via the network, thereby querying and obtaining various information of all RMC 21 and BMC 22 in the cabinet 2 and storing the information in the database 31.

本發明的主要技術特徵在於，管理系統3可經由網路定時訪問機櫃2，並獲取機櫃2內所有RMC21及BMC22的各項資訊(例如狀態資料、事件日誌(event log)、系統資源使用率、端點伺服器220內部感測器的感測數值等等)，藉由這些資訊來主動分析RMC21及BMC22是否發生異常狀態，或即將發生異常狀態。當管理系統3經分析後認為有必要時，即可主動於遠端實施對應的機制，以於遠端直接排除RMC21及／或BMC22的異常狀態，或是預先避免RMC21及／或BMC22進入所述異常狀態。The main technical feature of the present invention is that the management system 3 can periodically access the cabinet 2 via the network, and obtain various information of all RMC21 and BMC22 in the cabinet 2 (such as status data, event log, system resource utilization rate, The sensing values of the internal sensors of the endpoint server 220, etc.), use these information to actively analyze whether the RMC21 and BMC22 have abnormal states or are about to occur. When the management system 3 determines that it is necessary after analysis, it can actively implement corresponding mechanisms at the remote end to directly eliminate the abnormal state of RMC21 and/or BMC22 at the remote end, or prevent RMC21 and/or BMC22 from entering the Abnormal state.

本發明的技術方案可以在完全不需人為介入的情況下進行異常狀態的處理，大幅降低了人為誤判的可能，並且可令機櫃2的監控達到高度自動化。The technical solution of the present invention can handle abnormal conditions without human intervention, greatly reduces the possibility of human misjudgment, and can make the monitoring of the cabinet 2 highly automated.

續請參閱圖3A，為本發明的資料搜集流程圖的第一具體實施例。Please continue to refer to FIG. 3A, which is a first specific embodiment of the data collection flowchart of the present invention.

如圖3A所示，若管理者欲對資料中心1內的機櫃2進行監控，則管理者可直接啟動遠端的管理系統3(步驟S11)。當管理系統3被啟動後，即會主動遠程訪問資料中心1中的機櫃2(以圖2中的單一個機櫃2為例)內的RMC21及所有BMC22(步驟S12)。並且，管理系統3藉由遠程訪問來取得機櫃2中的RMC21及所有BMC22的各項資訊(步驟S13)，再將所取得的資訊儲存於本地端的資料中31中(步驟S14)。As shown in FIG. 3A, if the manager wants to monitor the cabinet 2 in the data center 1, the manager can directly activate the remote management system 3 (step S11). When the management system 3 is activated, it will actively remotely access the RMC 21 and all the BMC 22 in the cabinet 2 in the data center 1 (take the single cabinet 2 in FIG. 2 as an example) (step S12). In addition, the management system 3 obtains various information of the RMC 21 and all the BMC 22 in the cabinet 2 through remote access (step S13), and then stores the obtained information in the local data 31 (step S14).

具體地，本實施例中，管理系統3是在啟動後定時主動訪問機櫃2，也就是將步驟S12、S13、S14的訪問動作、資訊取得動作及儲存動作視為啟動後的例行程序(routine)。於執行上述routine時，持續判斷管理系統3是否關閉(步驟S15)，並且於管理系統3關閉前持續執行上述步驟S12至步驟S14，以持續對機櫃2內的RMC21與BMC22進行監控。Specifically, in this embodiment, the management system 3 actively accesses the cabinet 2 periodically after activation, that is, the access actions, information acquisition actions, and storage actions of steps S12, S13, and S14 are regarded as routines after activation. ). When the above routine is executed, it is continuously determined whether the management system 3 is shut down (step S15), and the above steps S12 to S14 are continuously executed before the management system 3 is shut down to continuously monitor the RMC21 and BMC22 in the cabinet 2.

參閱圖3B，為本發明的資料搜集流程圖的第二具體實施例。Refer to FIG. 3B, which is a second specific embodiment of the data collection flowchart of the present invention.

本實施例中，當管理者啟動了所述管理系統3後(步驟S21)，管理系統3可以提供一個操作介面(步驟S22)。通過這個操作介面，管理者可以登入管理系統3，並且藉由管理系統3來於遠端對資料中心1中的各個機櫃2進行資訊監控以及控制。本實施例中，所述操作介面可為一個實體介面或網頁(Web)介面，不加以限定。In this embodiment, when the manager starts the management system 3 (step S21), the management system 3 can provide an operation interface (step S22). Through this operation interface, the administrator can log in to the management system 3, and through the management system 3 to remotely monitor and control the information of each cabinet 2 in the data center 1. In this embodiment, the operating interface can be a physical interface or a web interface, and is not limited.

在提供了所述操作介面後，管理系統3持續判斷是否通過操作介面接受了由管理者所進行的操作(步驟S23)。若確實接受到管理者的操作，則管理系統3依據管理者的操作行為，從遠端對機櫃2以及機櫃2內的RMC21及BMC22實施對應的遠端管理(步驟S24)。接著，管理系統3可記錄管理者的上述操作行為(步驟S25)，並且，還可取得並記錄管理系統3、機櫃2、各端點伺服器220以及RMC21、BMC22因為所述遠端管理而產生的反饋、系統參數及執行數據等反饋資訊(步驟S26)。最後，管理系統3同樣將所述操作行為及反饋資訊儲存於資料庫31中(步驟S27)，以利於後續對於異常狀態的分析動作。After the operation interface is provided, the management system 3 continuously judges whether the operation performed by the manager has been accepted through the operation interface (step S23). If the manager's operation is indeed accepted, the management system 3 implements corresponding remote management of the cabinet 2 and the RMC 21 and BMC 22 in the cabinet 2 from the remote end according to the manager's operation behavior (step S24). Then, the management system 3 can record the above-mentioned operation behavior of the administrator (step S25), and can also obtain and record the management system 3, the cabinet 2, each endpoint server 220, and RMC21, BMC22 generated by the remote management Feedback information such as the feedback, system parameters, and execution data (step S26). Finally, the management system 3 also stores the operation behavior and feedback information in the database 31 (step S27), so as to facilitate the subsequent analysis of the abnormal state.

相同地，本實施例的管理系統3會將步驟S22至步驟S27的動作視為啟動後的routine。於執行上述routine時，持續判斷管理系統3是否關閉(步驟S28)，並且於管理系統3關閉前持續執行上述步驟S22至步驟S27，以持續監控並分析管理者所實施的操作行為對機櫃2內的RMC21與BMC22所造成的影響。Similarly, the management system 3 of this embodiment regards the actions from step S22 to step S27 as a routine after activation. When the above routine is executed, it is continuously judged whether the management system 3 is shut down (step S28), and the above steps S22 to S27 are continuously executed before the management system 3 shuts down, so as to continuously monitor and analyze the operation behavior implemented by the manager on the cabinet 2 The impact of RMC21 and BMC22.

續請參閱圖4，為本發明的分析與排除流程圖的第一具體實施例。Please continue to refer to FIG. 4, which is a first specific embodiment of the analysis and elimination flowchart of the present invention.

如圖4所示，本實施例中管理系統3會定時存取資料庫31(步驟S31)，並且從資料庫31中取得機櫃2中的RMC21及BMC22各項資訊、管理者的操作行為、以及各項反饋資訊(步驟S32)，並且加以進行分析。藉由上述資料，管理系統3可以分析出機櫃2內的RMC21及各個BMC22是否處於預設的多種關注狀態的其中之一(步驟S33)。As shown in FIG. 4, the management system 3 in this embodiment will periodically access the database 31 (step S31), and obtain various information of RMC21 and BMC22 in the cabinet 2 from the database 31, the operation behavior of the manager, and Various feedback information (step S32), and analyze it. Based on the above data, the management system 3 can analyze whether the RMC 21 and each BMC 22 in the cabinet 2 are in one of the preset multiple attention states (step S33).

於一實施例中，所述管理系統3可以實時地取得機櫃2中的RMC21與BMC22的各項資訊、實時地從操作介面取得管理者的操作行為，並且據以進行分析。於另一實施例中，管理系統3可藉由圖3A的步驟S14及圖3B的步驟S27定時將上述資料儲存至資料庫31中，並且定時從資料庫31中讀取上述資料以進行分析，不加以限定。In one embodiment, the management system 3 can obtain various information of the RMC21 and BMC22 in the cabinet 2 in real time, and obtain the operation behavior of the manager from the operation interface in real time, and analyze it accordingly. In another embodiment, the management system 3 can periodically store the above-mentioned data in the database 31 through step S14 of FIG. 3A and step S27 of FIG. 3B, and periodically read the above-mentioned data from the database 31 for analysis. Not limited.

於一實施例中，上述RMC21及BMC22的各項資訊，可例如為狀態資料(如目前處於工作模式或更新模式、IP位址、MAC位址、子網路遮罩、閘道器IP位址、IPMI session數量等)、事件日誌(event log)等，而上述操作行為可例如為管理者針對特定機櫃2、端點伺服器220或RMC21、BMC22所實行的資料查詢作業、更新作業、重置作業等，但不加以限定。通過上述資料，管理系統3可以藉由執行對應演算法而分析出機櫃2中目前是否具有需要即時救援的RMC21或BMC22。In one embodiment, the various information of RMC21 and BMC22 mentioned above can be, for example, status data (such as current working mode or update mode, IP address, MAC address, subnet mask, gateway IP address) , IPMI session number, etc.), event log, etc., and the above operation behavior can be, for example, the data query operation, update operation, and reset performed by the administrator for a specific cabinet 2, endpoint server 220 or RMC21, BMC22 Homework, etc., but not limited. Based on the above data, the management system 3 can analyze whether there is an RMC21 or BMC22 in the cabinet 2 that needs immediate rescue by executing the corresponding algorithm.

於圖4的實施例中，管理系統3主要可預設至少三個種類的關注狀態，包括第一類關注狀態、第二類關注狀態及第三類關注狀態，其中這三類的關注狀態分別對應至RMC21／BMC22不同的異常狀況，並且分別需要由管理系統3於遠端直接實施不同的機制來加以排除或加以預防。In the embodiment of FIG. 4, the management system 3 can preset at least three types of attention states, including the first type of attention state, the second type of attention state, and the third type of attention state. The three types of attention states are respectively Corresponding to different abnormal conditions of RMC21/BMC22, and respectively need to be eliminated or prevented by the management system 3 directly implementing different mechanisms at the remote end.

如圖4所示，若管理系統3依據上述資料(主要依據狀態資料、事件日誌及管理者的操作行為)進行分析後發現有任一RMC21或BMC22已處於異常狀態，但尚未與管理系統3失去連線，則會認定這個RMC21或BMC22是處於所述第一類關注狀態(步驟S34)。當發現任一RMC21、BMC22處於第一類關注狀態時，管理系統3可自動對處於第一類關注狀態的RMC21、BMC22實施遠端恢復機制，以遠程解除RMC21或BMC22的異常狀態(步驟S37)。As shown in Figure 4, if the management system 3 analyzes based on the above data (mainly based on the status data, event logs and manager’s operation behavior) and finds that any RMC21 or BMC22 has been in an abnormal state, but has not lost contact with the management system 3. Connected, it will be determined that this RMC21 or BMC22 is in the first type of attention state (step S34). When any RMC21 or BMC22 is found to be in the first type of attention state, the management system 3 can automatically implement a remote recovery mechanism for the RMC21 and BMC22 in the first type of attention state to remotely release the abnormal state of RMC21 or BMC22 (step S37) .

若管理系統3依據上述資料(主要依據RMC21與BMC22狀態資料)進行分析後發現有任一RMC21或BMC22與管理系統3的連線正常，但判斷可能即將出現異常狀態，則會認定這個RMC21或BMC22是處於所述第二類關注狀態(步驟S35)。當發現任一RMC21、BMC22處於第二類關注狀態時，管理系統3可自動對處於第二類關注狀態的RMC21、BMC22實施遠端服務重啟機制，以遠程避免RMC21或BMC22進入可能的異常狀態(步驟S38)。If the management system 3 analyzes based on the above data (mainly based on the status data of RMC21 and BMC22) and finds that any RMC21 or BMC22 is connected to the management system 3 normally, but it is judged that an abnormal state may be about to occur, the RMC21 or BMC22 will be recognized It is in the second type of attention state (step S35). When any RMC21 or BMC22 is found to be in the second type of attention state, the management system 3 can automatically implement a remote service restart mechanism for the RMC21 and BMC22 in the second type of attention state to remotely prevent RMC21 or BMC22 from entering a possible abnormal state ( Step S38).

若管理系統3依據上述資料(主要依據狀態資料、管理者的操作行為以及各項反饋資訊)進行分析後發現有任一BMC22已失去了網路連線(即，管理系統3無法遠程直接訪問這個BMC22)，則會認定這個BMC22是處於所述第三類關注狀態(步驟S36)。當發現任一BMC22處於第三類關注狀態時，管理系統3可自動對處於第三類關注狀態的BMC22實施遠端救援機制，以遠程排除BMC22失去連線的狀態，並且使BMC22的網路連線恢復正常(步驟S39)。If the management system 3 analyzes the above data (mainly based on the status data, the manager’s operation behavior and various feedback information) and finds that any BMC22 has lost its network connection (that is, the management system 3 cannot directly access this remotely). BMC22), it will be determined that this BMC22 is in the third type of attention state (step S36). When it is found that any BMC22 is in the third type of attention state, the management system 3 can automatically implement a remote rescue mechanism for the BMC22 in the third type of attention state, so as to remotely eliminate the state of BMC22 losing connection, and make the BMC22 network connection The line returns to normal (step S39).

下面段落討論所述第一類關注狀態。The following paragraphs discuss the first type of attention state.

由於部分的RMC21／BMC22不具備基本輸入輸出系統(Basic Input/Output System,BIOS)，因此需要通過外部伺服器所提供的網路時間協定(Network Time Protocol,NTP)服務，或是硬體時鐘晶片提供的實時時鐘(Real-time Clock, RTC)服務來設定時間，以與其他設備達到時間同步。Because some RMC21/BMC22 do not have a basic input/output system (Basic Input/Output System, BIOS), they need to use the Network Time Protocol (NTP) service provided by an external server, or a hardware clock chip The Real-time Clock (RTC) service is provided to set the time to synchronize with other devices.

如上所述，若在RMC21或BMC22的時間同步程序尚未完成前發生了系統事件，則雖然該系統事件仍然會被記錄在RMC21、BMC22的事件日誌中，但該系統事件的時間欄位將無法記錄正確的事件發生時間，而只會記錄例如“Pre-init”的字樣。若沒有正確的事件發生時間，則管理者無法將事件日誌做為所述系統事件的參考指標，這樣將會導致判斷錯誤。除此之外，若所述RMC21、BMC22需要進行重置(Reset)作業，也可能會造成上述系統事件的事件發生時間記錄錯誤或異常的情況。As mentioned above, if a system event occurs before the time synchronization program of RMC21 or BMC22 is completed, although the system event will still be recorded in the event log of RMC21, BMC22, the time field of the system event will not be recorded The correct time of occurrence of the event, and only words such as "Pre-init" will be recorded. If there is no correct event occurrence time, the administrator cannot use the event log as a reference indicator of the system event, which will lead to judgment errors. In addition, if the RMC21 and BMC22 need to perform a Reset operation, it may also cause an error or abnormality in the event occurrence time recording of the above system event.

參閱圖5，為本發明的第一類關注狀態排除流程圖的第一具體實施例。本實施例中，所述管理系統3會定時存取資料庫31(步驟S41)，以由資料庫31中取得機櫃2內的RMC21及BMC22的狀態資料及事件日誌，並且判斷RMC21及BMC22的狀態變化(步驟S42)。Refer to FIG. 5, which is a first specific embodiment of the first type of attention state elimination flowchart of the present invention. In this embodiment, the management system 3 will periodically access the database 31 (step S41) to obtain the status data and event logs of the RMC21 and BMC22 in the cabinet 2 from the database 31, and determine the status of the RMC21 and BMC22 Change (step S42).

本實施例中，管理系統3主要是判斷所獲得的事件日誌中，是否有任一系統事件的事件發生時間不明或錯誤(步驟S43)。若所述事件日誌中的所有系統事件皆記錄了正確的事件發生時間，則管理系統3不主動實施任何動作。In this embodiment, the management system 3 mainly judges whether there is any system event in the obtained event log with an unknown or wrong event occurrence time (step S43). If all system events in the event log record the correct event occurrence time, the management system 3 does not take the initiative to implement any actions.

若經分析後，管理系統3發現任一RMC21或BMC22具有時間不明或錯誤的系統事件，則管理系統3會將該RMC21或BMC22視為處於所述第一類關注狀態(步驟S44)，即，認定這個RMC21或BMC22處於異常狀態，但尚未失去網路連線。If after analysis, the management system 3 finds that any RMC21 or BMC22 has an unknown or wrong system event, the management system 3 will regard the RMC21 or BMC22 as being in the first type of attention state (step S44), that is, It is determined that the RMC21 or BMC22 is in an abnormal state, but the network connection has not been lost.

於一實施例中，管理系統3主要可於所述事件日誌中的任一系統事件的事件發生時間被記錄為“Pre-init”或類似字樣時(即，無法正確說明系統事件的發生時間)，判斷所述系統事件的事件發生時間不明或錯誤。於另一實施例中，管理系統3主要可以在從事件日誌中發現任一RMC21或BMC22具有事件發生時間不明的系統事件，並且從狀態資料中發現這個RMC21或BMC22尚未完成時間同步程序或是需要進行重置作業時，判斷所述系統事件的事件發生時間不明或錯誤。In one embodiment, the management system 3 can mainly be used when the event occurrence time of any system event in the event log is recorded as "Pre-init" or similar (ie, the occurrence time of the system event cannot be correctly stated) , The event occurrence time of the system event is unknown or wrong. In another embodiment, the management system 3 can mainly find from the event log that any RMC21 or BMC22 has a system event with an unknown time of the event, and find from the status data that this RMC21 or BMC22 has not completed the time synchronization process or is required When the reset operation is performed, it is determined that the event occurrence time of the system event is unknown or wrong.

當管理系統3於步驟S44中認定一個RMC21或BMC22處於第一類關注狀態後，管理系統3首先取得本次存取事件日誌的時間戳記(步驟S45)，將這個時間戳記做為所述系統事件的備位時間識別資訊，並儲存於資料庫31中(步驟S46)。於一實施例中，管理系統3是將本次存取資料庫31以讀取所述事件日誌的時間做為上述時間戳記。於另一實施例中，管理系統3是將本次遠程訪問機櫃2並從RMC21、BMC22取得所述事件日誌的時間做為上述時間戳記，但不加以限定。When the management system 3 determines in step S44 that an RMC21 or BMC22 is in the first type of attention state, the management system 3 first obtains the time stamp of this access event log (step S45), and uses this time stamp as the system event The identification information of the preparation time is stored in the database 31 (step S46). In one embodiment, the management system 3 uses the time when the database 31 is accessed this time to read the event log as the time stamp. In another embodiment, the management system 3 uses the time when the cabinet 2 is remotely accessed this time and the event log is obtained from the RMC 21 and the BMC 22 as the above-mentioned time stamp, but is not limited.

舉例來說，所述事件日誌的原始內容可例如下表所示：

For example, the original content of the event log may be as shown in the following table:

若管理系統3在2018年12月22日的下午11時32分23秒時存取了所述事件日誌，並發現事件二的事件發生時間有誤，則管理系統3可以主動為事件二產生所述備位時間識別資訊，並且修改事件日誌的內容或是產生新的事件日誌。新的事件日誌可例如下表所示：

If the management system 3 accesses the event log at 11:32:23 PM on December 22, 2018, and finds that the time of the event 2 is wrong, the management system 3 can take the initiative to generate the event 2 State the identification information of the standby time, and modify the content of the event log or generate a new event log. The new event log can be as shown in the table below:

當管理者通過所述操作介面登入管理系統3，並且於管理系統3中查詢所述事件日誌時，管理系統3即可如上表所示，顯示所述備位時間識別資訊以做為事件二的事件發生時間。如此一來，即使RMC21或BMC22在時間同步未完成前發生一個系統事件，管理系統3仍可為該系統事件設定一個可供識別的備位時間，以利管理系統3以及管理者於對該系統事件的解讀，並藉此強化遠端恢復的效果。When the manager logs into the management system 3 through the operation interface and queries the event log in the management system 3, the management system 3 can display the identification information of the preparation time as shown in the above table as the event two The time the incident occurred. In this way, even if a system event occurs in RMC21 or BMC22 before the time synchronization is completed, the management system 3 can still set a recognizable standby time for the system event, so that the management system 3 and the administrator can use the system Interpretation of the event, and to strengthen the effect of remote recovery.

步驟S46後，管理系統3可進一步通過網路發出控制指令(例如第一控制指令)至處於第一類關注狀態的RMC21或BMC22，以對具有時間錯誤的異常狀態的RMC21或BMC22執行時間校正程序(步驟S47)。於一實施例中，所述時間校正程序是控制RMC21或BMC22藉由NTP服務進行時間校正。於另一實施例中，所述時間校正程序是強制RMC21或BMC22進行重置作業，但不加以限定。After step S46, the management system 3 may further send a control command (such as a first control command) via the network to the RMC21 or BMC22 in the first type of attention state, so as to execute the time correction procedure for the RMC21 or BMC22 in the abnormal state with a time error (Step S47). In one embodiment, the time correction procedure is to control the RMC21 or BMC22 to perform time correction through the NTP service. In another embodiment, the time correction procedure is to force the RMC21 or BMC22 to perform a reset operation, but is not limited.

下面段落繼續說明其他可能發生的第一類關注狀態。The following paragraphs continue to explain other possible first-type concerns.

由於資料中心1內部的機櫃2數量眾多，當管理者有更新的需求時，實難以通過人工方式逐台進行更新。因此，當管理者要對機櫃2內的RMC21、BMC22實施更新作業時(例如韌體更新)，係可對管理系統3進行操作，以通過管理系統3的相關程式碼來發送更新指令以及最新版本的韌體，藉此於遠端同時更新資料中心1內的多個機櫃2的RMC21及BMC22。Due to the large number of cabinets 2 in the data center 1, it is difficult to manually update them one by one when the manager has a need for updating. Therefore, when the administrator wants to update the RMC21 and BMC22 in the cabinet 2 (such as firmware update), the management system 3 can be operated to send the update command and the latest version through the relevant code of the management system 3. The firmware of, to update the RMC21 and BMC22 of multiple cabinets 2 in the data center 1 at the same time remotely.

若於更新過程中遇到網路壅塞或網路訊號不穩定造成網路連線中斷等問題，使得部分RMC21、BMC22無法依循正常的更新流程完成更新作業，就有可能造成更新作業失敗。然而，部分RMC21、BMC22在更新作業失敗後僅會造成系統無法正常運作，但並未失去網路連線(例如進入更新模式後無法恢復為工作模式)，此時就需要由管理系統3於遠端介入以進行異常狀況排除。If during the update process, network congestion or unstable network signal causes network connection interruption, etc., some RMC21 and BMC22 cannot follow the normal update process to complete the update operation, which may cause the update operation to fail. However, some RMC21 and BMC22 will only cause the system to fail to operate normally after the update operation fails, but the network connection is not lost (for example, the update mode cannot be restored to the working mode). At this time, the management system 3 Intervene to eliminate abnormal conditions.

參閱圖6，為本發明的第一類關注狀態排除流程圖的第二具體實施例。本實施例中，管理系統3同樣定時存取資料庫31(步驟S51)，以由資料庫31中取得機櫃2內的RMC21及BMC22的狀態資料及事件日誌，同時取得管理者通過操作介面所實施的操作行為，並且判斷RMC21及BMC22的狀態變化(步驟S52)。Refer to FIG. 6, which is a second specific embodiment of the first type of attention state elimination flowchart of the present invention. In this embodiment, the management system 3 also regularly accesses the database 31 (step S51) to obtain the status data and event logs of the RMC21 and BMC22 in the cabinet 2 from the database 31, and obtain the information implemented by the manager through the operation interface. And determine the status change of RMC21 and BMC22 (step S52).

本實施例中，管理系統3首先可對RMC21及BMC22的狀態資料以及事件日誌進行分析，以判斷是否有任一RMC21、BMC22的更新作業已逾時或發生錯誤(步驟S54)，並且判斷所述更新作業逾時或發生錯誤的RMC21或BMC22的網路連線是否正常(步驟S55)。若管理系統3在分析後發現有任一RMC21或BMC22的更新作業逾時或發生錯誤但網路連線仍然正常，則可將這個RMC21或BMC22視為處於所述第一類關注狀態(步驟S56)，即，處於異常狀態，但尚未失去連線。In this embodiment, the management system 3 can first analyze the status data and event logs of RMC21 and BMC22 to determine whether any update operation of RMC21 or BMC22 has timed out or an error has occurred (step S54), and determines the Whether the network connection of the RMC21 or BMC22 where the update operation timed out or the error occurred is normal (step S55). If the management system 3 finds after analysis that any RMC21 or BMC22 update operation is timed out or has an error but the network connection is still normal, the RMC21 or BMC22 can be regarded as being in the first type of concern state (step S56 ), that is, in an abnormal state, but the connection has not been lost.

更具體地，於上述步驟S52後，管理系統3可先依據所述操作行為來判斷管理者是否曾對機櫃2中的RMC21及／或BMC22實施了更新作業(步驟S53)。並且，於確定了管理者曾經實施了更新作業後，管理系統3再接續執行步驟S54以及步驟S55，以判斷這些RMC21、BMC22的更新作業是否逾時或發生錯誤，以及網路連線是否正常。More specifically, after the above step S52, the management system 3 can first determine whether the administrator has performed an update operation on the RMC21 and/or BMC22 in the cabinet 2 according to the operation behavior (step S53). Moreover, after confirming that the administrator has performed the update operation, the management system 3 continues to perform step S54 and step S55 to determine whether the update operations of the RMC21 and BMC22 are timed out or have errors, and whether the network connection is normal.

所述RMC21、BMC22在接受了管理者實施的更新作業後，將會自動進入更新模式。此時，RMC21、BMC22會在狀態資料中設定已進入更新模式的標記(flag)。當周邊設備與RMC21、BMC22溝通並且讀到更新模式的標記時，就會自動停止與這個RMC21、BMC22的互動。因此，只要RMC21、BMC22更新作業失敗而無法離開更新模式，這個RMC21、BMC22就無法正常運作。當管理系統3發現任一RMC21、BMC22接受了更新作業、更新作業已逾時或發生錯誤、但是尚未失去網路連線時，就可認定這個RMC21、BMC22處於所述第一關注狀態。The RMC21 and BMC22 will automatically enter the update mode after accepting the update operation implemented by the administrator. At this time, RMC21 and BMC22 will set a flag (flag) that has entered the update mode in the status data. When the peripheral device communicates with RMC21, BMC22 and reads the mark of the update mode, it will automatically stop the interaction with this RMC21, BMC22. Therefore, as long as the update operation of RMC21 and BMC22 fails and cannot leave the update mode, the RMC21 and BMC22 cannot operate normally. When the management system 3 finds that any RMC21, BMC22 has accepted the update operation, the update operation has timed out or an error has occurred, but the network connection has not been lost, it can determine that this RMC21, BMC22 is in the first attention state.

步驟S56後，管理系統3可進一步通過網路發出控制指令(例如第二控制指令)至處於第一類關注狀態的RMC21或BMC22，以強制更新作業失敗的RMC21或BMC22離開所述更新模式(步驟S57)。After step S56, the management system 3 may further send a control command (such as a second control command) to the RMC21 or BMC22 in the first type of attention state through the network, to force the RMC21 or BMC22 that failed the update operation to leave the update mode (step S57).

如上所述，在本實施例所指的更新作業失敗情況下(即，無法離開更新模式)，所述RMC21、BMC22仍可接收並處理相關的指令，只是周邊設備在讀到更新模式的標記(flag)時就會自動停止與RMC21、BMC22的互動。本實施例中，管理系統3已判斷所述RMC21、BMC22發生異常狀態，因此會無視於上述標記，而藉由控制指令的發出來強制RMC21、BMC22離開更新模式。As mentioned above, when the update operation referred to in this embodiment fails (that is, cannot leave the update mode), the RMC21 and BMC22 can still receive and process related instructions, but the peripheral device is reading the flag of the update mode (flag ) Will automatically stop interacting with RMC21 and BMC22. In this embodiment, the management system 3 has determined that the RMC21 and BMC22 have abnormal states, and therefore ignores the above flags, and forces the RMC21 and BMC22 to leave the update mode by issuing control commands.

步驟S57後，管理系統3還可進一步通過網路發出另一控制指令(例如第三控制指令)至已離開更新模式的RMC21或BMC22，以強制RMC21或BMC22進行重置作業，或是再次實施所述更新作業(步驟S58)。藉此，管理系統3可以確保RMC21、BMC22已恢復正常運作，並且韌體或軟體處於更新完成的最新版本。After step S57, the management system 3 can further send another control command (for example, a third control command) to the RMC21 or BMC22 that has left the update mode through the network, so as to force the RMC21 or BMC22 to perform the reset operation, or to perform the reset operation again. The update operation is described (step S58). In this way, the management system 3 can ensure that the RMC21 and BMC22 have resumed normal operation, and the firmware or software is in the latest version after the update.

下面段落接著討論所述第二類關注狀態。The following paragraphs then discuss the second type of attention state.

本發明中的RMC21、BMC22為一種嵌入式系統(Embbeded System)，因此即使機櫃2內的端點伺服器220未開機，管理系統3仍可藉由與RMC21、BMC22的溝通來實現遠程開機、遠程關機、查看設備狀態等遠程管理功能。The RMC21 and BMC22 in the present invention are an embedded system (Embbeded System). Therefore, even if the endpoint server 220 in the cabinet 2 is not powered on, the management system 3 can still realize remote booting and remote booting by communicating with RMC21 and BMC22. Remote management functions such as shutdown and viewing device status.

一般來說，管理者在實施遠程管理程序時，可在管理系統3上使用智慧平台管理介面(Intelligent Platform Management Interface,IPMI)工具程式來通過網路發送IPMI指令，藉此與機櫃2內的RMC21、BMC22溝通。於使用IPMI工具程式的情況下，每一道指令的發送都需與目的地的RMC21、BMC22建立一個IPMI會話期間(session)，藉此才能與目的地的RMC21、BMC22進行溝通。具體地，在IPMI session建立完成後，管理系統3才能透過網路與RMC21、BMC22以及機櫃2、端點伺服器220的底層硬體設備溝通，進而取得所述指令的執行結果(例如取得韌體版本、端點伺服器220內的所有感測器的感測數值等)。Generally speaking, when a manager implements remote management procedures, he can use the Intelligent Platform Management Interface (IPMI) tool program on the management system 3 to send IPMI commands through the network to communicate with the RMC21 in the cabinet 2. , BMC22 communication. In the case of using the IPMI tool program, each command transmission needs to establish an IPMI session with the destination RMC21 and BMC22, so as to communicate with the destination RMC21 and BMC22. Specifically, after the establishment of the IPMI session is completed, the management system 3 can communicate with the underlying hardware devices of the RMC21, BMC22, and the cabinet 2, the endpoint server 220 through the network, and then obtain the execution result of the command (for example, obtain the firmware Version, sensing values of all sensors in the endpoint server 220, etc.).

惟，嵌入式系統本身的運算資源是相當有限的，除了運作所需的基本資源消耗外，與RMC21的溝通、與BMC22的溝通以及回覆資料中心1內的各式監控系統等動作皆會進一步消耗嵌入式系統的運算資源。However, the computing resources of the embedded system itself are quite limited. In addition to the basic resource consumption required for operation, communication with RMC21, communication with BMC22, and replying to various monitoring systems in data center 1 will all be further consumed. Computing resources for embedded systems.

再者，當管理者通過管理系統3對各個RMC21、BMC22實施遠端管理程序時，也需消耗RMC21、BMC22的運算資源，最明顯的就是令RMC21、BMC22的IPMI session數量大幅增加，使得RMC21、BMC22出現回應不及或是請求超時(timeout)的現象。此時，雖然所述RMC21、BMC22尚未發生異常狀態，但可能需要由管理系統3於遠端介入以避免RMC21、BMC22將來發生異常狀態而影響機櫃2的運作。Furthermore, when the administrator implements the remote management program for each RMC21 and BMC22 through the management system 3, it also consumes the computing resources of RMC21 and BMC22. The most obvious is to increase the number of IPMI sessions of RMC21 and BMC22 significantly, making RMC21, BMC22 appears to respond late or request timeout (timeout). At this time, although the abnormal state of the RMC21 and BMC22 has not yet occurred, the management system 3 may need to be remotely intervened to prevent the abnormal state of the RMC21 and BMC22 from affecting the operation of the cabinet 2 in the future.

參閱圖7，為本發明的第二類關注狀態排除流程圖的第一具體實施例。本實施例中，所述管理系統3同樣會定時存取資料庫31(步驟S61)，以由資料庫31中取得機櫃2內的RMC21及BMC22的狀態資料，並且判斷RMC21及BMC22的狀態變化(步驟S62)。於一實施例中，管理系統3在步驟S62中主要是取得RMC21及各個BMC22目前的IPMI session總數。於另一實施例中，管理系統3在步驟S62中同時取得RMC21及各個BMC22目前的系統資源使用率。Refer to FIG. 7, which is a first specific embodiment of the second type of attention state elimination flowchart of the present invention. In this embodiment, the management system 3 will also periodically access the database 31 (step S61) to obtain the status data of the RMC21 and BMC22 in the cabinet 2 from the database 31, and determine the status change of the RMC21 and BMC22 ( Step S62). In one embodiment, the management system 3 mainly obtains the total number of IPMI sessions of the RMC 21 and each BMC 22 in step S62. In another embodiment, the management system 3 simultaneously obtains the current system resource utilization rates of the RMC 21 and each BMC 22 in step S62.

步驟S63後，管理系統3判斷是否有任一RMC21、BMC22的IPMI session總數高於第一門檻值(步驟S63)，並且於任一RMC21、BMC22的IPMI session總數高於第一門檻值時，認定這個RMC21、BMC22處於所述第二關注狀態(步驟S65)，即，RMC21或BMC22的連線正常，但判斷可能即將出現異常狀態。After step S63, the management system 3 determines whether the total number of IPMI sessions of any RMC21, BMC22 is higher than the first threshold (step S63), and when the total number of IPMI sessions of any RMC21, BMC22 is higher than the first threshold, it is determined The RMC21 and BMC22 are in the second attention state (step S65), that is, the connection of the RMC21 or BMC22 is normal, but it is judged that an abnormal state is about to occur.

值得一提的是，若管理系統3於步驟S62中同時取得了RMC21及各個BMC22的系統資源使用率，則管理系統3可同時判斷是否有任一RMC21、BMC22的系統資源使用率高於第二門檻值(步驟S64)。於此情境下，管理系統3會認定目前的IPMI session總數高於第一門檻值，並且系統資源使用率高於第二門檻值的RMC21或BMC22處於所述第二關注狀態。It is worth mentioning that if the management system 3 obtains the system resource utilization rate of RMC21 and each BMC22 at the same time in step S62, the management system 3 can simultaneously determine whether any RMC21, BMC22 system resource utilization rate is higher than the second Threshold value (step S64). In this situation, the management system 3 will determine that the current total number of IPMI sessions is higher than the first threshold, and the RMC21 or BMC22 whose system resource usage rate is higher than the second threshold is in the second attention state.

於一實施例中，所述系統資源使用率為RMC21、BMC22的中央處理單元或記憶體的使用率。於另一實施例中，所述系統資源使用率為RMC21、BMC22內部主要用來提供各項服務(如超文本傳輸協議(HyperText Transfer Protocol,HTTP)服務或IPMI服務等)所使用的系統資源的使用率，但不加以限定。In one embodiment, the system resource utilization rate is the utilization rate of the central processing unit or memory of the RMC21 and BMC22. In another embodiment, the system resource utilization rate is the percentage of system resources used by RMC21 and BMC22 to provide various services (such as HyperText Transfer Protocol (HTTP) service or IPMI service, etc.) Usage rate, but not limited.

當管理系統3認定一個RMC21或BMC22處於第二類關注狀態後，管理系統3可進一步通過網路發出控制指令(例如第四控制指令)至處於第二類關注狀態的RMC21或BMC22，以令所述RMC21或BMC22重啟IPMI服務(步驟S66)。藉此，RMC21、BMC22可將目前累積的IPMI session清空，以避免異常狀態的發生。When the management system 3 determines that an RMC21 or BMC22 is in the second type of attention state, the management system 3 can further send a control command (such as a fourth control command) through the network to the RMC21 or BMC22 in the second type of attention state to make all The RMC21 or BMC22 restarts the IPMI service (step S66). In this way, RMC21 and BMC22 can clear the current accumulated IPMI session to avoid abnormal status.

於一實施例中，所述第四控制指令為重置指令，管理系統3是通過網路發出重置指令至處於第二類關注狀態的RMC21或BMC22，以強制RMC21或BMC22進行重置作業。如此一來，重置後的RMC21、BMC22即可直接重啟IPMI服務。惟，上述僅為本發明的其中一個具體實施例，但不以上述為限。In one embodiment, the fourth control command is a reset command, and the management system 3 sends a reset command to the RMC21 or BMC22 in the second type of attention state through the network to force the RMC21 or BMC22 to perform the reset operation. In this way, the reset RMC21 and BMC22 can directly restart the IPMI service. However, the above is only one specific embodiment of the present invention, but it is not limited to the above.

通過上述技術方案，管理系統3可以經由分析提早發現RMC21或BMC22可能即將發生異常狀態，因此可主動於遠端實施服務重啟機制，以避免RMC21或BMC22真的發生異常狀態而影響機櫃2的運作。Through the above technical solutions, the management system 3 can find early through analysis that an abnormal state of the RMC21 or BMC22 may be about to occur, and therefore can actively implement a service restart mechanism remotely to prevent the abnormal state of the RMC21 or BMC22 from actually occurring and affecting the operation of the cabinet 2.

下面段落接著討論所述第三類關注狀態。The following paragraphs then discuss the third type of attention state.

如前文中所述，本發明的管理系統3主要是通過網路與資料中心1內的機櫃2中的RMC21、BMC22進行溝通，並且管理者也是通過網路對這些RMC21、BMC22實施遠程管理程序。因此，當機櫃2中的BMC22失去網路連線時，管理系統3將無法與BMC22進行溝通，管理者也無法對BMC22進行管理。於本實施例中，BMC22失去網路連線的異常狀況，可能是因為IP位址設定錯誤所引起的。As mentioned above, the management system 3 of the present invention mainly communicates with the RMC21 and BMC22 in the cabinet 2 in the data center 1 through the network, and the administrator also implements remote management procedures for these RMC21 and BMC22 through the network. Therefore, when the BMC22 in the cabinet 2 loses network connection, the management system 3 will not be able to communicate with the BMC22, and the administrator will not be able to manage the BMC22. In this embodiment, the abnormal condition that the BMC 22 loses network connection may be caused by an incorrect IP address setting.

一般來說，機櫃2內的BMC22可能被設定成使用動態IP位址(即，BMC22的網路模式被設定為動態IP模式)或靜態IP位址(即，BMC22的網路模式被設定為靜態IP模式)。若BMC22的網路模式為動態IP模式，則可由資料中心1內的動態主機設定協定(Dynamic Host Configuration Protocol, DHCP)伺服器(圖未標示)來主動配發一組動態IP位址給BMC22使用。若BMC22的網路模式為靜態IP模式，則管理者可通過管理系統3的操作介面來自行為BMC22設定一組靜態IP位址。Generally speaking, the BMC22 in the cabinet 2 may be set to use a dynamic IP address (that is, the network mode of BMC22 is set to dynamic IP mode) or a static IP address (that is, the network mode of BMC22 is set to static IP mode). If the network mode of BMC22 is dynamic IP mode, the Dynamic Host Configuration Protocol (DHCP) server (not shown in the figure) in Data Center 1 can actively allocate a dynamic IP address to BMC22. . If the network mode of the BMC22 is a static IP mode, the administrator can set a static IP address for the BMC22 through the operation interface of the management system 3.

要對BMC22實施網路設定作業以設定一組可用的靜態IP位址，管理者需經由管理系統3下達至少四道指令給BMC22(即，需建立四個IPMI session)，包括：(1)設定BMC22的網路模式為靜態IP模式；(2)設定靜態IP位址；(3)設定子網路遮罩(netmask)；(4)設定閘道器(Gateway)IP位址。To implement network configuration operations on BMC22 to set a set of available static IP addresses, the administrator needs to issue at least four commands to BMC22 via management system 3 (ie, four IPMI sessions need to be established), including: (1) Setting The network mode of BMC22 is static IP mode; (2) Set static IP address; (3) Set subnet mask (netmask); (4) Set gateway (Gateway) IP address.

如上所述，若管理者設定的靜態IP位址錯誤(例如與DHCP伺服器所配發的多組動態IP位址的其中之一重覆)，或是閘道器IP位址設定錯誤，則在多個子網域共存的環境，或是需要透過閘道器才能溝通的環境下，所述BMC22將無法與管理系統3連線。對於管理系統3來說，雖然這個BMC22所屬的端點伺服器220仍然存在，但因為管理系統3失去了與這個BMC22間的連線，因此將無法對這個BMC22(及其所屬的端點伺服器220)進行管理。此時，管理系統3可能需要於遠端介入以令BMC22恢復網路連線。As mentioned above, if the static IP address set by the administrator is wrong (for example, it is repeated with one of the multiple dynamic IP addresses allocated by the DHCP server), or the gateway IP address is set incorrectly, then In an environment where multiple sub-domains coexist, or an environment where communication is required through a gateway, the BMC 22 cannot connect to the management system 3. For the management system 3, although the endpoint server 220 to which this BMC22 belongs still exists, because the management system 3 has lost the connection with the BMC22, it will not be able to communicate with the BMC22 (and the endpoint server to which it belongs). 220) Manage. At this time, the management system 3 may need to intervene remotely to enable the BMC 22 to restore the network connection.

參閱圖8，為本發明的第三類關注狀態排除流程圖的第一具體實施例。本實施例中，所述管理系統3會定時存取資料庫31(步驟S71)，以由資料庫31中取得機櫃2內的各個BMC22的狀態資料、管理者通過管理系統3實施的操作行為、以及管理系統3基於所述操作行為所獲得的各項反饋資訊，並且判斷BMC22的狀態變化(步驟S72)。Refer to FIG. 8, which is a first specific embodiment of the third type of attention state elimination flowchart of the present invention. In this embodiment, the management system 3 will periodically access the database 31 (step S71) to obtain the status data of each BMC 22 in the cabinet 2 from the database 31, the operation behaviors implemented by the manager through the management system 3, And the management system 3 obtains various feedback information based on the operation behavior, and judges the status change of the BMC 22 (step S72).

於一實施例中，管理系統3在步驟S72中取得的狀態資料至少包括各個BMC22的網路模式(靜態IP模式或動態IP模式)、目前使用的靜態IP位址、子網路遮罩及閘道器IP位址等，不加以限定。並且，管理系統3在步驟S72中取得的反饋資訊主要包括所述操作行為實施時，管理系統3、機櫃2及各個端點伺服器220(以及各個BMC22)基於這個操作行為所產生的反饋、系統參數及執行數據等資料，但不加以限定。In one embodiment, the status data obtained by the management system 3 in step S72 includes at least the network mode (static IP mode or dynamic IP mode) of each BMC 22, the currently used static IP address, subnet mask, and gate. The IP address of the router is not limited. In addition, the feedback information obtained by the management system 3 in step S72 mainly includes the feedback generated by the management system 3, the cabinet 2 and each endpoint server 220 (and each BMC 22) based on this operation behavior when the operation behavior is implemented. Data such as parameters and execution data, but not limited.

步驟S72後，管理系統3首先依據所述狀態資料以及反饋資訊判斷機櫃2中是否有任一BMC22失去了與管理系統3間的連線(步驟S73)，並且，依據所述操作行為判斷管理者是否剛剛為機櫃2中的任一BMC22實施了網路設定作業(步驟S74)。若經分析後發現管理者剛剛對某一BMC22實施了網路設定作業，並且這個BMC22在網路設定作業後即失去連線，則管理系統3即可將這個BMC22視為處於所述第三類關注狀態(步驟S75)，即，BMC22已失去連線。After step S72, the management system 3 first judges whether any BMC 22 in the cabinet 2 has lost the connection with the management system 3 based on the status data and feedback information (step S73), and judges the manager based on the operation behavior Whether the network setting operation has just been implemented for any BMC 22 in the cabinet 2 (step S74). If it is found after analysis that the administrator has just implemented a network configuration operation on a certain BMC22, and the BMC22 loses connection after the network configuration operation, the management system 3 can regard the BMC22 as being in the third category Attention state (step S75), that is, BMC 22 has lost connection.

值得一提的是，於前述步驟S73中，管理系統3主要可於任一BMC22的網路模式被設定為靜態IP模式，並且這個BMC22的靜態IP位址與DHCP伺服器所配發的多組動態IP位址的其中之一重覆時，判斷這個BMC22失去網路連線(已經失去連線，或可能失去連線)。It is worth mentioning that in the aforementioned step S73, the management system 3 can be set to the static IP mode in any BMC22 network mode, and the static IP address of the BMC22 and the multiple groups allocated by the DHCP server When one of the dynamic IP addresses is repeated, it is judged that the BMC22 loses network connection (the connection has been lost, or the connection may be lost).

另，於前述步驟S73中，管理系統3還可於任一BMC22的網路模式被設定為靜態IP模式，並且這個BMC22的閘道器IP位址設定錯誤時，判斷這個BMC22失去網路連線(已經失去連線，或可能失去連線)。惟，上述僅為本發明的部分具體實施範例，但不應以上述為限。In addition, in the aforementioned step S73, the management system 3 can also set the network mode of any BMC22 to static IP mode, and when the IP address of the gateway of this BMC22 is set incorrectly, it is judged that the BMC22 loses network connection (The connection has been lost, or the connection may be lost). However, the foregoing are only part of the specific implementation examples of the present invention, but should not be limited to the foregoing.

於步驟S75後，管理系統3已可認定某一BMC22處於所述第三類關注狀態，接著，管理系統3判斷在資料中心1中主要負責這個BMC22的RMC21為何(步驟S76)，並且控制這個RMC21通過機櫃2的內部硬體線路24檢查所述BMC22所屬的端點伺服器220(步驟S77)，以確認這個端點伺服器220是否存在(步驟S78)。After step S75, the management system 3 can determine that a certain BMC22 is in the third type of attention state. Then, the management system 3 determines which RMC21 is mainly responsible for this BMC22 in the data center 1 (step S76), and controls this RMC21 The endpoint server 220 to which the BMC 22 belongs is checked through the internal hardware circuit 24 of the cabinet 2 (step S77) to confirm whether the endpoint server 220 exists (step S78).

如圖2所示，一個機櫃2內的RMC21主要可通過內部硬體線路24實體連接機櫃2中的所有端點伺服器220中的BMC22，因此，即使BMC22失去網路連線，同一個機櫃2內的RMC21仍可通過內部硬體線路24來與BMC22進行溝通。As shown in Figure 2, the RMC21 in a cabinet 2 can mainly physically connect the BMC22 in all the endpoint servers 220 in the cabinet 2 through the internal hardware line 24. Therefore, even if the BMC22 loses network connection, the same cabinet 2 The internal RMC21 can still communicate with the BMC22 through the internal hardware circuit 24.

若於上述步驟S78中判斷所述端點伺服器220不存在(例如已被抽離機櫃2，或已經損壞)，則管理系統3對應發出警示訊號(步驟S79)。於一實施例中，管理系統3可通過操作介面發出警示訊號(例如文字、燈光或聲響)，以對管理者進行警示。於另一實施例中，管理系統3可通過網路發送警示訊號(例如簡訊、電子郵件或通訊軟體)給管理者，以達到警示作用。If it is determined in the above step S78 that the endpoint server 220 does not exist (for example, it has been removed from the cabinet 2 or has been damaged), the management system 3 correspondingly sends a warning signal (step S79). In one embodiment, the management system 3 can issue a warning signal (such as text, light, or sound) through the operating interface to alert the manager. In another embodiment, the management system 3 can send a warning signal (such as a text message, email or communication software) to the manager via the network to achieve the warning effect.

若於上述步驟S78中判斷所述端點伺服器220仍然存在，則管理系統3控制所述RMC21通過內部硬體線路24發送一組IPMI指令至所述BMC22，以令BMC22恢復網路連線(步驟S80)。於一實施例中，管理系統3可通過RMC21將IPMI指令發送至所述BMC22，以重新設定所述BMC22的靜態IP位址，或是重新設定所述BMC22的閘道器IP位址，藉此令BMC22恢復與管理系統3間的連線。If it is determined in the above step S78 that the endpoint server 220 still exists, the management system 3 controls the RMC 21 to send a set of IPMI commands to the BMC 22 through the internal hardware circuit 24, so that the BMC 22 can restore the network connection ( Step S80). In one embodiment, the management system 3 can send an IPMI command to the BMC22 through the RMC21 to reset the static IP address of the BMC22 or reset the gateway IP address of the BMC22, thereby Make BMC22 restore the connection with the management system 3.

通過上述技術方案，管理系統3可以在BMC22失去連線後主動於遠端對BMC22實施救援機制，以令BMC22恢復網路連線。Through the above technical solution, the management system 3 can actively implement a rescue mechanism to the BMC 22 remotely after the BMC 22 loses the connection, so that the BMC 22 can restore the network connection.

本發明的方法可由管理系統3自動搜集所需資訊並對所有RMC21及BMC22的狀態進行分析，同時於任一RMC21、BMC22處於多種關注狀態之一時自動實施對應機制以排除異常狀態。如此一來，本發明的技術方案可大幅降低管理成本，亦使得資料中心1的監控無需人為干涉，也不受距離與時間的影響。In the method of the present invention, the management system 3 can automatically collect the required information and analyze the status of all RMC21 and BMC22. At the same time, when any RMC21 and BMC22 are in one of multiple attention states, the corresponding mechanism can be automatically implemented to eliminate abnormal conditions. In this way, the technical solution of the present invention can greatly reduce the management cost, and the monitoring of the data center 1 does not require human intervention, nor is it affected by distance and time.

以上所述僅為本發明之較佳具體實例，非因此即侷限本發明之專利範圍，故舉凡運用本發明內容所為之等效變化，均同理皆包含於本發明之範圍內，合予陳明。The above are only preferred specific examples of the present invention, and are not limited to the scope of the patent of the present invention. Therefore, all equivalent changes made by using the content of the present invention are included in the scope of the present invention in the same way, and they are all included in the present invention. Bright.

1:資料中心1: Data Center

2:機櫃2: Cabinet

21:機櫃管理控制器21: Cabinet Management Controller

211、221:網路介面控制器211, 221: network interface controller

22:基板管理控制器22: baseboard management controller

23:內部網路交換機23: Internal network switch

24:內部硬體線路24: Internal hardware circuit

3:機櫃伺服器管理系統3: Cabinet server management system

31:資料庫31: Database

4:公共網路交換機4: Public network switch

S11~S15、S21~S28:搜集步驟S11~S15, S21~S28: Collection steps

S31~S39:分析與排除步驟S31~S39: Analysis and elimination steps

S41~S47、S51~S58、S61~S66、S71~S80:排除步驟S41~S47, S51~S58, S61~S66, S71~S80: Elimination steps

圖1為本發明的資料中心的示意圖。Figure 1 is a schematic diagram of the data center of the present invention.

圖2為本發明的機櫃的方塊圖的第一具體實施例。Fig. 2 is a first specific embodiment of the block diagram of the cabinet of the present invention.

圖3A為本發明的資料搜集流程圖的第一具體實施例。Fig. 3A is a first specific embodiment of the data collection flowchart of the present invention.

圖3B為本發明的資料搜集流程圖的第二具體實施例。Fig. 3B is a second specific embodiment of the data collection flowchart of the present invention.

圖4為本發明的分析與排除流程圖的第一具體實施例。Fig. 4 is a first specific embodiment of the analysis and elimination flowchart of the present invention.

圖5為本發明的第一類關注狀態排除流程圖的第一具體實施例。FIG. 5 is a first specific embodiment of the first type of attention state elimination flowchart of the present invention.

圖6為本發明的第一類關注狀態排除流程圖的第二具體實施例。Fig. 6 is a second specific embodiment of the first type of attention state elimination flowchart of the present invention.

圖7為本發明的第二類關注狀態排除流程圖的第一具體實施例。FIG. 7 is a first specific embodiment of the second type of attention state elimination flowchart of the present invention.

圖8為本發明的第三類關注狀態排除流程圖的第一具體實施例。FIG. 8 is a first specific embodiment of the third type of attention state elimination flowchart of the present invention.

S31~S39:分析與排除步驟 S31~S39: Analysis and elimination steps

Claims

A remote troubleshooting method for cabinet abnormal state is applied to a data center having a cabinet and a cabinet server management system connected to the cabinet by a remote end, wherein the cabinet has a Rack Management Controller (RMC) ) And a plurality of endpoint servers, each of which has a baseboard management controller (BMC), and the remote elimination method includes: 　　a) The rack server management system regularly accesses a database In order to obtain the status data of each BMC, the operation behavior implemented by a manager through the rack server management system on the rack, and the feedback information obtained corresponding to the operation behavior; 　　b) According to the status data, the operation behavior and the The feedback information determines whether one of the BMCs is in one of the preset multiple attention states; and 　　 c) When determining that any BMC is in a third type of attention state among the multiple attention states, the rack server The management system automatically implements a remote rescue mechanism for the BMC in the third type of attention state to eliminate the abnormal state of the BMC losing network connection, where the third type of attention state means that the BMC has lost the server server with the cabinet. It manages the connection between the systems.

The remote troubleshooting method for the abnormal state of the cabinet as described in claim 1, which further includes the following steps: 　　a01) the cabinet server management system is started; 　　a02) after this step a01), the cabinet server management system periodically and actively remote access The RMC and each BMC in the cabinet; 　　a03) Obtain the status data of the RMC and each BMC; 　　a04) Save the status data to the database; and 　　a05) Continue until the cabinet server management system is shut down Perform this step a02) to this step a04).

For example, the remote troubleshooting method for the abnormal state of the cabinet as described in claim 1, which further includes the following steps: 　　a11) the cabinet server management system is activated; 　　a12) after the step a11), the cabinet server management system provides an operating interface　　A13) When accepting the operation behavior of the manager through the operation interface, implement a remote management program on the RMC and each BMC according to the content of the operation behavior; 　　a14) Obtain the feedback corresponding to the remote management program Information; 　　a15) store the operation behavior and the feedback information in the database; and 　　a16) continue to execute the step a12) to the step a15) before the cabinet server management system is closed.

The method for remotely eliminating the abnormal state of the cabinet according to claim 1, wherein the state data includes at least the network mode, IP address, subnet mask, and gateway IP address of each BMC.

The remote elimination method of the abnormal state of the cabinet according to claim 1, wherein the feedback information includes the cabinet server management system, the cabinet, each endpoint server, the RMC, and each BMC when the operation behavior is performed Respectively generated feedback, system parameters and execution data.

For the remote elimination method of the abnormal state of the cabinet as described in claim 1, the step b) includes the following steps: 　　b1) Determine whether one of the BMCs has lost contact with the cabinet server based on the status data and the feedback information Manage the connection between systems; 　　b2) Determine whether one of the BMCs has just implemented a network configuration operation based on the operation behavior; and 　　b3) Have just implemented the network configuration operation on any BMC, and connected to the network When the connection is lost after the route setting operation, the BMC is deemed to be in the third type of attention state.

The remote troubleshooting method for the abnormal state of the cabinet as described in claim 6, wherein the step b1) is to set the network mode of any BMC to a static IP mode, and the static IP address of the BMC and the data center When one of the multiple sets of dynamic IP addresses allocated by a Dynamic Host Configuration Protocol (DHCP) server in the server is repeated, it is determined that the BMC has lost connection.

The remote troubleshooting method for the abnormal state of the cabinet as described in claim 6, wherein the step b1) is when the network mode of any BMC is set to a static IP mode, and the IP address of the BMC gateway is set incorrectly , It is judged that the BMC has lost connection.

For the remote elimination method of the abnormal state of the cabinet as described in claim 6, the step c includes the following steps: 　　c1) When determining that any BMC is in the third type of concern state, determine and connect to the BMC RMC; 　　c2) Control the RMC to check the endpoint server to which the BMC belongs through an internal hardware circuit of the cabinet, where the RMC is physically connected to all the BMCs in the cabinet through the internal hardware circuit; 　　c3) When the endpoint server does not exist, a warning signal is issued; and 　　c4) When the endpoint server exists, control the RMC to send an Intelligent Platform Management Interface (IPMI) command to the endpoint server through the internal hardware line BMC to restore the connection between the BMC and the server management system of the rack.

According to the remote elimination method of the abnormal state of the cabinet as described in claim 9, the step c4) is to reset the static IP address of the BMC through the IPMI instruction, or reset the gateway IP address of the BMC.