TW202145015A

TW202145015A - Device and method for monitoring server and storage medium

Info

Publication number: TW202145015A
Application number: TW109116998A
Authority: TW
Inventors: 林廷皆; 黃尹; 程俊德; 潘聖中; 董光展
Original assignee: 新加坡商鴻運科股份有限公司
Priority date: 2020-05-21
Filing date: 2020-05-21
Publication date: 2021-12-01

Abstract

A server monitoring method includes: collecting SDR data of a server cluster to be monitored, wherein the server cluster to be monitored includes at least one server; storing the SDR data as a predetermined file format into a specified storage area; analyzing the SDR data which are stored in the specified storage area to determine whether abnormal SDR data is existed; outputting warning information of an abnormal component corresponding to the abnormal SDR data and if the abnormal SDR data is existed. A server monitoring device and a storage medium are also provided.

Description

Server monitoring device, method and computer readable storage medium

本發明涉及通信設備技術領域，尤其涉及一種伺服器監控裝置、方法及電腦可讀取存儲介質。The present invention relates to the technical field of communication equipment, and in particular, to a server monitoring device, a method and a computer-readable storage medium.

近幾年來，隨著科技與網路之快速發展，伺服器之功能愈益強大。為保證伺服器工作之穩定性，一般採用基板管理控制器(Baseboard Management Control, BMC)來監控與管理伺服器系統之運作。BMC可把系統發生之事件記錄於非易失之系統事件日誌(SEL)中，比如記錄之事件類型包括溫度異常、電壓異常、風扇異常等。於監控過程中，BMC還可管理非易失之感測器資料記錄存儲庫(SDRR)，可從此庫中檢索出系統運行時之資訊。因此，BMC自動生成之日誌資訊成為查看伺服器運行狀態之重要指標與參考依據。然而，基於BMC自動生成之日誌資訊無法得到用於監測伺服器運行狀態之感測器之異常事件。In recent years, with the rapid development of technology and the Internet, the functions of servers have become more and more powerful. In order to ensure the stability of the server operation, a baseboard management controller (BMC) is generally used to monitor and manage the operation of the server system. The BMC can record system events in the non-volatile system event log (SEL). For example, the recorded event types include abnormal temperature, abnormal voltage, abnormal fan, etc. During the monitoring process, the BMC can also manage a non-volatile Sensor Data Record Repository (SDRR) from which system runtime information can be retrieved. Therefore, the log information automatically generated by BMC becomes an important indicator and reference for checking the running status of the server. However, based on the log information automatically generated by the BMC, the abnormal events of the sensors used to monitor the running status of the server cannot be obtained.

有鑑於此，有必要提供一種伺服器監控裝置、方法及電腦可讀取存儲介質，可實現監測伺服器及其內部感測器之運行狀態。In view of this, it is necessary to provide a server monitoring device, method and computer-readable storage medium, which can monitor the running status of the server and its internal sensors.

本發明一實施方式提供一種伺服器監控方法，所述方法包括：收集待監控伺服器集群之SDR資料，其中所述待監控伺服器集群包括至少一伺服器；將所述收集之SDR資料以預設檔案格式存儲至指定存儲區；採用預設分析規則對所述指定存儲區中之SDR資料進行分析，以判斷是否存於異常SDR資料；及若存於異常SDR資料，則輸出與所述異常SDR資料對應之異常元件之警示資訊。An embodiment of the present invention provides a server monitoring method, the method includes: collecting SDR data of a server cluster to be monitored, wherein the server cluster to be monitored includes at least one server; Set the file format to be stored in a designated storage area; use preset analysis rules to analyze the SDR data in the designated storage area to determine whether there is abnormal SDR data; and if stored in abnormal SDR data, output and the abnormal SDR data Warning information of abnormal components corresponding to SDR data.

本發明一實施方式提供一種伺服器監控裝置，所述裝置包括處理器及記憶體，所述記憶體上存儲有複數電腦程式，所述處理器用於執行記憶體中存儲之電腦程式時實現如下步驟：收集待監控伺服器集群之SDR資料，其中所述待監控伺服器集群包括至少一伺服器；將所述收集之SDR資料以預設檔案格式存儲至指定存儲區；採用預設分析規則對所述指定存儲區中之SDR資料進行分析，以判斷是否存於異常SDR資料；及若存於異常SDR資料，則輸出與所述異常SDR資料對應之異常元件之警示資訊。An embodiment of the present invention provides a server monitoring device, the device includes a processor and a memory, the memory stores a plurality of computer programs, and the processor implements the following steps when executing the computer programs stored in the memory : collect the SDR data of the server cluster to be monitored, wherein the server cluster to be monitored includes at least one server; store the collected SDR data in a preset file format in a designated storage area; use a preset analysis rule to The SDR data in the specified storage area is analyzed to determine whether the abnormal SDR data exists; and if the abnormal SDR data exists, the warning information of the abnormal element corresponding to the abnormal SDR data is output.

本發明一實施方式提供一種電腦可讀取存儲介質，所述電腦可讀取存儲介質存儲有多條指令，多條所述指令可被一個或者多個處理器執行，以實現如下步驟：收集待監控伺服器集群之SDR資料，其中所述待監控伺服器集群包括至少一伺服器；將所述收集之SDR資料以預設檔案格式存儲至指定存儲區；採用預設分析規則對所述指定存儲區中之SDR資料進行分析，以判斷是否存於異常SDR資料；及若存於異常SDR資料，則輸出與所述異常SDR資料對應之異常元件之警示資訊。An embodiment of the present invention provides a computer-readable storage medium. The computer-readable storage medium stores a plurality of instructions, and a plurality of the instructions can be executed by one or more processors, so as to realize the following steps: Monitoring the SDR data of a server cluster, wherein the server cluster to be monitored includes at least one server; storing the collected SDR data in a preset file format to a designated storage area; using a preset analysis rule to store the designated storage The SDR data in the area is analyzed to determine whether the abnormal SDR data exists; and if the abnormal SDR data exists, the warning information of the abnormal element corresponding to the abnormal SDR data is output.

與習知技術相比，上述伺服器監控裝置、方法及電腦可讀取存儲介質，藉由對SDR資料進行分析，可實現將BMC無法發現之異常資訊進行回報，提前找出部件衰退/損壞之跡象，且可定位出有問題之感測器及伺服器元件，同時亦可實現對BMC日誌檔進行篩選與異常日誌回報。Compared with the prior art, the above-mentioned server monitoring device, method and computer-readable storage medium can realize the reporting of abnormal information that cannot be found by BMC by analyzing the SDR data, and find out the failure/damage of components in advance. It can locate faulty sensors and server components, and can also filter BMC log files and report abnormal logs.

請參閱圖1，為本發明伺服器監控裝置較佳實施例之示意圖。Please refer to FIG. 1 , which is a schematic diagram of a preferred embodiment of the server monitoring device of the present invention.

伺服器監控裝置100可實現對資料中心200中之多個伺服器進行監控，比如資料中心200包括至少一待監控伺服器集群，該待監控伺服器集群可包括多個伺服器。可理解，伺服器監控裝置100亦可根據實際需求來監控使用者所指定之伺服器或伺服器集群，於此不作限定。The server monitoring device 100 can monitor multiple servers in the data center 200 . For example, the data center 200 includes at least one server cluster to be monitored, and the server cluster to be monitored may include multiple servers. It can be understood that the server monitoring device 100 can also monitor the server or server cluster designated by the user according to actual requirements, which is not limited herein.

伺服器監控裝置100可包括記憶體10、處理器20以及存儲於記憶體10中並可於處理器20上運行之伺服器監控程式30。處理器20執行伺服器監控程式30時實現伺服器監控方法實施例中之步驟，例如圖3所示之步驟S300~S306。或者，所述處理器20執行伺服器監控程式30時實現圖2中各模組之功能，例如模組101~105。The server monitoring device 100 may include a memory 10 , a processor 20 , and a server monitoring program 30 stored in the memory 10 and running on the processor 20 . When the processor 20 executes the server monitoring program 30, the steps in the embodiment of the server monitoring method are implemented, such as steps S300-S306 shown in FIG. 3 . Alternatively, when the processor 20 executes the server monitoring program 30 , the functions of the modules shown in FIG. 2 , such as modules 101 to 105 , are implemented.

伺服器監控程式30可被分割成一個或多個模組，所述一個或者多個模組被存儲於記憶體10中，並由處理器20執行，以完成本發明。所述一個或多個模組可是能夠完成特定功能之一系列電腦程式指令段，所述指令段用於描述伺服器監控程式30於伺服器監控裝置100中之執行過程。例如，伺服器監控程式30可被分割成圖2中之收集模組101、存儲模組102、分析模組103、輸出模組104及轉換模組105。各模組具體功能參見下圖2中各模組之功能。The server monitoring program 30 can be divided into one or more modules, and the one or more modules are stored in the memory 10 and executed by the processor 20 to complete the present invention. The one or more modules may be a series of computer program instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the server monitoring program 30 in the server monitoring device 100 . For example, the server monitoring program 30 can be divided into a collection module 101 , a storage module 102 , an analysis module 103 , an output module 104 and a conversion module 105 in FIG. 2 . For the specific functions of each module, please refer to the function of each module in Figure 2 below.

本領域技術人員可理解，所述示意圖僅是伺服器監控裝置100之示例，並不構成對伺服器監控裝置100之限定，可包括比圖示更多或更少之部件，或者組合某些部件，或者不同之部件，例如伺服器監控裝置100還可包括輸入顯示裝置、通信模組、匯流排等。Those skilled in the art can understand that the schematic diagram is only an example of the server monitoring device 100, and does not constitute a limitation on the server monitoring device 100, and may include more or less components than the one shown, or combine some components , or different components, for example, the server monitoring device 100 may further include an input display device, a communication module, a bus bar, and the like.

處理器20可是中央處理單元(Central Processing Unit，CPU)，還可是其他通用處理器、數位訊號處理器 (Digital Signal Processor，DSP)、專用積體電路 (Application Specific Integrated Circuit，ASIC)、現成可程式設計閘陣列 (Field-Programmable Gate Array，FPGA) 或者其他可程式設計邏輯器件、分立門或者電晶體邏輯器件、分立硬體元件等。通用處理器可是微處理器或者處理器20亦可是任何常規之處理器等，處理器20可利用各種介面與匯流排連接伺服器監控裝置100之各個部分。The processor 20 may be a central processing unit (CPU), other general-purpose processors, a digital signal processor (DSP), an application specific integrated circuit (ASIC), or an off-the-shelf program. Design Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or the processor 20 can also be any conventional processor, etc. The processor 20 can use various interfaces and bus bars to connect various parts of the server monitoring device 100 .

記憶體10可用於存儲伺服器監控程式30與/或模組，處理器20藉由運行或執行存儲於記憶體10內之電腦程式與/或模組，以及調用存儲於記憶體10內之資料，實現伺服器監控裝置100之各種功能。記憶體10可包括高速隨機存取記憶體，還可包括非易失性記憶體，例如硬碟機、記憶體、插接式硬碟機，智慧存儲卡（Smart Media Card, SMC），安全數位（Secure Digital, SD）卡，快閃記憶體卡（Flash Card）、至少一個磁碟記憶體件、快閃記憶體器件、或其他易失性固態記憶體件。The memory 10 can be used to store the server monitoring program 30 and/or modules, and the processor 20 calls the data stored in the memory 10 by running or executing the computer programs and/or modules stored in the memory 10 , to realize various functions of the server monitoring device 100 . The memory 10 may include high-speed random access memory, and may also include non-volatile memory such as hard disk drives, internal memory, plug-in hard disk drives, Smart Media Cards (SMC), secure digital (Secure Digital, SD) card, flash memory card (Flash Card), at least one disk memory device, flash memory device, or other volatile solid state memory device.

圖2為本發明伺服器監控程式較佳實施例之功能模組圖。FIG. 2 is a functional module diagram of a preferred embodiment of the server monitoring program of the present invention.

參閱圖2所示，伺服器監控程式30可包括收集模組101、存儲模組102、分析模組103、輸出模組104及轉換模組105。於一實施方式中，上述模組可為存儲於記憶體10中且可被處理器20調用執行之可程式化軟體指令。可理解之是，於其他實施方式中，上述模組亦可為固化於處理器20中之程式指令或固件（firmware）。Referring to FIG. 2 , the server monitoring program 30 may include a collection module 101 , a storage module 102 , an analysis module 103 , an output module 104 and a conversion module 105 . In one embodiment, the above-mentioned modules may be programmable software instructions stored in the memory 10 and invoked by the processor 20 for execution. It can be understood that, in other embodiments, the above-mentioned modules can also be program instructions or firmware solidified in the processor 20 .

收集模組101用於收集待監控伺服器集群之SDR資料。The collection module 101 is used to collect the SDR data of the server cluster to be monitored.

於一實施方式中，所述待監控伺服器集群包括至少一伺服器，所述待監控伺服器集群可是用戶指定需進行監控之伺服器集群，如資料中心200之伺服器集群。In one embodiment, the to-be-monitored server cluster includes at least one server, and the to-be-monitored server cluster may be a server cluster designated by a user to be monitored, such as a server cluster of the data center 200 .

於一實施方式中，收集模組101可藉由網頁方式接入該待監控伺服器集群關聯之SUT（System Under Test，被測系統），該SUT可包括SDR監測器，伺服器監控裝置100可發送控制指令給SDR監測器，進而收集模組101可收集到所述待監控伺服器集群之每一伺服器之SDR資料。於本發明之其他實施方式中，收集模組101亦可藉由與所述待監控伺服器集群之IPMI（Intelligent Platform Management Interface，智慧平臺管理介面）監測器通信，實現收集所述待監控伺服器集群之每一伺服器之SDR資料。In one embodiment, the collection module 101 can access the SUT (System Under Test, system under test) associated with the server cluster to be monitored through a web page. The SUT can include an SDR monitor, and the server monitoring device 100 can Send a control command to the SDR monitor, and then the collection module 101 can collect the SDR data of each server in the server cluster to be monitored. In other embodiments of the present invention, the collection module 101 can also collect the to-be-monitored server by communicating with the IPMI (Intelligent Platform Management Interface) monitor of the to-be-monitored server cluster. SDR data for each server in the cluster.

可理解，對於每一伺服器而言，可利用IPMI監視伺服器之物理特徵，比如藉由散佈於伺服器基板、系統板、主機殼、風扇等位置之感測器實現監視伺服器之物理特徵，基於該些感測器監測到之資料即可生成SDR資料，該些SDR資料可被存儲於感測器資料記錄存儲庫（SDRR）中，該物理特徵可是溫度、電壓、風扇工作狀態、電源狀態等。設置於伺服器中之主機板管理控制器（BMC）可實現自動監視伺服器系統之管理事件，並可把發生之事件記錄於非易失之系統事件日誌(SEL)中。於對伺服器進行監視過程中，BMC還可管理非易失之SDRR，並可從此庫中檢索出系統運行時之資訊。It can be understood that for each server, IPMI can be used to monitor the physical characteristics of the server, such as monitoring the physical characteristics of the server through sensors scattered on the server substrate, system board, main chassis, fans, etc. feature, SDR data can be generated based on the data monitored by the sensors, and the SDR data can be stored in the sensor data record repository (SDRR). The physical features can be temperature, voltage, fan working status, power status, etc. The motherboard management controller (BMC) installed in the server can automatically monitor the management events of the server system and record the events in the non-volatile system event log (SEL). In the process of monitoring the server, the BMC can also manage the non-volatile SDRR, and can retrieve system runtime information from this library.

存儲模組102用於將所述收集之SDR資料以預設檔案格式存儲至指定存儲區。The storage module 102 is used for storing the collected SDR data in a designated storage area in a preset file format.

於一實施方式中，所述預設格式可根據實際需求進行確定，比如為CSV檔案格式，所述指定存儲區同樣可根據實際需求進行確定，比如是指定之檔案伺服器、指定之資料庫等。存儲至所述指定存儲區之檔可以伺服器之名稱、編號或者伺服器對應之SUT之名稱為檔案名，以進行區分，避免檔雜亂，方便後續管理者進行查看等操作。比如，存儲模組102將所述收集到之SDR資料以SDR_SUT_1.CSV、SDR_SUT_2.CSV、SDR_SUT_3.CSV、…、SDR_SUT_n.CSV等檔之形式存儲至指定之檔案伺服器。In one embodiment, the preset format can be determined according to actual needs, such as a CSV file format, and the designated storage area can also be determined according to actual needs, such as a designated file server, a designated database, etc. . The file stored in the designated storage area can be the name and number of the server or the name of the SUT corresponding to the server as the file name to distinguish, avoid file clutter, and facilitate subsequent managers to view and other operations. For example, the storage module 102 stores the collected SDR data to a designated file server in the form of SDR_SUT_1.CSV, SDR_SUT_2.CSV, SDR_SUT_3.CSV, . . . , SDR_SUT_n.CSV, etc.

分析模組103用於採用預設分析規則對所述指定存儲區中之SDR資料進行分析，以判斷是否存於異常SDR資料。The analysis module 103 is configured to analyze the SDR data in the designated storage area by using a preset analysis rule to determine whether there is abnormal SDR data.

於一實施方式中，所述預設分析規則可是預先定義之對不同類型之SDR資料之分析規則，且可根據實際需求進行調整，所述不同類型可是指電壓、溫度、轉速等不同類型參數，若為同一類型之參數，因不同之元件亦存於不同之分析規則，比如CPU溫度、硬碟機溫度可具有不同之分析規則。In one embodiment, the preset analysis rules can be predefined analysis rules for different types of SDR data, and can be adjusted according to actual needs, and the different types can refer to different types of parameters such as voltage, temperature, rotational speed, etc. If the parameters are of the same type, different analysis rules exist for different components. For example, CPU temperature and hard disk temperature may have different analysis rules.

於一實施方式中，若SDR資料包括CPU溫度資料及電源功率消耗資料，於開啟壓力測試與結束壓力測試時分別檢測10個CPU溫度資料及10個電源功率消耗資料，所述預設分析規則包括：當開啟壓力測試（高負載）時，CPU之溫度資料應該是要高於先前未開啟壓力測試之溫度資料，電源功率消耗應該大於未開啟壓力測試之功率消耗資料，當壓力測試結束時，CPU之溫度資料應該是要低於先前進行壓力測試時之溫度資料，電源功率消耗應該小於未開啟壓力測試之功率消耗資料，否則，分析模組103可判斷存於異常，並定位異常點。In one embodiment, if the SDR data includes CPU temperature data and power consumption data, 10 CPU temperature data and 10 power consumption data are respectively detected when the stress test is started and when the stress test is ended, and the preset analysis rules include: : When the stress test is turned on (high load), the temperature data of the CPU should be higher than the previous temperature data without the stress test, and the power consumption of the power supply should be greater than the power consumption data without the stress test. The temperature data should be lower than the temperature data during the previous stress test, and the power consumption should be lower than the power consumption data when the stress test is not turned on. Otherwise, the analysis module 103 can determine that there is an abnormality and locate the abnormal point.

可理解，於伺服器運行過程中，晶片電壓資料可能需要保持穩定狀態，溫度或者風速轉速則應該是動態無法完全保持不變狀態。所述預設分析規則可是：分析n筆(比如50筆)連續SDR資料，若是電壓資料，則n筆連續SDR資料應該是相同若是溫度資料或者風速轉速資料等，則n筆連續SDR資料應該是動態即n筆連續SDR資料不完全相同。否則，分析模組103可判斷存於異常，並定位異常點。It can be understood that during the operation of the server, the voltage data of the chip may need to be kept in a stable state, while the temperature or the wind speed should be dynamic and cannot be kept completely unchanged. The preset analysis rule may be: to analyze n continuous SDR data (for example, 50 data), if it is voltage data, then n continuous SDR data should be the same; if it is temperature data or wind speed data, etc., then n continuous SDR data should be Dynamic means that n consecutive SDR data are not identical. Otherwise, the analysis module 103 can determine that there is an abnormality, and locate the abnormal point.

於一實施方式中，屬於一個群組之部件之SDR資料應該是基本相同。所述預設分析規則可是：任意一時刻之群組中之每一部件之SDR資料差值應該是於預設差值區間內。比如，一群組包括多個正常工作之CPU，則於任意一時刻，每一CPU之溫度資料之差值應該於15%以內，一群組包括多個正常工作之固態硬碟機（SSD），則於任意一時刻，每一SSD之溫度資料之差值應該於10%以內，一群組包括多個正常工作之風扇，則於任意一時刻，每一風扇之轉速差值應該於10%以內。否則，分析模組103可判斷存於異常，並定位異常點。In one embodiment, the SDR data of components belonging to a group should be substantially the same. The predetermined analysis rule may be: the SDR data difference of each component in the group at any time should be within a predetermined difference interval. For example, if a group includes multiple normal working CPUs, at any time, the difference between the temperature data of each CPU should be within 15%, and a group includes multiple normal working solid-state drives (SSD) , then at any time, the difference between the temperature data of each SSD should be within 10%, and a group includes multiple normal working fans, then at any time, the speed difference of each fan should be within 10% within. Otherwise, the analysis module 103 can determine that there is an abnormality, and locate the abnormal point.

所述預設分析規則還可是：於伺服器自檢過程中，風扇轉速應該是自檢開始時升高，於自檢結束時下降，比如於自檢開始時，風扇轉速應該至少升高20%，於自檢結束時，風扇轉速應該至少降低20%。否則，分析模組103可判斷存於異常，並定位異常點。The preset analysis rule may also be: during the server self-test, the fan speed should increase at the beginning of the self-test and decrease when the self-test ends. For example, when the self-test starts, the fan speed should increase by at least 20%. , at the end of the self-test, the fan speed should be reduced by at least 20%. Otherwise, the analysis module 103 can determine that there is an abnormality, and locate the abnormal point.

於一實施方式中，分析模組103還可將異常SDR資料結合SEL日誌檔進行分析，以實現判斷異常SDR資料是由於感測器自身異常產生，還是伺服器本身之部件異常產生。In one embodiment, the analysis module 103 can also analyze the abnormal SDR data in combination with the SEL log file, so as to determine whether the abnormal SDR data is abnormally generated by the sensor itself or a component of the server itself.

於一實施方式中，所述異常SDR資料可包括第一類型之異常SDR資料及第二類型之異常SDR資料。所述第一類型之異常SDR資料可是指能觸發BMC生成對應之異常日誌檔之資料，所述第二類型之異常SDR資料可是指不能觸發BMC生成對應之異常日誌檔之資料，進而分析模組103可實現針對BMC沒有回報之錯誤資訊，提前找出部件之異常/衰退之跡象，讓使用者或者某個部件可能即將損壞。In one embodiment, the abnormal SDR data may include a first type of abnormal SDR data and a second type of abnormal SDR data. The first type of abnormal SDR data refers to the data that can trigger the BMC to generate the corresponding abnormal log file, and the second type of abnormal SDR data refers to the data that cannot trigger the BMC to generate the corresponding abnormal log file, and then the analysis module 103 can realize the error information that is not reported by BMC, and find out the signs of abnormality/deterioration of components in advance, so that the user or a certain component may be damaged.

輸出模組104用於輸出與所述異常SDR資料對應之異常元件之警示資訊。The output module 104 is used for outputting the warning information of the abnormal element corresponding to the abnormal SDR data.

於一實施方式中，所述異常元件可是指用於監測所述伺服器之部件之感測器與/或所述伺服器之部件，即可能是用於監測伺服器部件之感測器出現異常，亦可能是伺服器本身之部件出現異常。所述警示資訊包括但不限於異常資訊（記載了異常基本資訊，如部件名稱、編號、異常基本情況等）及異常發生之時間區間。In one embodiment, the abnormal element may refer to a sensor used to monitor the component of the server and/or a component of the server, that is, it may be that the sensor used to monitor the component of the server is abnormal. , or there may be an abnormality in the components of the server itself. The warning information includes but is not limited to abnormal information (records basic abnormal information, such as component name, serial number, basic abnormal situation, etc.) and the time interval of abnormal occurrence.

於一實施方式中，伺服器監控裝置100還可實現BMC異常日誌檔之篩選與回報功能。具體地，收集模組101還可從所述待監控伺服器集群獲取每一伺服器之BMC生成之日誌檔，分析模組103可檢測BMC生成之日誌檔是否存於異常日誌檔，比如分析模組103可檢測日誌檔中是否發現了預設關鍵匹配資訊、日誌檔中部件之參數是否超過對應之閾值等，以實現檢測異常日誌檔。輸出模組104可將分析模組103檢測到之異常日誌檔進行輸出顯示。In one embodiment, the server monitoring device 100 can also implement the function of filtering and reporting BMC abnormal log files. Specifically, the collection module 101 can also obtain the log files generated by the BMC of each server from the server cluster to be monitored, and the analysis module 103 can detect whether the log files generated by the BMC are stored in abnormal log files, such as the analysis module The group 103 can detect whether preset key matching information is found in the log file, whether the parameters of the components in the log file exceed a corresponding threshold, etc., so as to detect abnormal log files. The output module 104 can output and display the abnormal log file detected by the analysis module 103 .

於一實施方式中，分析模組103還可用於判斷所述異常日誌檔中是否存於與所述第一類型之異常SDR資料相對應之日誌檔，雙重比對檢查BMC系統日誌機制更嚴謹。若所述異常日誌檔中不存於與所述第一類型之異常SDR資料相對應之日誌檔，表明BMC可能存於漏記載系統日誌事件或者SDR資料存於異常記錄，則輸出模組104可輸出預設提示資訊，以提醒所述待監控伺服器集群之管理者進行核查。In one embodiment, the analysis module 103 can also be used to determine whether the abnormal log file exists in the log file corresponding to the abnormal SDR data of the first type, and the double-checked BMC system log mechanism is more rigorous. If the log file corresponding to the first type of abnormal SDR data does not exist in the abnormal log file, indicating that the BMC may be stored in the system log event that is not recorded or the SDR data may be stored in the abnormal record, the output module 104 may Default prompt information is output to remind the administrator of the server cluster to be monitored to perform verification.

於一實施方式中，當分析模組103經過分析確定存於異常SDR資料時，可基於所述異常SDR資料生成對應之SDR異常日誌檔，進而輸出模組104可輸出所述SDR異常日誌檔，方便管理者即時查看。於本發明之其他實施方式中，該SDR異常日誌檔亦可存儲至指定之資料庫進行備份，以便管理者後續進行查閱或進行大資料分析等。In one embodiment, when the analysis module 103 analyzes and determines that the abnormal SDR data is stored, it can generate a corresponding SDR abnormal log file based on the abnormal SDR data, and then the output module 104 can output the SDR abnormal log file, It is convenient for managers to view it instantly. In other embodiments of the present invention, the SDR anomaly log file can also be stored in a designated database for backup, so that the administrator can check or analyze large data later.

於一實施方式中，為方便管理者快速查看SDR資訊，瞭解每一伺服器之每一被監測部件之工作狀態，轉換模組105用於將所述收集之SDR資料進行分類並轉換成SDR曲線圖，進而輸出模組104可輸出所述SDR曲線圖於顯示介面上。In one embodiment, in order to facilitate the administrator to quickly check the SDR information and understand the working status of each monitored component of each server, the conversion module 105 is used to classify the collected SDR data and convert it into an SDR curve and the output module 104 can output the SDR graph on the display interface.

可理解，每一SDR曲線圖可對應每一個感測器所監測到之資料，轉換模組105可即時監控所述收集之SDR資料，進而可實現即時更新所述SDR曲線圖。It can be understood that each SDR graph can correspond to the data monitored by each sensor, and the conversion module 105 can monitor the collected SDR data in real time, so that the SDR graph can be updated in real time.

圖3為本發明一實施方式中伺服器監控方法之流程圖。根據不同之需求，所述流程圖中步驟之順序可改變，某些步驟可省略。FIG. 3 is a flowchart of a server monitoring method in an embodiment of the present invention. According to different requirements, the order of the steps in the flowchart can be changed, and some steps can be omitted.

步驟S300，收集待監控伺服器集群之SDR資料，其中所述待監控伺服器集群包括至少一伺服器。Step S300 , collecting SDR data of the server cluster to be monitored, wherein the server cluster to be monitored includes at least one server.

步驟S302，將所述收集之SDR資料以預設檔案格式存儲至指定存儲區。Step S302, the collected SDR data is stored in a designated storage area in a preset file format.

步驟S304，採用預設分析規則對所述指定存儲區中之SDR資料進行分析，以判斷是否存於異常SDR資料。Step S304 , using a preset analysis rule to analyze the SDR data in the designated storage area to determine whether there is abnormal SDR data.

步驟S306，若存於異常SDR資料，則輸出與所述異常SDR資料對應之異常元件之警示資訊。Step S306, if the abnormal SDR data exists, output the warning information of the abnormal element corresponding to the abnormal SDR data.

上述伺服器監控裝置、方法及電腦可讀取存儲介質，藉由對SDR資料進行分析，可實現將BMC無法發現之異常資訊進行回報，提前找出部件衰退/損壞之跡象，且可定位出有問題之感測器及伺服器元件，同時亦可實現對BMC日誌檔進行篩選與異常日誌回報。The above-mentioned server monitoring device, method and computer-readable storage medium can realize the reporting of abnormal information that cannot be found by BMC by analyzing SDR data, find out signs of component deterioration/damage in advance, and locate faulty information. The problem sensor and server components can also filter BMC log files and report abnormal logs.

綜上所述，本發明符合發明專利要件，爰依法提出專利申請。惟，以上所述者僅為本發明之較佳實施方式，本發明之範圍並不以上述實施方式為限，舉凡熟悉本案技藝之人士爰依本發明之精神所作之等效修飾或變化，皆應涵蓋於以下申請專利範圍內。To sum up, the present invention complies with the requirements of an invention patent, and a patent application can be filed in accordance with the law. However, the above descriptions are only the preferred embodiments of the present invention, and the scope of the present invention is not limited to the above-mentioned embodiments, and equivalent modifications or changes made by those who are familiar with the art of the present invention according to the spirit of the present invention are all applicable. Should be covered within the scope of the following patent applications.

10:記憶體 20:處理器 30:伺服器監控程式 101:收集模組 102:存儲模組 103:分析模組 104:輸出模組 105:轉換模組 100:伺服器監控裝置10: Memory 20: Processor 30: Server monitoring program 101: Collect Mods 102: Storage Module 103: Analysis Module 104: Output module 105: Conversion module 100: Server monitoring device

圖1是本發明一實施方式之伺服器監控裝置之功能模組圖。FIG. 1 is a functional module diagram of a server monitoring device according to an embodiment of the present invention.

圖2是本發明一實施方式之伺服器監控程式之功能模組圖。FIG. 2 is a functional module diagram of a server monitoring program according to an embodiment of the present invention.

圖3是本發明一實施方式之伺服器監控方法之流程圖。FIG. 3 is a flowchart of a server monitoring method according to an embodiment of the present invention.

Claims

A server monitoring method, the method comprising: collecting SDR data of a server cluster to be monitored, wherein the server cluster to be monitored includes at least one server; Store the collected SDR data in the designated storage area in a preset file format; Analyzing the SDR data in the designated storage area using preset analysis rules to determine whether there is abnormal SDR data; and If the abnormal SDR data exists, the warning information of the abnormal element corresponding to the abnormal SDR data is output.

The server monitoring method according to claim 1, further comprising: obtaining a log file generated by the baseboard management controller of the server cluster to be monitored; and Detecting whether the log file generated by the baseboard management controller is stored in the abnormal log file, and outputting the detected abnormal log file.

The server monitoring method according to claim 2, wherein the abnormal SDR data includes a first type of abnormal SDR data and a second type of abnormal SDR data, wherein the first type of abnormal SDR data can trigger the The baseboard management controller generates a corresponding exception log file, and the abnormal SDR data of the second type cannot trigger the baseboard management controller to generate a corresponding exception log file.

The server monitoring method according to claim 3, further comprising: determining whether a log file corresponding to the abnormal SDR data of the first type exists in the abnormal log file; and If there is no log file corresponding to the abnormal SDR data of the first type in the abnormal log file, a default prompt message is output.

The server monitoring method according to claim 1, wherein if there is abnormal SDR data, the step of outputting the warning information of the abnormal element corresponding to the abnormal SDR data comprises: If stored in the abnormal SDR data, generating an SDR abnormal log file based on the abnormal SDR data; and Outputting the SDR abnormal log file and the warning information of the abnormal element corresponding to the abnormal SDR data.

The server monitoring method according to claim 1, further comprising: converting the collected SDR data into an SDR graph, and outputting the SDR graph; and The collected SDR data is monitored to update the SDR graph.

The server monitoring method according to claim 1, wherein the abnormal element includes a sensor for monitoring the components of the server and/or the components of the server, and the warning information includes abnormal information and abnormality The time period in which it occurred.

The server monitoring method according to claim 1, further comprising: The system under test associated with the server cluster to be monitored is accessed through a web page to collect the SDR data of the server cluster to be monitored.

A server monitoring device, the device includes a processor and a memory, a plurality of computer programs are stored on the memory, and the processor is used to execute any of the requirements 1 to 8 when executing the computer programs stored in the memory. A step of the described server monitoring method.

A computer-readable storage medium that stores a plurality of instructions, and a plurality of the instructions can be executed by one or more processors to achieve the invention as claimed in any one of claim items 1 to 8. The steps of the server monitoring method described above.