TW201822002A - Error resolving method or switch - Google Patents

Error resolving method or switch Download PDF

Info

Publication number
TW201822002A
TW201822002A TW105140520A TW105140520A TW201822002A TW 201822002 A TW201822002 A TW 201822002A TW 105140520 A TW105140520 A TW 105140520A TW 105140520 A TW105140520 A TW 105140520A TW 201822002 A TW201822002 A TW 201822002A
Authority
TW
Taiwan
Prior art keywords
error
switch
switches
management controller
central processor
Prior art date
Application number
TW105140520A
Other languages
Chinese (zh)
Other versions
TWI601013B (en
Inventor
胡翔竣
羅毅倫
Original Assignee
英業達股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 英業達股份有限公司 filed Critical 英業達股份有限公司
Priority to TW105140520A priority Critical patent/TWI601013B/en
Application granted granted Critical
Publication of TWI601013B publication Critical patent/TWI601013B/en
Publication of TW201822002A publication Critical patent/TW201822002A/en

Links

Landscapes

  • Hardware Redundancy (AREA)
  • Debugging And Monitoring (AREA)

Abstract

An error resolving method for switches is adapted to a server apparatus. The server apparatus includes a plurality of switches, a central processor, and a baseboard management controller. When the central processor executes a task, the central processor generates at least one controlling signal to at least one switch. Part of the switches is built a connection according to the controlling signal. The switches in the connection are electrically connected between a source device and a goal device, to transmit a signal from the source device to the goal device. When an error is occurred during the switches and the central processor executing the task, the central processor reset the connection. The baseboard management controller detects whether the error is resolved. When the error is failed to be resolved, the baseboard management controller saves the error log and resets the server apparatus. After the server apparatus is reset, the baseboard management controller selectively resets the switch to factory setting.

Description

交換器錯誤排除方法Switch error elimination method

本發明係關於一種交換器錯誤排除方法,特別是以基板管理控制器來排除交換器發生錯誤的方法。The present invention relates to a method for troubleshooting an exchanger, and more particularly to a method for a board management controller to troubleshoot an error in a switch.

隨著網際網路服務以及雲端運算的普及,越來越多企業仰賴資料計算中心來處理和儲存大量的資料。傳統的資料計算中心包含大量的伺服器和節點,用以遠端儲存、處理或分佈大量資料。但是隨著客戶多變的需求以及多元的服務內容,伺服器也跟著不斷地演進和升級。With the popularity of Internet services and cloud computing, more and more companies rely on data computing centers to process and store large amounts of data. Traditional data computing centers contain a large number of servers and nodes for remote storage, processing, or distribution of large amounts of data. But with the changing needs of customers and the diverse service content, the server is constantly evolving and upgrading.

為了提升資料的傳輸效率,現以交換器作為伺服器主機板中資料傳輸的中介。交換器藉由PCIe(Peripheral Component Interconnect Express,快速周邊組件互連)技術,提供了高頻寬和低延遲的資料傳輸方案。然而,目前伺服器主機板中的交換器皆是由伺服器主機板中的中央處理器來進行控制和設定。當中央處理器發生停機或其他無法運作的問題時,伺服器無法自動地記錄發生的錯誤,使得伺服器管理者無法取得伺服器發生錯誤的原因,據以修正伺服器所發生的錯誤。In order to improve the efficiency of data transmission, the switch is now used as an intermediary for data transmission in the server board. The switch provides high-bandwidth and low-latency data transmission schemes through PCIe (Peripheral Component Interconnect Express) technology. However, the switches in the current server board are controlled and set by the central processor in the server board. When the central processor has a downtime or other inoperable problem, the server cannot automatically record the error that occurred, causing the server manager to fail to obtain the cause of the server error and correct the error that occurred on the server.

本發明在於提供一種交換器錯誤排除方法,藉以解決中央處理器發生停機或其他無法運作的問題時,伺服器無法自動記錄和恢復或修正錯誤的問題。The present invention provides a switch error elimination method for solving the problem that the server cannot automatically record and recover or correct errors when the central processor is down or otherwise inoperable.

本發明所揭露的交換器錯誤排除方法,適用於伺服裝置中。伺服裝置具有多個交換器、中央處理器及基板管理控制器。交換器錯誤排除方法包括中央處理器於執行任務時,產生至少一個控制訊號至交換器。任務關聯於將來源裝置產生的訊號傳送至目的裝置。至少部分的交換器依據控制訊號,建立連接關係。於連接關係中的交換器電性連接來源裝置及目的裝置。當中央處理器或交換器於執行任務中發生錯誤時,中央處理器重置連接關係。基板管理控制器偵測發生的錯誤是否排除。當發生的錯誤未排除時,基板管理控制器記錄錯誤,重置伺服裝置,並於伺服裝置重置後,選擇性地以預設連接關係設定交換器。The switch error elimination method disclosed in the present invention is applicable to a servo device. The servo device has a plurality of switches, a central processing unit, and a baseboard management controller. The switch error elimination method includes the central processor generating at least one control signal to the switch when performing the task. The task is associated with transmitting a signal generated by the source device to the destination device. At least some of the switches establish a connection relationship according to the control signal. The switch in the connection relationship electrically connects the source device and the destination device. The central processor resets the connection relationship when an error occurs in the execution of the task by the central processing unit or switch. The baseboard management controller detects if an error has occurred. When an error occurs that is not eliminated, the baseboard management controller records the error, resets the servo device, and selectively sets the switch with a preset connection relationship after the servo device is reset.

根據上述本發明所揭露的交換器錯誤排除方法,藉由基板管理控制器來偵測中央處理器或交換器於執行任務中所發生的錯誤是否排除,藉以在中央處理器停機或發生其他無法運作的問題時,基板管理控制器可以取得中央處理器或交換器的狀態,記錄中央處理器或交換器發生錯誤的原因,並控制伺服器重置,以令伺服器在重置後可以排除發生錯誤的問題。當伺服器在重置後仍無法排除錯誤時,基板管理控制器可以重設交換器的連接關係,進一步地協助中央處理器排除錯誤。According to the switch error elimination method disclosed in the present invention, the substrate management controller is used to detect whether the error occurred in the execution of the task by the central processing unit or the switch is eliminated, so that the central processor is down or other operations are inoperable. The problem is that the baseboard management controller can take the status of the central processor or switch, record the cause of the error in the central processor or switch, and control the server reset so that the server can eliminate the error after resetting. The problem. When the server still cannot correct the error after resetting, the baseboard management controller can reset the connection relationship of the switch to further assist the central processor in troubleshooting.

以上之關於本揭露內容之說明及以下之實施方式之說明係用以示範與解釋本發明之精神與原理,並且提供本發明之專利申請範圍更進一步之解釋。The above description of the disclosure and the following description of the embodiments of the present invention are intended to illustrate and explain the spirit and principles of the invention, and to provide further explanation of the scope of the invention.

以下在實施方式中詳細敘述本發明之詳細特徵以及優點,其內容足以使任何熟習相關技藝者了解本發明之技術內容並據以實施,且根據本說明書所揭露之內容、申請專利範圍及圖式,任何熟習相關技藝者可輕易地理解本發明相關之目的及優點。以下之實施例係進一步詳細說明本發明之觀點,但非以任何觀點限制本發明之範疇。The detailed features and advantages of the present invention are set forth in the Detailed Description of the Detailed Description of the <RTIgt; </ RTI> <RTIgt; </ RTI> </ RTI> </ RTI> <RTIgt; The objects and advantages associated with the present invention can be readily understood by those skilled in the art. The following examples are intended to describe the present invention in further detail, but are not intended to limit the scope of the invention.

請參照圖1及圖2,圖1係根據本發明一實施例所繪示之伺服裝置的功能方塊圖,圖2係根據本發明一實施例所繪示之交換器錯誤排除方法的步驟流程圖。如圖所示,伺服裝置1具有多個交換器10、中央處理器12及基板管理控制器14,其中多個交換器10排列成三行三列的交換器陣列101,且第一行中的每一個交換器10與第二行中的每一個交換器10電性連接,第二行中的每一個交換器10與第三行中的每一個交換器10電性連接。第一行和第三行中的交換器10又分別連接伺服裝置1中的來源裝置20和目的裝置22。來源裝置20和目的裝置22例如是圖形處理器(Graphics Processing Unit, GPU)、主機(Host)、網路介面卡(Network Interface Card,NIC)、主機匯流排配接器(host bus adapter,HBA)或其他合適裝置,本實施例不予限制。1 and FIG. 2, FIG. 1 is a functional block diagram of a servo device according to an embodiment of the invention, and FIG. 2 is a flow chart of steps of a switch error elimination method according to an embodiment of the invention. . As shown, the servo device 1 has a plurality of switches 10, a central processing unit 12 and a substrate management controller 14, wherein the plurality of switches 10 are arranged in three rows and three columns of the switch array 101, and in the first row Each of the switches 10 is electrically connected to each of the switches 10 in the second row, and each of the switches 10 in the second row is electrically connected to each of the switches 10 in the third row. The switches 10 in the first and third rows are in turn connected to the source device 20 and the destination device 22 in the servo device 1, respectively. The source device 20 and the destination device 22 are, for example, a graphics processing unit (GPU), a host, a network interface card (NIC), and a host bus adapter (HBA). Or other suitable devices, the embodiment is not limited.

交換器陣列中的每一個交換器10分別電性連接至中央處理器12及基板管理控制器14,且中央處理器12電性連接基板管理控制器14。於一個實施例中,中央處理器12是電性連接至交換器10的控制埠(Management port),基板管理控制器14透過I²C(Inter-Integrated Circuit)或GPIO(General-purpose input/output)傳輸介面與交換器10連接,中央處理器12與基板管理控制器14以PCI Express匯流排連接,但不以此為限。圖1中的拓樸為一個示例,任何數量的交換器、中央處理器和基板管理器皆可包含在圖1的伺服裝置中。Each switch 10 in the switch array is electrically connected to the central processing unit 12 and the baseboard management controller 14, respectively, and the central processing unit 12 is electrically connected to the baseboard management controller 14. In one embodiment, the central processing unit 12 is electrically connected to the management port of the switch 10, and the substrate management controller 14 is transmitted through an I2C (Inter-Integrated Circuit) or a GPIO (General-purpose input/output). The interface is connected to the switch 10, and the central processing unit 12 and the baseboard management controller 14 are connected by a PCI Express bus, but not limited thereto. The topology in Figure 1 is an example, and any number of switches, central processing units, and substrate managers may be included in the servo device of Figure 1.

在一個實施例中,於步驟S301中,中央處理器12在執行任務時,產生至少一個控制訊號至交換器10。於步驟S303中,至少部分的交換器10依據控制訊號建立連接關係。中央處理器12產生的控制訊號例如是傳送至要建立連接關係的交換器10,亦可以是將控制訊號傳送至每一個交換器10,本實施例不予限制。控制訊號指示交換器10選擇接收訊號的接腳和輸出訊號的接腳。換言之,中央處理器12所執行的任務關聯於將來源裝置20產生的訊號傳送至目的裝置22,因此中央處理器12依據來源裝置20和目標裝置22所連接的交換器10,產生控制訊號指示交換器10選擇接收訊號的接腳和輸出訊號的接腳,藉以建立起一個連接關係,使來源裝置20產生的訊號可以經由於連接關係中的交換器10傳送至目標裝置22。In one embodiment, in step S301, the central processor 12 generates at least one control signal to the switch 10 when performing the task. In step S303, at least part of the switch 10 establishes a connection relationship according to the control signal. The control signal generated by the central processing unit 12 is, for example, transmitted to the switch 10 to establish a connection relationship, or the control signal is transmitted to each of the switches 10. This embodiment is not limited. The control signal instructs the switch 10 to select the pin that receives the signal and the pin that outputs the signal. In other words, the task performed by the central processing unit 12 is related to transmitting the signal generated by the source device 20 to the destination device 22. Therefore, the central processing unit 12 generates a control signal indication exchange according to the switch 10 to which the source device 20 and the target device 22 are connected. The device 10 selects the pin for receiving the signal and the pin for the output signal, thereby establishing a connection relationship, so that the signal generated by the source device 20 can be transmitted to the target device 22 via the switch 10 in the connection relationship.

於步驟S305中,當中央處理器12或交換器10於執行任務中發生錯誤時,中央處理器12重置連接關係。舉例來說,中央處理器12於執行任務中可能會發生停機或其他無法運作的問題,此時可視為中央處理器12於執行任務中發生錯誤,抑或是中央處理器12產生的錯誤的控制訊號,造成交換器10的連接關係錯誤,使得來源裝置20的訊號無法順利傳輸至目標裝置22,亦可視為交換器10於執行任務中發生錯誤。中央處理器12或交換器10可能會在執行任務中各別發生錯誤,亦可能是同時發生錯誤,本實施例不予限制。In step S305, when an error occurs in the execution of the task by the central processing unit 12 or the switch 10, the central processing unit 12 resets the connection relationship. For example, the central processor 12 may experience downtime or other inoperable problems in performing tasks, and may be regarded as an error in the execution of the task by the central processing unit 12, or an error control signal generated by the central processing unit 12. The connection relationship of the switch 10 is incorrect, so that the signal of the source device 20 cannot be smoothly transmitted to the target device 22, and the switch 10 can be regarded as an error in performing the task. The central processing unit 12 or the switch 10 may generate an error in each of the execution tasks, or may be an error at the same time, which is not limited in this embodiment.

於步驟S307中,基板管理控制器14偵測發生的錯誤是否排除。當發生的錯誤排除時,於步驟S309中,中央處理器12和交換器10繼續執行任務,或執行下一個任務。換言之,當中央處理器12排除停機或其他無法運作的問題,或中央處理器12重新產生控制訊號,解決交換器10連接關係的錯誤時,中央處理器12或交換器10發生的錯誤可以被回復,中央處理器12和交換器10則繼續執行任務,或執行下一個任務。In step S307, the substrate management controller 14 detects whether the error that occurred is excluded. When the occurrence of the error is eliminated, the CPU 12 and the switch 10 continue to perform the task or execute the next task in step S309. In other words, when the central processor 12 eliminates the problem of downtime or other inoperability, or the central processor 12 regenerates the control signal to resolve the error of the switch 10 connection, the error occurred by the central processor 12 or the switch 10 can be replied. The central processor 12 and the switch 10 continue to perform tasks or perform the next task.

當發生的錯誤未排除時,亦即中央處理器12或交換器10發生的錯誤不可以被回復。於步驟S311中,基板管理控制器14記錄錯誤,重置伺服裝置1,並於伺服裝置1重置後,選擇性地以預設連接關係設定交換器10。於一個實施例中,基板管理控制器14經由PCI Express匯流排讀取中央處理器12的狀態,且透過I²C或GPIO讀取交換器10的狀態。基板管理控制器14以中央處理器12及交換器10的狀態作為發生錯誤的紀錄,儲存錯誤紀錄,據以在重置伺服裝置1後,仍可經由查找基板管理控制器14記錄的內容,分析判斷中央處理器12或交換器10發生的錯誤,藉以更進一步地避免後續錯誤發生。When an error that occurs is not ruled out, that is, an error that occurs in the central processor 12 or the switch 10 cannot be replied. In step S311, the substrate management controller 14 records an error, resets the servo device 1, and selectively sets the switch 10 in a preset connection relationship after the servo device 1 is reset. In one embodiment, the substrate management controller 14 reads the state of the central processor 12 via the PCI Express bus and reads the state of the switch 10 through the I2C or GPIO. The substrate management controller 14 records the error as a record of the occurrence of the error of the central processing unit 12 and the switch 10, so that after the servo device 1 is reset, the content recorded by the search substrate management controller 14 can still be analyzed. Errors in the central processor 12 or the switch 10 are judged to further avoid subsequent errors.

當伺服裝置1重置後,中央處理器12或交換器10發生的錯誤仍未排除時,基板管理控制器14以預設連接關係設定交換器10。於一個實施例中,每一個交換器10具有一個接腳對應表儲存於交換器10的EEPROM (Electrically-Erasable Programmable Read-Only Memory)中,每一個接腳對應表指示交換器10接腳的預設連接關係,亦即接腳所連接的其他交換器10、來源裝置20或目標裝置22。伺服裝置1重置後,當基板管理控制器14判斷中央處理器12或交換器10發生的錯誤仍未排除時,基板管理控制器14或中央處理器12控制每一個交換器10依據其EEPROM儲存的接腳對應表,回復每一個接腳的設定值。When the error occurred by the central processing unit 12 or the switch 10 after the servo device 1 is reset, the substrate management controller 14 sets the switch 10 in a preset connection relationship. In one embodiment, each switch 10 has a pin correspondence table stored in an EEPROM (Electrically-Erasable Programmable Read-Only Memory) of the switch 10, and each pin correspondence table indicates a pre-switch of the switch 10. A connection relationship is provided, that is, other switches 10, source devices 20 or target devices 22 to which the pins are connected. After the servo device 1 is reset, when the substrate management controller 14 determines that an error occurring in the central processing unit 12 or the switch 10 has not been eliminated, the baseboard management controller 14 or the central processing unit 12 controls each of the switches 10 to be stored according to its EEPROM. The pin correspondence table returns the set value of each pin.

藉此,伺服裝置1可以在中央處理器12或交換器10發生錯誤時,由基板管理控制器14記錄錯誤,並在錯誤不可回復時,控制伺服裝置1重置,以令中央處理器12或交換器10可以繼續執行任務和執行下一個任務。Thereby, the servo device 1 can record an error by the baseboard management controller 14 when an error occurs in the central processing unit 12 or the switch 10, and when the error is unrecoverable, the control servo device 1 is reset to cause the central processing unit 12 or The switch 10 can continue to perform tasks and perform the next task.

接下來,請一併參照圖1與圖3,圖3係根據本發明另一實施例所繪示之交換器錯誤排除方法的步驟流程圖。如圖所示,本實施例提供另一種交換器錯誤排除方法,適用於伺服裝置中。為了方便說明,同樣以圖1揭示的伺服器裝置1來說明,但不以此為限。Next, please refer to FIG. 1 and FIG. 3 together. FIG. 3 is a flow chart showing the steps of the switch error elimination method according to another embodiment of the present invention. As shown in the figure, this embodiment provides another switch error elimination method, which is suitable for use in a servo device. For convenience of description, the server device 1 disclosed in FIG. 1 is also described, but not limited thereto.

於步驟S401中,中央處理器12在執行任務時,產生至少一個控制訊號至交換器10。於步驟S403中,至少部分的交換器10依據控制訊號建立連接關係。本實施例同樣地不限制中央處理器12產生的控制訊號是傳送至要建立連接關係的交換器10中,或是傳送至每一個交換器10。中央處理器12所執行的任務關聯於將來源裝置20產生的訊號傳送至目的裝置22,因此中央處理器12依據來源裝置20和目標裝置22所連接的交換器10,產生控制訊號,使交換器10建立一個連接關係以將來源裝置20產生的訊號傳送至目標裝置22。In step S401, the central processor 12 generates at least one control signal to the switch 10 when performing the task. In step S403, at least part of the switch 10 establishes a connection relationship according to the control signal. The present embodiment likewise does not limit the control signals generated by the central processing unit 12 to be transmitted to the switch 10 to establish a connection relationship or to each of the switches 10. The task performed by the central processing unit 12 is related to transmitting the signal generated by the source device 20 to the destination device 22. Therefore, the central processing unit 12 generates a control signal based on the source device 20 and the switch 10 to which the target device 22 is connected, so that the switch 10 Establish a connection relationship to transmit the signal generated by the source device 20 to the target device 22.

於步驟S405中,中央處理器12每隔一個預設時間區間,產生狀態資訊至基板管理控制器14,藉由狀態資訊告知基板管理控制器14中央處理器12執行任務的狀態。於步驟S407中,當基板管理控制器14超過預設時間區間未接收到狀態資訊時,基板管理控制器14判斷中央處理器12或交換器10於執行任務中發生錯誤。此時,於步驟S409中,中央處理器12會於一個重置時間區間中,嘗試重置交換器10的連接關係以回復發生的錯誤。In step S405, the central processing unit 12 generates status information to the baseboard management controller 14 every other preset time interval, and informs the baseboard management controller 14 of the state of the task by the central management unit 14 by the status information. In step S407, when the substrate management controller 14 does not receive the status information for more than the preset time interval, the baseboard management controller 14 determines that the central processor 12 or the switch 10 has an error in performing the task. At this time, in step S409, the central processing unit 12 attempts to reset the connection relationship of the switch 10 in a reset time interval to recover the error that occurred.

於步驟S411中,當於重置時間區間後,基板管理控制器14依據是否接收到中央處理器12產生的狀態資訊,判斷發生的錯誤是否排除。當發生的錯誤排除時,於步驟S413中,中央處理器12和交換器10繼續執行任務,或執行下一個任務,亦即中央處理器12或交換器10發生的錯誤被回復,中央處理器12和交換器10繼續執行本次任務或下一個任務。In step S411, after the reset time interval, the substrate management controller 14 determines whether the generated error is excluded according to whether the status information generated by the central processing unit 12 is received. When the occurrence of the error is eliminated, in step S413, the central processing unit 12 and the switch 10 continue to perform the task, or perform the next task, that is, the error generated by the central processing unit 12 or the switch 10 is replied, and the central processing unit 12 And the switch 10 continues to perform this task or the next task.

當中央處理器12或交換器10發生的錯誤不可以被回復,亦即發生的錯誤未排除時,於步驟S415中,基板管理控制器14記錄中央處理器12及交換器10的狀態,並重置伺服裝置1。於伺服裝置1重置後,基板管理控制器14同樣地依據中央處理器12產生的狀態資訊,判斷中央處理器12或交換器10發生的錯誤是否排除,據以選擇性地以預設連接關係設定交換器10。When an error occurring in the central processing unit 12 or the switch 10 cannot be recovered, that is, an error that has occurred is not eliminated, in step S415, the baseboard management controller 14 records the states of the central processing unit 12 and the switch 10, and Set the servo device 1. After the resetting of the servo device 1, the substrate management controller 14 similarly determines whether the error occurred by the central processing unit 12 or the switch 10 is excluded according to the status information generated by the central processing unit 12, thereby selectively presetting the connection relationship. The switch 10 is set.

於再一個實施例中,請一併參照圖1與圖4,圖4係根據本發明再一實施例所繪示之交換器錯誤排除方法的步驟流程圖。圖4提供的交換器錯誤排除方法,同樣適用於任何具有交換器、中央處理器及基板管理控制器的伺服裝置中。本實施例為了方便說明,同樣以圖1揭示的伺服器裝置1來說明,但不以此為限。In another embodiment, please refer to FIG. 1 and FIG. 4 together. FIG. 4 is a flow chart showing the steps of the switch error elimination method according to another embodiment of the present invention. The converter error elimination method provided in Figure 4 is equally applicable to any servo device having a switch, a central processing unit, and a baseboard management controller. For convenience of description, the present embodiment is also described with reference to the server device 1 disclosed in FIG. 1, but is not limited thereto.

於步驟S501中,中央處理器12在執行任務時,產生至少一個控制訊號至交換器10。於步驟S503中,至少部分的交換器10依據控制訊號建立連接關係,其中中央處理器12所執行的任務關聯於將來源裝置20產生的訊號傳送至目的裝置22。中央處理器12依據所執行的任務,產生控制訊號,以控制交換器10建立一個連接關係,藉以將來源裝置20產生的訊號傳送至目標裝置22。In step S501, the central processor 12 generates at least one control signal to the switch 10 when performing the task. In step S503, at least part of the switch 10 establishes a connection relationship according to the control signal, wherein the task performed by the central processing unit 12 is associated with transmitting the signal generated by the source device 20 to the destination device 22. The central processing unit 12 generates control signals in accordance with the tasks performed to control the switch 10 to establish a connection relationship for transmitting signals generated by the source device 20 to the target device 22.

於步驟S505中,當交換器10於執行任務中發生錯誤時,至少一個交換器10產生狀態訊號至基板管理控制器14,以告知基板管理控制器14有錯誤發生。狀態訊號例如是一個中斷(interrupt)訊號或一個錯誤(error)訊號,且由發生錯誤的交換器產生。於步驟S507中,中央處理器12會於一個重置時間區間中,嘗試重置交換器10的連接關係以回復發生的錯誤。In step S505, when the switch 10 generates an error in the execution of the task, the at least one switch 10 generates a status signal to the baseboard management controller 14 to inform the baseboard management controller 14 that an error has occurred. The status signal is, for example, an interrupt signal or an error signal and is generated by the switch in which the error occurred. In step S507, the central processing unit 12 attempts to reset the connection relationship of the switch 10 in a reset time interval to recover the error that occurred.

於步驟S509中,於重置時間區間後,基板管理控制器14依據交換器10產生狀態訊號,判斷發生的錯誤是否排除。當發生的錯誤排除時,於步驟S511中,中央處理器12和交換器10繼續執行任務,或執行下一個任務。當基板管理控制器14依據交換器10產生狀態訊號,判斷發生的錯誤未排除時,於步驟S513中,基板管理控制器14記錄中央處理器12及交換器10的狀態,並重置伺服裝置1。In step S509, after the reset time interval, the substrate management controller 14 generates a status signal according to the switch 10 to determine whether the error occurred is excluded. When the occurrence of the error is eliminated, the CPU 12 and the switch 10 continue to execute the task or execute the next task in step S511. When the substrate management controller 14 generates a status signal according to the switch 10 and determines that the error has not been eliminated, in step S513, the substrate management controller 14 records the status of the central processing unit 12 and the switch 10, and resets the servo device 1. .

於一個實施例中,請一併參照圖1與圖5,圖5係根據本發明又一實施例所繪示之交換器錯誤排除方法的步驟流程圖。圖5提供的交換器錯誤排除方法,同樣適用於任何具有交換器、中央處理器及基板管理控制器的伺服裝置中。以下實施例同樣以圖1揭示的伺服器裝置1來說明,但不以此為限。In one embodiment, please refer to FIG. 1 and FIG. 5 together. FIG. 5 is a flow chart showing the steps of the switch error elimination method according to another embodiment of the present invention. The switch error elimination method provided in Figure 5 is equally applicable to any servo device having a switch, a central processing unit, and a baseboard management controller. The following embodiments are also described in the server device 1 disclosed in FIG. 1, but are not limited thereto.

於步驟S601中,中央處理器12在執行任務時,產生至少一個控制訊號至交換器10,並於步驟S603中,至少部分的交換器10依據控制訊號建立連接關係。於連接關係中的交換器10用以將來源裝置20產生的訊號傳送至目標裝置22。於步驟S605中,基板管理控制器14每隔一個預設時間區間輪詢(polling)交換器10。依據每一個交換器10的狀態暫存器,判斷中央處理器12或交換器10於執行任務中是否有錯誤發生。In step S601, the central processor 12 generates at least one control signal to the switch 10 when performing the task, and in step S603, at least part of the switch 10 establishes a connection relationship according to the control signal. The switch 10 in the connection relationship is used to transmit the signal generated by the source device 20 to the target device 22. In step S605, the substrate management controller 14 polls the switch 10 every other predetermined time interval. Based on the status register of each switch 10, it is determined whether the central processor 12 or the switch 10 has an error in executing the task.

當有錯誤發生時,於步驟S607中,中央處理器12會於一個重置時間區間中,嘗試重置交換器10的連接關係以回復發生的錯誤。於步驟S609中,於重置時間區間後,基板管理控制器14輪詢(polling)每一個交換器10,判斷發生的錯誤是否已排除。當發生的錯誤已排除時,於步驟S611中,中央處理器12和交換器10繼續執行任務,或執行下一個任務。當基板管理控制器14依據交換器10產生狀態訊號,判斷發生的錯誤未排除時,於步驟S613中,基板管理控制器14記錄中央處理器12及交換器10的狀態,並重置伺服裝置1。When an error occurs, in step S607, the central processing unit 12 attempts to reset the connection relationship of the switch 10 in a reset time interval to recover the error that occurred. In step S609, after the reset time interval, the baseboard management controller 14 polls each of the switches 10 to determine whether the occurrence of the error has been eliminated. When the occurrence of the error has been eliminated, the central processor 12 and the switch 10 continue to perform the task or perform the next task in step S611. When the substrate management controller 14 generates a status signal according to the switch 10 and determines that the error has not been eliminated, in step S613, the substrate management controller 14 records the status of the central processing unit 12 and the switch 10, and resets the servo device 1. .

綜合以上所述,本發明實施例提供一種交換器錯誤排除方法,藉由基板管理控制器依據中央處理器和交換器的狀態,判斷中央處理器和交換器是否發生錯誤,並於中央處理器無法排除錯誤時,記錄中央處理器或交換器發生錯誤的原因,並控制伺服器重置,據以讓伺服器在重置後可以排除發生錯誤的問題。當伺服器在重置後仍無法排除錯誤時,基板管理控制器可以更進一步地重設交換器的連接關係,提升協助中央處理器排除錯誤的機制。In summary, the embodiment of the present invention provides a method for troubleshooting a switch. The baseboard management controller determines whether the central processor and the switch have an error according to the state of the central processing unit and the switch, and cannot be diagnosed by the central processing unit. When troubleshooting, record the cause of the error in the central processor or switch and control the server reset so that the server can eliminate the problem of the error after reset. When the server still cannot correct the error after resetting, the baseboard management controller can further reset the connection relationship of the switch and improve the mechanism for assisting the central processor to eliminate the error.

雖然本發明以前述之實施例揭露如上,然其並非用以限定本發明。在不脫離本發明之精神和範圍內,所為之更動與潤飾,均屬本發明之專利保護範圍。關於本發明所界定之保護範圍請參考所附之申請專利範圍。Although the present invention has been disclosed above in the foregoing embodiments, it is not intended to limit the invention. It is within the scope of the invention to be modified and modified without departing from the spirit and scope of the invention. Please refer to the attached patent application for the scope of protection defined by the present invention.

1‧‧‧伺服裝置1‧‧‧Servo

10‧‧‧交換器10‧‧‧Switch

101‧‧‧交換器陣列101‧‧‧Switch array

12‧‧‧中央處理器12‧‧‧Central processor

14‧‧‧基板管理控制器14‧‧‧Base Management Controller

20‧‧‧來源裝置20‧‧‧Source device

22‧‧‧目的裝置22‧‧‧ destination device

S301~S311、S401~S415、S501~S513、S601~S613‧‧‧步驟Steps S301 to S311, S401 to S415, S501 to S513, and S601 to S613‧‧

圖1係根據本發明一實施例所繪示之伺服裝置的功能方塊圖。 圖2係根據本發明一實施例所繪示之交換器錯誤排除方法的步驟流程圖。 圖3係根據本發明另一實施例所繪示之交換器錯誤排除方法的步驟流程圖。 圖4係根據本發明再一實施例所繪示之交換器錯誤排除方法的步驟流程圖。 圖5係根據本發明又一實施例所繪示之交換器錯誤排除方法的步驟流程圖。1 is a functional block diagram of a servo device according to an embodiment of the invention. 2 is a flow chart showing the steps of a method for troubleshooting an exchanger according to an embodiment of the invention. FIG. 3 is a flow chart showing the steps of a method for troubleshooting an exchanger according to another embodiment of the present invention. FIG. 4 is a flow chart showing steps of a method for troubleshooting an exchanger according to still another embodiment of the present invention. FIG. 5 is a flow chart showing steps of a method for troubleshooting an exchanger according to still another embodiment of the present invention.

Claims (10)

一種交換器錯誤排除方法,適用於一伺服裝置中,該伺服裝置包括多個交換器、一中央處理器及一基板管理控制器,該交換器錯誤排除方法包括:該中央處理器於執行一任務時,產生至少一控制訊號至該些交換器,該任務關聯於將一來源裝置產生的訊號傳送至一目的裝置;至少部分的該些交換器依據該控制訊號,建立一連接關係,於該連接關係中的該些交換器電性連接該來源裝置及該目的裝置;當該中央處理器或該些交換器於執行該任務中發生錯誤時,該中央處理器重置該連接關係;該基板管理控制器偵測發生的錯誤是否排除;以及當發生的錯誤未排除時,該基板管理控制器記錄錯誤,重置該伺服裝置,並於該伺服裝置重置後,選擇性地以一預設連接關係設定該些交換器。A switch error elimination method is applicable to a server device, the server device includes a plurality of switches, a central processing unit and a baseboard management controller, and the switch error elimination method comprises: the central processing unit performing a task At least one control signal is generated to the switches, the task is associated with transmitting a signal generated by a source device to a destination device; at least some of the switches establish a connection relationship according to the control signal, and the connection is established. The switches in the relationship are electrically connected to the source device and the destination device; when the central processor or the switches generate an error in performing the task, the central processor resets the connection relationship; the substrate management The controller detects whether the error is eliminated; and when the error is not eliminated, the baseboard management controller records the error, resets the servo device, and selectively resets the servo device after a preset connection. Relationships set up these switches. 如請求項1所述之交換器錯誤排除方法,其中該中央處理器每隔一預設時間區間產生一狀態資訊至該基板管理控制器,該狀態資訊關聯於該中央處理器執行該任務的狀態,該交換器錯誤排除方法更包括當該基板管理控制器超過該預設時間區間未接收到該狀態資訊時,該基板管理控制器判斷該中央處理器或該些交換器於執行該任務中發生錯誤。The switch error elimination method of claim 1, wherein the central processor generates a status information to the baseboard management controller every predetermined time interval, the status information being associated with a state in which the central processor executes the task. The switch error elimination method further includes: when the baseboard management controller does not receive the status information beyond the preset time interval, the baseboard management controller determines that the central processor or the switches occur in performing the task. error. 如請求項2所述之交換器錯誤排除方法,其中該中央處理器更於一重置時間區間內重置該連接關係,當該基板管理控制器於該重置時間區間後,仍未接收到該狀態資訊時,該基板管理控制器判斷發生的錯誤未排除。The switch error elimination method according to claim 2, wherein the central processor resets the connection relationship in a reset time interval, and the base management controller does not receive the reset time interval after the reset time interval. When the status information is received, the baseboard management controller determines that an error has not been eliminated. 如請求項1所述之交換器錯誤排除方法,其中當該些交換器於執行該任務中發生錯誤時,該些交換器其中至少一產生一狀態訊號至該基板管理控制器。The switch error elimination method of claim 1, wherein at least one of the switches generates a status signal to the baseboard management controller when an error occurs in performing the task. 如請求項4所述之交換器錯誤排除方法,其中該中央處理器更於一重置時間區間中重置該連接關係,於該重置時間區間後,該基板管理控制器依據該狀態訊號,判斷發生的錯誤是否排除。The switch error elimination method of claim 4, wherein the central processor further resets the connection relationship in a reset time interval, after the reset time interval, the baseboard management controller according to the status signal, Determine if the error occurred is excluded. 如請求項1所述之交換器錯誤排除方法,其中該基板管理控制器每隔一預設時間區間輪詢該些交換器,依據每一該交換器的一狀態暫存器,判斷該中央處理器或該些交換器於執行該任務中是否發生錯誤。The switch error elimination method according to claim 1, wherein the baseboard management controller polls the switches every predetermined time interval, and determines the central processing according to a state register of each switch. Whether the switch or the switches have an error in performing this task. 如請求項6所述之交換器錯誤排除方法,其中該中央處理器於一重置時間區間中重置該連接關係,於該重置時間區間後,該基板管理控制器輪詢每一該交換器的該狀態暫存器,判斷發生的錯誤是否排除。The switch error elimination method of claim 6, wherein the central processor resets the connection relationship in a reset time interval, after the reset time interval, the baseboard management controller polls each of the exchanges The state register of the device determines whether the error occurred is excluded. 如請求項1所述之交換器錯誤排除方法,其中當發生的錯誤未排除時,該交換器錯誤排除方法更包括該基板管理控制器讀取該中央處理器及該些交換器的狀態,以該中央處理器及該些交換器的狀態作為發生錯誤的紀錄。The switch error elimination method of claim 1, wherein when the error that occurs is not excluded, the switch error elimination method further includes the baseboard management controller reading the status of the central processor and the switches, The state of the central processor and the switches is recorded as an error. 如請求項1所述之交換器錯誤排除方法,其中當該伺服裝置重置後,該基板管理控制器更依據該中央處理器產生的一狀態資訊、該些交換器其中至少一產生的一狀態訊號和每一該交換器的一狀態暫存器其中至少一,判斷發生的錯誤是否排除,當發生的錯誤未排除時,以該預設連接關係設定該些交換器。The switch error elimination method of claim 1, wherein the baseboard management controller further determines a state generated by the central processor and a state generated by at least one of the switches after the servo device is reset. The signal and at least one of the state registers of each of the switches determine whether the generated error is excluded. When the error is not eliminated, the switches are set in the preset connection relationship. 如請求項1所述之交換器錯誤排除方法,其中每一該交換器具有一接腳對應表,每一該接腳對應表指示該預設連接關係,當該伺服裝置重置後,發生的錯誤仍未排除時,每一該交換器依據該接腳對應表重設。The switch error elimination method according to claim 1, wherein each of the switches has a pin correspondence table, and each of the pin correspondence tables indicates the preset connection relationship, and an error occurs when the servo device is reset. When not yet excluded, each of the switches is reset according to the pin correspondence table.
TW105140520A 2016-12-07 2016-12-07 Error resolving method or switch TWI601013B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW105140520A TWI601013B (en) 2016-12-07 2016-12-07 Error resolving method or switch

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW105140520A TWI601013B (en) 2016-12-07 2016-12-07 Error resolving method or switch

Publications (2)

Publication Number Publication Date
TWI601013B TWI601013B (en) 2017-10-01
TW201822002A true TW201822002A (en) 2018-06-16

Family

ID=61010881

Family Applications (1)

Application Number Title Priority Date Filing Date
TW105140520A TWI601013B (en) 2016-12-07 2016-12-07 Error resolving method or switch

Country Status (1)

Country Link
TW (1) TWI601013B (en)

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6894970B1 (en) * 2000-10-31 2005-05-17 Chiaro Networks, Ltd. Router switch fabric protection using forward error correction
US7206963B2 (en) * 2003-06-12 2007-04-17 Sun Microsystems, Inc. System and method for providing switch redundancy between two server systems
US7418633B1 (en) * 2004-05-13 2008-08-26 Symantec Operating Corporation Method and apparatus for immunizing applications on a host server from failover processing within a switch
US8418039B2 (en) * 2009-08-03 2013-04-09 Airbiquity Inc. Efficient error correction scheme for data transmission in a wireless in-band signaling system
TWI479310B (en) * 2011-01-10 2015-04-01 Hon Hai Prec Ind Co Ltd Server and method for controlling opening of channels
DE112011105911T5 (en) * 2011-12-01 2014-09-11 Intel Corporation Server with switch circuits
CN104238480A (en) * 2013-06-21 2014-12-24 鸿富锦精密工业(深圳)有限公司 Cabinet server BMC startup and shutdown control system and method

Also Published As

Publication number Publication date
TWI601013B (en) 2017-10-01

Similar Documents

Publication Publication Date Title
US7536584B2 (en) Fault-isolating SAS expander
EP2052326B1 (en) Fault-isolating sas expander
US5313386A (en) Programmable controller with backup capability
US10127095B2 (en) Seamless automatic recovery of a switch device
WO2021027481A1 (en) Fault processing method, apparatus, computer device, storage medium and storage system
US10027532B2 (en) Storage control apparatus and storage control method
US5392424A (en) Apparatus for detecting parity errors among asynchronous digital signals
TWI576682B (en) Rack havng multi-rmms and firmware updating method for the rack
JP6662987B2 (en) Method and system for checking cable errors
CN108108254B (en) Switch error elimination method
JPWO2012029147A1 (en) System and fault handling method
US10142169B2 (en) Diagnosis device, diagnosis method, and non-transitory recording medium storing diagnosis program
US20180267870A1 (en) Management node failover for high reliability systems
US20220137830A1 (en) Using path quarantining to identify and handle backend issues
CN111176913A (en) Circuit and method for detecting Cable Port in server
TW202022610A (en) Method for detecting a server
TWI601013B (en) Error resolving method or switch
CN105843336A (en) Rack with a plurality of rack management modules and method for updating firmware thereof
US9454452B2 (en) Information processing apparatus and method for monitoring device by use of first and second communication protocols
US9246848B2 (en) Relay apparatus, storage system, and method of controlling relay apparatus
JPWO2011145541A1 (en) Bus control device and bus control method
US20140173365A1 (en) Semiconductor apparatus, management apparatus, and data processing apparatus
TWI530782B (en) Server system
CN113760627B (en) Method and device for controlling interface debugging in bus by adopting response mechanism
US11232197B2 (en) Computer system and device management method