TW201423390A - Computer system and operating method thereof - Google Patents

Computer system and operating method thereof Download PDF

Info

Publication number
TW201423390A
TW201423390A TW101145892A TW101145892A TW201423390A TW 201423390 A TW201423390 A TW 201423390A TW 101145892 A TW101145892 A TW 101145892A TW 101145892 A TW101145892 A TW 101145892A TW 201423390 A TW201423390 A TW 201423390A
Authority
TW
Taiwan
Prior art keywords
monitored device
monitored
logic control
computer system
control device
Prior art date
Application number
TW101145892A
Other languages
Chinese (zh)
Inventor
Chia-Hsiang Chen
Original Assignee
Inventec Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inventec Corp filed Critical Inventec Corp
Priority to TW101145892A priority Critical patent/TW201423390A/en
Publication of TW201423390A publication Critical patent/TW201423390A/en

Links

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

A computer system and an operating method thereof are disclosed herein. The computer system includes at least one monitored device and a logic control device. The logic control device is connected to the monitored device, and is configured to monitor status signals from the monitored device so as to determine whether the monitored device is in an error state. When the monitored device is in the error state, the logic control device counts a predetermined time period, and determines whether the monitored device recovers to normal after the predetermined time period, and determines whether the monitored device has been reset during the predetermined time period. If the monitored device does not recover to normal and the monitored device has not been reset during the predetermined time period, then the logic control device resets the monitored device.

Description

電腦系統及其操作方法 Computer system and its operation method

本發明是有關於一種電子系統及其操作方法,特別是有關於一種電腦系統及其操作方法。 The present invention relates to an electronic system and method of operating the same, and more particularly to a computer system and method of operating the same.

隨著數位科技的發展,電腦系統已被廣泛地應用在人們的生活當中,如用以提供個人使用的桌上型電腦、筆記型電腦及用以提供網路服務的網路處理器、伺服器等。 With the development of digital technology, computer systems have been widely used in people's lives, such as desktop computers, notebook computers, and network processors and servers for providing network services. Wait.

一般而言,電腦系統包括多個分別運作的裝置,如中央處理器、南橋晶片、儲存裝置、基本輸入輸出系統等。當這些裝置發生錯誤時,其可透過傳送錯誤訊息至電腦系統中的管理控制器(如基板管理控制器),以令管理控制器重新啟動這些裝置。然而,管理控制器本身可能出錯或失效,以至於在裝置發生錯誤時管理控制器未能予以重啟。如此一來,電腦系統可能長時間處於錯誤狀態,若電腦系統為提供網路服務的伺服器,則可能造成網路服務品質下降,並進一步造成使用者的不滿。 In general, a computer system includes a plurality of separately operated devices, such as a central processing unit, a south bridge chip, a storage device, a basic input/output system, and the like. When an error occurs in these devices, it can transmit an error message to a management controller (such as a baseboard management controller) in the computer system to cause the management controller to restart the devices. However, the management controller itself may be in error or invalid, so that the management controller fails to restart when an error occurs in the device. As a result, the computer system may be in an error state for a long time. If the computer system is a server that provides network services, the quality of the network service may be degraded, and the user may be further dissatisfied.

是以,為確保電腦系統錯誤回復的可靠性,上述問題有急迫解決的需要。 Therefore, in order to ensure the reliability of the computer system error response, the above problems have an urgent need to solve.

本發明的一態樣為一種電腦系統,其利用一邏輯控制裝置進行訊號監控及錯誤回復。 One aspect of the present invention is a computer system that utilizes a logic control device for signal monitoring and error recovery.

根據本發明一實施例,電腦系統包括至少一受監控裝置以及邏輯控制裝置。邏輯控制裝置連接受監控裝置,用以監控受監控裝置的狀態訊號,以判斷受監控裝置是否處於錯誤狀態。當受監控裝置處於錯誤狀態時,邏輯控制裝置計時一預設時間,並在此預設時間後判斷受監控裝置是否恢復正常,且判斷受監控裝置是否在此預設時間內進行重置。若受監控裝置未恢復正常且受監控裝置未在此預設時間內進行重置,則邏輯控制裝置重置受監控裝置。 According to an embodiment of the invention, a computer system includes at least one monitored device and a logical control device. The logic control device is connected to the monitored device for monitoring the status signal of the monitored device to determine whether the monitored device is in an error state. When the monitored device is in an error state, the logic control device counts a preset time, and after the preset time, determines whether the monitored device returns to normal, and determines whether the monitored device resets within the preset time. If the monitored device does not return to normal and the monitored device does not reset within this preset time, the logical control device resets the monitored device.

根據本發明一實施例,邏輯控制裝置更包括狀態映射表,邏輯控制裝置儲存受監控裝置的狀態訊號於狀態映射表中的對應位址,作為正確運作資料。 According to an embodiment of the invention, the logic control device further includes a state mapping table, and the logic control device stores the status signal of the monitored device in a corresponding address in the state mapping table as the correct operation data.

根據本發明一實施例,邏輯控制裝置比對受監控裝置的狀態訊號與儲存於狀態映射表中相應位址的正確運作資料,以判斷受監控裝置是否處於錯誤狀態。 According to an embodiment of the invention, the logic control device compares the status signal of the monitored device with the correct operation data stored in the corresponding address in the state mapping table to determine whether the monitored device is in an error state.

根據本發明一實施例,邏輯控制裝置更包括計時器,用以計時預設時間。 According to an embodiment of the invention, the logic control device further includes a timer for counting the preset time.

根據本發明一實施例,邏輯控制裝置依據是否未接收到受監控裝置所發出之正常訊號,或依據受監控裝置是否發出錯誤訊號,以判斷受監控裝置是否處於錯誤狀態。 According to an embodiment of the invention, the logic control device determines whether the monitored device is in an error state according to whether the normal signal sent by the monitored device is not received or whether the monitored device issues an error signal.

根據本發明一實施例,邏輯控制裝置重啟主要電力軌(main power rail)以使該受監控裝置重新開機。 According to an embodiment of the invention, the logic control device restarts the main power rail to cause the monitored device to be powered back on.

本發明的一態樣為一種電腦系統的操作方法。根據本發明一實施例,電腦系統包括邏輯控制裝置以及至少一受監控裝置,邏輯控制裝置連接受監控裝置,操作方法包括:監控受監控裝置的狀態訊號;根據受監控裝置的狀態訊號 以判斷受監控裝置是否處於錯誤狀態;當受監控裝置處於錯誤狀態時,計時一預設時間;在此預設時間後,判斷受監控裝置是否恢復正常,且判斷受監控裝置是否在此預設時間內進行重置;以及,若受監控裝置未恢復正常且受監控裝置未在此預設時間內進行重置,則重置受監控裝置。 One aspect of the present invention is a method of operating a computer system. According to an embodiment of the invention, a computer system includes a logic control device and at least one monitored device, and the logic control device is connected to the monitored device, and the operation method includes: monitoring a status signal of the monitored device; and according to the status signal of the monitored device To determine whether the monitored device is in an error state; when the monitored device is in an error state, time is counted for a preset time; after the preset time, it is determined whether the monitored device is restored to normal, and it is determined whether the monitored device is preset here The reset is performed within the time; and if the monitored device does not return to normal and the monitored device does not reset within this preset time, the monitored device is reset.

根據本發明一實施例,其中邏輯控制裝置包括一狀態映射表,且根據受監控裝置的狀態訊號以判斷受監控裝置是否處於錯誤狀態的步驟包括:儲存受監控裝置的狀態訊號於狀態映射表中的對應位址以作為正確運作資料;而後,比對受監控裝置的狀態訊號與儲存於狀態映射表中相應位址的正確運作資料,以判斷受監控裝置是否處於錯誤狀態。 According to an embodiment of the invention, the logic control device includes a state mapping table, and the step of determining whether the monitored device is in an error state according to the status signal of the monitored device includes: storing the status signal of the monitored device in the state mapping table. The corresponding address is used as the correct operation data; then, the status signal of the monitored device and the correct operation data stored in the corresponding address in the state mapping table are compared to determine whether the monitored device is in an error state.

根據本發明一實施例,其中根據受監控裝置的狀態訊號,以判斷受監控裝置是否處於錯誤狀態的步驟包括:依據是否未偵測到受監控裝置所發出正常訊號,或依據受監控裝置是否發出錯誤訊號以判斷受監控裝置是否處於錯誤狀態。 According to an embodiment of the invention, the step of determining whether the monitored device is in an error state according to the status signal of the monitored device includes: whether the normal signal sent by the monitored device is not detected, or whether the monitored device is issued according to whether The error signal is used to determine if the monitored device is in an error state.

根據本發明一實施例,重置受監控裝置的步驟包括:重啟主要電力軌以使受監控裝置重新開機。 In accordance with an embodiment of the invention, the step of resetting the monitored device includes restarting the primary power rail to cause the monitored device to reboot.

綜上所述,應用上述的實施例,當電腦系統的內部裝置發生錯誤時,可透過邏輯控制裝置進行回復,其中由於邏輯控制裝置可用邏輯元件實現,較不易出錯,是以能提供較可靠的錯誤回復機制。 In summary, when the above-mentioned embodiment is applied, when an error occurs in the internal device of the computer system, the logic control device can be used to reply. Because the logic control device can be implemented by logic components, it is less prone to error, so that it can provide more reliable. Error response mechanism.

以下將以圖式及詳細敘述清楚說明本揭示內容之精神,任何所屬技術領域中具有通常知識者在瞭解本揭示內容之較佳實施例後,當可由本揭示內容所教示之技術,加以改變及修飾,其並不脫離本揭示內容之精神與範圍。 The spirit and scope of the present disclosure will be apparent from the following description of the preferred embodiments of the present disclosure. Modifications do not depart from the spirit and scope of the disclosure.

關於本文中所使用之『連接』,可指二或多個元件相互直接作實體或電性接觸,或是相互間接作實體或電性接觸,而『連接』還可指二或多個元件元件相互操作或動作。 As used herein, "connected" may mean that two or more elements are in direct physical or electrical contact with each other, or indirectly in physical or electrical contact with each other, and "connected" may also refer to two or more elemental elements. Interoperate or act.

本發明的一態樣為一種電腦系統,其利用一邏輯控制裝置進行訊號監控及錯誤回復。電腦系統可為桌上型電腦、筆記型電腦、網路處理器以及伺服器等,為使敘述清楚,在以下的段落中將以伺服器為例進行說明。 One aspect of the present invention is a computer system that utilizes a logic control device for signal monitoring and error recovery. The computer system can be a desktop computer, a notebook computer, a network processor, a server, etc. In order to clarify the description, a server will be described as an example in the following paragraphs.

第1圖為根據本發明一實施例所繪示的電腦系統100之方塊圖。電腦系統100包括至少一受監控裝置(例如,7個受監控裝置D1-D7)以及一邏輯控制裝置110。當注意到,受監控裝置可為電腦系統100中的內部裝置,例如可為但不限於南橋晶片(south bridge chip)、基本輸入輸出系統(basic input output system,BIOS)、基板管理控制器(baseboard management controller,BMC)、中央處理器(central processing unit,CPU)、電源供應單元(power supply unit,PSU)、儲存裝置或電壓調節器(voltage regulator down,VRD)中的任一者,而為使敘述清楚,在以下的段落中將以7個受監控裝置D1-D7為例進行說明,其中D1可為南橋晶片、D2可為基本輸入輸出系統、D3可為基板管理控制器、D4可為中央處理器、D5可為電源供應單元、D6可為儲存裝置,且D7可為電壓調節器。邏輯控制裝置 110可用(但不限於)邏輯電路、可程式邏輯裝置(programmable logic device,PLD)、複雜可程式邏輯裝置(complex programmable logic device,CPLD)、或可程式邏輯閘陣列(field programmable gate array,FPGA)所實現。 FIG. 1 is a block diagram of a computer system 100 in accordance with an embodiment of the invention. Computer system 100 includes at least one monitored device (eg, seven monitored devices D1-D7) and a logic control device 110. When noted, the monitored device can be an internal device in the computer system 100, such as, but not limited to, a south bridge chip, a basic input output system (BIOS), a baseboard management controller (baseboard). Management controller (BMC), a central processing unit (CPU), a power supply unit (PSU), a storage device, or a voltage regulator down (VRD), The description is clear. In the following paragraphs, seven monitored devices D1-D7 will be taken as an example. D1 can be a south bridge chip, D2 can be a basic input/output system, D3 can be a substrate management controller, and D4 can be a central The processor, D5 can be a power supply unit, D6 can be a storage device, and D7 can be a voltage regulator. Logic control device 110 may be, but is not limited to, a logic circuit, a programmable logic device (PLD), a complex programmable logic device (CPLD), or a field programmable gate array (FPGA). Realized.

邏輯控制裝置110分別連接受監控裝置D1-D7,用以監控受監控裝置D1-D7的狀態訊號,以判斷受監控裝置D1-D7是否處於錯誤狀態。舉例而言,邏輯控制裝置110可藉由低腳位(low pin count,LPC)匯流排監控南橋晶片D1與基本輸入輸出系統D2是否發出正常訊號(如heartbeat signal)、藉由延伸周邊元件互連匯流排(peripheral component interconnect extended,PCI-X)監控基板管理控制器D3是否發出正常訊號(如heartbeat signal)、藉由通用輸入輸出(general purpose input/output,GPIO)腳位監控中央處理器D4是否發出過熱訊號或錯誤訊號(如CPU_ierr、CPU_mcerr、Thermal_trip)、電源供應單元D5是否發出過熱訊號及/或正常訊號(如電源良好訊號(如power good signal))、以及儲存裝置D6與電壓調節器D7是否發出錯誤訊號及/或正常訊號(如電源錯誤訊號(power fault signal)及/或電源良好訊號(power good signal))。其中,由於電壓調節器D7可分別輸出多個電壓位準給電腦系統100中的內部裝置,故邏輯控制裝置110可分別監測電壓調節器D7所輸出的每個電壓位準之錯誤訊號及/或正常訊號。如此一來,透過監控受監控裝置D1-D7的錯誤訊號及/或正常訊號,邏輯控制裝置110即可依據是否未偵測到受監控裝置D1-D7所發出正常訊號,或依據受監控裝置D1-D7是否發出錯誤 訊號以判斷受監控裝置D1-D7是否處於錯誤狀態。 The logic control device 110 is respectively connected to the monitored devices D1-D7 for monitoring the status signals of the monitored devices D1-D7 to determine whether the monitored devices D1-D7 are in an error state. For example, the logic control device 110 can monitor whether the south bridge chip D1 and the basic input/output system D2 send a normal signal (such as a heartbeat signal) by using a low pin count (LPC) bus bar, and interconnect by extending peripheral components. The peripheral component interconnect extended (PCI-X) monitors whether the baseboard management controller D3 sends a normal signal (such as a heartbeat signal), and monitors the central processor D4 by a general purpose input/output (GPIO) pin. Send overheat signal or error signal (such as CPU_ierr, CPU_mcerr, Thermal_trip), power supply unit D5 to send overheat signal and / or normal signal (such as power good signal (such as power good signal), and storage device D6 and voltage regulator D7 Whether to send an error signal and/or a normal signal (such as a power fault signal and/or a power good signal). Wherein, since the voltage regulator D7 can respectively output a plurality of voltage levels to the internal devices in the computer system 100, the logic control device 110 can separately monitor the error signals of each voltage level output by the voltage regulator D7 and/or Normal signal. In this way, by monitoring the error signals and/or normal signals of the monitored devices D1-D7, the logic control device 110 can determine whether the normal signals sent by the monitored devices D1-D7 are not detected, or according to the monitored device D1. -D7 issued an error Signal to determine if the monitored devices D1-D7 are in an error state.

而當受監控裝置D1-D7處於錯誤狀態時,邏輯控制裝置110可計時一段預設時間,並在此段預設時間後判斷受監控裝置D1-D7是否恢復正常,例如是否再次接收到正常訊號或錯誤訊號消失,並判斷受監控裝置D1-D7是否在此段預設時間內進行重置。舉例而言,邏輯控制裝置110可利用多個通用輸入輸出接腳以分別監控電壓調節器D7輸出的多個電壓位準或多個電壓位準的電源正常訊號,並對應此些電壓位準是否重新啟動(如,是否關閉後開啟)以判斷受監控裝置D1-D7是否已進行重置。 When the monitored devices D1-D7 are in an error state, the logic control device 110 can time a preset time and determine whether the monitored devices D1-D7 return to normal after the preset time, for example, whether the normal signal is received again. Or the error signal disappears and it is determined whether the monitored device D1-D7 is reset within this preset time. For example, the logic control device 110 can utilize a plurality of general-purpose input and output pins to separately monitor a plurality of voltage levels or a plurality of voltage level power supply normal signals output by the voltage regulator D7, and corresponding to the voltage levels. Restart (eg, whether it is turned off and on) to determine if the monitored devices D1-D7 have been reset.

接著,若受監控裝置D1-D7未恢復正常且受監控裝置D1-D7未在此段預設時間內進行重置,則邏輯控制裝置110可重置受監控裝置D1-D7。舉例而言,邏輯控制裝置110可透過發送重置訊號至受監控裝置D1-D7以重置單一受監控裝置D1-D7,或重新啟動主要電力軌(main power rail)以使電腦系統100重新開機。 Then, if the monitored devices D1-D7 are not restored to normal and the monitored devices D1-D7 are not reset within the preset time period, the logic control device 110 may reset the monitored devices D1-D7. For example, the logic control device 110 can reset the single monitored device D1-D7 by sending a reset signal to the monitored devices D1-D7, or restart the main power rail to restart the computer system 100. .

透過上述的設置,邏輯控制裝置110可監控受監控裝置D1-D7的狀態,以在受監控裝置D1-D7在發生錯誤而未被恢復或被重置時,重新啟動電腦系統100或單一發生錯誤的受監控裝置D1-D7,而確保電腦系統100的正確運作。此外,由於邏輯控制裝置110可用邏輯元件實現,是以相較於高階的管理控制器(如基板管理控制器),邏輯控制裝置110可提供更可靠的錯誤回復機制。 Through the above settings, the logic control device 110 can monitor the status of the monitored devices D1-D7 to restart the computer system 100 or a single error when the monitored devices D1-D7 are not recovered or reset in the event of an error. The monitored devices D1-D7 ensure proper operation of the computer system 100. Moreover, since the logic control device 110 can be implemented with logic elements, the logic control device 110 can provide a more reliable error recovery mechanism than a higher order management controller such as a baseboard management controller.

在本發明一實施例中,邏輯控制裝置110可更包括一狀態映射表112。在電腦系統100運作時,邏輯控制裝置 110可儲存受監控裝置D1-D7的狀態訊號於狀態映射表112中的對應位址,作為正確運作資料。舉例而言,由第一通用輸入輸出接腳接收的邏輯電位可儲存於狀態映射表112中的第一位址、由第二通用輸入輸出接腳接收的邏輯電位可儲存於狀態映射表112中的第二位址、且由LPC匯流排的第一接腳接收的邏輯電位可儲存於狀態映射表112中的第三位址。當注意到,在一些實施例中,狀態映射表112中的每一位址可指向多個暫存器空間,以儲存不同時間下的狀態訊號,或儲存周期性的狀態訊號(如heartbeat signal)。 In an embodiment of the invention, the logic control device 110 may further include a state mapping table 112. Logic control device when computer system 100 is in operation 110 can store the status signals of the monitored devices D1-D7 in the corresponding address in the status mapping table 112 as the correct operation data. For example, the logic potential received by the first general-purpose input and output pin can be stored in the first address in the state mapping table 112, and the logic potential received by the second general-purpose input and output pin can be stored in the state mapping table 112. The second bit address and the logic potential received by the first pin of the LPC bus can be stored in the third address in the state map 112. It is noted that in some embodiments, each address in the state mapping table 112 can point to multiple scratchpad spaces to store status signals at different times, or to store periodic status signals (eg, heartbeat signal). .

在取得正確運作資料後,邏輯控制裝置110可比對當下所接收的受監控裝置D1-D7的狀態訊號與過去儲存於狀態映射表112中相應位址的正確運作資料,以判斷受監控裝置D1-D7是否處於錯誤狀態。同樣地,邏輯控制裝置110亦可藉此判斷受監控裝置D1-D7出錯後是否恢復正常。舉例而言,若儲存於狀態映射表112中第二位址的中央處理器D4的過熱訊號(如Thermal_trip)為高邏輯電位,則當邏輯控制裝置110發現第二通用輸入輸出接腳所接收的中央處理器D4的過熱訊號為低邏輯電位時,邏輯控制裝置110可依此判斷中央處理器D4處於錯誤狀態。 After obtaining the correct operation data, the logic control device 110 can compare the status signals of the currently monitored monitored devices D1-D7 with the correct operation data stored in the corresponding address in the state mapping table 112 in the past to determine the monitored device D1- Whether D7 is in an error state. Similarly, the logic control device 110 can also determine whether the monitored device D1-D7 returns to normal after an error. For example, if the overheat signal (eg, Thermal_trip) of the central processor D4 stored in the second address in the state mapping table 112 is a high logic potential, when the logic control device 110 finds the second general-purpose input and output pin receives When the overheat signal of the central processing unit D4 is a low logic potential, the logic control device 110 can judge that the central processing unit D4 is in an error state.

當注意的是,在其它實施例中,邏輯控制裝置110亦可比對受監控裝置D1-D7的狀態訊號以及管理者所預設的數值以判斷受監控裝置D1-D7是否處於錯誤狀態,判斷方式不以上述實施例為限。 It should be noted that in other embodiments, the logic control device 110 may also compare the status signals of the monitored devices D1-D7 with the values preset by the administrator to determine whether the monitored devices D1-D7 are in an error state, and determine the manner. It is not limited to the above embodiment.

在一些實施例中,邏輯控制裝置110亦可根據受監控 裝置D1-D7中複數個狀態訊號進行整體上的錯誤判斷。 In some embodiments, the logic control device 110 can also be monitored according to A plurality of status signals in the devices D1-D7 perform an overall erroneous determination.

另外,在本發明一實施例中,邏輯控制裝置110可更包括一計時器114,用以計時前述預設時間。 In addition, in an embodiment of the invention, the logic control device 110 may further include a timer 114 for counting the preset time.

此外,熟習本領域者當可明白,在不脫離本發明精神下,受監控裝置D1-D7的狀態訊號可為任何可用以表示受監控裝置D1-D7是否正常運作的訊號,而不以前述實施例中的訊號為限。 Moreover, it will be apparent to those skilled in the art that the status signals of the monitored devices D1-D7 can be any signal that can be used to indicate whether the monitored devices D1-D7 are functioning properly without departing from the spirit of the present invention. The signal in the example is limited.

本發明另一態樣為一種電腦系統的操作方法。此操作方法可用於結構與前述第1圖中相同或類似的電腦系統。為方便說明,下述操作方法係以第1圖所示之實施例為例進行描述,但並不以第1圖之實施例為限。 Another aspect of the invention is a method of operating a computer system. This method of operation can be applied to a computer system having the same or similar structure as in the first drawing. For convenience of description, the following operation method is described by taking the embodiment shown in FIG. 1 as an example, but is not limited to the embodiment of FIG. 1.

當注意到,在以下操作方法中的步驟中,除非另行述明,否則並不具有特定順序。另外,以下步驟亦可能被同時執行,或者於執行時間上有所重疊。 It is noted that in the steps of the following methods of operation, there is no particular order unless otherwise stated. In addition, the following steps may also be performed simultaneously or overlap in execution time.

第2圖為根據本發明一實施例中的操作方法200所繪示的流程圖。操作方法200可包括步驟S1-S5。在電腦系統100啟動後,監控受監控裝置D1-D7的狀態訊號(步驟S1),並根據受監控裝置D1-D7的狀態訊號以判斷受監控裝置D1-D7是否處於錯誤狀態(步驟S2)。當受監控裝置D1-D7處於錯誤狀態時,開始計時一段預設時間(步驟S3),接著,進行計時(步驟S4)。在到達預設時間後,判斷受監控裝置D1-D7是否恢復正常,且判斷受監控裝置D1-D7是否在預設時間內進行重置(步驟S5),若受監控裝置D1-D7未恢復正常且受監控裝置D1-D7未在預設時間內進行重置,則重置受監控裝置D1-D7(步驟S6)。 FIG. 2 is a flow chart diagram of operation method 200 in accordance with an embodiment of the invention. Operation method 200 can include steps S1-S5. After the computer system 100 is started, the status signals of the monitored devices D1-D7 are monitored (step S1), and based on the status signals of the monitored devices D1-D7, it is determined whether the monitored devices D1-D7 are in an error state (step S2). When the monitored devices D1-D7 are in an error state, timing is started for a predetermined period of time (step S3), and then, timing is performed (step S4). After the preset time is reached, it is determined whether the monitored devices D1-D7 return to normal, and it is determined whether the monitored devices D1-D7 are reset within a preset time (step S5), if the monitored devices D1-D7 are not restored to normal And the monitored devices D1-D7 are not reset within the preset time, the monitored devices D1-D7 are reset (step S6).

其中,關於受監控裝置D1-D7的詳細說明可參照前一實施態樣,在此不贅述。 For a detailed description of the monitored devices D1-D7, reference may be made to the previous embodiment, and details are not described herein.

以實施上的範例而言,在步驟S1中,電腦系統100可監控南橋晶片D1、基本輸入輸出系統D2與基板管理控制器D3是否發出正常訊號(如heartbeat signal)、中央處理器D4是否發出過熱訊號或錯誤訊號,如CPU_ierr、CPU_mcerr、Thermal_trip,電源供應單元D5是否發出過熱訊號及/或正常訊號,如電源良好訊號(如power good signal),以及儲存裝置D6與電壓調節器D7是否發出錯誤訊號及/或正常訊號,如電源錯誤訊號(power fault signal)及/或電源良好訊號(power good signal)。其中,電腦系統100可分別監測電壓調節器D7所輸出的每個電壓位準之錯誤訊號及/或正常訊號。 In an example of implementation, in step S1, the computer system 100 can monitor whether the south bridge chip D1, the basic input/output system D2, and the substrate management controller D3 emit a normal signal (such as a heartbeat signal), and whether the central processing unit D4 issues an overheat. Signal or error signal, such as CPU_ierr, CPU_mcerr, Thermal_trip, whether the power supply unit D5 sends out overheating signals and/or normal signals, such as power good signal (such as power good signal), and whether the storage device D6 and the voltage regulator D7 send an error signal. And/or normal signals, such as power fault signals and/or power good signals. The computer system 100 can separately monitor the error signal and/or the normal signal of each voltage level output by the voltage regulator D7.

在步驟S2中,電腦系統100可依據是否未偵測到受監控裝置D1-D7所發出正常訊號,或依據受監控裝置D1-D7是否發出錯誤訊號以判斷受監控裝置D1-D7是否處於錯誤狀態。另外,若受監控裝置D1-D7並未處於錯誤狀態,則電腦系統100重新執行步驟S1,以持續監控受監控裝置D1-D7的狀態訊號。 In step S2, the computer system 100 can determine whether the monitored device D1-D7 is in an error state according to whether the normal signal sent by the monitored device D1-D7 is not detected, or whether the monitored device D1-D7 sends an error signal. . In addition, if the monitored devices D1-D7 are not in an error state, the computer system 100 re-executes step S1 to continuously monitor the status signals of the monitored devices D1-D7.

在步驟S3中,電腦系統100可利用計時器開始計時。在一些實施例中,電腦系統100在此段時間中繼續監控受監控裝置D1-D7的狀態訊號,以判斷是否還有其它錯誤,而進一步進行整體上的錯誤判斷。 In step S3, computer system 100 can begin timing using a timer. In some embodiments, the computer system 100 continues to monitor the status signals of the monitored devices D1-D7 during this period of time to determine if there are other errors and further make an overall erroneous determination.

在步驟S5中,電腦系統100可透過是否再次接收到正常訊號或錯誤訊號消失,以判斷受監控裝置D1-D7是否恢 復正常,並可分別監控電壓調節器D7輸出的多個電壓位準或多個電壓位準的電源正常訊號,並對應此些電壓位準是否重新啟動(如,是否關閉後開啟)以判斷受監控裝置D1-D7是否已進行重置。其中,若電腦系統100判斷受監控裝置D1-D7已恢復正常或已進行重置,則表示受監控裝置D1-D7可能已由其它錯誤回復機制進行處理,故電腦系統100可重新執行步驟S1以再次監控受監控裝置D1-D7的狀態訊號。 In step S5, the computer system 100 can determine whether the monitored device D1-D7 is restored by whether the normal signal is received again or the error signal disappears. Normally, and can separately monitor the power level signals of multiple voltage levels or voltage levels output by the voltage regulator D7, and correspondingly whether the voltage levels are restarted (for example, whether it is turned off and then turned on) to determine Whether the monitoring devices D1-D7 have been reset. If the computer system 100 determines that the monitored devices D1-D7 have returned to normal or have been reset, it indicates that the monitored devices D1-D7 may have been processed by other error recovery mechanisms, so the computer system 100 may re-execute step S1. The status signals of the monitored devices D1-D7 are monitored again.

在步驟S6中,電腦系統100可透過發送重置訊號至受監控裝置D1-D7以重置單一受監控裝置D1-D7,或重新啟動主要電力軌(main power rail)以使電腦系統100中的受監控裝置D1-D7重新開機。 In step S6, the computer system 100 can reset the single monitored device D1-D7 by transmitting a reset signal to the monitored devices D1-D7, or restart the main power rail to make the computer system 100 The monitored devices D1-D7 are turned back on.

透過上述的設置,電腦系統100可監控受監控裝置D1-D7的狀態,以在受監控裝置D1-D7在發生錯誤而未被恢復或被重置時,重新啟動受監控裝置D1-D7或發生錯誤的受監控裝置D1-D7,而確保電腦系統100的正確運作。 Through the above settings, the computer system 100 can monitor the status of the monitored devices D1-D7 to restart the monitored devices D1-D7 or occur when the monitored devices D1-D7 are not recovered or reset in the event of an error. The wrong monitored devices D1-D7 ensure proper operation of the computer system 100.

在本發明一實施例中,步驟S2可包括以下子步驟。(a)儲存受監控裝置D1-D7的狀態訊號於狀態映射表112中的對應位址,作為正確運作資料;而後(b)比對受監控裝置D1-D7的狀態訊號與儲存於狀態映射表112中相應位址的正確運作資料,以判斷受監控裝置D1-D7是否處於錯誤狀態。 In an embodiment of the invention, step S2 may include the following sub-steps. (a) storing the status signals of the monitored devices D1-D7 in the corresponding address in the status mapping table 112 as the correct operation data; and then (b) comparing the status signals of the monitored devices D1-D7 with the status mapping table. The correct operation data of the corresponding address in 112 to determine whether the monitored devices D1-D7 are in an error state.

舉例而言,電腦系統100可儲存中央處理器D4的過熱訊號(如Thermal_trip)的邏輯電位於狀態映射表112中的第二位址,作為電腦系統100正確運作資料,而後電腦 系統100可藉由比對接收到的中央處理器D4的過熱訊號與儲存於狀態映射表112中的第二位址的邏輯電位是否相同以判斷中央處理器D4是否處於錯誤狀態。 For example, the computer system 100 can store the logic of the overheat signal (such as Thermal_trip) of the central processor D4 in the second address in the state mapping table 112, as the correct operation data of the computer system 100, and then the computer The system 100 can determine whether the CPU D4 is in an error state by comparing whether the received overheat signal of the central processor D4 and the logic potential of the second address stored in the state map 112 are the same.

此外,在一些實施例中,電腦系統100同樣可利用儲存於狀態映射表112中的正確運作資料以判斷受監控裝置D1-D7出錯後是否恢復正常。 Moreover, in some embodiments, computer system 100 can also utilize the correct operational data stored in state map 112 to determine if monitored device D1-D7 has returned to normal after an error.

當注意的是,在其它實施例中,電腦系統100亦可比對受監控裝置D1-D7的狀態訊號以及管理者所預設的數值以判斷受監控裝置D1-D7是否處於錯誤狀態,故判斷錯誤的方式不以上述實施例為限。 It should be noted that in other embodiments, the computer system 100 can also compare the status signals of the monitored devices D1-D7 and the values preset by the administrator to determine whether the monitored devices D1-D7 are in an error state, so that the error is determined. The manner of this is not limited to the above embodiment.

雖然本發明已以實施例揭露如上,然其並非用以限定本發明,任何熟習此技藝者,在不脫離本發明之精神和範圍內,當可作各種之更動與潤飾,因此本發明之保護範圍當視後附之申請專利範圍所界定者為準。 Although the present invention has been disclosed in the above embodiments, it is not intended to limit the present invention, and the present invention can be modified and retouched without departing from the spirit and scope of the present invention. The scope is subject to the definition of the scope of the patent application attached.

100‧‧‧電腦系統 100‧‧‧ computer system

110‧‧‧邏輯控制裝置 110‧‧‧Logical control device

112‧‧‧狀態映射表 112‧‧‧State Mapping Table

200‧‧‧操作方法 200‧‧‧How to operate

D1-D7‧‧‧受監控裝置 D1-D7‧‧‧Monitored device

S1-S6‧‧‧步驟 S1-S6‧‧‧ steps

114‧‧‧計時器 114‧‧‧Timer

LPC、PCI-X‧‧‧匯流排 LPC, PCI-X‧‧‧ bus

第1圖為根據本發明一實施例所繪示的電腦系統的方塊圖;第2圖為根據本發明一實施例所繪示的電腦系統之操作方法的流程圖。 1 is a block diagram of a computer system according to an embodiment of the invention; and FIG. 2 is a flow chart of a method for operating a computer system according to an embodiment of the invention.

100‧‧‧電腦系統 100‧‧‧ computer system

110‧‧‧邏輯控制裝置 110‧‧‧Logical control device

112‧‧‧狀態映射表 112‧‧‧State Mapping Table

114‧‧‧計時器 114‧‧‧Timer

D1-D7‧‧‧受監控裝置 D1-D7‧‧‧Monitored device

LPC、PCI-X‧‧‧匯流排 LPC, PCI-X‧‧‧ bus

Claims (10)

一種電腦系統,包括:至少一受監控裝置;以及一邏輯控制裝置,連接該受監控裝置,用以監控該受監控裝置的狀態訊號,以判斷該受監控裝置是否處於錯誤狀態,其中當該受監控裝置處於錯誤狀態時,該邏輯控制裝置計時一預設時間,並在該預設時間後判斷該受監控裝置是否恢復正常,且判斷該受監控裝置是否在該預設時間內進行重置,其中若該受監控裝置未恢復正常且該受監控裝置未在該預設時間內進行重置,則該邏輯控制裝置重置該受監控裝置。 A computer system comprising: at least one monitored device; and a logic control device connected to the monitored device for monitoring a status signal of the monitored device to determine whether the monitored device is in an error state, wherein When the monitoring device is in an error state, the logic control device counts a preset time, and after the preset time, determines whether the monitored device returns to normal, and determines whether the monitored device is reset within the preset time. If the monitored device does not return to normal and the monitored device does not reset within the preset time, the logic control device resets the monitored device. 如請求項1所述的電腦系統,其中該邏輯控制裝置更包括一狀態映射表,該邏輯控制裝置儲存該受監控裝置的狀態訊號於該狀態映射表中的對應位址以作為正確運作資料。 The computer system of claim 1, wherein the logic control device further comprises a state mapping table, wherein the logic control device stores the status signal of the monitored device in a corresponding address in the state mapping table as a correct operation data. 如請求項2所述的電腦系統,其中該邏輯控制裝置比對該受監控裝置的狀態訊號與儲存於該狀態映射表中相應位址的正確運作資料,以判斷該受監控裝置是否處於錯誤狀態。 The computer system of claim 2, wherein the logic control device compares the status signal of the monitored device with the correct operation data stored in the corresponding address in the status mapping table to determine whether the monitored device is in an error state. . 如請求項1所述的電腦系統,其中該邏輯控制裝置更包括一計時器,用以計時該預設時間。 The computer system of claim 1, wherein the logic control device further comprises a timer for counting the preset time. 如請求項1所述的電腦系統,其中該邏輯控制裝置依據是否未接收到該受監控裝置所發出之正常訊號,或依據該受監控裝置是否發出錯誤訊號,以判斷該受監控裝置是否處於錯誤狀態。 The computer system of claim 1, wherein the logic control device determines whether the monitored device is in error according to whether a normal signal sent by the monitored device is not received, or whether the monitored device sends an error signal according to whether the monitored device sends an error signal. status. 如請求項1所述的電腦系統,其中該邏輯控制裝置重啟主要電力軌(main power rail)以使該受監控裝置重新開機。 The computer system of claim 1, wherein the logic control device restarts a main power rail to cause the monitored device to be powered back on. 一種電腦系統的操作方法,其中該電腦系統包括一邏輯控制裝置以及至少一受監控裝置,該邏輯控制裝置連接該受監控裝置,該操作方法包括:監控該受監控裝置的狀態訊號;根據該受監控裝置的狀態訊號以判斷該受監控裝置是否處於錯誤狀態;當該受監控裝置處於錯誤狀態時,計時一預設時間;在該預設時間後,判斷該受監控裝置是否恢復正常,且判斷該受監控裝置是否在該預設時間內進行重置;以及若該受監控裝置未恢復正常且受監控裝置未在該預設時間內進行重置,則重置該受監控裝置。 A computer system operating method, wherein the computer system includes a logic control device and at least one monitored device, the logic control device is connected to the monitored device, the operating method includes: monitoring a status signal of the monitored device; Monitoring the status signal of the device to determine whether the monitored device is in an error state; when the monitored device is in an error state, counting a preset time; after the preset time, determining whether the monitored device returns to normal, and determining Whether the monitored device is reset within the preset time; and if the monitored device does not return to normal and the monitored device does not reset within the preset time, the monitored device is reset. 如請求項7所述的操作方法,其中該邏輯控制裝置包括一狀態映射表,且根據該受監控裝置的狀態訊號以判斷該受監控裝置是否處於錯誤狀態的步驟包括: 儲存該受監控裝置的狀態訊號於該狀態映射表中的對應位址以作為正確運作資料;而後比對該受監控裝置的狀態訊號與儲存於該狀態映射表中相應位址的正確運作資料,以判斷該受監控裝置是否處於錯誤狀態。 The operation method of claim 7, wherein the logic control device comprises a state mapping table, and the step of determining whether the monitored device is in an error state according to the status signal of the monitored device comprises: Storing the status signal of the monitored device in the status mapping table as the correct operation data; and then comparing the status signal of the monitored device with the correct operation data stored in the corresponding address in the status mapping table, To determine if the monitored device is in an error state. 如請求項7所述的操作方法,其中根據該受監控裝置的狀態訊號,以判斷該受監控裝置是否處於錯誤狀態的步驟包括:依據是否未偵測到該受監控裝置所發出正常訊號,或依據該受監控裝置是否發出錯誤訊號以判斷該受監控裝置是否處於錯誤狀態。 The operation method of claim 7, wherein the step of determining whether the monitored device is in an error state according to the status signal of the monitored device comprises: determining whether a normal signal is sent by the monitored device, or Whether the monitored device is in an error state is determined according to whether the monitored device sends an error signal. 如請求項7所述的操作方法,其中重置該受監控裝置的步驟包括:重啟主要電力軌以使該受監控裝置重新開機。 The method of operation of claim 7, wherein the step of resetting the monitored device comprises restarting the primary power rail to cause the monitored device to be powered back on.
TW101145892A 2012-12-06 2012-12-06 Computer system and operating method thereof TW201423390A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW101145892A TW201423390A (en) 2012-12-06 2012-12-06 Computer system and operating method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW101145892A TW201423390A (en) 2012-12-06 2012-12-06 Computer system and operating method thereof

Publications (1)

Publication Number Publication Date
TW201423390A true TW201423390A (en) 2014-06-16

Family

ID=51393998

Family Applications (1)

Application Number Title Priority Date Filing Date
TW101145892A TW201423390A (en) 2012-12-06 2012-12-06 Computer system and operating method thereof

Country Status (1)

Country Link
TW (1) TW201423390A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649002A (en) * 2015-10-29 2017-05-10 佛山市顺德区顺达电脑厂有限公司 Server and method for automatically overhauling baseboard management controller
TWI602054B (en) * 2016-04-01 2017-10-11 神雲科技股份有限公司 Method of providing error status data for computer device
CN107451035A (en) * 2016-05-31 2017-12-08 佛山市顺德区顺达电脑厂有限公司 Error state data for computer installation provides method
CN107544878A (en) * 2016-06-28 2018-01-05 佛山市顺德区顺达电脑厂有限公司 Error state data for computer installation automatically provides method
TWI767378B (en) * 2020-10-27 2022-06-11 英業達股份有限公司 Error type determination system and method thereof
TWI811597B (en) * 2020-12-18 2023-08-11 新唐科技股份有限公司 A method and a communication interface controller for restoring communication interface interruption

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649002A (en) * 2015-10-29 2017-05-10 佛山市顺德区顺达电脑厂有限公司 Server and method for automatically overhauling baseboard management controller
CN106649002B (en) * 2015-10-29 2020-01-31 佛山市顺德区顺达电脑厂有限公司 Server and method for automatically overhauling baseboard management controller
TWI602054B (en) * 2016-04-01 2017-10-11 神雲科技股份有限公司 Method of providing error status data for computer device
CN107451035A (en) * 2016-05-31 2017-12-08 佛山市顺德区顺达电脑厂有限公司 Error state data for computer installation provides method
CN107451035B (en) * 2016-05-31 2020-11-10 佛山市顺德区顺达电脑厂有限公司 Error state data providing method for computer device
CN107544878A (en) * 2016-06-28 2018-01-05 佛山市顺德区顺达电脑厂有限公司 Error state data for computer installation automatically provides method
TWI767378B (en) * 2020-10-27 2022-06-11 英業達股份有限公司 Error type determination system and method thereof
TWI811597B (en) * 2020-12-18 2023-08-11 新唐科技股份有限公司 A method and a communication interface controller for restoring communication interface interruption

Similar Documents

Publication Publication Date Title
US10789117B2 (en) Data error detection in computing systems
CN107122321B (en) Hardware repair method, hardware repair system, and computer-readable storage device
TW201423390A (en) Computer system and operating method thereof
CN101126995B (en) Method and apparatus for processing serious hardware error
US20140143597A1 (en) Computer system and operating method thereof
US7461303B2 (en) Monitoring VRM-induced memory errors
WO2020239060A1 (en) Error recovery method and apparatus
US8677182B2 (en) Computer system capable of generating an internal error reset signal according to a catastrophic error signal
US20040003317A1 (en) Method and apparatus for implementing fault detection and correction in a computer system that requires high reliability and system manageability
US7672247B2 (en) Evaluating data processing system health using an I/O device
TWI529624B (en) Method and system of fault tolerance for multiple servers
US11687395B2 (en) Detecting and recovering from fatal storage errors
US10275330B2 (en) Computer readable non-transitory recording medium storing pseudo failure generation program, generation method, and generation apparatus
TW201715396A (en) Server and error detecting method thereof
TWI518680B (en) Method for maintaining file system of computer system
US8495353B2 (en) Method and circuit for resetting register
US8689059B2 (en) System and method for handling system failure
JP2005135063A (en) Information processor and clock abnormality detecting program for information processor
TWI715005B (en) Monitor method for demand of a bmc
TWI421701B (en) Computer system
TW201423590A (en) Computer system and operating method thereof
JP2014146110A (en) Information processing device, method for diagnosing error detection function, and computer program
TWI426379B (en) System and method for detecting system error of a computer
WO2016194170A1 (en) Error detection device and error detection system
CN112084049A (en) Method for monitoring resident program of baseboard management controller