TW591376B - System and method for detecting server failure and the restoring of the same - Google Patents

System and method for detecting server failure and the restoring of the same Download PDF

Info

Publication number
TW591376B
TW591376B TW90133427A TW90133427A TW591376B TW 591376 B TW591376 B TW 591376B TW 90133427 A TW90133427 A TW 90133427A TW 90133427 A TW90133427 A TW 90133427A TW 591376 B TW591376 B TW 591376B
Authority
TW
Taiwan
Prior art keywords
patent application
server
scope
item
signal
Prior art date
Application number
TW90133427A
Other languages
Chinese (zh)
Inventor
Chung-Chih Tung
Original Assignee
Mitac Technology Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mitac Technology Corp filed Critical Mitac Technology Corp
Priority to TW90133427A priority Critical patent/TW591376B/en
Application granted granted Critical
Publication of TW591376B publication Critical patent/TW591376B/en

Links

Landscapes

  • Debugging And Monitoring (AREA)
  • Hardware Redundancy (AREA)

Abstract

There is provided a system for detecting server failure and the restoring of the same, which is suitable for a server and includes an event signal interception module and a power control circuit. The event signal interception module receives a system signal and sends a counter reset signal to the power control circuit based on the system signal. The power control circuit includes a counter which is decremented according to a time unit. When the power control circuit receives the counter reset signal, the counting value of the counter is reset to a first predetermined value. When the counter equals to a third predetermined value, the power of the server is turned off and the server is reset.

Description

591376 五、發明說明(1) 本發明係有關於一種伺服器故障偵測系統及方法,且 特別有關於一種可以自動將故障之伺服器重新回復之偵測 與回復伺服器故障之系統及方法。 近年來,由於網路的蓬勃發展,對於整體網路系統的 系統管理工作(Μ I S )也愈顯重要,同時地,如何隨時隨地 維持與提供一個穩定且正常的伺服器(S e r ν e r )工作狀況也 成為一種重要課題。 此外,由於地球村觀念的成型,遠距離、跨國服務、 以及隨時隨地之資訊服務也成為常見的工作型態。而當在 半夜或是系統管理人員不在場的情況下,如果伺服器臨時 故障,如使用者操作不當、或系統軟體等原因所造成之伺 ® 月艮器當機的時候,則必須等待系統之網路管理人員來進行 故障排解等處理,進而使得伺服器的服務中斷、引響系統 的整體服務品質。 然而,在許多的情況下,伺服器的當機是由於系統軟 體本身的細部設計問題,或是使用者的不當存取伺服器所 造成的,而僅需將伺服器重新關機/開機即可恢復正常服 務。因此,在這種只需要將伺服器重新開機即可解決的情 況下,如何縮短伺服器當機到重開機的時間則成為另一個 重要課題。 有鑑於此,本發明主要目的為提供一種當伺服器當機_ 時,可以在自動通知管理人員的同時來自動將伺服器重新_ 開機,以便利故障之伺服器重新啟動服務之偵測與回復伺 月艮器故障之系統及方法,進而減少因為伺服器停止服務後591376 V. Description of the invention (1) The present invention relates to a server fault detection system and method, and particularly to a system and method for detecting and recovering a server fault by automatically recovering the faulty server. In recent years, due to the vigorous development of the network, the system management work (M IS) of the overall network system has become increasingly important. At the same time, how to maintain and provide a stable and normal server (Ser v er) at any time and place Working conditions have also become an important issue. In addition, due to the formation of the concept of a global village, long-distance, multinational services, and information services anytime and anywhere have become common work patterns. And in the middle of the night or in the absence of system administrators, if the server temporarily fails, such as improper user operation or system software, etc., the server must be waited for when the server fails. Network management personnel perform troubleshooting and other processing, which in turn causes server service interruption and affects the overall service quality of the system. However, in many cases, the server crash is caused by the detailed design problems of the system software itself, or the user's improper access to the server, and the server can be recovered only by shutting down / booting the server again. Normal service. Therefore, in such a situation that the server can be resolved by simply restarting the server, how to shorten the time from the server crashing to restarting has become another important issue. In view of this, the main object of the present invention is to provide a server that can automatically restart the server when it is down, so as to facilitate the detection and recovery of the restarted service of the failed server. System and method for waiting for server failure, thereby reducing

0506-6768TWf;MRS90-003;y i anhou.ptd 第4頁 591376 五、發明說明(2) 所產生的損失。 為了達成上述目的,可藉由本發明所提供之一種偵測 與回復伺服器故障之系統及方法來達成。 依據本發明實施例之偵測與回復伺服器故障之系統, 適用於一伺服器之中,本偵測與回復伺服器故障系統包括 一事件訊號攔截模組與一電源控制電路。事件訊號攔截模 組接收一系統信號,並依據系統信號將一計數器重設信號 送出至電源控制電路。 電源控制電路可以包括隨一單位時間向下遞減之一計 數器,當電源控制電路接收到計數器重設信號時,則將計 數器之計數值重設為第一既定數值,而當計數值等於第三 既定氣袭時,則會關閉伺服器之電源,並將伺服器重新開 機。 此外,當計數值等於第三既定數值時,也同時輸出一 故障發生信號,用以通知系統管理人員。其中,系統信號 可以是一作業系統模組發出之閒置事件(Idle Event)信 號,且電源控制電路是一獨立之電源控制線路且可以配置 於伺服器之主機板或是介面卡之上。 圖式簡單說明 為使本發明之上述目的、特徵和優點能更明顯易懂, 下文特舉一具體實施例,並配合所附圖示,進行詳細說明 如下: 第1圖顯示依據本發明實施例之偵測與回復伺服器故 障系統之系統架構示意圖。0506-6768TWf; MRS90-003; y i anhou.ptd page 4 591376 V. Description of the invention (2) Loss incurred. In order to achieve the above object, the system and method for detecting and recovering a server failure provided by the present invention can be achieved. The system for detecting and recovering a server failure according to the embodiment of the present invention is applicable to a server. The system for detecting and recovering a server failure includes an event signal interception module and a power control circuit. The event signal interception module receives a system signal and sends a counter reset signal to the power control circuit according to the system signal. The power control circuit may include a counter that is decremented with a unit time. When the power control circuit receives the counter reset signal, it resets the count value of the counter to the first predetermined value, and when the count value is equal to the third predetermined value, During the air strike, the server power will be turned off and the server will be restarted. In addition, when the count value is equal to the third predetermined value, a fault occurrence signal is also output at the same time to notify the system management personnel. Among them, the system signal can be an idle event signal from an operating system module, and the power control circuit is an independent power control circuit and can be arranged on the motherboard or interface card of the server. In order to make the above-mentioned objects, features, and advantages of the present invention clearer and easier to understand, a specific embodiment is described below in detail with the accompanying drawings as follows: Figure 1 shows an embodiment according to the present invention Schematic diagram of the system architecture of the detection and recovery server failure system.

0506-6768TWf;MRS90-003;y i anhou.ptd 第5頁 591376 五、發明說明(3) 第2圖顯不依據本發明實施例之偵測與回復伺服器故 障方法之流程圖。 符號說明 1 0〜作業系統模組; 2 0〜事件訊號攔截模組; 3 0〜電源控制電路; 3卜計數器; S 1 0 0、…、S 1 0 8〜操作步驟。 實施例 接下來’本發明實施例將參考伴隨圖示進行詳細說明 於下。 第1圖顯示依據本發明實施例之偵測與回復伺服器故 障系統之系統架構示意圖。參考第1圖,依據本發明實施 例之偵測與回復伺服器故障系統包括一作業系統模組1 〇、 事件訊號攔截模組2 0、以及電源控制電路3 〇。 電源控制電路3 0係具有控制伺服器開機/關機、以及 可以控制電源開/關能力之獨立的電源控制線路且電源控 制電路30可以配置於伺服器(未顯示)之主機板或是介面卡 之上。 電源控制電路3 0中包括一計數器3 1 ,計數器3 1具有一 第一既定數值之一計數值,且在每一既定單位時間之後, 將該計數值減去一第二既定數值。其中,第一既定數值大 於第二既定數值。舉例來說,計數器3 1的初始值(第一既 定數值)為1 8 0,而每1秒(既定單位時間)減去數值丨(第二0506-6768TWf; MRS90-003; y i anhou.ptd Page 5 591376 V. Description of the invention (3) Figure 2 shows a flowchart of a method for detecting and recovering a server failure according to an embodiment of the present invention. Explanation of symbols 1 0 ~ operating system module; 2 0 ~ event signal interception module; 3 0 ~ power control circuit; 3 counter; S 1 0 0, ..., S 1 0 8 ~ operation steps. EXAMPLES Next, examples of the present invention will be described in detail with reference to accompanying drawings. FIG. 1 is a schematic diagram of a system architecture of a fault detection and recovery server system according to an embodiment of the present invention. Referring to FIG. 1, a fault detection and recovery server system according to an embodiment of the present invention includes an operating system module 10, an event signal interception module 20, and a power control circuit 30. The power control circuit 30 is an independent power control circuit that controls the server on / off and can control the power on / off ability. The power control circuit 30 can be configured on the motherboard or interface card of the server (not shown). on. The power supply control circuit 30 includes a counter 31, and the counter 31 has a count value of a first predetermined value, and after each predetermined unit time, the count value is subtracted from a second predetermined value. Among them, the first predetermined value is larger than the second predetermined value. For example, the initial value (the first predetermined value) of the counter 31 is 1 8 0, and the value is subtracted every 1 second (the predetermined unit time) 丨 (the second

0506-6768TWf;MRS90-003;yianhou.ptd 第6頁 591376 五、發明說明(4) 既定數值)。 作業系統模組1 0可以是伺服器中所安裝之作業系統 (Operating System),如Windows 或是Li nu"x 等等^作業系 統。當伺服器之中央處理單元(CPU)沒有事情可以進行>處 理時’則作業系統模組1 〇會發出一系統信號,如間置事件 (Idle Event)信號來給整體伺服器系統以進行電源管理 (Power Management)程序 。 事件訊號攔截模組20可以是一種電路或是驅動程式 ⑼river),用來攔截接收上述由作業系統模組1〇所發出之 系統信號,並依據接收到之系統信號輸出一 重設信 號送出至電源控制電路30。 ° ° 而當電源控制電路3 0接收到計數器重設信號時,則會 將計數器3 1之計數值重新設回初始值(第一既定數值),另 一方面,當電源控制電路30之計數器31之計數值隨著時間 遞減至一第三既定數值時,舉例來說,第 〇(代表饲服器已有-段時間沒有發出系統信號疋;^信 號為閒置事件訊號的話,則代表伺服器之中央處理器為當 機或是一直處於忙碌狀態(此情況極少可能)),因而°,電胃 源控制電路3 0則會關閉伺服器之電源,並將伺服器重新開 機,並會輸出一故障發生信號,如傳訊(Pager),用以通 知系統管理人員伺服器發生故障。 一接下來’第2圖顯示依據本發明實施例之偵測與回復 伺服器故障方法之流程圖。同時參考第丨圖與第2圖,本發 明實施例之操作流程將說明如下。0506-6768TWf; MRS90-003; yianhou.ptd page 6 591376 V. Description of the invention (4) Prescribed value). The operating system module 10 can be an operating system (Operating System) installed in the server, such as Windows or Linux® or other operating systems. When the central processing unit (CPU) of the server has nothing to process > processing, then the operating system module 10 will issue a system signal, such as an Idle Event signal, to the overall server system for power Management (Power Management) program. The event signal interception module 20 may be a circuit or a driver (river), which is used to intercept and receive the system signal sent by the operating system module 10, and output a reset signal to the power control according to the received system signal. Circuit 30. ° ° When the power control circuit 30 receives the counter reset signal, it resets the count value of the counter 31 to the initial value (the first predetermined value). On the other hand, when the counter 31 of the power control circuit 30 When the count value decreases to a third predetermined value with time, for example, the 0 (represents that the feeder has not sent a system signal for a period of time 疋; if the signal is an idle event signal, it represents the server's The CPU is down or has been busy (this situation is extremely unlikely)), so °, the electrical source control circuit 30 will turn off the server's power and restart the server, and will output a fault An occurrence signal, such as Pager, is used to notify system administrators that the server has failed. A next 'FIG. 2 shows a flowchart of a method for detecting and recovering a server fault according to an embodiment of the present invention. Referring to FIG. 丨 and FIG. 2 at the same time, the operation flow of the embodiment of the present invention will be described as follows.

0506-6768TWf;MRS90-003;yi anhou.ptd0506-6768TWf; MRS90-003; yi anhou.ptd

591376591376

首先二如步驟SI 〇〇,電源控制電路30中之計數器31之 計數值在^每一秒(既定單位時間)之後,將計數值減去一 (第二既定^值)。之後,如步驟31〇2,判斷此計數值是否 等於零(第三,既定數值),如果計數值不等於零的話,則如 步驟S 1 〇 4,,斷是否有接收到由事件訊號攔截模組2 〇所 送之一計數器重設信號。 若電源控制電路3 0沒有接收到計數器重設信號,則直 接回到步=S/ 0 〇的程序;而若電源控制電路3 〇有接收到 數器重設信號的話,則如步驟s丨0 6,將計數器3丨之計數^ 重新設回180(第一既定數值),再繼續回到步驟§1〇〇的程 序。 另一方面,在步驟51〇2的判斷中,如果計數值等於零 的話,則如步驟S 1 0 8,電源控制電路3 0關閉伺服器之電 源,並將伺服器重新開機,並輸出一故障發生信號(第2圖 中未顯示)’如傳訊(Pager )來通知系統管理人員。 因此’藉由本發明所提供之一種偵測與回復伺服器故 障之系統及方法,可以當伺服器當機時,同時自動通二管 理人員與自動將伺服器重新開機,以便利故障之伺服器^ 最短的時間内重新啟動以恢復服務,進而減少因為伺服器 停止服務後所產生的損失。 雖然本發明已以較佳實施例揭露如上,然其並非用以 限定本發明,任何熟悉此項技藝者,在不脫離本發明之精 神和範圍内’當可做些許更動與潤飾’因此本發明之保護 範圍當視後附之申請專利範圍所界定者為準。 °First, as in step SI 00, the count value of the counter 31 in the power control circuit 30 is reduced by one every second (predetermined unit time) (a second predetermined value). Then, as in step 31〇2, it is determined whether the count value is equal to zero (third, a predetermined value). If the count value is not equal to zero, then in step S1〇4, it is determined whether an event signal interception module 2 has been received. 〇 One of the counter reset signals sent. If the power control circuit 30 does not receive the counter reset signal, it directly returns to the procedure of step = S / 0 〇; and if the power control circuit 30 receives the counter reset signal, it proceeds to step s0 0 , Reset the count ^ of the counter 3 丨 to 180 (the first predetermined value), and then continue to the procedure of step §100. On the other hand, if the count value is equal to zero in step 5102, then in step S108, the power control circuit 30 turns off the power of the server, restarts the server, and outputs a fault. Signals (not shown in Figure 2) ', such as Pager, to notify system administrators. Therefore, with the system and method for detecting and recovering a server failure provided by the present invention, when the server is down, the administrator and the server can be automatically restarted at the same time to facilitate the failed server ^ Restart in the shortest time to restore service, thereby reducing losses due to server outages. Although the present invention has been disclosed in the preferred embodiment as above, it is not intended to limit the present invention. Anyone skilled in the art can 'do some changes and retouching' without departing from the spirit and scope of the present invention. The scope of protection shall be determined by the scope of the attached patent application. °

Claims (1)

591376 六、申請專利範圍 1. 一種偵測與回復伺服器故障之系統,適用於一伺服 器中,該系統包括: 一事件訊號攔截模組,用以接收一系統信號,並依據 該系統信號將一計數器重設信號送出;以及 一電源控制電路,用以接收該計數器重設信號,該電 源控制電路包括: 一計數器,具有一第一既定數值之一計數值,於每一 既定單位時間之後,將該計數值減去一第二既定數值, 當該電源控制電路接收到該計數器重設信號時,將該 計數器之該計數值重設為該第一既定數值,而當該計數值 等於一第三既定數值時,關閉該伺服器之一電源,並將該 伺服器重新開機。 2 ·如申請專利範圍第1項所述之系統,其中更包括一 作業系統模組,用以輸出該系統信號。 3. 如申請專利範圍第1項所述之系統,其中該電源控 制電路更包括當該計數值等於該第三既定數值時,輸出一 故障發生信號。 4. 如申請專利範圍第1或2項所述之系統,其中該系統 信號為一閒置事件(Idle Event)信號。 5. 如申請專利範圍第1項所述之系統,其中該電源控 制電路係配置於該伺服器之一主機板上。 6 .如申請專利範圍第1項所述之系統,其中該電源控 制電路係配置於該伺服器之一介面卡上。 7.如申請專利範圍第1項所述之系統,其中該第一既591376 VI. Scope of patent application 1. A system for detecting and recovering server faults, applicable to a server, the system includes: an event signal interception module for receiving a system signal, and according to the system signal A counter reset signal is sent out; and a power control circuit for receiving the counter reset signal, the power control circuit includes: a counter having a count value of a first predetermined value, after each predetermined unit time, Subtract a second predetermined value from the count value, reset the counter value to the first predetermined value when the power control circuit receives the counter reset signal, and when the count value is equal to a first predetermined value When the predetermined value is three, power off one of the servers and restart the server. 2 · The system described in item 1 of the scope of patent application, which further includes an operating system module for outputting the system signal. 3. The system according to item 1 of the scope of patent application, wherein the power control circuit further comprises outputting a fault occurrence signal when the count value is equal to the third predetermined value. 4. The system according to item 1 or 2 of the patent application scope, wherein the system signal is an Idle Event signal. 5. The system according to item 1 of the scope of patent application, wherein the power control circuit is configured on a motherboard of the server. 6. The system according to item 1 of the scope of patent application, wherein the power control circuit is configured on an interface card of the server. 7. The system according to item 1 of the scope of patent application, wherein the first both 0506-6768TWf;MRS90-003;y i anhou.ptd 第9頁 591376 六、申請專利範圍 定數值大於該第二既定數值。 8. 如申請專利範圍第1項所述之系統,其中該第三既 定數值為零。 9. 一種偵測與回復伺服器故障之方法,包括下列步 驟: 將一計數器之一計數值於每一既定單位時間之後,將 該計數值減去一第二既定數值; 當接收到一計數器重設信號時,將該計數值設為一第 一既定數值;以及 當該計數值等於一第三既定數值時,關閉一伺服器之 一電源,並將該伺服器重新開機。 1 0.如申請專利範圍第9項所述之方法,其中更包括接 收一系統信號,並依據該系統信號送出該計數器重設信 號。 11.如申請專利範圍第1 0項所述之方法,其中更包括 一作業系統模組輸出該系統信號。 1 2.如申請專利範圍第9項所述之方法,其中更包括當 該計數值等於該第三既定數值時,輸出一故障發生信號。 1 3.如申請專利範圍第1 0或11項所述之方法,其中該 系統信號為一閒置事件(Idle Event)信號。 1 4.如申請專利範圍第9項所述之方法,其中該計數器 係配置於一電源控制電路之上。 1 5.如申請專利範圍第9項所述之方法,其中該第一既 定數值大於該第二既定數值。0506-6768TWf; MRS90-003; y i anhou.ptd page 9 591376 6. The scope of patent application The set value is greater than the second set value. 8. The system described in item 1 of the scope of patent application, wherein the third predetermined value is zero. 9. A method for detecting and recovering a server failure, comprising the following steps: subtracting a second predetermined value from a count value of a counter after each predetermined unit time; When the signal is set, the count value is set to a first predetermined value; and when the count value is equal to a third predetermined value, a power source of a server is turned off, and the server is turned on again. 10. The method according to item 9 of the scope of patent application, further comprising receiving a system signal and sending the counter reset signal according to the system signal. 11. The method according to item 10 of the patent application scope, further comprising an operating system module outputting the system signal. 1 2. The method according to item 9 of the scope of patent application, further comprising outputting a fault occurrence signal when the count value is equal to the third predetermined value. 1 3. The method according to item 10 or 11 of the scope of patent application, wherein the system signal is an Idle Event signal. 14. The method according to item 9 of the scope of patent application, wherein the counter is arranged on a power control circuit. 1 5. The method according to item 9 of the scope of patent application, wherein the first predetermined value is greater than the second predetermined value. 0506-6768TWf;MRS90-003;yianhou.ptd 第10頁 591376 六、申請專利範圍 1 6.如申請專利範圍第9項所述之方法,其中該第三既 定數值為零。 1 7.如申請專利範圍第1 4項所述之方法,其中該電源 控制電路係配置於該伺服器之一主機板上。 1 8.如申請專利範圍第1 4項所述之方法,其中該電源 控制電路係配置於該伺服器之一介面卡上。0506-6768TWf; MRS90-003; yianhou.ptd Page 10 591376 6. Scope of patent application 1 6. The method described in item 9 of the scope of patent application, wherein the third predetermined value is zero. 17. The method according to item 14 of the scope of patent application, wherein the power control circuit is configured on a motherboard of the server. 18. The method according to item 14 of the scope of patent application, wherein the power control circuit is configured on an interface card of the server. 0506-6768TWf;MRS90-003;yianhou.ptd 第11頁0506-6768TWf; MRS90-003; yianhou.ptd Page 11
TW90133427A 2001-12-31 2001-12-31 System and method for detecting server failure and the restoring of the same TW591376B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW90133427A TW591376B (en) 2001-12-31 2001-12-31 System and method for detecting server failure and the restoring of the same

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW90133427A TW591376B (en) 2001-12-31 2001-12-31 System and method for detecting server failure and the restoring of the same

Publications (1)

Publication Number Publication Date
TW591376B true TW591376B (en) 2004-06-11

Family

ID=34057361

Family Applications (1)

Application Number Title Priority Date Filing Date
TW90133427A TW591376B (en) 2001-12-31 2001-12-31 System and method for detecting server failure and the restoring of the same

Country Status (1)

Country Link
TW (1) TW591376B (en)

Similar Documents

Publication Publication Date Title
US7756048B2 (en) Method and apparatus for customizable surveillance of network interfaces
WO2018095107A1 (en) Bios program abnormal processing method and apparatus
WO2017215441A1 (en) Self-recovery method and apparatus for board configuration in distributed system
CN101047564A (en) Network communication equipment platform and method for implementing high reliability on it
US20040123183A1 (en) Method and apparatus for recovering from a failure in a distributed event notification system
TW591376B (en) System and method for detecting server failure and the restoring of the same
JP2735514B2 (en) Process status management method
JP3158517B2 (en) Failure detection method
KR0133337B1 (en) Tarket system control
JP2008152552A (en) Computer system and failure information management method
US6622257B1 (en) Computer network with swappable components
JP3325785B2 (en) Computer failure detection and recovery method
CN104394003B (en) Power supply trouble processing method, device and power supply unit
US7243257B2 (en) Computer system for preventing inter-node fault propagation
JP6654662B2 (en) Server device and server system
JP2004013723A (en) Device and method for fault recovery of information processing system adopted cluster configuration using shared memory
CN112084049B (en) Method for monitoring resident program of baseboard management controller
JP2000148525A (en) Method for reducing load of active system in service processor duplex system
JPS58225738A (en) Dispersion type transmission system
TWI715005B (en) Monitor method for demand of a bmc
WO2024066589A1 (en) Processing method for hardware error reporting, and related device
JP2001175545A (en) Server system, fault diagnosing method, and recording medium
JPH0271336A (en) Monitor system for fault state of processor
JP3107104B2 (en) Standby redundancy method
CN111654434A (en) Flow switching method and device and forwarding equipment

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees