TW200307200A - Multiple fault location in a series of devices - Google Patents

Multiple fault location in a series of devices Download PDF

Info

Publication number
TW200307200A
TW200307200A TW092107381A TW92107381A TW200307200A TW 200307200 A TW200307200 A TW 200307200A TW 092107381 A TW092107381 A TW 092107381A TW 92107381 A TW92107381 A TW 92107381A TW 200307200 A TW200307200 A TW 200307200A
Authority
TW
Taiwan
Prior art keywords
devices
scope
item
error
processing system
Prior art date
Application number
TW092107381A
Other languages
Chinese (zh)
Other versions
TWI265408B (en
Inventor
Alongkorn Kitamorn
Ashwini Kulkarni
Gordon D Mcintosh
Kanisha Patel
Michael Anthony Perez
Original Assignee
Ibm
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ibm filed Critical Ibm
Publication of TW200307200A publication Critical patent/TW200307200A/en
Application granted granted Critical
Publication of TWI265408B publication Critical patent/TWI265408B/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0712Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a virtual computing platform, e.g. logically partitioned systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0745Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in an input/output transactions management context
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Debugging And Monitoring (AREA)
  • Hardware Redundancy (AREA)
  • Test And Diagnosis Of Digital Computers (AREA)

Abstract

A method, computer program product, and data processing system for locating hardware faults occurring in multiple devices in a data processing system, is disclosed. The devices have a scanning order in which the devices (or at least information regarding the devices) are scanned to analyze any possible error condition. When a new error is detected in a device, an identification of the device is stored in a data structure. If another error is detected and causes the devices to be scanned again, the scanning process will skip over the device whose identity is stored in the data structure so that the new error can be located.

Description

200307200 玖、發明說明: 【發明所屬之技術領域】 本發明係有關於在一貧料處理系統内對硬體的故障做識 別和處理。更明確地說,本發明提供一種方法,就是電腦 程式產品和資料處理系統,以識別和處理發生於一連申裝 置中的多重錯誤,其被連續順序地掃描以找出錯誤。 【先前技術】 在資料處理系統(平台)内一邏輯分割(LPAr)功能允許一 單獨作業系統(OS)有多重備份或多重混合作業系統同時地 在一單一資料處理系統平台上執行。在作業系統資料檔案 執行的分割内,被分配一該平台資源的非重疊處理子集。 這些平台的可分配資源包括一個或複數個架構上不同的處 理器和它們的中斷管理區,系統記憶體區,以及1/〇配接卡 匯流排槽。該分割的資源指的是該平台的韌體到作業系統 資料檔案。 在平台内執行的每一個不同的08或〇8的資料檔案彼此被 保護,使得一邏輯分割上的軟體錯誤不致於影響另一個分 割中的正確操作。這可以由分配每個〇s資料檔案直接管理 的分離平台資源來提供,以及提供機制以確定各種不同的 資料檔案不能夠控制還沒分派給它的任何資源。此外,一 作業系統的分派資源所掌握的軟體錯誤可避免任何其他的 資料檔案受到影響。因此,該作業系統(或每個不同的作業 系統)的每個資料檔案在該平台内,直接地控制不同的可分 配資源集。 β 84360 200307200 對於一 LPAR系統内的硬體資源,這些資源在一互斥方式 中,與不同的分割來分享。也就是,一單一資源可以在任 一時間被分至任何的一個分割,但是任何的給定資源可以 分派至該分割的其中任一個。這造成每個分割區好像是一 獨立的電腦。可以被分享資源是:輸入/輸出(I/O)配接卡, 隨機存取記憶體(RAM),非揮發性的隨機存取記憶體 (NVRAM)和硬碟機,雖然這個列表絕不是無遺漏的。在LPAR 系統内的每個分割可以一再地開機和關機,而不需要一再 地供給電源至所有的系統中。 I/O裝置群組可以由硬體的一般件所控制,例如一主機週 邊元件介面橋接器(PCI),其可以控制許多I/O配接卡或連接 在橋接器下。這個橋接器可以被想成所有的分割由它被分 配的槽來分享。因此,如果橋接器不能運作,它會影響所 有的分割分享連接在橋接器下面的裝置。的確,這問題是 非常地嚴重,所有的LPAR系統將會當機,如果任何的分割 想到進一步地使用該橋接器。換句話說,整個的LPAR系統 將會故障。在此狀況下,正常進行動作為將正在執行,且 分享橋接器的分割部份結束。這將會防止由於該故障而讓 系統當機。 通常發生故障的是一 I/O配接卡,它會造成該橋接器採取 一非使用的(錯誤)狀態。在發生的時間,I/O的故障啟動一 機器,檢查中斷處理器(MCIH),接下來它將會報告該錯誤, 然後結束適當的分割。這程序是一 π正常π的解決方案,可 避免所有的LPAR系統由於這個問題而當機。 84360 200307200 為了要修正該故障,必須要識別故障發生的特定1/〇配接 卡或I/O配接卡槽。這通常被依序地掃描與每個1/〇配接卡有 關的狀態暫存器。然而,在多重1/0配接卡在單一橋接器控 制下遇到錯誤時會發生一個問題。如果首先發生一錯誤於 一配接卡中,較早於該序列,然後發生一錯誤於一配接卡 中,較晚於該序列,該掃描可能停止在第一個錯誤,而第 二個錯誤可能不會被通告。収因為該第—個錯誤的情況 無法被清除。無法被清除的理由是該錯誤情況必須持續存 在,讓孩橋接器維持在一固定的(錯誤)狀態。 因此’需要一個方法識別一連申的配接卡内多重故障。 【發明内容】 不贫明揭露 ^只灯屣垤乐統,用於找 出發f於—資料處理系統内多重裝置之硬體故障位置的方 法。咸裝置具有一掃描順 置的訊息)被掃描以分析出任付了^置(或至少有關於該裝 刀斫出任何可能的錯誤狀況。當一新的 I曰玦万;一裝置中被偵測時, 資料結構中。如果另-個身份識別被儲存於一 # r ^ ,·日次被偵測,且造成該裝置再被 知描…知描程序會跳過 的裝置1得新錯誤的位置能被找出。 構中 【圖式簡單說明】 本發明的新奇特徵描述於 當閱讀該伴隨圖㈣,林曰:的申請專利範圍内。然而 進-步目的和優點將較易二本身和較喜歡的使用模態, 了解,立中. 易由芩照下列具體實施例的描述來 84360 200307200 圖1是一本發明可能實現的系統之一資料處理的方塊圖; 圖2是一圖表’描述一連串的裝置(槽内的〗/〇配接卡),包 括發生錯誤的槽,於像在圖1中描述的一資料處理系統; 圖3是一圖表’描述一機器檢查中斷處理器的結果,其已 偵測錯誤於原先描述於圖2中之一連串的槽中; 圖4疋一圖表’描述圖3中槽的連申,加上另外的錯誤出 現於一槽中,它在掃描順序上,是伴隨發生在該槽於圖2中 發生錯誤; 圖5疋一依照本發明的較佳具體實施例之圖表,描述一連 串的槽發生如圖4中的相同錯誤,但是包括一另外的資料結 構; 圖6疋一依照本發明的較佳具體實施例之圖表,描述一機 器檢查中斷處理器的結果,其偵測一第二個,伴隨發生錯 誤;以及 圖7是一依照本發明的較佳具體實施例,表示找出一連串 裝置的錯誤之程序流程圖。 【實施方式】 現在參知圖表’以及特別是圖1,將描述本發明可實現的 一貧料處理系統之方塊圖。資料處理系統100可以是一對稱 多處理斋(SMP)系統,包括複數個處理器1〇1,1〇2,1〇3, 和1 04連接土系統匯流排丨〇6。舉例來說,資料處理系統丄⑽ 可以是一 IBM RS/6000,一國際商業機器公司的產品於 Armonk,紐約,其貫現為在一網路内的伺服器。二者擇一 單處理咨系統可以被運用。同樣地也連接到系統 84360 -10- 200307200 匯流排106的是記憶體控制器/快取記憶體1〇8,其提供一介 面給複數個本地記憶體160-163。1/〇匯流排橋接器11〇被連 接到系統匯流排1 〇6,且提供一介面給1/〇匯流排丨丨2。記憶 體控制器/快取108和1/0匯流排橋接器11〇可以如所描述的 做整合。 貝料處理系統100是一邏輯分割資料處理系統。因此,資 料處理系統1 00可以具有(或一單一作業系統的多重步騾), 同時地執行的系統。多重作業系統的每個系統可以具有任 何數目的軟體程式,在它裡面執行。資料處理系統1⑽是被 邏輯地分割,使得不同的PCI I/O配接卡12(M21,128-129, 和136,圖形卡148,以及硬碟配接卡149可以被指定至不同 的邏輯分割中。在這情況下,圖形配接卡148提供一接法至 一顯示裝置(不顯示),而硬碟配接卡149提供一控制硬式磁 碟150的連接法。 因此,舉例來說,假設資料處理系統1 〇〇被分隔為三個邏 輯分割,PI,P2,和P3。每一個PCI I/O配接卡120-121, 12 8-129,136,圖形卡148,硬碟配接卡149,每一個主處 理機1 0 1 -1 04,和每一個本地記憶體1 60-1 63被指定至該三個 分割的其中一個。舉例來說,處理器1 01,本地記憶體1 6 0, 和PCI I/O配接卡120,128,和129可被指定至邏輯分割Pi ; 處理器102-103,本地記憶體16卜和PCI I/O配接卡121和136 可被指定至分割P2 ;和處理器104,本地記憶體162-163, 圖形卡148和硬碟配接卡149可被指定至邏輯分割P3。 每個在資料處理系統1 00内執行的作業系統被指定至一不 84360 -11 - 200307200 同的邏輯分割。因此,每個在資料處理系統丨〇〇内執行的作 業系統只能存取在它邏輯分割内的1/0單元。因此,舉例來 說’該高等交談式執行(AIX)作業系統的一個資料檔案可在 分割P1内執行,該AIX作業系統的第二個資料檔案(資料檔 案)可在分割P2内執行,而且一視窗2〇〇〇作業系統可以操 作在邏輯分割内,P1。視窗2〇〇〇是華盛頓,Redm〇nd微軟 公司公司的產品和商標。 週邊元件互連(PCI)主機橋接器114連接到I/O匯流排112, 提供一介面至PCI本地匯流排11 5。許多的pci I/O配接卡 120-121可以經由PCI-至-PCI橋接器116,PCI匯流排118,PCI 匯流排119,I/O槽170和I/O槽171被連接至PCI匯流排115,。 PCI-至-PCI橋接器116提供一介面至pci匯流排11 8和PCI匯 流排119。PCI I/O配接卡120和121被分別地放置至I/O槽170 和1 7 1。一般PCI匯流排的實現方式是會支援四到八個〗/〇配 接卡(也就是,擴充槽用以加入的連接器)。每個PCI I/O配 接卡120-121提供在資料處理系統1〇〇和輸入/輸出裝置之間 的介面例如,舉例來說,其他網路電腦,它是資料處理系 統1 0 0的從端。 一另外的PCI主機橋接器122提供一另外的PCI匯流排123 介面。PCI匯流排123連接到複數個PCI I/O配接卡128-129。 PCII/0配接卡128-129可以經由PCI-至-PCI橋接器124,PCI 匯流排126,PCI匯流排127,I/O槽1 72和I/O槽1 73連接至PCI 匯流排123。PCI-至-PCI橋接器124提供一在pci匯流排126 和PCI匯流排之間的介面127。PCII/0配接卡128和129被分 84360 -12- 200307200 別地放置於I/O槽172和173内。在這個方法中,另外的I/O裝 置,舉例來說,像數據機或網路配接卡可以經由每一個PCI I/O配接卡128-129被支援。在這個方法中,資料處理系統100 可以連接至複數個網路電腦。 一映射至插入I/O槽1 74之圖形配接卡148的記憶體經由 PCI匯流排144,PCI-到-PCI橋接器142,PCI匯流排141和主 機橋接器140,可以被連接到I/O匯流排112。硬碟配接卡1 49 可以被放置於I/O槽175,其被連接至PCI匯流排145。依次 地,這個匯流排被連接至PCI-至-PCI橋接器142,其經由PCI 匯流排141,被連接至PCI主機橋接器140。 一 PCI主機橋接器130提供PCI匯流排131—介面,以連接 至I/O匯流排112。PCI I/O配接卡136連接至I/O槽176,其以 PCI匯流排133連接至PCI-至-PCI橋接器132。PCI-至-PCI橋 接器13 2被連接至PCI匯流排131。這個PCI匯流排也連接PCI 主機橋接器130至該服務處理器郵箱介面和ISA匯流排存取 通過邏輯194和PCI-至-PCI橋接器132。該服務處理器郵箱 介面和ISA匯流排存取通過邏輯194將PCI存取送達至該 PCI/IS A橋接器193。NVRAM儲存體192被連接至該ISA匯流 排196。月艮務處理器135經由它本地PCI匯流排195,耦合至 服務處理器郵箱介面和IS A匯流排存取通過邏輯1 94。服務 處理器135也經由複數個JTAG/I2C匯流排134連接至處理器 101-104。JTAG/I2C匯流排是134JTAG/掃描匯流排(參照 IEEE1149.1)和Phillips I2C匯流排的組合。然而,二者擇一 地,JTAG/I2C匯流排134可以只由Phillips I2C匯流排,或 84360 -13 - 200307200 JTAG/掃描匿流排替代。該主處理機⑻,i〇2, ig3,和叫 的所有S P AT T N訊號-起被連接至—中斷服務處理器的輪 入信號。該服務處理器135具有它自己的本地記憶體ΐ9ι,而 且具有存取至該硬體OP —平板19〇。 當資料處理系統100—開始啟動時’服務處理器135使用 JTAG/掃描I2C匯流排丨34詢問系統(主機)處理器丨〇丨_丨〇4,記 憶體控制器/快取記憶體丨08和l/0橋接器丨丨〇。在這個步驟完 成時’服務處理器135具有-清單和拓撲,了解資料處理= 統100。服務處理器135也執行㈣自我測試(BISTs),基本保 註試驗(BATs),而且詢問主處理機1〇1_1〇4仙所有元件上的 記憶體測試,記憶體控制器/快取1〇8和1/〇橋接器11〇。任何 在BISTS期間偵測之故障錯誤訊息,BATs,和記憶體測試由 服務處理器135蒐集和報告。 如果一系統資源有意義/有效的配置,在該BISTs,BATS 和記憶體測試期間取出偵測錯誤的元件後,仍然是可能的, 則資料處理系統100允許進行載入可執行碼至本地(主記 憶體160-163内。服務處理器135然後釋放主處理機1〇1_1〇4°, 以執行載入主記憶體16(Μ63的程式碼。當主機處理器 101_104在資料處理系統1〇〇内執行個別作業系統的程式碼 時,服務處理器135進入一監聽的模式和報告錯誤。服務處 理器135監視的項目類型包括,舉例來說,該冷卻風扇的速 度和操作,溫度感測器,電源供應調整器,以及由處理器 101-104,本地記憶體160-163,和1/0橋接器u〇報告的可回 復和不可回復錯誤。服務處理器135負責儲存和報告資料處理系 84360 -14- 200307200 =㈣所有相關的監視項目之錯誤訊息。服務處理器⑴ 處理器135可…我的私限採取行動。舉例來說,服務 σ以/王意一處理器的快取記憶體上過多的可回 錯誤,並且決定硬體故障的預測性。基於這決定’在^ 正在執行的段落和未來起始程式載人〇期間服 : 135可以做#々、人、、次π χ ^ 啟動”二原以非組態化。1PLs也有時被指為” 1 式,,資料處理⑽_可以使用各種不同的 两,化電腦系統來實現。舉例來說,資料處理系統! =國際商業機器公司的IBMeSe咖模型84〇系統 :、現。廷樣的系統可以使用〇議作業系統支援邏輯分 割,它可以從國際商業機器公司買到。 硬::般热悉此技蟄裡的人士將會激賞在圖1中所描述的 硬二可以改變的。舉例來說,其他的週邊裝置,例如光 碟機寺,也除了可以被使用外,或代替被描述的硬體。該 描述的例子並不沒暗示有關於本發明結構上的限制。 本!月棱供一種方法,電腦程式產品和資料處理系統, 用於找出故障位置於一連串裝置内,其具有一掃描順序以 找出錯誤。圖2是一圖表,描述一連串的裝置具有一掃描順 序於資料處理系統内,如同在圖i中所描述的。—pa主機 橋接器200處理與槽2G2,綱,和2_裝置之ι/〇處理。 該配接卡在槽2〇4發生-錯誤。為了要陳述發生於槽204的 錯决情形’茲機器檢查中斷處理器必須找出錯誤。典型地, 該機器檢查中斷處理器必須依照一預先決定的掃描;:序(在 這個例子中,順序是從左到右的)掃描與每—個槽2〇2,綱, 84360 -15- 200307200 206和208有關的狀態暫存器,找出錯誤。狀態暫存器可以 被包含在一 I/O橋接器内,例如:圖丨内的1/()橋接器丨丨〇,一 pci王機橋接器,例如:PCI主機橋接器2〇〇或在配接卡本身 内’例如:槽202内的配接卡204,206和208。 為了找出槽内的配接卡204内發生的錯誤,機器檢查中斷 處理器會首先檢查與狀態暫存器有關的槽2〇2。檢查出該槽 202的配接卡内興錯誤發生,該機器檢查中斷處理器會進行 搜尋槽204,它是在序列中下一個槽。當槽2〇4包含一配接 卡且發生一錯誤時,該機器檢查中斷處理器會識別出槽 發生故障,如圖3所描述的,當成,,越過出,,。識別槽的 配接卡内發生的錯誤,將會造成pci主機橋接器2〇〇被置放 一錯誤狀態,如圖3所描述的,當成”越過在外,,。然後,該 機器檢查中斷處理器會結束找出錯誤的程序。 PCI主機橋接為200必須保持在一錯誤狀態,直到槽2〇4的 問題被更正,以避免系統當機。結果,槽204無法清除它的 錯誤狀態。其中一個結果就是,如果一額外的錯誤在一配 接卡中發生,而該槽中在該掃描順序的更前面,該額外的 錯誤就可能無法被識別出。舉例來說,圖4表示該配接卡在 槽206中遭遇一錯誤。因為槽2〇4也包含一正遭遇一錯誤的 配接卡,該機器檢查中斷處理器將會對槽2〇2檢查狀態暫存 器’然後對槽2G4檢查狀態暫存器,以及偵測—錯誤狀況在 槽204的配接卡中,在到達槽施之前將會結束找出錯誤位 置的程序。 本發明引進一額外的資料結構,以解決這種情形,例如: 84360 -16- 200307200 在圖5中描述的資料結構5〇〇 在一較佳的具體實施例中, 貝料結構500被圮錄於一記憶體裝置中,例如:在圖i中的 NVRAM儲存器192。資料結構5〇〇當成一登入,記錄錯誤, 當它們被該機器檢查中斷處理器識別出時。當該槽2〇4的配 接卡内錯誤發生已經被偵測於圖5中時,資料結構5〇〇顯示 ▲才曰的m纟。在本發明較佳的具體實施例中,當該機 器檢查中斷處理器掃描下一個槽2〇2, 2〇4, 2〇6和2〇8,它 將會首先檢查與槽202有關的狀態暫存器,然後檢查狀態與 才曰204有關的自存备。然而當該機器檢查中斷處理器到達槽 204時,它將會搜尋資料結構5〇〇,以記錄在槽發生的錯 誤。當該機器檢查中斷處理器偵測發生在槽2〇4的錯誤於資 料、、、口構5GG被#錄時’該機器檢查中斷處理器將會檢查與槽 2〇6有關的狀態暫存器。如圖6所示,發生在槽的錯誤^ 會被識別出,以及資料結構5⑽將會被更新,以包括新偵測 的錯誤。 姑圖7是-依照本發明的較佳具體實施例,表示找出一連串 农置的U之&序流程圖。在—較佳的具體實施例中,發 生在1触接卡㈣誤包含了-連㈣槽1而在該技藝裡 的:常技術之—將會承認任-組的裝置於-連_發生且具 :抒力j4順序,能以圖7描述的程序來掃描以偵測錯誤。該 程序並沒限制於較佳實施例中。 =先,决疋是否所有的槽已被完全掃描(步驟700)。如果 不疋也就疋’如果任何的槽還沒有被掃描出錯嗜,在序 列裡與該下—個槽有關的狀態暫存器會被檢查(步驟702)。 84360 -17- 200307200 然後決疋是否—錯誤已發生在那個槽(步驟704)。如果不是, 二彳序回彳又到步驟7 〇 〇以檢查該下一個槽,如果有下一個槽 =話。如果有—錯誤發生,決定錯誤是否已被記錄於一適 田的貝料I口構中,例如圖5中的資料結構5〇〇(步驟。如 果錯次有被死錄,則該處理回到步驟700,以檢查該下—個 才曰3果有下—個槽的話。然而,如果該錯誤沒有被記錄, 則槽被確認為遭遇一錯誤(步驟7〇8),以及該錯誤的記錄被 儲存於—適當的資料結構中,例如圖5中的資料結構500(步 驟叫。在步驟710後’該程序結束。二者擇一地,程序可 以結束如果在步驟已沒有槽需要掃描。 、k疋重要的去〉王意當本發明已在整個功能性資料處理系 統的上下文中被描述,那些原來熟悉於該技藝的人士將會 激賞本發明的程序能分配於電腦可讀取的指令形式和多種 不同的形式和本發明的使 j郡疋相寺地,不管是該特定類 型的^號媒體實際上用夹舍 未貝仃该7刀配。該電腦可讀取媒體 的貝例包括可記錄型媒體, _ „ . 、 1 j如·一軟式磁碟,一硬式磁 碟驅動,一隨機存取記情 妗刑嫫触η』 一 CCUR〇M,DVD-ROM,和傳 輸土媒m,例如:數位 訊逢掊哭诂田你 、,、比通矾連接器,有線或無線通 A ^接叩,使用傳輸形式· π 4取讲贿 彳如·射頻和光傳輸。該電腦 可碩取媒體可能是用編碼 ^ ^ ^ ^ ^ 秸式,其被解碼以實際用於特 疋的舆枓處理系統。功能 此太邙 f描逑材料是揭露對一機器的功 月匕貝汛。功能性的描述材料 指令,規則,事膏,可=^ 不限制在,電腦程式, 本發明的描述是當作^^午矛貝科結構的疋我。 k明和描述的目的,而不是想要限 84360 -18 - 200307200 制本發明於已揭露的形式。_多修改和變更對於熟悉於該 技蟄的人士會是易於明白的。所選擇和描述的具體實施例 是為了詳加解釋本發明的原貝,卜實際的應用和使其他熟悉 於該技藝的人士能夠了解本發明有各種不同修改的具體實 施例,如同適合於該特殊用途。 【圖式代表符號說明】 100 貧料處理系統 101,102,103,104 處理器 106 系統匯流排 108 記憶體控制器/快取記憶體 110 I/O橋接器 112 I/O匯流排 114,122,130,140 P CI主機橋接器 1 16,124,132,142 PCI-至-PCI橋接器 118,119,123,126, PCI匯流排 131,133,141,144, 145 120,121,128,129, PCI I/O配接卡 136 134 JTAG/I2C匯流排 135 服務處理器 148 圖形卡 149 硬碟配接卡 150 硬碟 84360 -19- 200307200 160,161,162,163 本地記憶體 170,171,172,173,I/O槽 174,175,176 190 OP平板 191 記憶體 192 非揮發隨機存取記憶體 194 服務處理器郵箱介面及IS A存取通過邏 輯 196 ISA匯流排 200 PCI主機橋接器 201 槽1 204 槽2 206 槽3 208 槽4 84360 -20 -200307200 发明 Description of the invention: [Technical field to which the invention belongs] The present invention relates to the identification and treatment of hardware failures in a lean material processing system. More specifically, the present invention provides a method that is a computer program product and a data processing system to identify and process multiple errors that occur in a series of applications, which are scanned sequentially to find errors. [Prior Art] A logical partitioning (LPAr) function within a data processing system (platform) allows a single operating system (OS) to have multiple backups or multiple hybrid operating systems to execute simultaneously on a single data processing system platform. Within the partitioning performed by the operating system data file, a non-overlapping processing subset of the platform resources is allocated. The allocatable resources of these platforms include one or more processors with different architectures and their interrupt management areas, system memory areas, and 1/0 adapter card bus slots. The divided resources refer to the platform's firmware to the operating system data files. Each different 08 or 08 data file executed in the platform is protected from each other, so that software errors on one logical partition will not affect the correct operation in another partition. This can be provided by allocating separate platform resources directly managed by each 0s data file, as well as providing a mechanism to determine that various data files cannot control any resources that have not been assigned to it. In addition, software errors in an operating system's allocated resources can prevent any other data files from being affected. Therefore, each data file of the operating system (or each different operating system) is directly within the platform and controls different sets of assignable resources. β 84360 200307200 For hardware resources in an LPAR system, these resources are shared with different partitions in a mutually exclusive manner. That is, a single resource can be assigned to any one partition at any time, but any given resource can be assigned to any one of the partitions. This makes each partition appear to be a separate computer. Resources that can be shared are: input / output (I / O) adapter cards, random access memory (RAM), non-volatile random access memory (NVRAM), and hard drives, although this list is by no means complete. Missing. Each partition within an LPAR system can be turned on and off repeatedly without the need to repeatedly supply power to all systems. The I / O device group can be controlled by general hardware, such as a host peripheral component interface bridge (PCI), which can control many I / O adapter cards or connect under the bridge. This bridge can be thought of as all splits shared by the slots to which it is assigned. Therefore, if the bridge does not work, it will affect all split sharing devices connected under the bridge. Indeed, this problem is very serious, all LPAR systems will crash, if any partition wants to use the bridge further. In other words, the entire LPAR system will fail. In this case, the normal action is that it will be executed, and the split part of the sharing bridge ends. This will prevent the system from crashing due to the failure. Often it is an I / O adapter card that causes the bridge to assume an unused (wrong) state. At the time of the occurrence, an I / O failure starts a machine, checks the interrupt handler (MCIH), it will then report the error, and then end the appropriate split. This program is a π normal π solution, which can prevent all LPAR systems from crashing due to this problem. 84360 200307200 In order to correct this fault, you must identify the specific 1/0 adapter card or I / O adapter card slot in which the failure occurred. This is usually sequentially scanned through the status registers associated with each 1/0 adapter. However, a problem occurs when multiple 1/0 adapters encounter errors under the control of a single bridge. If an error occurs first in an adapter card, earlier than the sequence, and then an error occurs in an adapter card, later than the sequence, the scan may stop at the first error and the second error May not be notified. The first error condition cannot be cleared. The reason it cannot be cleared is that the error condition must persist and the child bridge must be maintained in a fixed (error) state. Therefore, a method is needed to identify multiple faults in a continuous application adapter card. [Summary of the Invention] It is not unduly disclosed that only light and music systems are used to find a method for starting a hardware fault location of multiple devices in a data processing system. The device has a scan order message) that is scanned to analyze the task (or at least any possible error conditions related to the installation of the knife. When a new device is installed, a device is detected) In the data structure, if another identity is stored in a # r ^, the day is detected, and the device is described again ... the device 1 will skip the new wrong location Can be found in the structure. [Simplified illustration of the scheme] The novel features of the present invention are described in the scope of patent application when reading the accompanying diagram ㈣, Lin Yue: However, the purpose and advantages of further steps will be easier. Favorite mode of use, understand, Lizhong. Easy to follow according to the description of the following specific embodiment 84360 200307200 Figure 1 is a block diagram of data processing of one of the possible systems of the present invention; Figure 2 is a diagram depicting a series The device (in the slot / 0 adapter card), including the slot where the error occurred, is a data processing system like the one described in Figure 1; Figure 3 is a diagram depicting the results of a machine inspection interrupt handler, which Detected errors as previously described It is described in a series of slots in FIG. 2; FIG. 4 is a diagram depicting the continuous application of the slot in FIG. 3, and another error occurs in a slot. It is in the scanning sequence that occurs in the slot along with An error occurs in FIG. 2; FIG. 5 is a diagram according to a preferred embodiment of the present invention, describing a series of slots where the same error occurs as in FIG. 4, but including an additional data structure; The diagram of the preferred embodiment of the present invention describes the result of a machine checking interrupt handler, which detects a second one, with an error occurring; and FIG. 7 is a preferred embodiment according to the present invention, which shows finding a series of Device error program flow chart. [Embodiment] Now referring to the diagram 'and especially FIG. 1, a block diagram of a lean material processing system that can be implemented by the present invention will be described. The data processing system 100 may be a symmetric multiprocessing module. (SMP) system, including a plurality of processors 101, 102, 103, and 104 connected to the soil system bus. 〇06. For example, the data processing system can be an IBM RS / 6000 , An international The company's products are in Armonk, New York, which is now a server in a network. Either of the two processing systems can be used. The same is also connected to the system 84360 -10- 200307200 bus 106 It is a memory controller / cache memory 108, which provides an interface to a plurality of local memories 160-163. The 1 / 〇 bus bridge 11 is connected to the system bus 106, and provides a The interface is for 1/0 buses. 2. Memory controller / cache 108 and 1/0 bus bridge 11 can be integrated as described. The shell material processing system 100 is a logically divided data processing system. Therefore, the data processing system 100 may have (or multiple steps of a single operating system) a system that executes simultaneously. Each operating system of a multiple operating system can have any number of software programs running in it. The data processing system 1 is logically divided, so that different PCI I / O adapter cards 12 (M21, 128-129, and 136, graphics card 148, and hard disk adapter card 149 can be assigned to different logical divisions. In this case, the graphics adapter card 148 provides a connection method to a display device (not shown), and the hard disk adapter card 149 provides a connection method to control the hard disk 150. Therefore, for example, suppose Data processing system 100 is divided into three logical divisions, PI, P2, and P3. Each PCI I / O adapter card 120-121, 12 8-129, 136, graphics card 148, and hard disk adapter card 149, each main processor 1 0 1 -1 04, and each local memory 1 60-1 63 are assigned to one of the three partitions. For example, processor 1 01, local memory 1 6 0, and PCI I / O adapter cards 120, 128, and 129 can be assigned to the logical split Pi; processors 102-103, local memory 16b, and PCI I / O adapter cards 121 and 136 can be assigned to Partition P2; and processor 104, local memory 162-163, graphics card 148 and hard disk adapter card 149 can be assigned to logical partition P3. Each in The operating system executed in the material processing system 100 is designated to the same logical partition as 84360 -11-200307200. Therefore, each operating system executed in the data processing system 丨 〇〇 can only access within its logical partition 1/0 unit. Therefore, for example, 'a data file of the advanced interactive execution (AIX) operating system can be executed in partition P1, and a second data file (data file) of the AIX operating system can be executed in partition Executed in P2, and a Windows 2000 operating system can operate in a logical partition, P1. Windows 2000 is a product and trademark of Microsoft Corporation of Redmond, Washington. Peripheral Component Interconnect (PCI) Host Bridge The device 114 is connected to the I / O bus 112 and provides an interface to the PCI local bus 115. Many PCI I / O adapter cards 120-121 can be connected via the PCI-to-PCI bridge 116, PCI bus 118, The PCI bus 119, I / O slot 170 and I / O slot 171 are connected to the PCI bus 115. The PCI-to-PCI bridge 116 provides an interface to the PCI bus 118 and the PCI bus 119. PCI I / O adapter cards 120 and 121 are respectively placed in I / O slots 170 and 1 7 1. The general implementation of the PCI bus is to support four to eight 〖/ 〇 adapter cards (that is, connectors used for expansion slots to join). Each PCI I / O adapter card 120-121 Provide an interface between the data processing system 100 and the input / output device. For example, for example, other network computers, it is the slave of the data processing system 100. An additional PCI host bridge 122 provides an additional PCI bus 123 interface. The PCI bus 123 is connected to a plurality of PCI I / O adapter cards 128-129. PCII / 0 adapter cards 128-129 can be connected to PCI bus 123 via PCI-to-PCI bridge 124, PCI bus 126, PCI bus 127, I / O slot 1 72 and I / O slot 1 73. The PCI-to-PCI bridge 124 provides an interface 127 between the PCI bus 126 and the PCI bus. PCII / 0 adapter cards 128 and 129 are divided into 84360 -12- 200307200 and placed separately in I / O slots 172 and 173. In this method, additional I / O devices, such as modems or network adapter cards, can be supported via each PCI I / O adapter card 128-129. In this method, the data processing system 100 can be connected to a plurality of network computers. A memory mapped to the graphics adapter card 148 inserted in the I / O slot 1 74 can be connected to the I / O via the PCI bus 144, the PCI-to-PCI bridge 142, the PCI bus 141, and the host bridge 140. O Busbar 112. The hard disk adapter 1 49 can be placed in an I / O slot 175, which is connected to a PCI bus 145. In turn, this bus is connected to a PCI-to-PCI bridge 142, which is connected to a PCI host bridge 140 via a PCI bus 141. A PCI host bridge 130 provides a PCI bus 131-interface to connect to the I / O bus 112. The PCI I / O adapter card 136 is connected to the I / O slot 176, which is connected to the PCI-to-PCI bridge 132 by a PCI bus 133. The PCI-to-PCI bridge 132 is connected to the PCI bus 131. This PCI bus also connects the PCI host bridge 130 to the service processor mailbox interface and the ISA bus access via logic 194 and PCI-to-PCI bridge 132. The service processor mailbox interface and ISA bus access route PCI access to the PCI / IS A bridge 193 through logic 194. NVRAM bank 192 is connected to the ISA bus 196. The service processor 135 is coupled to the service processor mailbox interface and the ISA bus access logic 194 via its local PCI bus 195. The service processor 135 is also connected to the processors 101-104 via a plurality of JTAG / I2C buses 134. JTAG / I2C bus is a combination of 134JTAG / scan bus (refer to IEEE1149.1) and Phillips I2C bus. Alternatively, however, the JTAG / I2C bus 134 can be replaced only by the Phillips I2C bus, or 84360 -13-200307200 JTAG / scanning bus. The main processor ⑻, i〇2, ig3, and all the SP AT T N signals called are connected to-the interrupt service processor's turn signal. The service processor 135 has its own local memory 9m, and has access to the hardware OP-tablet 19. When the data processing system 100 is started, the service processor 135 uses JTAG / scan I2C bus 丨 34 to query the system (host) processor 丨 〇 丨 _ 丨 〇4, memory controller / cache memory 08 and l / 0 bridge 丨 丨 〇. At the completion of this step, the 'service processor 135 has -list and topology, understanding data processing = system 100. The service processor 135 also performs ㈣ self-tests (BISTs), basic guarantee tests (BATs), and asks the main processor for memory testing on all components of the 10101_1cent, memory controller / cache 108 And 1 / 〇 bridge 11〇. Any fault error messages, BATs, and memory tests detected during the BISTS are collected and reported by the service processor 135. If a meaningful / effective allocation of system resources is still possible after the error detection components are removed during the BISTs, BATS, and memory tests, the data processing system 100 allows loading executable code to the local (main memory) 160-163. The service processor 135 then releases the main processor 1101_104 ° to execute the code loaded into the main memory 16 (M63. When the host processor 101_104 executes in the data processing system 100 When operating system code, the service processor 135 enters a listening mode and reports errors. The types of items monitored by the service processor 135 include, for example, the speed and operation of the cooling fan, temperature sensor, power supply Regulators, and recoverable and non-recoverable errors reported by processors 101-104, local memory 160-163, and 1/0 bridge u〇. The service processor 135 is responsible for storing and reporting the data processing department 84360 -14- 200307200 = 错误 Error messages for all related monitoring items. Service processor ⑴ Processor 135 can ... take action on my privacy. For example, service σ is handled as / wangyiyi There are too many recoverable errors in the cache memory and determine the predictability of hardware failures. Based on this decision 'in the ^ section being executed and the future starting program to be manned during the period: 135 可以 做 # 々 、 人 、 , 次 π χ ^ Start "The two originals are non-configuration. 1PLs are sometimes referred to as" 1 ". Data processing can be implemented using various computer systems. For example, data processing systems ! = IBM Business Machine's 84e system from the International Business Machines Corporation: The current system can be used to support logical partitioning. The system can be purchased from the International Business Machines Corporation. Hard :: generally familiar with this technology People here will appreciate that the hardware two described in Figure 1 can be changed. For example, other peripheral devices, such as CD-ROM temples, can be used in addition to or instead of the hardware described. This description The examples do not imply any structural limitations of the present invention. This article provides a method, a computer program product, and a data processing system for finding the location of a fault in a series of devices. Trace the sequence to find the errors. Figure 2 is a diagram depicting a series of devices with a scanning sequence within the data processing system, as described in Figure i.-The pa host bridge 200 handles the slots 2G2, outline, and 2_device / ι processing. The adapter card has an error in slot 204. In order to state the failure occurred in slot 204, the machine checks the interrupt handler to find the error. Typically, the machine The inspection interrupt handler must follow a predetermined scan; the sequence (in this example, the sequence is from left to right) scans are related to each slot 2202, outline, 84360 -15- 200307200 206 and 208 Status register to find errors. The status register can be included in an I / O bridge, such as: 1 / () bridge in the figure, a pci king bridge, such as: PCI host bridge 200 or in Within the adapter card itself, for example: adapter cards 204, 206, and 208 in slot 202. In order to find the error occurred in the adapter card 204 in the slot, the machine check interrupt processor first checks the slot 200 related to the status register. Check out that the adapter card built-in error in slot 202 has occurred. The machine check interrupt handler searches for slot 204, which is the next slot in the sequence. When slot 204 contains an adapter card and an error occurs, the machine check interrupt handler will recognize that the slot has failed, as described in Figure 3, as ,, over, and out. An error that occurs in the adapter card of the identification slot will cause the PCI host bridge 2000 to be placed in an error state, as described in Figure 3, as "overpassed." Then, the machine checks the interrupt handler. The process of finding the error will end. The PCI host bridge of 200 must remain in an error state until the problem of slot 204 is corrected to avoid system crash. As a result, slot 204 cannot clear its error state. One of the results That is, if an additional error occurs in an adapter card and the slot is earlier in the scanning sequence, the additional error may not be recognized. For example, FIG. 4 shows that the adapter card is in the An error was encountered in slot 206. Because slot 204 also contains an adapter that is experiencing an error, the machine check interrupt handler will check the status register in slot 202 and then check the status in slot 2G4. Register, and detection-error conditions in the adapter card in slot 204, the process of finding the error location will end before reaching slot application. The present invention introduces an additional data structure to solve this situation, for example : 84360 -16- 200307200 In a preferred embodiment, the data structure 500 described in FIG. 5 is recorded in a memory device, such as the NVRAM storage in FIG. Device 192. The data structure 500 is a log-in and records errors when they are recognized by the machine check interrupt handler. When the error in the adapter card in slot 204 has been detected in Figure 5 The data structure 500 shows ▲ cai m 纟. In a preferred embodiment of the present invention, when the machine check interrupt processor scans the next slot 002, 204, 206, and 20. 8. It will first check the status register related to slot 202, and then check the self-storage related to status 204. However, when the machine checks that the interrupt handler reaches slot 204, it will search the data structure 5 〇〇, to record the error occurred in the slot. When the machine check interrupt handler detects the error occurred in the slot 204 in the data, and the structure 5GG is recorded, the machine check interrupt handler will check Status register related to slot 206. As shown in Figure 6, it occurs at The errors ^ will be identified, and the data structure 5⑽ will be updated to include newly detected errors. Figure 7 is-According to a preferred embodiment of the present invention, it means to find a series of U & Sequence flow chart. In the preferred embodiment, the error occurs when the 1-touch card contains -flail slot 1 and in the art: the conventional technology-will recognize the device of any-group -Even_occurs and has: expressive j4 sequence, can be scanned to detect errors with the procedure described in Figure 7. This procedure is not limited to the preferred embodiment. = First, determine whether all slots have been completely Scan (step 700). If not, then 'if any slot has not been scanned for errors, the state register in the sequence related to the next slot will be checked (step 702). 84360 -17- 200307200 Then decide whether or not-the error has occurred in that slot (step 704). If not, the sequence goes back to step 700 to check the next slot, if there is a next slot = then. If there is an error, it is determined whether the error has been recorded in a shell material I structure of Shida, such as the data structure 500 in FIG. 5 (step. If there are dead recordings, the process returns to Step 700 to check if the next slot has a next slot. However, if the error is not recorded, the slot is confirmed to have encountered an error (step 708), and the record of the error is Stored in an appropriate data structure, such as the data structure 500 in FIG. 5 (the step is called. After step 710, the procedure ends. Alternatively, the procedure may end if there are no more slots to scan in the step., K疋 Important> Wang Yidang The invention has been described in the context of the entire functional data processing system. Those who are familiar with the art will appreciate the program of the invention can be distributed in the form of computer-readable instructions and A variety of different forms and the present invention of the Jongsangsa Temple, regardless of the particular type of media number ^ are actually equipped with the 7 blades. Examples of the computer-readable media include recordable types Media, _ „ . 1 j such as a floppy disk, a hard disk drive, a random access memory, a penalty, and a touch. CCUROM, DVD-ROM, and transmission media m, such as: Crying you ,,,, or more than Alum connector, wired or wireless communication A ^ connection, use the transmission form · π 4 to take bribes, such as · radio frequency and optical transmission. The computer can obtain the media may use encoding ^ ^ ^ ^ ^ This type is decoded to be actually used in special public address processing systems. The function of this material is to expose the work of a machine. Functional description of material instructions, rules, It is not limited to a computer program. The description of the present invention is intended to be used as a structure for the structure of syllabidae. The purpose of the description and description is not to restrict the invention to 84360 -18-200307200. Modified and changed forms will be easily understood by those familiar with the technology. The specific embodiments selected and described are intended to explain the original application of the present invention in detail, as well as its practical application and use. Others familiar with the art will understand that the present invention is different The specific embodiment of the modification is as if suitable for the special purpose. [Description of Symbols of the Drawings] 100 Lean Processing System 101, 102, 103, 104 Processor 106 System Bus 108 Memory Controller / Cache Memory 110 I / O Bridge 112 I / O bus 114,122,130,140 P CI host bridge 1 16,124,132,142 PCI-to-PCI bridge 118,119,123,126, PCI bus 131,133,141,144, 145 120,121,128,129, PCI I / O configuration Adapter 136 134 JTAG / I2C bus 135 Service processor 148 Graphics card 149 Hard disk Adapter card 150 Hard disk 84360 -19- 200307200 160,161,162,163 Local memory 170,171,172,173, I / O slot 174,175,176 190 OP tablet 191 memory 192 non-volatile random access memory 194 service processor mailbox interface and IS A access through logic 196 ISA bus 200 PCI host bridge 201 slot 1 204 slot 2 206 slot 3 208 slot 4 84360 -20 -

Claims (1)

200307200 拾、申请專利範圍: 1 · 一種方法,包含: 、、“夂數個衣置中偵測出一錯誤於第一個裝置内,其中 β 數個裝置與掃描順序有關聯;以及 /琢知描順序掃描關於該複數個裝置之資訊以識別該 第一個裝置,並跳過在一資料結構中被識別出的每一㈣ 置。 2·如申請專利範圍第丨項之 万凌其中$员料結構被儲存在 人k數個裝置通訊的一監督裝置中。 3·如申請專利範圍第2項之方法,進一步包含: 回應偵測該錯誤,至少部份取消監督裝置之功能。 4·如申請專利範圍第1項之方法,進一步包含:匕。 插入一與第一個裝置相同之象徵於該資料結構内。 5·如申請專利範圍第1項之方法,其中該複數個裝置包 少一個積體電路。 土 …請專利範圍第5項之方法,其中至少一個積體電路勺 括土 V —個輸入/輸出介面積體電路。 7·如申請專利範圍第丨項之方法,其中該複數個裝置包括= 少一個週邊元件於一資料處理系統中。 土 汝申明專利範圍第1項之方法,其中關於該複數個裝置、 掃描訊息包含·· 〈 檢查一介面電路内的錯誤暫存器,其中每個錯誤暫存 器代表複數個裝置中一相關裝置的狀態。 予 9·如申請專利範圍第1項之方法,其中關於該複數個裝置之 84360 200307200 掃描訊息包含: 從複數個裝置的掃描順庠 W6,ia„ ,、序分析—目前裝置的行為以決 疋目則裝置的現狀。 10·—種在一電腦可讀媒介中 r ii U U ^ rh ^ _孝王式產品,包含功能性 括·· 仃時使電腦能夠執行的動作包 從複數個裝置中,偵測出一 中該複數個裝置與掃描順序有關: 在該掃描順序内之複數個 置的知描訊息識別今裳 個裝置,並跳過在一資料杜 Id忒弟一 η.如申請專利範圍第_之:被;, 構被儲存在與複數個裝置通訊的置二中該資料結 α如申請專利範圍第u項之電腦程式產 描述材料’當被該電腦執行時 广“員外功能 作,該功能描述材料包括:^執行另外的動 回應於偵測該錯誤,至少部份取消監督裝 13.如申請專利範圍篦彳〇 g 功成。 予〜靶圍罘10項《電腦程式產品 描述材料,當被該電腦執 η欲外功能 〒丁吁旎夠使電腦執杯2 作,功能描述材料包括: 订另外的動 插入-與第-個裝置相同之象徵於 —請專利範圍第10項之電腦程式產品,::中。 數個裝置包括至少一個積體電路。 、中其中該複 15.如申請專利範園第14項之電腦程式產品 S 積體電路包括至少一個輸入/輸出介面積體電路。土少-個 84360 200307200 其中該複數個 絲中。 其中關於該複 16.如_請專利範園第〗〇項之電腦程式產品, 裝置包括i少一個週邊元件於一資料處理系 1 7 ·如申請專利範圍第丨〇項之電腦程式產品' 數個裝置之掃描訊息包含·· 檢查一介面電路内的錯誤暫存器,其中每個錯誤暫存 器代表複數個裝置中一相關裝置的狀態。 18.如申請專利範圍第】。項之電腦程式;品,其中關於該複 數個裝置之掃描訊息包含: 從複數個裝置的掃描順序,分析一目前裝置的行為 決定目前裝置的現狀。 19· 一種資料處理系統,包含: 至少一個處理器; 與至少一處理器通訊的記憶體; 、\數個裝置與至少—個處理器通訊,和具有一掃描順 序;以及 一指令集於該記憶體中, 其中至少一個處理器執行該指令集以完成動作包括: 仗後數個裝置中偵測出一錯誤於第一個裝置中,其 中該複數個裝置與掃描順序有關聯;以及 以在孩掃描順序掃描關於該複數個裝置之資訊以識 ^邊第一個裝置,並跳過在一資料結構中被識別出的 每一個裝置。 2〇·如申請專利範圍第19項之資料處理系統,其中該資料結 冓破錯存在與複數個裝置通訊的一監督裝置中。 84360 200307200 21.如中請專利範圍第2㈣之 個處理器執行該指令集,以完成額外:::該至少- 22 偵—錯誤,至少部份取消監督裝置之功能。 • α申清專利範圍第19項之資料 個#理哭也/ 更里系說,其中該至少一 個處理爾該指令集,以完成額外的動作包括: 插:-與第-個裝置相同之象徵於該資料。 •如申凊專利範圍第丨9項 漤罢〜 ,、竹處理系統,其中該複數個 衣罝包括至少一個積體電路。 2?申請專利範圍第23項之資料處理系統,其中至少一個 % te %路包括至少一個輸入/輸出介面積體電路。 25·:申請專利範圍第19項之資料處理系統,其中該複數個 衣置包括至少一個週邊元件於一資料處理系統中。 %·如申請專利範圍第19項之資料處理系統,其中關於該複 數個裝置之掃描訊息包含: 。檢查一介面電路内的錯誤暫存器,其中每個錯誤暫存 器代表複數個裝置中一相關裝置的狀態。 27·如申請專利範圍第19項之資料處理系統,其中關於該複 數個裝置之掃描訊息包含: 從衩數個裝置的掃描順序,分析一目前裝置的行為, 決定目前裝置的現狀。 84360200307200 The scope of patent application: 1 · A method, including: ", an error was detected in the first device in several clothes, among which β devices are related to the scanning sequence; and / Scans the information about the multiple devices in order to identify the first device, and skips each of the devices identified in a data structure. 2. If the number of members in the scope of the patent application is 丨 Ling Ling The material structure is stored in a monitoring device that communicates with several devices of the person. 3. The method of item 2 of the patent application scope further includes: responding to the detection of the error, at least partially canceling the function of the monitoring device. The method of applying for the first item of patent scope further includes: a dagger. Inserting a symbol identical to the first device in the data structure. 5. The method of applying for the first item of the patent scope, wherein one of the plurality of device packages is one less Integrated circuit. Soil ... Please call the method in the scope of patent No. 5 in which at least one integrated circuit includes soil V-an input / output dielectric circuit. Method, wherein the plurality of devices include = less than one peripheral component in a data processing system. The method described in the first scope of the patent by Tu Ru, in which the scanned information about the plurality of devices includes ... Error register, where each error register represents the status of a related device in the plurality of devices. 99. If the method of the scope of patent application item 1, the 84360 200307200 scan information about the plurality of devices contains: from Scanning of multiple devices: W6, ia, and sequential analysis—the current device behavior determines the status of the device. 10 · —A ii UU ^ rh ^ _ filial king product in a computer-readable medium, including functional brackets ... The action package that enables the computer to execute when 仃 detects one of the plurality from a plurality of devices This device is related to the scanning sequence: the multiple scanning information in the scanning sequence identifies the current device, and skips a data file ID Id. One such as the scope of the patent application: be; The structure is stored in the second communication with a plurality of devices. The data structure α is as described in the computer program of the patent application No. u. The description materials' when executed by the computer are widely used for external functions. The function description materials include: ^ Perform another action in response to detecting the error, and at least partially cancel the monitoring equipment. 13. If the scope of the patent application is not effective, I will target 10 items of computer program product description materials, which will be executed by the computer. For external functions, Ding Yu can make the computer hold the cup 2 function, the functional description materials include: order additional dynamic insertion-the same symbol as the first device in-please the computer program products in the scope of the patent No. 10 :: Medium Several devices include at least one integrated circuit. Among them, the complex 15. The computer program product S such as the patent application No. 14 The integrated circuit includes at least one input / output dielectric circuit. The number of soil-a 84360 200307200 Among the plurality of wires, among which is the computer program product of item 16. Such as _ please patent patent garden, the device includes at least one peripheral component in a data processing system 17. Such as the scope of patent application 丨 〇 Item of computer program products' Scanning messages of several devices include: · Checking error registers in an interface circuit, where each error register represents the status of a related device in a plurality of devices. Item]. The computer program of the item; the scanning information about the plurality of devices includes: from the scanning order of the plurality of devices, analyzing the behavior of a current device to determine the current status of the current device. 19. A data processing system including: At least one processor; memory in communication with at least one processor; and several devices in communication with at least one processor, and having one Scanning sequence; and an instruction set in the memory, wherein at least one processor executes the instruction set to complete an action includes: detecting an error in several devices after the battle in the first device, wherein the plurality of devices Related to the scanning order; and scanning the information about the plurality of devices in the child scanning order to identify the first device, and skipping each device identified in a data structure. 2.If applied The data processing system of item 19 of the patent scope, wherein the data structure is stored in a supervisory device communicating with a plurality of devices. 84360 200307200 21. If so, the processor of the second scope of the patent scope is required to execute the instruction set to Complete Extra ::: This should detect at least-22 errors and at least partially cancel the function of the monitoring device. • Alpha claims the information in item 19 of the patent scope # 理 哭 也 / More specifically, the at least one processor and the instruction set to complete additional actions include: Insert:-the same symbol as the first device In the information. • For example, item 9 of the patent scope of the application, bamboo treatment system, wherein the plurality of clothes include at least one integrated circuit. 2. The data processing system of the 23rd patent application scope, wherein at least one% te% circuit includes at least one input / output dielectric circuit. 25 ·: The data processing system of the scope of application for item 19, wherein the plurality of clothes includes at least one peripheral component in a data processing system. % · If the data processing system of item 19 of the patent application scope, the scanned information about the plurality of devices includes:. An error register in an interface circuit is checked, where each error register represents the status of a related device of the plurality of devices. 27. The data processing system according to item 19 of the scope of patent application, wherein the scanning information about the plurality of devices includes: from the scanning sequence of the plurality of devices, analyzing the behavior of a current device to determine the current status of the device. 84360
TW092107381A 2002-04-04 2003-04-01 Method, computer-readable medium and data processing system for locating hardware faults TWI265408B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US10/116,522 US20030191978A1 (en) 2002-04-04 2002-04-04 Multiple fault location in a series of devices

Publications (2)

Publication Number Publication Date
TW200307200A true TW200307200A (en) 2003-12-01
TWI265408B TWI265408B (en) 2006-11-01

Family

ID=28674005

Family Applications (1)

Application Number Title Priority Date Filing Date
TW092107381A TWI265408B (en) 2002-04-04 2003-04-01 Method, computer-readable medium and data processing system for locating hardware faults

Country Status (2)

Country Link
US (1) US20030191978A1 (en)
TW (1) TWI265408B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI579768B (en) * 2016-01-12 2017-04-21 英業達股份有限公司 Updating system of firmware of complex programmable logic device and updating method thereof

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7644118B2 (en) 2003-09-11 2010-01-05 International Business Machines Corporation Methods, systems, and media to enhance persistence of a message
US7266727B2 (en) * 2004-03-18 2007-09-04 International Business Machines Corporation Computer boot operation utilizing targeted boot diagnostics
DE102004019151A1 (en) * 2004-04-21 2005-11-10 Daimlerchrysler Ag Computer-aided diagnostic system based on heuristics and system topologies
CN100395717C (en) * 2005-07-11 2008-06-18 英业达股份有限公司 Method and system for monitoring hard-disk damage
US8785217B2 (en) 2011-09-12 2014-07-22 International Business Machines Corporation Tunable radiation source

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4371930A (en) * 1980-06-03 1983-02-01 Burroughs Corporation Apparatus for detecting, correcting and logging single bit memory read errors
US4459693A (en) * 1982-01-26 1984-07-10 Genrad, Inc. Method of and apparatus for the automatic diagnosis of the failure of electrical devices connected to common bus nodes and the like
US4514845A (en) * 1982-08-23 1985-04-30 At&T Bell Laboratories Method and apparatus for bus fault location
US4606024A (en) * 1982-12-20 1986-08-12 At&T Bell Laboratories Hierarchical diagnostic testing arrangement for a data processing system having operationally interdependent circuit boards
US4535455A (en) * 1983-03-11 1985-08-13 At&T Bell Laboratories Correction and monitoring of transient errors in a memory system
US4604751A (en) * 1984-06-29 1986-08-05 International Business Machines Corporation Error logging memory system for avoiding miscorrection of triple errors
US4951283A (en) * 1988-07-08 1990-08-21 Genrad, Inc. Method and apparatus for identifying defective bus devices
US5072450A (en) * 1989-07-27 1991-12-10 Zenith Data Systems Corporation Method and apparatus for error detection and localization
US5245615A (en) * 1991-06-06 1993-09-14 International Business Machines Corporation Diagnostic system and interface for a personal computer
US5263032A (en) * 1991-06-27 1993-11-16 Digital Equipment Corporation Computer system operation with corrected read data function
JP2888401B2 (en) * 1992-08-03 1999-05-10 インターナショナル・ビジネス・マシーンズ・コーポレイション Synchronization method for redundant disk drive arrays
US5504859A (en) * 1993-11-09 1996-04-02 International Business Machines Corporation Data processor with enhanced error recovery
US5729767A (en) * 1994-10-07 1998-03-17 Dell Usa, L.P. System and method for accessing peripheral devices on a non-functional controller
US6032271A (en) * 1996-06-05 2000-02-29 Compaq Computer Corporation Method and apparatus for identifying faulty devices in a computer system
US5889933A (en) * 1997-01-30 1999-03-30 Aiwa Co., Ltd. Adaptive power failure recovery
WO1999005599A1 (en) * 1997-07-28 1999-02-04 Intergraph Corporation Apparatus and method for memory error detection and error reporting
US6061788A (en) * 1997-10-02 2000-05-09 Siemens Information And Communication Networks, Inc. System and method for intelligent and reliable booting
US6496945B2 (en) * 1998-06-04 2002-12-17 Compaq Information Technologies Group, L.P. Computer system implementing fault detection and isolation using unique identification codes stored in non-volatile memory
US6317848B1 (en) * 1998-09-24 2001-11-13 Xerox Corporation System for tracking and automatically communicating printer failures and usage profile aspects
DE19947135A1 (en) * 1999-09-30 2001-04-05 Siemens Ag Method for treating peripheral units reported as faulty in a communication system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI579768B (en) * 2016-01-12 2017-04-21 英業達股份有限公司 Updating system of firmware of complex programmable logic device and updating method thereof

Also Published As

Publication number Publication date
TWI265408B (en) 2006-11-01
US20030191978A1 (en) 2003-10-09

Similar Documents

Publication Publication Date Title
US8135985B2 (en) High availability support for virtual machines
US7313717B2 (en) Error management
JP5579354B2 (en) Method and apparatus for storing track data cross-reference for related applications
US7519866B2 (en) Computer boot operation utilizing targeted boot diagnostics
TWI317868B (en) System and method to detect errors and predict potential failures
US6934879B2 (en) Method and apparatus for backing up and restoring data from nonvolatile memory
US7979749B2 (en) Method and infrastructure for detecting and/or servicing a failing/failed operating system instance
US7139940B2 (en) Method and apparatus for reporting global errors on heterogeneous partitioned systems
CN102597962B (en) Method and system for fault management in virtual computing environments
US6901537B2 (en) Method and apparatus for preventing the propagation of input/output errors in a logical partitioned data processing system
US7107495B2 (en) Method, system, and product for improving isolation of input/output errors in logically partitioned data processing systems
US20040221198A1 (en) Automatic error diagnosis
US10037238B2 (en) System and method for encoding exception conditions included at a remediation database
US8949659B2 (en) Scheduling workloads based on detected hardware errors
TWI310899B (en) Method, system, and product for utilizing a power subsystem to diagnose and recover from errors
CN100375960C (en) Method and apparatus for regulating input/output fault
US6789048B2 (en) Method, apparatus, and computer program product for deconfiguring a processor
JP4366336B2 (en) Method for managing trace data in logical partition data processing system, logical partition data processing system for managing trace data, computer program for causing computer to manage trace data, logical partition data Processing system
JPH0950424A (en) Dump sampling device and dump sampling method
US7996707B2 (en) Method to recover from ungrouped logical path failures
TW200307200A (en) Multiple fault location in a series of devices
CN111966520A (en) Database high-availability switching method, device and system
US7302690B2 (en) Method and apparatus for transparently sharing an exception vector between firmware and an operating system
JP3334174B2 (en) Fault handling verification device
KR100862407B1 (en) System and method to detect errors and predict potential failures

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees