TW200307200A

TW200307200A - Multiple fault location in a series of devices

Info

Publication number: TW200307200A
Application number: TW092107381A
Authority: TW
Inventors: Alongkorn Kitamorn; Ashwini Kulkarni; Gordon D Mcintosh; Kanisha Patel; Michael Anthony Perez
Original assignee: Ibm
Priority date: 2002-04-04
Filing date: 2003-04-01
Publication date: 2003-12-01
Also published as: TWI265408B; US20030191978A1

Abstract

A method, computer program product, and data processing system for locating hardware faults occurring in multiple devices in a data processing system, is disclosed. The devices have a scanning order in which the devices (or at least information regarding the devices) are scanned to analyze any possible error condition. When a new error is detected in a device, an identification of the device is stored in a data structure. If another error is detected and causes the devices to be scanned again, the scanning process will skip over the device whose identity is stored in the data structure so that the new error can be located.

Description

200307200 玖、發明說明：【發明所屬之技術領域】本發明係有關於在一貧料處理系統内對硬體的故障做識別和處理。更明確地說，本發明提供一種方法，就是電腦程式產品和資料處理系統，以識別和處理發生於一連申裝置中的多重錯誤，其被連續順序地掃描以找出錯誤。【先前技術】在資料處理系統（平台）内一邏輯分割（LPAr)功能允許一單獨作業系統（OS)有多重備份或多重混合作業系統同時地在一單一資料處理系統平台上執行。在作業系統資料檔案執行的分割内，被分配一該平台資源的非重疊處理子集。這些平台的可分配資源包括一個或複數個架構上不同的處理器和它們的中斷管理區，系統記憶體區，以及1/〇配接卡匯流排槽。該分割的資源指的是該平台的韌體到作業系統資料檔案。在平台内執行的每一個不同的08或〇8的資料檔案彼此被保護，使得一邏輯分割上的軟體錯誤不致於影響另一個分割中的正確操作。這可以由分配每個〇s資料檔案直接管理的分離平台資源來提供，以及提供機制以確定各種不同的資料檔案不能夠控制還沒分派給它的任何資源。此外，一作業系統的分派資源所掌握的軟體錯誤可避免任何其他的資料檔案受到影響。因此，該作業系統（或每個不同的作業系統）的每個資料檔案在該平台内，直接地控制不同的可分配資源集。 β 84360 200307200 對於一 LPAR系統内的硬體資源，這些資源在一互斥方式中，與不同的分割來分享。也就是，一單一資源可以在任一時間被分至任何的一個分割，但是任何的給定資源可以分派至該分割的其中任一個。這造成每個分割區好像是一獨立的電腦。可以被分享資源是：輸入/輸出（I/O)配接卡，隨機存取記憶體（RAM)，非揮發性的隨機存取記憶體 (NVRAM)和硬碟機，雖然這個列表絕不是無遺漏的。在LPAR 系統内的每個分割可以一再地開機和關機，而不需要一再地供給電源至所有的系統中。 I/O裝置群組可以由硬體的一般件所控制，例如一主機週邊元件介面橋接器（PCI)，其可以控制許多I/O配接卡或連接在橋接器下。這個橋接器可以被想成所有的分割由它被分配的槽來分享。因此，如果橋接器不能運作，它會影響所有的分割分享連接在橋接器下面的裝置。的確，這問題是非常地嚴重，所有的LPAR系統將會當機，如果任何的分割想到進一步地使用該橋接器。換句話說，整個的LPAR系統將會故障。在此狀況下，正常進行動作為將正在執行，且分享橋接器的分割部份結束。這將會防止由於該故障而讓系統當機。通常發生故障的是一 I/O配接卡，它會造成該橋接器採取一非使用的（錯誤）狀態。在發生的時間，I/O的故障啟動一機器，檢查中斷處理器（MCIH)，接下來它將會報告該錯誤，然後結束適當的分割。這程序是一 π正常π的解決方案，可避免所有的LPAR系統由於這個問題而當機。 84360 200307200 為了要修正該故障，必須要識別故障發生的特定1/〇配接卡或I/O配接卡槽。這通常被依序地掃描與每個1/〇配接卡有關的狀態暫存器。然而，在多重1/0配接卡在單一橋接器控制下遇到錯誤時會發生一個問題。如果首先發生一錯誤於一配接卡中，較早於該序列，然後發生一錯誤於一配接卡中，較晚於該序列，該掃描可能停止在第一個錯誤，而第二個錯誤可能不會被通告。収因為該第—個錯誤的情況無法被清除。無法被清除的理由是該錯誤情況必須持續存在，讓孩橋接器維持在一固定的（錯誤）狀態。因此’需要一個方法識別一連申的配接卡内多重故障。【發明内容】不贫明揭露 ^只灯屣垤乐統，用於找出發f於—資料處理系統内多重裝置之硬體故障位置的方法。咸裝置具有一掃描順置的訊息）被掃描以分析出任付了^置（或至少有關於該裝刀斫出任何可能的錯誤狀況。當一新的 I曰玦万；一裝置中被偵測時，資料結構中。如果另-個身份識別被儲存於一 # r ^ ，·日次被偵測，且造成該裝置再被知描…知描程序會跳過的裝置1得新錯誤的位置能被找出。構中【圖式簡單說明】本發明的新奇特徵描述於當閱讀該伴隨圖㈣，林曰:的申請專利範圍内。然而進-步目的和優點將較易二本身和較喜歡的使用模態，了解，立中. 易由芩照下列具體實施例的描述來 84360 200307200 圖1是一本發明可能實現的系統之一資料處理的方塊圖；圖2是一圖表’描述一連串的裝置（槽内的〗/〇配接卡），包括發生錯誤的槽，於像在圖1中描述的一資料處理系統；圖3是一圖表’描述一機器檢查中斷處理器的結果，其已偵測錯誤於原先描述於圖2中之一連串的槽中；圖4疋一圖表’描述圖3中槽的連申，加上另外的錯誤出現於一槽中，它在掃描順序上，是伴隨發生在該槽於圖2中發生錯誤；圖5疋一依照本發明的較佳具體實施例之圖表，描述一連串的槽發生如圖4中的相同錯誤，但是包括一另外的資料結構；圖6疋一依照本發明的較佳具體實施例之圖表，描述一機器檢查中斷處理器的結果，其偵測一第二個，伴隨發生錯誤；以及圖7是一依照本發明的較佳具體實施例，表示找出一連串裝置的錯誤之程序流程圖。【實施方式】現在參知圖表’以及特別是圖1，將描述本發明可實現的一貧料處理系統之方塊圖。資料處理系統100可以是一對稱多處理斋（SMP)系統，包括複數個處理器1〇1，1〇2，1〇3，和1 04連接土系統匯流排丨〇6。舉例來說，資料處理系統丄⑽ 可以是一 IBM RS/6000，一國際商業機器公司的產品於 Armonk，紐約，其貫現為在一網路内的伺服器。二者擇一單處理咨系統可以被運用。同樣地也連接到系統 84360 -10- 200307200 匯流排106的是記憶體控制器/快取記憶體1〇8，其提供一介面給複數個本地記憶體160-163。1/〇匯流排橋接器11〇被連接到系統匯流排1 〇6，且提供一介面給1/〇匯流排丨丨2。記憶體控制器/快取108和1/0匯流排橋接器11〇可以如所描述的做整合。貝料處理系統100是一邏輯分割資料處理系統。因此，資料處理系統1 00可以具有（或一單一作業系統的多重步騾），同時地執行的系統。多重作業系統的每個系統可以具有任何數目的軟體程式，在它裡面執行。資料處理系統1⑽是被邏輯地分割，使得不同的PCI I/O配接卡12(M21，128-129，和136，圖形卡148，以及硬碟配接卡149可以被指定至不同的邏輯分割中。在這情況下，圖形配接卡148提供一接法至一顯示裝置（不顯示），而硬碟配接卡149提供一控制硬式磁碟150的連接法。因此，舉例來說，假設資料處理系統1 〇〇被分隔為三個邏輯分割，PI，P2，和P3。每一個PCI I/O配接卡120-121， 12 8-129，136，圖形卡148，硬碟配接卡149，每一個主處理機1 0 1 -1 04，和每一個本地記憶體1 60-1 63被指定至該三個分割的其中一個。舉例來說，處理器1 01，本地記憶體1 6 0，和PCI I/O配接卡120，128，和129可被指定至邏輯分割Pi ; 處理器102-103，本地記憶體16卜和PCI I/O配接卡121和136 可被指定至分割P2 ;和處理器104，本地記憶體162-163，圖形卡148和硬碟配接卡149可被指定至邏輯分割P3。每個在資料處理系統1 00内執行的作業系統被指定至一不 84360 -11 - 200307200 同的邏輯分割。因此，每個在資料處理系統丨〇〇内執行的作業系統只能存取在它邏輯分割内的1/0單元。因此，舉例來說’該高等交談式執行（AIX)作業系統的一個資料檔案可在分割P1内執行，該AIX作業系統的第二個資料檔案（資料檔案）可在分割P2内執行，而且一視窗2〇〇〇作業系統可以操作在邏輯分割内，P1。視窗2〇〇〇是華盛頓，Redm〇nd微軟公司公司的產品和商標。週邊元件互連（PCI)主機橋接器114連接到I/O匯流排112，提供一介面至PCI本地匯流排11 5。許多的pci I/O配接卡 120-121可以經由PCI-至-PCI橋接器116，PCI匯流排118，PCI 匯流排119，I/O槽170和I/O槽171被連接至PCI匯流排115，。 PCI-至-PCI橋接器116提供一介面至pci匯流排11 8和PCI匯流排119。PCI I/O配接卡120和121被分別地放置至I/O槽170 和1 7 1。一般PCI匯流排的實現方式是會支援四到八個〗/〇配接卡（也就是，擴充槽用以加入的連接器）。每個PCI I/O配接卡120-121提供在資料處理系統1〇〇和輸入/輸出裝置之間的介面例如，舉例來說，其他網路電腦，它是資料處理系統1 0 0的從端。一另外的PCI主機橋接器122提供一另外的PCI匯流排123 介面。PCI匯流排123連接到複數個PCI I/O配接卡128-129。 PCII/0配接卡128-129可以經由PCI-至-PCI橋接器124，PCI 匯流排126，PCI匯流排127，I/O槽1 72和I/O槽1 73連接至PCI 匯流排123。PCI-至-PCI橋接器124提供一在pci匯流排126 和PCI匯流排之間的介面127。PCII/0配接卡128和129被分 84360 -12- 200307200 別地放置於I/O槽172和173内。在這個方法中，另外的I/O裝置，舉例來說，像數據機或網路配接卡可以經由每一個PCI I/O配接卡128-129被支援。在這個方法中，資料處理系統100 可以連接至複數個網路電腦。一映射至插入I/O槽1 74之圖形配接卡148的記憶體經由 PCI匯流排144，PCI-到-PCI橋接器142，PCI匯流排141和主機橋接器140，可以被連接到I/O匯流排112。硬碟配接卡1 49 可以被放置於I/O槽175，其被連接至PCI匯流排145。依次地，這個匯流排被連接至PCI-至-PCI橋接器142，其經由PCI 匯流排141，被連接至PCI主機橋接器140。一 PCI主機橋接器130提供PCI匯流排131—介面，以連接至I/O匯流排112。PCI I/O配接卡136連接至I/O槽176，其以 PCI匯流排133連接至PCI-至-PCI橋接器132。PCI-至-PCI橋接器13 2被連接至PCI匯流排131。這個PCI匯流排也連接PCI 主機橋接器130至該服務處理器郵箱介面和ISA匯流排存取通過邏輯194和PCI-至-PCI橋接器132。該服務處理器郵箱介面和ISA匯流排存取通過邏輯194將PCI存取送達至該 PCI/IS A橋接器193。NVRAM儲存體192被連接至該ISA匯流排196。月艮務處理器135經由它本地PCI匯流排195，耦合至服務處理器郵箱介面和IS A匯流排存取通過邏輯1 94。服務處理器135也經由複數個JTAG/I2C匯流排134連接至處理器 101-104。JTAG/I2C匯流排是134JTAG/掃描匯流排（參照 IEEE1149.1)和Phillips I2C匯流排的組合。然而，二者擇一地，JTAG/I2C匯流排134可以只由Phillips I2C匯流排，或 84360 -13 - 200307200 JTAG/掃描匿流排替代。該主處理機⑻，i〇2, ig3,和叫的所有S P AT T N訊號-起被連接至—中斷服務處理器的輪入信號。該服務處理器135具有它自己的本地記憶體ΐ9ι，而且具有存取至該硬體OP —平板19〇。當資料處理系統100—開始啟動時’服務處理器135使用 JTAG/掃描I2C匯流排丨34詢問系統（主機）處理器丨〇丨_丨〇4，記憶體控制器/快取記憶體丨08和l/0橋接器丨丨〇。在這個步驟完成時’服務處理器135具有-清單和拓撲，了解資料處理= 統100。服務處理器135也執行㈣自我測試（BISTs)，基本保註試驗（BATs)，而且詢問主處理機1〇1_1〇4仙所有元件上的記憶體測試，記憶體控制器/快取1〇8和1/〇橋接器11〇。任何在BISTS期間偵測之故障錯誤訊息，BATs，和記憶體測試由服務處理器135蒐集和報告。如果一系統資源有意義/有效的配置，在該BISTs，BATS 和記憶體測試期間取出偵測錯誤的元件後，仍然是可能的，則資料處理系統100允許進行載入可執行碼至本地（主記憶體160-163内。服務處理器135然後釋放主處理機1〇1_1〇4°，以執行載入主記憶體16(Μ63的程式碼。當主機處理器 101_104在資料處理系統1〇〇内執行個別作業系統的程式碼時，服務處理器135進入一監聽的模式和報告錯誤。服務處理器135監視的項目類型包括，舉例來說，該冷卻風扇的速度和操作，溫度感測器，電源供應調整器，以及由處理器 101-104，本地記憶體160-163，和1/0橋接器u〇報告的可回復和不可回復錯誤。服務處理器135負責儲存和報告資料處理系 84360 -14- 200307200 =㈣所有相關的監視項目之錯誤訊息。服務處理器⑴ 處理器135可…我的私限採取行動。舉例來說，服務 σ以/王意一處理器的快取記憶體上過多的可回錯誤，並且決定硬體故障的預測性。基於這決定’在^ 正在執行的段落和未來起始程式載人〇期間服 : 135可以做#々、人、、次π χ ^ 啟動”二原以非組態化。1PLs也有時被指為” 1 式，，資料處理⑽_可以使用各種不同的两，化電腦系統來實現。舉例來說，資料處理系統! =國際商業機器公司的IBMeSe咖模型84〇系統 :、現。廷樣的系統可以使用〇議作業系統支援邏輯分割，它可以從國際商業機器公司買到。硬：：般热悉此技蟄裡的人士將會激賞在圖1中所描述的硬二可以改變的。舉例來說，其他的週邊裝置，例如光碟機寺，也除了可以被使用外，或代替被描述的硬體。該描述的例子並不沒暗示有關於本發明結構上的限制。本！月棱供一種方法，電腦程式產品和資料處理系統，用於找出故障位置於一連串裝置内，其具有一掃描順序以找出錯誤。圖2是一圖表，描述一連串的裝置具有一掃描順序於資料處理系統内，如同在圖i中所描述的。—pa主機橋接器200處理與槽2G2，綱，和2_裝置之ι/〇處理。該配接卡在槽2〇4發生-錯誤。為了要陳述發生於槽204的錯决情形’茲機器檢查中斷處理器必須找出錯誤。典型地，該機器檢查中斷處理器必須依照一預先決定的掃描；:序(在這個例子中，順序是從左到右的）掃描與每—個槽2〇2,綱， 84360 -15- 200307200 206和208有關的狀態暫存器，找出錯誤。狀態暫存器可以被包含在一 I/O橋接器内，例如：圖丨内的1/()橋接器丨丨〇，一 pci王機橋接器，例如：PCI主機橋接器2〇〇或在配接卡本身内’例如：槽202内的配接卡204，206和208。為了找出槽内的配接卡204内發生的錯誤，機器檢查中斷處理器會首先檢查與狀態暫存器有關的槽2〇2。檢查出該槽 202的配接卡内興錯誤發生，該機器檢查中斷處理器會進行搜尋槽204,它是在序列中下一個槽。當槽2〇4包含一配接卡且發生一錯誤時，該機器檢查中斷處理器會識別出槽發生故障，如圖3所描述的，當成，，越過出，，。識別槽的配接卡内發生的錯誤，將會造成pci主機橋接器2〇〇被置放一錯誤狀態，如圖3所描述的，當成”越過在外，，。然後，該機器檢查中斷處理器會結束找出錯誤的程序。 PCI主機橋接為200必須保持在一錯誤狀態，直到槽2〇4的問題被更正，以避免系統當機。結果，槽204無法清除它的錯誤狀態。其中一個結果就是，如果一額外的錯誤在一配接卡中發生，而該槽中在該掃描順序的更前面，該額外的錯誤就可能無法被識別出。舉例來說，圖4表示該配接卡在槽206中遭遇一錯誤。因為槽2〇4也包含一正遭遇一錯誤的配接卡，該機器檢查中斷處理器將會對槽2〇2檢查狀態暫存器’然後對槽2G4檢查狀態暫存器，以及偵測—錯誤狀況在槽204的配接卡中，在到達槽施之前將會結束找出錯誤位置的程序。本發明引進一額外的資料結構，以解決這種情形，例如： 84360 -16- 200307200 在圖5中描述的資料結構5〇〇在一較佳的具體實施例中，貝料結構500被圮錄於一記憶體裝置中，例如：在圖i中的 NVRAM儲存器192。資料結構5〇〇當成一登入，記錄錯誤，當它們被該機器檢查中斷處理器識別出時。當該槽2〇4的配接卡内錯誤發生已經被偵測於圖5中時，資料結構5〇〇顯示 ▲才曰的m纟。在本發明較佳的具體實施例中，當該機器檢查中斷處理器掃描下一個槽2〇2, 2〇4, 2〇6和2〇8，它將會首先檢查與槽202有關的狀態暫存器，然後檢查狀態與才曰204有關的自存备。然而當該機器檢查中斷處理器到達槽 204時，它將會搜尋資料結構5〇〇，以記錄在槽發生的錯誤。當該機器檢查中斷處理器偵測發生在槽2〇4的錯誤於資料、、、口構5GG被#錄時’該機器檢查中斷處理器將會檢查與槽 2〇6有關的狀態暫存器。如圖6所示，發生在槽的錯誤^ 會被識別出，以及資料結構5⑽將會被更新，以包括新偵測的錯誤。姑圖7是-依照本發明的較佳具體實施例，表示找出一連串农置的U之&序流程圖。在—較佳的具體實施例中，發生在1触接卡㈣誤包含了-連㈣槽1而在該技藝裡的:常技術之—將會承認任-組的裝置於-連_發生且具 :抒力j4順序，能以圖7描述的程序來掃描以偵測錯誤。該程序並沒限制於較佳實施例中。 =先，决疋是否所有的槽已被完全掃描（步驟700)。如果不疋也就疋’如果任何的槽還沒有被掃描出錯嗜，在序列裡與該下—個槽有關的狀態暫存器會被檢查（步驟702)。 84360 -17- 200307200 然後決疋是否—錯誤已發生在那個槽（步驟704)。如果不是，二彳序回彳又到步驟7 〇〇以檢查該下一個槽，如果有下一個槽 =話。如果有—錯誤發生，決定錯誤是否已被記錄於一適田的貝料I口構中，例如圖5中的資料結構5〇〇(步驟。如果錯次有被死錄，則該處理回到步驟700，以檢查該下—個才曰3果有下—個槽的話。然而，如果該錯誤沒有被記錄，則槽被確認為遭遇一錯誤(步驟7〇8)，以及該錯誤的記錄被儲存於—適當的資料結構中，例如圖5中的資料結構500(步驟叫。在步驟710後’該程序結束。二者擇一地，程序可以結束如果在步驟已沒有槽需要掃描。、k疋重要的去〉王意當本發明已在整個功能性資料處理系統的上下文中被描述，那些原來熟悉於該技藝的人士將會激賞本發明的程序能分配於電腦可讀取的指令形式和多種不同的形式和本發明的使 j郡疋相寺地，不管是該特定類型的^號媒體實際上用夹舍未貝仃该7刀配。該電腦可讀取媒體的貝例包括可記錄型媒體， _ „ . 、 1 j如·一軟式磁碟，一硬式磁碟驅動，一隨機存取記情妗刑嫫触η』一 CCUR〇M，DVD-ROM，和傳輸土媒m，例如：數位訊逢掊哭诂田你、，、比通矾連接器，有線或無線通 A ^接叩，使用傳輸形式· π 4取讲贿彳如·射頻和光傳輸。該電腦可碩取媒體可能是用編碼 ^ ^ ^ ^ ^ 秸式，其被解碼以實際用於特疋的舆枓處理系統。功能此太邙 f描逑材料是揭露對一機器的功月匕貝汛。功能性的描述材料指令，規則，事膏，可=^ 不限制在，電腦程式，本發明的描述是當作^^午矛貝科結構的疋我。 k明和描述的目的，而不是想要限 84360 -18 - 200307200 制本發明於已揭露的形式。_多修改和變更對於熟悉於該技蟄的人士會是易於明白的。所選擇和描述的具體實施例是為了詳加解釋本發明的原貝,卜實際的應用和使其他熟悉於該技藝的人士能夠了解本發明有各種不同修改的具體實施例，如同適合於該特殊用途。【圖式代表符號說明】 100 貧料處理系統 101,102,103,104 處理器 106 系統匯流排 108 記憶體控制器/快取記憶體 110 I/O橋接器 112 I/O匯流排 114,122,130,140 P CI主機橋接器 1 16,124,132,142 PCI-至-PCI橋接器 118，119，123，126, PCI匯流排 131,133，141，144, 145 120，121，128，129, PCI I/O配接卡 136 134 JTAG/I2C匯流排 135 服務處理器 148 圖形卡 149 硬碟配接卡 150 硬碟 84360 -19- 200307200 160,161，162,163 本地記憶體 170，171，172，173，I/O槽 174,175,176 190 OP平板 191 記憶體 192 非揮發隨機存取記憶體 194 服務處理器郵箱介面及IS A存取通過邏輯 196 ISA匯流排 200 PCI主機橋接器 201 槽1 204 槽2 206 槽3 208 槽4 84360 -20 -200307200 发明 Description of the invention: [Technical field to which the invention belongs] The present invention relates to the identification and treatment of hardware failures in a lean material processing system. More specifically, the present invention provides a method that is a computer program product and a data processing system to identify and process multiple errors that occur in a series of applications, which are scanned sequentially to find errors. [Prior Art] A logical partitioning (LPAr) function within a data processing system (platform) allows a single operating system (OS) to have multiple backups or multiple hybrid operating systems to execute simultaneously on a single data processing system platform. Within the partitioning performed by the operating system data file, a non-overlapping processing subset of the platform resources is allocated. The allocatable resources of these platforms include one or more processors with different architectures and their interrupt management areas, system memory areas, and 1/0 adapter card bus slots. The divided resources refer to the platform's firmware to the operating system data files. Each different 08 or 08 data file executed in the platform is protected from each other, so that software errors on one logical partition will not affect the correct operation in another partition. This can be provided by allocating separate platform resources directly managed by each 0s data file, as well as providing a mechanism to determine that various data files cannot control any resources that have not been assigned to it. In addition, software errors in an operating system's allocated resources can prevent any other data files from being affected. Therefore, each data file of the operating system (or each different operating system) is directly within the platform and controls different sets of assignable resources. β 84360 200307200 For hardware resources in an LPAR system, these resources are shared with different partitions in a mutually exclusive manner. That is, a single resource can be assigned to any one partition at any time, but any given resource can be assigned to any one of the partitions. This makes each partition appear to be a separate computer. Resources that can be shared are: input / output (I / O) adapter cards, random access memory (RAM), non-volatile random access memory (NVRAM), and hard drives, although this list is by no means complete. Missing. Each partition within an LPAR system can be turned on and off repeatedly without the need to repeatedly supply power to all systems. The I / O device group can be controlled by general hardware, such as a host peripheral component interface bridge (PCI), which can control many I / O adapter cards or connect under the bridge. This bridge can be thought of as all splits shared by the slots to which it is assigned. Therefore, if the bridge does not work, it will affect all split sharing devices connected under the bridge. Indeed, this problem is very serious, all LPAR systems will crash, if any partition wants to use the bridge further. In other words, the entire LPAR system will fail. In this case, the normal action is that it will be executed, and the split part of the sharing bridge ends. This will prevent the system from crashing due to the failure. Often it is an I / O adapter card that causes the bridge to assume an unused (wrong) state. At the time of the occurrence, an I / O failure starts a machine, checks the interrupt handler (MCIH), it will then report the error, and then end the appropriate split. This program is a π normal π solution, which can prevent all LPAR systems from crashing due to this problem. 84360 200307200 In order to correct this fault, you must identify the specific 1/0 adapter card or I / O adapter card slot in which the failure occurred. This is usually sequentially scanned through the status registers associated with each 1/0 adapter. However, a problem occurs when multiple 1/0 adapters encounter errors under the control of a single bridge. If an error occurs first in an adapter card, earlier than the sequence, and then an error occurs in an adapter card, later than the sequence, the scan may stop at the first error and the second error May not be notified. The first error condition cannot be cleared. The reason it cannot be cleared is that the error condition must persist and the child bridge must be maintained in a fixed (error) state. Therefore, a method is needed to identify multiple faults in a continuous application adapter card. [Summary of the Invention] It is not unduly disclosed that only light and music systems are used to find a method for starting a hardware fault location of multiple devices in a data processing system. The device has a scan order message) that is scanned to analyze the task (or at least any possible error conditions related to the installation of the knife. When a new device is installed, a device is detected) In the data structure, if another identity is stored in a # r ^, the day is detected, and the device is described again ... the device 1 will skip the new wrong location Can be found in the structure. [Simplified illustration of the scheme] The novel features of the present invention are described in the scope of patent application when reading the accompanying diagram ㈣, Lin Yue: However, the purpose and advantages of further steps will be easier. Favorite mode of use, understand, Lizhong. Easy to follow according to the description of the following specific embodiment 84360 200307200 Figure 1 is a block diagram of data processing of one of the possible systems of the present invention; Figure 2 is a diagram depicting a series The device (in the slot / 0 adapter card), including the slot where the error occurred, is a data processing system like the one described in Figure 1; Figure 3 is a diagram depicting the results of a machine inspection interrupt handler, which Detected errors as previously described It is described in a series of slots in FIG. 2; FIG. 4 is a diagram depicting the continuous application of the slot in FIG. 3, and another error occurs in a slot. It is in the scanning sequence that occurs in the slot along with An error occurs in FIG. 2; FIG. 5 is a diagram according to a preferred embodiment of the present invention, describing a series of slots where the same error occurs as in FIG. 4, but including an additional data structure; The diagram of the preferred embodiment of the present invention describes the result of a machine checking interrupt handler, which detects a second one, with an error occurring; and FIG. 7 is a preferred embodiment according to the present invention, which shows finding a series of Device error program flow chart. [Embodiment] Now referring to the diagram 'and especially FIG. 1, a block diagram of a lean material processing system that can be implemented by the present invention will be described. The data processing system 100 may be a symmetric multiprocessing module. (SMP) system, including a plurality of processors 101, 102, 103, and 104 connected to the soil system bus. 〇06. For example, the data processing system can be an IBM RS / 6000 , An international The company's products are in Armonk, New York, which is now a server in a network. Either of the two processing systems can be used. The same is also connected to the system 84360 -10- 200307200 bus 106 It is a memory controller / cache memory 108, which provides an interface to a plurality of local memories 160-163. The 1 / 〇 bus bridge 11 is connected to the system bus 106, and provides a The interface is for 1/0 buses. 2. Memory controller / cache 108 and 1/0 bus bridge 11 can be integrated as described. The shell material processing system 100 is a logically divided data processing system. Therefore, the data processing system 100 may have (or multiple steps of a single operating system) a system that executes simultaneously. Each operating system of a multiple operating system can have any number of software programs running in it. The data processing system 1 is logically divided, so that different PCI I / O adapter cards 12 (M21, 128-129, and 136, graphics card 148, and hard disk adapter card 149 can be assigned to different logical divisions. In this case, the graphics adapter card 148 provides a connection method to a display device (not shown), and the hard disk adapter card 149 provides a connection method to control the hard disk 150. Therefore, for example, suppose Data processing system 100 is divided into three logical divisions, PI, P2, and P3. Each PCI I / O adapter card 120-121, 12 8-129, 136, graphics card 148, and hard disk adapter card 149, each main processor 1 0 1 -1 04, and each local memory 1 60-1 63 are assigned to one of the three partitions. For example, processor 1 01, local memory 1 6 0, and PCI I / O adapter cards 120, 128, and 129 can be assigned to the logical split Pi; processors 102-103, local memory 16b, and PCI I / O adapter cards 121 and 136 can be assigned to Partition P2; and processor 104, local memory 162-163, graphics card 148 and hard disk adapter card 149 can be assigned to logical partition P3. Each in The operating system executed in the material processing system 100 is designated to the same logical partition as 84360 -11-200307200. Therefore, each operating system executed in the data processing system 丨〇〇 can only access within its logical partition 1/0 unit. Therefore, for example, 'a data file of the advanced interactive execution (AIX) operating system can be executed in partition P1, and a second data file (data file) of the AIX operating system can be executed in partition Executed in P2, and a Windows 2000 operating system can operate in a logical partition, P1. Windows 2000 is a product and trademark of Microsoft Corporation of Redmond, Washington. Peripheral Component Interconnect (PCI) Host Bridge The device 114 is connected to the I / O bus 112 and provides an interface to the PCI local bus 115. Many PCI I / O adapter cards 120-121 can be connected via the PCI-to-PCI bridge 116, PCI bus 118, The PCI bus 119, I / O slot 170 and I / O slot 171 are connected to the PCI bus 115. The PCI-to-PCI bridge 116 provides an interface to the PCI bus 118 and the PCI bus 119. PCI I / O adapter cards 120 and 121 are respectively placed in I / O slots 170 and 1 7 1. The general implementation of the PCI bus is to support four to eight 〖/ 〇 adapter cards (that is, connectors used for expansion slots to join). Each PCI I / O adapter card 120-121 Provide an interface between the data processing system 100 and the input / output device. For example, for example, other network computers, it is the slave of the data processing system 100. An additional PCI host bridge 122 provides an additional PCI bus 123 interface. The PCI bus 123 is connected to a plurality of PCI I / O adapter cards 128-129. PCII / 0 adapter cards 128-129 can be connected to PCI bus 123 via PCI-to-PCI bridge 124, PCI bus 126, PCI bus 127, I / O slot 1 72 and I / O slot 1 73. The PCI-to-PCI bridge 124 provides an interface 127 between the PCI bus 126 and the PCI bus. PCII / 0 adapter cards 128 and 129 are divided into 84360 -12- 200307200 and placed separately in I / O slots 172 and 173. In this method, additional I / O devices, such as modems or network adapter cards, can be supported via each PCI I / O adapter card 128-129. In this method, the data processing system 100 can be connected to a plurality of network computers. A memory mapped to the graphics adapter card 148 inserted in the I / O slot 1 74 can be connected to the I / O via the PCI bus 144, the PCI-to-PCI bridge 142, the PCI bus 141, and the host bridge 140. O Busbar 112. The hard disk adapter 1 49 can be placed in an I / O slot 175, which is connected to a PCI bus 145. In turn, this bus is connected to a PCI-to-PCI bridge 142, which is connected to a PCI host bridge 140 via a PCI bus 141. A PCI host bridge 130 provides a PCI bus 131-interface to connect to the I / O bus 112. The PCI I / O adapter card 136 is connected to the I / O slot 176, which is connected to the PCI-to-PCI bridge 132 by a PCI bus 133. The PCI-to-PCI bridge 132 is connected to the PCI bus 131. This PCI bus also connects the PCI host bridge 130 to the service processor mailbox interface and the ISA bus access via logic 194 and PCI-to-PCI bridge 132. The service processor mailbox interface and ISA bus access route PCI access to the PCI / IS A bridge 193 through logic 194. NVRAM bank 192 is connected to the ISA bus 196. The service processor 135 is coupled to the service processor mailbox interface and the ISA bus access logic 194 via its local PCI bus 195. The service processor 135 is also connected to the processors 101-104 via a plurality of JTAG / I2C buses 134. JTAG / I2C bus is a combination of 134JTAG / scan bus (refer to IEEE1149.1) and Phillips I2C bus. Alternatively, however, the JTAG / I2C bus 134 can be replaced only by the Phillips I2C bus, or 84360 -13-200307200 JTAG / scanning bus. The main processor ⑻, i〇2, ig3, and all the SP AT T N signals called are connected to-the interrupt service processor's turn signal. The service processor 135 has its own local memory 9m, and has access to the hardware OP-tablet 19. When the data processing system 100 is started, the service processor 135 uses JTAG / scan I2C bus 丨 34 to query the system (host) processor 丨〇丨 _ 丨〇4, memory controller / cache memory 08 and l / 0 bridge 丨丨〇. At the completion of this step, the 'service processor 135 has -list and topology, understanding data processing = system 100. The service processor 135 also performs ㈣ self-tests (BISTs), basic guarantee tests (BATs), and asks the main processor for memory testing on all components of the 10101_1cent, memory controller / cache 108 And 1 / 〇 bridge 11〇. Any fault error messages, BATs, and memory tests detected during the BISTS are collected and reported by the service processor 135. If a meaningful / effective allocation of system resources is still possible after the error detection components are removed during the BISTs, BATS, and memory tests, the data processing system 100 allows loading executable code to the local (main memory) 160-163. The service processor 135 then releases the main processor 1101_104 ° to execute the code loaded into the main memory 16 (M63. When the host processor 101_104 executes in the data processing system 100 When operating system code, the service processor 135 enters a listening mode and reports errors. The types of items monitored by the service processor 135 include, for example, the speed and operation of the cooling fan, temperature sensor, power supply Regulators, and recoverable and non-recoverable errors reported by processors 101-104, local memory 160-163, and 1/0 bridge u〇. The service processor 135 is responsible for storing and reporting the data processing department 84360 -14- 200307200 = 错误 Error messages for all related monitoring items. Service processor ⑴ Processor 135 can ... take action on my privacy. For example, service σ is handled as / wangyiyi There are too many recoverable errors in the cache memory and determine the predictability of hardware failures. Based on this decision 'in the ^ section being executed and the future starting program to be manned during the period: 135 可以做 # 々、人、，次 π χ ^ Start "The two originals are non-configuration. 1PLs are sometimes referred to as" 1 ". Data processing can be implemented using various computer systems. For example, data processing systems ! = IBM Business Machine's 84e system from the International Business Machines Corporation: The current system can be used to support logical partitioning. The system can be purchased from the International Business Machines Corporation. Hard :: generally familiar with this technology People here will appreciate that the hardware two described in Figure 1 can be changed. For example, other peripheral devices, such as CD-ROM temples, can be used in addition to or instead of the hardware described. This description The examples do not imply any structural limitations of the present invention. This article provides a method, a computer program product, and a data processing system for finding the location of a fault in a series of devices. Trace the sequence to find the errors. Figure 2 is a diagram depicting a series of devices with a scanning sequence within the data processing system, as described in Figure i.-The pa host bridge 200 handles the slots 2G2, outline, and 2_device / ι processing. The adapter card has an error in slot 204. In order to state the failure occurred in slot 204, the machine checks the interrupt handler to find the error. Typically, the machine The inspection interrupt handler must follow a predetermined scan; the sequence (in this example, the sequence is from left to right) scans are related to each slot 2202, outline, 84360 -15- 200307200 206 and 208 Status register to find errors. The status register can be included in an I / O bridge, such as: 1 / () bridge in the figure, a pci king bridge, such as: PCI host bridge 200 or in Within the adapter card itself, for example: adapter cards 204, 206, and 208 in slot 202. In order to find the error occurred in the adapter card 204 in the slot, the machine check interrupt processor first checks the slot 200 related to the status register. Check out that the adapter card built-in error in slot 202 has occurred. The machine check interrupt handler searches for slot 204, which is the next slot in the sequence. When slot 204 contains an adapter card and an error occurs, the machine check interrupt handler will recognize that the slot has failed, as described in Figure 3, as ,, over, and out. An error that occurs in the adapter card of the identification slot will cause the PCI host bridge 2000 to be placed in an error state, as described in Figure 3, as "overpassed." Then, the machine checks the interrupt handler. The process of finding the error will end. The PCI host bridge of 200 must remain in an error state until the problem of slot 204 is corrected to avoid system crash. As a result, slot 204 cannot clear its error state. One of the results That is, if an additional error occurs in an adapter card and the slot is earlier in the scanning sequence, the additional error may not be recognized. For example, FIG. 4 shows that the adapter card is in the An error was encountered in slot 206. Because slot 204 also contains an adapter that is experiencing an error, the machine check interrupt handler will check the status register in slot 202 and then check the status in slot 2G4. Register, and detection-error conditions in the adapter card in slot 204, the process of finding the error location will end before reaching slot application. The present invention introduces an additional data structure to solve this situation, for example : 84360 -16- 200307200 In a preferred embodiment, the data structure 500 described in FIG. 5 is recorded in a memory device, such as the NVRAM storage in FIG. Device 192. The data structure 500 is a log-in and records errors when they are recognized by the machine check interrupt handler. When the error in the adapter card in slot 204 has been detected in Figure 5 The data structure 500 shows ▲ cai m 纟. In a preferred embodiment of the present invention, when the machine check interrupt processor scans the next slot 002, 204, 206, and 20. 8. It will first check the status register related to slot 202, and then check the self-storage related to status 204. However, when the machine checks that the interrupt handler reaches slot 204, it will search the data structure 5 〇〇, to record the error occurred in the slot. When the machine check interrupt handler detects the error occurred in the slot 204 in the data, and the structure 5GG is recorded, the machine check interrupt handler will check Status register related to slot 206. As shown in Figure 6, it occurs at The errors ^ will be identified, and the data structure 5⑽ will be updated to include newly detected errors. Figure 7 is-According to a preferred embodiment of the present invention, it means to find a series of U & Sequence flow chart. In the preferred embodiment, the error occurs when the 1-touch card contains -flail slot 1 and in the art: the conventional technology-will recognize the device of any-group -Even_occurs and has: expressive j4 sequence, can be scanned to detect errors with the procedure described in Figure 7. This procedure is not limited to the preferred embodiment. = First, determine whether all slots have been completely Scan (step 700). If not, then 'if any slot has not been scanned for errors, the state register in the sequence related to the next slot will be checked (step 702). 84360 -17- 200307200 Then decide whether or not-the error has occurred in that slot (step 704). If not, the sequence goes back to step 700 to check the next slot, if there is a next slot = then. If there is an error, it is determined whether the error has been recorded in a shell material I structure of Shida, such as the data structure 500 in FIG. 5 (step. If there are dead recordings, the process returns to Step 700 to check if the next slot has a next slot. However, if the error is not recorded, the slot is confirmed to have encountered an error (step 708), and the record of the error is Stored in an appropriate data structure, such as the data structure 500 in FIG. 5 (the step is called. After step 710, the procedure ends. Alternatively, the procedure may end if there are no more slots to scan in the step., K疋 Important> Wang Yidang The invention has been described in the context of the entire functional data processing system. Those who are familiar with the art will appreciate the program of the invention can be distributed in the form of computer-readable instructions and A variety of different forms and the present invention of the Jongsangsa Temple, regardless of the particular type of media number ^ are actually equipped with the 7 blades. Examples of the computer-readable media include recordable types Media, _ „ . 1 j such as a floppy disk, a hard disk drive, a random access memory, a penalty, and a touch. CCUROM, DVD-ROM, and transmission media m, such as: Crying you ,,,, or more than Alum connector, wired or wireless communication A ^ connection, use the transmission form · π 4 to take bribes, such as · radio frequency and optical transmission. The computer can obtain the media may use encoding ^ ^ ^ ^ ^ This type is decoded to be actually used in special public address processing systems. The function of this material is to expose the work of a machine. Functional description of material instructions, rules, It is not limited to a computer program. The description of the present invention is intended to be used as a structure for the structure of syllabidae. The purpose of the description and description is not to restrict the invention to 84360 -18-200307200. Modified and changed forms will be easily understood by those familiar with the technology. The specific embodiments selected and described are intended to explain the original application of the present invention in detail, as well as its practical application and use. Others familiar with the art will understand that the present invention is different The specific embodiment of the modification is as if suitable for the special purpose. [Description of Symbols of the Drawings] 100 Lean Processing System 101, 102, 103, 104 Processor 106 System Bus 108 Memory Controller / Cache Memory 110 I / O Bridge 112 I / O bus 114,122,130,140 P CI host bridge 1 16,124,132,142 PCI-to-PCI bridge 118,119,123,126, PCI bus 131,133,141,144, 145 120,121,128,129, PCI I / O configuration Adapter 136 134 JTAG / I2C bus 135 Service processor 148 Graphics card 149 Hard disk Adapter card 150 Hard disk 84360 -19- 200307200 160,161,162,163 Local memory 170,171,172,173, I / O slot 174,175,176 190 OP tablet 191 memory 192 non-volatile random access memory 194 service processor mailbox interface and IS A access through logic 196 ISA bus 200 PCI host bridge 201 slot 1 204 slot 2 206 slot 3 208 slot 4 84360 -20 -

Claims

200307200 The scope of patent application: 1 · A method, including: ", an error was detected in the first device in several clothes, among which β devices are related to the scanning sequence; and / Scans the information about the multiple devices in order to identify the first device, and skips each of the devices identified in a data structure. 2. If the number of members in the scope of the patent application is 丨 Ling Ling The material structure is stored in a monitoring device that communicates with several devices of the person. 3. The method of item 2 of the patent application scope further includes: responding to the detection of the error, at least partially canceling the function of the monitoring device. The method of applying for the first item of patent scope further includes: a dagger. Inserting a symbol identical to the first device in the data structure. 5. The method of applying for the first item of the patent scope, wherein one of the plurality of device packages is one less Integrated circuit. Soil ... Please call the method in the scope of patent No. 5 in which at least one integrated circuit includes soil V-an input / output dielectric circuit. Method, wherein the plurality of devices include = less than one peripheral component in a data processing system. The method described in the first scope of the patent by Tu Ru, in which the scanned information about the plurality of devices includes ... Error register, where each error register represents the status of a related device in the plurality of devices. 99. If the method of the scope of patent application item 1, the 84360 200307200 scan information about the plurality of devices contains: from Scanning of multiple devices: W6, ia, and sequential analysis—the current device behavior determines the status of the device. 10 · —A ii UU ^ rh ^ _ filial king product in a computer-readable medium, including functional brackets ... The action package that enables the computer to execute when 仃 detects one of the plurality from a plurality of devices This device is related to the scanning sequence: the multiple scanning information in the scanning sequence identifies the current device, and skips a data file ID Id. One such as the scope of the patent application: be; The structure is stored in the second communication with a plurality of devices. The data structure α is as described in the computer program of the patent application No. u. The description materials' when executed by the computer are widely used for external functions. The function description materials include: ^ Perform another action in response to detecting the error, and at least partially cancel the monitoring equipment. 13. If the scope of the patent application is not effective, I will target 10 items of computer program product description materials, which will be executed by the computer. For external functions, Ding Yu can make the computer hold the cup 2 function, the functional description materials include: order additional dynamic insertion-the same symbol as the first device in-please the computer program products in the scope of the patent No. 10 :: Medium Several devices include at least one integrated circuit. Among them, the complex 15. The computer program product S such as the patent application No. 14 The integrated circuit includes at least one input / output dielectric circuit. The number of soil-a 84360 200307200 Among the plurality of wires, among which is the computer program product of item 16. Such as _ please patent patent garden, the device includes at least one peripheral component in a data processing system 17. Such as the scope of patent application 丨〇 Item of computer program products' Scanning messages of several devices include: · Checking error registers in an interface circuit, where each error register represents the status of a related device in a plurality of devices. Item]. The computer program of the item; the scanning information about the plurality of devices includes: from the scanning order of the plurality of devices, analyzing the behavior of a current device to determine the current status of the current device. 19. A data processing system including: At least one processor; memory in communication with at least one processor; and several devices in communication with at least one processor, and having one Scanning sequence; and an instruction set in the memory, wherein at least one processor executes the instruction set to complete an action includes: detecting an error in several devices after the battle in the first device, wherein the plurality of devices Related to the scanning order; and scanning the information about the plurality of devices in the child scanning order to identify the first device, and skipping each device identified in a data structure. 2．If applied The data processing system of item 19 of the patent scope, wherein the data structure is stored in a supervisory device communicating with a plurality of devices. 84360 200307200 21. If so, the processor of the second scope of the patent scope is required to execute the instruction set to Complete Extra ::: This should detect at least-22 errors and at least partially cancel the function of the monitoring device. • Alpha claims the information in item 19 of the patent scope # 理哭也 / More specifically, the at least one processor and the instruction set to complete additional actions include: Insert:-the same symbol as the first device In the information. • For example, item 9 of the patent scope of the application, bamboo treatment system, wherein the plurality of clothes include at least one integrated circuit. 2. The data processing system of the 23rd patent application scope, wherein at least one% te% circuit includes at least one input / output dielectric circuit. 25 ·: The data processing system of the scope of application for item 19, wherein the plurality of clothes includes at least one peripheral component in a data processing system. % · If the data processing system of item 19 of the patent application scope, the scanned information about the plurality of devices includes:. An error register in an interface circuit is checked, where each error register represents the status of a related device of the plurality of devices. 27. The data processing system according to item 19 of the scope of patent application, wherein the scanning information about the plurality of devices includes: from the scanning sequence of the plurality of devices, analyzing the behavior of a current device to determine the current status of the device. 84360