CN110389849A - A kind of Fault Locating Method of PCIe device, system and server - Google Patents

A kind of Fault Locating Method of PCIe device, system and server Download PDF

Info

Publication number
CN110389849A
CN110389849A CN201910655652.1A CN201910655652A CN110389849A CN 110389849 A CN110389849 A CN 110389849A CN 201910655652 A CN201910655652 A CN 201910655652A CN 110389849 A CN110389849 A CN 110389849A
Authority
CN
China
Prior art keywords
address
pcie device
server
error
pcie
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201910655652.1A
Other languages
Chinese (zh)
Inventor
聂海涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Wave Intelligent Technology Co Ltd
Original Assignee
Suzhou Wave Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Wave Intelligent Technology Co Ltd filed Critical Suzhou Wave Intelligent Technology Co Ltd
Priority to CN201910655652.1A priority Critical patent/CN110389849A/en
Publication of CN110389849A publication Critical patent/CN110389849A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/40Bus structure
    • G06F13/4004Coupling between buses
    • G06F13/4022Coupling between buses using switching circuits, e.g. switching matrix, connection or expansion network
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2213/00Indexing scheme relating to interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F2213/0026PCI express

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computer Hardware Design (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a kind of Fault Locating Methods of PCIe device, applied to BMC, BMC can receive the address IO of all PCIe devices in the server of BIOS crawl, and the address IO is saved into storage unit, when subsequent detection is to server fail, the address IO that reports an error in status register is automatically grabbed, and the address IO that will report an error is compared with the address IO in storage unit can determine the corresponding failure PCIe device in the address IO that reports an error.As it can be seen that the above process only needs BMC to realize, reduced costs without external detection equipment without spare interface;In addition, whole process participates in manually without user without tearing casing open, the simple and reliable property of position fixing process is high.The invention also discloses a kind of fault location system of PCIe device and servers, have beneficial effect identical with above-mentioned Fault Locating Method.

Description

A kind of Fault Locating Method of PCIe device, system and server
Technical field
The present invention relates to fault location technology fields, Fault Locating Method, system more particularly to a kind of PCIe device And server.
Background technique
With the high speed development of server, user is also higher and higher for the stability requirement of server, both requires to operate It is simple to need performance to be protected again.However, in the prior art to PCIe (Peripheral Component Interconnect Express, high speed serialization computer expansion bus standard) equipment carry out fault location be but unable to satisfy Requirement easy to operate.
Specifically, in the prior art when carrying out fault location to PCIe device, it usually needs user is first by DCI (Direct connect interface, be directly connected to interface) equipment is connected to the USB3.0 interface of server, alternatively, first will The casing of server is opened, then ITP (In target probe) equipment is connected to the ITP interface of mainboard, is then passed through again DCI equipment or ITP equipment double-click Cscripts script and manual transmitting order to lower levels to grab the address IO that reports an error, and then input again Pci.resources () determines failure according to the address IO that reports an error to obtain the address IO of all PCIe devices in server PCIe device.As it can be seen that needing server reserves ITP and USB3.0 interface in the prior art, and need by DCI equipment and ITP Equipment, it is at high cost;Furthermore, it is necessary to user carry out in advance equipment connection even tear open casing (may cause failure no longer to reappear, position Reliability is low), it is also necessary to on-site manual operates DCI equipment and ITP equipment to carry out fault location, and position fixing process is cumbersome.
Summary of the invention
The object of the present invention is to provide a kind of Fault Locating Method of PCIe device, system and servers, reduce costs, The simple and reliable property of position fixing process is high.
In order to solve the above technical problems, it is applied to BMC the present invention provides a kind of Fault Locating Method of PCIe device, Include:
The address IO of all PCIe devices in the server of BIOS crawl is received, and the address IO is saved to storage list In member;
Detect whether the server breaks down, if it is, automatically grabbing the address IO that reports an error in status register;
The address IO that reports an error is compared with the address IO in the storage unit, report an error the address pair IO described in determination The failure PCIe device answered.
Preferably, the address IO includes the address MMIO and/or I/O space address.
Preferably, whether the detection server breaks down, comprising:
It detects in the system log of the server and physical address whether occurs and report an error.
Preferably, the status register includes MSR register, CSR register and MC register.
Preferably, the storage unit is shared drive.
Preferably, it reports an error described in the determination after the corresponding failure PCIe device in the address IO, further includes:
The web interface for sending SEL log to the BMC is shown, described in notifying that user plugs or replace again Failure PCIe device.
In order to solve the above technical problems, being applied to the present invention also provides a kind of fault location system of PCIe device BMC, comprising:
Receiving unit, for receiving the address IO of all PCIe devices in the server that BIOS is grabbed, and by the IO Location is saved into storage unit;
Detection unit, for detecting whether the server breaks down, if it is, triggering picking unit;
The picking unit, for automatically grabbing the address IO that reports an error in status register;
Determination unit determines institute for the address IO that reports an error to be compared with the address IO in the storage unit State the corresponding failure PCIe device in the address IO that reports an error.
Preferably, further includes:
Transmission unit, the web interface for sending SEL log to the BMC are shown, to notify user to plug again Or the replacement failure PCIe device.
In order to solve the above technical problems, the present invention also provides a kind of servers, comprising:
PCIe interface;
The PCIe device being connect with the PCIe interface;
BMC realizes the step of the Fault Locating Method of PCIe device as described above when for executing the computer program Suddenly.
Preferably, the PCIe device includes network interface card and/or video card and/or storage card and/or sound card.
The present invention provides a kind of Fault Locating Methods of PCIe device, are applied to BMC, and BMC can receive BIOS crawl The address IO of all PCIe devices in server, and the address IO is saved into storage unit, it is sent out in subsequent detection to server When raw failure, the address IO that reports an error in status register is automatically grabbed, and will the report an error address IO and the address IO in storage unit Comparison can determine the corresponding failure PCIe device in the address IO that reports an error.As it can be seen that the above process only needs BMC to realize, without External detection equipment is reduced costs without spare interface;In addition, without tearing casing open, whole process is joined manually without user With the simple and reliable property height of position fixing process.
The present invention also provides a kind of fault location system of PCIe device and servers, have and above-mentioned fault location side The identical beneficial effect of method.
Detailed description of the invention
It to describe the technical solutions in the embodiments of the present invention more clearly, below will be to institute in the prior art and embodiment Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without creative efforts, can also obtain according to these attached drawings Obtain other attached drawings.
Fig. 1 is a kind of process flow diagram flow chart of the Fault Locating Method of PCIe device provided by the invention;
Fig. 2 is a kind of structure chart of the fault location system of PCIe device provided by the invention;
Fig. 3 provides a kind of structural schematic diagram of server for the present invention.
Specific embodiment
Core of the invention is to provide Fault Locating Method, system and the server of a kind of PCIe device, reduces costs, The simple and reliable property of position fixing process is high.
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
Fig. 1 is please referred to, Fig. 1 is a kind of process flow diagram flow chart of the Fault Locating Method of PCIe device provided by the invention, is answered For BMC (Baseboard Management Controller, baseboard management controller), this method comprises:
S11: the server of BIOS (Base InputandOutput System, basic input output system) crawl is received In all PCIe devices the address IO, and the address IO is saved into storage unit;
Firstly the need of explanation, the PCIe device referred in the application refers to the equipment connecting with PCIe interface, example Such as network interface card, video card, storage card, sound card etc..In addition, in the application, the model of BMC can with but be not only limited to AST2500.
Specifically, server can carry out power-on self-test after powering on booting first, and then each device in server carries out just Beginningization (including PCIe device), after all PCIe devices in the server complete initialization, BIOS can call memory therein Address mapping module is with removing in crawl server the address IO of all PCIe devices namely each PCIe device and its corresponding IO Then the address IO of all PCIe devices grabbed is sent to BMC, BMC by LPC (Low Pin Count) interface by location It is saved in storage unit behind the address IO for receiving PCIe device, so as to the subsequent progress address IO comparison.
As a kind of preferred embodiment, storage unit is shared drive.In the present embodiment, select shared drive as Storage unit, such address IO can be accessed by different processes, improve the utilization efficiency of data, certainly, here may be used The address IO is stored to select other kinds of storage unit, the application does not limit particularly herein.
As a kind of preferred embodiment, the address IO includes the address MMIO and/or I/O space address.
The present embodiment at this time should in view of in practical applications, what some PCIe devices utilized is the memory in server The corresponding address IO of PCIe device is the address MMIO, and what also some PCIe devices utilized is included memory itself, at this time should The corresponding address IO of PCIe device is I/O space address.It is corresponded to not as it can be seen that the application considers different types of PCIe device The same address IO, improves the fault location rate of PCIe device.
S12: whether detection service device breaks down, if so, into S13;
Specifically, whether BMC meeting detection service device breaks down, and in practical applications, BMC can pass through detection service device System log in whether there is physical address and report an error or occur CATERR to carry out fault detection, this kind of detection mode is simple Reliably;It can also be detected by the low and high level of the respective pins of detection chip by connecting correspondingly detection chip, The application does not limit particularly for which kind of detection mode specifically chosen.
S13: the address IO that reports an error in status register is automatically grabbed;
It is considered that after server fail, the address IO to report an error can temporarily be saved into status register the application, Based on this, in the application, after detecting server fail, BMC meeting automatically scanning status register, and then seized condition The address IO that reports an error (including the address MMIO and/or I/O space address) in register.Specifically, status register here includes MSR (Model Specific Registers) register, CSR (Control and Status Register) register and MC (Mechine Check) register etc..
S14: the address IO that will report an error is compared with the address IO in storage unit, determines the corresponding failure in the address IO that reports an error PCIe device.
After grabbing and reporting an error the address IO, the address IO that can will report an error BMC is compared with the address IO in storage unit, looks for To reporting an error corresponding to the address IO PCIe device, and the PCIe device is determined as failure PCIe device, to realize failure The positioning of PCIe device.
As it can be seen that the above process only needs BMC to realize, reduced without external detection equipment without spare interface Cost;In addition, whole process participates in manually without user without tearing casing open, the simple and reliable property of position fixing process is high.
As a kind of preferred embodiment, after the determining corresponding failure PCIe device in the address IO that reports an error, further includes:
The web interface for sending SEL (System Event Log, System Event Log) log to BMC is shown, with logical Know that user plugs or replace again failure PCIe device.
After failure PCIe device has been determined, the web interface that BMC can send SEL log to BMC is shown, Yong Hu Seeing can determine to plug the failure PCIe device again or replace the PCIe according to the actual situation after SEL log and set It is standby, and turned back on after plugging the failure PCIe device again or replacing the PCIe device, to detect the event after maintenance Whether the failure of barrier PCIe device is eliminated.As it can be seen that facilitating user's timely learning fault message, Jin Erjin by this kind of display mode The operation of row follow-up maintenance, improves the functional reliability and stability of server.
Referring to figure 2., Fig. 2 is a kind of structure chart of the fault location system of PCIe device provided by the invention, the failure Positioning system is applied to BMC, comprising:
Receiving unit 21, the address IO of all PCIe devices in the server for receiving BIOS crawl, and by the address IO It saves into storage unit;
Whether detection unit 22 breaks down for detection service device, if it is, triggering picking unit 23;
Picking unit 23, for automatically grabbing the address IO that reports an error in status register;
Determination unit 24 is compared for will report an error the address IO with the address IO in storage unit, determines with reporting an error IO The corresponding failure PCIe device in location.
As a kind of preferred embodiment, further includes:
Transmission unit, the web interface for sending SEL log to BMC shown, to notify user to plug again or Replace failure PCIe device.
Above method embodiment, this hair are please referred to for the introduction of the fault location system of PCIe device provided by the invention It is bright that details are not described herein.
Referring to figure 3., Fig. 3 provides a kind of structural schematic diagram of server for the present invention, which includes:
PCIe interface 32;
The PCIe device 31 being connect with PCIe interface 32;
BMC 33, when for executing computer program the step of the realization such as Fault Locating Method of above-mentioned PCIe device.
Above method embodiment is please referred to for the introduction of server provided by the invention, details are not described herein by the present invention.
As a kind of preferred embodiment, PCIe device 31 includes network interface card and/or video card and/or storage card and/or sound card. Certainly, PCIe device here can also include other kinds of board, and the application does not limit particularly herein.
It should be noted that in the present specification, the terms "include", "comprise" or its any other variant are intended to Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or equipment including the element.
The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, of the invention It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims (10)

1. a kind of Fault Locating Method of PCIe device, which is characterized in that be applied to BMC, comprising:
The address IO of all PCIe devices in the server of BIOS crawl is received, and the address IO is saved to storage unit In;
Detect whether the server breaks down, if it is, automatically grabbing the address IO that reports an error in status register;
The address IO that reports an error is compared with the address IO in the storage unit, the address IO that reports an error described in determination is corresponding Failure PCIe device.
2. the Fault Locating Method of PCIe device as described in claim 1, which is characterized in that the address IO includes MMIO Location and/or I/O space address.
3. the Fault Locating Method of PCIe device as described in claim 1, which is characterized in that the detection server is It is no to break down, comprising:
It detects in the system log of the server and physical address whether occurs and report an error.
4. the Fault Locating Method of PCIe device as described in claim 1, which is characterized in that the status register includes MSR register, CSR register and MC register.
5. the Fault Locating Method of PCIe device as described in claim 1, which is characterized in that the storage unit is in shared It deposits.
6. such as the Fault Locating Method of PCIe device described in any one of claim 1 to 5, which is characterized in that the determining institute After stating the corresponding failure PCIe device in the address IO that reports an error, further includes:
The web interface for sending SEL log to the BMC is shown, to notify user to plug or replace again the failure PCIe device.
7. a kind of fault location system of PCIe device, which is characterized in that be applied to BMC, comprising:
Receiving unit is protected for receiving the address IO of all PCIe devices in the server that BIOS is grabbed, and by the address IO It deposits into storage unit;
Detection unit, for detecting whether the server breaks down, if it is, triggering picking unit;
The picking unit, for automatically grabbing the address IO that reports an error in status register;
Determination unit determines the report for the address IO that reports an error to be compared with the address IO in the storage unit The corresponding failure PCIe device in the wrong address IO.
8. the fault location system of PCIe device as claimed in claim 7, which is characterized in that further include:
Transmission unit, the web interface for sending SEL log to the BMC shown, to notify user to plug again or Replace the failure PCIe device.
9. a kind of server characterized by comprising
PCIe interface;
The PCIe device being connect with the PCIe interface;
BMC realizes that the failure of the PCIe device as described in any one of claim 1 to 6 is fixed when for executing the computer program The step of position method.
10. server as claimed in claim 9, which is characterized in that the PCIe device include network interface card and/or video card and/or Storage card and/or sound card.
CN201910655652.1A 2019-07-19 2019-07-19 A kind of Fault Locating Method of PCIe device, system and server Withdrawn CN110389849A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910655652.1A CN110389849A (en) 2019-07-19 2019-07-19 A kind of Fault Locating Method of PCIe device, system and server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910655652.1A CN110389849A (en) 2019-07-19 2019-07-19 A kind of Fault Locating Method of PCIe device, system and server

Publications (1)

Publication Number Publication Date
CN110389849A true CN110389849A (en) 2019-10-29

Family

ID=68286807

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910655652.1A Withdrawn CN110389849A (en) 2019-07-19 2019-07-19 A kind of Fault Locating Method of PCIe device, system and server

Country Status (1)

Country Link
CN (1) CN110389849A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111625382A (en) * 2020-05-21 2020-09-04 浪潮电子信息产业股份有限公司 Server fault diagnosis method, device, equipment and medium
CN112527192A (en) * 2020-12-01 2021-03-19 联想(北京)有限公司 Data acquisition method and device and service equipment
CN112699073A (en) * 2021-01-06 2021-04-23 同方计算机有限公司 PCIE card on-line replacement method and system with controllable BMC system
CN113868051A (en) * 2021-09-18 2021-12-31 苏州浪潮智能科技有限公司 PCIe fault detection device, method, equipment and storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111625382A (en) * 2020-05-21 2020-09-04 浪潮电子信息产业股份有限公司 Server fault diagnosis method, device, equipment and medium
CN111625382B (en) * 2020-05-21 2022-06-10 浪潮电子信息产业股份有限公司 Server fault diagnosis method, device, equipment and medium
CN112527192A (en) * 2020-12-01 2021-03-19 联想(北京)有限公司 Data acquisition method and device and service equipment
CN112699073A (en) * 2021-01-06 2021-04-23 同方计算机有限公司 PCIE card on-line replacement method and system with controllable BMC system
CN113868051A (en) * 2021-09-18 2021-12-31 苏州浪潮智能科技有限公司 PCIe fault detection device, method, equipment and storage medium
CN113868051B (en) * 2021-09-18 2023-08-08 苏州浪潮智能科技有限公司 PCIe fault detection device, method, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110389849A (en) A kind of Fault Locating Method of PCIe device, system and server
JP6530774B2 (en) Hardware failure recovery system
CN107479721B (en) Storage device, system and method for remote multicomputer switching technology
US6189109B1 (en) Method of remote access and control of environmental conditions
US6742139B1 (en) Service processor reset/reload
US6088816A (en) Method of displaying system status
US6163849A (en) Method of powering up or powering down a server to a maintenance state
US7543191B2 (en) Method and apparatus for isolating bus failure
CN1949182A (en) Detecting correctable errors and logging information relating to their location in memory
CN112286709B (en) Diagnosis method, diagnosis device and diagnosis equipment for server hardware faults
US20080140895A1 (en) Systems and Arrangements for Interrupt Management in a Processing Environment
CN109143954B (en) System and method for realizing controller reset
WO2021212943A1 (en) Server power supply maintenance method, apparatus and device, and medium
US10275330B2 (en) Computer readable non-transitory recording medium storing pseudo failure generation program, generation method, and generation apparatus
US7378977B2 (en) Current overload detecting system and method
KR20100038038A (en) Single shared power domain dynamic load based power loss detection and notification
CN112000535A (en) SAS Expander card-based hard disk abnormity identification method and processing method
CN113656339B (en) NVME hot plug processing method, BMC, device, equipment and medium
US20080288828A1 (en) structures for interrupt management in a processing environment
WO2019128784A1 (en) Nvme storage extension system
CN109710479B (en) Processing method, first device and second device
CN115964218A (en) Method and device for identifying fault of high-speed serial computer expansion bus equipment
US8689059B2 (en) System and method for handling system failure
US11314582B2 (en) Systems and methods for dynamically resolving hardware failures in an information handling system
US10817397B2 (en) Dynamic device detection and enhanced device management

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20191029