WO2020048174A1 - 故障诊断系统及服务器 - Google Patents

故障诊断系统及服务器 Download PDF

Info

Publication number
WO2020048174A1
WO2020048174A1 PCT/CN2019/090352 CN2019090352W WO2020048174A1 WO 2020048174 A1 WO2020048174 A1 WO 2020048174A1 CN 2019090352 W CN2019090352 W CN 2019090352W WO 2020048174 A1 WO2020048174 A1 WO 2020048174A1
Authority
WO
WIPO (PCT)
Prior art keywords
pull
switch
unit
control module
fault diagnosis
Prior art date
Application number
PCT/CN2019/090352
Other languages
English (en)
French (fr)
Inventor
金科
周栋树
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP19857042.6A priority Critical patent/EP3835903B1/en
Publication of WO2020048174A1 publication Critical patent/WO2020048174A1/zh
Priority to US17/193,048 priority patent/US11347611B2/en

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B23/00Testing or monitoring of control systems or parts thereof
    • G05B23/02Electric testing or monitoring
    • G05B23/0205Electric testing or monitoring by means of a monitoring system capable of detecting and responding to faults
    • G05B23/0218Electric testing or monitoring by means of a monitoring system capable of detecting and responding to faults characterised by the fault detection method dealing with either existing or incipient faults
    • G05B23/0224Process history based detection method, e.g. whereby history implies the availability of large amounts of data
    • G05B23/0227Qualitative history assessment, whereby the type of data acted upon, e.g. waveforms, images or patterns, is not relevant, e.g. rule based assessment; if-then decisions
    • G05B23/0235Qualitative history assessment, whereby the type of data acted upon, e.g. waveforms, images or patterns, is not relevant, e.g. rule based assessment; if-then decisions based on a comparison with predetermined threshold or range, e.g. "classical methods", carried out during normal operation; threshold adaptation or choice; when or how to compare with the threshold
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/26Functional testing
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B23/00Testing or monitoring of control systems or parts thereof
    • G05B23/02Electric testing or monitoring
    • G05B23/0205Electric testing or monitoring by means of a monitoring system capable of detecting and responding to faults
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B19/00Programme-control systems
    • G05B19/02Programme-control systems electric
    • G05B19/04Programme control other than numerical control, i.e. in sequence controllers or logic controllers
    • G05B19/05Programmable logic controllers, e.g. simulating logic interconnections of signals according to ladder diagrams or function charts
    • G05B19/058Safety, monitoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • G06F11/0754Error or fault detection not based on redundancy by exceeding limits
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • G06F11/0772Means for error signaling, e.g. using interrupts, exception flags, dedicated error registers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2205Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested
    • G06F11/2236Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested to test CPU or processors
    • G06F11/2242Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested to test CPU or processors in multi-processor systems, e.g. one processor becoming the test master
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/40Bus structure
    • G06F13/4004Coupling between buses
    • G06F13/4022Coupling between buses using switching circuits, e.g. switching matrix, connection or expansion network
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03KPULSE TECHNIQUE
    • H03K17/00Electronic switching or gating, i.e. not by contact-making and –breaking
    • H03K17/002Switching arrangements with several input- or output terminals
    • H03K17/005Switching arrangements with several input- or output terminals with several inputs only

Definitions

  • the embodiments of the present application relate to the technical field of circuits, and in particular, to a fault diagnosis system and a server.
  • CPU central processing unit
  • the CPU fault indication signal of each node board in the CPU fault indication signal circuit topology is transmitted to the server management board separately through the backplane, and the CPU fault indication signals of each two CPUs on the server management board are aggregated.
  • the CPU fault indication signal is uploaded to the complex programmable logic device after level shifting, and the complex programmable logic device is aggregated inside the CPLD according to the current hard partition status to obtain the CPU fault indication signal.
  • the CPU fault indication signal of each node board is transmitted to the server management board separately through the backplane. All isolation circuits and level conversion circuits must be placed on the server management board, which increases the circuit complexity of the server management board and CPU failure. Indicates the difficulty of wiring the signal circuit topology.
  • An embodiment of the present application provides a fault diagnosis system for detecting fault indication signals of multiple CPUs of a server through the fault diagnosis system, thereby reducing the complexity of circuit wiring of the fault diagnosis system.
  • the embodiment of the present application provides a fault diagnosis system according to a first aspect.
  • the fault diagnosis system includes:
  • the fault diagnosis system may be applied to a server, the fault diagnosis system includes a control unit and a first management board, a first pull-up unit and a second pull-up unit, a first pull-up switch and a second pull-up switch, and At least one central processing unit; the first pull-up unit is electrically connected to the first pull-up switch, the second pull-up unit is electrically connected to the second pull-up switch, and the control unit is connected to the first pull-up switch A management board is electrically connected, and the control unit is electrically connected to the first pull-up switch and the second pull-up switch, respectively; the control unit is configured to receive a hard partition signal sent by the first management board, and Controlling the first pull-up switch and the second pull-up switch to be closed respectively according to the hard partition signal, so that each central processing unit is electrically connected to the first pull-up switch and the second pull-up switch respectively
  • the fault diagnosis line includes a line from a first pull-up unit, a first pull-up switch, the
  • the fault diagnosis system further includes a first analog switch and a second analog switch;
  • the control unit includes a first control module And a second control module, the first control module is electrically connected to the first pull-up switch and the first analog switch, respectively, and the second control module is respectively connected to the second pull-up switch and the first Two analog switches are electrically connected;
  • the first control module is configured to receive the hard partition signal sent by the first management board, and control the first pull-up switch and the first according to the hard partition signal.
  • the analog switches are closed, and the second control module is configured to receive the hard partition signal sent by the first management board, and control the second pull-up switch and the second analog switch according to the hard partition signal. Closed separately, so that each central processor is also electrically connected to the first analog switch and the second analog switch, respectively, to form the fault diagnosis line, and the fault diagnosis line is in the
  • the first pull-up switch and the second pull-up switch further include the first analog switch and the second analog switch.
  • the fault diagnosis system further includes a third analog switch and a fourth analog switch.
  • the control unit further includes a third control module and a fourth control module, the third control module is electrically connected to the third analog switch, and the fourth control module is electrically connected to the fourth analog switch;
  • the third control module is configured to receive the hard partition signal sent by the first management board, and control the third analog switch to be closed according to the hard partition signal
  • the fourth control module is configured to receive the first A hard partition signal sent by a management board, and controlling the fourth analog switch to be closed according to the hard partition signal, so that each central processor is further connected with the third analog switch and the fourth analog respectively
  • the switches are electrically connected to form the fault diagnosis circuit.
  • the fault diagnosis circuit further includes the third analog switch and the second analog switch. Four analog switches.
  • the fault diagnosis system further includes a third pull-up unit and a third pull-up switch, a fourth pull-up unit and a fourth pull-up switch, wherein the third pull-up switch is electrically connected to the first management board, so that The third pull-up unit is electrically connected to the third pull-up switch, the fourth pull-up switch is electrically connected to the first management board, and the fourth pull-up unit is electrically connected to the fourth pull-up switch. connection.
  • the fault diagnosis system further includes a fifth pull-up unit and a fifth pull-up switch, a sixth pull-up unit and a sixth pull-up switch, a seventh pull-up unit and a seventh pull-up switch, an eighth pull-up unit and an eighth pull-up.
  • a pull switch wherein the fifth pull-up switch is electrically connected to the first management board, the fifth pull-up unit is electrically connected to the fifth pull-up switch, and the sixth pull-up switch is connected to the The first management board is electrically connected, the sixth pull-up unit is electrically connected to the sixth pull-up switch, the seventh pull-up switch is electrically connected to the first management board, and the seventh pull-up unit is electrically connected to The seventh pull-up switch is electrically connected, the eighth pull-up switch is electrically connected to the first management board, and the eighth pull-up unit is electrically connected to the eighth pull-up switch.
  • the fault diagnosis system further includes a center backplane for connecting the node boards corresponding to the first control module, the second control module, the third control module, and the fourth control module, respectively.
  • the first management board is further configured to send in-place information to the first control module, the second control module, the third control module, and / or the fourth control module, where the in-position information is used to indicate the first A state of a node board corresponding to a control module, the second control module, the third control module, and / or the fourth control module.
  • the fault diagnosis system further includes a level conversion unit, and the level conversion unit is electrically connected to the first management board.
  • the first fault diagnosis system further includes a second management board, and the second management board is electrically connected to the first control module, the second control module, the third control module, and the fourth control module, respectively.
  • the hard partition information includes 2P mode information, 4P mode information, or 8P mode information, wherein the 2P mode information corresponds to 2 processors of the at least one processor, and the 4P mode information corresponds to 4 of the at least one processors.
  • Processor, 8P mode information corresponds to 8 processors in the at least one processor.
  • the first management board includes a first complex programmable logic device CPLD.
  • the control unit includes a second CPLD.
  • a second aspect of the embodiments of the present application provides a server, and the server includes the first aspect and the fault diagnosis system according to any one of the first implementation manner of the first aspect to the eleventh implementation manner of the first aspect.
  • control unit and the switch are all located on the node board.
  • the control unit can control the switch to change the fault indication signal topology according to the current hard partition information and in-position information, and in combination with the slot number information where the node board is located.
  • the fault synchronization between the node boards can be synchronized through the middle backplane, without the management board's participation, the line is completely decoupled from the management board, simplifying the management board's circuits and functions . Therefore, the fault diagnosis system provided in the embodiment of the present application reduces the wiring of the management board of the server, thereby reducing the complexity of wiring between the node board and the server management board.
  • FIG. 1 is a schematic diagram of a fault diagnosis system according to an embodiment of the present application.
  • FIG. 2 is another schematic diagram of a fault diagnosis system according to an embodiment of the present application.
  • FIG. 3 is another schematic diagram of a fault diagnosis system according to an embodiment of the present application.
  • FIG. 4 is another schematic diagram of a fault diagnosis system according to an embodiment of the present application.
  • FIG. 5 is another schematic diagram of a fault diagnosis system according to an embodiment of the present application.
  • An embodiment of the present application provides a fault diagnosis system for detecting fault indication signals of multiple CPUs of a server through the fault diagnosis system, thereby reducing the complexity of circuit wiring of the fault diagnosis system.
  • FIG. 1 is a schematic diagram of the fault diagnosis system provided by the embodiment of the present application.
  • An embodiment of the fault diagnosis system provided by the embodiment of the present application includes:
  • the first pull-up unit 1031 is electrically connected to the first pull-up switch 1051
  • the second pull-up unit 1032 is electrically connected to the second pull-up switch 1052
  • the control unit 102 is connected to the first pull-up switch 1031 and the second pull-up, respectively.
  • the switch 1032 is electrically connected
  • the control unit 102 is electrically connected to the first management board 1011
  • the first level conversion unit 1081 is electrically connected to the first management board 1011
  • the center backplane 104 is connected to the first management board 1011 and the first node, respectively. ⁇ ⁇ Board connection.
  • the control unit 102 is configured to receive the hard partition information and the in-position information sent by the first management board 1011.
  • the in-position information is the in-position information of other node boards and is used to indicate the in-position status of other node boards.
  • the hard partition information and the slot number information of the first node board send control signals to the first pull-up switch 1051 and the second pull-up switch 1052, so that the first pull-up switch 1051 and the second pull-up switch 1052 are closed, respectively.
  • the CPU1 and CPU2 are electrically connected to the first pull-up switch 1051, and the CPU1 and CPU2 are electrically connected to the second pull-up switch 1052, respectively, to form a fault diagnosis circuit.
  • the fault diagnosis circuit includes a first pull-up unit 1031, a first The lines of the pull-up switches 1051, CPU1, CPU2, the second pull-up switch 1052 to the second pull-up unit 1032. It should be noted that the slot number information is used to indicate the relative position of the node board currently inserted in the center backplane.
  • the first level conversion unit 1081 and the first management board 1011 can be electrically connected through general input / output (GPIO), and the control unit 102 and the first The management board 1011 is electrically connected, and the specific communication protocol is not limited.
  • GPIO general input / output
  • the first pull-up unit 1031 and the second pull-up unit 1032 are used to pull up the level of the fault indication signal of the fault diagnosis circuit to obtain a target signal after being pulled up, and the level of the target signal is higher than a preset threshold. .
  • the target signal is transmitted to the first level conversion unit 1081, and the first level conversion unit 1081 sends the target signal to the first A management board 1011
  • the first management board 1011 is used to detect whether the level of the target signal is lower than a preset threshold, and when the level of the target signal is lower than the diagnostic threshold, the first management board 1011 determines at least There is a faulty CPU in one CPU, that is, the first management board 1011 determines that at least one of the CPU1 and the PU2 has a fault.
  • the complex programmable logic device (CPLD) of the first management board 1011 determines at least After a CPU fails, the server management controller (BMC) or other management chip is notified to collect the failure information of the CPU.
  • CPLD complex programmable logic device
  • the first management board 1011 may be a management board of a server.
  • the first management board 1011 may send hard partition information to the control unit 102 through the first CPLD.
  • the control unit 102 provided in this embodiment may This is the second CPLD on the node board.
  • any CPU in the fault diagnosis system fails, other CPUs may sense that the level of the target signal is lower than a preset threshold, and determine that a fault occurs. For example, when CPU1 fails, CPU2 determines that the level of the target signal is lower than a preset threshold, determines that a failure has occurred, and CPU2 terminates the current service and synchronizes the failure status. It can be understood that the synchronization of the faults between the node boards can be synchronized through the middle backplane, and the management board is not required to participate. The lines are completely decoupled from the management board, simplifying the management board circuits and functions. Therefore, the fault diagnosis system provided in this embodiment reduces the routing of the management board of the server and reduces the complexity of wiring between the node board and the server management board.
  • FIG. 2 is another schematic diagram of a fault diagnosis system provided by an embodiment of the present application.
  • Another embodiment of the fault diagnosis system provided by the embodiment of the present application includes:
  • the fault diagnosis system may further include a first analog switch 201 and a second analog switch 202.
  • the control unit 102 may include a first control module 1021 and a second control module 1022.
  • the first analog switch 201 and the first control module 1021 are respectively located at On the first node board, the second analog switch 201 and the second control module 1022 are respectively located on the second node board.
  • the first pull-up unit 1031 and the first pull-up switch 1051 are located on the second node board.
  • the second pull-up unit 1032 and the second pull-up switch 1052 are located on the second node board.
  • control module involved in this embodiment and subsequent implementations includes a second CPLD on the node board.
  • first control module 1021 is the CPLD on the first node board
  • second control module may be the second node board.
  • On CPLD On CPLD.
  • At least one CPU in this embodiment is described by taking CPU1, CPU2, CPU3, and CPU4 as examples, where CPU1 and CPU2 may be located on a first node board, and CPU3 and CPU4 may be located on a second node board.
  • the first control module 1021 is electrically connected to the first pull-up switch 1051, and the first control module 1021 may also be electrically connected to the first analog switch 201.
  • the second control module 1022 is electrically connected to the second pull-up switch 1052, the second control module 1022 can also be electrically connected to the second analog switch 202, the first analog switch 201 is electrically connected to the second analog switch 202, and the middle backplane 104 It is connected to the first management board 1011, the first node board and the second node board, respectively.
  • the first control module 1021 and the second control module 1022 are respectively used to receive the second hard partition signal and the in-position information sent by the first management board 1011.
  • the in-position information is the in-position information of other node boards, and is used for Indicates the presence status of other node boards.
  • the first control module 1011 receives the in-position information of the second node board, and the second control module receives the in-position information of the first node board.
  • the first control module 1021 sends a control signal to the first pull-up switch 1051 according to the hard partition information and the slot number information where the first node board is located. After the first pull-up switch 1051 receives the control signal, the first pull-up switch 1051 is closed, thereby enabling the enable of the first pull-up unit 1031. After the first control module 1021 receives the hard partition information, the first control module 1021 may also send a control signal to the first analog switch 201 so that the first analog switch 201 is closed.
  • the second control module 1022 sends a control signal to the second pull-up switch 1052 according to the hard partition information and the slot number information of the second node board. After receiving the control signal, the second pull-up switch 1052 closes the second pull-up switch 1052. , Thereby enabling the second pull-up unit 1032 to be enabled. After the second control module 1022 receives the second control signal, the second control module 1022 may also send a control signal to the second analog switch 202 so that the second analog switch 202 is closed.
  • the first pull-up unit 1031 and the second pull-up unit 1032 are closed, respectively. It is used to pull up the level of the fault indication signal of the fault diagnosis line to obtain the target signal after being pulled up, and the level of the target signal is higher than a preset threshold.
  • the CPU1, the CPU2, the CPU3, and the CPU4 are respectively connected with
  • the first analog switch is electrically connected
  • CPU1, CPU2, CPU3, and CPU4 are also electrically connected to the second analog switch respectively to form a fault diagnosis circuit.
  • the fault diagnosis circuit is also connected between the first pull-up switch 1051 and the second pull-up switch 1052. It includes a first analog switch 201 and a second analog switch 202.
  • the first pull-up unit 1031 and the second pull-up unit 1032 pull up the fault indication signal to obtain the target signal, and then transmit the target signal to the first level conversion unit 1081, and the level conversion unit 1081 sends the target signal to the first Management board 1011, the first management board 1011 is used to detect whether the level of the target signal is lower than a preset threshold, and when the level of the target signal is lower than the diagnostic threshold, the first management board 1011 determines that at least one of the CPUs 1 to PU4 has One CPU has failed. After the first management board 1011 determines that at least one CPU is faulty, the CPLD of the first management board 1011 notifies the BMC or other management chip to collect fault information of the central processing unit.
  • any CPU in the fault diagnosis system fails, other CPUs may sense that the level of the target signal is lower than a preset threshold, and determine that a fault occurs.
  • CPU1 fails, CPU2 to CPU4 respectively determine that the level of the target signal is lower than a preset threshold, and determine that a fault has occurred.
  • CPU2 to CPU4 terminate the current service and synchronize the fault status. It can be understood that the synchronization of the faults between the node boards can be synchronized through the middle backplane, and the management board is not required to participate. The lines are completely decoupled from the management board, simplifying the management board circuits and functions. Therefore, the fault diagnosis system provided in this embodiment reduces the routing of the management board of the server and reduces the complexity of wiring between the node board and the server management board.
  • FIG. 3 is another schematic diagram of the fault diagnosis system provided by the embodiment of the present application, and another implementation of the fault diagnosis system provided by the embodiment of the present application. Examples include:
  • the fault diagnosis system includes a first management board 101, a first node board, a second node board, a third node board, and a fourth node board, and a middle backplane 104, wherein the middle backplane and the first management board 101, The first node board, the second node board, the third node board, and the fourth node board are connected.
  • the first node board includes a first control module 1021, a first analog switch 201, a first pull-up unit 1031, a first pull-up switch 1051, a first level conversion unit 1081, a CPU1, and a CPU2.
  • the first control module 1021 is electrically connected to the first analog switch 201 and the first pull-up switch 1051, the first pull-up unit 1031 and the first pull-up switch 1051 are electrically connected, and the first analog switch 201 is connected to the first control module.
  • 1021 is electrically connected
  • the first level conversion unit 1081 is electrically connected to the first management board 1011 through the middle backplane 104
  • the first control module 1021 is electrically connected to the first management board 1011 through the middle backplane 104.
  • the second node board includes a second control module 1022, a second analog switch 202, a CPU3, and a CPU4.
  • the second control module 1022 is electrically connected to the second analog switch 202, and the second control module 1022 is electrically connected to the first management board 1011 through the center backplane 104.
  • the third node board includes a third control module 1023, a third analog switch 203, CPU5, and CPU6.
  • the third control module 1023 is electrically connected to the third analog switch 203, and the third control module 1023 is connected to the first A management board 1011 is electrically connected.
  • the fourth node board includes a fourth control module 1024, a fourth analog switch 204, a second pull-up unit 1032, a second pull-up switch 1052, a CPU7, and a CP8.
  • the fourth control module 1024 is electrically connected to the fourth analog switch 204 and the second pull-up switch 1052, respectively.
  • the fourth control module 1024 is electrically connected to the first management board 1011 through the center backplane 104.
  • the first control module 1021, the second control module 1022, the third control module 1023, and the third analog switch 203 are respectively electrically connected through the center back plate 104.
  • the first control module 1021, the second control module 1022, the third control module 1023, and the third control module 1024 are respectively used to receive the hard partition information and the in-position information sent by the management board.
  • the in-position information is the in-position information of other node boards and is used to indicate the in-position status of other node boards.
  • the first control module 1021 receives in-position information corresponding to the second node board, the third node board, and the fourth node board, respectively.
  • the first control module 1021 sends a control signal to the first pull-up switch 1051 according to the hard partition information and the slot number information of the first node board. After receiving the control signal, the first pull-up switch 1051 closes the first pull-up switch 1051. , Thereby enabling the enable of the first pull-up unit 1031.
  • the first control module 1021 may also send a control signal to the first analog switch 201 so that the first analog switch 201 is closed.
  • the second control module 1022 sends a control signal to the second analog switch 202 according to the hard partition information and the slot number information where the second node board is located, so that the second analog switch 202 is closed.
  • the third control module 1023 sends a control signal to the third analog switch 203 according to the hard partition information and the slot number information where the third node board is located, so that the third analog switch 203 is closed.
  • the fourth control module 1024 sends a control signal to the second pull-up switch 1051 according to the hard partition information and the slot number information of the fourth node board. After receiving the control signal, the second pull-up switch 1051 closes the second pull-up switch 1051. , Thereby enabling the second pull-up unit 1032 to be enabled.
  • the fourth control module 1024 may also send a control signal to the fourth analog switch 204 so that the fourth analog switch 204 is closed.
  • a pull-up unit 1031 and a second pull-up unit 1032 are used for the level of the fault indication signal of the fault diagnosis circuit to obtain a target signal after being pulled up, and the level of the target signal is higher than a preset threshold.
  • the fault diagnosis circuit includes a first analog switch 201, a second analog switch 202, a third analog switch 203, and a fourth analog switch 204 between the first pull-up switch 1051 and the second pull-up switch 1052. .
  • the first pull-up unit 1031 and the second pull-up unit 1032 pull up the fault indication signal to obtain the target signal, and then transmit the target signal to the first level conversion unit 1081, and the level conversion unit sends the target signal to the first Management board 1011, the first management board 1011 is used to detect whether the target signal is lower than a preset threshold, and when the level of the target signal is lower than the diagnostic threshold, the first management board 1011 determines that there is at least one CPU among CPU1 to PU8 malfunction. After the first management board 1011 determines that at least one CPU is faulty, the CPLD of the first management board 1011 notifies the BMC or other management chip to collect fault information of the central processing unit.
  • any CPU in the fault diagnosis system fails, other CPUs may determine that the level of the target signal is lower than a preset threshold, and determine that a faulty CPU exists.
  • CPU1 fails, CPU2 to CPU8 determine that the level of the target signal is lower than a preset threshold value, determine that a failure has occurred, and CPU2 to CPU8 terminate the current service and synchronize the fault status.
  • the synchronization of the faults between the node boards can be synchronized through the middle backplane, and the management board is not required to participate.
  • the lines are completely decoupled from the management board, simplifying the management board circuits and functions. Therefore, the fault diagnosis system provided in this embodiment reduces the routing of the management board of the server and reduces the complexity of wiring between the node board and the server management board.
  • the fourth node board may be the second node board in the embodiment in FIG. 2 described above, and the order of the node boards is not limited herein.
  • FIG. 4 is another schematic diagram of the fault diagnosis system provided by the embodiment of the present application.
  • One embodiment includes:
  • the fault diagnosis system of the 8-way server includes a first management board 1011 and a second management board 1012, a first node board, a second node board, a third node board, and a fourth node board.
  • a backplane 104 is installed, wherein the middle backplane 104 is connected to the first management board 1011, the second management board 1012, the first node board, the second node board, the third node board, and the fourth node board, respectively.
  • the first node board includes a first control module 1021, a first analog switch 201, a fifth analog switch 205, a first pull-up unit 1031, a first pull-up switch 1051, a third pull-up unit 1033, and a third pull-up switch 1053.
  • the first control module 1021 is electrically connected to the first analog switch 201, the fifth analog switch 205, the first pull-up switch 1051, and the third pull-up switch 1053, respectively.
  • the first pull-up unit 1031 and the first pull-up switch 1051 are electrically connected.
  • the third pull-up unit 1033 and the third pull-up switch 1053 are electrically connected, the first level conversion unit 1081 is electrically connected to the first management board 1011 through the center backplane 104, and the first control module 1021 is connected through the center back The board 104 is electrically connected to the first management board 1011 and the second management board 1012.
  • the second node board includes a second control module 1022, a second analog switch 202, a sixth analog switch 206, a second pull-up unit 1032, a second pull-up switch 1052, a fourth pull-up unit 1034, and a fourth pull-up switch 1054. , CPU3 and CPU4.
  • the second control module 1022 is electrically connected to the second analog switch 202, the sixth analog switch 206, the second pull-up switch 1052, and the fourth pull-up switch 1054, respectively.
  • the second pull-up unit 1032 and the second pull-up switch 1052 are electrically connected.
  • the fourth pull-up unit 1034 is electrically connected to the fourth pull-up switch 1054, and the second control module 1022 is electrically connected to the first management board 1011 and the second management board 1012 through the middle backplane 104.
  • the third node board includes a third control module 1023, a third analog switch 203, a seventh analog switch 207, a fifth pull-up unit 1035, a fifth pull-up switch 1055, a sixth pull-up unit 1036, and a sixth pull-up switch 1056.
  • the third control module 1023 is electrically connected to the third analog switch 203, the seventh analog switch 207, the fifth pull-up switch 1055, and the sixth pull-up switch 1056, respectively.
  • the fifth pull-up unit 1035 and the fifth pull-up switch 1055 are electrically connected.
  • the sixth pull-up unit 1036 and the sixth pull-up switch 1056 are electrically connected, the second level conversion unit 1082 is electrically connected to the second management board 1011 through the center backplane 104, and the third control module 1023 is connected through the center back
  • the board 104 is electrically connected to the first management board 1011 and the second management board 1012.
  • the fourth node board includes a fourth control module 1024, a fourth analog switch 204, an eighth analog switch 208, a seventh pull-up unit 1037, a seventh pull-up switch 1057, an eighth pull-up unit 1038, and an eighth pull-up switch 1058. , CPU7 and CP8. Among them, the fourth control module 1024 is respectively connected with the fourth analog switch 204, the eighth analog switch 208, the seventh pull-up switch 1057, and the eighth pull-up switch 1058. The fourth control module 1024 is managed by the middle backplane 104 and the first management module. The board 1011 and the second management board 1012 are electrically connected.
  • the first analog switch 201 and the second analog switch 202 in the fault diagnosis system are electrically connected through the center back plate 104, and the third analog switch 203 and the fourth analog switch 204 in the fault diagnosis system are connected through the center back
  • the plate 104 is electrically connected, and the fifth analog switch 205, the sixth analog switch 206, the seventh analog switch 207, and the eighth analog switch 208 are electrically connected through the center back plate 104.
  • the first control module 1021, the second control module 1022, the third control module 1023, and the fourth control module 1024 are respectively used to receive the hard partition information sent by the first management board 101 and the in-position information of other node boards.
  • the in-position information is the in-position information of other node boards and is used to indicate the in-position status of other node boards.
  • the first control module 1021 sends a control signal to the first pull-up switch 1051 and the third pull-up switch 1053.
  • the first pull-up switch 1051 closes the first pull-up switch 1051 according to the control signal, thereby The enable of the first pull-up unit 1031 is turned on, and the third pull-up switch 1053 closes the third pull-up switch 1053 according to the control signal, thereby turning on the enable of the third pull-up unit 1033.
  • the first pull-up unit 1031 and the third pull-up unit 1033 pull up the level of the fault indication signal to obtain a target signal.
  • the target signal The level is above a preset threshold.
  • the target signal is transmitted to the first management board 101 through the first level conversion unit 1081, where the first management board 101 is used to detect whether the level of the target signal is lower than a preset threshold, and when the level of the target signal is lower than the diagnostic threshold At this time, the first management board 101 determines that at least one of the CPU1 and the PU2 fails.
  • the 2P mode in this embodiment indicates that the service performed by the current server only needs 2 CPUs to execute.
  • the second control module 1022, the third control module 1023, and the fourth control module 1024 in the fault diagnosis system may not be required. Do business.
  • the first control module 1021 sends a control signal to the first pull-up switch 1051 and the first analog switch 201.
  • the first pull-up switch 1051 closes the first pull-up switch 1051 according to the control signal, thereby leading to
  • the first analog switch 201 closes the first analog switch 201 according to a control signal.
  • the second control module 1022 sends a control signal to the second pull-up switch 1052 and the second analog switch 202.
  • the second pull-up switch 1052 closes the second pull-up switch 1052 according to the control signal, thereby turning on the second pull-up unit 1032.
  • the second analog switch 202 closes the second analog switch 202 according to the control signal.
  • the first pull-up unit 1031 and the second pull-up unit 1032 change the level of the fault indication signal. Pull up to get the target signal, whose level is higher than a preset threshold.
  • the target signal is transmitted to the first management board 101 through the first level conversion unit 1081, where the first management board 101 is used to detect whether the level of the target signal is lower than a preset threshold, and when the level of the target signal is lower than the diagnostic threshold At this time, the first management board 101 determines that at least one of the CPUs 1 to 4 has failed.
  • the first control module 1021 sends a control signal to the first pull-up switch 1051 and the fifth analog switch 205.
  • the first pull-up switch 1051 closes the first pull-up switch 1051 according to the control signal, thereby leading to
  • the fifth analog switch 205 turns off the fifth analog switch 205 according to the control signal.
  • the second control module 1022 sends a control signal to the sixth analog switch 206, and the sixth analog switch 206 closes the sixth analog switch 206 according to the control signal.
  • the third control module 1023 sends a control signal to the seventh analog switch 207, and the seventh analog switch 207 closes the seventh analog switch 207 according to the control signal.
  • the fourth control module 1024 sends control signals to the eighth pull-up switch 1058 and the eighth analog switch 208.
  • the eighth pull-up switch 1058 closes the eighth pull-up switch 1058 according to the control signal, thereby turning on the second pull-up unit 1038. Yes, the eighth analog switch 208 closes the eighth analog switch 208 according to the control signal.
  • the first pull-up unit 1031 and the second pull-up unit 1032 change the level of the fault indication signal. Pull up to get the target signal, whose level is higher than a preset threshold.
  • the target signal is transmitted to the first management board 101 through the first level conversion unit 1081, where the first management board 101 is used to detect whether the level of the target signal is lower than a preset threshold, and when the level of the target signal is lower than the diagnostic threshold At this time, the first management board 101 determines that at least one of the CPUs 1 to 8 has a failure.
  • the fault diagnosis system provided in this embodiment may be switched to a server in a different mode according to the hard partition information.
  • the server may be switched to a 2P mode, a 4P mode, or an 8P mode according to the hard partition information. Therefore, in this embodiment, the control module can adapt the current hard partition setting and system requirements according to the current hard partition information and in-position information, and control the switch to change the fault indication signal topology based on the slot number information where the node board is located.
  • the circuit adaptability is strong, which improves the flexibility of the fault diagnosis system.
  • the fault diagnosis system shown in FIG. 4 can also be switched to two 4P modes.
  • the first management board 1011 and each unit in the first node board and the second node board form a 4P mode.
  • the second The management board 1012 and each unit in the third node board and the fourth node board constitute another 4P mode.
  • the functions and actions performed by each unit in the first management board 1011, the first node board, and the second node board are similar to the embodiment corresponding to FIG. 2 described above.
  • the second management board 1012, the third node board, and the fourth node board The functions and actions performed by each unit are not repeated here.
  • the in-position information of other node boards sent by the first management board 1011 in this embodiment may be used to indicate that any one, any two, or any three node boards of the first to fourth node boards are not in the slot.
  • the control module corresponding to the node board that is not in the slot or isolated does not perform actions.
  • the first management board 1011 sends hard partition information and slots to the first control module 1021, the second control module 1022, the third control module, and the fourth control module.
  • Tag number information When the hard partition information is in the 8P mode, after the first control module 1021, the second control module 1022, the third control module 1023, and the fourth control module 1024 receive the hard partition information, the third control module 1023 according to the slot The bit number information is determined not to be in the slot or isolated, and the third control module 1023 does not perform an action. The first control module 1021, the second control module 1022, and the fourth control module determine that they are in place according to the slot number information, respectively.
  • the first control module 1021 sends a control signal to the first pull-up switch 1051 and the fifth analog switch 205.
  • a pull-up switch 1051 closes the first pull-up switch 1051 according to the control signal, thereby turning on the enable of the first pull-up unit 1031, and the fifth analog switch 205 turns off the fifth analog switch 205 according to the control signal.
  • the second control module 1022 sends a control signal to the sixth analog switch 206, and the sixth analog switch 206 closes the sixth analog switch 206 according to the control signal.
  • the fourth control module 1024 sends control signals to the eighth pull-up switch 1058 and the eighth analog switch 208.
  • the eighth pull-up switch 1058 closes the eighth pull-up switch 1058 according to the control signal, thereby turning on the second pull-up unit 1038. Yes, the eighth analog switch 208 closes the eighth analog switch 208 according to the control signal.
  • the first pull-up unit 1031 and the second pull-up unit 1032 pull up the level of the fault indication signal to obtain a target signal whose level is higher than a preset threshold.
  • the target signal is transmitted to the first management board 101 after passing through the first level conversion unit 1081.
  • the first management board 101 is used to detect whether the level of the target signal is lower than a preset threshold, and when the level of the target signal is lower than the diagnosis At the threshold, the first management board 101 determines that at least one of CPU1 to CPU4 and CPU7 to CPU8 fails.
  • FIG. 5 is another schematic diagram of the fault diagnosis system provided by the embodiment of the present application.
  • the fault diagnosis provided by the embodiment of the present application Another embodiment of the system includes:
  • the fault diagnosis system of the two 4-way servers includes a first management board 1011 and a second management board 1012, a first node board, a second node board, a third node board, and a fourth Node board, middle backplane 104, wherein the middle backplane 104 is connected to the first management board 1011, the second management board 1022, the first node board, the second node board, the third node board, and the fourth node board, respectively.
  • the units included in the first node board, the second node board, the third node board, and the fourth node board can be referred to FIG. 5, and details are not described herein again.
  • the first management board 1011 is electrically connected to the first control module 1021 and the second control module 1022, respectively.
  • the first management board is electrically connected to the first level conversion unit 1081 through a center backplane.
  • the second management board 1012 is electrically connected to the third control module 1023 and the fourth control module 1024, respectively.
  • the second management board 1012 is electrically connected to the second level conversion unit 1082 through the center backplane 104.
  • the first management board 1011 sends the hard partition information and the in-place information of its node board to the first control module 1021, the second control module 1022, and the second management board 1012 can respectively send the third control module 1023 and the third control module 1023 and
  • the fourth control module 1024 sends hard partition information and the in-position information of its node board.
  • the hard partition information is 4P mode.
  • the first management board 1011 and each unit in the first node board and the second node board form a 4P mode.
  • the functions and actions performed by the management board 1011 and the units in the first node board and the second node board are similar to the foregoing embodiment corresponding to FIG. 2, and details are not described herein again.
  • the second management board 1012 and each unit in the third node board and the fourth node board form another 4P mode.
  • the functions and actions performed by the second management board 1012, the units on the third node board and the fourth node board are similar to the embodiment corresponding to FIG. 2 described above, and are not repeated here.
  • a fault diagnosis system for four 2-socket servers may also be provided.
  • the fault diagnosis system for the four 2-socket servers is similar to the fault diagnosis system for the two 4-socket servers, and is not described here. More details.
  • An embodiment of the present application further provides a server, which includes a fault diagnosis system corresponding to FIG. 1, a fault diagnosis system corresponding to FIG. 2, a fault diagnosis system corresponding to FIG. 3, a fault diagnosis system corresponding to FIG. 4, or a server corresponding to FIG. 5.
  • a fault diagnosis system please refer to the embodiments corresponding to FIG. 1 to FIG. 5 respectively, and details are not described herein again.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computer Hardware Design (AREA)
  • Automation & Control Theory (AREA)
  • Mathematical Physics (AREA)
  • Hardware Redundancy (AREA)
  • Test And Diagnosis Of Digital Computers (AREA)

Abstract

一种故障诊断系统,包括:控制单元(102)、第一管理板(1011)、第一上拉单元(1031)、第二上拉单元(1032)、第一上拉开关(1051)、第二上拉开关(1052)和至少一个中央处理器(CPU1,...,CPU8);第一上拉单元(1031)与第一上拉开关(1051)电连接,第二上拉单元(1032)与第二上拉开关(1052)电连接,控制单元(102)与第一管理板(1011)电连接,控制单元(102)分别与第一上拉开关(1051)和第二上拉开关(1052)电连接;控制单元(102)用于接收第一管理板(1011)发送的硬分区信号;第一上拉单元(1031)和第二上拉单元(1032)用于上拉故障诊断线路的故障指示信号,以得到目标信号;第一管理板(1011)用于检测目标信号的电平是否低于预设阈值,并且当目标信号的电平低于预设阈值时,确定至少一个中央处理器(CPU1,...,CPU8)中存在发生故障的中央处理器(CPU1,...,CPU8)。

Description

故障诊断系统及服务器 技术领域
本申请实施例涉及电路技术领域,尤其涉及一种故障诊断系统及服务器。
背景技术
随着对服务器性能的需求不断提高,单个中央处理器(central processing unit,CPU)的服务器已经无法满足高性能的计算需求,服务器已经向2路、4路到8路高性能服务器演进,且需具备硬分区的能力,服务器的不断演进使得对服务器管理板的业务处理能力的要求越来越高,其中,CPU故障指示信号电路拓扑也越来越复杂。
现有技术中,CPU故障指示信号电路拓扑中每个节点板的CPU故障指示信号通过背板单独传输至服务器管理板,在服务器管理板上每两个CPU的CPU故障指示信号进行汇聚,汇聚后的CPU故障指示信号经过电平转换(level shift)后上传至复杂可编程逻辑器件,复杂可编程逻辑器件根据当前的硬分区状态,在CPLD内部做汇聚,从而获取到CPU故障指示信号。
但是,每个节点板的CPU故障指示信号通过背板单独传输至服务器管理板,所有隔离电路和电平转换电路都要放在服务器管理板上,增加了服务器管理板的电路复杂度和CPU故障指示信号电路拓扑的布线难度。
发明内容
本申请实施例提供一种故障诊断系统,用于通过故障诊断系统对服务器的多路CPU的故障指示信号进行检测,降低了故障诊断系统的电路布线的复杂度。
本申请实施例提供第一方面提供一种故障诊断系统,该故障诊断系统包括:
所述故障诊断系统可应用于服务器中,所述故障诊断系统包括控制单元和第一管理板,第一上拉单元和第二上拉单元,第一上拉开关和第二上拉开关,以及至少一个中央处理器;所述第一上拉单元与所述第一上拉开关电连接,所述第二上拉单元与所述第二上拉开关电连接,所述控制单元与所述第一管理板电连接,所述控制单元分别与所述第一上拉开关和所述第二上拉开关电连接;所述控制单元用于接收所述第一管理板发送的硬分区信号,并根据所述硬分区信号控制所述第一上拉开关和所述第二上拉开关分别闭合,使得每个中央处理器分别与所述第一上拉开关和所述第二上拉开关电连接,以形成故障诊断线路,所述故障诊断线路包括从第一上拉单元、第一上拉开关、所述至少一个中央处理器、第二上拉开关到所述第二上拉单元的线路;所述第一上拉单元和所述第二上拉单元用于上拉所述故障诊断线路的故障指示信号,以得到上拉后的目标信号;所述第一管理板用于检测所述目标信号的电平是否低于预设阈值,并且当所述目标信号的电平低于诊断阈值时,确定所述故障诊断线路上的所述至少一个中央处理器中存在发生故障的中央处理器,其中,控制单元,第一上拉单元和第二上拉单元,第一上拉开关和第二上拉开关,以及至少一个中央处理器位于第一节点板。
基于本申请实施例第一方面,本申请实施例第一方面的第一种实现方式中,所述故 障诊断系统还包括第一模拟开关和第二模拟开关;所述控制单元包括第一控制模块和第二控制模块,所述第一控制模块分别与所述第一上拉开关和所述第一模拟开关电连接,所述第二控制模块分别与所述第二上拉开关和所述第二模拟开关电连接;所述第一控制模块用于根据接收所述第一管理板发送的所述硬分区信号,并根据所述硬分区信号控制所述第一上拉开关和所述第一模拟开关分别闭合,所述第二控制模块用于接收所述第一管理板发送的所述硬分区信号,并根据所述硬分区信号控制所述第二上拉开关和所述第二模拟开关分别闭合,使得所述每个中央处理器还分别与所述第一模拟开关和所述第二模拟开关电连接,以形成所述故障诊断线路,所述故障诊断线路在所述第一上拉开关和所述第二上拉开关之间还包括所述第一模拟开关和所述第二模拟开关。
基于本申请实施例第一方面以及第一方面的第一种实现方式,本申请实施例第一方面的第二种实现方式中,所述故障诊断系统还包括第三模拟开关和第四模拟开关;所述控制单元还包括第三控制模块和第四控制模块,所述第三控制模块与所述第三模拟开关电连接,所述第四控制模块与所述第四模拟开关电连接;所述第三控制模块用于接收所述第一管理板发送的所述硬分区信号,并根据所述硬分区信号控制所述第三模拟开关闭合,所述第四控制模块用于接收所述第一管理板发送的所述硬分区信号,并根据所述硬分区信号控制所述第四模拟开关闭合,使得所述每个中央处理器还分别与所述第三模拟开关和所述第四模拟开关电连接,以形成所述故障诊断线路,所述故障诊断线路在所述第一模拟开关和所述第二模拟开关之间还包括所述第三模拟开关和所述第四模拟开关。
基于本申请实施例第一方面以及第一方面的第一种实现方式至第一方面的第二种实现方式中的任一项,本申请实施例第一方面的第三种实现方式中,所述故障诊断系统还包括第三上拉单元以及第三上拉开关,第四上拉单元以及第四上拉开关,其中,所述第三上拉开关与所述第一管理板电连接,所述第三上拉单元与所述第三上拉开关电连接,所述第四上拉开关与所述第一管理板电连接,所述第四上拉单元与所述第四上拉开关电连接。
基于本申请实施例第一方面以及第一方面的第一种实现方式至第一方面的第三种实现方式中的任一项,本申请实施例第一方面的第四种实现方式中,所述故障诊断系统还包括第五上拉单元以及第五上拉开关,第六上拉单元以及第六上拉开关,七上拉单元以及第七上拉开关,第八上拉单元以及第八上拉开关,其中,所述第五上拉开关与所述第一管理板电连接,所述第五上拉单元与所述第五上拉开关电连接,所述第六上拉开关与所述第一管理板电连接,所述第六上拉单元与所述第六上拉开关电连接,所述第七上拉开关与所述第一管理板电连接,所述第七上拉单元与所述第七上拉开关电连接,所述第八上拉开关与所述第一管理板电连接,所述第八上拉单元与所述第八上拉开关电连接。
基于本申请实施例第一方面以及第一方面的第一种实现方式至第一方面的第四种实现方式中的任一项,本申请实施例第一方面的第五种实现方式中,所述故障诊断系统还包括中置背板,所述中置背板用于连接所述第一控制模块、所述第二控制模块、第三控制模块以及第四控制模块分别对应的节点板。
基于本申请实施例第一方面以及第一方面的第一种实现方式至第一方面的第五种实现方式中的任一项,本申请实施例第一方面的第六种实现方式中,所述第一管理板还用于向所述第一控制模块、所述第二控制模块、第三控制模块和/或第四控制模块发送在位 信息,所述在位信息用于指示所述第一控制模块、所述第二控制模块、第三控制模块和/或第四控制模块分别对应的节点板的状态。
基于本申请实施例第一方面以及第一方面的第一种实现方式至第一方面的第六种实现方式中的任一项,本申请实施例第一方面的第七种实现方式中,所述故障诊断系统还包括电平转换单元,所述电平转换单元与所述第一管理板电连接。
基于本申请实施例第一方面以及第一方面的第一种实现方式至第一方面的第七种实现方式中的任一项,本申请实施例第一方面的第八种实现方式中,所述第故障诊断系统还包括第二管理板,所述第二管理板分别与第一控制模块、第二控制模块、所述第三控制模块和第四控制模块电连接。
基于本申请实施例第一方面以及第一方面的第一种实现方式至第一方面的第八种实现方式中的任一项,本申请实施例第一方面的第九种实现方式中,所述硬分区信息包括2P模式信息、4P模式信息或8P模式信息,其中,2P模式信息对应所述至少一个处理器中的2个处理器,4P模式信息对应所述至少一个处理器中的4个处理器,8P模式信息对应所述至少一个处理器中的8个处理器。
基于本申请实施例第一方面以及第一方面的第一种实现方式至第一方面的第九种实现方式中的任一项,本申请实施例第一方面的第十种实现方式中,所述第一管理板包括第一复杂可编程逻辑器件CPLD。
基于本申请实施例第一方面以及第一方面的第一种实现方式至第一方面的第十种实现方式中的任一项,本申请实施例第一方面的第十一种实现方式中,所述控制单元包括第二CPLD。
本申请实施例第二方面提供一种服务器,所述服务器包括第一方面以及第一方面的第一种实现方式至第一方面的第十一种实现方式中任一所述的故障诊断系统。
从以上技术方案可以看出,本申请实施例具有以下优点:
本申请实施例中,控制单元及开关全部位于节点板上,控制单元可以根据当前的硬分区信息以及在位信息,以及结合本节点板所在的槽位号信息,控制开关变更故障指示信号拓扑,来适配当前的硬分区设置和系统要求,节点板之间的故障同步可通过中置背板的实现同步,不需要管理板参与,线路上彻底与管理板解耦,简化管理板电路和功能。因此,本申请实施例提供的故障诊断系统减少了服务器的管理板的走线,从而降低了节点板和服务器管理板之间布线的复杂度。
附图说明
图1为本申请实施例提供的故障诊断系统的一个示意图;
图2为本申请实施例提供的故障诊断系统的另一个示意图;
图3为本申请实施例提供的故障诊断系统的另一个示意图;
图4为本申请实施例提供的故障诊断系统的另一个示意图;
图5为本申请实施例提供的故障诊断系统的另一个示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整 地描述,显然,所描述的实施例仅是本申请实施例一部分实施例,而不是全部的实施例。基于本申请实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请实施例保护的范围。
本申请实施例的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请实施例的实施例如能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
本申请实施例提供一种故障诊断系统,用于通过故障诊断系统对服务器的多路CPU的故障指示信号进行检测,降低了故障诊断系统的电路布线的复杂度。
下面对本申请实施例提供的故障诊断系统进行描述,请参考图1,图1为本申请实施例提供的故障诊断系统的一个示意图,本申请实施例提供的故障诊断系统的一个实施例包括:
第一管理板1011以及控制单元102,第一上拉单元1031和第二上拉单元1032,第一上拉开关1051和第二上拉开关1052,中置背板104,第一电平转换单元1081以及至少一个中央处理器CPU;具体地,第一管理板1011为服务器的管理板,控制单元102、第一上拉单元1031、第二上拉单元1032、第一上拉开关1051、第二上拉开关1052、第一电平转换单元1081以及至少一个中央处理器CPU位于服务器的第一节点板上,需要说明的是,本实施例中的至少一个CPU以CPU1以及CPU2作为例子进行说明。
本实施例提供的故障诊断系统各单元之间的连接关系可以如下所示:
其中,第一上拉单元1031与第一上拉开关1051电连接,第二上拉单元1032与第二上拉开关1052电连接,控制单元102分别与第一上拉开关1031和第二上拉开关1032电连接,控制单元102与第一管理板1011电连接,第一电平转换单元1081与第一管理板1011电连接,中置背板104分别与第一管理板1011连接以及第一节点板连接。
控制单元102用于接收第一管理板1011发送的硬分区信息以及在位信息,其中,在位信息为其他节点板的在位信息,用于指示其他节点板的在位状态,控制单元102根据硬分区信息以及第一节点板所在的槽位号信息向第一上拉开关1051和第二上拉开关1052发送控制信号,从而让第一上拉开关1051和第二上拉开关1052分别闭合,使得CPU1以及CPU2分别与第一上拉开关1051电连接,以及CPU1以及CPU2分别与第二上拉开关1052电连接,以形成故障诊断线路,故障诊断线路包括从第一上拉单元1031、第一上拉开关1051、CPU1、CPU2、第二上拉开关1052到第二上拉单元1032的线路。需要说明的是,槽位号信息用于指示节点板当前插在中置背板的相对位置。
需要说明的是,本实施以及后续实施中,第一电平转换单元1081与第一管理板1011可以通过通用输入/输出(general perpose input/output,GPIO)实现电连接,控制单元102与第一管理板1011采用电连接,具体通信协议不做限制。
其中,第一上拉单元1031和第二上拉单元1032用于上拉故障诊断线路的故障指示 信号的电平,以得到上拉后的目标信号,该目标信号的电平高于预设阈值。
第一上拉单元1031和第二上拉单元1032将故障指示信号进行上拉得到目标信号之后,目标信号传输至第一电平转换单元1081,第一电平转换单元1081将目标信号发送至第一管理板1011,第一管理板1011用于检测目标信号的电平是否低于预设阈值,并且当目标信号的电平低于诊断阈值时,第一管理板1011确定故障诊断线路上的至少一个CPU中存在发生故障的CPU,即第一管理板1011确定CPU1和PU2中至少有一个CPU发生故障,其中,第一管理板1011的复杂可编程逻辑器件(complex programmable logic device,CPLD)确定至少有一个CPU发生故障之后,通知服务器管理控制器(baseboard management controller,BMC)或者其他管理芯片收集中央处理器的故障信息。
需要说明的是,本实施中,第一管理板1011可以为服务器的管理板,其中第一管理板1011可以通过第一CPLD向控制单元102发送硬分区信息,本实施例提供的控制单元102可以是节点板上的第二CPLD。
本实施例中,当故障诊断系统中任意的CPU发生故障后,其他的CPU可以感知目标信号的电平低于预设阈值,确定存在故障发生。比如,当CPU1发生故障后,CPU2确定目标信号的电平低于预设阈值,确定存在故障发生,CPU2终止当前业务并同步故障状态。可以理解的是,节点板之间的故障同步可通过中置背板的实现同步,不需要管理板参与,线路上彻底与管理板解耦,简化管理板电路和功能。因此,本实施例提供的故障诊断系统减少了服务器的管理板的走线,降低了节点板和服务器管理板之间布线的复杂度。
上面对本申请实施例提供的一个故障诊断系统进行了描述,下面对本申请实施例提供的另一个故障诊断系统进行描述。
请参考图2,图2为本申请实施例提供的故障诊断系统的另一个示意图,本申请实施例提供的故障诊断系统的另一个实施例包括:
故障诊断系统还可以包括第一模拟开关201以及第二模拟开关202,控制单元102可以包括第一控制模块1021以及第二控制模块1022,其中,第一模拟开关201以及第一控制模块1021分别位于第一节点板上,第二模拟开关201以及第二控制模块1022分别位于第二节点板上,需要说明的是,本实施例中,第一上拉单元1031以及第一上拉开关1051位于第一节点板上,第二上拉单元1032以及第二上拉开关1052位于第二节点板上。
需要说明的是,本实施例以及后续实施所涉及的控制模块包括节点板上的第二CPLD,比如第一控制模块1021为第一节点板上的CPLD,第二控制模块可以是第二节点板上的CPLD。
本实施例中的至少一个CPU以CPU1、CPU2、CPU3以及CPU4为例进行描述,其中,CPU1以及CPU2可以位于第一节点板上,CPU3以及CPU4可以位于第二节点板上。
本实施例提供的故障诊断系统各单元之间的连接关系可以如下所示:
其中,第一控制模块1021与第一上拉开关1051电连接,第一控制模块1021还可以与第一模拟开关201电连接。第二控制模块1022与第二上拉开关1052电连接,第二控制模块1022还可以与第二模拟开关202电连接,第一模拟开关201与第二模拟开关202电连接,中置背板104分别与第一管理板1011连接、第一节点板以及第二节点板连接。
本实施例中第一控制模块1021与第二控制模块1022分别用于接收第一管理板1011发送的第二硬分区信号以及在位信息,在位信息为其他节点板的在位信息,用于指示其 他节点板的在位状态。其中,第一控制模块1011接收第二节点板的在位信息,第二控制模块接收第一节点板的在位信息。
第一控制模块1021根据硬分区信息以及第一节点板所在的槽位号信息向第一上拉开关1051发送控制信号。第一上拉开关1051接收到控制信号之后,闭合第一上拉开关1051,从而导通第一上拉单元1031的使能。第一控制模块1021接收到硬分区信息之后,第一控制模块1021还可以向第一模拟开关201发送控制信号,以使得第一模拟开关201闭合。
第二控制模块1022根据硬分区信息以及第二节点板所在的槽位号信息向第二上拉开关1052发送控制信号,第二上拉开关1052接收到控制信号之后,闭合第二上拉开关1052,从而导通第二上拉单元1032的使能。第二控制模块1022接收到第二控制信号之后,第二控制模块1022还可以向第二模拟开关202发送控制信号,以使得第二模拟开关202闭合。
其中,当故障诊断系统中的第一模拟开关201、第二模拟开关202、第一上拉开关1051以及第二上拉开关1052分别闭合之后,第一上拉单元1031和第二上拉单元1032用于上拉故障诊断线路的故障指示信号的电平,以得到上拉后的目标信号,该目标信号的电平高于预设阈值。
可以理解的是,当故障诊断系统中的第一模拟开关201、第二模拟开关202、第一上拉开关1051以及第二上拉开关1052分别闭合之后,使得CPU1、CPU2、CPU3以及CPU4分别与第一模拟开关电连接,以及CPU1、CPU2、CPU3以及CPU4还分别第二模拟开关电连接,以形成故障诊断线路,故障诊断线路在第一上拉开关1051和第二上拉开关1052之间还包括第一模拟开关201和第二模拟开关202。
第一上拉单元1031和第二上拉单元1032将故障指示信号进行上拉得到目标信号之后,将目标信号传输至第一电平转换单元1081,电平转换单元1081将目标信号发送至第一管理板1011,第一管理板1011用于检测目标信号的电平是否低于预设阈值,并且当目标信号的电平低于诊断阈值时,第一管理板1011确定故CPU1至PU4中至少有一个CPU发生故障。其中,第一管理板1011确定至少有一个CPU发生故障之后,第一管理板1011的CPLD通知BMC或者其他管理芯片收集中央处理器的故障信息。
本实施例中,当故障诊断系统中任意的CPU发生故障后,其他的CPU可以感知目标信号的电平低于预设阈值,确定存在故障发生。比如,当CPU1发生故障后,CPU2至CPU4分别确定目标信号的电平低于预设阈值,确定存在故障发生,CPU2至CPU4终止当前业务并同步故障状态。可以理解的是,节点板之间的故障同步可通过中置背板的实现同步,不需要管理板参与,线路上彻底与管理板解耦,简化管理板电路和功能。因此,本实施例提供的故障诊断系统减少了服务器的管理板的走线,降低了节点板和服务器管理板之间布线的复杂度。
下面对本申请实施例提供的另一个故障诊断系统进行描述,请参考图3,图3为本申请实施例提供的故障诊断系统的另一个示意图,本申请实施例提供的故障诊断系统的另一个实施例包括:
故障诊断系统包括第一管理板101,第一节点板、第二节点板、第三节点板以及第四节点板,中置背板104,其中,中置背板分别与第一管理板101、第一节点板、第二节点板、第三节点板以及第四节点板连接。
其中,第一节点板包括第一控制模块1021,第一模拟开关201,第一上拉单元1031,第一上拉开关1051,第一电平转换单元1081,CPU1和CPU2。其中,第一控制模块1021分别与第一模拟开关201、第一上拉开关1051电连接,第一上拉单元1031和第一上拉开关1051电连接,第一模拟开关201与第一控制模块1021电连接,第一电平转换单元1081通过中置背板104与第一管理板1011电连接,第一控制模块1021通过中置背板104与第一管理板1011电连接。
第二节点板包括第二控制模块1022,第二模拟开关202,CPU3以及CPU4。其中,第二控制模块1022与第二模拟开关202电连接,第二控制模块1022通过中置背板104与第一管理板1011电连接。
第三节点板包括第三控制模块1023,第三模拟开关203,CPU5以及CPU6,其中,第三控制模块1023与第三模拟开关203电连接,第三控制模块1023通过中置背板104与第一管理板1011电连接。
第四节点板包括第四控制模块1024,第四模拟开关204,第二上拉单元1032,第二上拉开关1052,CPU7以及CP8。其中,第四控制模块1024分别与第四模拟开关204、第二上拉开关1052电连接,第四控制模块1024通过中置背板104与第一管理板1011电连接。
本实施例中,第一控制模块1021、第二控制模块1022、第三控制模块1023以及第三模拟开关203通过中置背板104分别电连接。
其中,第一控制模块1021、第二控制模块1022、第三控制模块1023以及第三控制模块1024分别用于接收管理板发送的硬分区信息以及在位信息。其中,在位信息为其他节点板的在位信息,用于指示其他节点板的在位状态。例如,第一控制模块1021接收第二节点板、第三节点板以及第四节点板分别对应的在位信息。
第一控制模块1021根据硬分区信息以及第一节点板所在的槽位号信息向第一上拉开关1051发送控制信号,第一上拉开关1051接收到控制信号之后,闭合第一上拉开关1051,从而导通第一上拉单元1031的使能。第一控制模块1021还可以向第一模拟开关201发送控制信号,以使得第一模拟开关201闭合。
第二控制模块1022根据硬分区信息以及第二节点板所在的槽位号信息向第二模拟开关202发送控制信号,以使得第二模拟开关202闭合。
第三控制模块1023根据硬分区信息以及第三节点板所在的槽位号信息向第三模拟开关203发送控制信号,以使得第三模拟开关203闭合。
第四控制模块1024根据硬分区信息以及第四节点板所在的槽位号信息向第二上拉开关1051发送控制信号,第二上拉开关1051接收到控制信号之后,闭合第二上拉开关1051,从而导通第二上拉单元1032的使能。第四控制模块1024还可以向第四模拟开关204发送控制信号,以使得第四模拟开关204闭合。
其中,当故障诊断系统中的第一模拟开关201、第二模拟开关202、第三模拟开关203以及第四模拟开关204、第一上拉开关1021以及第二上拉开关1022分别闭合之后,第一上拉单元1031和第二上拉单元1032用于故障诊断线路的故障指示信号的电平,以得到上拉后的目标信号,该目标信号的电平高于预设阈值。
可以理解的是,当故障诊断系统中的第一模拟开关201、第二模拟开关202、第三模 拟开关203以及第四模拟开关204、第一上拉开关1021以及第二上拉开关1022分别闭合之后,形成故障诊断线路,故障诊断线路在第一上拉开关1051和第二上拉开关1052之间包括第一模拟开关201、第二模拟开关202、第三模拟开关203以及第四模拟开关204。
第一上拉单元1031和第二上拉单元1032将故障指示信号进行上拉,得到目标信号之后,将目标信号传输至第一电平转换单元1081,电平转换单元将目标信号发送至第一管理板1011,第一管理板1011用于检测目标信号的是否低于预设阈值,并且当目标信号的电平低于诊断阈值时,第一管理板1011确定故CPU1至PU8中至少有一个CPU发生故障。其中,第一管理板1011确定至少有一个CPU发生故障之后,第一管理板1011的CPLD通知BMC或者其他管理芯片收集中央处理器的故障信息。
本实施例中,当故障诊断系统中任意的CPU发生故障后,其他的CPU可以确定目标信号的电平低于预设阈值,并确定存在发生故障的CPU。比如,当CPU1发生故障后,CPU2至CPU8确定目标信号的电平低于预设阈值,确定存在故障发生,CPU2至CPU8终止当前业务并同步故障状态。可以理解的是,节点板之间的故障同步可通过中置背板的实现同步,不需要管理板参与,线路上彻底与管理板解耦,简化管理板电路和功能。因此,本实施例提供的故障诊断系统减少了服务器的管理板的走线,降低了节点板和服务器管理板之间布线的复杂度。
需要说明的是,本实施例中,第四节点板可以是上述图2实施例中的第二节点板,此处对节点板的排序不做限定。
下面以8路服务器对本申请实施例提供的故障诊断系统进行描述,请参考图4,图4为本申请实施例提供的故障诊断系统的另一个示意图,本申请实施例提供的故障诊断系统的另一个实施例包括:
在8路的服务器中,该8路的服务器的故障诊断系统包括第一管理板1011以及第二管理板1012,第一节点板、第二节点板、第三节点板以及第四节点板,中置背板104,其中,中置背板104分别与第一管理板1011、第二管理板1012、第一节点板、第二节点板、第三节点板以及第四节点板连接。
第一节点板包括第一控制模块1021,第一模拟开关201,第五模拟开关205,第一上拉单元1031,第一上拉开关1051,第三上拉单元1033,第三上拉开关1053,第一电平转换单元1081,CPU1和CPU2。其中,第一控制模块1021分别与第一模拟开关201、第五模拟开关205、第一上拉开关1051和第三上拉开关1053电连接,第一上拉单元1031和第一上拉开关1051电连接,第三上拉单元1033和第三上拉开关1053电连接,第一电平转换单元1081通过中置背板104与第一管理板1011电连接,第一控制模块1021通过中置背板104与第一管理板1011以及第二管理板1012电连接。
第二节点板包括第二控制模块1022,第二模拟开关202,第六模拟开关206,第二上拉单元1032,第二上拉开关1052,第四上拉单元1034,第四上拉开关1054,CPU3以及CPU4。其中,第二控制模块1022分别与第二模拟开关202、第六模拟开关206、第二上拉开关1052和第四上拉开关1054电连接,第二上拉单元1032和第二上拉开关1052电连接,第四上拉单元1034和第四上拉开关1054电连接,第二控制模块1022通过中置背板104与第一管理板1011以及第二管理板1012电连接。
第三节点板包括第三控制模块1023,第三模拟开关203,第七模拟开关207,第五上 拉单元1035,第五上拉开关1055,第六上拉单元1036,第六上拉开关1056,CPU5以及CP6,第二电平转换单元1082。其中,第三控制模块1023分别与第三模拟开关203、第七模拟开关207、第五上拉开关1055以及第六上拉开关1056电连接,第五上拉单元1035与第五上拉开关1055电连接,第六上拉单元1036和第六上拉开关1056电连接,第二电平转换单元1082通过中置背板104与第二管理板1011电连接,第三控制模块1023通过中置背板104与第一管理板1011以及第二管理板1012电连接。
第四节点板包括第四控制模块1024,第四模拟开关204,第八模拟开关208,第七上拉单元1037,第七上拉开关1057,第八上拉单元1038,第八上拉开关1058,CPU7以及CP8。其中,第四控制模块1024分别与第四模拟开关204、第八模拟开关208、第七上拉开关1057以及第八上拉开关1058,第四控制模块1024通过中置背板104与第一管理板1011以及第二管理板1012电连接。
本实施例中,故障诊断系统中的第一模拟开关201与第二模拟开关202通过中置背板104电连接,故障诊断系统中的第三模拟开关203与第四模拟开关204通过中置背板104电连接,第五模拟开关205、第六模拟开关206、第七模拟开关207以及第八模拟开关208通过中置背板104电连接。
本实施例中,第一控制模块1021、第二控制模块1022、第三控制模块1023以及第四控制模块1024分别用于接收第一管理板101发送的硬分区信息以及其他节点板的在位信息,其中,在位信息为其他节点板的在位信息,用于指示其他节点板的在位状态。
当硬分区信息为2P模式时,第一控制模块1021向第一上拉开关1051以及第三上拉开关1053发送控制信号,第一上拉开关1051根据控制信号闭合第一上拉开关1051,从而导通第一上拉单元1031的使能,第三上拉开关1053根据控制信号闭合第三上拉开关1053,从而导通第三上拉单元1033的使能。
导通第一上拉单元1031和第三上拉单元1033的使能之后,第一上拉单元1031以及第三上拉单元1033将故障指示信号的电平上拉,得到目标信号,该目标信号的电平高于预设阈值。目标信号经过第一电平转换单元1081传输至第一管理板101,其中第一管理板101用于检测目标信号的电平是否低于预设阈值,并且当目标信号的电平低于诊断阈值时,第一管理板101确定CPU1和PU2中至少有一个CPU发生故障。
需要说明的是,本实施例中的2P模式表示当前服务器所执行的业务仅需要2个CPU执行,故障诊断系统中的第二控制模块1022、第三控制模块1023以及第四控制模块1024可以不执行业务。
当硬分区信息为4P模式时,第一控制模块1021向第一上拉开关1051以及第一模拟开关201发送控制信号,第一上拉开关1051根据控制信号闭合第一上拉开关1051,从而导通第一上拉单元1031的使能,第一模拟开关201根据控制信号闭合第一模拟开关201。第二控制模块1022向第二上拉开关1052以及第二模拟开关202发送控制信号,第二上拉开关1052根据控制信号闭合第二上拉开关1052,从而导通第二上拉单元1032的使能,第二模拟开关202根据控制信号闭合第二模拟开关202。
闭合第一模拟开关201和第二模拟开关202,以及闭合第一上拉开关1051和第二上拉开关1032之后,第一上拉单元1031和第二上拉单元1032将故障指示信号的电平上拉,得到目标信号,该目标信号的电平高于预设阈值。目标信号经过第一电平转换单元1081 传输至第一管理板101,其中第一管理板101用于检测目标信号的电平是否低于预设阈值,并且当目标信号的电平低于诊断阈值时,第一管理板101确定CPU1至CPU4中至少有一个CPU发生故障。
当硬分区信息为8P模式时,第一控制模块1021向第一上拉开关1051以及第五模拟开关205发送控制信号,第一上拉开关1051根据控制信号闭合第一上拉开关1051,从而导通第一上拉单元1031的使能,第五模拟开关205根据控制信号关闭第五模拟开关205。第二控制模块1022向第六模拟开关206发送控制信号,第六模拟开关206根据控制信号闭合第六模拟开关206。第三控制模块1023向第七模拟开关207发送控制信号,第七模拟开关207根据控制信号闭合第七模拟开关207。第四控制模块1024向第八上拉开关1058以及第八模拟开关208发送控制信号,第八上拉开关1058根据控制信号闭合第八上拉开关1058,从而导通第二上拉单元1038的使能,第八模拟开关208根据控制信号闭合第八模拟开关208。
闭合第五模拟开关205至第八模拟开关208,以及闭合第一上拉开关1051和第八上拉开关1058之后,第一上拉单元1031和第二上拉单元1032将故障指示信号的电平上拉,得到目标信号,该目标信号的电平高于预设阈值。目标信号经过第一电平转换单元1081传输至第一管理板101,其中第一管理板101用于检测目标信号的电平是否低于预设阈值,并且当目标信号的电平低于诊断阈值时,第一管理板101确定CPU1至CPU8中至少有一个CPU发生故障。
本实施例中提供的故障诊断系统可以根据硬分区信息切换为不同模式的服务器,比如服务器可以根据硬分区信息切换为2P模式、4P模式或者8P模式。因此本实施例中控制模块可以根据当前的硬分区信息以及在位信息,以及结合本节点板所在的槽位号信息控制开关变更故障指示信号拓扑,来适配当前的硬分区设置和系统要求,电路适配能力强,提高了故障诊断系统的灵活性。
可选地,如图4所示的故障诊断系统还可以切换为2个4P模式,比如,第一管理板1011与第一节点板和第二节点板中的各单元组成一个4P模式,第二管理板1012与第三节点板和第四节点板中的各单元组成另一个4P模式。第一管理板1011与第一节点板和第二节点板中各单元执行的功能和动作与前述图2对应的实施例类似,第二管理板1012与第三节点板和第四节点板中的各单元执行的功能和动作,此处不再赘述。
可选地,本实施例中第一管理板1011发送的其他节点板的在位信息可以用于指示第一节点板至第四节点板中任意一个、任意两个或任意三个节点板不在槽位或者隔离时,其中,不在槽位或者隔离的节点板对应的控制模块不执行动作。
比如,在位信息指示第三节点板不在槽位或者隔离时,第一管理板1011向第一控制模块1021、第二控制模块1022、第三控制模块以及第四控制模块发送硬分区信息以及槽位号信息,当硬分区信息为8P模式时,第一控制模块1021、第二控制模块1022、第三控制模块1023以及第四控制模块1024接收到硬分区信息之后,第三控制模块1023根据槽位号信息确定不在槽位或者隔离,第三控制模块1023不执行动作。第一控制模块1021、第二控制模块1022以及第四控制模块分别根据槽位号信息确定在位,则第一控制模块1021向第一上拉开关1051以及第五模拟开关205发送控制信号,第一上拉开关1051根据控制信号闭合第一上拉开关1051,从而导通第一上拉单元1031的使能,第五模拟开关 205根据控制信号关闭第五模拟开关205。第二控制模块1022第六模拟开关206发送控制信号,第六模拟开关206根据控制信号闭合第六模拟开关206。第四控制模块1024向第八上拉开关1058以及第八模拟开关208发送控制信号,第八上拉开关1058根据控制信号闭合第八上拉开关1058,从而导通第二上拉单元1038的使能,第八模拟开关208根据控制信号闭合第八模拟开关208。
第一上拉单元1031和第二上拉单元1032将故障指示信号的电平上拉,得到目标信号,该目标信号的电平高于预设阈值。目标信号经过第一电平转换单元1081后传输至第一管理板101,其中第一管理板101用于检测目标信号的电平是否低于预设阈值,并且当目标信号的电平低于诊断阈值时,第一管理板101确定CPU1至CPU4以及CPU7至CPU8中至少有一个CPU发生故障。
下面以2个4路的服务器对本申请实施例提供的故障诊断系统进行描述,请参考图5,图5为本申请实施例提供的故障诊断系统的另一个示意图,本申请实施例提供的故障诊断系统的另一个实施例包括:
在2个4路的服务器中,该2个4路的服务器的故障诊断系统包括第一管理板1011以及第二管理板1012,第一节点板、第二节点板、第三节点板以及第四节点板,中置背板104,其中,中置背板104分别与第一管理板1011、第二管理板1022、第一节点板、第二节点板、第三节点板以及第四节点板连接。其中,第一节点板、第二节点板、第三节点板以及第四节点板分别包含的单元具体可参考图5,此处不再赘述。
如图5所示,第一管理板1011分别与第一控制模块1021以及第二控制模块1022电连接,第一管理板通过中置背板与第一电平转换单元1081电连接。第二管理板1012分别与第三控制模块1023以及第四控制模块1024电连接,第二管理板1012通过中置背板104与第二电平转换单元1082电连接。
本实施例中,第一管理板1011分别向第一控制模块1021、第二控制模块1022发送硬分区信息以及其节点板的在位信息,第二管理板1012可以分别向第三控制模块1023以及第四控制模1024块发送硬分区信息以及其节点板的在位信息,硬分区信息为4P模式。
第一控制模块1021以及第二控制模块1022接收到第一管理板发送的4P硬分区信息之后,第一管理板1011与第一节点板和第二节点板中的各单元组成一个4P模式,第一管理板1011与第一节点板和第二节点板中各单元执行的功能和动作与前述图2对应的实施例类似,此处不再赘述。
第三控制模块1023以及第四控制模块接收到第二管理板1012发送的4P硬分区信息之后,第二管理板1012与第三节点板和第四节点板中的各单元组成另一个4P模式。第二管理板1012与第三节点板和第四节点板中的各单元执行的功能和动作与前述图2对应的实施例类似,此处不再赘述。
需要说明的是,本实施例中还可以提供4个2路的服务器的故障诊断系统,4个2路的服务器的故障诊断系统与前述2个4路的服务器的故障诊断系统类似,此处不再赘述。
本申请实施例还提供一种服务器,所述服务器包括图1对应的故障诊断系统、图2对应的故障诊断系统、图3对应的故障诊断系统、图4对应的故障诊断系统或图5对应 的故障诊断系统,具体请参考图1至图5分别对应的实施例,此处不再赘述。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (13)

  1. 一种故障诊断系统,其特征在于,所述故障诊断系统包括控制单元和第一管理板,第一上拉单元和第二上拉单元,第一上拉开关和第二上拉开关,以及至少一个中央处理器;
    所述第一上拉单元与所述第一上拉开关电连接,所述第二上拉单元与所述第二上拉开关电连接,所述控制单元与所述第一管理板电连接,所述控制单元分别与所述第一上拉开关和所述第二上拉开关电连接;
    所述控制单元用于接收所述第一管理板发送的硬分区信号,并根据所述硬分区信号控制所述第一上拉开关和所述第二上拉开关分别闭合,使得每个中央处理器分别与所述第一上拉开关和所述第二上拉开关电连接,以形成故障诊断线路,所述故障诊断线路包括从第一上拉单元、第一上拉开关、所述至少一个中央处理器、第二上拉开关到所述第二上拉单元的线路;
    所述第一上拉单元和所述第二上拉单元用于上拉所述故障诊断线路的故障指示信号,以得到上拉后的目标信号;
    所述第一管理板用于检测所述目标信号的电平是否低于预设阈值,并且当所述目标信号的电平低于诊断阈值时,确定所述故障诊断线路上的所述至少一个中央处理器中存在发生故障的中央处理器。
  2. 根据权利要求1所述的故障诊断系统,其特征在于,所述故障诊断系统还包括第一模拟开关和第二模拟开关;
    所述控制单元包括第一控制模块和第二控制模块,所述第一控制模块分别与所述第一上拉开关和所述第一模拟开关电连接,所述第二控制模块分别与所述第二上拉开关和所述第二模拟开关电连接;
    所述第一控制模块用于根据接收所述第一管理板发送的所述硬分区信号,并根据所述硬分区信号控制所述第一上拉开关和所述第一模拟开关分别闭合,所述第二控制模块用于接收所述第一管理板发送的所述硬分区信号,并根据所述硬分区信号控制所述第二上拉开关和所述第二模拟开关分别闭合,使得所述每个中央处理器还分别与所述第一模拟开关和所述第二模拟开关电连接,以形成所述故障诊断线路,所述故障诊断线路在所述第一上拉开关和所述第二上拉开关之间还包括所述第一模拟开关和所述第二模拟开关。
  3. 根据权利要求2所述的故障诊断系统,其特征在于,所述故障诊断系统还包括第三模拟开关和第四模拟开关;
    所述控制单元还包括第三控制模块和第四控制模块,所述第三控制模块与所述第三模拟开关电连接,所述第四控制模块与所述第四模拟开关电连接;
    所述第三控制模块用于接收所述第一管理板发送的所述硬分区信号,并根据所述硬分区信号控制所述第三模拟开关闭合,所述第四控制模块用于接收所述第一管理板发送的所述硬分区信号,并根据所述硬分区信号控制所述第四模拟开关闭合,使得所述每个中央处理器还分别与所述第三模拟开关和所述第四模拟开关电连接,以形成所述故障诊断线路,所述故障诊断线路在所述第一模拟开关和所述第二模拟开关之间还包括所述第三模拟开关和所述第四模拟开关。
  4. 根据权利要求2所述的故障诊断系统,其特征在于,所述故障诊断系统还包括第三上拉单元以及第三上拉开关,第四上拉单元以及第四上拉开关,其中,所述第三上拉开关与所述第一管理板电连接,所述第三上拉单元与所述第三上拉开关电连接,所述第四上拉开关与所述第一管理板电连接,所述第四上拉单元与所述第四上拉开关电连接。
  5. 根据权利要求3所述的故障诊断系统,其特征在于,所述故障诊断系统还包括第五上拉单元以及第五上拉开关,第六上拉单元以及第六上拉开关,七上拉单元以及第七上拉开关,第八上拉单元以及第八上拉开关,其中,所述第五上拉开关与所述第一管理板电连接,所述第五上拉单元与所述第五上拉开关电连接,所述第六上拉开关与所述第一管理板电连接,所述第六上拉单元与所述第六上拉开关电连接,所述第七上拉开关与所述第一管理板电连接,所述第七上拉单元与所述第七上拉开关电连接,所述第八上拉开关与所述第一管理板电连接,所述第八上拉单元与所述第八上拉开关电连接。
  6. 根据权利要求1至5任一项所述的故障诊断系统,其特征在于,所述故障诊断系统还包括中置背板,所述中置背板用于连接所述第一控制模块、所述第二控制模块、第三控制模块以及第四控制模块分别对应的节点板。
  7. 根据权利要求6所述的故障诊断系统,其特征在于,所述第一管理板还用于向所述第一控制模块、所述第二控制模块、第三控制模块和/或第四控制模块发送在位信息,所述在位信息用于指示所述第一控制模块、所述第二控制模块、第三控制模块和/或第四控制模块分别对应的节点板的在位状态。
  8. 根据权利要求1至6任一项所述的故障诊断系统,其特征在于,所述故障诊断系统还包括电平转换单元,所述电平转换单元与所述第一管理板电连接。
  9. 根据权利要求8所述的故障诊断系统,其特征在于,所述第故障诊断系统还包括第二管理板,所述第二管理板分别与所述第一控制模块、所述第二控制模块、所述第三控制模块和第四控制模块电连接。
  10. 根据权利要求9所述的故障诊断系统,其特征在于,所述硬分区信息包括2P模式信息、4P模式信息或8P模式信息,其中,2P模式信息对应所述至少一个处理器中的2个处理器,4P模式信息对应所述至少一个处理器中的4个处理器,8P模式信息对应所述至少一个处理器中的8个处理器。
  11. 根据权利要求1至6所述的故障诊断系统,所述第一管理板包括第一复杂可编程逻辑器件CPLD。
  12. 根据权利要求11所述的故障诊断系统,所述控制单元包括第二CPLD。
  13. 一种服务器,其特征在于,所述服务器包括上述权利要求1-12任一项所述的故障诊断系统。
PCT/CN2019/090352 2018-09-06 2019-06-06 故障诊断系统及服务器 WO2020048174A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP19857042.6A EP3835903B1 (en) 2018-09-06 2019-06-06 Fault diagnosis system and server
US17/193,048 US11347611B2 (en) 2018-09-06 2021-03-05 Fault diagnosis system and server

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811039579.7A CN109101009B (zh) 2018-09-06 2018-09-06 故障诊断系统及服务器
CN201811039579.7 2018-09-06

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/193,048 Continuation US11347611B2 (en) 2018-09-06 2021-03-05 Fault diagnosis system and server

Publications (1)

Publication Number Publication Date
WO2020048174A1 true WO2020048174A1 (zh) 2020-03-12

Family

ID=64865427

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/090352 WO2020048174A1 (zh) 2018-09-06 2019-06-06 故障诊断系统及服务器

Country Status (4)

Country Link
US (1) US11347611B2 (zh)
EP (1) EP3835903B1 (zh)
CN (1) CN109101009B (zh)
WO (1) WO2020048174A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109101009B (zh) 2018-09-06 2020-08-14 华为技术有限公司 故障诊断系统及服务器
CN111078603B (zh) * 2019-10-30 2021-08-20 苏州浪潮智能科技有限公司 一种多节点设备内部串口访问的控制方法和系统
CN112667428A (zh) * 2020-12-31 2021-04-16 神威超算(北京)科技有限公司 Bmc故障处理电路、方法、装置、电子设备及存储介质
CN114553673A (zh) * 2022-01-18 2022-05-27 浙江大华技术股份有限公司 网络故障处理方法、装置、计算机设备和存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101894056A (zh) * 2009-05-19 2010-11-24 大唐移动通信设备有限公司 总线工作节点隔离装置及其故障恢复系统和方法
CN101894060A (zh) * 2010-06-25 2010-11-24 福建星网锐捷网络有限公司 故障检测方法及模块化设备
US20130061098A1 (en) * 2010-05-10 2013-03-07 Toyoya Jidosha Kabushiki Kaisha Failure check apparatus and failure check method
CN106446311A (zh) * 2015-08-10 2017-02-22 杭州华为数字技术有限公司 Cpu告警电路及告警方法
CN107450414A (zh) * 2017-09-18 2017-12-08 北京百卓网络技术有限公司 一种上电控制系统及方法
CN107844392A (zh) * 2017-10-24 2018-03-27 北京全路通信信号研究设计院集团有限公司 一种x86架构cpu寄存器在线故障检测方法及装置
CN109101009A (zh) * 2018-09-06 2018-12-28 华为技术有限公司 故障诊断系统及服务器

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6373376B1 (en) * 2000-09-11 2002-04-16 Honeywell International Inc. AC synchronization with miswire detection for a multi-node serial communication system
US6448901B1 (en) * 2000-09-11 2002-09-10 Honeywell International Inc Status indicator for an interface circuit for a multi-node serial communication system
JP2007010477A (ja) * 2005-06-30 2007-01-18 Fujitsu Ltd 集積回路及び回路ボード
JP2007024659A (ja) * 2005-07-15 2007-02-01 Fujitsu Ltd 集積回路及び回路ボード
CN101621340B (zh) * 2009-08-07 2013-01-16 中兴通讯股份有限公司 一种检测装置及方法
CN104181870B (zh) * 2013-05-24 2017-06-16 华为技术有限公司 控制方法及装置
EP3764234B1 (en) * 2016-10-31 2022-06-29 Huawei Technologies Co., Ltd. Method and enable apparatus for starting physical device
US10216559B2 (en) * 2016-11-14 2019-02-26 Allegro Microsystems, Llc Diagnostic fault communication

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101894056A (zh) * 2009-05-19 2010-11-24 大唐移动通信设备有限公司 总线工作节点隔离装置及其故障恢复系统和方法
US20130061098A1 (en) * 2010-05-10 2013-03-07 Toyoya Jidosha Kabushiki Kaisha Failure check apparatus and failure check method
CN101894060A (zh) * 2010-06-25 2010-11-24 福建星网锐捷网络有限公司 故障检测方法及模块化设备
CN106446311A (zh) * 2015-08-10 2017-02-22 杭州华为数字技术有限公司 Cpu告警电路及告警方法
CN107450414A (zh) * 2017-09-18 2017-12-08 北京百卓网络技术有限公司 一种上电控制系统及方法
CN107844392A (zh) * 2017-10-24 2018-03-27 北京全路通信信号研究设计院集团有限公司 一种x86架构cpu寄存器在线故障检测方法及装置
CN109101009A (zh) * 2018-09-06 2018-12-28 华为技术有限公司 故障诊断系统及服务器

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3835903A4 *

Also Published As

Publication number Publication date
US11347611B2 (en) 2022-05-31
CN109101009A (zh) 2018-12-28
CN109101009B (zh) 2020-08-14
EP3835903A1 (en) 2021-06-16
EP3835903A4 (en) 2021-10-13
US20210191831A1 (en) 2021-06-24
EP3835903B1 (en) 2023-01-04

Similar Documents

Publication Publication Date Title
WO2020048174A1 (zh) 故障诊断系统及服务器
US10417167B2 (en) Implementing sideband control structure for PCIE cable cards and IO expansion enclosures
US20180074923A1 (en) Implementing cable failover in multiple cable pci express io interconnections
US20090135715A1 (en) Duplicate internet protocol address resolution in a fragmented switch stack environment
US6795933B2 (en) Network interface with fail-over mechanism
KR100518369B1 (ko) 컴퓨터 시스템 전원을 차단하지 않고 컴퓨터 시스템 버스에이전트를 분리하거나 설치하는 방법 및 장치
US20040162928A1 (en) High speed multiple ported bus interface reset control system
US20040168008A1 (en) High speed multiple ported bus interface port state identification system
US10956269B2 (en) Electronic data-distribution control unit and method for operating such a control unit
TWI704464B (zh) 資料備援系統
TWI546682B (zh) 藉助於混和管理路徑來管理一儲存系統之方法與裝置
JP5176914B2 (ja) 伝送装置及び冗長構成部の系切替え方法
CN111414327B (zh) 网络设备
JP2003242048A (ja) バスシステム
CN108701117B (zh) 互连系统、互连控制方法和装置
TWI658367B (zh) 硬體資源擴充系統
CN107659413A (zh) 小型通信设备
CN112346905B (zh) 数据备援系统
JP3604868B2 (ja) システムおよびエラー処理方法
CN118170712A (zh) 一种控制板和服务器
CN112346905A (zh) 数据备援系统
JPH079465Y2 (ja) Lan用インターフェース
CN115114068A (zh) 服务器互联异常处理系统、方法、设备和存储介质
JP2011040842A (ja) 二重化システムの装置切替方式
JPS6251335A (ja) ステ−シヨンの2重化構成制御方式

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19857042

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019857042

Country of ref document: EP

Effective date: 20210311