WO2017041533A1 - 断开PCIe设备与主机之间的链接的方法和装置 - Google Patents

断开PCIe设备与主机之间的链接的方法和装置 Download PDF

Info

Publication number
WO2017041533A1
WO2017041533A1 PCT/CN2016/083801 CN2016083801W WO2017041533A1 WO 2017041533 A1 WO2017041533 A1 WO 2017041533A1 CN 2016083801 W CN2016083801 W CN 2016083801W WO 2017041533 A1 WO2017041533 A1 WO 2017041533A1
Authority
WO
WIPO (PCT)
Prior art keywords
host
tlp packet
packet
error type
error
Prior art date
Application number
PCT/CN2016/083801
Other languages
English (en)
French (fr)
Inventor
张浩鹏
吴沛
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to ES16843471T priority Critical patent/ES2748228T3/es
Priority to EP16843471.0A priority patent/EP3296885B1/en
Publication of WO2017041533A1 publication Critical patent/WO2017041533A1/zh
Priority to US15/819,440 priority patent/US10565043B2/en
Priority to US16/740,717 priority patent/US11620175B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • G06F11/0754Error or fault detection not based on redundancy by exceeding limits
    • G06F11/0757Error or fault detection not based on redundancy by exceeding limits by exceeding a time limit, i.e. time-out, e.g. watchdogs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/382Information transfer, e.g. on bus using universal interface adapter
    • G06F13/385Information transfer, e.g. on bus using universal interface adapter for adaptation of a particular data processing system to different peripheral devices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0745Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in an input/output transactions management context
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/40Bus structure
    • G06F13/4004Coupling between buses
    • G06F13/4027Coupling between buses using bus bridges
    • G06F13/4031Coupling between buses using bus bridges with arbitration
    • G06F13/4036Coupling between buses using bus bridges with arbitration and deadlock prevention
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/42Bus transfer protocol, e.g. handshake; Synchronisation
    • G06F13/4204Bus transfer protocol, e.g. handshake; Synchronisation on a parallel bus
    • G06F13/4221Bus transfer protocol, e.g. handshake; Synchronisation on a parallel bus being an input/output bus, e.g. ISA bus, EISA bus, PCI bus, SCSI bus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/42Bus transfer protocol, e.g. handshake; Synchronisation
    • G06F13/4282Bus transfer protocol, e.g. handshake; Synchronisation on a serial bus, e.g. I2C bus, SPI bus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2213/00Indexing scheme relating to interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F2213/0026PCI express
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L1/00Arrangements for detecting or preventing errors in the information received
    • H04L1/12Arrangements for detecting or preventing errors in the information received by using return channel
    • H04L1/16Arrangements for detecting or preventing errors in the information received by using return channel in which the return channel carries supervisory signals, e.g. repetition request signals
    • H04L1/18Automatic repetition systems, e.g. Van Duuren systems
    • H04L1/1867Arrangements specially adapted for the transmitter end
    • H04L1/188Time-out mechanisms
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L1/00Arrangements for detecting or preventing errors in the information received
    • H04L1/24Testing correct operation

Definitions

  • the present invention relates to the field of computer technologies, and in particular, to a method and apparatus for disconnecting a link between a PCIe device and a host.
  • a host can connect multiple IO (input output) devices.
  • Each IO device in a plurality of IO devices includes a PCIe device.
  • the host connects to multiple PCIe devices, and performs data interaction with other devices except the host through multiple PCIe devices to complete the services of the host.
  • the PCIe device may be abnormal during operation.
  • the host is not sure which PCIe device is abnormal.
  • the host's CPU Central Processing Unit
  • the host disconnects all PCIe devices connected to the host, causing the host to fail to interact with other devices and affect the services of the host.
  • the present invention provides a method and apparatus for disconnecting a link between a PCIe device and a host.
  • the technical solutions are as follows:
  • the present invention provides a method of disconnecting a link between a bus and an interface standard PCIe device and a host, the PCIe device including an end node EP device, the method comprising:
  • the EP device counts the duration of occurrence of the error type
  • the EP device disconnects from the host.
  • the EP device acquires an error type of transmitting a TLP packet error between the PCIe device and the host, including:
  • the EP device determines that the error type of transmitting the TLP packet error between the PCIe device and the host is a non-acknowledging NAK error type.
  • the EP device acquires an error type of transmitting a TLP packet error between the PCIe device and the host, including:
  • the EP device determines that the error type of transmitting the TLP packet error between the PCIe device and the host is a transmission error type.
  • the determining, by the EP device, whether the TLP packet is a preset TLP packet includes:
  • the EP device acquires a first sequence number of the TLP packet, and predicts a third sequence number of the TLP packet according to a second sequence number of a last TLP packet that is closest to the current time;
  • the EP device determines that the TLP packet is not a preset TLP packet.
  • the transmission error type includes a retransmission error type and a missed transmission error type
  • the method further includes:
  • the EP device determines that the error type of transmitting the TLP packet error between the PCIe device and the host is a missed transmission type
  • the EP device determines that the error type of the PCIe device and the host transmitting the TLP packet error is a retransmission error type.
  • the EP device acquires an error type of transmitting a TLP packet error between the PCIe device and the host, including:
  • the EP device determines that the error type of transmitting a TLP packet error between the PCIe device and the host is a credit value insufficient error type.
  • the acquiring, by the EP device, the first credit value required for the TLP packet to be sent by the host includes:
  • the EP device determines a first credit value required by the TLP packet according to the packet header type, the packet data type, and the packet data length.
  • the EP device acquires an error type of transmitting a TLP packet error between the PCIe device and the host, including:
  • the EP device detects whether an abnormality occurs in the PCIe device
  • the EP device determines that the error type of the TLP packet error transmitted between the PCIe device and the host is an own abnormal error type.
  • the EP device disconnects from the host, including:
  • the EP device sets a system clock of the PCIe device to an unavailable state by a gated clock, where the unavailable state is used to instruct the PCIe device to refuse to process a processing request sent by the host.
  • the method further includes:
  • the EP device disconnects from the host.
  • the present invention provides an apparatus for disconnecting a link between a bus and an interface standard PCIe device and a host, the PCIe device including an end node EP device, the device comprising:
  • An acquiring module configured to acquire an error type of transmitting a transport layer packet TLP packet error between the PCIe device and the host;
  • a statistics module configured to: if the error type is a repairable error type specified in the PCIe protocol, the duration of the error type is counted;
  • the acquiring module includes:
  • a first receiving unit configured to receive a TLP packet sent by the host
  • a first determining unit configured to determine whether the TLP packet is damaged
  • a second determining unit configured to determine, if the TLP packet is damaged, an error type of transmitting the TLP packet error between the PCIe device and the host is a non-acknowledging NAK error type.
  • the acquiring module includes:
  • a second receiving unit configured to receive a TLP packet sent by the host
  • a third determining unit configured to determine whether the TLP packet is a preset TLP packet
  • a fourth determining unit configured to determine, when the TLP packet is not the preset TLP packet, an error type of transmitting the TLP packet error between the PCIe device and the host as a transmission error type.
  • the third determining unit is configured to acquire a first serial number of the TLP packet, and Determining, by the second serial number of the last TLP packet, the current serial number, the third serial number of the TLP packet, if the first serial number and the third serial number are not equal, determining that the TLP packet is not preset TLP package.
  • the transmission error type includes a retransmission error type and a missed transmission error type
  • the acquiring module further includes :
  • a fifth determining unit configured to determine, when the TLP packet is newer than the preset TLP packet, that an error type of transmitting the TLP packet error between the PCIe device and the host is a missed transmission type
  • a sixth determining unit configured to determine, when the TLP packet is older than the preset TLP packet, that the error type of the PCIe device and the host transmitting the TLP packet error is a retransmission error type.
  • the acquiring module Blocks including:
  • An acquiring unit configured to acquire a first credit value required by the TLP packet to be sent by the host, and a second credit value currently remaining by the EP device;
  • a seventh determining unit configured to determine, when the first credit value is greater than the second credit value, that an error type of transmitting a TLP packet error between the PCIe device and the host is a credit value insufficient error type.
  • the acquiring unit is configured to acquire a packet header type and a packet data type of the TLP packet to be sent by the host And a packet data length, and determining a first credit value required by the TLP packet according to the packet header type, the packet data type, and the packet data length.
  • the acquiring module includes:
  • a detecting unit configured to detect whether an abnormality occurs in the PCIe device
  • the eighth determining unit is configured to determine, if the detecting unit detects that the PCIe device is abnormal, an error type of transmitting a TLP packet error between the PCIe device and the host as an own abnormal error type.
  • the disconnecting module is configured to set, by using a gating clock, a system clock of the PCIe device to an unavailable state, where the unavailable state is And means for instructing the PCIe device to refuse to process a processing request sent by the host.
  • the disconnecting module is further configured to: if the error type is an unrepairable error type specified in the PCIe protocol, disconnecting A link between the hosts.
  • the present invention provides a bus and interface standard PCIe device, the PCIe device comprising an end node EP device, the EP device comprising: a memory and a processor, the memory being used to store the processor data;
  • the processor is configured to acquire an error type of transmitting a transport layer packet TLP packet error between the PCIe device and the host;
  • the processor is further configured to: if the error type is a repairable error type specified in the PCIe protocol, the duration of the error type is counted;
  • the processor is further configured to disconnect the main body if the duration is up to a preset duration A link between machines.
  • the processor is further configured to receive a TLP packet sent by the host, and determine whether the TLP packet is damaged.
  • the processor is further configured to determine, if the TLP packet is damaged, that the error type of transmitting the TLP packet error between the PCIe device and the host is a non-acknowledgment NAK error type.
  • the processor is further configured to receive a TLP packet sent by the host, and determine whether the TLP packet is a preset TLP packet;
  • the processor is further configured to determine, if the TLP packet is not the preset TLP packet, an error type of transmitting the TLP packet error between the PCIe device and the host as a transmission error type.
  • the processor is further configured to acquire a first serial number of the TLP packet, and according to the current Determining, by the second sequence number of the last TLP packet, the third sequence number of the TLP packet, if the first sequence number and the third sequence number are not equal, determining that the TLP packet is not preset TLP package.
  • the transmission error type includes a retransmission error type and a missed transmission error type
  • the processor is further configured to: if the TLP packet is newer than the preset TLP packet, determine that the error type of transmitting the TLP packet error between the PCIe device and the host is a missed transmission type;
  • the processor is further configured to: if the TLP packet is older than the preset TLP packet, determine that the error type of the PCIe device and the host transmitting the TLP packet error is a retransmission error type.
  • the processor is further configured to acquire a first credit value required for the TLP packet to be sent by the host, and a current remaining of the EP device Second credit value;
  • the processor is further configured to determine that an error type of transmitting a TLP packet error between the PCIe device and the host is a credit value insufficient error type if the first credit value is greater than the second credit value.
  • the processor is further configured to acquire a packet header class of the TLP packet to be sent by the host a type, a packet data type, and a packet data length, and determining a first credit value required by the TLP packet according to the packet header type, the packet data type, and the packet data length.
  • the processor is further configured to detect whether an abnormality occurs in the PCIe device;
  • the processor is further configured to determine that the error type of transmitting a TLP packet error between the PCIe device and the host is a self-abnormal error type if an abnormality is detected in the PCIe device.
  • the processor is further configured to set a system clock of the PCIe device to an unavailable state by using a gated clock, where the unavailable state is And means for instructing the PCIe device to refuse to process a processing request sent by the host.
  • the processor is further configured to: if the error type is an unrepairable error type specified in the PCIe protocol, disconnecting The link between the hosts.
  • the EP device obtains an error type of transmitting a transport layer packet TLP packet error between the PCIe device and the host; if the error type is a repairable error type specified in the PCIe protocol, the EP device periodically displays the error. The duration of the type; if the duration reaches the preset duration, the EP device disconnects from the host. Therefore, the EP device determines whether the link between the PCIe device and the host is abnormal by detecting an error type of transmitting the TLP packet error, and when detecting that the link is abnormal, disconnecting the link, thereby eliminating the need to disconnect the host and all Links to PCIe devices can reduce the impact on host services.
  • 1-1 is an application scenario diagram of disconnecting a link between a PCIe device and a host according to an embodiment of the present invention
  • 1-2 is a flowchart of a method for disconnecting a link between a PCIe device and a host according to an embodiment of the present invention
  • 2-1 is a flowchart of a method for disconnecting a link between a PCIe device and a host according to an embodiment of the present invention
  • 2-2 is a hardware diagram of detecting an abnormality type of a NAK according to an embodiment of the present invention
  • 2-3 is a hardware diagram for detecting a type of transmission error according to an embodiment of the present invention.
  • 2-4 are hardware diagrams for detecting an error type of insufficient credit value according to an embodiment of the present invention.
  • 3-1 is a disconnection between a PCIe device and a host according to an embodiment of the present invention. Schematic diagram of the device structure
  • 3-2 is a schematic structural diagram of an acquiring module according to an embodiment of the present invention.
  • 3-3 is a schematic structural diagram of another acquiring module according to an embodiment of the present invention.
  • 3-4 is a schematic structural diagram of another acquiring module according to an embodiment of the present invention.
  • 3-5 is a schematic structural diagram of another acquiring module according to an embodiment of the present invention.
  • FIG. 3-6 are schematic structural diagrams of another acquiring module according to an embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of a PCIe device according to an embodiment of the present invention.
  • An embodiment of the present invention provides an application scenario of a method for disconnecting a link between a PCIe device and a host.
  • the host when the host is connected to only one IO device, the host directly connects to the IO device through the RP port;
  • the host When connecting multiple IO devices, the host is connected to multiple IO devices through a PCIe SW (Switch, Switch).
  • PCIe SW Switch, Switch
  • the IO device includes a PCIe device, and the PCIe device includes an EP (End Point) device.
  • the PCIe SW includes an Up (Upstream Port) and a DP (Downstream Port).
  • the PCIe SW passes the UP and The CPU of the host is connected to the EP device of the PCIe device of the IO device through the DP.
  • the EP device includes a PL (Physical Layer), a DL (Data Link), and a TL (Transaction Layer).
  • the TL is used to interact with the user, and the DL is used to perform data interaction with the host.
  • the PL is used to interact with the PCIe device. When the DL or DL detects an abnormality, the link between the PCIe device and the host can be disconnected.
  • the embodiment of the present invention provides a method for disconnecting a link between a PCIe device and a host.
  • the PCIe device includes an EP device.
  • the execution body of the method may be an EP device. Referring to FIG. 1-2, the method includes:
  • Step 101 The EP device acquires an error type of a TLP (Transaction Layer Packet) packet error between the PCIe device and the host.
  • TLP Transaction Layer Packet
  • Step 102 If the error type is a repairable error type specified in the PCIe protocol, The EP device counts the duration of the type of error;
  • Step 103 If the duration is up to the preset duration, the EP device disconnects from the host.
  • the EP device obtains an error type of transmitting a transport layer packet TLP packet error between the PCIe device and the host; if the error type is a repairable error type specified in the PCIe protocol, the EP device periodically displays the error. The duration of the type; if the duration reaches the preset duration, the EP device disconnects from the host. Therefore, the EP device determines whether the link between the PCIe device and the host is abnormal by detecting an error type of transmitting the TLP packet error, and when detecting that the link is abnormal, disconnecting the link, thereby eliminating the need to disconnect the host and all Links to PCIe devices can reduce the impact on host services.
  • the embodiment of the present invention provides a method for disconnecting a link between a PCIe device and a host.
  • the PCIe device includes an EP device, and the execution body of the method may be an EP device. Referring to FIG. 2-1, the method includes:
  • Step 201 The EP device acquires an error type of transmitting a TLP packet error between the PCIe device and the host.
  • the host When the host performs a service interaction with the PCIe device, the host sends a resource request to the EP device included in the PCIe device, where the resource request carries the packet header type, the packet data type, and the packet data length of the TLP packet; the EP device receives the resource request sent by the host, according to the The resource request calculates the credit value required by the TLP packet, and sends the credit value required by the host to the host; the host receives the credit value sent by the EP device, and sends the TLP packet to the EP device through the credit value.
  • the TLP packet error may be transmitted between the PCIe device and the host due to the abnormal link between the PCIe device and the host, the insufficient credit value, or the abnormality of the EP device. Therefore, the first method may be adopted in the following manner.
  • the second mode, the third mode, and the fourth mode are implemented.
  • the step may be:
  • the EP device receives the TLP packet sent by the host, and determines whether the TLP packet is damaged. If the TLP packet is damaged, the EP device determines that the error type of the TLP packet error transmitted between the PCIe device and the host is a NAK (Negative Acknowledgment, non-confirmed) error. Types of.
  • NAK Negative Acknowledgment, non-confirmed
  • the EP device determines whether the TLP packet carries the damaged identifier, and if the TLP packet carries the damaged identifier, determining that the TLP packet is damaged; The TLP packet does not carry the damage identifier, then the TLP is determined. The package is not damaged.
  • the EP device determines that the TLP packet is not damaged, the EP device sends an ACK (Acknowledgement) to the host; the host receives the ACK sent by the EP device, and determines, according to the ACK, that the EP device correctly receives the TLP packet, at this time, The host sends the next TLP packet to the EP device.
  • ACK Acknowledgement
  • the EP device determines that the TLP packet is damaged, the EP device sends a NAK to the host; the host receives the NAK sent by the EP device, and determines, according to the NAK, that the EP device does not correctly receive the TLP packet, and at this time, the host re-enters the EP The device sends the TLP packet until the receiving EP device returns an ACK.
  • this step can be implemented by the following steps (1) and (2), including:
  • the EP device receives the TLP packet sent by the host, and determines whether the TLP packet is a preset TLP packet.
  • the host In order to determine whether the host repeatedly sends a TLP packet to the EP device or misses sending a TLP packet to the EP device, the host sends the serial number of the TLP packet in the TLP packet to the EP device, and the sequence numbers of the two adjacent TLP packets are different by one. Therefore, the EP device can determine the preset TLP packet according to the sequence number of the last TLP packet that is closest to the current time.
  • the preset TLP packet is the TLP packet that the host should currently send to the EP device.
  • This step can be implemented by the following steps (1-1) to (1-2), including:
  • the EP device obtains the first sequence number of the TLP packet, and predicts the third sequence number of the TLP packet according to the second sequence number of the last TLP packet that is closest to the current time;
  • the EP device obtains the sequence number carried in the TLP packet.
  • the sequence number carried by the TLP packet is referred to as the first sequence number, and the first sequence number is stored in the sequence number list, so as to obtain the first sequence number subsequently. .
  • the serial number list stores the sequence number of the TLP packet that has been received by the EP device, and the EP device obtains the sequence number of the last TLP packet that is closest to the current time from the sequence number list. For the sake of distinction, the previous TLP is used.
  • the sequence number of the packet is referred to as the second sequence number, and the EP device adds the second sequence number to obtain the sequence number of the TLP packet.
  • the sequence number of the TLP packet is predicted as the third sequence number.
  • the EP device determines whether the first sequence number and the third sequence number are equal; if the first sequence number and the third sequence number are equal, the EP device determines that the TLP packet is a preset TLP packet; if the first sequence number and the third sequence number are Not equal, the EP device determines that the TLP packet is not a preset TLP packet.
  • step 201 is performed. If the TLP packet is not a preset TLP packet, perform the following step (2).
  • the EP device determines that the error type of the TLP packet error transmitted between the PCIe device and the host is a transmission error type.
  • the EP device may further determine, according to the first sequence number and the third sequence number, whether the TLP packet is newer than the preset TLP packet, or whether the TLP packet is older than the preset TLP packet; After the 12-bit unsigned number, after counting to 4095, it will roll over to 0 to continue counting. Therefore, the serial number of the TLP packet is not the same as the new and old TLP packets. For example, the serial number of the TLP packet is 4095.
  • the preset TLP packet has a sequence number of 0. Although the 4095 is greater than 0, the TLP packet is older than the preset TLP packet. Therefore, the EP device determines the TLP packet to be preset according to the first sequence number and the third sequence number.
  • the TLP package is new, or the TLP package can be implemented by the following process than the preset TLP package, including:
  • the EP device obtains the number of bits of the serial number generated by the host, calculates the difference between the serial number of the first serial number and the third serial number, calculates the first value according to the number of bits, and calculates the difference between the serial number and the remainder of the first value, If the remainder is greater than the second value, it is determined that the TLP packet is older than the preset TLP packet; if the remainder is less than the second value, it is determined that the TLP packet is newer than the preset TLP packet.
  • the first value is equal to the power of 2, and the first value is divided by 2 to obtain the second value.
  • the first value is 4096
  • the second value is 2047
  • the first sequence number is A_Seq
  • the second sequence number is B_Seq
  • the EP device determines that the host misses one or more TLP packets; if the TLP packet is older than the preset TLP packet, the EP device determines that the host repeatedly sends directions. The EP device sends the TLP packet. Therefore, the transmission error type includes the retransmission error type and the leakage error type. If the TLP packet is newer than the preset TLP packet, the EP device determines the error type of the TLP packet error transmitted between the PCIe device and the host. In order to miss the error type; if the TLP packet is older than the preset TLP packet, the EP device determines that the error type of the TLP packet error transmitted between the PCIe device and the host is a retransmission error type.
  • this step can be implemented by the following steps (A) and (B), including:
  • the host When the host performs a service interaction with the PCIe device, the host sends a resource request to the EP device included in the PCIe device, where the resource request carries the packet header type, the packet data type, and the packet data length of the TLP packet; the EP device receives the resource request sent by the host, according to the
  • the credit value required for the TLP packet is calculated by the packet header type, the packet data type, and the packet data length.
  • the credit value required for the TLP packet is referred to as the first credit value.
  • the EP device stores a correspondence between the packet header type and the credit value, and stores a correspondence relationship between the packet data type, the packet data length, and the credit value.
  • the step of calculating, by the EP device, the first credit value required for the TLP packet according to the packet header type, the packet data type, and the packet data length may be:
  • the EP device obtains a third credit value required for the packet header of the TLP packet from the correspondence between the packet header type and the credit value according to the packet header type; according to the packet data type and the packet data length, the packet data type, the packet data length, and And obtaining a fourth credit value required for the packet data of the TLP packet in the correspondence between the credit values, calculating a sum of the third credit value and the fourth credit value, and obtaining a first credit value required by the TLP packet.
  • the header type may be PH (Posted Head) or NPH (Non-Posted Head), and for PH and NPH, the header of each TLP packet consumes only one credit value.
  • the packet data types include PD (Posted Data) and NPD (Non-Posted Data), and for NPD, the packet data of each TLP packet consumes only one credit value, and for PD,
  • the EP device determines the number of credit values required for the packet data based on the packet data length.
  • the packet data of the PD type of each TLP packet can also be set to consume a credit value in the embodiment of the present invention.
  • the step of the EP device acquiring the current remaining second credit value may be:
  • the EP device sets a register to record the credit value that the EP device has consumed, and calculates the current remaining credit value according to the total credit value of the EP device and the credit value already consumed.
  • the current remaining credit value is referred to as the first Two credit values.
  • the EP device sends the first credit value to the host; the host receives the first credit value sent by the EP device, and sends the TLP packet to the PCIe device by using the first credit value.
  • this step can be:
  • the EP device detects whether the PCIe device is abnormal. If the EP device detects that the PCIe device is abnormal, the EP device determines that the error type of the TLP packet error transmitted between the PCIe device and the host is the abnormal type of the error.
  • Step 202 The EP device determines whether the error type is a repairable error type specified in the PCIe protocol. If the error type is a repairable error type, perform step 203; if the error type is an unrepairable error type, perform steps 205;
  • the EP device stores a repairable error type library specified in the PCIe protocol, and the repairable error type library includes a non-acknowledgment NAK error type, a missed error type, a retransmission error type, a credit value insufficient error type, and an own exception.
  • NAK error type a non-acknowledgment NAK error type
  • missed error type a missed error type
  • retransmission error type a credit value insufficient error type
  • an own exception The type of error.
  • the EP device determines whether the error type exists in the repairable error type library; if the error type exists in the repairable error type library, the EP device determines that the error type is a repairable error type; if the error type does not exist In the repairable error type library, the EP device determines that the error type is an unrecoverable error type.
  • Step 203 If the error type is a repairable error type, the EP device counts the duration of the error type;
  • the PCIe protocol stipulates that the error type can be repaired without disconnecting the host, but if the EP device has not successfully modified the error type error, then the The type of error can still cause the CPU of the host to hang. Therefore, in the embodiment of the present invention, the duration of the error type is counted, and it is determined according to the duration whether the link with the host is disconnected.
  • the EP device starts the counter to start timing.
  • the EP device counts the duration of the error type and clears the counter.
  • the EP device sets the state of the NAK_SCHEDULED bit to the active state, and uses a counter to record the time when the state of the NAK_SCHEDULED bit is a valid state.
  • the EP device sets the state of the NAK_SCHEDULED bit to invalid. State, stop timing, get the duration of the counter record, and clear the counter.
  • the EP device starts the counter to start timing, and when the transmission error type error is fixed, the EP device stops timing, acquires the duration of the counter record, and clears the counter. .
  • the EP device starts the counter to start timing, and when the credit value is insufficient, the error type error is fixed, the EP device stops timing, acquires the duration of the counter record, and This counter is cleared.
  • Step 204 The EP device determines whether the duration of the duration reaches the preset duration. If the duration is up to the preset duration, step 205 is performed. If the duration does not reach the preset duration, step 201 is performed.
  • the preset duration can be set and changed according to the type of the error, that is, in the embodiment of the present invention, the correspondence between the error type and the preset duration is stored, and the EP device obtains the correspondence between the error type and the preset duration according to the error type.
  • the preset duration corresponding to this error type Therefore, different error types corresponding to different preset durations are implemented, and the CPU of the host is effectively prevented from hanging.
  • the EP device determines that the EP device has a NAK_SCHEDULED (non-acknowledgement status bit) as a reset signal due to a downlink abnormality between the PCIe device and the host.
  • NAK_SCHEDULED non-acknowledgement status bit
  • the EP device starts the counter to start timing.
  • the NAK_SCHEDULED is invalid, the EP device stops the counter, and the counter is immediately cleared and maintained.
  • the duration of the counter acquisition reaches the preset duration, the EP device does.
  • the link between the host and the host needs to be disconnected.
  • the EP device determines that the TPL packet is lost due to a downlink abnormality between the PCIe device and the host. In this case, the EP device sends a NAK to the host, if the packet is always lost. The state determines that the downlink is extremely unreliable. Therefore, in order to prevent the CPU of the host from hanging, when the duration of the missed error type reaches a preset duration, the EP device needs to disconnect from the host.
  • the error type is a retransmission error type
  • the EP device determines that the host retransmits the TLP packet, and when the duration of the retransmission error type reaches a preset duration, the EP device needs to be disconnected from the host. Linking, at this time, the hardware chain-breaking enable signal is output, and step 205 is performed.
  • the hardware circuit in the EP device is shown in Figure 2-3.
  • the host cannot send a TLP packet to the EP device. If the CPU still issues a large number of read and write operations, the host buffer will be full, and the CPU side will be pressed back, eventually causing the CPU. When the instruction times out, the CPU hangs. Therefore, when the duration of the error type is insufficient, the EP device needs to disconnect the host. In this case, the hardware chain-breaking enable signal is output. 205.
  • the hardware circuit in the EP device is shown in Figure 2-4.
  • Step 205 The EP device disconnects from the host.
  • PCIe When the PCIe device detects that the state of the system clock is unavailable, PCIe refuses to process the processing request sent by the host, thereby breaking the link with the host.
  • the EP device obtains an error type of transmitting a transport layer packet TLP packet error between the PCIe device and the host; if the error type is a repairable error type specified in the PCIe protocol, the EP device periodically displays the error. The duration of the type; if the duration reaches the preset duration, the EP device disconnects from the host. Thereby implementing the detection by the EP device The error type of the TLP packet is incorrect. It is determined whether the link between the PCIe device and the host is abnormal. When the link is detected to be abnormal, the link is disconnected, so that the link between the host and all PCIe devices does not need to be disconnected, which can be reduced. Impact on host services.
  • An embodiment of the present invention provides a device for disconnecting a link between a bus and an interface standard PCIe device and a host, where the PCIe device includes an end node EP device, configured to perform the above disconnection between the PCIe device and the host, see Figure 3-1, the device includes:
  • the obtaining module 301 is configured to acquire an error type of a transport layer packet TLP packet error between the PCIe device and the host.
  • the statistics module 302 is configured to: if the error type is a repairable error type specified in the PCIe protocol, the duration of the error type is counted;
  • the disconnection module 303 is configured to disconnect the link with the host if the duration is up to a preset duration.
  • the obtaining module 301 includes:
  • the first receiving unit 3011 is configured to receive a TLP packet sent by the host.
  • the first determining unit 3012 is configured to determine whether the TLP packet is damaged.
  • the second determining unit 3013 is configured to determine that the error type of the TLP packet error transmitted between the PCIe device and the host is a non-answering NAK error type if the TLP packet is damaged.
  • the obtaining module 301 includes:
  • the second receiving unit 3014 is configured to receive a TLP packet sent by the host.
  • the third determining unit 3015 is configured to determine whether the TLP packet is a preset TLP packet.
  • the fourth determining unit 3016 is configured to determine that the error type of the TLP packet error transmitted between the PCIe device and the host is a transmission error type if the TLP packet is not a preset TLP packet.
  • the third determining unit 3015 is configured to obtain a first sequence number of the TLP packet, and predict a third sequence number of the TLP packet according to the second sequence number of the last TLP packet that is closest to the current time, if the first sequence The number and the third serial number are not equal, and it is determined that the TLP packet is not a preset TLP packet.
  • the transmission error type includes a retransmission error type and a leakage error type.
  • the obtaining module 301 further includes:
  • the fifth determining unit 3017 is configured to determine, when the TLP packet is newer than the preset TLP packet, that the error type of the TLP packet error transmitted between the PCIe device and the host is a missed transmission type;
  • the sixth determining unit 3018 is configured to determine that the error type of the PCIe device and the host transmitting the TLP packet error is a retransmission error type if the TLP packet is older than the preset TLP packet.
  • the obtaining module 301 includes:
  • the obtaining unit 3019 is configured to acquire a first credit value required for the TLP packet to be sent by the host, and a second credit value currently remaining by the EP device;
  • the seventh determining unit 30110 is configured to determine that the error type of transmitting the TLP packet error between the PCIe device and the host is a credit value insufficient error type if the first credit value is greater than the second credit value.
  • the obtaining unit 3019 is configured to acquire a packet header type, a packet data type, and a packet data length of the TLP packet to be sent by the host, and determine a first credit value required by the TLP packet according to the packet header type, the packet data type, and the packet data length. .
  • the obtaining module 301 includes:
  • the detecting unit 30111 is configured to detect whether an abnormality occurs in the PCIe device.
  • the eighth determining unit 30112 is configured to determine that the error type of the TLP packet error transmitted between the PCIe device and the host is an abnormal type of the abnormality if the detecting unit detects that the PCIe device is abnormal.
  • the disconnecting module 303 is configured to set the system clock of the PCIe device to an unavailable state by using a gated clock, and the unavailable state is used to instruct the PCIe device to refuse to process the processing request sent by the host.
  • disconnection module 303 is further configured to disconnect the link with the host if the error type is an unrepairable error type specified in the PCIe protocol.
  • the EP device obtains an error type of transmitting a transport layer packet TLP packet error between the PCIe device and the host; if the error type is a repairable error type specified in the PCIe protocol, the EP device periodically displays the error. The duration of the type; if the duration reaches the preset duration, the EP device disconnects from the host. Therefore, the EP device determines whether the link between the PCIe device and the host is abnormal by detecting an error type of transmitting the TLP packet error, and when detecting that the link is abnormal, disconnecting the link, thereby eliminating the need to disconnect the host and all Links to PCIe devices can reduce the impact on host services.
  • the embodiment of the present invention provides a bus and interface standard PCIe device, which is used to perform the above disconnection between the PCIe device and the host.
  • the PCIe device includes an end node EP device, and the EP device includes: a memory 401 and processing.
  • the memory 401 is configured to store data obtained by the processor 402.
  • the processor 402 is configured to acquire an error type of a transport layer packet TLP packet error between the PCIe device and the host.
  • the processor 402 is further configured to: if the error type is a repairable error type specified in the PCIe protocol, the duration of the error type is counted;
  • the processor 402 is further configured to disconnect the link with the host if the duration is up to a preset duration.
  • the processor 402 is further configured to receive a TLP packet sent by the host, and determine whether the TLP packet is damaged.
  • the processor 402 is further configured to determine that the error type of the TLP packet error transmitted between the PCIe device and the host is a non-acknowledgment NAK error type if the TLP packet is damaged.
  • the processor 402 is further configured to receive a TLP packet sent by the host, and determine whether the TLP packet is a preset TLP packet;
  • the processor 402 is further configured to determine that the error type of the TLP packet error transmitted between the PCIe device and the host is a transmission error type if the TLP packet is not a preset TLP packet.
  • the processor 402 is further configured to obtain a first sequence number of the TLP packet, and predict a third sequence number of the TLP packet according to a second sequence number of the last TLP packet that is closest to the current time, if the first sequence number and the third number The sequence numbers are not equal, and it is determined that the TLP packet is not a preset TLP packet.
  • the transmission error type includes a retransmission error type and a missed transmission type.
  • the processor 402 is further configured to: if the TLP packet is newer than the preset TLP packet, determine that the error type of the TLP packet error transmitted between the PCIe device and the host is a missed transmission type;
  • the processor 402 is further configured to determine that the error type of the PCIe device and the host transmitting the TLP packet error is a retransmission error type if the TLP packet is older than the preset TLP packet.
  • the processor 402 is further configured to acquire a first credit value required for the TLP packet to be sent by the host, and a second credit value currently remaining by the EP device;
  • the processor 402 is further configured to determine that the error type of transmitting the TLP packet error between the PCIe device and the host is a credit value insufficient error type if the first credit value is greater than the second credit value.
  • the processor 402 is further configured to obtain a packet header type, a packet data type, and a packet data length of the TLP packet to be sent by the host, and determine a first credit value required by the TLP packet according to the packet header type, the packet data type, and the packet data length.
  • the processor 402 is further configured to detect whether an abnormality occurs in the PCIe device.
  • the processor 402 is further configured to determine that the error type of the TLP packet error transmitted between the PCIe device and the host is an abnormal type of the fault if the abnormality of the PCIe device is detected.
  • the processor 402 is further configured to set a system clock of the PCIe device to an unavailable state by using a gated clock, where the unavailable state is used to instruct the PCIe device to refuse to process the processing request sent by the host.
  • the device further includes:
  • the processor 402 is further configured to disconnect the link with the host if the error type is an unrepairable error type specified in the PCIe protocol.
  • the EP device obtains an error type of transmitting a transport layer packet TLP packet error between the PCIe device and the host; if the error type is a repairable error type specified in the PCIe protocol, the EP device periodically displays the error. The duration of the type; if the duration reaches the preset duration, the EP device disconnects from the host. Therefore, the EP device determines whether the link between the PCIe device and the host is abnormal by detecting an error type of transmitting the TLP packet error, and when detecting that the link is abnormal, disconnecting the link, thereby eliminating the need to disconnect the host and all Links to PCIe devices can reduce the impact on host services.
  • the device for disconnecting the link between the PCIe device and the host provided by the foregoing embodiment is only illustrated by the division of the foregoing functional modules when the link between the PCIe device and the host is disconnected.
  • the above function assignment can be completed by different functional modules as needed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the method for disconnecting the link between the PCIe device and the host provided by the foregoing embodiment is the same as the method for disconnecting the link between the PCIe device and the host, and the specific implementation process is described in the method embodiment. Let me repeat.
  • a person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium.
  • the storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computer Hardware Design (AREA)
  • Retry When Errors Occur (AREA)
  • Information Transfer Systems (AREA)
  • Communication Control (AREA)
  • Detection And Prevention Of Errors In Transmission (AREA)

Abstract

本发明公开了一种断开PCIe设备与主机之间的链接的方法和装置,属于计算机技术领域。方法包括:所述PCIe设备包括结束节点EP设备,所述EP设备获取所述PCIe设备与所述主机之间传输传输层报文TLP包错误的错误类型;如果所述错误类型是PCIe协议中规定的可修复的错误类型,所述EP设备统计出现所述错误类型的持续时长;如果所述持续时长达到预设时长,所述EP设备断开与所述主机之间的链接。装置包括:获取模块,统计模块和断开模块。本发明可以减少对主机业务的影响。

Description

断开PCIe设备与主机之间的链接的方法和装置
本申请要求于2015年09月11日提交中国专利局、申请号为201510580109.1、发明名称为“断开PCIe设备与主机之间的链接的方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及计算机技术领域,特别涉及一种断开PCIe设备与主机之间的链接的方法和装置。
背景技术
随着PCIe(Peripheral Component Interface Express,总线和接口标准)协议的普及,一个主机可以连接多个IO(input output,输入输出)设备,多个IO设备中的每个IO设备包括一个PCIe设备,则主机连接多个PCIe设备,通过多个PCIe设备与除该主机之外的其他设备进行数据交互,以完成主机的业务。
由于PCIe设备在工作过程中可能会出现异常,当某个PCIe设备出现异常时,由于主机不确定是哪个PCIe设备出现异常,为了防止主机的CPU(Central Processing Unit,中央处理器)挂死,主机会断开与主机连接的所有PCIe设备。
现有技术至少存在以下问题:
主机断开主机连接的所有PCIe设备,从而造成主机无法与其他设备进行数据交互,影响主机的业务。
发明内容
为了解决现有技术的问题,本发明提供了一种断开PCIe设备与主机之间的链接的方法和装置。技术方案如下:
第一方面,本发明提供了一种断开总线和接口标准PCIe设备与主机之间的链接的方法,所述PCIe设备包括结束节点EP设备,所述方法包括:
所述EP设备获取所述PCIe设备与所述主机之间传输传输层报文TLP包错误的错误类型;
如果所述错误类型是PCIe协议中规定的可修复的错误类型,所述EP设备统计出现所述错误类型的持续时长;
如果所述持续时长达到预设时长,所述EP设备断开与所述主机之间的链接。
结合第一方面,在第一方面的第一种可能的实现方式中,所述EP设备获取所述PCIe设备与所述主机之间传输TLP包错误的错误类型,包括:
所述EP设备接收所述主机发送的TLP包,并确定所述TLP包是否有损坏;
如果所述TLP包有损坏,所述EP设备确定所述PCIe设备与所述主机之间传输所述TLP包错误的错误类型为非应答NAK错误类型。
结合第一方面,在第一方面的第二种可能的实现方式中,所述EP设备获取所述PCIe设备与所述主机之间传输TLP包错误的错误类型,包括:
所述EP设备接收所述主机发送的TLP包,并确定所述TLP包是否是预设的TLP包;
如果所述TLP包不是所述预设的TLP包,所述EP设备确定所述PCIe设备与所述主机之间传输所述TLP包错误的错误类型为传输错误类型。
结合第一方面的第二种可能的实现方式,在第一方面的第三种可能的实现方式中,所述EP设备确定所述TLP包是否是预设的TLP包,包括:
所述EP设备获取所述TLP包的第一序列号,并根据离当前时间最近的上一个TLP包的第二序列号,预测所述TLP包的第三序列号;
如果所述第一序列号和所述第三序列号不相等,所述EP设备确定所述TLP包不是预设的TLP包。
结合第一方面的第二种可能的实现方式,在第一方面的第四种可能的实现方式中,所述传输错误类型包括重传错误类型和漏传错误类型,所述方法还包括:
如果所述TLP包比所述预设的TLP包新,所述EP设备确定所述PCIe设备与所述主机之间传输所述TLP包错误的错误类型为漏传错误类型;
如果所述TLP包比所述预设的TLP包旧,所述EP设备确定所述PCIe设备与所述主机传输所述TLP包错误的错误类型为重传错误类型。
结合第一方面,在第一方面的第五种可能的实现方式中,所述EP设备获取所述PCIe设备与所述主机之间传输TLP包错误的错误类型,包括:
所述EP设备获取所述主机待发送的TLP包所需的第一信用值以及所述EP设备当前剩余的第二信用值;
如果所述第一信用值大于所述第二信用值,所述EP设备确定所述PCIe设备与所述主机之间传输TLP包错误的错误类型为信用值不足错误类型。
结合第一方面的第五种可能的实现方式,在第一方面的第六种可能的实现方式中,所述EP设备获取所述主机待发送的TLP包所需的第一信用值,包括:
所述EP设备获取所述主机待发送的TLP包的包头类型、包数据类型和包数据长度;
所述EP设备根据所述包头类型、所述包数据类型和所述包数据长度,确定所述TLP包所需的第一信用值。
结合第一方面,在第一方面的第七种可能的实现方式中,所述EP设备获取所述PCIe设备与所述主机之间传输TLP包错误的错误类型,包括:
所述EP设备检测所述PCIe设备是否发生异常;
如果所述EP设备检测出所述PCIe设备发生异常,所述EP设备确定所述PCIe设备与所述主机之间传输TLP包错误的错误类型为自身异常错误类型。
结合第一方面,在第一方面的第八种可能的实现方式中,,所述EP设备断开与所述主机之间的链接,包括:
所述EP设备通过门控时钟将所述PCIe设备的系统时钟设置为不可用状态,所述不可用状态用于指示所述PCIe设备拒绝处理所述主机发送的处理请求。
结合第一方面,在第一方面的第九种可能的实现方式中,所述方法还包括:
如果所述错误类型是所述PCIe协议中规定的不可修复的错误类型,所述EP设备断开与所述主机之间的链接。
第二方面,本发明提供了一种断开总线和接口标准PCIe设备与主机之间的链接的装置,所述PCIe设备包括结束节点EP设备,所述装置包括:
获取模块,用于获取所述PCIe设备与所述主机之间传输传输层报文TLP包错误的错误类型;
统计模块,用于如果所述错误类型是PCIe协议中规定的可修复的错误类型,统计出现所述错误类型的持续时长;
断开模块,用于如果所述持续时长达到预设时长,断开与所述主机之间的链接。
结合第二方面,在第二方面的第一种可能的实现方式中,所述获取模块,包括:
第一接收单元,用于接收所述主机发送的TLP包;
第一确定单元,用于确定所述TLP包是否有损坏;
第二确定单元,用于如果所述TLP包有损坏,确定所述PCIe设备与所述主机之间传输所述TLP包错误的错误类型为非应答NAK错误类型。
结合第二方面,在第二方面的第二种可能的实现方式中,所述获取模块,包括:
第二接收单元,用于接收所述主机发送的TLP包;
第三确定单元,用于确定所述TLP包是否是预设的TLP包;
第四确定单元,用于如果所述TLP包不是所述预设的TLP包,确定所述PCIe设备与所述主机之间传输所述TLP包错误的错误类型为传输错误类型。
结合第二方面的第二种可能的实现方式,在第二方面的第三种可能的实现方式中,所述第三确定单元,用于获取所述TLP包的第一序列号,并根据离当前时间最近的上一个TLP包的第二序列号,预测所述TLP包的第三序列号,如果所述第一序列号和所述第三序列号不相等,确定所述TLP包不是预设的TLP包。
结合第二方面的第二种可能的实现方式,在第二方面的第四种可能的实现方式中,所述传输错误类型包括重传错误类型和漏传错误类型,所述获取模块,还包括:
第五确定单元,用于如果所述TLP包比所述预设的TLP包新,确定所述PCIe设备与所述主机之间传输所述TLP包错误的错误类型为漏传错误类型;
第六确定单元,用于如果所述TLP包比所述预设的TLP包旧,确定所述PCIe设备与所述主机传输所述TLP包错误的错误类型为重传错误类型。
结合第二方面,在第二方面的第五种可能的实现方式中,所述获取模 块,包括:
获取单元,用于获取所述主机待发送的TLP包所需的第一信用值以及所述EP设备当前剩余的第二信用值;
第七确定单元,用于如果所述第一信用值大于所述第二信用值,确定所述PCIe设备与所述主机之间传输TLP包错误的错误类型为信用值不足错误类型。
结合第二方面的第五种可能的实现方式,在第二方面的第六种可能的实现方式中,所述获取单元,用于获取所述主机待发送的TLP包的包头类型、包数据类型和包数据长度,根据所述包头类型、所述包数据类型和所述包数据长度,确定所述TLP包所需的第一信用值。
结合第二方面,在第二方面的第七种可能的实现方式中,所述获取模块,包括:
检测单元,用于检测所述PCIe设备是否发生异常;
第八确定单元,用于如果所述检测单元检测出所述PCIe设备发生异常,确定所述PCIe设备与所述主机之间传输TLP包错误的错误类型为自身异常错误类型。
结合第二方面,在第二方面的第八种可能的实现方式中,所述断开模块,用于通过门控时钟将所述PCIe设备的系统时钟设置为不可用状态,所述不可用状态用于指示所述PCIe设备拒绝处理所述主机发送的处理请求。
结合第二方面,在第二方面的第九种可能的实现方式中,所述断开模块,还用于如果所述错误类型是所述PCIe协议中规定的不可修复的错误类型,断开与所述主机之间的链接。
第三方面,本发明提供了一种总线和接口标准PCIe设备,所述PCIe设备包括结束节点EP设备,所述EP设备包括:存储器和处理器,所述存储器用于存储所述处理器得到的数据;
所述处理器,用于获取所述PCIe设备与所述主机之间传输传输层报文TLP包错误的错误类型;
所述处理器,还用于如果所述错误类型是PCIe协议中规定的可修复的错误类型,统计出现所述错误类型的持续时长;
所述处理器,还用于如果所述持续时长达到预设时长,断开与所述主 机之间的链接。
结合第三方面,在第三方面的第一种可能的实现方式中,所述处理器,还用于接收所述主机发送的TLP包,并确定所述TLP包是否有损坏;
所述处理器,还用于如果所述TLP包有损坏,确定所述PCIe设备与所述主机之间传输所述TLP包错误的错误类型为非应答NAK错误类型。
结合第三方面,在第三方面的第二种可能的实现方式中,所述处理器,还用于接收所述主机发送的TLP包,并确定所述TLP包是否是预设的TLP包;
所述处理器,还用于如果所述TLP包不是所述预设的TLP包,确定所述PCIe设备与所述主机之间传输所述TLP包错误的错误类型为传输错误类型。
结合第三方面的第二种可能的实现方式,在第三方面的第三种可能的实现方式中,所述处理器,还用于获取所述TLP包的第一序列号,并根据离当前时间最近的上一个TLP包的第二序列号,预测所述TLP包的第三序列号,如果所述第一序列号和所述第三序列号不相等,确定所述TLP包不是预设的TLP包。
结合第三方面的第二种可能的实现方式,在第三方面的第四种可能的实现方式中,所述传输错误类型包括重传错误类型和漏传错误类型,
所述处理器,还用于如果所述TLP包比所述预设的TLP包新,确定所述PCIe设备与所述主机之间传输所述TLP包错误的错误类型为漏传错误类型;
所述处理器,还用于如果所述TLP包比所述预设的TLP包旧,确定所述PCIe设备与所述主机传输所述TLP包错误的错误类型为重传错误类型。
结合第三方面,在第三方面的第五种可能的实现方式中,所述处理器,还用于获取所述主机待发送的TLP包所需的第一信用值以及所述EP设备当前剩余的第二信用值;
所述处理器,还用于如果所述第一信用值大于所述第二信用值,确定所述PCIe设备与所述主机之间传输TLP包错误的错误类型为信用值不足错误类型。
结合第三方面的第五种可能的实现方式,在第三方面的第六种可能的实现方式中,所述处理器,还用于获取所述主机待发送的TLP包的包头类 型、包数据类型和包数据长度,根据所述包头类型、所述包数据类型和所述包数据长度,确定所述TLP包所需的第一信用值。
结合第三方面,在第三方面的第七种可能的实现方式中,所述处理器,还用于检测所述PCIe设备是否发生异常;
所述处理器,还用于如果检测出所述PCIe设备发生异常,确定所述PCIe设备与所述主机之间传输TLP包错误的错误类型为自身异常错误类型。
结合第三方面,在第三方面的第八种可能的实现方式中,所述处理器,还用于通过门控时钟将所述PCIe设备的系统时钟设置为不可用状态,所述不可用状态用于指示所述PCIe设备拒绝处理所述主机发送的处理请求。
结合第三方面,在第三方面的第九种可能的实现方式中,所述处理器,还用于如果所述错误类型是所述PCIe协议中规定的不可修复的错误类型,断开与所述主机之间的链接。
在本发明实施例中,EP设备获取PCIe设备与主机之间传输传输层报文TLP包错误的错误类型;如果该错误类型是PCIe协议中规定的可修复的错误类型,EP设备统计出现该错误类型的持续时长;如果持续时长达到预设时长,EP设备断开与主机之间的链接。从而实现由EP设备通过检测传输TLP包错误的错误类型,确定PCIe设备与主机之间的链接是否发生异常,当检测到该链接出现异常时,断开该链接,从而不需要断开主机与所有PCIe设备的链接,可以减少对主机业务的影响。
附图说明
图1-1是本发明实施例提供的一种断开PCIe设备与主机之间的链接的应用场景图;
图1-2是本发明实施例提供的一种断开PCIe设备与主机之间的链接的方法流程图;
图2-1是本发明实施例提供的一种断开PCIe设备与主机之间的链接的方法流程图;
图2-2是本发明实施例提供的一种检测NAK异常类型的硬件图;
图2-3是本发明实施例提供的一种检测传输错误类型的硬件图;
图2-4是本发明实施例提供的一种检测信用值不足错误类型的硬件图;
图3-1是本发明实施例提供的一种断开PCIe设备与主机之间的链接的 装置结构示意图;
图3-2是本发明实施例提供的一种获取模块的结构示意图;
图3-3是本发明实施例提供的另一种获取模块的结构示意图;
图3-4是本发明实施例提供的另一种获取模块的结构示意图;
图3-5是本发明实施例提供的另一种获取模块的结构示意图;
图3-6是本发明实施例提供的另一种获取模块的结构示意图;
图4是本发明实施例提供的一种PCIe设备的结构示意图。
具体实施方式
为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施方式作进一步地详细描述。
本发明实施例提供了一种断开PCIe设备与主机之间的链接的方法的应用场景,参见图1-1,当主机只连接一个IO设备时,主机通过RP端口直接连接IO设备;当主机连接多个IO设备时,主机通过PCIe SW(Switch,开关)与多个IO设备相连。
其中,IO设备包括PCIe设备,PCIe设备包括EP(End point,结束节点)设备,PCIe SW包括一个UP(Upstream Port,上游端口)和多个DP(Downstream Port,下游端口),PCIe SW通过UP与主机的CPU相连,通过DP与IO设备的PCIe设备的EP设备相连。
其中,EP设备包括PL(Physical Layer,物理层)、DL(Data Link,数据链路)和TL(Transaction Layer,传输层),TL用于与用户进行交互,DL用于与主机进行数据交互,PL用于与PCIe设备进行交互。当DL或者DL检测出异常时,都可以断开PCIe设备与主机之间的链接。
发明实施例提供了一种断开PCIe设备与主机之间的链接的方法,PCIe设备包括EP设备,该方法的执行主体可以为EP设备,参见图1-2,其中,该方法包括:
步骤101:EP设备获取PCIe设备与主机之间传输TLP(transaction layer packet,传输层报文)包错误的错误类型;
步骤102:如果该错误类型是PCIe协议中规定的可修复的错误类型, EP设备统计出现该错误类型的持续时长;
步骤103:如果持续时长达到预设时长,EP设备断开与主机之间的链接。
在本发明实施例中,EP设备获取PCIe设备与主机之间传输传输层报文TLP包错误的错误类型;如果该错误类型是PCIe协议中规定的可修复的错误类型,EP设备统计出现该错误类型的持续时长;如果持续时长达到预设时长,EP设备断开与主机之间的链接。从而实现由EP设备通过检测传输TLP包错误的错误类型,确定PCIe设备与主机之间的链接是否发生异常,当检测到该链接出现异常时,断开该链接,从而不需要断开主机与所有PCIe设备的链接,可以减少对主机业务的影响。
发明实施例提供了一种断开PCIe设备与主机之间的链接的方法,PCIe设备包括EP设备,该方法的执行主体可以为EP设备,参见图2-1,其中,该方法包括:
步骤201:EP设备获取PCIe设备与主机之间传输TLP包错误的错误类型;
主机与PCIe设备进行业务交互时,主机向PCIe设备包括的EP设备发送资源请求,该资源请求携带TLP包的包头类型、包数据类型和包数据长度;EP设备接收主机发送的资源请求,根据该资源请求计算该TLP包所需的信用值,向主机发送其所需要的信用值;主机接收EP设备发送的信用值,并通过该信用值向EP设备发送该TLP包。
在本步骤中,可能会由于PCIe设备与主机之间的链接异常、信用值不足或者EP设备自身异常导致PCIe设备与主机之间传输TLP包错误,因此,本步骤可以通过以下第一种方式、第二种方式、第三种方式和第四种方式实现,对于第一种实现方式,本步骤可以为:
EP设备接收主机发送的TLP包,并确定该TLP包是否损坏,如果TLP包有损坏,EP设备确定PCIe设备与主机之间传输TLP包错误的错误类型为NAK(Negative Acknowledgment,,非确认)错误类型。
如果该TLP包在传输过程中被损坏,则该TLP包携带损坏标识;因此,EP设备确定该TLP包是否携带损坏标识,如果该TLP包携带损坏标识,则确定该TLP包有损坏;如果该TLP包不携带损坏标识,则确定该TLP 包没有损坏。
进一步地,如果EP设备确定该TLP包没有损坏时,EP设备向主机发送ACK(Acknowledgement,确认);主机接收EP设备发送的ACK,并根据该ACK确定EP设备正确接收该TLP包,此时,主机向EP设备发送下一个TLP包。
进一步地,如果EP设备确定该TLP包有损坏,EP设备向主机发送NAK;主机接收到EP设备发送的NAK,并根据该NAK确定EP设备没有正确接收该TLP包,此时,主机重新向EP设备发送该TLP包,直到接收到EP设备返回ACK。
对于第二种实现方式,本步骤可以通过以下步骤(1)和(2)实现,包括:
(1):EP设备接收主机发送的TLP包,并确定该TLP包是否是预设的TLP包;
为了确定主机是否重复向EP设备发送TLP包或者漏掉向EP设备发送TLP包,主机向EP设备发送TLP包中携带该TLP包的序列号,并且相邻两个TLP包的序列号相差1。因此,EP设备根据离当前时间最近的上一个TLP包的序列号可以确定预设的TLP包,预设的TLP包就是主机当前应该发送给EP设备的TLP包。
本步骤可以通过以下步骤(1-1)至(1-2)实现,包括:
(1-1):EP设备获取该TLP包的第一序列号,并根据离当前时间最近的上一个TLP包的第二序列号,预测该TLP包的第三序列号;
EP设备获取该TLP包中携带的序列号,为了便于区分,将该TLP包携带的序列号称为第一序列号,将第一序列号存储在序列号列表中,以便于后续获取第一序列号。
其中,序列号列表中存储EP设备已经接收到的TLP包的序列号,则EP设备从序列号列表中获取离当前时间最近的上一个TLP包的序列号,为了便于区分,将该上一个TLP包的序列号称为第二序列号,EP设备将第二序列号加一得到预测该TLP包的序列号,为了便于区分,将预测该TLP包的序列号称为第三序列号。
(1-2):如果第一序列号和第三序列号不相等,EP设备确定该TLP包不是预设的TLP包。
EP设备确定第一序列号和第三序列号是否相等;如果第一序列号和第三序列号相等,EP设备确定该TLP包是预设的TLP包;如果第一序列号和第三序列号不相等,EP设备确定该TLP包不是预设的TLP包。
进一步地,如果该TLP包是预设的TLP包,执行步骤201。如果该TLP包不是预设的TLP包,执行以下步骤(2)。
(2):如果TLP包不是预设的TLP包,EP设备确定PCIe设备与主机之间传输TLP包错误的错误类型为传输错误类型。
进一步地,EP设备还可以根据第一序列号和第三序列号,确定该TLP包比预设的TLP包新,还是该TLP包比预设的TLP包旧;由于主机生成序列号时,是通过12位的无符号数,当计数到4095之后,会翻转到0继续计数,因此TLP包的序列号的大小与该TLP包的新旧并不相同,例如,该TLP包的序列号为4095,预设的TLP包的序列号为0,虽然4095大于0,但是该TLP包比预设的TLP包旧,因此,EP设备根据第一序列号和第三序列号,确定该TLP包比预设的TLP包新,还是该TLP包比预设的TLP包旧可以通过以下过程实现,包括:
EP设备获取主机生成序列号的位数,计算第一序列号和第三序列号的序列号之差,根据该位数计算第一数值,并计算该序列号之差和第一数值的余数,如果该余数大于第二数值,则确定该TLP包比预设的TLP包旧;如果该余数小于第二数值,则确定该TLP包比预设的TLP包新。第一数值等于2的该位数次方,第一数值除以2得到第二数值。
例如,该位数为12,则第一数值为4096,第二数值为2047,第一序列号为A_Seq,第二序列号为B_Seq;则如果(A_Seq-B_Seq)%4096>=2048,则该TLP包比预设的TLP包新;如果(A_Seq-B_Seq)%4096<2048,则该TLP包比预设的TLP包旧。
进一步地,如果该TLP包比预设的TLP包新,则EP设备确定主机漏掉某个或某几个TLP包;如果该TLP包比预设的TLP包旧,则EP设备确定主机重复向EP设备发送TLP包,因此,传输错误类型包括重传错误类型和漏传错误类型;如果该TLP包比预设的TLP包新,EP设备确定PCIe设备与主机之间传输TLP包错误的错误类型为漏传错误类型;如果该TLP包比预设的TLP包旧,EP设备确定PCIe设备与主机之间传输TLP包错误的错误类型为重传错误类型。
对于第三种实现方式,本步骤可以通过以下步骤(A)和(B)实现,包括:
(A):EP设备获取主机待发送的TLP包所需的第一信用值以及EP设备当前剩余的第二信用值;
主机与PCIe设备进行业务交互时,主机向PCIe设备包括的EP设备发送资源请求,该资源请求携带TLP包的包头类型、包数据类型和包数据长度;EP设备接收主机发送的资源请求,根据该包头类型、包数据类型和包数据长度,计算该TLP包所需的信用值,为了便于区分,将该TLP包所需的信用值称为第一信用值。
其中,EP设备中存储有包头类型和信用值的对应关系,以及存储有包数据类型、包数据长度和信用值的对应关系。则EP设备根据该包头类型、包数据类型和包数据长度,计算该TLP包所需的第一信用值的步骤可以为:
EP设备根据该包头类型,从包头类型和信用值的对应关系中获取该TLP包的包头所需的第三信用值;根据该包数据类型和包数据长度,从包数据类型、包数据长度和信用值的对应关系中获取该TLP包的包数据所需的第四信用值,计算第三信用值和第四信用值的和,得到该TLP包所需的第一信用值。
其中,包头类型可以为PH(Posted Head,报告请求头)或NPH(Non-Posted Head,非报告请求头),并且,对于PH和NPH,每个TLP包的包头只消耗一个信用值。包数据类型包括PD(Posted Data,报告请求数据)和NPD(Non-Posted Data,非报告请求数据),并且,对于NPD,每个TLP包的包数据只消耗一个信用值,而对于PD,则EP设备根据包数据长度,确定包数据所需要的信用值数量。由于主机只能发起普通的读写操作,因此,在本发明实施例中也可以设置每个TLP包的PD类型的包数据消耗一个信用值。
其中,EP设备获取当前剩余的第二信用值的步骤可以为:
EP设备设置一个寄存器,记录EP设备已经消耗的信用值,并根据EP设备的总信用值和已经消耗的信用值,计算当前剩余的信用值,为了便于区分,将当前剩余的信用值称为第二信用值。
(B):如果第一信用值大于第二信用值,EP设备确定PCIe设备与主机之间传输TLP包错误的错误类型为信用值不足错误类型。
进一步地,如果第一信用值不大于第二信用值,EP设备向主机发送第一信用值;主机接收EP设备发送的第一信用值,并通过该第一信用值向PCIe设备发送该TLP包。
对于第四种实现方式,本步骤可以为:
EP设备检测PCIe设备是否发生异常,如果EP设备检测出PCIe设备发生异常,EP设备确定PCIe设备与主机之间传输TLP包错误的错误类型为自身异常错误类型。
步骤202:EP设备确定该错误类型是否是PCIe协议中规定的可修复的错误类型,如果该错误类型是可修复的错误类型,执行步骤203;如果该错误类型是不可修复的错误类型,执行步骤205;
其中,EP设备中存储有PCIe协议中规定的可修复的错误类型库,可修复的错误类型库包括非应答NAK错误类型、漏传错误类型、重传错误类型、信用值不足错误类型和自身异常错误类型。
EP设备确定该错误类型是否存在该可修复的错误类型库中;如果该错误类型存在该可修复的错误类型库中,EP设备确定该错误类型是可修复的错误类型;如果该错误类型不存在该可修复的错误类型库中,EP设备确定该错误类型是不可修复的错误类型。
步骤203:如果该错误类型是可修复的错误类型,EP设备统计出现该错误类型的持续时长;
如果该错误类型是可修复的错误类型,PCIe协议中规定该错误类型能够被修复,不需要断开与主机之间的链接,但是如果EP设备一直没能成功修改该错误类型的错误,则该错误类型照样会导致主机的CPU挂死,因此,在本发明实施例中,统计出现该错误类型的持续时间,根据该持续时间确定要不要断开与主机之间的链接。
当出现该错误类型时,EP设备启动计数器开始计时,当该错误类型的错误被修复时,EP设备统计出现该错误类型的持续时长,并将该计数器清零。
例如,当该错误类型为NAK超时错误类型时,EP设备将NAK_SCHEDULED位的状态设置为有效状态,并采用计数器记录该NAK_SCHEDULED位的状态为有效状态的时间。当该NAK超时错误类型的错误被修复时,EP设备将该NAK_SCHEDULED位的状态设置为无效状 态,停止计时,获取该计数器记录的持续时长,并将计数器清零。
再如,当该错误类型为传输错误类型时,EP设备启动计数器开始计时,当该传输错误类型的错误被修复时,EP设备停止计时,获取该计数器记录的持续时长,并将该计数器清零。
再如,当该错误类型为信用值不足错误类型时,EP设备启动计数器开始计时,当该信用值不足错误类型的错误被修复时,EP设备停止计时,获取该计数器记录的持续时长,并将该计数器清零。
步骤204:EP设备确定持续时长是否达到预设时长,如果持续时长达到预设达到预设时长,执行步骤205,如果持续时长没有达到预设时长,执行步骤201;
预设时长可以根据错误类型进行设置并更改,也即在本发明实施例中存储错误类型和预设时长的对应关系,EP设备根据该错误类型,从错误类型和预设时长的对应关系中获取该错误类型对应的预设时长。从而实现了不同的错误类型对应不同的预设时长,有效防止主机的CPU挂死。
例如,错误类型和预设时长的对应关系如下表1所示:
表1
错误类型 预设时长
NAK错误类型 10s
重传错误类型 20s
漏传错误类型 15s
信用值不足错误类型 8s
自身异常错误类型 5s
…… ……
例如,当该错误类型为NAK超时错误类型时,EP设备确定由于PCIe设备与主机之间的下行链路异常造成的,EP设备以NAK_SCHEDULED(非应答状态位)为复位信号,当NAK_SCHEDULED有效时,EP设备启动计数器开始计时,当NAK_SCHEDULED无效时,EP设备停止计数器,计数器立即清零并保持;当计数器获取的持续时长达到预设时长时,EP设备确 定需要断开与主机之间的链接,此时就输出硬件断链使能信号,执行步骤205,EP设备中的硬件电路如图2-2所示。
再如,当该错误类型为漏传错误类型时,EP设备确定由于PCIe设备与主机之间的下行链路异常导致TPL包丢失造成的,此时EP设备向主机发送NAK,如果一直处于丢包状态,则确定下行链路已经极端不可靠,因此,为了防止主机的CPU挂死,当出现漏传错误类型的持续时长达到预设时长时,EP设备需要断开与主机之间的链接。当该错误类型为重传错误类型时,EP设备确定主机对该TLP包进行了重发,当该出现重发错误类型的持续时长达到预设时长时,EP设备需要断开与主机之间的链接,此时就输出硬件断链使能信号,执行步骤205。EP设备中的硬件电路如图2-3所示。
再如,当该错误类型为信用值不足错误类型时,主机无法向EP设备发送TLP包,如果CPU仍然下发大量读写操作时,主机的缓冲区会满,反压CPU侧,最终导致CPU指令超时,导致CPU挂死,因此当出现信用值不足错误类型的持续时长达到预设时长时,EP设备需要断开与主机之间的链接,此时就输出硬件断链使能信号,执行步骤205。EP设备中的硬件电路如图2-4所示。
步骤205:EP设备断开与主机之间的链接。
当确定需要断开与主机之间的链接时,EP设备设置硬件断链使能信号link_down=1,当EP设备检测到硬件断链使能信号link_down=1时,通过门控时钟将PCIe设备的系统时钟设置为不可用状态。
PCIe设备检测到系统时钟的状态为不可用状态时,PCIe拒绝处理主机发送的处理请求,从而实现断开与主机之间的链接。
主机无法得到PCIe设备的响应时,将自身的LTSSM(Link Training and Status State Machine,链路训练与状态状态机)状态将因为比特锁定与符合锁定的失锁而跳转到Disabled状态位,与EP设备断开链接,主机感觉到此Disabled状态位时,将与EP设备相关缓存内容清除,完成异常EP设备的隔离。
在本发明实施例中,EP设备获取PCIe设备与主机之间传输传输层报文TLP包错误的错误类型;如果该错误类型是PCIe协议中规定的可修复的错误类型,EP设备统计出现该错误类型的持续时长;如果持续时长达到预设时长,EP设备断开与主机之间的链接。从而实现由EP设备通过检测传 输TLP包错误的错误类型,确定PCIe设备与主机之间的链接是否发生异常,当检测到该链接出现异常时,断开该链接,从而不需要断开主机与所有PCIe设备的链接,可以减少对主机业务的影响。
本发明实施例提供了一种断开总线和接口标准PCIe设备与主机之间的链接的装置,该PCIe设备包括结束节点EP设备,用于执行以上断开PCIe设备与主机之间的链接,参见图3-1,该装置包括:
获取模块301,用于获取PCIe设备与主机之间传输传输层报文TLP包错误的错误类型;
统计模块302,用于如果错误类型是PCIe协议中规定的可修复的错误类型,统计出现错误类型的持续时长;
断开模块303,用于如果持续时长达到预设时长,断开与主机之间的链接。
进一步地,参见图3-2,获取模块301,包括:
第一接收单元3011,用于接收主机发送的TLP包;
第一确定单元3012,用于确定TLP包是否有损坏;
第二确定单元3013,用于如果TLP包有损坏,确定PCIe设备与主机之间传输TLP包错误的错误类型为非应答NAK错误类型。
进一步地,参见图3-3,获取模块301,包括:
第二接收单元3014,用于接收主机发送的TLP包;
第三确定单元3015,用于确定TLP包是否是预设的TLP包;
第四确定单元3016,用于如果TLP包不是预设的TLP包,确定PCIe设备与主机之间传输TLP包错误的错误类型为传输错误类型。
进一步地,第三确定单元3015,用于获取TLP包的第一序列号,并根据离当前时间最近的上一个TLP包的第二序列号,预测TLP包的第三序列号,如果第一序列号和第三序列号不相等,确定TLP包不是预设的TLP包。
进一步地,传输错误类型包括重传错误类型和漏传错误类型,参见图3-4,获取模块301,还包括:
第五确定单元3017,用于如果TLP包比预设的TLP包新,确定PCIe设备与主机之间传输TLP包错误的错误类型为漏传错误类型;
第六确定单元3018,用于如果TLP包比预设的TLP包旧,确定PCIe设备与主机传输TLP包错误的错误类型为重传错误类型。
进一步地,参见图3-5,获取模块301,包括:
获取单元3019,用于获取主机待发送的TLP包所需的第一信用值以及EP设备当前剩余的第二信用值;
第七确定单元30110,用于如果第一信用值大于第二信用值,确定PCIe设备与主机之间传输TLP包错误的错误类型为信用值不足错误类型。
进一步地,获取单元3019,用于获取主机待发送的TLP包的包头类型、包数据类型和包数据长度,根据包头类型、包数据类型和包数据长度,确定TLP包所需的第一信用值。
进一步地,参见图3-6,获取模块301,包括:
检测单元30111,用于检测PCIe设备是否发生异常;
第八确定单元30112,用于如果检测单元检测出PCIe设备发生异常,确定PCIe设备与主机之间传输TLP包错误的错误类型为自身异常错误类型。
进一步地,断开模块303,用于通过门控时钟将PCIe设备的系统时钟设置为不可用状态,不可用状态用于指示PCIe设备拒绝处理主机发送的处理请求。
进一步地,断开模块303,还用于如果错误类型是PCIe协议中规定的不可修复的错误类型,断开与主机之间的链接。
在本发明实施例中,EP设备获取PCIe设备与主机之间传输传输层报文TLP包错误的错误类型;如果该错误类型是PCIe协议中规定的可修复的错误类型,EP设备统计出现该错误类型的持续时长;如果持续时长达到预设时长,EP设备断开与主机之间的链接。从而实现由EP设备通过检测传输TLP包错误的错误类型,确定PCIe设备与主机之间的链接是否发生异常,当检测到该链接出现异常时,断开该链接,从而不需要断开主机与所有PCIe设备的链接,可以减少对主机业务的影响。
本发明实施例提供了一种总线和接口标准PCIe设备,用于执行以上断开PCIe设备与主机之间的链接,参见图4,PCIe设备包括结束节点EP设备,EP设备包括:存储器401和处理器402,存储器401用于存储处理器402得到的数据;
处理器402,用于获取PCIe设备与主机之间传输传输层报文TLP包错误的错误类型;
处理器402,还用于如果错误类型是PCIe协议中规定的可修复的错误类型,统计出现错误类型的持续时长;
处理器402,还用于如果持续时长达到预设时长,断开与主机之间的链接。
进一步地,处理器402,还用于接收主机发送的TLP包,并确定TLP包是否有损坏;
处理器402,还用于如果TLP包有损坏,确定PCIe设备与主机之间传输TLP包错误的错误类型为非应答NAK错误类型。
进一步地,处理器402,还用于接收主机发送的TLP包,并确定TLP包是否是预设的TLP包;
处理器402,还用于如果TLP包不是预设的TLP包,确定PCIe设备与主机之间传输TLP包错误的错误类型为传输错误类型。
进一步地,
处理器402,还用于获取TLP包的第一序列号,并根据离当前时间最近的上一个TLP包的第二序列号,预测TLP包的第三序列号,如果第一序列号和第三序列号不相等,确定TLP包不是预设的TLP包。
进一步地,传输错误类型包括重传错误类型和漏传错误类型,
处理器402,还用于如果TLP包比预设的TLP包新,确定PCIe设备与主机之间传输TLP包错误的错误类型为漏传错误类型;
处理器402,还用于如果TLP包比预设的TLP包旧,确定PCIe设备与主机传输TLP包错误的错误类型为重传错误类型。
进一步地,
处理器402,还用于获取主机待发送的TLP包所需的第一信用值以及EP设备当前剩余的第二信用值;
处理器402,还用于如果第一信用值大于第二信用值,确定PCIe设备与主机之间传输TLP包错误的错误类型为信用值不足错误类型。
进一步地,
处理器402,还用于获取主机待发送的TLP包的包头类型、包数据类型和包数据长度,根据包头类型、包数据类型和包数据长度,确定TLP包所需的第一信用值。
进一步地,
处理器402,还用于检测PCIe设备是否发生异常;
处理器402,还用于如果检测出PCIe设备发生异常,确定PCIe设备与主机之间传输TLP包错误的错误类型为自身异常错误类型。
进一步地,
处理器402,还用于通过门控时钟将PCIe设备的系统时钟设置为不可用状态,不可用状态用于指示PCIe设备拒绝处理主机发送的处理请求。
进一步地,装置还包括:
处理器402,还用于如果错误类型是PCIe协议中规定的不可修复的错误类型,断开与主机之间的链接。
在本发明实施例中,EP设备获取PCIe设备与主机之间传输传输层报文TLP包错误的错误类型;如果该错误类型是PCIe协议中规定的可修复的错误类型,EP设备统计出现该错误类型的持续时长;如果持续时长达到预设时长,EP设备断开与主机之间的链接。从而实现由EP设备通过检测传输TLP包错误的错误类型,确定PCIe设备与主机之间的链接是否发生异常,当检测到该链接出现异常时,断开该链接,从而不需要断开主机与所有PCIe设备的链接,可以减少对主机业务的影响。
需要说明的是:上述实施例提供的断开PCIe设备与主机之间的链接的装置在断开PCIe设备与主机之间的链接时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的断开PCIe设备与主机之间的链接的装置与断开PCIe设备与主机之间的链接的方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (30)

  1. 一种断开总线和接口标准PCIe设备与主机之间的链接的方法,其特征在于,所述PCIe设备包括结束节点EP设备,所述方法包括:
    所述EP设备获取所述PCIe设备与所述主机之间传输传输层报文TLP包错误的错误类型;
    如果所述错误类型是PCIe协议中规定的可修复的错误类型,所述EP设备统计出现所述错误类型的持续时长;
    如果所述持续时长达到预设时长,所述EP设备断开与所述主机之间的链接。
  2. 如权利要求1所述的方法,其特征在于,所述EP设备获取所述PCIe设备与所述主机之间传输TLP包错误的错误类型,包括:
    所述EP设备接收所述主机发送的TLP包,并确定所述TLP包是否有损坏;
    如果所述TLP包有损坏,所述EP设备确定所述PCIe设备与所述主机之间传输所述TLP包错误的错误类型为非应答NAK错误类型。
  3. 如权利要求1所述的方法,其特征在于,所述EP设备获取所述PCIe设备与所述主机之间传输TLP包错误的错误类型,包括:
    所述EP设备接收所述主机发送的TLP包,并确定所述TLP包是否是预设的TLP包;
    如果所述TLP包不是所述预设的TLP包,所述EP设备确定所述PCIe设备与所述主机之间传输所述TLP包错误的错误类型为传输错误类型。
  4. 如权利要求3所述的方法,其特征在于,所述EP设备确定所述TLP包是否是预设的TLP包,包括:
    所述EP设备获取所述TLP包的第一序列号,并根据离当前时间最近的上一个TLP包的第二序列号,预测所述TLP包的第三序列号;
    如果所述第一序列号和所述第三序列号不相等,所述EP设备确定所述TLP包不是预设的TLP包。
  5. 如权利要求3所述的方法,其特征在于,所述传输错误类型包括重传错误类型和漏传错误类型,所述方法还包括:
    如果所述TLP包比所述预设的TLP包新,所述EP设备确定所述PCIe设备与所述主机之间传输所述TLP包错误的错误类型为漏传错误类型;
    如果所述TLP包比所述预设的TLP包旧,所述EP设备确定所述PCIe设备与所述主机传输所述TLP包错误的错误类型为重传错误类型。
  6. 如权利要求1所述的方法,其特征在于,所述EP设备获取所述PCIe设备与所述主机之间传输TLP包错误的错误类型,包括:
    所述EP设备获取所述主机待发送的TLP包所需的第一信用值以及所述EP设备当前剩余的第二信用值;
    如果所述第一信用值大于所述第二信用值,所述EP设备确定所述PCIe设备与所述主机之间传输TLP包错误的错误类型为信用值不足错误类型。
  7. 如权利要求6所述的方法,其特征在于,所述EP设备获取所述主机待发送的TLP包所需的第一信用值,包括:
    所述EP设备获取所述主机待发送的TLP包的包头类型、包数据类型和包数据长度;
    所述EP设备根据所述包头类型、所述包数据类型和所述包数据长度,确定所述TLP包所需的第一信用值。
  8. 如权利要求1所述的方法,其特征在于,所述EP设备获取所述PCIe设备与所述主机之间传输TLP包错误的错误类型,包括:
    所述EP设备检测所述PCIe设备是否发生异常;
    如果所述EP设备检测出所述PCIe设备发生异常,所述EP设备确定所述PCIe设备与所述主机之间传输TLP包错误的错误类型为自身异常错误类型。
  9. 如权利要求1所述的方法,其特征在于,所述EP设备断开与所述主机之间的链接,包括:
    所述EP设备通过门控时钟将所述PCIe设备的系统时钟设置为不可用状态,所述不可用状态用于指示所述PCIe设备拒绝处理所述主机发送的处理请求。
  10. 如权利要求1所述的方法,其特征在于,所述方法还包括:
    如果所述错误类型是所述PCIe协议中规定的不可修复的错误类型,所述EP设备断开与所述主机之间的链接。
  11. 一种断开总线和接口标准PCIe设备与主机之间的链接的装置,其特征在于,所述PCIe设备包括结束节点EP设备,所述装置包括:
    获取模块,用于获取所述PCIe设备与所述主机之间传输传输层报文TLP包错误的错误类型;
    统计模块,用于如果所述错误类型是PCIe协议中规定的可修复的错误类型,统计出现所述错误类型的持续时长;
    断开模块,用于如果所述持续时长达到预设时长,断开与所述主机之间的链接。
  12. 如权利要求11所述的装置,其特征在于,所述获取模块,包括:
    第一接收单元,用于接收所述主机发送的TLP包;
    第一确定单元,用于确定所述TLP包是否有损坏;
    第二确定单元,用于如果所述TLP包有损坏,确定所述PCIe设备与所述主机之间传输所述TLP包错误的错误类型为非应答NAK错误类型。
  13. 如权利要求11所述的装置,其特征在于,所述获取模块,包括:
    第二接收单元,用于接收所述主机发送的TLP包;
    第三确定单元,用于确定所述TLP包是否是预设的TLP包;
    第四确定单元,用于如果所述TLP包不是所述预设的TLP包,确定所述PCIe设备与所述主机之间传输所述TLP包错误的错误类型为传输错误类型。
  14. 如权利要求13所述的装置,其特征在于,
    所述第三确定单元,用于获取所述TLP包的第一序列号,并根据离当前时间最近的上一个TLP包的第二序列号,预测所述TLP包的第三序列号,如果所述第一序列号和所述第三序列号不相等,确定所述TLP包不是预设的TLP包。
  15. 如权利要求13所述的装置,其特征在于,所述传输错误类型包括重传错误类型和漏传错误类型,所述获取模块,还包括:
    第五确定单元,用于如果所述TLP包比所述预设的TLP包新,确定所述PCIe设备与所述主机之间传输所述TLP包错误的错误类型为漏传错误类型;
    第六确定单元,用于如果所述TLP包比所述预设的TLP包旧,确定所述PCIe设备与所述主机传输所述TLP包错误的错误类型为重传错误类型。
  16. 如权利要求11所述的装置,其特征在于,所述获取模块,包括:
    获取单元,用于获取所述主机待发送的TLP包所需的第一信用值以及所述EP设备当前剩余的第二信用值;
    第七确定单元,用于如果所述第一信用值大于所述第二信用值,确定所述PCIe设备与所述主机之间传输TLP包错误的错误类型为信用值不足错误类型。
  17. 如权利要求16所述的装置,其特征在于,
    所述获取单元,用于获取所述主机待发送的TLP包的包头类型、包数据类型和包数据长度,根据所述包头类型、所述包数据类型和所述包数据长度,确定所述TLP包所需的第一信用值。
  18. 如权利要求11所述的装置,其特征在于,所述获取模块,包括:
    检测单元,用于检测所述PCIe设备是否发生异常;
    第八确定单元,用于如果所述检测单元检测出所述PCIe设备发生异常,确定所述PCIe设备与所述主机之间传输TLP包错误的错误类型为自身异常错误类型。
  19. 如权利要求11所述的装置,其特征在于,
    所述断开模块,用于通过门控时钟将所述PCIe设备的系统时钟设置为不可用状态,所述不可用状态用于指示所述PCIe设备拒绝处理所述主机发送的处理请求。
  20. 如权利要求11所述的装置,其特征在于,
    所述断开模块,还用于如果所述错误类型是所述PCIe协议中规定的不可修复的错误类型,断开与所述主机之间的链接。
  21. 一种总线和接口标准PCIe设备,其特征在于,所述PCIe设备包括结束节点EP设备,所述EP设备包括:存储器和处理器,所述存储器用于存储所述处理器得到的数据;
    所述处理器,用于获取所述PCIe设备与所述主机之间传输传输层报文TLP包错误的错误类型;
    所述处理器,还用于如果所述错误类型是PCIe协议中规定的可修复的错误类型,统计出现所述错误类型的持续时长;
    所述处理器,还用于如果所述持续时长达到预设时长,断开与所述主机之间的链接。
  22. 如权利要求21所述的PCIe设备,其特征在于,
    所述处理器,还用于接收所述主机发送的TLP包,并确定所述TLP包是否有损坏;
    所述处理器,还用于如果所述TLP包有损坏,确定所述PCIe设备与所述主机之间传输所述TLP包错误的错误类型为非应答NAK错误类型。
  23. 如权利要求21所述的PCIe设备,其特征在于,
    所述处理器,还用于接收所述主机发送的TLP包,并确定所述TLP包是否是预设的TLP包;
    所述处理器,还用于如果所述TLP包不是所述预设的TLP包,确定所述PCIe设备与所述主机之间传输所述TLP包错误的错误类型为传输错误类型。
  24. 如权利要求23所述的PCIe设备,其特征在于,
    所述处理器,还用于获取所述TLP包的第一序列号,并根据离当前时间最近的上一个TLP包的第二序列号,预测所述TLP包的第三序列号,如果所述第一序列号和所述第三序列号不相等,确定所述TLP包不是预设的TLP包。
  25. 如权利要求23所述的PCIe设备,其特征在于,所述传输错误类型包括重传错误类型和漏传错误类型,
    所述处理器,还用于如果所述TLP包比所述预设的TLP包新,确定所述PCIe设备与所述主机之间传输所述TLP包错误的错误类型为漏传错误类型;
    所述处理器,还用于如果所述TLP包比所述预设的TLP包旧,确定所述PCIe设备与所述主机传输所述TLP包错误的错误类型为重传错误类型。
  26. 如权利要求21所述的PCIe设备,其特征在于,
    所述处理器,还用于获取所述主机待发送的TLP包所需的第一信用值以及所述EP设备当前剩余的第二信用值;
    所述处理器,还用于如果所述第一信用值大于所述第二信用值,确定所述PCIe设备与所述主机之间传输TLP包错误的错误类型为信用值不足错误类型。
  27. 如权利要求26所述的PCIe设备,其特征在于,
    所述处理器,还用于获取所述主机待发送的TLP包的包头类型、包数据类型和包数据长度,根据所述包头类型、所述包数据类型和所述包数据长度,确定所述TLP包所需的第一信用值。
  28. 如权利要求21所述的PCIe设备,其特征在于,
    所述处理器,还用于检测所述PCIe设备是否发生异常;
    所述处理器,还用于如果检测出所述PCIe设备发生异常,确定所述PCIe设备与所述主机之间传输TLP包错误的错误类型为自身异常错误类型。
  29. 如权利要求21所述的PCIe设备,其特征在于,
    所述处理器,还用于通过门控时钟将所述PCIe设备的系统时钟设置为不可用状态,所述不可用状态用于指示所述PCIe设备拒绝处理所述主机发送的处理请求。
  30. 如权利要求21所述的PCIe设备,其特征在于,
    所述处理器,还用于如果所述错误类型是所述PCIe协议中规定的不可修复的错误类型,断开与所述主机之间的链接。
PCT/CN2016/083801 2015-09-11 2016-05-28 断开PCIe设备与主机之间的链接的方法和装置 WO2017041533A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
ES16843471T ES2748228T3 (es) 2015-09-11 2016-05-28 Método de desconexión de enlace entre un equipo PCIe y un concentrador y dispositivo que utiliza este último
EP16843471.0A EP3296885B1 (en) 2015-09-11 2016-05-28 Method of disconnecting link between pcie equipment and host and device utilizing same
US15/819,440 US10565043B2 (en) 2015-09-11 2017-11-21 Method and apparatus for disconnecting link between PCIE device and host
US16/740,717 US11620175B2 (en) 2015-09-11 2020-01-13 Method and apparatus for disconnecting link between PCIe device and host

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510580109.1A CN105205021B (zh) 2015-09-11 2015-09-11 断开PCIe设备与主机之间的链接的方法和装置
CN201510580109.1 2015-09-11

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/819,440 Continuation US10565043B2 (en) 2015-09-11 2017-11-21 Method and apparatus for disconnecting link between PCIE device and host

Publications (1)

Publication Number Publication Date
WO2017041533A1 true WO2017041533A1 (zh) 2017-03-16

Family

ID=54952714

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/083801 WO2017041533A1 (zh) 2015-09-11 2016-05-28 断开PCIe设备与主机之间的链接的方法和装置

Country Status (5)

Country Link
US (2) US10565043B2 (zh)
EP (1) EP3296885B1 (zh)
CN (1) CN105205021B (zh)
ES (1) ES2748228T3 (zh)
WO (1) WO2017041533A1 (zh)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105205021B (zh) 2015-09-11 2018-02-13 华为技术有限公司 断开PCIe设备与主机之间的链接的方法和装置
CN105700967A (zh) * 2016-01-08 2016-06-22 华为技术有限公司 一种外设部件内部互联PCIe设备及其检测方法
CN105701051B (zh) 2016-01-15 2019-10-15 华为技术有限公司 一种热插拔方法、主机控制器、主机及PCIe桥设备
CN105824622B (zh) * 2016-03-11 2020-04-24 联想(北京)有限公司 数据处理方法及电子设备
CN106201753B (zh) * 2016-06-28 2019-12-31 苏州浪潮智能科技有限公司 一种基于linux中PCIE错误的处理方法及系统
CN106326151A (zh) * 2016-08-19 2017-01-11 浪潮(北京)电子信息产业有限公司 一种PCIe设备的拔除方法及装置
CN108259212B (zh) * 2017-05-25 2019-09-17 新华三技术有限公司 报文处理方法及装置
CN109560900A (zh) * 2017-09-27 2019-04-02 阿里巴巴集团控股有限公司 数据发送方法和装置
CN110968443B (zh) * 2018-09-28 2023-04-11 阿里巴巴集团控股有限公司 设备异常的检测方法及装置
CN113498600B (zh) * 2020-01-22 2022-11-25 华为技术有限公司 一种基于PCIe的数据传输方法及装置
CN112256539B (zh) * 2020-09-18 2022-07-19 苏州浪潮智能科技有限公司 一种pcie链路错误统计方法、装置、终端及存储介质
US11836059B1 (en) 2020-12-14 2023-12-05 Sanblaze Technology, Inc. System and method for testing non-volatile memory express storage devices
CN112463461B (zh) * 2020-12-17 2023-12-22 北京浪潮数据技术有限公司 一种链路联通方法、装置、设备及计算机可读存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101645766A (zh) * 2009-09-09 2010-02-10 成都市华为赛门铁克科技有限公司 实现数据包重发的方法、装置及系统
CN102349059A (zh) * 2009-04-17 2012-02-08 株式会社东芝 PCI Express的TLP处理电路及具备该处理电路的中继设备
JP4947722B2 (ja) * 2008-03-04 2012-06-06 Necアクセステクニカ株式会社 インタフェース制御回路および情報処理装置
CN105205021A (zh) * 2015-09-11 2015-12-30 华为技术有限公司 断开PCIe设备与主机之间的链接的方法和装置

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS4947722B1 (zh) 1968-06-19 1974-12-17
US7836352B2 (en) * 2006-06-30 2010-11-16 Intel Corporation Method and apparatus for improving high availability in a PCI express link through predictive failure analysis
US7869356B2 (en) * 2007-12-18 2011-01-11 Plx Technology, Inc. Dynamic buffer pool in PCIExpress switches
WO2010071628A1 (en) * 2008-12-15 2010-06-24 Hewlett-Packard Development Company, L.P. Detecting an unreliable link in a computer system
JP5454224B2 (ja) * 2010-02-25 2014-03-26 ソニー株式会社 記憶装置および記憶システム
US8782461B2 (en) * 2010-09-24 2014-07-15 Intel Corporation Method and system of live error recovery
US8787155B2 (en) * 2011-06-01 2014-07-22 International Business Machines Corporation Sideband error signaling
US9086945B2 (en) * 2011-09-01 2015-07-21 Dell Products, Lp System and method to correlate errors to a specific downstream device in a PCIe switching network
US9344219B2 (en) * 2013-06-25 2016-05-17 Intel Corporation Increasing communication safety by preventing false packet acceptance in high-speed links
EP3053003A4 (en) * 2013-09-30 2017-05-24 Intel Corporation Early wake-warn for clock gating control
CN103533045B (zh) * 2013-10-12 2017-12-29 丁贤根 一种用于pcie数据链路层高性能容错的方法
US20170091013A1 (en) * 2015-09-28 2017-03-30 Netapp, Inc. Pcie error reporting and throttling

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4947722B2 (ja) * 2008-03-04 2012-06-06 Necアクセステクニカ株式会社 インタフェース制御回路および情報処理装置
CN102349059A (zh) * 2009-04-17 2012-02-08 株式会社东芝 PCI Express的TLP处理电路及具备该处理电路的中继设备
CN101645766A (zh) * 2009-09-09 2010-02-10 成都市华为赛门铁克科技有限公司 实现数据包重发的方法、装置及系统
CN105205021A (zh) * 2015-09-11 2015-12-30 华为技术有限公司 断开PCIe设备与主机之间的链接的方法和装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3296885A4 *

Also Published As

Publication number Publication date
EP3296885A1 (en) 2018-03-21
US11620175B2 (en) 2023-04-04
ES2748228T3 (es) 2020-03-16
EP3296885B1 (en) 2019-08-14
US20200151045A1 (en) 2020-05-14
CN105205021B (zh) 2018-02-13
EP3296885A4 (en) 2018-07-25
US10565043B2 (en) 2020-02-18
US20180095817A1 (en) 2018-04-05
CN105205021A (zh) 2015-12-30

Similar Documents

Publication Publication Date Title
WO2017041533A1 (zh) 断开PCIe设备与主机之间的链接的方法和装置
US10348616B2 (en) Packet transmission method and apparatus, and interconnect interface
US11093351B2 (en) Method and apparatus for backup communication
TWI518497B (zh) 用於提供具狀態保留功能之連結電力節省的裝置、方法及系統
US12069493B2 (en) Sidelink monitoring method for vehicle communication and related apparatus
US20060020846A1 (en) Mechanism for enabling enhanced fibre channel error recovery across redundant paths using SCSI level commands
US7676701B2 (en) Computer readable medium storing an error recovery program, error recovery method, error recovery apparatus, and computer system
CN105700967A (zh) 一种外设部件内部互联PCIe设备及其检测方法
EP2157723B1 (en) Data retransmission method and system
WO2015032048A1 (zh) 一种移动终端内部通信方法
TWI483117B (zh) 用於執行命令之裝置、主機控制器及用於執行命令之系統
US10230625B2 (en) Information processing apparatus, information processing system, and communication device
US8181078B2 (en) Methods and system for simplified SAS error recovery
JP2009116732A (ja) 情報処理装置及び情報処理方法
US8868994B2 (en) High performance virtual Converged Enhanced Ethernet with persistent state flow control
US9619347B2 (en) Systems and methods of fault management in electronic communications
JP5182417B2 (ja) 伝送データのエラーチェック装置および方法
US9755888B2 (en) Information processing device, information processing system, and communication device
WO2022086798A1 (en) Repeated in sequence packet transmission for checksum comparison
CN114968876A (zh) 数据通信方法、系统、电子装置和存储介质
TW201237617A (en) Abnormal detecting and recovering circuit for USB apparatus and method thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16843471

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2016843471

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE