CN110968443A - Equipment abnormity detection method and device - Google Patents

Equipment abnormity detection method and device Download PDF

Info

Publication number
CN110968443A
CN110968443A CN201811145890.XA CN201811145890A CN110968443A CN 110968443 A CN110968443 A CN 110968443A CN 201811145890 A CN201811145890 A CN 201811145890A CN 110968443 A CN110968443 A CN 110968443A
Authority
CN
China
Prior art keywords
pcie
terminal equipment
data packet
link
capacity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811145890.XA
Other languages
Chinese (zh)
Other versions
CN110968443B (en
Inventor
郑晓
龙欣
谢峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201811145890.XA priority Critical patent/CN110968443B/en
Publication of CN110968443A publication Critical patent/CN110968443A/en
Application granted granted Critical
Publication of CN110968443B publication Critical patent/CN110968443B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • G06F11/0754Error or fault detection not based on redundancy by exceeding limits
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0745Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in an input/output transactions management context
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a method and a device for detecting equipment abnormality. Wherein, the method comprises the following steps: monitoring the capacity of a PCIe link storage data packet of PCIe terminal equipment through the flow control characteristic of a high-speed serial computer expansion bus PCIe; under the condition that the capacity of the data packet reaches a preset threshold value, controlling a PCIe link to be closed and triggering an error report message, wherein the error report message is triggered by an error report mechanism of PCIe; and triggering the driver to detect the state of the PCIe terminal equipment through the error report message so as to determine whether the PCIe terminal equipment is abnormal. The invention solves the technical problem that the slow response of the AER driver in the related technology causes the delay in processing the downtime risk of the host machine caused by hardware repair.

Description

Equipment abnormity detection method and device
Technical Field
The invention relates to the field of equipment detection, in particular to a method and a device for detecting equipment abnormity.
Background
In the heterogeneous computing product, GPU/FPGA resource selling is provided for a virtual machine in a direct connection mode in the process of providing computing services. But hardware errors triggered by improper handling of such hardware by itself or within a virtual machine can render the PCIe interface unusable. Therefore, the stability, reliability and safety isolation of heterogeneous computing products are always important. However, under certain specific conditions, due to hardware instability and unpredictable reasons, the GPU computing service or the FPGA service does not respond to access to the GPU/FPGA pass-through device hardware, which in turn causes serious system errors, and due to slow response of the AER driver, the host computer downtime caused by the fact that the repair hardware is not timely processed is caused.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The embodiment of the invention provides a method and a device for detecting equipment abnormality, which are used for at least solving the technical problem that the downtime of a host machine caused by hardware repair is not time to process due to slow response of an AER driver in the related technology.
According to an aspect of the embodiments of the present invention, there is provided a method for detecting an apparatus abnormality, including: monitoring the capacity of a PCIe link storage data packet of PCIe terminal equipment through the flow control characteristic of a high-speed serial computer expansion bus PCIe; under the condition that the capacity of the data packet reaches a preset threshold value, controlling the PCIe link to be closed and triggering an error report message, wherein the error report message is triggered by an error report mechanism of the PCIe link; and triggering a driver to detect the state of the PCIe terminal equipment through the error report message so as to determine whether the PCIe terminal equipment is abnormal.
Further, when the capacity of the data packet reaches a preset threshold, before controlling the PCIe link to be closed and triggering an error report message, the method further includes: when the PCIe terminal equipment system generates errors and cannot respond to a PCIe link, or when the processing speed of the PCIe terminal equipment to the data packet is lower than a preset speed, the data packet starts to be accumulated; and under the condition that the data packets are accumulated to fill the capacity of the PCIe link storage data packets of the PCIe terminal equipment, determining that the capacity of the data packets reaches a preset threshold value.
Further, when the capacity of the data packet reaches a preset threshold, controlling the PCIe link to be closed, and triggering an error report message includes: under the condition that the capacity of the data packet reaches a preset threshold value, the PCIe transfer station or the PCIe root component cannot send the data packet to the PCIe link; and under the condition that the PCIe transfer station or the PCIe root component cannot send a data packet to the PCIe link, the PCIe transfer station controls the PCIe link to be closed and triggers the error report message.
Further, triggering a driver to detect the status of the PCIe endpoint device through the error report message to determine whether the PCIe endpoint device is abnormal comprises: reading the running state of the PCIe terminal equipment through a system management bus; if the PCIe terminal equipment is read to be abnormal in running state, determining that the PCIe terminal equipment is abnormal; and if the PCIe terminal equipment is read to be in a normal running state, determining that the PCIe terminal equipment is not abnormal.
Further, the method further comprises: under the condition that the PCIe terminal equipment is determined to be abnormal, resetting the PCIe terminal equipment to repair the PCIe terminal equipment; and after the PCIe terminal equipment is repaired or under the condition that the PCIe terminal equipment is determined not to be abnormal, starting the PCIe link.
Further, the PCIe terminal device is a GPU device or an FPGA device.
According to another aspect of the embodiments of the present invention, there is also provided an apparatus for detecting an apparatus abnormality, including: the monitoring unit is used for monitoring the capacity of PCIe link storage data packets of the PCIe terminal equipment through the flow control characteristic of the PCIe of the high-speed serial computer expansion bus; the control unit is used for controlling the PCIe link to be closed and triggering an error report message under the condition that the capacity of the data packet reaches a preset threshold value, wherein the error report message is triggered by an error report mechanism of the PCIe link; and the detection unit is used for triggering a driver to detect the state of the PCIe terminal equipment through the error report message so as to determine whether the PCIe terminal equipment is abnormal or not.
Further, the apparatus further comprises: the first accumulation unit is used for controlling the PCIe link to be closed and triggering an error report message before the capacity of the data packet reaches a preset threshold, and the data packet starts to be accumulated when the PCIe terminal device system generates an error and cannot respond to the PCIe link or when the processing speed of the PCIe terminal device to the data packet is lower than a preset speed; the first determining unit is used for determining that the capacity of the data packet reaches a preset threshold value under the condition that the data packet is accumulated to fill up the capacity of the PCIe link storage data packet of the PCIe terminal equipment end.
Further, the control unit includes: the first determining module is used for enabling the PCIe transfer station or the PCIe root component to be incapable of sending the data packet to the PCIe link under the condition that the capacity of the data packet reaches a preset threshold value; and the control module is used for controlling the PCIe link to be closed by the PCIe transfer station and triggering the error report message under the condition that the PCIe transfer station or the PCIe root component cannot send a data packet to the PCIe link.
Further, the detection unit includes: the reading module is used for reading the running state of the PCIe terminal equipment through a system management bus; the second determining module is used for determining that the PCIe terminal equipment is abnormal under the condition that the running state of the PCIe terminal equipment is read to be abnormal; and the third determining module is used for determining that the PCIe terminal equipment is not abnormal under the condition that the PCIe terminal equipment is read to be in a normal running state.
Further, the apparatus further comprises: the resetting unit is used for resetting the PCIe terminal equipment to repair the PCIe terminal equipment under the condition that the PCIe terminal equipment is determined to be abnormal; and the starting unit is used for starting the PCIe link after the PCIe terminal equipment is repaired or under the condition that the PCIe terminal equipment is determined not to be abnormal.
Further, the PCIe terminal device is a GPU device or an FPGA device.
According to another aspect of the embodiments of the present invention, there is also provided a storage medium, where the storage medium includes a stored program, and when the program runs, a device on which the storage medium is located is controlled to execute any one of the above-mentioned device abnormality detection methods.
According to another aspect of the embodiments of the present invention, there is further provided a processor, where the processor is configured to execute a program, where the program executes the method for detecting the device exception described in any one of the above.
In the embodiment of the invention, the capacity of a data packet of PCIe terminal equipment is actively monitored, and under the condition that the capacity of the data packet reaches a preset threshold value, the PCIe link is controlled to be closed and an error report message is triggered, wherein the error report message is triggered by an error report mechanism of PCIe; the driver is triggered by the error report message to detect the state of the PCIe terminal equipment so as to determine whether the PCIe terminal equipment is abnormal or not, thereby avoiding the host downtime risk caused by the fact that the AER driver responds slowly and cannot process the repair hardware, ensuring the effect that the normal business logic cannot be influenced by the host downtime risk, and further solving the technical problem that the AER driver responds slowly and cannot process the downtime risk of the host caused by the repair hardware in the related art.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
fig. 1 is a block diagram of a hardware structure of a computing device for implementing a method for detecting device anomalies according to an embodiment of the present invention;
FIG. 2 is a flowchart of a method for detecting device anomalies according to a first embodiment of the invention;
FIG. 3 is a schematic view of an apparatus for detecting abnormality of a device according to a second embodiment of the present invention; and
fig. 4 is a block diagram of an alternative computer terminal according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
First, some terms or terms appearing in the description of the embodiments of the present application are applicable to the following explanations:
TLP: transaction Layer Packet, the basic datagram unit of PCIe.
AER error reporting mechanism for PCIe.
Data access for PCIe is divided into two categories: a Post request and a Non-Post request, wherein,
post request: a TLP write originating from one end of the Link. Since PCIe has a function of flow control, the initiating end of Link will determine whether to issue the TLP through the peer credits. The TLP must be successfully delivered. Since the TLP initiator for the Link has already confirmed that the destination has credit to accept the TLP before initiation. For example, most MMIO data writes sent from the CPU to endpoint are request requests, and such requests do not require endpoint to reply whether the writes are successful. PCIe flow control guarantees that the post request will definitely succeed.
Non-post request: such requests are typically read TLPs for MMIO, or IO accesses to the device. When the TLP of the Non-request reaches the endpoint, the endpoint replies a completion TLP to notify the initiator that the Non-request has succeeded.
Example 1
There is also provided, in accordance with an embodiment of the present invention, an embodiment of a method for detecting device anomalies, it being noted that the steps illustrated in the flowchart of the accompanying drawings may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order different than presented herein.
The method provided by the first embodiment of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Fig. 1 shows a hardware configuration block diagram of a computer terminal (or mobile device) for implementing a method of detecting device abnormality. As shown in fig. 1, the computer terminal 10 (or mobile device 10) may include one or more (shown as 102a, 102b, … …, 102 n) processors 102 (the processors 102 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA, etc.), a memory 104 for storing data, and a transmission module 106 for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
It should be noted that the one or more processors 102 and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuit may be a single stand-alone processing module, or incorporated in whole or in part into any of the other elements in the computer terminal 10 (or mobile device). As referred to in the embodiments of the application, the data processing circuit acts as a processor control (e.g. selection of a variable resistance termination path connected to the interface).
The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the device abnormality detection method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the software programs and modules stored in the memory 104, that is, implements the device abnormality detection method of the application program. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 can be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.
Under the operating environment, the application provides a method for detecting the device abnormality as shown in fig. 2. Fig. 2 is a flowchart of a method for detecting device abnormality according to a first embodiment of the present invention.
Step S201, monitoring the capacity of PCIe link storage data packet of PCIe terminal equipment through the flow control characteristic of the PCIe of the high-speed serial computer expansion bus.
It should be noted that the score of the PCIe endpoint device may be monitored through the flow control characteristic of the PCIe serial computer expansion bus, where the score is used to indicate a capacity of a PCIe link storage packet at the PCIe endpoint device side, for example, the score of the PCIe endpoint device is credits, and the above packet is a PCIe basic datagram unit, which is referred to as TLP for short, for example, the link storage of the PCIe endpoint device can store 4 TLPs, if the link of the PCIe endpoint device does not currently store a TLP, the credits is 4, and if the link of the PCIe endpoint device currently stores 2 TLPs, the credits is 2.
In the process of sending the TLP from the starting end of the PCIe link to the other end of the PCIe link, the PCIe monitors the score of the PCIe end device through a flowcontrol (flow control) characteristic.
Step S202, under the condition that the capacity of the data packet reaches a preset threshold value, controlling the PCIe link to be closed and triggering an error report message, wherein the error report message is triggered by an error report mechanism of PCIe.
For example, the preset threshold is 0, that is, the PCIe link has no capacity for storing data packets, and when credits are exhausted, the corresponding link is actively closed, and an AER report message is triggered by an error reporting mechanism of PCIe.
Step S203, the driver is triggered by the error report message to detect the status of the PCIe terminal device, so as to determine whether the PCIe terminal device is abnormal.
After triggering the AER report message, the AER driver is actively triggered to check if the PCIe end device status is normal.
Through the steps, if the device is detected to be abnormal, the hardware can be processed in advance to avoid the error from spreading to the root component of PCIe or the level of the system bus, so that the host is prevented from being down. The host downtime risk caused by the fact that the AER driver responds slowly and cannot process the repair hardware is avoided, the effect that normal business logic cannot be affected by the slow response is guaranteed, and the technical problem that the downtime risk of the host caused by the fact that the AER driver responds slowly and cannot process the repair hardware is solved.
As an optional embodiment, before controlling the PCIe link to be closed and triggering the error report message when the capacity of the data packet reaches the preset threshold, the method further includes: when the PCIe terminal equipment system generates errors and cannot respond to the PCIe link, or when the processing speed of the PCIe terminal equipment to the data packets is lower than the preset speed, the data packets start to be stacked; and under the condition that the data packets are accumulated to fill up the capacity of the PCIe link storage data packets of the PCIe terminal equipment, determining that the capacity of the data packets reaches a preset threshold value.
In the above scheme, when a system error occurs in the PCIe endpoint device, the PCIe endpoint device may not respond to the PCIe bus, and the TLP may start to be stacked. For example, when a system error occurs in the PCIe terminal device, if the PCIe terminal device cannot receive a TLP issued by an originating terminal of the PCIe link, the TLP starts to be stacked, when the TLP is fully stacked in the capacity of the PCIe link capable of storing the data packet, credits are exhausted, and the score reaches the preset threshold, it is determined that the capacity of the data packet reaches the preset threshold.
Or, in another scheme, there is no exception at the PCIe end device, but TLPs start to pile up due to the untimely TLP processing. That is, under the condition that the processing speed of the PCIe end device for the TLP is less than the speed of sending the TLP by the originating end of the PCIe link, the TLP starts to be stacked. When the TLP is fully accumulated with the capacity of the PCIe link capable of storing the data packet, the credits are exhausted, and the score reaches a preset threshold value, and then the capacity of the data packet is determined to reach the preset threshold value.
As an optional embodiment, in the case that the capacity of the data packet reaches the preset threshold, controlling the PCIe link to be closed, and triggering the error report message includes: under the condition that the capacity of the data packet reaches a preset threshold value, a transfer station of PCIe or a root component of the PCIe cannot send the data packet to a PCIe link; under the condition that the transfer station of the PCIe or the root component of the PCIe can not send data packets to the PCIe link, the transfer station of the PCIe controls the PCIe link to be closed and triggers an error report message.
The hub for PCIe is PCIe Switch, for example, PLX9797 or PLX8747, and such a Switch can monitor the status of Link and can be configured to issue an alarm AER to CPU when the credits are exhausted. The Root component of PCIe is Root Complex, and when PCIe Switch or Root Complex fails to send TLP, the AER error report message is triggered.
As an optional implementation, the triggering, by the error report message, the driver to detect the status of the PCIe endpoint device to determine whether the PCIe endpoint device is abnormal includes: reading the running state of PCIe terminal equipment through a system management bus; if the operating state of the PCIe terminal equipment is read to be abnormal, determining that the PCIe terminal equipment is abnormal; and if the PCIe terminal equipment is read to be in a normal running state, determining that the PCIe terminal equipment is not abnormal.
And under the condition that the PCIe terminal equipment is determined not to be abnormal, starting a PCIe link. To restore normal business processes.
In the application, if the PCIe terminal device does not respond, the operating state of the PCIe terminal device is determined to be abnormal, and it is determined that the PCIe terminal device is abnormal.
As an optional implementation, the method further comprises: resetting the PCIe terminal equipment to repair the PCIe terminal equipment under the condition that the PCIe terminal equipment is determined to be abnormal; and after the PCIe terminal equipment is repaired, the PCIe link is opened.
It should be noted that, in addition to the above-mentioned resetting of the PCIe endpoint device to repair the PCIe endpoint device, other manners that may implement the repair of the PCIe endpoint device are not limited in this embodiment of the application. And after the PCIe terminal equipment is repaired, starting a PCIe link to recover the normal business process.
As an optional implementation, the PCIe terminal device is a GPU device or an FPGA device.
In summary, the state of credits is observed by the browse control function of PCIE switch or root complex, and when the credits is exhausted, Link is closed and AER requests driver intervention. The AER driver actively queries the hardware state. If the hardware is normal, the link is opened and the service is continued. If the hardware error is found, the hardware is repaired, and the method avoids the host machine downtime risk caused by the fact that the AER driver has slow response and cannot process the repaired hardware. This approach also ensures that normal business logic is not affected by this.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
Example 2
According to an embodiment of the present invention, there is also provided an apparatus for implementing the above method for detecting an abnormality of a device, as shown in fig. 3, the apparatus including: a monitoring unit 301, a control unit 302 and a detection unit 303.
Specifically, the monitoring unit 301 is configured to monitor the capacity of the PCIe link storage packet of the PCIe terminal device through the flow control characteristic of the high-speed serial computer expansion bus PCIe.
It should be noted that the score of the PCIe endpoint device may be monitored through the flow control characteristic of the PCIe serial computer expansion bus, where the score is used to indicate a capacity of a PCIe link storage packet at the PCIe endpoint device side, for example, the score of the PCIe endpoint device is credits, the packet is a PCIe basic datagram unit, which is referred to as a TLP for short, for example, the link storage of the PCIe endpoint device can store 4 TLPs, if the TLP is not currently stored in the link of the PCIe endpoint device, the credits is 4, and if the TLP currently stored in the link of the PCIe endpoint device is 2, the credits is 2.
In the process of sending the TLP from the starting end of the PCIe link to the other end of the PCIe link, the PCIe monitors the score of the PCIe end device through a flowcontrol (flow control) characteristic.
The control unit 302 is configured to control the PCIe link to be closed and trigger an error report message when the capacity of the data packet reaches a preset threshold, where the error report message is an error report message triggered by an error report mechanism of PCIe.
For example, the preset threshold is 0, that is, the PCIe link has no capacity for storing data packets, and when credits are exhausted, the corresponding link is actively closed, and an AER report message is triggered by an error reporting mechanism of PCIe.
The detecting unit 303 is configured to trigger the driver to detect a status of the PCIe endpoint device through the error report message, so as to determine whether the PCIe endpoint device is abnormal.
The detecting unit 303 is configured to actively trigger the AER driver to check whether the PCIe terminal device status is normal after triggering the AER report message.
Through the monitoring unit 301, the control unit 302 and the detection unit 303, if it is detected that the device is abnormal, the hardware can be processed in advance to avoid the error from spreading to the root component of PCIe or the system bus, so as to avoid the downtime of the host. The host downtime risk caused by the fact that the AER driver responds slowly and cannot process the repair hardware is avoided, the effect that normal business logic cannot be affected by the slow response is guaranteed, and the technical problem that the downtime risk of the host caused by the fact that the AER driver responds slowly and cannot process the repair hardware is solved.
As an optional implementation manner, the first stacking unit is configured to, when the capacity of the data packet reaches a preset threshold, control the PCIe link to be closed, and before triggering an error report message, start stacking the data packet when the PCIe endpoint device system makes an error and cannot respond to the PCIe link, or when the processing speed of the PCIe endpoint device on the data packet is less than a preset speed; the first determining unit is used for determining that the capacity of the data packet reaches a preset threshold value under the condition that the data packet is accumulated to fill up the capacity of the PCIe link storage data packet of the PCIe terminal equipment end.
In the above scheme, when a system error occurs in the PCIe endpoint device, the PCIe endpoint device may not respond to the PCIe bus, and the TLP may start to be stacked. For example, when a system error occurs in the PCIe terminal device, if the PCIe terminal device cannot receive a TLP issued by an originating terminal of the PCIe link, the TLP starts to be stacked, when the TLP is fully stacked in the capacity of the PCIe link capable of storing the data packet, credits are exhausted, and the score reaches the preset threshold, it is determined that the capacity of the data packet reaches the preset threshold.
Alternatively, in another alternative, there is no exception at the PCIe end device, but TLPs start to pile up due to TLP processing being not timely. That is, under the condition that the processing speed of the PCIe end device for the TLP is less than the speed of sending the TLP by the originating end of the PCIe link, the TLP starts to be stacked. When the TLP is fully accumulated with the capacity of the PCIe link capable of storing the data packet, the credits are exhausted, and the score reaches a preset threshold value, and then the capacity of the data packet is determined to reach the preset threshold value.
As an optional embodiment, the control unit 302 includes: the first determining module is used for ensuring that a transfer station of PCIe or a root component of the PCIe cannot send the data packet to a PCIe link under the condition that the capacity of the data packet reaches a preset threshold value; and the control module is used for controlling the PCIe link to be closed by the PCIe transfer station and triggering an error report message under the condition that the PCIe transfer station or the PCIe root component cannot send a data packet to the PCIe link.
The hub for PCIe is PCIe Switch, for example, PLX9797 or PLX8747, and such a Switch can monitor the status of Link and can be configured to issue an alarm AER to CPU when the credits are exhausted. The Root component of PCIe is Root Complex, and when PCIe Switch or Root Complex fails to send TLP, the AER error report message is triggered.
As an optional implementation, the detecting unit 303 includes: the reading module is used for reading the running state of the PCIe terminal equipment through a system management bus; the second determining module is used for determining that the PCIe terminal equipment is abnormal under the condition that the operating state of the PCIe terminal equipment is read to be abnormal; and the third determining module is used for determining that the PCIe terminal equipment is not abnormal under the condition that the PCIe terminal equipment is read to be in the normal running state.
And the second starting unit is used for starting the PCIe link under the condition that the PCIe terminal equipment is determined not to have abnormity. To restore normal business processes.
In the application, if the PCIe terminal device does not respond, the operating state of the PCIe terminal device is determined to be abnormal, and it is determined that the PCIe terminal device is abnormal.
As an optional implementation, the apparatus further comprises: the resetting unit is used for resetting the PCIe terminal equipment under the condition that the PCIe terminal equipment is determined to be abnormal so as to repair the PCIe terminal equipment; the starting unit is used for starting the PCIe link after the PCIe terminal equipment is repaired or under the condition that the PCIe terminal equipment is determined not to be abnormal.
It should be noted that, in addition to the above-mentioned resetting of the PCIe endpoint device to repair the PCIe endpoint device, other manners that may implement the repair of the PCIe endpoint device are not limited in this embodiment of the application. And after the PCIe terminal equipment is repaired, starting a PCIe link to recover the normal business process.
As an optional implementation, the PCIe terminal device is a GPU device or an FPGA device.
In summary, the state of credits is observed by the browse control function of PCIE switch or root complex, and once the credits is exhausted, Link is closed and AER requests driver intervention. The AER driver actively queries the hardware state. If the hardware is normal, the link is opened and the service is continued. If the hardware error is found, the hardware is repaired, and the method avoids the host machine downtime risk caused by the fact that the AER driver has slow response and cannot process the repaired hardware. This approach also ensures that normal business logic is not affected by this.
It should be noted here that the monitoring unit 301, the control unit 302 and the detection unit 303 correspond to steps S201 to S203 in embodiment 1, and the two modules are the same as the corresponding steps in the implementation example and application scenario, but are not limited to the disclosure of the first embodiment. It should be noted that the modules described above as part of the apparatus may be run in the computer terminal 10 provided in the first embodiment.
Example 3
The embodiment of the invention can provide a computer terminal which can be any computer terminal device in a computer terminal group. Optionally, in this embodiment, the computer terminal may also be replaced with a terminal device such as a mobile terminal.
Optionally, in this embodiment, the computer terminal may be located in at least one network device of a plurality of network devices of a computer network.
In this embodiment, the computer terminal may execute the program code of the following steps in the method for detecting device abnormality of an application program: monitoring the capacity of a PCIe link storage data packet of PCIe terminal equipment through the flow control characteristic of a high-speed serial computer expansion bus PCIe; under the condition that the capacity of the data packet reaches a preset threshold value, controlling a PCIe link to be closed and triggering an error report message, wherein the error report message is triggered by an error report mechanism of PCIe; and triggering the driver to detect the state of the PCIe terminal equipment through the error report message so as to determine whether the PCIe terminal equipment is abnormal.
In this embodiment, the computer terminal may execute the program code of the following steps in the method for detecting device abnormality of an application program: under the condition that the capacity of the data packet reaches a preset threshold value, controlling the PCIe link to be closed, and before triggering an error report message, the method further comprises the following steps: when the PCIe terminal equipment system generates errors and cannot respond to the PCIe link, or when the processing speed of the PCIe terminal equipment to the data packets is lower than the preset speed, the data packets start to be stacked; and under the condition that the data packets are accumulated to fill the capacity of the PCIe link storage data packets of the PCIe terminal equipment, determining that the capacity of the data packets reaches a preset threshold value.
In this embodiment, the computer terminal may execute the program code of the following steps in the method for detecting device abnormality of an application program: under the condition that the capacity of the data packet reaches a preset threshold value, controlling the PCIe link to be closed, and triggering an error report message comprises the following steps: under the condition that the capacity of the data packet reaches a preset threshold value, a transfer station of PCIe or a root component of the PCIe cannot send the data packet to a PCIe link; under the condition that the transfer station of the PCIe or the root component of the PCIe can not send data packets to the PCIe link, the transfer station of the PCIe controls the PCIe link to be closed and triggers an error report message.
In this embodiment, the computer terminal may execute the program code of the following steps in the method for detecting device abnormality of an application program: triggering the driver to detect the status of the PCIe endpoint device via the error report message to determine whether the PCIe endpoint device is anomalous comprises: reading the running state of PCIe terminal equipment through a system management bus; if the operating state of the PCIe terminal equipment is read to be abnormal, determining that the PCIe terminal equipment is abnormal; and if the PCIe terminal equipment is read to be in a normal running state, determining that the PCIe terminal equipment is not abnormal.
In this embodiment, the computer terminal may execute the program code of the following steps in the method for detecting device abnormality of an application program: the method further comprises the following steps: resetting the PCIe terminal equipment to repair the PCIe terminal equipment under the condition that the PCIe terminal equipment is determined to be abnormal; after the PCIe terminal equipment is repaired or under the condition that the PCIe terminal equipment is determined not to be abnormal, the PCIe link is opened.
In this embodiment, the computer terminal may execute the program code of the following steps in the method for detecting device abnormality of an application program: the PCIe terminal equipment is GPU equipment or FPGA equipment.
Fig. 4 is a block diagram of an alternative computer terminal according to an embodiment of the present invention. As shown in fig. 4, the computer terminal 10 may include: one or more processors (only one shown) and memory.
The memory may be used to store software programs and modules, such as program instructions/modules corresponding to the method and apparatus for detecting device exceptions in the embodiments of the present invention, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory, that is, the method for detecting device exceptions described above is implemented. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory may further include memory remotely located from the processor, and these remote memories may be connected to terminal a through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: monitoring the capacity of a PCIe link storage data packet of PCIe terminal equipment through the flow control characteristic of a high-speed serial computer expansion bus PCIe; under the condition that the capacity of the data packet reaches a preset threshold value, controlling a PCIe link to be closed and triggering an error report message, wherein the error report message is triggered by an error report mechanism of PCIe; and triggering the driver to detect the state of the PCIe terminal equipment through the error report message so as to determine whether the PCIe terminal equipment is abnormal.
The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: under the condition that the capacity of the data packet reaches a preset threshold value, controlling the PCIe link to be closed, and before triggering an error report message, the method further comprises the following steps: when the PCIe terminal equipment system generates errors and cannot respond to the PCIe link, or when the processing speed of the PCIe terminal equipment to the data packets is lower than the preset speed, the data packets start to be stacked; and under the condition that the data packets are accumulated to fill the capacity of the PCIe link storage data packets of the PCIe terminal equipment, determining that the capacity of the data packets reaches a preset threshold value.
The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: under the condition that the capacity of the data packet reaches a preset threshold value, controlling the PCIe link to be closed, and triggering an error report message comprises the following steps: under the condition that the capacity of the data packet reaches a preset threshold value, a transfer station of PCIe or a root component of the PCIe cannot send the data packet to a PCIe link; under the condition that the transfer station of the PCIe or the root component of the PCIe can not send data packets to the PCIe link, the transfer station of the PCIe controls the PCIe link to be closed and triggers an error report message.
The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: triggering the driver to detect the status of the PCIe endpoint device via the error report message to determine whether the PCIe endpoint device is anomalous comprises: reading the running state of PCIe terminal equipment through a system management bus; if the operating state of the PCIe terminal equipment is read to be abnormal, determining that the PCIe terminal equipment is abnormal; and if the PCIe terminal equipment is read to be in a normal running state, determining that the PCIe terminal equipment is not abnormal.
The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: the method further comprises the following steps: resetting the PCIe terminal equipment to repair the PCIe terminal equipment under the condition that the PCIe terminal equipment is determined to be abnormal; after the PCIe terminal equipment is repaired or under the condition that the PCIe terminal equipment is determined not to be abnormal, the PCIe link is opened.
The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: the PCIe terminal equipment is GPU equipment or FPGA equipment.
The embodiment of the invention provides a scheme for detecting equipment abnormity. The method comprises the steps that the capacity of a data packet of PCIe terminal equipment is actively monitored, the PCIe link is controlled to be closed and an error report message is triggered under the condition that the capacity of the data packet reaches a preset threshold value, wherein the error report message is triggered by an error report mechanism of PCIe; the driver is triggered by the error report message to detect the state of the PCIe terminal equipment so as to determine whether the PCIe terminal equipment is abnormal or not, thereby avoiding the host downtime risk caused by the fact that the AER driver responds slowly and cannot process the repair hardware, ensuring the effect that the normal business logic cannot be influenced by the host downtime risk, and further solving the technical problem that the AER driver responds slowly and cannot process the downtime risk of the host caused by the repair hardware in the related art.
It can be understood by those skilled in the art that the structure shown in fig. 4 is only an illustration, and the computer terminal may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 4 is a diagram illustrating the structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 4, or have a different configuration than shown in FIG. 4.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.
Example 4
The embodiment of the invention also provides a storage medium. Optionally, in this embodiment, the storage medium may be configured to store a program code executed by the method for detecting a device abnormality provided in the first embodiment.
Optionally, in this embodiment, the storage medium may be located in any one of computer terminals in a computer terminal group in a computer network, or in any one of mobile terminals in a mobile terminal group.
Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: monitoring the capacity of a PCIe link storage data packet of PCIe terminal equipment through the flow control characteristic of a high-speed serial computer expansion bus PCIe; under the condition that the capacity of the data packet reaches a preset threshold value, controlling a PCIe link to be closed and triggering an error report message, wherein the error report message is triggered by an error report mechanism of PCIe; and triggering the driver to detect the state of the PCIe terminal equipment through the error report message so as to determine whether the PCIe terminal equipment is abnormal.
Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: under the condition that the capacity of the data packet reaches a preset threshold value, controlling the PCIe link to be closed, and before triggering an error report message, the method further comprises the following steps: when the PCIe terminal equipment system generates errors and cannot respond to the PCIe link, or when the processing speed of the PCIe terminal equipment to the data packets is lower than the preset speed, the data packets start to be stacked; and under the condition that the data packets are accumulated to fill the capacity of the PCIe link storage data packets of the PCIe terminal equipment, determining that the capacity of the data packets reaches a preset threshold value.
Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: under the condition that the capacity of the data packet reaches a preset threshold value, controlling the PCIe link to be closed, and triggering an error report message comprises the following steps: under the condition that the capacity of the data packet reaches a preset threshold value, a transfer station of PCIe or a root component of the PCIe cannot send the data packet to a PCIe link; under the condition that the transfer station of the PCIe or the root component of the PCIe can not send data packets to the PCIe link, the transfer station of the PCIe controls the PCIe link to be closed and triggers an error report message.
Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: triggering the driver to detect the status of the PCIe endpoint device via the error report message to determine whether the PCIe endpoint device is anomalous comprises: reading the running state of PCIe terminal equipment through a system management bus; if the operating state of the PCIe terminal equipment is read to be abnormal, determining that the PCIe terminal equipment is abnormal; and if the PCIe terminal equipment is read to be in a normal running state, determining that the PCIe terminal equipment is not abnormal.
Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: the method further comprises the following steps: resetting the PCIe terminal equipment to repair the PCIe terminal equipment under the condition that the PCIe terminal equipment is determined to be abnormal; after the PCIe terminal equipment is repaired or under the condition that the PCIe terminal equipment is determined not to be abnormal, the PCIe link is opened.
Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: the PCIe terminal equipment is GPU equipment or FPGA equipment.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (14)

1. A method for detecting device abnormality includes:
monitoring the capacity of a PCIe link storage data packet of PCIe terminal equipment through the flow control characteristic of a high-speed serial computer expansion bus PCIe;
under the condition that the capacity of the data packet reaches a preset threshold value, controlling the PCIe link to be closed and triggering an error report message, wherein the error report message is triggered by an error report mechanism of the PCIe link;
and triggering a driver to detect the state of the PCIe terminal equipment through the error report message so as to determine whether the PCIe terminal equipment is abnormal.
2. The method of claim 1, wherein, in the case that the capacity of the data packet reaches a preset threshold, controlling the PCIe link to be shut down and triggering an error report message, the method further comprises:
when the PCIe terminal equipment system generates errors and cannot respond to a PCIe link, or when the processing speed of the PCIe terminal equipment to the data packet is lower than a preset speed, the data packet starts to be accumulated;
and under the condition that the data packets are accumulated to fill the capacity of the PCIe link storage data packets of the PCIe terminal equipment, determining that the capacity of the data packets reaches a preset threshold value.
3. The method of claim 1, wherein, in the event that the capacity of the data packet reaches a preset threshold, controlling the PCIe link to be shutdown and triggering an error report message comprises:
under the condition that the capacity of the data packet reaches a preset threshold value, the PCIe transfer station or the PCIe root component cannot send the data packet to the PCIe link;
and under the condition that the PCIe transfer station or the PCIe root component cannot send a data packet to the PCIe link, the PCIe transfer station controls the PCIe link to be closed and triggers the error report message.
4. The method of claim 1, wherein detecting the status of the PCIe end device by the error report message trigger driver to determine whether the PCIe end device is anomalous comprises:
reading the running state of the PCIe terminal equipment through a system management bus;
if the PCIe terminal equipment is read to be abnormal in running state, determining that the PCIe terminal equipment is abnormal;
and if the PCIe terminal equipment is read to be in a normal running state, determining that the PCIe terminal equipment is not abnormal.
5. The method of claim 4, wherein the method further comprises:
under the condition that the PCIe terminal equipment is determined to be abnormal, resetting the PCIe terminal equipment to repair the PCIe terminal equipment;
and after the PCIe terminal equipment is repaired or under the condition that the PCIe terminal equipment is determined not to be abnormal, starting the PCIe link.
6. The method of claim 1, wherein the PCIe terminal device is a GPU device or an FPGA device.
7. An apparatus for detecting device abnormality, comprising:
the monitoring unit is used for monitoring the capacity of PCIe link storage data packets of the PCIe terminal equipment through the flow control characteristic of the PCIe of the high-speed serial computer expansion bus;
the control unit is used for controlling the PCIe link to be closed and triggering an error report message under the condition that the capacity of the data packet reaches a preset threshold value, wherein the error report message is triggered by an error report mechanism of the PCIe link;
and the detection unit is used for triggering a driver to detect the state of the PCIe terminal equipment through the error report message so as to determine whether the PCIe terminal equipment is abnormal or not.
8. The apparatus of claim 7, wherein the apparatus further comprises:
the first accumulation unit is used for controlling the PCIe link to be closed and triggering an error report message before the capacity of the data packet reaches a preset threshold, and the data packet starts to be accumulated when the PCIe terminal device system generates an error and cannot respond to the PCIe link or when the processing speed of the PCIe terminal device to the data packet is lower than a preset speed;
the first determining unit is used for determining that the capacity of the data packet reaches a preset threshold value under the condition that the data packet is accumulated to fill up the capacity of the PCIe link storage data packet of the PCIe terminal equipment end.
9. The apparatus of claim 7, wherein the control unit comprises:
the first determining module is used for enabling the PCIe transfer station or the PCIe root component to be incapable of sending the data packet to the PCIe link under the condition that the capacity of the data packet reaches a preset threshold value;
and the control module is used for controlling the PCIe link to be closed by the PCIe transfer station and triggering the error report message under the condition that the PCIe transfer station or the PCIe root component cannot send a data packet to the PCIe link.
10. The apparatus of claim 7, wherein the detection unit comprises:
the reading module is used for reading the running state of the PCIe terminal equipment through a system management bus;
the second determining module is used for determining that the PCIe terminal equipment is abnormal under the condition that the running state of the PCIe terminal equipment is read to be abnormal;
and the third determining module is used for determining that the PCIe terminal equipment is not abnormal under the condition that the PCIe terminal equipment is read to be in a normal running state.
11. The apparatus of claim 10, wherein the apparatus further comprises:
the resetting unit is used for resetting the PCIe terminal equipment to repair the PCIe terminal equipment under the condition that the PCIe terminal equipment is determined to be abnormal;
and the starting unit is used for starting the PCIe link after the PCIe terminal equipment is repaired or under the condition that the PCIe terminal equipment is determined not to be abnormal.
12. The apparatus of claim 7, wherein the PCIe terminal device is a GPU device or an FPGA device.
13. A storage medium, characterized in that the storage medium includes a stored program, wherein, when the program runs, a device in which the storage medium is located is controlled to execute the method for detecting device abnormality according to any one of claims 1 to 6.
14. A processor, configured to run a program, wherein the program executes the method for detecting the device abnormality according to any one of claims 1 to 6.
CN201811145890.XA 2018-09-28 2018-09-28 Equipment abnormity detection method and device Active CN110968443B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811145890.XA CN110968443B (en) 2018-09-28 2018-09-28 Equipment abnormity detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811145890.XA CN110968443B (en) 2018-09-28 2018-09-28 Equipment abnormity detection method and device

Publications (2)

Publication Number Publication Date
CN110968443A true CN110968443A (en) 2020-04-07
CN110968443B CN110968443B (en) 2023-04-11

Family

ID=70027556

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811145890.XA Active CN110968443B (en) 2018-09-28 2018-09-28 Equipment abnormity detection method and device

Country Status (1)

Country Link
CN (1) CN110968443B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111782456A (en) * 2020-06-30 2020-10-16 平安国际智慧城市科技股份有限公司 Anomaly detection method and device, computer equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130219194A1 (en) * 2012-02-16 2013-08-22 Hon Hai Precision Industry Co., Ltd. Test apparatus and method for testing pcie slot
CN103384204A (en) * 2011-12-31 2013-11-06 华为数字技术(成都)有限公司 Method and device for processing serial concurrent conversion circuit failure
CN103543961A (en) * 2013-10-12 2014-01-29 浙江宇视科技有限公司 PCIe-based storage extension system and method
CN105205021A (en) * 2015-09-11 2015-12-30 华为技术有限公司 Method and device for disconnecting link between PCIe (peripheral component interface express) equipment and host computer
CN105700967A (en) * 2016-01-08 2016-06-22 华为技术有限公司 PCIe (Peripheral Component Interconnect Express) equipment and detection method thereof
US20160321155A1 (en) * 2015-04-30 2016-11-03 Fujitsu Limited Bus connection target device, storage control device and bus communication system
CN107678994A (en) * 2017-09-15 2018-02-09 华为技术有限公司 PCIe device hot drawing method and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103384204A (en) * 2011-12-31 2013-11-06 华为数字技术(成都)有限公司 Method and device for processing serial concurrent conversion circuit failure
US20130219194A1 (en) * 2012-02-16 2013-08-22 Hon Hai Precision Industry Co., Ltd. Test apparatus and method for testing pcie slot
CN103543961A (en) * 2013-10-12 2014-01-29 浙江宇视科技有限公司 PCIe-based storage extension system and method
US20160321155A1 (en) * 2015-04-30 2016-11-03 Fujitsu Limited Bus connection target device, storage control device and bus communication system
CN105205021A (en) * 2015-09-11 2015-12-30 华为技术有限公司 Method and device for disconnecting link between PCIe (peripheral component interface express) equipment and host computer
US20180095817A1 (en) * 2015-09-11 2018-04-05 Huawei Technologies Co., Ltd. Method and apparatus for disconnecting link between pcie device and host
CN105700967A (en) * 2016-01-08 2016-06-22 华为技术有限公司 PCIe (Peripheral Component Interconnect Express) equipment and detection method thereof
CN107678994A (en) * 2017-09-15 2018-02-09 华为技术有限公司 PCIe device hot drawing method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
廖寅龙;田泽;赵强;马超;: "PCIe总线物理层弹性缓冲设计与实现" *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111782456A (en) * 2020-06-30 2020-10-16 平安国际智慧城市科技股份有限公司 Anomaly detection method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN110968443B (en) 2023-04-11

Similar Documents

Publication Publication Date Title
CN107222426B (en) Flow control method, device and system
CN103399546B (en) Triple redundance control method and system
CN100365994C (en) Method and system for regulating ethernet
CN106502814B (en) Method and device for recording error information of PCIE (peripheral component interface express) equipment
RU2614569C2 (en) Rack with automatic recovery function and method of automatic recovery for this rack
CN112398692B (en) Consensus process processing method and device and electronic equipment
CN110968443B (en) Equipment abnormity detection method and device
CN113001590A (en) Robot fault recovery method, device, equipment and computer readable storage medium
CN110889143A (en) File verification method and device
KR20160023873A (en) Hardware management communication protocol
CN111805544A (en) Robot control method and device
CN114826962A (en) Link fault detection method, device, equipment and machine readable storage medium
CN110912985A (en) Network link scheduling method and related equipment
CN105210043A (en) Information processing device
CN110968456B (en) Method and device for processing fault disk in distributed storage system
CN109039761B (en) Method and device for processing fault link in cluster control channel
CN113568398B (en) Configuration deleting method and system for distributed control system
CN106911557B (en) Message transmission method and device
CN112214437B (en) Storage device, communication method and device and computer readable storage medium
CN109495463B (en) Link width negotiation method, device and computer readable storage medium
CN114531257A (en) Network attack handling method and device
CN111625831B (en) Trusted security measurement method and device
CN109918257B (en) Hard disk exception handling method and device
EP3349398B1 (en) Method for monitoring an iot device and using it as battery protection watchdog for iot devices
CN112463446B (en) PCIe device recovery method and system, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40026971

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant