CN116781569A - Network card fault determination method and device - Google Patents

Network card fault determination method and device Download PDF

Info

Publication number
CN116781569A
CN116781569A CN202310720959.1A CN202310720959A CN116781569A CN 116781569 A CN116781569 A CN 116781569A CN 202310720959 A CN202310720959 A CN 202310720959A CN 116781569 A CN116781569 A CN 116781569A
Authority
CN
China
Prior art keywords
network card
event
target network
type information
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310720959.1A
Other languages
Chinese (zh)
Inventor
王震
李冬冬
刘清林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202310720959.1A priority Critical patent/CN116781569A/en
Publication of CN116781569A publication Critical patent/CN116781569A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0663Performing the actions predefined by failover planning, e.g. switching to standby network elements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/50Testing arrangements

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Environmental & Geological Engineering (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention provides a network card fault determining method and device, wherein the method comprises the following steps: polling and monitoring an asynchronous event of a target network card and a completion queue event of the target network card, and determining event type information of the asynchronous event and event type information of the completion queue event; and determining the fault state of the target network card based on the event type information of the asynchronous event and/or the event type information of the completion queue event. According to the network card fault determining method and device, the asynchronous event and the completion queue event of the target network card are monitored through polling, and the fault state of the target network card is determined based on the event type information of the asynchronous event and/or the event type information of the completion queue event. The method and the device realize the automatic determination of whether the target network card is in a fault state based on the monitoring of the target network card event. The method provides a basis for carrying out targeted fault processing on the target network card based on the fault state of the target network card, and improves the stability of the link of the target network card.

Description

Network card fault determination method and device
Technical Field
The present invention relates to the field of server technologies, and in particular, to a method and an apparatus for determining a network card failure.
Background
Remote direct data access (Remote Direct Memory Access, RDMA) network protocol is a high bandwidth, low latency, low CPU (Central Processing Unit, CPU) consumption network interconnect technology that overcomes the latency bottleneck of conventional TCP/IP network protocol stacks and the continuous high load of CPUs. RDMA provides low latency characteristics by adopting stack bypass and zero copy technology in order to eliminate the bottleneck of computing tasks given by traditional network communication, reduces CPU occupation and memory bandwidth bottleneck, provides high bandwidth utilization, and is suitable for establishing a network communication main link between storage devices.
When a network card using an RDMA network protocol runs a service, a plurality of abnormal scenes inevitably exist, so that card faults are caused, if the card faults cannot be timely monitored and effectively processed, data transmission delay of the network card can be caused, data are inconsistent, even a link is crashed when the network card is serious, and the service is seriously affected.
The problem of implementing fault state determination of network cards based on RDMA network protocols is an important topic to be solved in the industry.
Disclosure of Invention
The invention provides a network card fault determining method and device, which are used for solving the problem of determining the fault state of a network card of an RDMA network protocol.
The invention provides a network card fault determining method, which comprises the following steps:
polling and monitoring an asynchronous event of a target network card and a completion queue event of the target network card, and determining event type information of the asynchronous event and event type information of the completion queue event, wherein the target network card is a network card using a remote direct data access network protocol;
and determining the fault state of the target network card based on the event type information of the asynchronous event and/or the event type information of the completion queue event.
According to the network card fault determining method provided by the invention, the fault state of the target network card is determined based on the event type information of the asynchronous event and/or the event type information of the completion queue event, and the method comprises the following steps:
judging event type information of the asynchronous event, and determining that the target network card is in a fault state based on first error reporting information in the event type information of the asynchronous event;
and/or the number of the groups of groups,
and judging the event type information of the completion queue event, and determining that the target network card is in a fault state based on second error reporting information in the event type information of the completion queue event.
According to the network card fault determining method provided by the invention, the event type information of the asynchronous event is judged, and the target network card is determined to be in a fault state based on the first error reporting information in the event type information of the asynchronous event, comprising the following steps:
judging event type information of the asynchronous event, and determining that the target network card is in a fault state when the first error reporting information contains a fatal error type or the occurrence frequency of the data transmission abnormal type in the first error reporting information is greater than a preset frequency threshold value.
According to the network card fault determining method provided by the invention, the event type information of the completion queue event is judged, and the target network card is determined to be in a fault state based on the second error reporting information in the event type information of the completion queue event, comprising the following steps:
judging event type information of the completion queue event, and determining that the target network card is in a fault state under the condition that the second error reporting information contains a fatal error type or an unknown error type or the occurrence frequency of the data transmission abnormal type in the second error reporting information is greater than a preset frequency threshold value.
According to the network card fault determining method provided by the invention, after determining that the target network card is in a fault state, the method further comprises the following steps:
and sending a fault message to an operation and maintenance person of the target network card, so that the operation and maintenance person executes hot plug operation on the target network card after receiving the fault message.
According to the network card fault determining method provided by the invention, after determining that the target network card is in a fault state, the method further comprises the following steps:
determining a redundant link corresponding to the target network card;
and switching the operation business in the target network card to the redundant link so as to enable the operation business to operate on the redundant link.
According to the network card fault determining method provided by the invention, before the asynchronous event of the target network card and the completion queue event of the target network card are polled and monitored, the method further comprises the following steps:
and setting the asynchronous event file descriptor of the asynchronous event to a non-blocking mode so as to acquire the asynchronous event of the target network card.
The invention also provides a network card fault determining device, which comprises:
the monitoring module is used for polling and monitoring asynchronous events of a target network card and completion queue events of the target network card, determining event type information of the asynchronous events and event type information of the completion queue events, wherein the target network card is a network card using a remote direct data access network protocol;
And the state determining module is used for determining the fault state of the target network card based on the event type information of the asynchronous event and/or the event type information of the completion queue event.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes any network card fault determination method when executing the computer program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements any of the network card fault determination methods described above.
The invention also provides a computer program product comprising a computer program which when executed by a processor implements a network card fault determination method as described in any one of the above.
According to the network card fault determining method and device, the asynchronous event of the target network card and the completion queue event of the target network card are monitored in a polling mode, and the fault state of the target network card is determined based on the event type information of the asynchronous event and/or the event type information of the completion queue event. The method and the device realize the automatic determination of whether the target network card is in a fault state based on the monitoring of the target network card event. The method provides a basis for carrying out targeted fault processing on the target network card based on the fault state of the target network card, and improves the stability of the link of the target network card.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are some embodiments of the invention and that other drawings can be obtained from them without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a network card fault determining method provided by the invention;
fig. 2 is a schematic diagram of a device structure applying the network card fault determining method provided by the invention;
FIG. 3 is a schematic diagram illustrating the judgment of an asynchronous event polling monitoring module according to the present invention;
FIG. 4 is a schematic diagram illustrating the determination of a completion queue event poll monitoring module according to the present invention;
fig. 5 is a schematic structural diagram of a network card failure determining device provided by the present invention;
fig. 6 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention provides a network card fault determining method, and fig. 1 is a flow chart diagram of the network card fault determining method provided by the invention. Referring to fig. 1, the network card fault determining method provided by the present invention may include:
step 110, an asynchronous event of a target network card and a completion queue event of the target network card are monitored in a polling manner, event type information of the asynchronous event and event type information of the completion queue event are determined, and the target network card is a network card using a remote direct data access network protocol;
and step 120, determining the fault state of the target network card based on the event type information of the asynchronous event and/or the event type information of the completion queue event.
The execution main body of the network card fault determination method provided by the invention can be electronic equipment, a component, an integrated circuit or a chip in the electronic equipment. The electronic device may be a mobile electronic device or a non-mobile electronic device. By way of example, the mobile electronic device may be a cell phone, tablet, notebook, palmtop, ultra-mobile personal computer (ultra-mobile personal computer, UMPC), netbook or personal digital assistant (personal digital assistant, PDA), etc., and the non-mobile electronic device may be a server, network attached storage (Network Attached Storage, NAS) or personal computer (personal computer, PC), etc., the invention is not particularly limited.
The technical scheme of the invention is described in detail below by taking a computer to execute the network card fault determination method provided by the invention as an example.
In step 110, polling monitoring is performed on the asynchronous event in the target network card and the completion queue event of the target network card, and determining event type information of the asynchronous event in the target network card and event type information of the completion queue event of the target network card.
Asynchronous events refer to mechanisms that notify an application when a data transfer or other operation is completed. These events may be send or receive data completions, error occurrences, or other events related to RDMA operations.
Completion Queue (CQ) events are data structures used to manage asynchronous event notifications. When the RDMA operation is completed, the adapter will put information about the operation into the CQ and notify the application. CQ events typically contain information related to RDMA operations, such as the number of bytes transferred, the type of operation (read or write), source and destination addresses, and so forth. In addition, CQ events may also contain information about errors and failures, such as transmission failure or abnormal disconnection.
When the event in the target network card carries out polling monitoring, a polling monitoring module can be constructed, and the asynchronous event and the completion queue event are periodically polled and monitored in the target network card.
It should be noted that, before polling the asynchronous event in the target network card, an identifier of the asynchronous event in the target network card, namely async_fd, needs to be initialized. And setting async_fd into a non-blocking mode, and acquiring an asynchronous event in the target network card only in the non-blocking mode in a polling or timer triggering mode.
In the process of polling and monitoring the asynchronous event in the target network card, monitoring whether the asynchronous event is generated in the target network card, acquiring the asynchronous event based on a corresponding interface after the asynchronous event is generated, and determining event type information of the asynchronous event.
The EVENT type of the asynchronous EVENT in the target network card may be a card failure type EVENT, such as an ibv_event_device_failure type EVENT. The event trigger indicates that the target network card has the exception of fatal error and needs to be processed offline. The EVENT type of the asynchronous EVENT may also be a data transmission related failure type, an ibv_event_port_err type EVENT. The event trigger indicates the error of the network card link port, which indicates the sometimes intermittent condition of the data link of the current target network card, and if no external force is involved, the unstable condition of the link is indicated.
In the process of carrying out polling monitoring on the completion queue event in the target network card, monitoring whether the completion queue event is generated in the target network card, acquiring the completion queue event based on a corresponding interface after the completion queue event is generated, and determining event type information of the completion queue event.
And based on the completion queue event in the polling monitoring target network card, determining event type information of the completion queue event, and analyzing whether the target network card is abnormal. For example, for the completion queue event, the event type information is ibv_wc_fail_err type, which indicates that a FATAL error occurred in the data transmission in the target network card; and the event type information in the completion queue event is IBV_WC_GENERAL_ERR type, which indicates that unknown errors occur in data transmission in the target network card.
Alternatively, the network card using the remote direct data access network protocol may be a remote direct memory access (RDMA over Converged Ethernet, roCE) network card that is ethernet-converged. A RoCE network card is a network card that allows remote direct memory access RDMA network protocols to be used over ethernet.
In step 120, after determining the event type information of the asynchronous event and the event type information of the completion queue event, the state of the target network card is determined based on analyzing the event type information of the asynchronous event and/or the event type information of the completion queue event, so as to determine the fault state of the target network card.
The event type information of the asynchronous event and the event type information of the completion queue event reflect the running state of the current target network card. Therefore, after determining the event type information of the asynchronous event and the event type information of the completion queue event, it is possible to implement a determination as to whether the target network card is in a failure state.
Whether the target network card is in a fault state or not is determined, whether the target network card is in the fault state or not can be determined based on the event type information of the asynchronous event alone or based on the event type information of the completion queue event alone, and whether the target network card is determined based on the event type information of the asynchronous event and the event type information of the completion queue event together can also be determined.
For example, in the process of polling and monitoring the target network card, if it is determined that the EVENT type of the asynchronous EVENT of the target network card is an ibv_event_device_fault type EVENT. The event trigger indicates that the target network card is abnormal with fatal errors, and the target network card can be determined to be in a fault state. The EVENT type of the asynchronous EVENT may also be a data transmission related failure type, an ibv_event_port_err type EVENT. The event trigger indicates that the network card link port is wrong, the current target network card has multiple time-out continuous operation, if the condition that the link is unstable is indicated under the condition that no external force is involved, and the number of times of the unstable condition is larger than the preset tolerance, the target network card can be determined to be in a fault state.
Therefore, whether the target network card is in a failure state can be determined based on the judgment of the event type information based solely on the asynchronous event.
In the process of carrying out polling monitoring on the target network card, if the event type of the completion queue event of the target network card is determined to be the FATAL error type of the IBV_WC_FATAL_ERR type, the current target network card can be determined to be in an abnormal state. If the EVENT type information of the completion queue EVENT of the target network card is ibv_event_port_active data queue sending abnormality, and the number of times of occurrence of the abnormality information is greater than the preset tolerance, the target network card can be determined to be in a fault state.
Therefore, whether the target network card is in a failure state can be determined based on a judgment based solely on event type information of the completion queue event.
Similarly, based on the judgment of event type information based on the asynchronous event alone, whether the target network card is in a fault state is determined, and based on the judgment of event type information based on the completion queue event alone, whether the target network card is in a fault state is determined. The event type information of the asynchronous event and the event type information of the completion queue event can also be used for determining the fault state of the target network card.
For example, in the process of monitoring the asynchronous EVENT of the target network card and the completion queue EVENT of the target network card, determining that the EVENT type of the asynchronous EVENT of the target network card is the ibv_event_device_fault type EVENT and the EVENT type of the completion queue EVENT of the target network card is the FATAL error type of the ibv_wc_fault_err type EVENT at the same time, it may be determined that the target network card is in a fault state.
Optionally, under the condition that the target network card is determined to be in a fault state, and before repairing the target network card, a fault log of the target network card can be obtained, and the obtained fault log is sent to a manufacturer of the target network card. When the target network card is monitored to be in a fault state, the manufacturer of the target network card is often required to assist in analysis, and a tool of the manufacturer is required to be called to collect the log of the target network card. The method can send instructions to corresponding processes through an Inter-process communication (Inter-Process Communication, IPC) method to execute data collection of manufacturers (corresponding scripts can be called), and when log collection is completed, corresponding callback functions are executed to tell RDMA process that log collection is completed, and further repairing actions can be carried out. It can be understood that, after the fault log of the target network card is obtained, the fault log can be used for analyzing the target network card and sending the fault log to a corresponding manufacturer, the fault log can be used for carrying out targeted improvement on the subsequent target network card.
According to the network card fault determining method provided by the embodiment of the invention, the asynchronous event of the target network card and the completion queue event of the target network card are monitored in a polling manner, and the fault state of the target network card is determined based on the event type information of the asynchronous event and/or the event type information of the completion queue event. The method and the device realize the automatic determination of whether the target network card is in a fault state based on the monitoring of the target network card event. The method provides a basis for carrying out targeted fault processing on the target network card based on the fault state of the target network card, and improves the stability of the link of the target network card.
In one embodiment, determining the fault state of the target network card based on the event type information of the asynchronous event and/or the event type information of the completion queue event includes: judging event type information of the asynchronous event, and determining that the target network card is in a fault state based on first error reporting information in the event type information of the asynchronous event; and/or judging the event type information of the completion queue event, and determining that the target network card is in a fault state based on second error reporting information in the event type information of the completion queue event.
After the event type information of the asynchronous event is determined, the event type information of the asynchronous event is judged, and first error reporting information in the event type information of the asynchronous event is determined. The first error reporting information is information related to error reporting in the information of the asynchronous event. It can be understood that if the first error information in the EVENT type information of the asynchronous EVENT of the target network card is determined to be ibv_event_device_fault type information. The event trigger indicates that the target network card has the exception of fatal error, and the current asynchronous event target network card can be determined to be in a fault state.
After determining the event type information of the completion queue event, judging the event type information of the completion queue event, determining second error reporting information in the event type information of the completion queue event, and determining that the target network card is in a fault state. The second error reporting information is information related to error reporting in the information of the completion queue event. It can be understood that if the second error information of the completion queue event of the target network card is determined to be the FATAL error type information of ibv_wc_fault_err type, it can be determined that the target network card whose current completion queue event is in an abnormal state.
The network card fault determining method provided by the embodiment of the invention realizes the determination that the target network card is in an abnormal state by judging the event type information of the asynchronous event and the error reporting information in the event type information of the completion queue event.
In one embodiment, determining that the target network card is in a fault state based on first error reporting information in the event type information of the asynchronous event includes: judging event type information of the asynchronous event, and determining that the target network card is in a fault state when the first error reporting information contains a fatal error type or the occurrence frequency of the data transmission abnormal type in the first error reporting information is greater than a preset frequency threshold value.
After the event type information of the asynchronous event is determined, the determined event type information is judged, and whether first error reporting information in the event type information meets the requirement of determining that the target network card is in a fault state is determined.
The specific judging process is as follows: it is determined whether the first error-reporting information is of a fatal error type. If the type of the fatal error is determined, the target network card can be determined to be in a fault state. For example, the first error message may be ibv_event_device_fail: network card fatal error; ibv_event_port_err/ACTIVE: network card link time-out error; ibv_event_qp_err: transmitting and receiving queue errors; ibv_event_srq_err: shared receive queue errors; ibv_event_cq_err: completion queue errors, etc.
When determining that the FATAL error type of the IBV_EVENT_DEVICE_FATAL network card FATAL error occurs, the serious error of the target network card can be determined, and the target network card can be directly determined to be in a fault state.
And for ibv_event_port_err/ACTIVE: network card link time-out error; ibv_event_qp_err: transmitting and receiving queue errors; ibv_event_srq_err: shared receive queue errors; ibv_event_cq_err: and completing the errors of the abnormal data transmission types such as the queue errors, and the like, wherein the counting mode can be adopted, and under the condition that the occurrence times of the errors are larger than the preset times threshold value, the errors are larger than the set tolerance value, and the target network card is determined to be in a fault state.
According to the network card fault determination method provided by the embodiment of the invention, the determined event type information is judged after the event type information of the asynchronous event, and whether the first error reporting information in the event type information is compounded with the requirement of determining that the target network card is in the fault state is determined, so that the determination of the target network card in the fault state is realized.
In one embodiment, determining that the target network card is in a fault state based on second error reporting information in the event type information of the completion queue event includes: judging event type information of the completion queue event, and determining that the target network card is in a fault state under the condition that the second error reporting information contains a fatal error type or an unknown error type or the occurrence frequency of the data transmission abnormal type in the second error reporting information is greater than a preset frequency threshold value.
After determining event type information of the completed queue event, judging the determined event type information, and determining whether second error reporting information in the event type information meets the requirement of determining that the target network card is in a fault state.
The specific judging process is as follows: it is determined whether the second error-reporting information is of a fatal error type. If the type of the fatal error is determined, the target network card can be determined to be in a fault state. For example, the second error message may be ibv_wc_fault_err: network card fatal error; ibv_wc_general_err: unknown error types; ibv_wc_wr_flush_err: data transmission anomaly type, etc.
When determining the type of the FATAL error of the IBV_WC_FATAL_ERR network card, determining that the target network card has serious errors, and directly determining that the target network card is in a fault state.
And for ibv_wc_wr_flush_err: the data transmission abnormality type can adopt a counting mode, and when the occurrence times of errors are determined to be larger than a preset time threshold value, the error exceeds a set tolerance value, and the target network card is determined to be in a fault state.
According to the network card fault determination method provided by the embodiment of the invention, the determined event type information is judged after the event type information of the queue event is completed, and whether the second error reporting information in the event type information is compounded with the requirement of determining that the target network card is in the fault state is determined, so that the determination of the target network card is in the fault state is realized.
In one embodiment, after determining that the target network card is in the failure state, the method further includes: and sending a fault message to an operation and maintenance person of the target network card, so that the operation and maintenance person executes hot plug operation on the target network card after receiving the fault message.
After determining that the target network card is in the fault state, the target network card needs to be correspondingly processed so as to enable the target network card to recover to the normal running state.
After the target network card is determined to be in the fault state, sending a fault message to an operation and maintenance personnel of the target network card, so that the operation and maintenance personnel execute hot plug operation on the target network card after receiving the fault message.
When the target network card is in a fault state, which is a serious error, the target network card cannot be repaired in a general repairing mode. The driver can only be reloaded for repair by means of hot plug. After the hot plug reloads the driver, a corresponding automation configuration can be set to restore the deployment IP address again automatically and keep the state machine of RDMA in the initial health state.
According to the network card fault determination method provided by the embodiment of the invention, after the target network card is determined to be in the fault state, the fault message is sent to the operation and maintenance personnel of the target network card, so that the operation and maintenance personnel execute hot plug operation on the target network card after receiving the fault message, and the restarting recovery operation on the target network card is realized.
In one embodiment, after determining that the target network card is in the failure state, the method further includes: determining a redundant link corresponding to the target network card; and switching the operation business in the target network card to the redundant link so as to enable the operation business to operate on the redundant link.
For the operation service in the target network card, a redundant link is generally set, and the operation service can be switched to the redundant link when needed so as to ensure the normal operation of the operation service.
After the target network card is in the fault state, determining a redundant link corresponding to the target network card, and switching the operation service in the current target network card to the corresponding redundant link, so that the operation service continues to operate in the redundant link. The method is convenient to perform hot plug operation or other fault repair operation on the target network card in the fault state, and normal operation of operation business is not affected.
According to the network card fault determination method provided by the embodiment of the invention, after the target network card is determined to be in the fault state, the redundant link corresponding to the target network card is determined, and the operation service in the current target network card is switched to the corresponding redundant link, so that the operation service continues to operate in the redundant link. The method is convenient to perform hot plug operation or other fault repair operation on the target network card in the fault state, and normal operation of operation business is not affected.
In one embodiment, before the asynchronous event of the polling monitoring target network card and the completion queue event of the target network card, the method further comprises: and setting the asynchronous event file descriptor of the asynchronous event to a non-blocking mode so as to acquire the asynchronous event of the target network card.
Before polling the asynchronous event in the target network card, an identifier of the asynchronous event in the target network card, async_fd, needs to be initialized. And setting async_fd into a non-blocking mode, and acquiring an asynchronous event in the target network card only in the non-blocking mode in a polling or timer triggering mode.
According to the network card fault determining method provided by the embodiment of the invention, the identifier of the asynchronous event in the target network card is required to be initialized before the asynchronous event in the target network card is monitored in a polling way, and the asynchronous event file descriptor of the asynchronous event is set to be in a non-blocking mode, so that the acquisition of the asynchronous event in the target network card is realized.
The technical solution provided by the present invention is described below by taking a schematic device structure diagram of a network card fault determining method provided by the present invention as an example, and as shown in fig. 2, the device includes: asynchronous event poll monitoring module 210, CQ event poll monitoring module 220, alarm reporting module 230, vendor log collection module 240, and hot plug repair module 250.
The asynchronous event polling monitoring module 210 is configured to initialize an asynchronous event file descriptor, async_fd, of an asynchronous event during an initialization phase. Async_fd is set to a non-blocking mode in which an asynchronous event async event can be acquired by polling or timer triggering by the asynchronous event poll monitoring module 210. Based on the set fixed polling period, the asynchronous event polling monitoring module 210 checks async_fd by a polling function during the polling period to determine whether an async event asynchronous event has occurred. The specific judging process is shown in the judging schematic diagram of the asynchronous event polling monitoring module provided by the invention in fig. 3, if an async event asynchronous event occurs, the corresponding asynchronous event can be obtained by calling a ibv _get_async_event interface. The asynchronous EVENT related to the card failure of the RoCE network card is mainly an ibv_event_device_fault EVENT, and the EVENT triggers an exception indicating that a FATAL error occurs in the card, and the card needs to be processed offline. Processing of other events may also be performed by counting, by adjusting tolerance to determine if a card is faulty, such as: if the RoCE network card triggers the IBV_EVENT_PORT/ACTIV, IBV_EVENT_QP_ERR, IBV_EVENT_SRQ_ERR or IBV_EVENT_CQ_ERR of the asynchronous EVENT for a plurality of times, the intermittent operation of the network card is indicated, if the link is unstable under the condition of no external force intervention, the condition that the RoCE network card is in a fault state is determined under the condition that the preset time threshold trigger is set, and the fault processing fault flow is carried out afterwards.
The CQ event poll monitoring module 220 may invoke the corresponding interface to create CQ during the RDMA initialization phase, and after CQ is created, poll CQ events via the ibv _poll_cq interface. Each time the input and output data IO is completed, a CQ event is polled to check the completion of the IO, and a plurality of event types are corresponding to the completion of the IO. The specific judging process is shown in the judging schematic diagram of the completion queue event polling monitoring module provided by the invention in fig. 4, and whether the event type of the completion queue event is two verified fault types is identified. One is the ibv_wc_fault_err type, which indicates that a FATAL error has occurred; the second is the ibv_wc_general_err type, whose surface has an unknown error. Other error events can be determined whether the RoCE network card is in a fault state by setting tolerance and based on a counting processing mode, and if a certain error reaches a preset frequency threshold, the card fault is processed according to the card fault. Such as: ibv_wc_wr_flush_err, which indicates that QP has a problem when handling IO. If the fault treatment is carried out for a plurality of times, such as 100 times, the fault treatment is carried out, and the driver is reloaded to repair through hot plug.
The alarm reporting module 230 is configured to, when detecting that the RoCE network card is in a fault state, need to report an alarm, let the user perceive that the RoCE network card is in the fault state, and then determine how to process the service by the client. If the service can be suspended, suspending the IO service; if the service cannot be suspended, the service can be switched to a redundant link to be processed temporarily, and after the card is repaired, the deployment is resumed again.
The vendor log collecting module 240 is configured to, when detecting that the RoCE network card is in a fault state, often require the manufacturer of the RoCE network card to assist in analysis, and call tools of the manufacturer to collect the RoCE network card log. The method can send instructions to the corresponding process to execute data collection of the manufacturer (corresponding script can be called) through an IPC method, and when log collection is completed, a corresponding callback function is executed to tell the RDMA process that the log collection is completed, and then the next repairing action can be performed.
The hot plug repair module 250 is configured to, when a serious error, such as a failure state of the RoCE network card, fail to repair the program in a general repair manner, and only reload the driver by hot plug to repair the program. After the hot plug reloads the driver, a corresponding automation configuration can be set to restore the deployment IP again and keep the RDMA state machine in the initial health state.
Fig. 5 is a schematic structural diagram of a network card failure determining device provided by the present invention, as shown in fig. 5, the device includes:
the monitoring module 510 is configured to poll and monitor an asynchronous event of a target network card and a completion queue event of the target network card, and determine event type information of the asynchronous event and event type information of the completion queue event, where the target network card is a network card using a remote direct data access network protocol;
The state determining module 520 is configured to determine a fault state of the target network card based on the event type information of the asynchronous event and/or the event type information of the completion queue event.
The network card fault determining device provided by the embodiment of the invention monitors the asynchronous event of the target network card and the completion queue event of the target network card in a polling way, and determines the fault state of the target network card based on the event type information of the asynchronous event and/or the event type information of the completion queue event. The method and the device realize the automatic determination of whether the target network card is in a fault state based on the monitoring of the target network card event. The method provides a basis for carrying out targeted fault processing on the target network card based on the fault state of the target network card, and improves the stability of the link of the target network card.
In one embodiment, the state determination module 520 is specifically configured to:
based on the event type information of the asynchronous event and/or the event type information of the completion queue event, determining the fault state of the target network card includes:
judging event type information of the asynchronous event, and determining that the target network card is in a fault state based on first error reporting information in the event type information of the asynchronous event;
And/or the number of the groups of groups,
and judging the event type information of the completion queue event, and determining that the target network card is in a fault state based on second error reporting information in the event type information of the completion queue event.
In one embodiment, the state determination module 520 is further specifically configured to:
judging the event type information of the asynchronous event, and determining that the target network card is in a fault state based on first error reporting information in the event type information of the asynchronous event, wherein the method comprises the following steps:
judging event type information of the asynchronous event, and determining that the target network card is in a fault state when the first error reporting information contains a fatal error type or the occurrence frequency of the data transmission abnormal type in the first error reporting information is greater than a preset frequency threshold value.
In one embodiment, the state determination module 520 is further specifically configured to:
judging the event type information of the completion queue event, and determining that the target network card is in a fault state based on second error reporting information in the event type information of the completion queue event, wherein the method comprises the following steps:
judging event type information of the completion queue event, and determining that the target network card is in a fault state under the condition that the second error reporting information contains a fatal error type or an unknown error type or the occurrence frequency of the data transmission abnormal type in the second error reporting information is greater than a preset frequency threshold value.
In one embodiment, the state determination module 520 is further specifically configured to:
after determining that the target network card is in the fault state, the method further comprises:
and sending a fault message to an operation and maintenance person of the target network card, so that the operation and maintenance person executes hot plug operation on the target network card after receiving the fault message.
In one embodiment, the state determination module 520 is further specifically configured to:
after determining that the target network card is in the fault state, the method further comprises:
determining a redundant link corresponding to the target network card;
and switching the operation business in the target network card to the redundant link so as to enable the operation business to operate on the redundant link.
In one embodiment, the monitoring module 510 is specifically configured to:
and before the asynchronous event of the polling monitoring target network card and the completion queue event of the target network card, the method further comprises the following steps:
and setting the asynchronous event file descriptor of the asynchronous event to a non-blocking mode so as to acquire the asynchronous event of the target network card.
Fig. 6 illustrates a physical schematic diagram of an electronic device, as shown in fig. 6, which may include: processor 610, communication interface (Communications Interface) 620, memory 630, and communication bus 640, wherein processor 610, communication interface 620, and memory 630 communicate with each other via communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform a network card failure determination method comprising:
Polling and monitoring an asynchronous event of a target network card and a completion queue event of the target network card, and determining event type information of the asynchronous event and event type information of the completion queue event, wherein the target network card is a network card using a remote direct data access network protocol;
and determining the fault state of the target network card based on the event type information of the asynchronous event and/or the event type information of the completion queue event.
Further, the logic instructions in the memory 630 may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, are capable of performing the network card failure determination method provided by the above methods, the method comprising:
polling and monitoring an asynchronous event of a target network card and a completion queue event of the target network card, and determining event type information of the asynchronous event and event type information of the completion queue event, wherein the target network card is a network card using a remote direct data access network protocol;
and determining the fault state of the target network card based on the event type information of the asynchronous event and/or the event type information of the completion queue event.
In yet another aspect, the present invention further provides a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform the above-provided network card failure determination methods, the method comprising:
polling and monitoring an asynchronous event of a target network card and a completion queue event of the target network card, and determining event type information of the asynchronous event and event type information of the completion queue event, wherein the target network card is a network card using a remote direct data access network protocol;
And determining the fault state of the target network card based on the event type information of the asynchronous event and/or the event type information of the completion queue event.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A network card failure determination method, the method comprising:
polling and monitoring an asynchronous event of a target network card and a completion queue event of the target network card, and determining event type information of the asynchronous event and event type information of the completion queue event, wherein the target network card is a network card using a remote direct data access network protocol;
and determining the fault state of the target network card based on the event type information of the asynchronous event and/or the event type information of the completion queue event.
2. The network card failure determination method according to claim 1, wherein determining the failure state of the target network card based on the event type information of the asynchronous event and/or the event type information of the completion queue event includes:
Judging event type information of the asynchronous event, and determining that the target network card is in a fault state based on first error reporting information in the event type information of the asynchronous event;
and/or the number of the groups of groups,
and judging the event type information of the completion queue event, and determining that the target network card is in a fault state based on second error reporting information in the event type information of the completion queue event.
3. The network card failure determining method according to claim 2, wherein the determining the event type information of the asynchronous event, and determining that the target network card is in a failure state based on first error reporting information in the event type information of the asynchronous event, includes:
judging event type information of the asynchronous event, and determining that the target network card is in a fault state when the first error reporting information contains a fatal error type or the occurrence frequency of the data transmission abnormal type in the first error reporting information is greater than a preset frequency threshold value.
4. The network card failure determination method according to claim 2, wherein the determining the event type information of the completion queue event, and determining that the target network card is in a failure state based on second error reporting information in the event type information of the completion queue event, includes:
Judging event type information of the completion queue event, and determining that the target network card is in a fault state under the condition that the second error reporting information contains a fatal error type or an unknown error type or the occurrence frequency of the data transmission abnormal type in the second error reporting information is greater than a preset frequency threshold value.
5. The network card failure determination method according to claim 2, wherein after the determination that the target network card is in the failure state, further comprising:
and sending a fault message to an operation and maintenance person of the target network card, so that the operation and maintenance person executes hot plug operation on the target network card after receiving the fault message.
6. The network card failure determination method according to claim 2, wherein after the determination that the target network card is in the failure state, further comprising:
determining a redundant link corresponding to the target network card;
and switching the operation business in the target network card to the redundant link so as to enable the operation business to operate on the redundant link.
7. The network card failure determination method according to claim 1, wherein before the polling the asynchronous event of the target network card and the completion queue event of the target network card, further comprising:
And setting the asynchronous event file descriptor of the asynchronous event to a non-blocking mode so as to acquire the asynchronous event of the target network card.
8. A network card failure determination apparatus, comprising:
the monitoring module is used for polling and monitoring asynchronous events of a target network card and completion queue events of the target network card, determining event type information of the asynchronous events and event type information of the completion queue events, wherein the target network card is a network card using a remote direct data access network protocol;
and the state determining module is used for determining the fault state of the target network card based on the event type information of the asynchronous event and/or the event type information of the completion queue event.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the network card failure determination method according to any of claims 1 to 7 when executing the computer program.
10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the network card failure determination method according to any of claims 1 to 7.
CN202310720959.1A 2023-06-16 2023-06-16 Network card fault determination method and device Pending CN116781569A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310720959.1A CN116781569A (en) 2023-06-16 2023-06-16 Network card fault determination method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310720959.1A CN116781569A (en) 2023-06-16 2023-06-16 Network card fault determination method and device

Publications (1)

Publication Number Publication Date
CN116781569A true CN116781569A (en) 2023-09-19

Family

ID=87987326

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310720959.1A Pending CN116781569A (en) 2023-06-16 2023-06-16 Network card fault determination method and device

Country Status (1)

Country Link
CN (1) CN116781569A (en)

Similar Documents

Publication Publication Date Title
US20080162984A1 (en) Method and apparatus for hardware assisted takeover
CN106789306B (en) Method and system for detecting, collecting and recovering software fault of communication equipment
US6622261B1 (en) Process pair protection for complex applications
CN110740072A (en) fault detection method, device and related equipment
CN111385107B (en) Main/standby switching processing method and device for server
CN111414268A (en) Fault processing method and device and server
CN104283718A (en) Network device and hardware fault diagnosis method used for network device
WO2020088351A1 (en) Method for sending device information, computer device and distributed computer device system
EP2784677A1 (en) Processing apparatus, program and method for logically separating an abnormal device based on abnormality count and a threshold
CN116684256B (en) Node fault monitoring method, device and system, electronic equipment and storage medium
CN111880992B (en) Monitoring and maintaining method for controller state in storage device
CN113608908A (en) Server fault processing method, system, equipment and readable storage medium
CN109885420B (en) PCIe link fault analysis method, BMC and storage medium
CN103731315A (en) Server failure detecting method
CN111147615B (en) Method and system for taking over IP address, computer readable storage medium and server
CN117271234A (en) Fault diagnosis method and device, storage medium and electronic device
CN116781569A (en) Network card fault determination method and device
JP2016066303A (en) Server device, redundant configuration server system, information taking-over program and information taking-over method
CN114168071B (en) Distributed cluster capacity expansion method, distributed cluster capacity expansion device and medium
CN111654401B (en) Network segment switching method, device, terminal and storage medium of monitoring system
Cisco
US11954509B2 (en) Service continuation system and service continuation method between active and standby virtual servers
Cisco
CN115220937A (en) Method, electronic device and program product for storage management
JPH1188471A (en) Test method and test equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination