WO2022078070A1 - Device detection method and apparatus, and communication device - Google Patents

Device detection method and apparatus, and communication device Download PDF

Info

Publication number
WO2022078070A1
WO2022078070A1 PCT/CN2021/114169 CN2021114169W WO2022078070A1 WO 2022078070 A1 WO2022078070 A1 WO 2022078070A1 CN 2021114169 W CN2021114169 W CN 2021114169W WO 2022078070 A1 WO2022078070 A1 WO 2022078070A1
Authority
WO
WIPO (PCT)
Prior art keywords
response
time period
timeouts
heartbeat data
counted
Prior art date
Application number
PCT/CN2021/114169
Other languages
French (fr)
Chinese (zh)
Inventor
帅煜韬
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022078070A1 publication Critical patent/WO2022078070A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/50Testing arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • G06F11/0754Error or fault detection not based on redundancy by exceeding limits
    • G06F11/0757Error or fault detection not based on redundancy by exceeding limits by exceeding a time limit, i.e. time-out, e.g. watchdogs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0817Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1095Replication or mirroring of data, e.g. scheduling or transport for data synchronisation between network nodes

Definitions

  • the present application relates to the field of communications, and in particular, to a device detection method, apparatus, and communication device.
  • the heartbeat detection method In a distributed network, in order to obtain the status of each network device in the network in time, the heartbeat detection method is usually used to detect whether each device in the network fails. Specifically, the heartbeat detection method means that a device periodically sends a heartbeat data packet to another device, and then determines whether the receiving end device is in a normal state according to the response data packet situation fed back by the other device. For example, as shown in Figure 1, device 1 periodically sends a heartbeat data packet to device 2, and then waits for device 2 to feed back a response data packet. If device 1 does not receive a response data packet sent by device 2 within a preset time period, the device 1 will determine that the device 2 is faulty, and an alarm message needs to be reported at this time.
  • the system may misjudge due to network fluctuations. For example, when the network fluctuates, the network line may be on and off, which will cause the sending end device or router, such as the heartbeat data packet sent by device 1 periodically, to lose packets, or "probabilistic packet loss", which leads to the receiving end. Device 2 will not send a response data packet. At this time, device 1 can only determine that device 2 is faulty, but in fact, it may be that the transmission line between device 1 and device 2 is faulty, but device 2 itself is not faulty. Therefore, the actual state of the device cannot be accurately detected by using the heartbeat detection method, and the accuracy rate is low.
  • the embodiments of the present application provide a device detection method, device, and communication device, which are used to solve the technical problem that the device state cannot be accurately detected when network fluctuation occurs in a distributed network.
  • the application discloses the following technical solutions:
  • the present application provides a device detection method, the method comprising: when a first device detects that a response response fed back by a second device times out, acquiring a historical heartbeat synchronized by the first device within a first time period Data, the historical heartbeat data includes the response response situation of the other two devices detected by each of the first device, the second device and the third device to the heartbeat data packet;
  • the response situation detected by the first device determines the reason for the timeout of the response response of the second device, and the reason includes the failure of the second device, or the transmission between the second device and the first device.
  • the link has failed.
  • the historical heartbeat data detected by two peripheral devices and the historical heartbeat data obtained by the device itself are used to detect the device in an abnormal state, and the reason for the failure is determined by comparing the number of timeouts of each device in the past period of time.
  • the failure of the device itself, or the link failure caused by probabilistic packet loss because the obtained historical heartbeat data is the heartbeat timeout situation detected and reported by multiple devices, and the global information is used to make decisions, so compared with the historical heartbeat data of a single device Heartbeat data detection, the method improves the accuracy of device failure detection in a distributed network, thereby avoiding misjudgment caused by probabilistic packet loss in the case of network fluctuations.
  • each device in the historical heartbeat data detecting the response response situation, and determining the reason for the timeout of the second device's response response, including: according to The response response situation detected by each device respectively determines the total number N1, N2 and N3 of response response timeouts of the first device, the second device and the third device within the first time period; When the first condition is satisfied, it is determined that the cause is the failure of the second device, and the first condition is: the total number N2 of response timeouts corresponding to the second device is the largest, and the third device corresponds to The total number N3 of the acknowledgment response timeouts is greater than 0.
  • the second condition when the second condition is satisfied, it is determined that the reason is that the transmission link between the second device and the first device occurs
  • the second condition is: the total number N1 of response timeouts corresponding to the first device is greater than 0, the total number N2 of response timeouts corresponding to the second device is greater than 0, and the third device corresponds to The total number of ack response timeouts N3 is equal to 0.
  • the historical heartbeat data includes:
  • the third device feeds back the cumulative number of timeouts a 23 of the response response; the third device reports the cumulative number of timeouts a 32 of the response response returned by the second device within the first time period, the third device in the first time period
  • the accumulated timeout times a 31 of the feedback response of the first device are counted in the internal statistics.
  • N1 a 12 +a 13 +a 21 +a 31
  • N2 a 12 +a 21 +a 23 +a 32
  • N3 a 13 +a 23 +a 31 +a 32 .
  • the historical heartbeat data includes: the accumulation of the feedback response responses of the second device counted by the first device within a first time period The number of timeouts a 12 ; the cumulative number of timeouts a 21 of the feedback response responses from the first device, as counted by the second device within the first time period; The cumulative timeout times a 32 of the device feedback response response.
  • the first condition is: a 12 >0, a 21 >0, and a 32 >0;
  • the historical heartbeat data includes: the accumulation of the feedback response responses of the second device counted by the first device within a first time period The number of timeouts a 12 ; the cumulative number of timeouts a 21 of the feedback response responses from the first device, as counted by the second device within the first time period; The device feeds back the cumulative timeout times a 32 of the response response, and the third device reports the cumulative timeout times a 31 of the response response returned by the first device within the first time period.
  • the first condition is: a 12 >0, a 21 >0, and a 32 +a 23 >0;
  • the method before acquiring the historical heartbeat data reported by the third device synchronized by the first device within the first time period, the method further includes: in two The third device is selected from one or more devices, and the third device is when the first device sends a request for obtaining historical heartbeat data to each of the two or more devices, The device from which the first historical heartbeat data was received.
  • the method before the first device detects that the response response fed back by the second device times out, the method further includes: periodically reporting to the second device in the network The device and the third device send a heartbeat data packet; respectively receive the response responses from the second device and the third device according to the heartbeat data packet feedback; count the second device in the first time period The cumulative number of timeouts of the feedback response response, and the cumulative number of timeouts of the response response returned by the third device.
  • each device in the distributed network periodically obtains the heartbeat timeout of other devices in the past period of time, and synchronizes the historical heartbeat data of these devices, so as to provide accurate detection in the event of a failure.
  • the present application provides a device detection device, the device includes: a data synchronization module, when the first device detects that the response response fed back by the second device times out, acquires the first device in the first time period Internally synchronized historical heartbeat data, the historical heartbeat data includes the response response situation of the other two devices detected by each of the first equipment, the second equipment and the third equipment to the heartbeat data packet; the processing module, is used to determine the reason for the timeout of the response response of the second device according to the response response situation detected by each device in the historical heartbeat data, and the reasons include: the second device is faulty, or the second device is faulty.
  • the transmission link between the device and the first device fails.
  • the processing module is specifically configured to determine the first device, the first device, the The total number N1, N2 and N3 of response response timeouts of the second device and the third device within the first time period, and, when the first condition is satisfied, it is determined that the cause is the failure of the second device; so
  • the first condition is: the total number N2 of response timeouts corresponding to the second device is the largest, and the total number N3 of response timeouts corresponding to the third device is greater than 0.
  • the processing module is further configured to, when a second condition is satisfied, determine that the reason is that the second device and the first device The transmission link between the two devices fails; the second condition is: the total number N1 of response timeouts corresponding to the first device is greater than 0, and the total number N2 of response response timeouts corresponding to the second device is greater than 0, And the total number N3 of response timeouts corresponding to the third device is equal to 0.
  • the historical heartbeat data includes the following parameters: a feedback response from the second device counted by the first device in a first time period The cumulative timeout times a 12 of the response, the cumulative timeout times a 13 of the feedback response of the third device counted by the first device within the first time period; the second device counted within the first time period The first device feeds back the cumulative number of timeouts a 21 of the response response, the second device reports the cumulative number of timeouts a 23 of the response response from the third device within the first time period; the third device is in the first The cumulative timeout times a 32 of the feedback response responses of the second device counted in the time period, and the cumulative timeout times a 31 of the feedback response responses of the first device counted by the third device in the first time period;
  • the historical heartbeat data includes the following parameters: a feedback response from the second device counted by the first device in a first time period The cumulative timeout times a 12 of the response; the cumulative timeout times a 21 of the feedback response of the first device counted by the second device within the first time period; the third device counted within the first time period the cumulative timeout times a 32 of the second device feedback response response;
  • the historical heartbeat data includes the following parameters: a feedback response from the second device counted by the first device in a first time period The cumulative timeout times a 12 of the response; the cumulative timeout times a 21 of the feedback response of the first device counted by the second device within the first time period; the third device counted within the first time period the cumulative timeout times a 32 of the feedback response of the second device, and the cumulative timeout times a 31 of the response response of the first device as counted by the third device within the first time period;
  • the processing module is further configured to select the third device from among two or more devices, where the third device is an When the first device sends a request for acquiring historical heartbeat data to each of two or more devices, the device from which the first historical heartbeat data is received by the data synchronization module.
  • the method further includes: a heartbeat detection module, configured to periodically send a message to the first device before the first device detects that the response response fed back by the second device times out.
  • the second device and the third device in the network send heartbeat data packets;
  • the sampling module is configured to respectively receive the response responses fed back by the second device and the third device according to the heartbeat data packets, and
  • the accumulated timeout times of the second device's feedback response response and the accumulated timeout times of the third device's feedback response response in the first time period are counted.
  • the present application also provides a chip system, the chip system includes a processor and a memory, wherein the processor is coupled with the memory, and the memory is used for storing computer program instructions; the processor is used for executing the instructions stored in the memory. , so that the chip system executes the aforementioned first aspect and the methods in various implementations of the first aspect.
  • the chip system further includes an interface circuit, and the interface circuit is used to realize the communication between the chip system and other external modules.
  • the chip system is a chip circuit.
  • the present application further provides a communication device.
  • the communication device may be the device detection device described in the second aspect, or include the chip system described in the third aspect, so as to be able to execute the first aspect and the third aspect.
  • methods in various implementations are possible.
  • the communication device may include, but is not limited to, a processor, a memory, a communication interface, a sensor module, a mobile communication module, a wireless communication module, a display screen, a camera, a USB interface, a power management module, and the like.
  • the present application also provides a computer-readable storage medium, in which instructions are stored, so that when the instructions are executed on a computer or a processor, the instructions can be used to execute the foregoing first aspect and each of the first aspects. method in an implementation.
  • the present application also provides a computer program product, the computer program product includes computer instructions, when the instructions are executed by a computer or a processor, the aforementioned first aspect and the methods in various implementation manners of the first aspect can be implemented.
  • beneficial effects corresponding to the technical solutions of the various implementation manners of the second aspect to the fifth aspect are the same as the beneficial effects of the foregoing first aspect and various implementation manners of the first aspect.
  • beneficial effects in various implementation manners of the first aspect will not be repeated.
  • Fig. 1 is a kind of network structure schematic diagram that adopts heartbeat detection method to detect equipment failure provided by this application;
  • Fig. 2 is another kind of network structure schematic diagram that adopts the heartbeat detection method to detect equipment failure provided by the embodiment of this application;
  • FIG. 3 is a flowchart of a device detection method provided by an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a device detection apparatus provided by an embodiment of the present application.
  • FIG. 5 is a flowchart of another device detection method provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a communication device according to an embodiment of the present application.
  • the technical solution of the present application can be applied to a distributed network, and the network can be a centralized network or a decentralized network, such as applied to a smart home environment.
  • the so-called decentralization is a concept relative to centralization. Every device in the decentralized network is equal, and there is no master device and slave device.
  • the master device For a centralized network, the master device generally sends heartbeat packets to other slave devices and waits for other slave devices to feed back response packets, while in a decentralized network, devices send heartbeat packets and responses to each other. data pack.
  • FIG. 2 a schematic structural diagram of a decentralized network provided in this embodiment.
  • the network includes at least three electronic devices, such as device 1 , device 2 and device 3 .
  • other electronic devices such as switches, routers, and servers, may also be included, and this embodiment does not limit the type and number of electronic devices included in the network.
  • any one of the devices 1 to 3 may be a terminal device
  • the terminal device may be a portable device, such as a smart terminal, a mobile phone, a notebook computer, a tablet computer, a personal computer , PC), personal digital assistant (PDA), foldable terminal, vehicle terminal, wearable device with wireless communication function (such as smart watch or bracelet), user device (user device) or user device (user device) equipment, UE), and augmented reality (AR) or virtual reality (virtual reality, VR) equipment, etc.
  • the terminal device may also be a smart home device, such as audio, air conditioners, refrigerators, TVs, washing machines, and water heaters deployed in indoor homes.
  • the above-mentioned various terminal devices include, but are not limited to, those equipped with Apple (IOS), Android (Android), Microsoft (Microsoft) or other operating systems.
  • any one of the devices 1 to 3 may also be a network device, such as a switch, a gateway, a server, etc., which is not limited in this embodiment.
  • the communication between the device and the device can be transmitted through a wireless network, such as WiFi; if the above-mentioned device is a network device, the communication between the device and the device can be through the Internet, such as fiber optic transmission.
  • a wireless network such as WiFi
  • the above-mentioned device is a network device
  • the communication between the device and the device can be through the Internet, such as fiber optic transmission. This application does not limit the specific transmission medium between devices.
  • This embodiment provides a device detection method, which can be applied to the above-mentioned distributed network, and can detect the abnormal response response caused by probabilistic packet loss when the network fluctuates.
  • a possible situation of the probabilistic packet loss is that due to the instability of the line when the network fluctuates, the phenomenon of the network link being on and off occurs, resulting in the loss of heartbeat packets during the link transmission process.
  • the receiving end device may be in a normal state, can receive heartbeat data packets, and feed back a response message.
  • the method includes the following steps:
  • the historical heartbeat data includes responses to the heartbeat data packets from the other two devices detected by each of the first device, the second device, and the third device. Specifically, the historical heartbeat data includes at least the following three parts:
  • the historical heartbeat data of the first device including: the second device detected by the first device in the first time period, and/or the response of the third device to the heartbeat data packet sent by itself (the first device) ;
  • the historical heartbeat data of the second device including: the first device detected by the second device in the first time period, and/or the response of the third device to the heartbeat data packet sent by itself (the second device) ;
  • the historical heartbeat data of the third device including: the first device detected by the third device in the first time period, and/or the response of the second device to the heartbeat data packet sent by itself (the third device) .
  • the above-mentioned historical heartbeat data of the first device is obtained from statistics by the first device, and the historical heartbeat data of the second device and the historical heartbeat data of the third device are actively reported to the first device through their respective devices, and the first device obtains after receiving them respectively. .
  • an implementation manner of acquiring the historical heartbeat data of the first device is: the first device periodically sends the heartbeat to the second device and the third device in the network
  • the sending cycle and sending range can be customized. For example, if the sending cycle is set to 1s (seconds), the first device sends a heartbeat data packet or heartbeat message to each other device every 1s.
  • the second device and the third device When the second device and the third device receive the heartbeat data packet sent from the first device, they will send a response to the first device, such as feeding back a response response data packet or response message, etc.; the first device will send Start timing after a heartbeat data packet, and determine whether a response response feedback from the second device and the third device is received within a preset time.
  • the preset time can be customized.
  • the first device receives a response sent by the second device or the third device (receiving end) within the preset time, it means that the response feedback of the receiving end has not timed out; if the response is received outside the preset time, Or if no response response is received, it means that the response feedback from the receiving end times out.
  • the response feedback timeout may also be referred to as abnormal heartbeat.
  • device 1 sends a heartbeat message 1 to device 2 and device 3 respectively at time t1, then device 1 receives response message 1 fed back by device 2 at time t2, and receives the response message fed back by device 3 at time t3 2. If the time interval between t1 and t2 is within the preset time interval, device 1 records that the response response corresponding to the heartbeat message sent to device 2 at time t1 has not timed out; if the time interval between t1 and t3 is within the preset time interval Otherwise, or the response message 2 sent by the device 3 is not received within the preset time interval, the response time-out corresponding to the heartbeat message sent to the device 3 at time t1 is recorded.
  • the first device marks it as "0", and for the time-out response response, the first device marks it as "1".
  • the device 1 will mark the response to the heartbeat message 1 sent at time t1 as "0", and for the above device 3, device 1 will respond to the heartbeat message 1 sent at time t2. Marked as "1".
  • the device 1 may also use other methods to mark the situation that the response response it receives has timed out or not, and the marking method adopted by the device 1 is not limited in this embodiment.
  • a two-dimensional array of all "1"s is recorded.
  • the second device also periodically sends heartbeat data packets to the first device and the third device, respectively, and records the response timeouts of the first device and the third device to form historical heartbeat data of the second device.
  • the third device also periodically sends heartbeat data packets to the second device and the first device, respectively, and records the response timeout of the second device and the first device to form historical heartbeat data of the first device.
  • the historical heartbeat data can be refreshed regularly, such as storing historical heartbeat data at 1min (minute) intervals on each device side, or refreshing the local storage record every 1min, and updating the historical heartbeat data to the latest 1-2min Historical heartbeat data.
  • This embodiment takes the first device as an example.
  • the first device detects that the response response fed back by the second device has timed out, it means that the first device and the second device are in normal communication, and the devices at both ends send and receive each other.
  • Heartbeat data packet and response response when the first device sends a heartbeat data packet to the second device at a certain time, but does not receive the response packet fed back by the second device within a preset time (for example, 1s), the first device will The device confirms that the current response response fed back by the second device times out, and starts the method of step 101 .
  • the first condition and the second condition can be used to determine whether the cause of the timeout is a failure of the device itself or a failure of a transmission link between devices.
  • the first condition is: the total number N2 of response timeouts corresponding to the second device is the largest, and the total number N3 of response timeouts corresponding to the third device is greater than 0. Expressed as: N2>N1, N2>N3, and N3>0.
  • the second condition is: the total number N1 of response timeouts corresponding to the first device is greater than 0, the total number N2 of response timeouts corresponding to the second device is greater than 0, and the total number N3 of response timeouts corresponding to the third device equal to 0.
  • N1>0, N2>0, and N3 0.
  • N1 represents the total number of response timeouts of the first device within the first time period
  • N2 represents the total number of response timeouts of the second device within the first time period
  • N3 represents the third The total number of response timeouts of the device within the first time period.
  • the first time period can be set freely, for example, set to 30s, 60s, 90s, 120s, etc., which is not limited in this embodiment.
  • the first device determines the cause of the failure of the second device according to the historical heartbeat data.
  • the frequency of each device sending heartbeat data packets is 1 per second, the period is 1 second, the first time period is 60s, and the preset time is 1s, that is, device 1 sends a heartbeat data packet
  • the time interval for the response of the receiving end to not time out is 1s, within a detection period of 60s, if device 1 continuously receives 60 response packets sent by device 2, it is assumed that device 2 feeds back the response packet time to device 1.
  • the device 1 detects the response of the device 2 to the heartbeat data packet sent by the device 1.
  • the number of response timeouts is 0, that is, the number of abnormal heartbeats is 0.
  • device 1 only receives 4 response packets from device 3 that meet the preset time interval of 1 s within 60s, it marks 4 "0"s, and the feedback responses of the remaining 56 heartbeat packets are If the timeout is exceeded, that is, 56 "1"s are marked, then device 1 counts the response response of device 3 in the past 60s, and the number of response response timeouts is 56, that is, the number of abnormal heartbeats is 56.
  • the number of abnormal heartbeats can be represented by the letter "a", then a12 represents the abnormal number of heartbeats (or response response situation) of device 2 counted by device 1 in the first time period, and a13 represents that device 1 is in
  • the number of abnormal heartbeats of the device 3 counted in the first time period, and the historical heartbeat data of the device 2 and the device 3 counted by the device 1 in the first time period is ⁇ a 12 , a 13 ⁇ .
  • the historical heartbeat data of the second device may be represented as ⁇ a 21 , a 23 ⁇
  • the historical heartbeat data of the third device may be represented as ⁇ a 31 , a 32 ⁇ .
  • a 21 represents the number of times the abnormal heartbeat occurred in the device 1 counted by the device 2 in the first time period
  • a 23 represents the number of times the abnormal heartbeat occurred in the device 3 counted by the device 2 in the first time period
  • a 31 Represents the number of times the abnormal heartbeat occurs to the device 1 as counted by the device 3 in the first time period
  • a 32 represents the number of times the abnormal heartbeat of the device 2 occurs as counted by the device 3 in the first time period.
  • the device 1 stores these historical heartbeat data in the local storage medium of the device 1 .
  • the above method further includes: the first device sends the historical heartbeat data collected by the first device in the first time period to the second device and the third device, respectively.
  • the second device sends its historical heartbeat data collected in the first time period to the first device and the third device respectively.
  • the third device sends its historical heartbeat data collected in the first time period to the first device and the second device respectively.
  • the first device, the second device, and the third device all obtain the historical heartbeat data counted by the other two devices, respectively.
  • the data synchronization module of the first device obtains the historical heartbeat data of the first device, and simultaneously obtains the historical heartbeat data from other devices in the network.
  • the heartbeat data of a 12 is set as the default data.
  • the above-mentioned reason for analyzing the feedback timeout of the second device according to the historical heartbeat data of the first device, the second device and the third device in the first time period obtained by the first device is specifically:
  • the total number N1 of response timeouts is the cumulative abnormal heartbeat number N1
  • the cumulative abnormal heartbeat number N1 is the sum of the abnormal heartbeat times detected by all devices.
  • the second device ie Device 2 has failed.
  • the device with the most accumulated heartbeat timeouts (or abnormal times) is determined as the failed device.
  • N1>N2>N3, and N3>0, or, N1>N2, N1>N3, and N3>0 it is determined that the first device (ie, device 1) is faulty. If N3>N2>N1, and N3>0, or, N3>N2, N3>N1, and N3>0, it is determined that the third device (ie, device 3) is faulty.
  • the method provided in this embodiment uses the historical heartbeat data detected by two peripheral devices and the historical heartbeat data obtained by the device itself to detect a device in an abnormal state, and determines the number of timeouts by comparing the number of timeouts of each device in the past period of time.
  • the reason for the failure is the failure of the device itself, or the link failure caused by probabilistic packet loss.
  • the obtained historical heartbeat data is the heartbeat timeout condition detected and reported by multiple devices, the global information is used to make decisions.
  • the method improves the accuracy of device fault detection in a distributed network, thereby avoiding misjudgment caused by probabilistic packet loss in the case of network fluctuations.
  • each device in the distributed network periodically obtains the heartbeat timeout of other devices in the past period of time, and synchronizes the historical heartbeat data of these devices, thereby preparing for accurate detection in the event of a failure.
  • This embodiment takes Device 1, Device 2, and Device 3 as examples.
  • the data synchronization module of Device 1 synchronizes its historical heartbeat data from Device 2 and Device 3 and processes it. Assume that the historical heartbeat data obtained by device 1 includes:
  • the second condition is to determine that the link between device 1 and device 2 is faulty.
  • the historical heartbeat data of device 3 is used to determine whether the failure belongs to device 2 itself or the transmission link between device 1 and device 2, thereby improving the performance of distributed networks in the network.
  • This embodiment is similar to the aforementioned second possible embodiment, the difference is that in the historical heartbeat data acquired by the device 1, in addition to a 12 , a 21 , a 32 of the second possible embodiment, it also includes a 23.
  • the a 31 represents the cumulative timeout times of the feedback response of the device 1 counted by the device 3 in the first time period.
  • N1 a 12
  • N2 a 21
  • N3 a 32 +a 23 , then the above steps 102.
  • Determine whether the cause of the failure is the failure of the device 2 or the transmission link between the device 1 and the device 2 through the historical heartbeat data, and the determination method is as follows:
  • step 101 above if the first device detects that the second device is abnormal, and there are two or more terminal devices in the network in addition to the first device and the second device, it is necessary to First, select one of the at least two terminal devices as the third device.
  • a specific selection method is that the first device sends a request for obtaining historical heartbeat data to each of two or more devices, and each device that receives the request will send its own record to the first device.
  • the device that sends the first historical heartbeat data is determined as the third device.
  • the device that received the first historical heartbeat data has the fastest response speed, or is the closest to the first device, so selecting this device as the third device has higher processing efficiency.
  • steps 101 and 102 in the foregoing embodiment are performed to analyze and process the reason for the timeout of the response of the second device.
  • steps 101 and 102 in the foregoing embodiment are performed to analyze and process the reason for the timeout of the response of the second device.
  • the specific process please refer to the foregoing Section 1.
  • the description of the first, second or third implementation manner will not be repeated in this embodiment.
  • the first device when there are multiple devices, the first device sends a request for obtaining historical heartbeat data to each of these devices at the same time, and selects the device corresponding to the first received historical heartbeat data as the first device.
  • Three devices which can improve the detection efficiency.
  • selection criteria may also be used to determine the third device, such as a device closest to the first device as the third device, and this embodiment does not limit the above judgment criteria for selecting and determining the third device.
  • the method provided in this embodiment is applied to a distributed network.
  • the historical heartbeat data of other devices in the network is acquired and synchronized, and the historical heartbeat data is merged and processed, and the processed data is used.
  • the processed data is used.
  • the historical heartbeat data is converted into the cumulative number of timeouts of a certain device in the past period of time, and by comparing the cumulative number of timeouts of each device in the past period of time, the faulty device is determined, that is, the device with the most cumulative timeouts.
  • the first device is used as an example to detect the abnormal condition of the second device.
  • the second device and the third device can also use the same method to detect the first device.
  • FIG. 4 is a schematic structural diagram of a device detection apparatus provided by an embodiment of the present application.
  • the apparatus may be a communication device, or a component located in the communication device, such as a chip or a system of chips.
  • the device can implement the device detection method in the foregoing embodiment.
  • the apparatus may include: a data synchronization module 401 , a processing module 402 , a heartbeat detection module 403 and a sampling module 404 .
  • the apparatus may also include other units or modules such as a storage unit.
  • each module has at least the following functions, as shown in Figure 5,
  • the heartbeat detection module 403 is configured to periodically send a heartbeat data packet to the second device and the third device in the network before the first device detects that the response response fed back by the second device times out.
  • the sampling module 404 is configured to respectively receive the response responses fed back by the second device and the third device according to the heartbeat data packet, and count the cumulative timeout of the response responses fed back by the second device within the first time period number of times, and, the cumulative number of timeouts that the third device feeds back the response response.
  • the sampling module 404 sends the statistics of the response responses of each device in the first time period to the data synchronization module 401 as historical heartbeat data.
  • the data synchronization module 401 acquires the historical heartbeat data synchronized by the first device within the first time period when the first device detects that the response response fed back by the second device times out (that is, when the device 2 is in an abnormal state message),
  • the historical heartbeat data includes response responses to the heartbeat data packet by the other two devices detected by each of the first device, the second device, and the third device.
  • the processing module 402 is configured to determine the reason for the timeout of the response response of the second device according to the response response situation detected by each device in the historical heartbeat data, and the reasons include: failure of the second device, Or the transmission link between the second device and the first device fails.
  • the processing module 402 is specifically configured to separately determine the first device, the second device and the device according to the response situation detected by each device.
  • the total number N1, N2 and N3 of response response timeouts of the third device within the first time period, and, when the first condition is satisfied, it is determined that the cause is the failure of the second device.
  • the first condition is: the total number N2 of response timeouts corresponding to the second device is the largest, and the total number N3 of response timeouts corresponding to the third device is greater than 0.
  • the processing module 402 is further configured to, when the second condition is satisfied, determine that the cause is a connection between the second device and the first device.
  • the transmission link has failed.
  • the second condition is: the total number N1 of response timeouts corresponding to the first device is greater than 0, the total number N2 of response timeouts corresponding to the second device is greater than 0, and the third device corresponds to The total number of ack response timeouts N3 is equal to 0.
  • the historical heartbeat data includes: the accumulated timeout times a 12 of the feedback response responses of the second device counted by the first device in the first time period, the first device The cumulative number of timeouts a 13 of the feedback response response of the third device counted by a device within the first time period; the cumulative number of timeouts a 13 of the feedback response response of the first device counted by the second device within the first time period a 21 , the cumulative number of timeouts a 23 of the feedback response from the third device as counted by the second device within the first time period; the feedback from the second device as counted by the third device within the first time period
  • the cumulative timeout times a 32 of the response response, the cumulative timeout times a 31 of the response response fed back by the first device as counted by the third device in the first time period; the first condition is: N2>N1, N2> N3, and N3>0; wherein, N1 a 12 +a 13 +a 21 +a 31 , N
  • the historical heartbeat data includes: the cumulative number of timeouts a 12 of the feedback response response of the second device that is counted by the first device within a first time period; The cumulative number of timeouts a 21 of the feedback response responses of the first device counted by the second device in the first time period;
  • the first condition is: a 12 >0, a 21 >0, and a 32 >0;
  • the processing module 402 is further configured to select the third device from among two or more devices, where the third device is the third device in the third device.
  • the processing module 402 is further configured to select the third device from among two or more devices, where the third device is the third device in the third device.
  • this embodiment also provides a communication device, and the communication device may be a terminal device or a network device, or a component integrated on the above-mentioned terminal device or network device.
  • FIG. 6 shows a schematic structural diagram of a communication device, and the network device may include: a processor 110 , a memory 120 , and at least one communication interface 130 .
  • the processor 110, the memory 120 and the at least one communication interface 130 may be coupled through a communication bus.
  • the processor 110 is the control center of the communication device, and can be used for communication between devices, for example, including information transmission with the second device, the third device, and other devices.
  • the processor 110 may be composed of an integrated circuit (Integrated Circuit, IC), for example, may be composed of a single packaged IC, or may be composed of a plurality of packaged ICs connected with the same function or different functions.
  • the processor 110 may include a central processing unit (Central Processing Unit, CPU) or a digital signal processor (Digital Signal Processor, DSP) or the like.
  • CPU Central Processing Unit
  • DSP Digital Signal Processor
  • the processor 110 may further include a hardware chip, and the hardware chip may be an application specific integrated circuit (ASIC), a programmable logic device (PLD) or a combination thereof.
  • ASIC application specific integrated circuit
  • PLD programmable logic device
  • the above-mentioned PLD can be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a general array logic (generic array logic, GAL) or any combination thereof.
  • the hardware chip is a processing chip or a chip circuit.
  • the memory 120 is used for storing and exchanging various types of data or software, including storing historical heartbeat data, heartbeat data packets, response packets or response messages, and the like.
  • computer programs and codes may be stored in the memory 120 .
  • the memory 120 may include volatile memory (volatile memory), such as random access memory (Random Access Memory, RAM); may also include non-volatile memory (non-volatile memory), such as flash memory (flash memory) memory), a hard disk (Hard Sisk Drive, HDD) or a solid-state drive (Solid-State Drive, SSD), the memory 120 may also include a combination of the above-mentioned types of memory.
  • volatile memory such as random access memory (Random Access Memory, RAM)
  • non-volatile memory such as flash memory (flash memory) memory
  • HDD Hard Sisk Drive, HDD
  • SSD solid-state drive
  • Communication interface 130 using any transceiver-like device, for communicating with other devices or communication networks, such as Ethernet, Radio Access Network (RAN), Wireless Local Area Network (WLAN), Virtual Extensible Local Area Network (VXLAN), etc.
  • RAN Radio Access Network
  • WLAN Wireless Local Area Network
  • VXLAN Virtual Extensible Local Area Network
  • the above communication device may also include other more or less components, and the structures illustrated in the embodiments of the present application do not constitute a specific limitation on the communication device. And the components shown in FIG. 6 may be implemented in hardware, software, firmware or any combination thereof.
  • the heartbeat detection module 403 and the sampling module 404 in the aforementioned apparatus shown in FIG. 4 can be implemented through a communication interface
  • the functions of the data synchronization module 401 and the processing module 402 can be implemented by the processor 110
  • the storage The functions of the unit may be implemented by the memory 120 .
  • the communication device uses the communication interface to receive response responses sent by at least two other devices, and when the processor 110 detects that the response response fed back by the second device times out, obtains its own historical heartbeat data synchronized within the first time period , and then determine the reason for the timeout of the response response of the second device according to the response response situation detected by each device in the historical heartbeat data.
  • the program code in the memory 120 is called to execute the method shown in FIG. 3 or FIG. 5 in the foregoing embodiment.
  • the communication device also includes a mobile communication module, a wireless communication module, and the like.
  • the mobile communication module includes modules with wireless communication functions such as 2G/3G/4G/5G.
  • filters, switches, power amplifiers, low noise amplifiers (LNAs), etc. may also be included.
  • the wireless communication module can provide wireless communication solutions including WLAN, bluetooth (bluetooth), global navigation satellite system (GNSS), frequency modulation (frequency modulation, FM), etc. applied to communication equipment.
  • an embodiment of the present application also provides a network system, and the network system structure may be a distributed network architecture as shown in the foregoing FIG. 2 , including at least three communication devices, such as device 1 to device 3 .
  • the structure of each device may be a communication device as shown in FIG. 6 , which is used to implement the device detection method in the foregoing embodiment.
  • the historical heartbeat data detected by two peripheral devices and the historical heartbeat data obtained by the device itself are used to detect the device in an abnormal state, and the faulty device is determined by comparing the number of timeouts of each device in the past period of time. The reason is the failure of the device itself, or the link failure caused by probabilistic packet loss. Since the obtained historical heartbeat data is the heartbeat timeout condition detected and reported by multiple devices, the global information is used to make decisions, so compared to a single device The method improves the accuracy of device fault detection in the distributed network, thereby avoiding misjudgment caused by probabilistic packet loss in the case of network fluctuations.
  • Embodiments of the present application also provide a computer program product, where the computer program product includes one or more computer program instructions.
  • the computer program product includes one or more computer program instructions.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer program instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from a communication device, computer, server or data
  • the center transmits to another communication device by wire or wireless.
  • the computer program product and the computer program instructions may be located in the memory of the aforementioned communication device, so as to implement the device detection method described in the embodiments of the present application.
  • the at least one refers to one or more than one
  • the at least three refers to three or more.
  • words such as “first”, “second”, “third” are used to describe the same items or items with basically the same function and effect. Similar items are distinguished. Those skilled in the art can understand that words such as “first”, “second” and “third” do not limit the quantity and execution order, and words such as “first”, “second” and “third” also do not limit the number and execution order. Not necessarily different.

Abstract

Disclosed are a device detection method and apparatus, and a communication device. The method comprises: when a first device detects that a reply response which is fed back by a second device times out, acquiring historical heartbeat data synchronized by the first device within a first time period, the historical heartbeat data comprising the conditions of reply responses which are detected by each of the first device, the second device and a third device and are made by the other two devices to heartbeat data packets; and according to the conditions of the reply responses, which are detected by each device, in the historical heartbeat data, determining the cause of the timeout of the reply response from the second device, the cause comprising the second device being faulty or a transmission link between the second device and the first device being faulty. In the method, a device in an abnormal state is detected by using historical heartbeat data which is detected by peripheral devices and historical heartbeat data acquired by a device itself, thereby improving the accuracy of device fault detection in a distributed network and avoiding incorrect determination caused by network fluctuations.

Description

一种设备检测方法、装置和通信设备Device detection method, device and communication device
本申请要求于2020年10月13日提交中国专利局、申请号为202011093314.2,发明名称为“一种设备检测方法、装置和通信设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on October 13, 2020 with the application number 202011093314.2 and the title of the invention is "A device detection method, device and communication device", the entire contents of which are incorporated by reference in in this application.
技术领域technical field
本申请涉及通信领域,尤其是涉及一种设备检测方法、装置和通信设备。The present application relates to the field of communications, and in particular, to a device detection method, apparatus, and communication device.
背景技术Background technique
在分布式网络中,为了及时获取网络中各个网络设备的状态,通常会采用心跳检测法来检测网络中的各个设备是否发生故障。具体地,心跳检测法是指一个设备周期性地向另一个设备发送心跳数据包,然后根据另一个设备反馈的响应数据包情况来确定接收端设备是否处于正常状态。比如图1所示,设备1定期地向设备2发送心跳数据包,然后等待设备2反馈响应数据包,如果在预设时间段内,设备1未接收到设备2发送的响应数据包,则设备1会判定设备2发生了故障,此时需要上报告警信息。In a distributed network, in order to obtain the status of each network device in the network in time, the heartbeat detection method is usually used to detect whether each device in the network fails. Specifically, the heartbeat detection method means that a device periodically sends a heartbeat data packet to another device, and then determines whether the receiving end device is in a normal state according to the response data packet situation fed back by the other device. For example, as shown in Figure 1, device 1 periodically sends a heartbeat data packet to device 2, and then waits for device 2 to feed back a response data packet. If device 1 does not receive a response data packet sent by device 2 within a preset time period, the device 1 will determine that the device 2 is faulty, and an alarm message needs to be reported at this time.
技术人员在实践过程中发现,在采用心跳检测法对设备的状态进行检测时,可能由于网络波动现象导致系统误判。比如网络波动时,可能出现网络线路时通时断的情况,进而导致发送端设备或者路由器,比如设备1定期发送的心跳数据包丢包,或者称为“概率性丢包”,从而导致接收端的设备2不会发送响应数据包,此时设备1只能判断出设备2发生故障,但实际上有可能是设备1和设备2之间的传输线路发生故障,而设备2本身并未发生故障,因此,采用心跳检测法不能准确地检测出设备实际的状态,准确率较低。In practice, technicians found that when the heartbeat detection method is used to detect the status of the device, the system may misjudge due to network fluctuations. For example, when the network fluctuates, the network line may be on and off, which will cause the sending end device or router, such as the heartbeat data packet sent by device 1 periodically, to lose packets, or "probabilistic packet loss", which leads to the receiving end. Device 2 will not send a response data packet. At this time, device 1 can only determine that device 2 is faulty, but in fact, it may be that the transmission line between device 1 and device 2 is faulty, but device 2 itself is not faulty. Therefore, the actual state of the device cannot be accurately detected by using the heartbeat detection method, and the accuracy rate is low.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供了一种设备检测方法、装置和通信设备,用于解决分布式网络中发生网络波动时无法准确检测设备状态的技术问题。为了解决该技术问题,本申请公开了以下技术方案:The embodiments of the present application provide a device detection method, device, and communication device, which are used to solve the technical problem that the device state cannot be accurately detected when network fluctuation occurs in a distributed network. In order to solve this technical problem, the application discloses the following technical solutions:
第一方面,本申请提供了一种设备检测方法,该方法包括:当第一设备检测到第二设备反馈的应答响应超时时,获取所述第一设备在第一时间段内同步的历史心跳数据,所述历史心跳数据包括所述第一设备、所述第二设备和第三设备中每个设备检测的其他两个设备对心跳数据包的应答响应情况;根据所述历史心跳数据中每个设备检测的所述应答响应情况,确定所述第二设备应答响应超时的原因,所述原因包括所述第二设备发生故障,或者所述第二设备与所述第一设备之间的传输链路发生故障。In a first aspect, the present application provides a device detection method, the method comprising: when a first device detects that a response response fed back by a second device times out, acquiring a historical heartbeat synchronized by the first device within a first time period Data, the historical heartbeat data includes the response response situation of the other two devices detected by each of the first device, the second device and the third device to the heartbeat data packet; The response situation detected by the first device determines the reason for the timeout of the response response of the second device, and the reason includes the failure of the second device, or the transmission between the second device and the first device. The link has failed.
本方法,利用两个周边设备检测的历史心跳数据,和设备自身获取的历史心跳数据对处于异常状态的设备进行检测,通过比较各设备过去一段时间内的超时次数,确定出发生故障的原因是设备本身的故障,或者是概率性丢包导致的链路故障,由于获取的历史心跳数据是多个设备互相检测和上报的心跳超时情况,利用全局信息进行决策,所以相比于单一设备的历史心跳数据检测,本方法提高了分布式网络内设备故障检测的准确率,从而避 免网络波动情况下由于概率性丢包导致的误判。In this method, the historical heartbeat data detected by two peripheral devices and the historical heartbeat data obtained by the device itself are used to detect the device in an abnormal state, and the reason for the failure is determined by comparing the number of timeouts of each device in the past period of time. The failure of the device itself, or the link failure caused by probabilistic packet loss, because the obtained historical heartbeat data is the heartbeat timeout situation detected and reported by multiple devices, and the global information is used to make decisions, so compared with the historical heartbeat data of a single device Heartbeat data detection, the method improves the accuracy of device failure detection in a distributed network, thereby avoiding misjudgment caused by probabilistic packet loss in the case of network fluctuations.
结合第一方面,在第一方面的一种可能的实现方式中,根据所述历史心跳数据中每个设备检测所述应答响应情况,确定所述第二设备应答响应超时的原因,包括:根据所述每个设备检测的所述应答响应情况分别确定所述第一设备、所述第二设备和所述第三设备在所述第一时间段内的应答响应超时总数N1,N2和N3;当满足第一条件时,确定所述原因是所述第二设备发生故障,所述第一条件为:所述第二设备对应的所述应答响应超时总数N2最大,且所述第三设备对应的所述应答响应超时总数N3大于0。With reference to the first aspect, in a possible implementation manner of the first aspect, according to each device in the historical heartbeat data, detecting the response response situation, and determining the reason for the timeout of the second device's response response, including: according to The response response situation detected by each device respectively determines the total number N1, N2 and N3 of response response timeouts of the first device, the second device and the third device within the first time period; When the first condition is satisfied, it is determined that the cause is the failure of the second device, and the first condition is: the total number N2 of response timeouts corresponding to the second device is the largest, and the third device corresponds to The total number N3 of the acknowledgment response timeouts is greater than 0.
本实现方式中,利用第一条件能够准确地检测出故障是否属于设备本身故障,从而提升了分布式网络在网络波动场景下故障检测的准确率。In this implementation manner, whether the fault belongs to the device itself can be accurately detected by using the first condition, thereby improving the accuracy of fault detection in the distributed network in the network fluctuation scenario.
结合第一方面,在第一方面的另一种可能的实现方式中,当满足第二条件时,确定所述原因是,所述第二设备与所述第一设备之间的传输链路发生故障,所述第二条件为:所述第一设备对应的所述应答响应超时总数N1大于0,所述第二设备对应的所述应答响应超时总数N2大于0,且所述第三设备对应的所述应答响应超时总数N3等于0。With reference to the first aspect, in another possible implementation manner of the first aspect, when the second condition is satisfied, it is determined that the reason is that the transmission link between the second device and the first device occurs The second condition is: the total number N1 of response timeouts corresponding to the first device is greater than 0, the total number N2 of response timeouts corresponding to the second device is greater than 0, and the third device corresponds to The total number of ack response timeouts N3 is equal to 0.
本实现方式中,利用第二条件能够准确地检测出故障是否属于设备与设备之间的传输链路发生故障,从而提升了分布式网络在网络波动场景下故障检测的准确率。In this implementation manner, by using the second condition, it can be accurately detected whether the failure belongs to the failure of the transmission link between the devices, thereby improving the accuracy of failure detection of the distributed network in the network fluctuation scenario.
结合第一方面,在第一方面的又一种可能的实现方式中,所述历史心跳数据包括:With reference to the first aspect, in yet another possible implementation manner of the first aspect, the historical heartbeat data includes:
所述第一设备在第一时间段内统计的所述第二设备反馈应答响应的累计超时次数a 12,所述第一设备在第一时间段内统计的所述第三设备反馈应答响应的累计超时次数a 13;所述第二设备在第一时间段内统计的所述第一设备反馈应答响应的累计超时次数a 21,所述第二设备在第一时间段内统计的所述第三设备反馈应答响应的累计超时次数a 23;所述第三设备在第一时间段内统计的所述第二设备反馈应答响应的累计超时次数a 32,所述第三设备在第一时间段内统计的所述第一设备反馈应答响应的累计超时次数a 31。在上述所包含的历史心跳数据情况下,所述第一条件为:N2>N1,N2>N3,且N3>0;其中,N1=a 12+a 13+a 21+a 31,N2=a 12+a 21+a 23+a 32,N3=a 13+a 23+a 31+a 32The cumulative number of timeouts a 12 of the feedback response of the second device as counted by the first device in the first time period, and the number of times of the feedback response of the third device as counted by the first device in the first time period. The cumulative number of timeouts a 13 ; the cumulative number of timeouts a 21 of the feedback response responses of the first device counted by the second device within the first time period, the second device counted within the first time period The third device feeds back the cumulative number of timeouts a 23 of the response response; the third device reports the cumulative number of timeouts a 32 of the response response returned by the second device within the first time period, the third device in the first time period The accumulated timeout times a 31 of the feedback response of the first device are counted in the internal statistics. In the case of the above-mentioned historical heartbeat data, the first condition is: N2>N1, N2>N3, and N3>0; wherein, N1=a 12 +a 13 +a 21 +a 31 , N2=a 12 +a 21 +a 23 +a 32 , N3=a 13 +a 23 +a 31 +a 32 .
结合第一方面,在第一方面的又一种可能的实现方式中,在上述所包含的历史心跳数据情况下,还包括所述第二条件为N1=N2>0,N3=0。其中,N1=a 12+a 13+a 21+a 31,N2=a 12+a 21+a 23+a 32,N3=a 13+a 23+a 31+a 32With reference to the first aspect, in yet another possible implementation manner of the first aspect, in the case of the historical heartbeat data included above, the second condition further includes that N1=N2>0, N3=0. Wherein, N1=a 12 +a 13 +a 21 +a 31 , N2=a 12 +a 21 +a 23 +a 32 , and N3=a 13 +a 23 +a 31 +a 32 .
结合第一方面,在第一方面的又一种可能的实现方式中,当所述历史心跳数据包括:所述第一设备在第一时间段内统计的所述第二设备反馈应答响应的累计超时次数a 12;所述第二设备在第一时间段内统计的所述第一设备反馈应答响应的累计超时次数a 21;所述第三设备在第一时间段内统计的所述第二设备反馈应答响应的累计超时次数a 32。在上述所包含的历史心跳数据情况下,所述第一条件为:a 12>0,a 21>0,且a 32>0;所述第二条件为:a 12>0,a 21>0,且a 32=0。 With reference to the first aspect, in yet another possible implementation manner of the first aspect, when the historical heartbeat data includes: the accumulation of the feedback response responses of the second device counted by the first device within a first time period The number of timeouts a 12 ; the cumulative number of timeouts a 21 of the feedback response responses from the first device, as counted by the second device within the first time period; The cumulative timeout times a 32 of the device feedback response response. In the case of the above-mentioned historical heartbeat data, the first condition is: a 12 >0, a 21 >0, and a 32 >0; the second condition is: a 12 >0, a 21 >0 , and a 32 =0.
结合第一方面,在第一方面的又一种可能的实现方式中,当所述历史心跳数据包括:所述第一设备在第一时间段内统计的所述第二设备反馈应答响应的累计超时次数a 12;所述第二设备在第一时间段内统计的所述第一设备反馈应答响应的累计超时次数a 21;所述第三设备在第一时间段内统计的所述第二设备反馈应答响应的累计超时次数a 32,所述第三设备在第一时间段内统计的所述第一设备反馈应答响应的累计超时次数a 31。在上述所包含的历史心跳数据情况下,所述第一条件为:a 12>0,a 21>0,且a 32+a 23>0;所述第二条件为: a 12>0,a 21>0,且a 32+a 23=0。 With reference to the first aspect, in yet another possible implementation manner of the first aspect, when the historical heartbeat data includes: the accumulation of the feedback response responses of the second device counted by the first device within a first time period The number of timeouts a 12 ; the cumulative number of timeouts a 21 of the feedback response responses from the first device, as counted by the second device within the first time period; The device feeds back the cumulative timeout times a 32 of the response response, and the third device reports the cumulative timeout times a 31 of the response response returned by the first device within the first time period. In the case of the above-mentioned historical heartbeat data, the first condition is: a 12 >0, a 21 >0, and a 32 +a 23 >0; the second condition is: a 12 >0, a 21 > 0, and a 32 +a 23 =0.
结合第一方面,在第一方面的又一种可能的实现方式中,获取所述第一设备在第一时间段内同步的所述第三设备上报的历史心跳数据之前,还包括:在两个或两个以上设备中选择所述第三设备,所述第三设备为在所述第一设备向两个或两个以上设备中的每个设备发送获取历史心跳数据的请求的情况下,接收到的第一个历史心跳数据所来自的设备。With reference to the first aspect, in yet another possible implementation manner of the first aspect, before acquiring the historical heartbeat data reported by the third device synchronized by the first device within the first time period, the method further includes: in two The third device is selected from one or more devices, and the third device is when the first device sends a request for obtaining historical heartbeat data to each of the two or more devices, The device from which the first historical heartbeat data was received.
结合第一方面,在第一方面的又一种可能的实现方式中,所述第一设备检测到第二设备反馈的应答响应超时之前,还包括:周期性地向网络中的所述第二设备和所述第三设备发送心跳数据包;分别接收来自所述第二设备和所述第三设备根据所述心跳数据包反馈的应答响应;统计所述第一时间段内所述第二设备反馈应答响应的累计超时次数,和,所述第三设备反馈应答响应的累计超时次数。With reference to the first aspect, in yet another possible implementation manner of the first aspect, before the first device detects that the response response fed back by the second device times out, the method further includes: periodically reporting to the second device in the network The device and the third device send a heartbeat data packet; respectively receive the response responses from the second device and the third device according to the heartbeat data packet feedback; count the second device in the first time period The cumulative number of timeouts of the feedback response response, and the cumulative number of timeouts of the response response returned by the third device.
本实现方式中,分布式网络中的各个设备周期性地获取过去一段时间内其他设备的心跳超时情况,并同步这些设备的历史心跳数据,从而为发生故障时提供精准检测做准备。In this implementation, each device in the distributed network periodically obtains the heartbeat timeout of other devices in the past period of time, and synchronizes the historical heartbeat data of these devices, so as to provide accurate detection in the event of a failure.
第二方面,本申请提供了一种设备检测装置,所述装置包括:数据同步模块,当第一设备检测到第二设备反馈的应答响应超时时,获取所述第一设备在第一时间段内同步的历史心跳数据,所述历史心跳数据包括所述第一设备、所述第二设备和第三设备中每个设备检测的其他两个设备对心跳数据包的应答响应情况;处理模块,用于根据所述历史心跳数据中每个设备检测的所述应答响应情况,确定所述第二设备应答响应超时的原因,所述原因包括:所述第二设备发生故障,或者所述第二设备与所述第一设备之间的传输链路发生故障。In a second aspect, the present application provides a device detection device, the device includes: a data synchronization module, when the first device detects that the response response fed back by the second device times out, acquires the first device in the first time period Internally synchronized historical heartbeat data, the historical heartbeat data includes the response response situation of the other two devices detected by each of the first equipment, the second equipment and the third equipment to the heartbeat data packet; the processing module, is used to determine the reason for the timeout of the response response of the second device according to the response response situation detected by each device in the historical heartbeat data, and the reasons include: the second device is faulty, or the second device is faulty. The transmission link between the device and the first device fails.
结合第二方面,在第二方面的一种可能的实现方式中,所述处理模块,具体用于根据所述每个设备检测的所述应答响应情况分别确定所述第一设备、所述第二设备和所述第三设备在所述第一时间段内的应答响应超时总数N1,N2和N3,以及,当满足第一条件时,确定所述原因是所述第二设备发生故障;所述第一条件为:所述第二设备对应的所述应答响应超时总数N2最大,且所述第三设备对应的所述应答响应超时总数N3大于0。With reference to the second aspect, in a possible implementation manner of the second aspect, the processing module is specifically configured to determine the first device, the first device, the The total number N1, N2 and N3 of response response timeouts of the second device and the third device within the first time period, and, when the first condition is satisfied, it is determined that the cause is the failure of the second device; so The first condition is: the total number N2 of response timeouts corresponding to the second device is the largest, and the total number N3 of response timeouts corresponding to the third device is greater than 0.
结合第二方面,在第二方面的另一种可能的实现方式中,所述处理模块,还用于当满足第二条件时,确定所述原因是所述第二设备与所述第一设备之间的传输链路发生故障;所述第二条件为:所述第一设备对应的所述应答响应超时总数N1大于0,所述第二设备对应的所述应答响应超时总数N2大于0,且所述第三设备对应的所述应答响应超时总数N3等于0。With reference to the second aspect, in another possible implementation manner of the second aspect, the processing module is further configured to, when a second condition is satisfied, determine that the reason is that the second device and the first device The transmission link between the two devices fails; the second condition is: the total number N1 of response timeouts corresponding to the first device is greater than 0, and the total number N2 of response response timeouts corresponding to the second device is greater than 0, And the total number N3 of response timeouts corresponding to the third device is equal to 0.
结合第二方面,在第二方面的又一种可能的实现方式中,当所述历史心跳数据包括以下参数时:所述第一设备在第一时间段内统计的所述第二设备反馈应答响应的累计超时次数a 12,所述第一设备在第一时间段内统计的所述第三设备反馈应答响应的累计超时次数a 13;所述第二设备在第一时间段内统计的所述第一设备反馈应答响应的累计超时次数a 21,所述第二设备在第一时间段内统计的所述第三设备反馈应答响应的累计超时次数a 23;所述第三设备在第一时间段内统计的所述第二设备反馈应答响应的累计超时次数a 32,所述第三设备在第一时间段内统计的所述第一设备反馈应答响应的累计超时次数a 31With reference to the second aspect, in yet another possible implementation manner of the second aspect, when the historical heartbeat data includes the following parameters: a feedback response from the second device counted by the first device in a first time period The cumulative timeout times a 12 of the response, the cumulative timeout times a 13 of the feedback response of the third device counted by the first device within the first time period; the second device counted within the first time period The first device feeds back the cumulative number of timeouts a 21 of the response response, the second device reports the cumulative number of timeouts a 23 of the response response from the third device within the first time period; the third device is in the first The cumulative timeout times a 32 of the feedback response responses of the second device counted in the time period, and the cumulative timeout times a 31 of the feedback response responses of the first device counted by the third device in the first time period;
所述第一条件为:N2>N1,N2>N3,且N3>0;其中,N1=a 12+a 13+a 21+a 31,N2=a 12+a 21+a 23+a 32,N3=a 13+a 23+a 31+a 32The first condition is: N2>N1, N2>N3, and N3>0; wherein, N1=a 12 +a 13 +a 21 +a 31 , N2=a 12 +a 21 +a 23 +a 32 , N3=a 13 +a 23 +a 31 +a 32 .
进一步地,在这种情况下,所述第二条件为N1=N2>0,N3=0。Further, in this case, the second condition is N1=N2>0, N3=0.
结合第二方面,在第二方面的又一种可能的实现方式中,当所述历史心跳数据包括以下参数时:所述第一设备在第一时间段内统计的所述第二设备反馈应答响应的累计超时次数a 12;所述第二设备在第一时间段内统计的所述第一设备反馈应答响应的累计超时次数a 21;所述第三设备在第一时间段内统计的所述第二设备反馈应答响应的累计超时次数a 32With reference to the second aspect, in yet another possible implementation manner of the second aspect, when the historical heartbeat data includes the following parameters: a feedback response from the second device counted by the first device in a first time period The cumulative timeout times a 12 of the response; the cumulative timeout times a 21 of the feedback response of the first device counted by the second device within the first time period; the third device counted within the first time period the cumulative timeout times a 32 of the second device feedback response response;
所述第一条件为:a 12>0,a 21>0,且a 32>0;所述第二条件为:a 12>0,a 21>0,且a 32=0。 The first condition is: a 12 >0, a 21 >0, and a 32 >0; the second condition is: a 12 >0, a 21 >0, and a 32 =0.
结合第二方面,在第二方面的又一种可能的实现方式中,当所述历史心跳数据包括以下参数时:所述第一设备在第一时间段内统计的所述第二设备反馈应答响应的累计超时次数a 12;所述第二设备在第一时间段内统计的所述第一设备反馈应答响应的累计超时次数a 21;所述第三设备在第一时间段内统计的所述第二设备反馈应答响应的累计超时次数a 32,所述第三设备在第一时间段内统计的所述第一设备反馈应答响应的累计超时次数a 31With reference to the second aspect, in yet another possible implementation manner of the second aspect, when the historical heartbeat data includes the following parameters: a feedback response from the second device counted by the first device in a first time period The cumulative timeout times a 12 of the response; the cumulative timeout times a 21 of the feedback response of the first device counted by the second device within the first time period; the third device counted within the first time period the cumulative timeout times a 32 of the feedback response of the second device, and the cumulative timeout times a 31 of the response response of the first device as counted by the third device within the first time period;
所述第一条件为:a 12>0,a 21>0,且a 32+a 23>0;所述第二条件为:a 12>0,a 21>0,且a 32+a 23=0。 The first condition is: a 12 >0, a 21 >0, and a 32 +a 23 >0; the second condition is: a 12 >0, a 21 >0, and a 32 +a 23 = 0.
结合第二方面,在第二方面的又一种可能的实现方式中,所述处理模块,还用于在两个或两个以上设备中选择所述第三设备,所述第三设备为在所述第一设备向两个或两个以上设备中的每个设备发送获取历史心跳数据的请求的情况下,通过所述数据同步模块接收到的第一个历史心跳数据所来自的设备。With reference to the second aspect, in yet another possible implementation manner of the second aspect, the processing module is further configured to select the third device from among two or more devices, where the third device is an When the first device sends a request for acquiring historical heartbeat data to each of two or more devices, the device from which the first historical heartbeat data is received by the data synchronization module.
结合第二方面,在第二方面的又一种可能的实现方式中,还包括:心跳检测模块,用于在所述第一设备检测到第二设备反馈的应答响应超时之前,周期性地向网络中的所述第二设备和所述第三设备发送心跳数据包;采样模块,用于分别接收来自所述第二设备和所述第三设备根据所述心跳数据包反馈的应答响应,并统计所述第一时间段内所述第二设备反馈应答响应的累计超时次数,和,所述第三设备反馈应答响应的累计超时次数。With reference to the second aspect, in yet another possible implementation manner of the second aspect, the method further includes: a heartbeat detection module, configured to periodically send a message to the first device before the first device detects that the response response fed back by the second device times out. The second device and the third device in the network send heartbeat data packets; the sampling module is configured to respectively receive the response responses fed back by the second device and the third device according to the heartbeat data packets, and The accumulated timeout times of the second device's feedback response response and the accumulated timeout times of the third device's feedback response response in the first time period are counted.
第三方面,本申请还提供了一种芯片系统,该芯片系统包括处理器和存储器,其中,处理器与存储器耦合,存储器用于存储计算机程序指令;处理器用于执行存储器中存储的所述指令,以使得所述芯片系统执行前述第一方面及第一方面各种实现方式中的方法。In a third aspect, the present application also provides a chip system, the chip system includes a processor and a memory, wherein the processor is coupled with the memory, and the memory is used for storing computer program instructions; the processor is used for executing the instructions stored in the memory. , so that the chip system executes the aforementioned first aspect and the methods in various implementations of the first aspect.
此外,所述芯片系统中还包括接口电路,所述接口电路用于实现所述芯片系统与外部的其它模块之间的通信。In addition, the chip system further includes an interface circuit, and the interface circuit is used to realize the communication between the chip system and other external modules.
可选的,所述芯片系统为一个芯片电路。Optionally, the chip system is a chip circuit.
第四方面,本申请还提供一种通信设备,所述通信设备可以是前述第二方面所述设备检测装置,或者包含前述第三方面所述的芯片系统,以便能够执行前述第一方面及第一方面各种实现方式中的方法。In a fourth aspect, the present application further provides a communication device. The communication device may be the device detection device described in the second aspect, or include the chip system described in the third aspect, so as to be able to execute the first aspect and the third aspect. On the one hand, methods in various implementations.
其中,所述通信设备可以包括但不限于处理器、存储器、通信接口,以及传感器模块、移动通信模块、无线通信模块、显示屏、摄像头、USB接口和电源管理模块等等。The communication device may include, but is not limited to, a processor, a memory, a communication interface, a sensor module, a mobile communication module, a wireless communication module, a display screen, a camera, a USB interface, a power management module, and the like.
第五方面,本申请还提供了一种计算机可读存储介质,该存储介质中存储有指令,使得当指令在计算机或处理器上运行时,可以用于执行前述第一方面以及第一方面各种实现方式中的方法。In a fifth aspect, the present application also provides a computer-readable storage medium, in which instructions are stored, so that when the instructions are executed on a computer or a processor, the instructions can be used to execute the foregoing first aspect and each of the first aspects. method in an implementation.
另外,本申请还提供了一种计算机程序产品,该计算机程序产品包括计算机指令,当该指令被计算机或处理器执行时,可实现前述第一方面以及第一方面各种实现方式中的方法。In addition, the present application also provides a computer program product, the computer program product includes computer instructions, when the instructions are executed by a computer or a processor, the aforementioned first aspect and the methods in various implementation manners of the first aspect can be implemented.
需要说明的是,上述第二方面至第五方面的各种实现方式的技术方案所对应的有益效 果与前述第一方面以及第一方面的各种实现方式的有益效果相同,具体参见上述第一方面以及第一方面的各种实现方式中的有益效果描述,不再赘述。It should be noted that the beneficial effects corresponding to the technical solutions of the various implementation manners of the second aspect to the fifth aspect are the same as the beneficial effects of the foregoing first aspect and various implementation manners of the first aspect. For details, refer to the foregoing first aspect. Aspects and descriptions of beneficial effects in various implementation manners of the first aspect will not be repeated.
附图说明Description of drawings
图1为本申请提供的一种采用心跳检测法检测设备故障的网络结构示意图;Fig. 1 is a kind of network structure schematic diagram that adopts heartbeat detection method to detect equipment failure provided by this application;
图2为本申请实施例提供的另一种采用心跳检测法检测设备故障的网络结构示意图;Fig. 2 is another kind of network structure schematic diagram that adopts the heartbeat detection method to detect equipment failure provided by the embodiment of this application;
图3为本申请实施例提供的一种设备检测方法的流程图;3 is a flowchart of a device detection method provided by an embodiment of the present application;
图4为本申请实施例提供的一种设备检测装置的结构示意图;FIG. 4 is a schematic structural diagram of a device detection apparatus provided by an embodiment of the present application;
图5为本申请实施例提供的另一种设备检测方法的流程图;FIG. 5 is a flowchart of another device detection method provided by an embodiment of the present application;
图6为本申请实施例提供的一种通信设备的结构示意图。FIG. 6 is a schematic structural diagram of a communication device according to an embodiment of the present application.
具体实施方式Detailed ways
为了使本技术领域的人员更好地理解本申请实施例中的技术方案,并使本申请实施例的上述目的、特征和优点能够更加明显易懂,下面结合附图对本申请实施例中的技术方案作进一步详细的说明。In order to enable those skilled in the art to better understand the technical solutions in the embodiments of the present application, and to make the above-mentioned purposes, features and advantages of the embodiments of the present application more clearly understood, the following describes the technical solutions in the embodiments of the present application with reference to the accompanying drawings. The program is described in further detail.
在对本申请实施例的技术方案说明之前,首先结合附图对本申请实施例的应用场景进行说明。Before describing the technical solutions of the embodiments of the present application, the application scenarios of the embodiments of the present application are first described with reference to the accompanying drawings.
本申请的技术方案可应用于一种分布式网络,该网络可以是一种中心化的网络,也可以是一种去中心化网络,比如应用于一种智能家居环境。其中,所谓去中心化是相对于中心化的一个概念,去中心化网络中每个设备都是平等的,没有主设备和从设备。对于中心化网络来说,一般是主设备向其他从设备发送心跳数据包,并等待其他从设备反馈响应数据包,而在去中心化网络中,设备与设备之间互相发送心跳数据包和响应数据包。The technical solution of the present application can be applied to a distributed network, and the network can be a centralized network or a decentralized network, such as applied to a smart home environment. Among them, the so-called decentralization is a concept relative to centralization. Every device in the decentralized network is equal, and there is no master device and slave device. For a centralized network, the master device generally sends heartbeat packets to other slave devices and waits for other slave devices to feed back response packets, while in a decentralized network, devices send heartbeat packets and responses to each other. data pack.
例如图2所示,为本实施例提供的一种去中心化网络的结构示意图。该网络中包括至少三个电子设备,比如设备1、设备2和设备3。此外,还可以包括其他电子设备,比如交换机、路由器和服务器等,本实施例对该网络中所包含的电子设备的类型和数量不予限制。For example, as shown in FIG. 2 , a schematic structural diagram of a decentralized network provided in this embodiment. The network includes at least three electronic devices, such as device 1 , device 2 and device 3 . In addition, other electronic devices, such as switches, routers, and servers, may also be included, and this embodiment does not limit the type and number of electronic devices included in the network.
可选的,所述设备1至设备3中的任意一种可以是一个终端设备,所述终端设备可以是一种便携式设备,比如智能终端、手机、笔记本电脑、平板电脑、个人计算机(personal computer,PC)、个人数字助理(personal digital assistant,PDA),可折叠终端、车载终端、具备无线通讯功能的可穿戴设备(例如智能手表或手环)、用户设备(user device)或用户设备(user equipment,UE)、以及增强现实(augmented reality,AR)或者虚拟现实(virtual reality,VR)设备等。此外,所述终端设备还可以是一种智能家居设备,比如部署在室内家庭中的音响、空调、冰箱、TV、洗衣机和热水器等,本申请的实施例对终端设备的具体设备形态不做限定。另外,上述各种终端设备中包括但不限于搭载苹果(IOS)、安卓(Android)、微软(Microsoft)或者其他操作系统。Optionally, any one of the devices 1 to 3 may be a terminal device, and the terminal device may be a portable device, such as a smart terminal, a mobile phone, a notebook computer, a tablet computer, a personal computer , PC), personal digital assistant (PDA), foldable terminal, vehicle terminal, wearable device with wireless communication function (such as smart watch or bracelet), user device (user device) or user device (user device) equipment, UE), and augmented reality (AR) or virtual reality (virtual reality, VR) equipment, etc. In addition, the terminal device may also be a smart home device, such as audio, air conditioners, refrigerators, TVs, washing machines, and water heaters deployed in indoor homes. The embodiments of this application do not limit the specific device form of the terminal device. . In addition, the above-mentioned various terminal devices include, but are not limited to, those equipped with Apple (IOS), Android (Android), Microsoft (Microsoft) or other operating systems.
此外,所述设备1至设备3中任意一种还可以是网络设备,比如交换机、网关、服务器等,本实施例对此不予限制。In addition, any one of the devices 1 to 3 may also be a network device, such as a switch, a gateway, a server, etc., which is not limited in this embodiment.
需要说明的是,如果上述设备为终端设备,则设备与设备之间的通信可以通过无线网络,比如WiFi来传输;如果上述设备为网络设备,则设备与设备之间的通信可以通过Internet,比如光纤来传输。本申请对设备与设备之间具体的传输媒介不予限制。It should be noted that, if the above-mentioned device is a terminal device, the communication between the device and the device can be transmitted through a wireless network, such as WiFi; if the above-mentioned device is a network device, the communication between the device and the device can be through the Internet, such as fiber optic transmission. This application does not limit the specific transmission medium between devices.
本实施例中提供一种设备检测方法,该方法可应用于上述分布式网络,并能够检测出在网络波动时发生概率性丢包导致的应答响应异常问题。This embodiment provides a device detection method, which can be applied to the above-mentioned distributed network, and can detect the abnormal response response caused by probabilistic packet loss when the network fluctuates.
其中,所述概率性丢包的一种可能的情况是,网络波动时由于线路的不稳定,出现网络链路时通时断的现象,导致心跳报文在链路传输过程中丢失。在这种情况下,接收端设备可能是处于正常状态,可以接收心跳数据包,并反馈应答的响应报文。Wherein, a possible situation of the probabilistic packet loss is that due to the instability of the line when the network fluctuates, the phenomenon of the network link being on and off occurs, resulting in the loss of heartbeat packets during the link transmission process. In this case, the receiving end device may be in a normal state, can receive heartbeat data packets, and feed back a response message.
如图3所示,该方法包括以下步骤:As shown in Figure 3, the method includes the following steps:
101:当第一设备检测到第二设备反馈的应答响应超时时,获取所述第一设备在第一时间段内同步的历史心跳数据。101: When the first device detects that the response response fed back by the second device times out, acquire historical heartbeat data synchronized by the first device within a first time period.
其中,所述历史心跳数据包括所述第一设备、所述第二设备和第三设备中每个设备检测的其他两个设备对心跳数据包的应答响应情况。具体地,历史心跳数据至少包括以下3部分:The historical heartbeat data includes responses to the heartbeat data packets from the other two devices detected by each of the first device, the second device, and the third device. Specifically, the historical heartbeat data includes at least the following three parts:
1、第一设备的历史心跳数据,包括:第一设备在第一时间段内检测的第二设备,和/或,第三设备对自己(第一设备)发送的心跳数据包的应答响应情况;1. The historical heartbeat data of the first device, including: the second device detected by the first device in the first time period, and/or the response of the third device to the heartbeat data packet sent by itself (the first device) ;
2、第二设备的历史心跳数据,包括:第二设备在第一时间段内检测的第一设备,和/或,第三设备对自己(第二设备)发送的心跳数据包的应答响应情况;2. The historical heartbeat data of the second device, including: the first device detected by the second device in the first time period, and/or the response of the third device to the heartbeat data packet sent by itself (the second device) ;
3、第三设备的历史心跳数据,包括:第三设备在第一时间段内检测的第一设备,和/或,第二设备对自己(第三设备)发送的心跳数据包的应答响应情况。3. The historical heartbeat data of the third device, including: the first device detected by the third device in the first time period, and/or the response of the second device to the heartbeat data packet sent by itself (the third device) .
其中,上述第一设备的历史心跳数据由第一设备统计获得,第二设备的历史心跳数据和第三设备的历史心跳数据则通过各自设备主动上报给第一设备,第一设备分别接收后获得。The above-mentioned historical heartbeat data of the first device is obtained from statistics by the first device, and the historical heartbeat data of the second device and the historical heartbeat data of the third device are actively reported to the first device through their respective devices, and the first device obtains after receiving them respectively. .
具体地,以第一设备的历史心跳数据为例,一种获取所述第一设备的历史心跳数据的实现方式是:第一设备周期性地向网络中的第二设备和第三设备发送心跳数据包,所述发送周期和发送范围可以自定义,比如设置发送周期为1s(秒),则第一设备每隔1s向每个其他设备发送一个心跳数据包或心跳报文。当第二设备和第三设备接收到来自第一设备发送的心跳数据包时,会向第一设备发送一个应答响应,比如反馈一个应答响应数据包或者响应报文等;第一设备会在发送一个心跳数据包之后开始计时,判断在预设时间内是否接收到来自第二设备和第三设备反馈的应答响应。所述预设时间可以自定义。Specifically, taking the historical heartbeat data of the first device as an example, an implementation manner of acquiring the historical heartbeat data of the first device is: the first device periodically sends the heartbeat to the second device and the third device in the network For data packets, the sending cycle and sending range can be customized. For example, if the sending cycle is set to 1s (seconds), the first device sends a heartbeat data packet or heartbeat message to each other device every 1s. When the second device and the third device receive the heartbeat data packet sent from the first device, they will send a response to the first device, such as feeding back a response response data packet or response message, etc.; the first device will send Start timing after a heartbeat data packet, and determine whether a response response feedback from the second device and the third device is received within a preset time. The preset time can be customized.
如果第一设备在预设时间内接收到第二设备或第三设备(接收端)发送的一个应答响应,则表示该接收端的应答反馈未超时;如果在预设时间之外接收到应答响应,或者没有接收到应答响应,则表示该接收端的应答反馈超时。可选的,所述应答反馈超时又可称为心跳异常。If the first device receives a response sent by the second device or the third device (receiving end) within the preset time, it means that the response feedback of the receiving end has not timed out; if the response is received outside the preset time, Or if no response response is received, it means that the response feedback from the receiving end times out. Optionally, the response feedback timeout may also be referred to as abnormal heartbeat.
例如,设备1在t1时刻分别向设备2和设备3发送一个心跳报文1,然后设备1在t2时刻接收到设备2反馈的响应报文1,在t3时刻接收到设备3反馈的响应报文2。如果t1与t2的时间间隔在预设时间间隔内,则设备1记录t1时刻发送给设备2的心跳报文所对应的应答响应未超时;如果t1与t3的时间间隔在所述预设时间间隔之外,或者在所述预设时间间隔内未接收到设备3发送的响应报文2,则记录t1时刻发送给设备3的心跳报文所对应的应答响应超时。For example, device 1 sends a heartbeat message 1 to device 2 and device 3 respectively at time t1, then device 1 receives response message 1 fed back by device 2 at time t2, and receives the response message fed back by device 3 at time t3 2. If the time interval between t1 and t2 is within the preset time interval, device 1 records that the response response corresponding to the heartbeat message sent to device 2 at time t1 has not timed out; if the time interval between t1 and t3 is within the preset time interval Otherwise, or the response message 2 sent by the device 3 is not received within the preset time interval, the response time-out corresponding to the heartbeat message sent to the device 3 at time t1 is recorded.
可选的,对于所述未超时的应答响应,第一设备将其标记为“0”,对于所述超时的应答响应,第一设备将其标记为“1”。对于上述设备2,设备1会对其在t1时刻发送的心跳报 文1的应答响应标记为“0”,对于上述设备3,设备1会对其在t2时刻发送的心跳报文1的应答响应标记为“1”。应理解,设备1还可以通过其他方式来标记其接收的应答响应超时和未超时的情况,本实施例对设备1采用的标记方式不作限制。本实施例中,当检测到第二设备应答响应超时时,即心跳响应异常时,则记录一个全“1”的二维数组。Optionally, for the non-timed-out response response, the first device marks it as "0", and for the time-out response response, the first device marks it as "1". For the above device 2, the device 1 will mark the response to the heartbeat message 1 sent at time t1 as "0", and for the above device 3, device 1 will respond to the heartbeat message 1 sent at time t2. Marked as "1". It should be understood that the device 1 may also use other methods to mark the situation that the response response it receives has timed out or not, and the marking method adopted by the device 1 is not limited in this embodiment. In this embodiment, when it is detected that the response response of the second device times out, that is, when the heartbeat response is abnormal, a two-dimensional array of all "1"s is recorded.
同理地,第二设备也周期性地分别向第一设备和第三设备发送心跳数据包,并记录第一设备和第三设备的应答响应超时情况,形成第二设备的历史心跳数据。第三设备也周期性地分别向第二设备和第一设备发送心跳数据包,并记录第二设备和第一设备的应答响应超时情况,形成第一设备的历史心跳数据。Similarly, the second device also periodically sends heartbeat data packets to the first device and the third device, respectively, and records the response timeouts of the first device and the third device to form historical heartbeat data of the second device. The third device also periodically sends heartbeat data packets to the second device and the first device, respectively, and records the response timeout of the second device and the first device to form historical heartbeat data of the first device.
另外,所述历史心跳数据可以定时刷新,比如在各个设备侧存储1min(分钟)时间间隔的历史心跳数据,或者,每隔1min刷新一次本地存储记录,将历史心跳数据更新为最近1~2min的历史心跳数据。In addition, the historical heartbeat data can be refreshed regularly, such as storing historical heartbeat data at 1min (minute) intervals on each device side, or refreshing the local storage record every 1min, and updating the historical heartbeat data to the latest 1-2min Historical heartbeat data.
本实施例以第一设备为例,在上述步骤101中,当第一设备检测到第二设备反馈的应答响应超时是指,第一设备与第二设备在正常通信中,两端设备互相收发心跳数据包和应答响应,当第一设备在某一时刻向第二设备发送的心跳数据包后,在预设时间内(比如1s)未收到第二设备反馈的响应报文,则第一设备确认当前所述第二设备反馈的应答响应超时,启动步骤101的方法。This embodiment takes the first device as an example. In the above step 101, when the first device detects that the response response fed back by the second device has timed out, it means that the first device and the second device are in normal communication, and the devices at both ends send and receive each other. Heartbeat data packet and response response, when the first device sends a heartbeat data packet to the second device at a certain time, but does not receive the response packet fed back by the second device within a preset time (for example, 1s), the first device will The device confirms that the current response response fed back by the second device times out, and starts the method of step 101 .
102:根据所述历史心跳数据中每个设备检测的所述应答响应情况,确定所述第二设备应答响应超时的原因,所述原因包括:所述第二设备发生故障,或者所述第二设备与所述第一设备之间的传输链路发生故障。102: Determine the reason for the timeout of the response response of the second device according to the response response situation detected by each device in the historical heartbeat data, the reasons include: the second device is faulty, or the second device is faulty. The transmission link between the device and the first device fails.
具体地,可通过第一条件和第二条件来判断超时原因是设备本身发生的故障,还是设备之间传输链路发生的故障。Specifically, the first condition and the second condition can be used to determine whether the cause of the timeout is a failure of the device itself or a failure of a transmission link between devices.
所述第一条件为:第二设备对应的所述应答响应超时总数N2最大,且第三设备对应的所述应答响应超时总数N3大于0。用表达式表示为:N2>N1,N2>N3,且N3>0。The first condition is: the total number N2 of response timeouts corresponding to the second device is the largest, and the total number N3 of response timeouts corresponding to the third device is greater than 0. Expressed as: N2>N1, N2>N3, and N3>0.
所述第二条件为:第一设备对应的所述应答响应超时总数N1大于0,第二设备对应的所述应答响应超时总数N2大于0,且第三设备对应的所述应答响应超时总数N3等于0。用表达式表示为:N1>0,N2>0,且N3=0。The second condition is: the total number N1 of response timeouts corresponding to the first device is greater than 0, the total number N2 of response timeouts corresponding to the second device is greater than 0, and the total number N3 of response timeouts corresponding to the third device equal to 0. Expressed in expressions: N1>0, N2>0, and N3=0.
其中,N1表示所述第一设备在所述第一时间段内的应答响应超时总数,N2表示所述第二设备在所述第一时间段内的应答响应超时总数,N3表示所述第三设备在所述第一时间段内的应答响应超时总数。所述第一时间段可以自由设置,比如设置为30s、60s、90s、120s等,本实施例对此不予限制。Wherein, N1 represents the total number of response timeouts of the first device within the first time period, N2 represents the total number of response timeouts of the second device within the first time period, and N3 represents the third The total number of response timeouts of the device within the first time period. The first time period can be set freely, for example, set to 30s, 60s, 90s, 120s, etc., which is not limited in this embodiment.
下面对本实施例中,第一设备根据历史心跳数据确定所述第二设备的故障原因,可能产生的几种实施方式进行说明。In the present embodiment, several possible implementations in which the first device determines the cause of the failure of the second device according to the historical heartbeat data are described below.
第一种实施方式first embodiment
本实施例中,假设各个设备发送心跳数据包的频率是1个/秒,周期是1秒,所述第一时间段是60s,所述预设时间为1s,即设备1发送一个心跳数据包后检测接收端的应答响应不超时的时间间隔是1s,则在一个检测周期60s内,如果设备1连续收到设备2发送的60个响应报文,假设设备2向设备1反馈响应报文的时间很短,比如毫秒级别的反馈响应,则在第61s时得到从第1s到第60s的检测周期(第一时间段)内,设备1检测设备2对设备1发送的心跳数据包的应答响应情况为,应答响应超时次数为0,即心跳异常 次数为0。类似的,如果设备1在60s内只接收到设备3发送的4个满足在1s的预设时间间隔的响应报文,标记了4个“0”,其余56个心跳报文的反馈响应均为超时,即标记了56个“1”,则设备1统计过去60s内设备3的应答响应情况为,应答响应超时次数为56,即心跳异常次数为56。In this embodiment, it is assumed that the frequency of each device sending heartbeat data packets is 1 per second, the period is 1 second, the first time period is 60s, and the preset time is 1s, that is, device 1 sends a heartbeat data packet After detecting that the time interval for the response of the receiving end to not time out is 1s, within a detection period of 60s, if device 1 continuously receives 60 response packets sent by device 2, it is assumed that device 2 feeds back the response packet time to device 1. Very short, such as the feedback response at the millisecond level, in the 61s, within the detection period (the first time period) from the 1s to the 60s, the device 1 detects the response of the device 2 to the heartbeat data packet sent by the device 1. The number of response timeouts is 0, that is, the number of abnormal heartbeats is 0. Similarly, if device 1 only receives 4 response packets from device 3 that meet the preset time interval of 1 s within 60s, it marks 4 "0"s, and the feedback responses of the remaining 56 heartbeat packets are If the timeout is exceeded, that is, 56 "1"s are marked, then device 1 counts the response response of device 3 in the past 60s, and the number of response response timeouts is 56, that is, the number of abnormal heartbeats is 56.
可选的,所述心跳异常次数可用字母“a”来表示,则a 12表示设备1在第一时间段内统计的设备2的心跳异常次数(或应答响应情况),a 13表示设备1在所述第一时间段内统计的设备3的心跳异常次数,进而设备1在第一时间段内统计的设备2和设备3的历史心跳数据为{a 12,a 13}。在上述示例中,a 12=0,a 13=56,则设备1在60s统计的设备1的历史心跳数据为{0,56}。 Optionally, the number of abnormal heartbeats can be represented by the letter "a", then a12 represents the abnormal number of heartbeats (or response response situation) of device 2 counted by device 1 in the first time period, and a13 represents that device 1 is in The number of abnormal heartbeats of the device 3 counted in the first time period, and the historical heartbeat data of the device 2 and the device 3 counted by the device 1 in the first time period is {a 12 , a 13 }. In the above example, a 12 =0, a 13 =56, then the historical heartbeat data of device 1 counted by device 1 in 60s is {0, 56}.
同理地,第二设备的历史心跳数据可以表示为{a 21,a 23},第三设备的历史心跳数据可以表示为{a 31,a 32}。其中,a 21表示设备2在所述第一时间段内统计的设备1发生心跳异常的次数,a 23表示设备2在所述第一时间段内统计的设备3发生心跳异常的次数,a 31表示设备3在所述第一时间段内统计的设备1发生心跳异常的次数,a 32表示设备3在所述第一时间段内统计的设备2发生心跳异常的次数。 Similarly, the historical heartbeat data of the second device may be represented as {a 21 , a 23 }, and the historical heartbeat data of the third device may be represented as {a 31 , a 32 }. Wherein, a 21 represents the number of times the abnormal heartbeat occurred in the device 1 counted by the device 2 in the first time period, a 23 represents the number of times the abnormal heartbeat occurred in the device 3 counted by the device 2 in the first time period, a 31 Represents the number of times the abnormal heartbeat occurs to the device 1 as counted by the device 3 in the first time period, and a 32 represents the number of times the abnormal heartbeat of the device 2 occurs as counted by the device 3 in the first time period.
此时,设备1在第一时间段内获得设备1至设备3统计的所有历史心跳数据为At this time, all the historical heartbeat data obtained by device 1 in the first time period from device 1 to device 3 are as follows:
Figure PCTCN2021114169-appb-000001
Figure PCTCN2021114169-appb-000001
并且,设备1将这些历史心跳数据存储在设备1的本地存储介质中。And, the device 1 stores these historical heartbeat data in the local storage medium of the device 1 .
另外,上述方法还包括:第一设备将所述第一设备在所述第一时间段内统计的历史心跳数据分别发送给第二设备和第三设备。第二设备将其在所述第一时间段内统计的历史心跳数据分别发送给第一设备和第三设备。第三设备将其在所述第一时间段内统计的历史心跳数据分别发送给第一设备和第二设备。从而使得第一设备、第二设备和第三设备都分别获得其他两个设备统计的历史心跳数据。In addition, the above method further includes: the first device sends the historical heartbeat data collected by the first device in the first time period to the second device and the third device, respectively. The second device sends its historical heartbeat data collected in the first time period to the first device and the third device respectively. The third device sends its historical heartbeat data collected in the first time period to the first device and the second device respectively. Thus, the first device, the second device, and the third device all obtain the historical heartbeat data counted by the other two devices, respectively.
当检测到网络内第二设备的状态异常时,第一设备的数据同步模块获取第一设备的历史心跳数据,同时从网络内其他设备获取历史心跳数据,如果获取超时,则将该第二设备的心跳数据a 12设置为默认数据。 When it is detected that the state of the second device in the network is abnormal, the data synchronization module of the first device obtains the historical heartbeat data of the first device, and simultaneously obtains the historical heartbeat data from other devices in the network. The heartbeat data of a 12 is set as the default data.
上述根据第一设备获取的第一设备、第二设备和第三设备在第一时间段内的历史心跳数据分析第二设备反馈超时的原因,具体为:The above-mentioned reason for analyzing the feedback timeout of the second device according to the historical heartbeat data of the first device, the second device and the third device in the first time period obtained by the first device is specifically:
首先,根据所述每个设备检测的所述应答响应情况分别确定所述第一设备、所述第二设备和所述第三设备在所述第一时间段内的应答响应超时总数N1,所述应答响应超时总数又为累积心跳异常次数N1,所述累积心跳异常次数N1为所有设备检测的心跳异常次数之和。First, determine the total number N1 of response timeouts of the first device, the second device, and the third device within the first time period according to the response status detected by each device. The total number of response timeouts is the cumulative abnormal heartbeat number N1, and the cumulative abnormal heartbeat number N1 is the sum of the abnormal heartbeat times detected by all devices.
例如,设备1的累积心跳异常次数为N1,且N1=a 12+a 13+a 21+a 31For example, the cumulative number of abnormal heartbeats of device 1 is N1, and N1=a 12 +a 13 +a 21 +a 31 ,
设备2的累积心跳异常次数为N2,且N2=a 12+a 21+a 23+a 32The cumulative number of abnormal heartbeats of device 2 is N2, and N2=a 12 +a 21 +a 23 +a 32 ,
设备3的累积心跳异常次数为N3,且N3=a 13+a 23+a 31+a 32The cumulative number of abnormal heartbeats of the device 3 is N3, and N3=a 13 +a 23 +a 31 +a 32 .
如果N1、N2、N3各不相同,且N2>N1>N3,N3>0,或者,N2>N1,N2>N3,且N3>0,则满足上述第一条件,则确定超时原因是所述第二设备(即设备2)发生了故 障。本实施方式中,将累计心跳超时(或异常次数)最多的设备确定为发生故障的设备。If N1, N2, and N3 are different, and N2>N1>N3, N3>0, or, N2>N1, N2>N3, and N3>0, then the first condition above is satisfied, then it is determined that the reason for the timeout is the The second device (ie Device 2) has failed. In this embodiment, the device with the most accumulated heartbeat timeouts (or abnormal times) is determined as the failed device.
同理地,如果N1>N2>N3,且N3>0,或者,N1>N2,N1>N3,且N3>0,则确定所述第一设备(即设备1)发生了故障。如果N3>N2>N1,且N3>0,或者,N3>N2,N3>N1,且N3>0,则确定所述第三设备(即设备3)发生了故障。Similarly, if N1>N2>N3, and N3>0, or, N1>N2, N1>N3, and N3>0, it is determined that the first device (ie, device 1) is faulty. If N3>N2>N1, and N3>0, or, N3>N2, N3>N1, and N3>0, it is determined that the third device (ie, device 3) is faulty.
举例说明,设备1在检测到设备2异常时获取设备1至设备3在第一时间段内的历史心跳数据为
Figure PCTCN2021114169-appb-000002
则N1=6+3+3+5=17,N2=6+3+0+0=9,N3=3+0+0+5=8,即N1>N2>N3,则确定设备1故障。
For example, when device 1 detects that device 2 is abnormal, the historical heartbeat data obtained from device 1 to device 3 in the first time period is:
Figure PCTCN2021114169-appb-000002
Then N1=6+3+3+5=17, N2=6+3+0+0=9, N3=3+0+0+5=8, ie N1>N2>N3, then it is determined that the equipment 1 is faulty.
如果N1=N2>0,且N3=0,则满足上述第二条件,可确定超时原因是所述第二设备(即设备2)与所述第一设备(即设备1)之间的传输链路发生故障。If N1=N2>0, and N3=0, the above-mentioned second condition is satisfied, and it can be determined that the reason for the timeout is the transmission chain between the second device (ie device 2) and the first device (ie device 1) Road failure.
本实施例中提供的方法,利用两个周边设备检测的历史心跳数据,和设备自身获取的历史心跳数据对处于异常状态的设备进行检测,通过比较各设备过去一段时间内的超时次数,确定出发生故障的原因是设备本身的故障,或者是概率性丢包导致的链路故障,由于获取的历史心跳数据是多个设备互相检测和上报的心跳超时情况,利用全局信息进行决策,所以相比于单一设备的历史心跳数据检测,本方法提高了分布式网络内设备故障检测的准确率,从而避免网络波动情况下由于概率性丢包导致的误判。The method provided in this embodiment uses the historical heartbeat data detected by two peripheral devices and the historical heartbeat data obtained by the device itself to detect a device in an abnormal state, and determines the number of timeouts by comparing the number of timeouts of each device in the past period of time. The reason for the failure is the failure of the device itself, or the link failure caused by probabilistic packet loss. Since the obtained historical heartbeat data is the heartbeat timeout condition detected and reported by multiple devices, the global information is used to make decisions. For the detection of historical heartbeat data of a single device, the method improves the accuracy of device fault detection in a distributed network, thereby avoiding misjudgment caused by probabilistic packet loss in the case of network fluctuations.
另外,本方法中,分布式网络中的各个设备周期性地获取过去一段时间内其他设备的心跳超时情况,并同步这些设备的历史心跳数据,从而为发生故障时提供精准检测做准备。In addition, in this method, each device in the distributed network periodically obtains the heartbeat timeout of other devices in the past period of time, and synchronizes the historical heartbeat data of these devices, thereby preparing for accurate detection in the event of a failure.
第二种实施方式Second Embodiment
本实施方式以设备1、设备2和设备3为例,当设备1检测到设备2心跳反馈超时后,设备1的数据同步模块从设备2和设备3同步其历史心跳数据,并进行处理。假设设备1获取的历史心跳数据包括:This embodiment takes Device 1, Device 2, and Device 3 as examples. When Device 1 detects that the heartbeat feedback of Device 2 has timed out, the data synchronization module of Device 1 synchronizes its historical heartbeat data from Device 2 and Device 3 and processes it. Assume that the historical heartbeat data obtained by device 1 includes:
设备1在第一时间段内统计的所述设备2反馈应答响应的累计超时次数a 12;设备2在第一时间段内统计的所述设备1反馈应答响应的累计超时次数a 21;设备3在第一时间段内统计的所述设备2反馈应答响应的累计超时次数a 32;此时N1=a 12,N2=a 21,N3=a 32,则通过上述历史心跳数据判断故障原因是设备2发生故障,还是设备1和设备2之间的传输链路发生故障,判断方法如下: The cumulative timeout times a 12 of the device 2 feedback response responses counted by the device 1 in the first time period; the cumulative timeout times a 21 of the device 1 feedback response responses counted by the device 2 in the first time period; device 3 The cumulative timeout times a 32 of the feedback response of the device 2 are counted in the first time period; at this time N1=a 12 , N2=a 21 , N3=a 32 , then it is judged that the cause of the failure is the device through the above historical heartbeat data 2 is faulty, or the transmission link between device 1 and device 2 is faulty, the judgment method is as follows:
如果a 12>0,a 21>0,且a 32>0,则满足上述第一条件,确定设备2发生故障;如果a 12>0,a 21>0,且a 32=0,则满足上述第二条件,确定设备1和设备2之间的链路发生故障。 If a 12 > 0, a 21 > 0, and a 32 > 0, the above first condition is satisfied, and it is determined that the device 2 is faulty; if a 12 > 0, a 21 > 0, and a 32 =0, the above is satisfied The second condition is to determine that the link between device 1 and device 2 is faulty.
本实施方式在检测到设备2故障时,利用设备3的历史心跳数据,判断故障属于设备2本身的故障还是设备1与设备2之间的传输链路发生故障,从而提升了分布式网络在网络波动场景下,故障检测的准确率。When a failure of device 2 is detected in this embodiment, the historical heartbeat data of device 3 is used to determine whether the failure belongs to device 2 itself or the transmission link between device 1 and device 2, thereby improving the performance of distributed networks in the network. The accuracy of fault detection in fluctuating scenarios.
第三种实施方式third embodiment
本实施方式与前述第二种可能的实施方式相似,不同之处在于设备1获取的历史 心跳数据中,除了第二种可能的实施方式的a 12,a 21,a 32之外,还包括a 23,所述a 31表示设备3在第一时间段内统计的设备1反馈应答响应的累计超时次数,此时N1=a 12,N2=a 21,N3=a 32+a 23,则上述步骤102,通过历史心跳数据判断故障原因是设备2发生故障,还是设备1和设备2之间的传输链路发生故障,判断方法如下: This embodiment is similar to the aforementioned second possible embodiment, the difference is that in the historical heartbeat data acquired by the device 1, in addition to a 12 , a 21 , a 32 of the second possible embodiment, it also includes a 23. The a 31 represents the cumulative timeout times of the feedback response of the device 1 counted by the device 3 in the first time period. At this time, N1=a 12 , N2=a 21 , and N3=a 32 +a 23 , then the above steps 102. Determine whether the cause of the failure is the failure of the device 2 or the transmission link between the device 1 and the device 2 through the historical heartbeat data, and the determination method is as follows:
如果a 12>0,a 21>0,且a 32+a 23>0,则满足上述第一条件,确定设备2发生故障;如果a 12>0,a 21>0,且a 32+a 23=0,则满足上述第二条件,确定故障原因是设备1和设备2之间的链路发生故障。 If a 12 > 0, a 21 > 0, and a 32 + a 23 > 0, the above first condition is satisfied, and it is determined that the equipment 2 is faulty; if a 12 > 0, a 21 > 0, and a 32 +a 23 =0, the above second condition is satisfied, and it is determined that the cause of the failure is that the link between the device 1 and the device 2 is faulty.
需要说明的是,根据上述历史心跳数据的不同,还可以包括其他更多或更少的判断方法,本实施例对上述各种具体的判断方法不做一一赘述。It should be noted that, according to the difference of the above-mentioned historical heartbeat data, other more or less judging methods may also be included, and this embodiment will not describe the above-mentioned specific judging methods one by one.
第四种实施方式Fourth Embodiment
本实施方式在上述步骤101之前,如果第一设备在检测第二设备发生异常时,网络中有除了第一设备和第二设备之外,还包括两个或两个以上的终端设备,则需要先从至少两个终端设备中选择一个作为所述第三设备。In this embodiment, before step 101 above, if the first device detects that the second device is abnormal, and there are two or more terminal devices in the network in addition to the first device and the second device, it is necessary to First, select one of the at least two terminal devices as the third device.
一种具体的选择方法是,第一设备分别向两个或两个以上设备中的每个设备发送获取历史心跳数据的请求,每个接收到的请求的设备会向第一设备发送自己记录的历史心跳数据,第一设备接收到的第一个历史心跳数据时,将发送该第一个历史心跳数据的设备确定为所述第三设备。所述接收到的该第一个历史心跳数据的设备的响应速度最快,或者,距离第一设备最近,因此选择该设备作为第三设备处理效率较高。A specific selection method is that the first device sends a request for obtaining historical heartbeat data to each of two or more devices, and each device that receives the request will send its own record to the first device. For historical heartbeat data, when the first device receives the first historical heartbeat data, the device that sends the first historical heartbeat data is determined as the third device. The device that received the first historical heartbeat data has the fastest response speed, or is the closest to the first device, so selecting this device as the third device has higher processing efficiency.
在确定了所述第三设备,并接收了该第三设备上报的历史心跳数据之后,执行前述实施例的步骤101和102,对第二设备的响应超时原因进行分析处理,具体过程参见前述第一种、第二种或第三种实施方式的说明,本实施例对此不再赘述。After the third device is determined and the historical heartbeat data reported by the third device is received, steps 101 and 102 in the foregoing embodiment are performed to analyze and process the reason for the timeout of the response of the second device. For the specific process, please refer to the foregoing Section 1. The description of the first, second or third implementation manner will not be repeated in this embodiment.
本实施例中,当有多个设备时,第一设备同时向这些设备中的每一个发送获取历史心跳数据的请求,并选择其中第一个收到的历史心跳数据所对应的设备,作为第三设备,从而可以提高检测效率。In this embodiment, when there are multiple devices, the first device sends a request for obtaining historical heartbeat data to each of these devices at the same time, and selects the device corresponding to the first received historical heartbeat data as the first device. Three devices, which can improve the detection efficiency.
应理解,还可以采用其他选择标准来确定所述第三设备,比如与第一设备距离最近的一个设备作为第三设备,本实施例对上述选择确定第三设备的判断标准不予限制。It should be understood that other selection criteria may also be used to determine the third device, such as a device closest to the first device as the third device, and this embodiment does not limit the above judgment criteria for selecting and determining the third device.
本实施例提供的方法应用于一种分布式网络内,当其中的一个设备发生异常时,获取并同步网络内其他设备的历史心跳数据,并对历史心跳数据进行合并处理,利用处理后的数据准确定位故障设备。具体地,将历史心跳数据转换为某设备在过去一段时间的累计超时次数,通过比较各设备过去一段时间内的累计超时次数,确定出发生故障的设备,即累计超时次数最多的设备,本方法提升了网络波动场景下,分布式网络内设备故障检测的准确性,避免利用单设备信息进行决策导致的误判。The method provided in this embodiment is applied to a distributed network. When an abnormality occurs in one of the devices, the historical heartbeat data of other devices in the network is acquired and synchronized, and the historical heartbeat data is merged and processed, and the processed data is used. Accurately locate faulty equipment. Specifically, the historical heartbeat data is converted into the cumulative number of timeouts of a certain device in the past period of time, and by comparing the cumulative number of timeouts of each device in the past period of time, the faulty device is determined, that is, the device with the most cumulative timeouts. This method It improves the accuracy of device fault detection in a distributed network in the scenario of network fluctuation, and avoids misjudgment caused by using single device information for decision-making.
需要说明的是,本实施例以第一设备为例对第二设备出现的异常状况进行检测,同理地,在分布式网络中,第二设备和第三设备也可以利用同样的方法检测第一设备发生异常状态的原因,其中,第二设备在检测第一设备发生异常时的检测方法与前述实施例中的方法相同,参见上述实施例中的方法步骤,本实施例不再赘述。It should be noted that, in this embodiment, the first device is used as an example to detect the abnormal condition of the second device. Similarly, in a distributed network, the second device and the third device can also use the same method to detect the first device. The reason for the abnormal state of a device, wherein the detection method of the second device when detecting the abnormality of the first device is the same as the method in the foregoing embodiment, refer to the method steps in the foregoing embodiment, and is not repeated in this embodiment.
下面介绍与上述方法实施例对应的装置实施例。Apparatus embodiments corresponding to the foregoing method embodiments are introduced below.
图4为本申请实施例提供的一种设备检测装置的结构示意图。所述装置可以是一种通信设备,或位于所述通信设备中的一个部件,例如芯片或芯片系统。并且该装置可以实现 前述实施例中的设备检测方法。FIG. 4 is a schematic structural diagram of a device detection apparatus provided by an embodiment of the present application. The apparatus may be a communication device, or a component located in the communication device, such as a chip or a system of chips. And the device can implement the device detection method in the foregoing embodiment.
具体地,如图4所示,该装置可以包括:数据同步模块401、处理模块402、心跳检测模块403和采样模块404。此外,所述装置还可以包括存储单元等其他的单元或模块。Specifically, as shown in FIG. 4 , the apparatus may include: a data synchronization module 401 , a processing module 402 , a heartbeat detection module 403 and a sampling module 404 . In addition, the apparatus may also include other units or modules such as a storage unit.
其中,各个模块至少具备以下功能,参见图5所示,Among them, each module has at least the following functions, as shown in Figure 5,
501:心跳检测模块403用于在第一设备检测到第二设备反馈的应答响应超时之前,周期性地向网络中的所述第二设备和第三设备发送心跳数据包。采样模块404用于分别接收来自所述第二设备和所述第三设备根据所述心跳数据包反馈的应答响应,并统计所述第一时间段内所述第二设备反馈应答响应的累计超时次数,和,所述第三设备反馈应答响应的累计超时次数。501: The heartbeat detection module 403 is configured to periodically send a heartbeat data packet to the second device and the third device in the network before the first device detects that the response response fed back by the second device times out. The sampling module 404 is configured to respectively receive the response responses fed back by the second device and the third device according to the heartbeat data packet, and count the cumulative timeout of the response responses fed back by the second device within the first time period number of times, and, the cumulative number of timeouts that the third device feeds back the response response.
502:采样模块404将在第一时间段内统计的各个设备的应答响应情况作为历史心跳数据发送给数据同步模块401。502: The sampling module 404 sends the statistics of the response responses of each device in the first time period to the data synchronization module 401 as historical heartbeat data.
503:数据同步模块401在当第一设备检测到第二设备反馈的应答响应超时时(即设备2状态异常消息时),获取所述第一设备在第一时间段内同步的历史心跳数据,所述历史心跳数据包括所述第一设备、所述第二设备和第三设备中每个设备检测的其他两个设备对心跳数据包的应答响应情况。503: The data synchronization module 401 acquires the historical heartbeat data synchronized by the first device within the first time period when the first device detects that the response response fed back by the second device times out (that is, when the device 2 is in an abnormal state message), The historical heartbeat data includes response responses to the heartbeat data packet by the other two devices detected by each of the first device, the second device, and the third device.
504:处理模块402用于根据所述历史心跳数据中每个设备检测的所述应答响应情况,确定所述第二设备应答响应超时的原因,所述原因包括:所述第二设备发生故障,或者所述第二设备与所述第一设备之间的传输链路发生故障。504: The processing module 402 is configured to determine the reason for the timeout of the response response of the second device according to the response response situation detected by each device in the historical heartbeat data, and the reasons include: failure of the second device, Or the transmission link between the second device and the first device fails.
可选的,在本实施例的一种具体的实现方式中,处理模块402具体用于根据所述每个设备检测的所述应答响应情况分别确定所述第一设备、所述第二设备和所述第三设备在所述第一时间段内的应答响应超时总数N1,N2和N3,以及,当满足第一条件时,确定所述原因是所述第二设备发生故障。其中,所述第一条件为:所述第二设备对应的所述应答响应超时总数N2最大,且所述第三设备对应的所述应答响应超时总数N3大于0。Optionally, in a specific implementation manner of this embodiment, the processing module 402 is specifically configured to separately determine the first device, the second device and the device according to the response situation detected by each device. The total number N1, N2 and N3 of response response timeouts of the third device within the first time period, and, when the first condition is satisfied, it is determined that the cause is the failure of the second device. The first condition is: the total number N2 of response timeouts corresponding to the second device is the largest, and the total number N3 of response timeouts corresponding to the third device is greater than 0.
可选的,在本实施例的另一种具体的实现方式中,处理模块402还用于当满足第二条件时,确定所述原因是所述第二设备与所述第一设备之间的传输链路发生故障。其中,所述第二条件为:所述第一设备对应的所述应答响应超时总数N1大于0,所述第二设备对应的所述应答响应超时总数N2大于0,且所述第三设备对应的所述应答响应超时总数N3等于0。Optionally, in another specific implementation manner of this embodiment, the processing module 402 is further configured to, when the second condition is satisfied, determine that the cause is a connection between the second device and the first device. The transmission link has failed. The second condition is: the total number N1 of response timeouts corresponding to the first device is greater than 0, the total number N2 of response timeouts corresponding to the second device is greater than 0, and the third device corresponds to The total number of ack response timeouts N3 is equal to 0.
进一步地,在一种可能的实现方式中,所述历史心跳数据包括:所述第一设备在第一时间段内统计的所述第二设备反馈应答响应的累计超时次数a 12,所述第一设备在第一时间段内统计的所述第三设备反馈应答响应的累计超时次数a 13;所述第二设备在第一时间段内统计的所述第一设备反馈应答响应的累计超时次数a 21,所述第二设备在第一时间段内统计的所述第三设备反馈应答响应的累计超时次数a 23;所述第三设备在第一时间段内统计的所述第二设备反馈应答响应的累计超时次数a 32,所述第三设备在第一时间段内统计的所述第一设备反馈应答响应的累计超时次数a 31;所述第一条件为:N2>N1,N2>N3,且N3>0;其中,N1=a 12+a 13+a 21+a 31,N2=a 12+a 21+a 23+a 32,N3=a 13+a 23+a 31+a 32Further, in a possible implementation manner, the historical heartbeat data includes: the accumulated timeout times a 12 of the feedback response responses of the second device counted by the first device in the first time period, the first device The cumulative number of timeouts a 13 of the feedback response response of the third device counted by a device within the first time period; the cumulative number of timeouts a 13 of the feedback response response of the first device counted by the second device within the first time period a 21 , the cumulative number of timeouts a 23 of the feedback response from the third device as counted by the second device within the first time period; the feedback from the second device as counted by the third device within the first time period The cumulative timeout times a 32 of the response response, the cumulative timeout times a 31 of the response response fed back by the first device as counted by the third device in the first time period; the first condition is: N2>N1, N2> N3, and N3>0; wherein, N1=a 12 +a 13 +a 21 +a 31 , N2=a 12 +a 21 +a 23 +a 32 , N3=a 13 +a 23 +a 31 +a 32 .
进一步地,在另一种可能的实现方式中,还包括上述第二条件为N1=N2>0,N3=0。Further, in another possible implementation manner, the above-mentioned second condition further includes that N1=N2>0, and N3=0.
可选的,在另一种可能的实现方式中,当所述历史心跳数据包括:所述第一设备在第一时间段内统计的所述第二设备反馈应答响应的累计超时次数a 12;所述第二设备 在第一时间段内统计的所述第一设备反馈应答响应的累计超时次数a 21;所述第三设备在第一时间段内统计的所述第二设备反馈应答响应的累计超时次数a 32时,所述第一条件为:a 12>0,a 21>0,且a 32>0;所述第二条件为:a 12>0,a 21>0,且a 32=0。 Optionally, in another possible implementation manner, when the historical heartbeat data includes: the cumulative number of timeouts a 12 of the feedback response response of the second device that is counted by the first device within a first time period; The cumulative number of timeouts a 21 of the feedback response responses of the first device counted by the second device in the first time period; When accumulating the number of timeouts a 32 , the first condition is: a 12 >0, a 21 >0, and a 32 >0; the second condition is: a 12 >0, a 21 >0, and a 32 =0.
可选的,在又一种可能的实现方式中,当所述历史心跳数据包括:所述第一设备在第一时间段内统计的所述第二设备反馈应答响应的累计超时次数a 12;所述第二设备在第一时间段内统计的所述第一设备反馈应答响应的累计超时次数a 21;所述第三设备在第一时间段内统计的所述第二设备反馈应答响应的累计超时次数a 32,所述第三设备在第一时间段内统计的所述第一设备反馈应答响应的累计超时次数a 31时,所述第一条件为:a 12>0,a 21>0,且a 32+a 23>0;所述第二条件为:a 12>0,a 21>0,且a 32+a 23=0。 Optionally, in another possible implementation manner, when the historical heartbeat data includes: the cumulative number of timeouts a 12 of the feedback response responses of the second device that are counted by the first device within a first time period; The cumulative number of timeouts a 21 of the feedback response responses of the first device counted by the second device in the first time period; The accumulated timeout times a 32 , when the third device counts the accumulated timeout times a 31 of the response response from the first device within the first time period, the first condition is: a 12 >0, a 21 > 0, and a 32 +a 23 >0; the second condition is: a 12 >0, a 21 >0, and a 32 +a 23 =0.
可选的,在本实施例的又一种具体的实现方式中,处理模块402还用于在两个或两个以上设备中选择所述第三设备,所述第三设备为在所述第一设备向两个或两个以上设备中的每个设备发送获取历史心跳数据的请求的情况下,通过所述数据同步模块接收到的第一个历史心跳数据所来自的设备。Optionally, in another specific implementation manner of this embodiment, the processing module 402 is further configured to select the third device from among two or more devices, where the third device is the third device in the third device. When a device sends a request for acquiring historical heartbeat data to each of two or more devices, the device from which the first historical heartbeat data is received through the data synchronization module.
另外,在具体的硬件实现中,本实施例中还提供了一种通信设备,该通信设备可以是一个终端设备或网络设备,或者是集成在上述终端设备或网络设备上的一个部件。In addition, in a specific hardware implementation, this embodiment also provides a communication device, and the communication device may be a terminal device or a network device, or a component integrated on the above-mentioned terminal device or network device.
图6示出了一种通信设备的结构示意图,该网络设备可以包括:处理器110、存储器120、和至少一个通信接口130。其中,处理器110、存储器120和至少一个通信接口130可通过通信总线耦合。FIG. 6 shows a schematic structural diagram of a communication device, and the network device may include: a processor 110 , a memory 120 , and at least one communication interface 130 . Wherein, the processor 110, the memory 120 and the at least one communication interface 130 may be coupled through a communication bus.
其中,处理器110为通信设备的控制中心,可用于设备间的通信,例如包括与第二设备、第三设备以及其他设备之间的信息传输。The processor 110 is the control center of the communication device, and can be used for communication between devices, for example, including information transmission with the second device, the third device, and other devices.
处理器110可以由集成电路(Integrated Circuit,IC)组成,例如可以由单颗封装的IC所组成,也可以由连接多颗相同功能或不同功能的封装IC而组成。举例来说,处理器110可以包括中央处理器(Central Processing Unit,CPU)或数字信号处理器(Digital Signal Processor,DSP)等。The processor 110 may be composed of an integrated circuit (Integrated Circuit, IC), for example, may be composed of a single packaged IC, or may be composed of a plurality of packaged ICs connected with the same function or different functions. For example, the processor 110 may include a central processing unit (Central Processing Unit, CPU) or a digital signal processor (Digital Signal Processor, DSP) or the like.
此外,处理器110还可以包括硬件芯片,所述该硬件芯片可以是专用集成电路(application specific integrated circuit,ASIC),可编程逻辑器件(programmable logic device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,CPLD),现场可编程逻辑门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。可选的,所述硬件芯片为一种处理芯片,或芯片电路。In addition, the processor 110 may further include a hardware chip, and the hardware chip may be an application specific integrated circuit (ASIC), a programmable logic device (PLD) or a combination thereof. The above-mentioned PLD can be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a general array logic (generic array logic, GAL) or any combination thereof. Optionally, the hardware chip is a processing chip or a chip circuit.
存储器120用于存储和交换各类数据或软件,包括存储历史心跳数据、心跳数据包、响应包或响应报文等。此外存储器120中可以存储有计算机程序和代码。The memory 120 is used for storing and exchanging various types of data or software, including storing historical heartbeat data, heartbeat data packets, response packets or response messages, and the like. In addition, computer programs and codes may be stored in the memory 120 .
具体地,存储器120可以包括易失性存储器(volatile memory),例如随机存取内存(Random Access Memory,RAM);还可以包括非易失性存储器(non-volatile memory),例如快闪存储器(flash memory),硬盘(Hard Sisk Drive,HDD)或固态硬盘(Solid-State Drive,SSD),存储器120还可以包括上述种类的存储器的组合。Specifically, the memory 120 may include volatile memory (volatile memory), such as random access memory (Random Access Memory, RAM); may also include non-volatile memory (non-volatile memory), such as flash memory (flash memory) memory), a hard disk (Hard Sisk Drive, HDD) or a solid-state drive (Solid-State Drive, SSD), the memory 120 may also include a combination of the above-mentioned types of memory.
通信接口130,使用任何收发器一类的装置,用于与其它设备或通信网络通信,如以 太网,无线接入网(radio access network,RAN),无线局域网(Wireless Local Area Network,WLAN)、虚拟可扩展局域网(Virtual Extensible Local Area Network,VXLAN)等。 Communication interface 130, using any transceiver-like device, for communicating with other devices or communication networks, such as Ethernet, Radio Access Network (RAN), Wireless Local Area Network (WLAN), Virtual Extensible Local Area Network (VXLAN), etc.
应理解,上述通信设备中还可以包括其他更多或更少的部件,本申请实施例示意的结构并不构成对通信设备的具体限定。并且图6所示的部件可以以硬件,软件、固件或者其任意组合的方式来实现。It should be understood that the above communication device may also include other more or less components, and the structures illustrated in the embodiments of the present application do not constitute a specific limitation on the communication device. And the components shown in FIG. 6 may be implemented in hardware, software, firmware or any combination thereof.
当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。例如,在前述图4所示的装置中的心跳检测模块403和采样模块404可以通过通信接口来实现,所述数据同步模块401和处理模块402的功能可以由处理器110来实现,所述存储单元的功能可以由存储器120实现。When implemented in software, it can be implemented in whole or in part in the form of a computer program product. For example, the heartbeat detection module 403 and the sampling module 404 in the aforementioned apparatus shown in FIG. 4 can be implemented through a communication interface, the functions of the data synchronization module 401 and the processing module 402 can be implemented by the processor 110, and the storage The functions of the unit may be implemented by the memory 120 .
具体地,所述通信设备利用通信接口接收至少两个其他设备发送的应答响应,处理器110当检测到第二设备反馈的应答响应超时时,获取自己在第一时间段内同步的历史心跳数据,然后根据所述历史心跳数据中每个设备检测的应答响应情况,确定所述第二设备应答响应超时的原因。具体地,当检测到第二设备反馈的应答响应超时时,调用存储器120中的程序代码,执行上述实施例图3或图5所示的方法。Specifically, the communication device uses the communication interface to receive response responses sent by at least two other devices, and when the processor 110 detects that the response response fed back by the second device times out, obtains its own historical heartbeat data synchronized within the first time period , and then determine the reason for the timeout of the response response of the second device according to the response response situation detected by each device in the historical heartbeat data. Specifically, when it is detected that the response response fed back by the second device times out, the program code in the memory 120 is called to execute the method shown in FIG. 3 or FIG. 5 in the foregoing embodiment.
此外,该通信设备中还包括移动通信模块、无线通信模块等。所述移动通信模块包括:2G/3G/4G/5G等无线通信功能的模块。此外,还可以包括滤波器、开关、功率放大器、低噪声放大器(low noise amplifier,LNA)等。所述无线通信模块可以提供应用在通信设备上的包括WLAN、蓝牙(bluetooth),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation,FM)等无线通信的解决方案。In addition, the communication device also includes a mobile communication module, a wireless communication module, and the like. The mobile communication module includes modules with wireless communication functions such as 2G/3G/4G/5G. In addition, filters, switches, power amplifiers, low noise amplifiers (LNAs), etc. may also be included. The wireless communication module can provide wireless communication solutions including WLAN, bluetooth (bluetooth), global navigation satellite system (GNSS), frequency modulation (frequency modulation, FM), etc. applied to communication equipment.
此外,本申请实施例还提供了一种网络系统,该网络系统结构可以是如前述图2所示分布式网络架构,包括至少3个通信设备,比如设备1至设备3。其中,所述每个设备的结构可以是如图6所示的通信设备,用于实现前述实施例中的设备检测方法。In addition, an embodiment of the present application also provides a network system, and the network system structure may be a distributed network architecture as shown in the foregoing FIG. 2 , including at least three communication devices, such as device 1 to device 3 . The structure of each device may be a communication device as shown in FIG. 6 , which is used to implement the device detection method in the foregoing embodiment.
本实施例中,利用两个周边设备检测的历史心跳数据,和设备自身获取的历史心跳数据对处于异常状态的设备进行检测,通过比较各设备过去一段时间内的超时次数,确定出发生故障的原因是设备本身的故障,或者是概率性丢包导致的链路故障,由于获取的历史心跳数据是多个设备互相检测和上报的心跳超时情况,利用全局信息进行决策,所以相比于单一设备的历史心跳数据检测,本方法提高了分布式网络内设备故障检测的准确率,从而避免网络波动情况下由于概率性丢包导致的误判。In this embodiment, the historical heartbeat data detected by two peripheral devices and the historical heartbeat data obtained by the device itself are used to detect the device in an abnormal state, and the faulty device is determined by comparing the number of timeouts of each device in the past period of time. The reason is the failure of the device itself, or the link failure caused by probabilistic packet loss. Since the obtained historical heartbeat data is the heartbeat timeout condition detected and reported by multiple devices, the global information is used to make decisions, so compared to a single device The method improves the accuracy of device fault detection in the distributed network, thereby avoiding misjudgment caused by probabilistic packet loss in the case of network fluctuations.
本申请实施例还提供一种计算机程序产品,所述计算机程序产品包括一个或多个计算机程序指令。在计算机加载和执行所述计算机程序指令时,全部或部分地产生按照上述各个实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络或者其他可编程装置。Embodiments of the present application also provide a computer program product, where the computer program product includes one or more computer program instructions. When a computer loads and executes the computer program instructions, all or part of the processes or functions described in the various embodiments described above occur. The computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
所述计算机程序指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个通信设备、计算机、服务器或数据中心通过有线或无线方式向另一个通信设备进行传输。The computer program instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from a communication device, computer, server or data The center transmits to another communication device by wire or wireless.
其中,所述计算机程序产品和所述计算机程序指令可以位于前述通信设备的存储器中,从而实现本申请实施例所述的设备检测方法。Wherein, the computer program product and the computer program instructions may be located in the memory of the aforementioned communication device, so as to implement the device detection method described in the embodiments of the present application.
此外,在本申请实施例的描述中,所述至少一个是指一个或一个以上,所述至少三个 是指三个或三个以上。另外,为了便于清楚描述本申请实施例的技术方案,在本申请的实施例中,采用了“第一”、“第二”、“第三”等字样对功能和作用基本相同的相同项或相似项进行区分。本领域技术人员可以理解“第一”、“第二”、“第三”等字样并不对数量和执行次序进行限定,并且“第一”、“第二”、“第三”等字样也并不限定一定不同。In addition, in the description of the embodiments of the present application, the at least one refers to one or more than one, and the at least three refers to three or more. In addition, in order to facilitate the clear description of the technical solutions of the embodiments of the present application, in the embodiments of the present application, words such as "first", "second", "third" are used to describe the same items or items with basically the same function and effect. Similar items are distinguished. Those skilled in the art can understand that words such as "first", "second" and "third" do not limit the quantity and execution order, and words such as "first", "second" and "third" also do not limit the number and execution order. Not necessarily different.
以上所述的本申请实施例并不构成对本申请保护范围的限定。The embodiments of the present application described above do not constitute a limitation on the protection scope of the present application.

Claims (20)

  1. 一种设备检测方法,其特征在于,所述方法包括:A device detection method, characterized in that the method comprises:
    当第一设备检测到第二设备反馈的应答响应超时时,获取所述第一设备在第一时间段内同步的历史心跳数据,所述历史心跳数据包括所述第一设备、所述第二设备和第三设备中每个设备检测的其他两个设备对心跳数据包的应答响应情况;When the first device detects that the response response fed back by the second device times out, acquires historical heartbeat data synchronized by the first device within a first time period, where the historical heartbeat data includes the first device, the second device and the second device. The response of the other two devices detected by each device in the device and the third device to the heartbeat data packet;
    根据所述历史心跳数据中每个设备检测的所述应答响应情况,确定所述第二设备应答响应超时的原因,所述原因包括:所述第二设备发生故障,或者所述第二设备与所述第一设备之间的传输链路发生故障。According to the response status detected by each device in the historical heartbeat data, determine the reason for the timeout of the response response of the second device, and the reasons include: the second device is faulty, or the second device is incompatible with The transmission link between the first devices is faulty.
  2. 根据权利要求1所述的方法,其特征在于,根据所述历史心跳数据中每个设备检测所述应答响应情况,确定所述第二设备应答响应超时的原因,包括:The method according to claim 1, wherein, according to each device in the historical heartbeat data, detecting the response response situation, and determining the reason for the second device's response response timeout, comprising:
    根据所述每个设备检测的所述应答响应情况分别确定所述第一设备、所述第二设备和所述第三设备在所述第一时间段内的应答响应超时总数N1,N2和N3;Determine the total number N1, N2, and N3 of response timeouts of the first device, the second device, and the third device within the first time period according to the response status detected by each device. ;
    当满足第一条件时,确定所述原因是所述第二设备发生故障,所述第一条件为:所述第二设备对应的所述应答响应超时总数N2最大,且所述第三设备对应的所述应答响应超时总数N3大于0。When the first condition is satisfied, it is determined that the cause is the failure of the second device, and the first condition is: the total number N2 of response timeouts corresponding to the second device is the largest, and the third device corresponds to The total number N3 of the acknowledgment response timeouts is greater than 0.
  3. 根据权利要求2所述的方法,其特征在于,还包括:The method of claim 2, further comprising:
    当满足第二条件时,确定所述原因是所述第二设备与所述第一设备之间的传输链路发生故障,所述第二条件为:所述第一设备对应的所述应答响应超时总数N1大于0,所述第二设备对应的所述应答响应超时总数N2大于0,且所述第三设备对应的所述应答响应超时总数N3等于0。When the second condition is satisfied, it is determined that the reason is that the transmission link between the second device and the first device is faulty, and the second condition is: the response response corresponding to the first device The total number of timeouts N1 is greater than 0, the total number of response timeouts N2 corresponding to the second device is greater than 0, and the total number of response timeouts N3 corresponding to the third device is equal to 0.
  4. 根据权利要求2或3所述的方法,其特征在于,所述历史心跳数据包括:The method according to claim 2 or 3, wherein the historical heartbeat data comprises:
    所述第一设备在第一时间段内统计的所述第二设备反馈应答响应的累计超时次数a 12,所述第一设备在第一时间段内统计的所述第三设备反馈应答响应的累计超时次数a 13The cumulative number of timeouts a 12 of the feedback response of the second device as counted by the first device in the first time period, and the number of times of the feedback response of the third device as counted by the first device in the first time period. Cumulative timeout times a 13 ,
    所述第二设备在第一时间段内统计的所述第一设备反馈应答响应的累计超时次数a 21,所述第二设备在第一时间段内统计的所述第三设备反馈应答响应的累计超时次数a 23The cumulative number of timeouts a 21 of the feedback response of the first device as counted by the second device within the first time period, and the number of times of the feedback response of the third device as counted by the second device within the first time period. Cumulative timeout times a 23 ;
    所述第三设备在第一时间段内统计的所述第二设备反馈应答响应的累计超时次数a 32,所述第三设备在第一时间段内统计的所述第一设备反馈应答响应的累计超时次数a 31The cumulative number of timeouts a 32 of the feedback response of the second device as counted by the third device within the first time period, and the number of times of the feedback response of the first device as counted by the third device within the first time period; Cumulative timeout times a 31 ;
    所述第一条件为:N2>N1,N2>N3,且N3>0;The first condition is: N2>N1, N2>N3, and N3>0;
    其中,N1=a 12+a 13+a 21+a 31,N2=a 12+a 21+a 23+a 32,N3=a 13+a 23+a 31+a 32Wherein, N1=a 12 +a 13 +a 21 +a 31 , N2=a 12 +a 21 +a 23 +a 32 , and N3=a 13 +a 23 +a 31 +a 32 .
  5. 根据权利要求4所述的方法,其特征在于,还包括:所述第二条件为N1=N2>0,N3=0。The method according to claim 4, further comprising: the second condition is N1=N2>0, N3=0.
  6. 根据权利要求2或3所述的方法,其特征在于,当所述历史心跳数据包括:The method according to claim 2 or 3, wherein when the historical heartbeat data comprises:
    所述第一设备在第一时间段内统计的所述第二设备反馈应答响应的累计超时次数a 12the cumulative number of timeouts a 12 of the feedback response of the second device that is counted by the first device in the first time period;
    所述第二设备在第一时间段内统计的所述第一设备反馈应答响应的累计超时次数a 21The cumulative number of timeouts a 21 of the feedback response of the first device that is counted by the second device in the first time period;
    所述第三设备在第一时间段内统计的所述第二设备反馈应答响应的累计超时次数a 32The cumulative number of timeouts a 32 of the feedback response of the second device that is counted by the third device in the first time period;
    所述第一条件为:a 12>0,a 21>0,且a 32>0; The first condition is: a 12 >0, a 21 >0, and a 32 >0;
    所述第二条件为:a 12>0,a 21>0,且a 32=0。 The second condition is: a 12 >0, a 21 >0, and a 32 =0.
  7. 根据权利要求2或3所述的方法,其特征在于,当所述历史心跳数据包括:The method according to claim 2 or 3, wherein when the historical heartbeat data comprises:
    所述第一设备在第一时间段内统计的所述第二设备反馈应答响应的累计超时次数a 12the cumulative number of timeouts a 12 of the feedback response of the second device that is counted by the first device in the first time period;
    所述第二设备在第一时间段内统计的所述第一设备反馈应答响应的累计超时次数a 21The cumulative number of timeouts a 21 of the feedback response of the first device that is counted by the second device in the first time period;
    所述第三设备在第一时间段内统计的所述第二设备反馈应答响应的累计超时次数a 32,所述第三设备在第一时间段内统计的所述第一设备反馈应答响应的累计超时次数a 31The cumulative number of timeouts a 32 of the feedback response of the second device as counted by the third device within the first time period, and the number of times of the feedback response of the first device as counted by the third device within the first time period; Cumulative timeout times a 31 ;
    所述第一条件为:a 12>0,a 21>0,且a 32+a 23>0; The first condition is: a 12 >0, a 21 >0, and a 32 +a 23 >0;
    所述第二条件为:a 12>0,a 21>0,且a 32+a 23=0。 The second condition is: a 12 >0, a 21 >0, and a 32 +a 23 =0.
  8. 根据权利要求1-7任一项所述的方法,其特征在于,获取所述第一设备在第一时间段内同步的所述第三设备上报的历史心跳数据之前,还包括:The method according to any one of claims 1-7, wherein before acquiring the historical heartbeat data reported by the third device synchronized by the first device within the first time period, the method further comprises:
    在两个或两个以上设备中选择所述第三设备,所述第三设备为在所述第一设备向两个或两个以上设备中的每个设备发送获取历史心跳数据的请求的情况下,接收到的第一个历史心跳数据所来自的设备。The third device is selected among two or more devices, where the third device is a case where the first device sends a request to each of the two or more devices to obtain historical heartbeat data Next, the device from which the first historical heartbeat data was received.
  9. 根据权利要求1-8任一项所述的方法,其特征在于,所述第一设备检测到第二设备反馈的应答响应超时之前,还包括:The method according to any one of claims 1-8, wherein before the first device detects that the response response fed back by the second device times out, the method further comprises:
    周期性地向网络中的所述第二设备和所述第三设备发送心跳数据包;Periodically send heartbeat data packets to the second device and the third device in the network;
    分别接收来自所述第二设备和所述第三设备根据所述心跳数据包反馈的应答响应;respectively receiving the response responses fed back by the second device and the third device according to the heartbeat data packet;
    统计所述第一时间段内所述第二设备反馈应答响应的累计超时次数,和,所述第三设备反馈应答响应的累计超时次数。Counting the cumulative timeout times of the second device feedback response responses and the cumulative timeout times of the third device feedback response responses within the first time period.
  10. 一种设备检测装置,其特征在于,所述装置包括:An equipment detection device, characterized in that the device comprises:
    数据同步模块,当第一设备检测到第二设备反馈的应答响应超时时,获取所述第一设备在第一时间段内同步的历史心跳数据,所述历史心跳数据包括所述第一设备、所述第二设备和第三设备中每个设备检测的其他两个设备对心跳数据包的应答响应情况;The data synchronization module, when the first device detects that the response response fed back by the second device times out, acquires historical heartbeat data synchronized by the first device within the first time period, and the historical heartbeat data includes the first device, The response situation of the other two devices detected by each of the second device and the third device to the heartbeat data packet;
    处理模块,用于根据所述历史心跳数据中每个设备检测的所述应答响应情况,确定所述第二设备应答响应超时的原因,所述原因包括:所述第二设备发生故障,或者所述第二设备与所述第一设备之间的传输链路发生故障。A processing module, configured to determine, according to the response response situation detected by each device in the historical heartbeat data, the reason for the timeout of the response response of the second device, and the reasons include: the second device is faulty, or The transmission link between the second device and the first device is faulty.
  11. 根据权利要求10所述的装置,其特征在于,The device of claim 10, wherein:
    所述处理模块,具体用于根据所述每个设备检测的所述应答响应情况分别确定所述第一设备、所述第二设备和所述第三设备在所述第一时间段内的应答响应超时总数N1,N2和N3,以及,当满足第一条件时,确定所述原因是所述第二设备发生故障;The processing module is specifically configured to respectively determine the responses of the first device, the second device, and the third device within the first time period according to the response responses detected by each device responding to the total number of timeouts N1, N2 and N3, and, when the first condition is met, determining that the cause is the failure of the second device;
    所述第一条件为:所述第二设备对应的所述应答响应超时总数N2最大,且所述第三设备对应的所述应答响应超时总数N3大于0。The first condition is: the total number N2 of response timeouts corresponding to the second device is the largest, and the total number N3 of response timeouts corresponding to the third device is greater than 0.
  12. 根据权利要求11所述的装置,其特征在于,The apparatus of claim 11, wherein:
    所述处理模块,还用于当满足第二条件时,确定所述原因是所述第二设备与所述第一设备之间的传输链路发生故障;The processing module is further configured to, when a second condition is satisfied, determine that the reason is that the transmission link between the second device and the first device is faulty;
    所述第二条件为:所述第一设备对应的所述应答响应超时总数N1大于0,所述第二设备对应的所述应答响应超时总数N2大于0,且所述第三设备对应的所述应答响应超时总数N3等于0。The second condition is: the total number N1 of response timeouts corresponding to the first device is greater than 0, the total number N2 of response timeouts corresponding to the second device is greater than 0, and the total number of response timeouts corresponding to the third device is greater than 0. The total number of response time-outs N3 is equal to 0.
  13. 根据权利要求11或12所述的装置,其特征在于,所述历史心跳数据包括:The device according to claim 11 or 12, wherein the historical heartbeat data comprises:
    所述第一设备在第一时间段内统计的所述第二设备反馈应答响应的累计超时次数a 12,所述第一设备在第一时间段内统计的所述第三设备反馈应答响应的累计超时次数a 13The cumulative number of timeouts a 12 of the feedback response of the second device as counted by the first device in the first time period, and the number of times of the feedback response of the third device as counted by the first device in the first time period. Cumulative timeout times a 13 ,
    所述第二设备在第一时间段内统计的所述第一设备反馈应答响应的累计超时次数a 21,所述第二设备在第一时间段内统计的所述第三设备反馈应答响应的累计超时次数a 23The cumulative number of timeouts a 21 of the feedback response of the first device as counted by the second device within the first time period, and the number of times of the feedback response of the third device as counted by the second device within the first time period. Cumulative timeout times a 23 ;
    所述第三设备在第一时间段内统计的所述第二设备反馈应答响应的累计超时次数a 32,所述第三设备在第一时间段内统计的所述第一设备反馈应答响应的累计超时次数a 31The cumulative number of timeouts a 32 of the feedback response of the second device as counted by the third device within the first time period, and the number of times of the feedback response of the first device as counted by the third device within the first time period; Cumulative timeout times a 31 ;
    所述第一条件为:N2>N1,N2>N3,且N3>0;The first condition is: N2>N1, N2>N3, and N3>0;
    其中,N1=a 12+a 13+a 21+a 31,N2=a 12+a 21+a 23+a 32,N3=a 13+a 23+a 31+a 32Wherein, N1=a 12 +a 13 +a 21 +a 31 , N2=a 12 +a 21 +a 23 +a 32 , and N3=a 13 +a 23 +a 31 +a 32 .
  14. 根据权利要求13所述的装置,其特征在于,还包括:The apparatus of claim 13, further comprising:
    所述第二条件为N1=N2>0,N3=0。The second condition is that N1=N2>0, and N3=0.
  15. 根据权利要求11或12所述的装置,其特征在于,当所述历史心跳数据包括:The device according to claim 11 or 12, wherein when the historical heartbeat data includes:
    所述第一设备在第一时间段内统计的所述第二设备反馈应答响应的累计超时次数a 12the cumulative number of timeouts a 12 of the feedback response of the second device that is counted by the first device in the first time period;
    所述第二设备在第一时间段内统计的所述第一设备反馈应答响应的累计超时次数a 21The cumulative number of timeouts a 21 of the feedback response of the first device that is counted by the second device in the first time period;
    所述第三设备在第一时间段内统计的所述第二设备反馈应答响应的累计超时次数a 32The cumulative number of timeouts a 32 of the feedback response of the second device that is counted by the third device in the first time period;
    所述第一条件为:a 12>0,a 21>0,且a 32>0; The first condition is: a 12 >0, a 21 >0, and a 32 >0;
    所述第二条件为:a 12>0,a 21>0,且a 32=0。 The second condition is: a 12 >0, a 21 >0, and a 32 =0.
  16. 根据权利要求11或12所述的装置,其特征在于,当所述历史心跳数据包括:The device according to claim 11 or 12, wherein when the historical heartbeat data includes:
    所述第一设备在第一时间段内统计的所述第二设备反馈应答响应的累计超时次数a 12the cumulative number of timeouts a 12 of the feedback response of the second device that is counted by the first device in the first time period;
    所述第二设备在第一时间段内统计的所述第一设备反馈应答响应的累计超时次数a 21The cumulative number of timeouts a 21 of the feedback response of the first device that is counted by the second device in the first time period;
    所述第三设备在第一时间段内统计的所述第二设备反馈应答响应的累计超时次数a 32,所述第三设备在第一时间段内统计的所述第一设备反馈应答响应的累计超时次数a 31The cumulative number of timeouts a 32 of the feedback response of the second device as counted by the third device within the first time period, and the number of times of the feedback response of the first device as counted by the third device within the first time period; Cumulative timeout times a 31 ;
    所述第一条件为:a 12>0,a 21>0,且a 32+a 23>0; The first condition is: a 12 >0, a 21 >0, and a 32 +a 23 >0;
    所述第二条件为:a 12>0,a 21>0,且a 32+a 23=0。 The second condition is: a 12 >0, a 21 >0, and a 32 +a 23 =0.
  17. 根据权利要求10-16任一项所述的装置,其特征在于,The device according to any one of claims 10-16, characterized in that,
    所述处理模块,还用于在两个或两个以上设备中选择所述第三设备,所述第三设备为在所述第一设备向两个或两个以上设备中的每个设备发送获取历史心跳数据的请求的情况下,通过所述数据同步模块接收到的第一个历史心跳数据所来自的设备。The processing module is further configured to select the third device from among two or more devices, where the third device is to send the first device to each of the two or more devices In the case of a request for obtaining historical heartbeat data, the device from which the first historical heartbeat data is received by the data synchronization module.
  18. 根据权利要求10-17任一项所述的装置,其特征在于,还包括:The device according to any one of claims 10-17, further comprising:
    心跳检测模块,用于在所述第一设备检测到第二设备反馈的应答响应超时之前,周期性地向网络中的所述第二设备和所述第三设备发送心跳数据包;a heartbeat detection module, configured to periodically send a heartbeat data packet to the second device and the third device in the network before the first device detects that the response response fed back by the second device times out;
    采样模块,用于分别接收来自所述第二设备和所述第三设备根据所述心跳数据包反馈的应答响应,并统计所述第一时间段内所述第二设备反馈应答响应的累计超时次数,和,所述第三设备反馈应答响应的累计超时次数。a sampling module, configured to respectively receive the response responses fed back by the second device and the third device according to the heartbeat data packet, and count the cumulative timeout of the response responses fed back by the second device within the first time period number of times, and, the cumulative number of timeouts that the third device feeds back the response response.
  19. 一种通信设备,包括处理器和存储器,所述处理器与所述存储器耦合,其特征在于,A communication device comprising a processor and a memory, the processor being coupled to the memory, characterized in that:
    所述存储器,用于存储计算机程序指令;the memory for storing computer program instructions;
    所述处理器,用于执行所述存储器中存储的所述指令,以使得所述通信设备执行如权利要求1至9中任一项所述的方法。The processor for executing the instructions stored in the memory to cause the communication device to perform the method of any one of claims 1 to 9.
  20. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有计算机程序指令,当所述计算机程序指令被运行时,实现如权利要求1至9中任一项所述的方法。A computer-readable storage medium, wherein computer program instructions are stored in the computer-readable storage medium, and when the computer program instructions are executed, the computer program instructions according to any one of claims 1 to 9 are implemented. method.
PCT/CN2021/114169 2020-10-13 2021-08-24 Device detection method and apparatus, and communication device WO2022078070A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011093314.2 2020-10-13
CN202011093314.2A CN114422412B (en) 2020-10-13 2020-10-13 Equipment detection method and device and communication equipment

Publications (1)

Publication Number Publication Date
WO2022078070A1 true WO2022078070A1 (en) 2022-04-21

Family

ID=81208917

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/114169 WO2022078070A1 (en) 2020-10-13 2021-08-24 Device detection method and apparatus, and communication device

Country Status (2)

Country Link
CN (1) CN114422412B (en)
WO (1) WO2022078070A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115983393A (en) * 2022-12-30 2023-04-18 北京百度网讯科技有限公司 Quantum circuit task timeout reason determining method, device, equipment and storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117424838A (en) * 2023-10-31 2024-01-19 北京中瑞浩航科技有限公司 Self-learning detection method for Internet of things equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120008506A1 (en) * 2010-07-12 2012-01-12 International Business Machines Corporation Detecting intermittent network link failures
CN108964977A (en) * 2018-06-05 2018-12-07 平安科技(深圳)有限公司 Node abnormality eliminating method and system, storage medium and electronic equipment
CN109887125A (en) * 2019-02-02 2019-06-14 北京主线科技有限公司 Fault detection method and device
US20190235939A1 (en) * 2018-01-26 2019-08-01 International Business Machines Corporation Heartbeat failure detection
CN110224880A (en) * 2018-03-01 2019-09-10 华为技术有限公司 A kind of heartbeat inspecting method and monitoring device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120008506A1 (en) * 2010-07-12 2012-01-12 International Business Machines Corporation Detecting intermittent network link failures
US20190235939A1 (en) * 2018-01-26 2019-08-01 International Business Machines Corporation Heartbeat failure detection
CN110224880A (en) * 2018-03-01 2019-09-10 华为技术有限公司 A kind of heartbeat inspecting method and monitoring device
CN108964977A (en) * 2018-06-05 2018-12-07 平安科技(深圳)有限公司 Node abnormality eliminating method and system, storage medium and electronic equipment
CN109887125A (en) * 2019-02-02 2019-06-14 北京主线科技有限公司 Fault detection method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115983393A (en) * 2022-12-30 2023-04-18 北京百度网讯科技有限公司 Quantum circuit task timeout reason determining method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN114422412A (en) 2022-04-29
CN114422412B (en) 2023-11-17

Similar Documents

Publication Publication Date Title
WO2022078070A1 (en) Device detection method and apparatus, and communication device
US10057150B2 (en) Managing communication congestion for internet of things devices
US9923821B2 (en) Managing communication congestion for internet of things devices
WO2017112365A1 (en) Managing communication congestion for internet of things devices
CN102404170B (en) Detection method, device and system of message loss
US20180338278A1 (en) Wi-fi roaming management
CN110519374B (en) Edge computing method of ZigBee networked industrial control system and edge node thereof
CN112702202B (en) Ammeter communication link fault recovery method and device and computer equipment
JP2017508391A (en) Method for matching data and event gap transmitted / received in network using different communication technologies
CN112333758B (en) Stability monitoring management method, system and application of television gateway system
JPH07183905A (en) Remote monitor system
CN111757371B (en) Statistical method of transmission delay, server and storage medium
US11246071B2 (en) Method for managing a connection in a distributed wireless network
CN109831335B (en) Data monitoring method, monitoring terminal, storage medium and data monitoring system
US10104571B1 (en) System for distributing data using a designated device
US11606282B2 (en) Method and device for detecting network reliability
WO2017000683A1 (en) Method and device for wireless terminal management
US10819609B2 (en) Communication relay device and network monitoring method
JP4299210B2 (en) Network monitoring method and apparatus
CN112637055A (en) Multi-link aggregation method, system and storage medium based on VPN tunnel
US9998294B1 (en) System for distributed audio output using designated audio devices
US9473597B2 (en) Implementing multiple MAC protocols using a single wireless communication unit
KR101900709B1 (en) A method of recovering a communication failure of a binary CDMA wireless transceiver
US10523469B2 (en) Relay device and communication system
JP5072722B2 (en) Monitoring system and master station

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21879116

Country of ref document: EP

Kind code of ref document: A1