CN115361434A - Multipath heartbeat detection method under high load condition of distributed system - Google Patents

Multipath heartbeat detection method under high load condition of distributed system Download PDF

Info

Publication number
CN115361434A
CN115361434A CN202210930712.8A CN202210930712A CN115361434A CN 115361434 A CN115361434 A CN 115361434A CN 202210930712 A CN202210930712 A CN 202210930712A CN 115361434 A CN115361434 A CN 115361434A
Authority
CN
China
Prior art keywords
time
detection
data
heartbeat
local system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210930712.8A
Other languages
Chinese (zh)
Inventor
李永进
贾宗秀
刘尧
蒋旭
姬涛涛
赵冬伟
张昕尧
朱亚楠
吴嵩
周勇亮
刘勇生
桑国彪
乐承予
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TIANJIN SHENZHOU GENERAL DATA TECHNOLOGY CO LTD
Original Assignee
TIANJIN SHENZHOU GENERAL DATA TECHNOLOGY CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TIANJIN SHENZHOU GENERAL DATA TECHNOLOGY CO LTD filed Critical TIANJIN SHENZHOU GENERAL DATA TECHNOLOGY CO LTD
Priority to CN202210930712.8A priority Critical patent/CN115361434A/en
Publication of CN115361434A publication Critical patent/CN115361434A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/14Session management
    • H04L67/143Termination or inactivation of sessions, e.g. event-controlled end of session
    • H04L67/145Termination or inactivation of sessions, e.g. event-controlled end of session avoiding end of session, e.g. keep-alive, heartbeats, resumption message or wake-up for inactive or interrupted session
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1004Server selection for load balancing

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Cardiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hardware Redundancy (AREA)

Abstract

The invention relates to a multipath heartbeat detection method under the condition of high load of a distributed system, which is technically characterized by comprising the following steps: the method comprises an independent heartbeat detection method, a service function heartbeat detection method and an opposite-end active response heartbeat detection method which run synchronously, and if any one heartbeat detection method is normal, the heartbeat detection result of the current round is normal. The heartbeat detection method is reasonable in design, integrates high-frequency detection, long timeout time and multi-path detection, can identify a heartbeat fault state in a short time through independent heartbeat detection, service functions, opposite-end active response, dynamic identification of timeout events and other methods, can solve the problem of system misjudgment heartbeat faults under high system pressure, and greatly improves the stability and the usability of a database.

Description

Multipath heartbeat detection method under high load condition of distributed system
Technical Field
The invention belongs to the technical field of databases, and particularly relates to a multipath heartbeat detection method under a high-load condition of a distributed system.
Background
The MPP database is a distributed database having a plurality of shared-nothing nodes. Heartbeat detection is a mechanism used to discover and identify the state of nodes. With the rapid development of computer technology, the performance requirements of databases are higher and higher, and how to efficiently utilize system resources and find the performance limit of a system is a long-term and stable trend. In the process, how the MPP database identifies system performance fluctuation or system fault problem under the limit pressure, how to quickly detect system fault, and being capable of identifying delayed response caused by system pressure becomes an important problem in front of a high-performance system.
In the existing heartbeat detection technology, TCP or UDP is usually used to send specially appointed data to an opposite end computer of an appointed IP, and whether the network and the opposite end computer work normally is judged according to a return value. When the system pressure is low, the mode can work normally, but when the system pressure is high, the overtime waiting phenomenon under the non-fault condition is easy to occur, so that the system is mistakenly judged to be faulty, and the normal operation of the system is influenced. If we set the system timeout to a longer time interval, it is difficult to find the true system failure in a shorter time.
In summary, how to ensure timely detection of a system fault and avoid system misjudgment under a high load condition of an MPP database system is a problem that needs to be solved urgently at present.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a multipath heartbeat detection method under the condition of high load of a distributed system, which is reasonable in design, can quickly detect system faults and avoid misjudgment.
The invention solves the technical problems in the prior art by adopting the following technical scheme:
a multipath heartbeat detection method under the condition of high load of a distributed system comprises an independent heartbeat detection method, a service function heartbeat detection method and an opposite-end active response heartbeat detection method which run synchronously, wherein if any one heartbeat detection result is normal, the heartbeat detection result of the current round is normal;
the independent heartbeat detection method comprises the following steps: the local system sends a network data packet to the opposite end computer according to the t1 time interval, detects whether the opposite end computer and the local system are normal, if the local system sends the network data packet for more than t2 time, the local system sends special data by using OOB to carry out network detection, and detects whether the OOB detection is normal; if the local system receives the response network data packet of the opposite end computer for more than t22 time, the local system uses OOB to send special data to carry out network detection and detects whether the OOB detection is normal; if the detection is abnormal and wrong, the system is considered to be abnormal; only if the local system receives the response network data of the opposite end computer within the time t3 and the verification is successful, the detection is regarded as successful;
the service function heartbeat detection method comprises the following steps: the local system records all communication time with business activity of the opposite end computer, if the current time and the recorded communication time interval are within t3 time, the system is considered to be in a normal state, and if the current time and the recorded communication time interval exceed the t3 time, the opposite end computer is considered to be possibly in an abnormal state; detecting the states of other detection paths of the opposite end computer, and if all the detection paths are in abnormal states, determining that the system is abnormal;
the opposite-end active response heartbeat detection method comprises the following steps: the opposite end computer sends system state information to the local system according to the time t1, if the local system can receive the information sent by the opposite end computer in the time t1 and the verification is successful, the opposite end computer and the local system are considered to be normal, otherwise, the system is considered to be abnormal;
where t1 represents the time interval of heartbeat initiation, t2 represents the maximum time from the start of data transmission and waiting for the completion of data transmission, t22 represents the maximum waiting time from the start of data reception at the opposite end, and t3 represents the maximum time of final timeout.
Further, the specific implementation method of the independent heartbeat detection method is as follows:
firstly, a local system sends a heartbeat detection data packet to an opposite-end computer according to t1 time, and the opposite-end computer verifies the content and the number of the received data packet after receiving the data packet; if the data violates the convention rule, the data is regarded as system exception, and the exception state is sent to a local system; if the data accords with the convention rule, the system is considered to be normal, and the normal state is sent to the local system; the local system receives normal data of the system, verifies the data according to an agreed rule, and considers that the data from the opposite end computer to the local system is normal after the data passes the agreed rule; otherwise, the opposite end computer and the local system are in an abnormal state;
if the local system sends data overtime after the time exceeds t2, OOB is used for sending special data to carry out network detection, if the detection succeeds within the time exceeding t21, whether the time of the current detection exceeds t3 is detected, if the time exceeds t3, the current detection is considered to be abnormal, if the time does not exceed t3, the current detection is ended, the environment used by the current detection is saved, and the currently executed steps are recorded; considering that an abnormality occurs if the OOB probing process fails;
if the local system does not receive the response of the opposite end computer after the time of t22 is exceeded, using OOB to send special data to carry out network detection, if the detection is successful within the time of t31, judging whether the time of the current round of detection exceeds the time of t3, if the time of the current round of detection exceeds the time of t3, judging the current round of detection to be abnormal, if the time of the current round of detection does not exceed the time of t3, finishing the current round of detection, storing the environment used by the current detection, and recording the current execution step; if the OOB detection process fails, the OOB detection process is considered to be abnormal;
finally, after t1 time, starting a new round of detection; if the previous round is not finished, the environment of the previous round is continuously used, the detection is carried out in the last executed step, and if the detection result of the previous round is normal or abnormal, the detection is restarted;
in the independent heartbeat detection process, if abnormity and errors occur, the system is considered to be abnormal;
here, t21 represents the maximum waiting time for OOB probing in a state where a timeout occurs at the time of transmitting data, and t31 represents the maximum waiting time for OOB probing in a state where the reception data is timed out.
Further, the t1 time is set to be an ultra-short time, and is set to be a longer time or is dynamically set to be a timeout time for other times.
Further, the method for dynamically setting the timeout time includes: and saving the time consumption required by the whole detection process for a plurality of times in the past, solving a fitting value, and multiplying the fitting value by a coefficient K to obtain the final maximum timeout t3 time.
The invention has the advantages and positive effects that:
the heartbeat detection method is reasonable in design, integrates high-frequency detection, long timeout time and multi-path detection, can identify a heartbeat fault state in a short time through independent heartbeat detection, service functions, opposite-end active response, dynamic identification of timeout events and other methods, can solve the problem of system misjudgment heartbeat faults under high system pressure, and greatly improves the stability and the usability of a database.
Drawings
FIG. 1 is a flow chart of the detection of the present invention.
Detailed Description
The following describes embodiments of the present invention in detail with reference to the accompanying drawings.
The invention provides a multi-path heartbeat detection method under a high load condition of a distributed system, which comprises the following three paths of detection methods which run synchronously as shown in figure 1: an independent heartbeat detection method, a service function heartbeat detection method and an opposite-end active response heartbeat detection method; if the detection result of one of the paths is normal, the heartbeat detection result of the current round is considered to be normal.
For convenience of explanation, during the heartbeat detection, the following time parameters are defined:
t1 represents the frequency of heartbeat initiation, i.e. triggering once every t1 time interval. E.g., 10 seconds, representing a trigger every 10 seconds. t1 is greater than or equal to 0.
t2 represents the maximum time from the start of transmitting data to the completion of waiting for transmitting data. t2 is greater than or equal to 0.
t21 represents the maximum waiting time for OOB detection in a state where timeout occurs at the time of data transmission. t21 is greater than or equal to 0.
t22 represents the maximum waiting time from the reception of peer data. t22 is greater than or equal to 0.
t31 represents the maximum waiting time for OOB probing in a state where the received data is timed out. t31 is greater than or equal to 0.
t3 represents the maximum time of the final timeout, t3 must be greater than (t 2+ t21+ t22+ t 31).
1. Independent heartbeat detection method.
Firstly, the local system sends a network data packet to the opposite end computer according to the time t1, and the content of the data packet is a heartbeat detection data packet containing the current round of detection convention. After receiving the data packet, the opposite end computer verifies the content and the serial number of the received data packet. And if the data violates the convention rule, the data is regarded as system exception, and the exception state is sent to the local system. The local system decides the next action depending on the exception status. If the data received by the opposite end computer accords with the convention rule, the system is considered to be normal, and the normal state is sent to the local system. And the local system receives normal data of the system, verifies the data according to the convention rule, and considers that the data from the opposite end computer to the local system is normal after the data passes the convention rule. Otherwise, the opposite computer and the local system are considered to be in an abnormal state.
If the sending data of the local system is overtime after the time of t2 is exceeded, the special data is sent by using OOB (out-of-band data) to carry out network detection. If the OOB detection is successfully used within the time exceeding t21, detecting whether the time of the current detection exceeds t3, if the time exceeds t3, determining the current detection is abnormal, if the time does not exceed t3, ending the current detection, saving the environment used by the current detection, and recording the currently executed steps. An anomaly is considered to occur if the OOB probing procedure fails.
If the local system does not receive the response of the opposite end computer after the time t22 is exceeded, OOB (out-of-band data) is used for sending special data to carry out network detection, if the detection is successful within the time t31, whether the detection of the current round exceeds the time t3 is judged, if the detection exceeds the time t3, the detection of the current round is judged to be abnormal, if the detection does not exceed the time t3, the detection of the current round is ended, the environment used by the detection is stored, and the currently executed steps are recorded. If the OOB probing procedure fails, an exception is deemed to have occurred.
Then, after time t1, a new round of detection is initiated. At this time, if the previous round is not finished, the environment of the previous round is continuously used, and the detection is performed in the last executed step. And if the detection result of the previous round is normal or abnormal, restarting the detection.
In the independent heartbeat detection process, if an exception and an error occur, the system is considered to be abnormal.
2. A service function heartbeat detection method.
The detection method is a heartbeat detection method using a service function. The local system and the peer computer have business activities that can determine the state of the peer computer. And recording all communication time with business activity of the opposite end computer, and if the current time and the recorded communication time interval are within t3 time, considering that the system is in a normal state. If the time t3 is exceeded, the peer computer is considered to be likely to be in an abnormal state. At this time, other connection states with the opposite end computer are detected, and if all the connections are in an abnormal state, the system can be regarded as abnormal.
3. And the opposite end actively responds to the heartbeat detection method.
The detection method is that the opposite end computer initiates a heartbeat detection method. And the opposite end computer sends system state information to the local system according to the time t 1. And if the local system can receive the information sent by the opposite end computer in the time t1 and the verification is successful, the opposite end computer and the local system are considered to be normal. Otherwise, the system is considered to be abnormal.
When the local system is abnormal in the heartbeat detection of the above three paths, the system is considered to be abnormal.
In the present invention, the interval t1 time may be set to a shorter time, such as a second order. And the system timeout time can be set to a longer time, such as hundreds of seconds, or the timeout time can be dynamically set.
In the invention, the dynamic setting algorithm of the timeout time is as follows: and saving the time consumption required by the whole process of detection for a plurality of times in the past, calculating a fitting value, multiplying the fitting value by a coefficient K to obtain the final maximum timeout t3 time, wherein the parameter K represents a number.
It should be emphasized that the embodiments described herein are illustrative and not restrictive, and thus the present invention includes embodiments not limited to the embodiments described herein, and that other embodiments derived from the teachings of the present invention by those skilled in the art are also within the scope of the present invention.

Claims (4)

1. A multipath heartbeat detection method under the condition of high load of a distributed system is characterized by comprising the following steps: the method comprises an independent heartbeat detection method, a service function heartbeat detection method and an opposite-end active response heartbeat detection method which run synchronously, wherein if any one heartbeat detection result is normal, the heartbeat detection result of the current round is normal;
the independent heartbeat detection method comprises the following steps: the local system sends a network data packet to the opposite terminal computer according to the t1 time interval, detects whether the opposite terminal computer and the local system are normal, if the time for sending the network data packet by the local system exceeds the t2 time, the local system sends special data by using OOB to carry out network detection, and detects whether the OOB detection is normal; if the local system receives the response network data packet of the opposite end computer for more than t22 time, the local system uses OOB to send special data to carry out network detection and detects whether the OOB detection is normal; if the detection is abnormal and wrong, the system is considered to be abnormal; only if the local system receives the response network data of the opposite end computer within the time t3 and the verification is successful, the detection is regarded as successful;
the service function heartbeat detection method comprises the following steps: the local system records all communication time with business activity of the opposite end computer, if the current time and the recorded communication time interval are within t3 time, the system is considered to be in a normal state, and if the current time and the recorded communication time interval exceed the t3 time, the opposite end computer is considered to be possibly in an abnormal state; detecting the states of other detection paths of the opposite end computer, and if all the detection paths are in abnormal states, determining that the system is abnormal;
the opposite-end active response heartbeat detection method comprises the following steps: the opposite end computer sends system state information to the local system according to the time t1, if the local system can receive the information sent by the opposite end computer in the time t1 and the verification is successful, the opposite end computer and the local system are considered to be normal, otherwise, the system is considered to be abnormal;
where t1 represents the time interval of heartbeat initiation, t2 represents the maximum time from the start of data transmission and waiting for the completion of data transmission, t22 represents the maximum waiting time from the start of data reception at the opposite end, and t3 represents the maximum time of final timeout.
2. The method of claim 1, wherein the method comprises: the specific implementation method of the independent heartbeat detection method comprises the following steps:
firstly, a local system sends a heartbeat detection data packet to an opposite terminal computer according to t1 time, and the opposite terminal computer verifies the content and the number of the received data packet after receiving the data packet; if the data violates the convention rule, the data is regarded as system exception, and an exception state is sent to a local system; if the data accords with the convention rule, the system is considered to be normal, and the normal state is sent to the local system; the local system receives normal data of the system, verifies the data according to an agreed rule, and considers that the data from the opposite end computer to the local system is normal after the data passes the agreed rule; otherwise, the opposite end computer and the local system are in an abnormal state;
if the local system sends data overtime after the time exceeds t2, OOB is used for sending special data to carry out network detection, if the detection succeeds within the time exceeding t21, whether the time of the current detection exceeds t3 is detected, if the time exceeds t3, the current detection is considered to be abnormal, if the time does not exceed t3, the current detection is ended, the environment used by the current detection is saved, and the currently executed steps are recorded; considering that an abnormality occurs if the OOB probing process fails;
if the local system does not receive the response of the opposite end computer after the time of t22 is exceeded, using OOB to send special data to carry out network detection, if the detection is successful within the time of t31, judging whether the time of the current round of detection exceeds the time of t3, if the time of the current round of detection exceeds the time of t3, judging the current round of detection to be abnormal, if the time of the current round of detection does not exceed the time of t3, finishing the current round of detection, storing the environment used by the current detection, and recording the current execution step; if the OOB detection process fails, the OOB detection process is considered to be abnormal;
finally, after t1 time, starting a new round of detection; if the previous round is not finished, the environment of the previous round is continuously used, the detection is carried out in the last executed step, and if the detection result of the previous round is normal or abnormal, the detection is restarted;
in the independent heartbeat detection process, if abnormity and errors occur, the system is considered to be abnormal;
here, t21 represents the maximum waiting time for OOB probing in a state where a timeout occurs at the time of transmitting data, and t31 represents the maximum waiting time for OOB probing in a state where the reception data is timed out.
3. The method for detecting the multipath heartbeat under the high load condition of the distributed system as claimed in claim 1 or 2, wherein: the t1 time is set as an ultra-short time, and other times are set as longer times or dynamic setting of timeout time is carried out.
4. The method of claim 3, wherein the method comprises: the dynamic setting method of the timeout time comprises the following steps: and saving the time consumption required by the whole detection process for a plurality of times in the past, solving a fitting value, and multiplying the fitting value by a coefficient K to obtain the final maximum timeout t3 time.
CN202210930712.8A 2022-08-04 2022-08-04 Multipath heartbeat detection method under high load condition of distributed system Pending CN115361434A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210930712.8A CN115361434A (en) 2022-08-04 2022-08-04 Multipath heartbeat detection method under high load condition of distributed system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210930712.8A CN115361434A (en) 2022-08-04 2022-08-04 Multipath heartbeat detection method under high load condition of distributed system

Publications (1)

Publication Number Publication Date
CN115361434A true CN115361434A (en) 2022-11-18

Family

ID=84033784

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210930712.8A Pending CN115361434A (en) 2022-08-04 2022-08-04 Multipath heartbeat detection method under high load condition of distributed system

Country Status (1)

Country Link
CN (1) CN115361434A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116797323A (en) * 2023-08-21 2023-09-22 北京嗨飞科技有限公司 Data processing method, device and equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116797323A (en) * 2023-08-21 2023-09-22 北京嗨飞科技有限公司 Data processing method, device and equipment
CN116797323B (en) * 2023-08-21 2023-11-14 北京嗨飞科技有限公司 Data processing method, device and equipment

Similar Documents

Publication Publication Date Title
US7165192B1 (en) Fault isolation in large networks
CN108847982B (en) Distributed storage cluster and node fault switching method and device thereof
US6625648B1 (en) Methods, systems and computer program products for network performance testing through active endpoint pair based testing and passive application monitoring
CN109274544B (en) Fault detection method and device for distributed storage system
US20080288812A1 (en) Cluster system and an error recovery method thereof
CN101800675A (en) Failure monitoring method, monitoring equipment and communication system
CN115361434A (en) Multipath heartbeat detection method under high load condition of distributed system
CN111585841B (en) Automatic test method and related device
CN111756573A (en) CTDB double-network-card fault monitoring method in distributed cluster and related equipment
CN109245953B (en) Network configuration method and device
CN110515757B (en) Information processing method, device, server and medium of distributed storage system
CN112787889B (en) Switch cold start testing method, system and medium
CN113722003A (en) Method, device and equipment for adjusting working mode of PHY chip
CN112543141B (en) DNS forwarding server disaster tolerance scheduling method and system
CN115378841B (en) Method and device for detecting state of equipment accessing cloud platform, storage medium and terminal
CN111654401B (en) Network segment switching method, device, terminal and storage medium of monitoring system
CN110581786A (en) Method, device, system and medium for testing communication stability of NCSI (network communication service) network
CN111934909B (en) Main-standby machine IP resource switching method, device, computer equipment and storage medium
CN112068978B (en) Method and device for prolonging timing period of VIEW-CHANGE secondary start timer
CN114327967A (en) Equipment repairing method and device, storage medium and electronic device
CN111309504A (en) Control method for embedded module serial port redundant transmission and related components
US20240187904A1 (en) Load Query Processing Method and Apparatus, Storage Medium and Electronic Apparatus
CN117354081B (en) Modbus timeout response self-adaptive configuration method, device, equipment and storage medium
JPH10334009A (en) Client fault detecting method
CN113595591B (en) Method for processing abnormal self-recovery of module terminal file parameter synchronous business

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination