CN117255001A - Barrier removing method and device for service node, load balancer and readable storage medium - Google Patents

Barrier removing method and device for service node, load balancer and readable storage medium Download PDF

Info

Publication number
CN117255001A
CN117255001A CN202311317985.6A CN202311317985A CN117255001A CN 117255001 A CN117255001 A CN 117255001A CN 202311317985 A CN202311317985 A CN 202311317985A CN 117255001 A CN117255001 A CN 117255001A
Authority
CN
China
Prior art keywords
service node
end service
state
fault
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311317985.6A
Other languages
Chinese (zh)
Inventor
于文超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN202311317985.6A priority Critical patent/CN117255001A/en
Publication of CN117255001A publication Critical patent/CN117255001A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0811Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking connectivity
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/50Testing arrangements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/12Avoiding congestion; Recovering from congestion
    • H04L47/125Avoiding congestion; Recovering from congestion by balancing the load, e.g. traffic engineering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Environmental & Geological Engineering (AREA)
  • Computer And Data Communications (AREA)

Abstract

The embodiment of the invention provides a barrier removing method, a device, a load balancer and a readable storage medium of a service node, wherein the method is applied to the load balancer in a load balancing system, the load balancing system comprises the load balancer and a plurality of rear-end service nodes respectively connected with the load balancer, and the method comprises the following steps: in the process of forwarding the service request to the back-end service node, if a connection state abnormal event is detected for any back-end service node, determining that the back-end service node is in a fault state, and stopping dispatching the service request to the back-end service node. The method can ensure the accuracy of the detection result of the fault back-end service node, timely remove the fault back-end service node in the system, avoid invading back-end service and reduce the fault removal overhead of the service node.

Description

Barrier removing method and device for service node, load balancer and readable storage medium
Technical Field
The present invention relates to the field of communications technologies, and in particular, to a method and an apparatus for removing an obstacle of a service node, a load balancer, and a readable storage medium.
Background
A network Load Balancer (Load Balancer) is a network software device that is deployed in front of a back-end service node, receives a user's request, and forwards the user's request to a different back-end service node according to a certain scheduling algorithm policy. By using a load balancer, the system can distribute loads among different service nodes, thereby reducing the pressure of a single service node, reducing the fault risk of the service node and improving the response speed of the service node. Fig. 1 is a schematic diagram of the basic working principle of a load balancer, referring to fig. 1, where the load balancer is deployed in front of the back-end service nodes a1, a2 and a3, and for a request sent by a user, the load balancer may schedule each user request to the back-end service node a1, a2 or a3 according to a set scheduling algorithm policy, so as to implement load sharing among the back-end service nodes a1, a2 and a 3.
The network equalizer may operate according to different levels of network protocols, such as transport layer (four-layer) load balancing and application layer (seven-layer) load balancing. Four layers of load balancing are mainly used for load distribution based on IP (Internet Protocol ) addresses and port numbers, and seven layers of load balancing are used for more intelligent load distribution according to the content of application layer protocols such as HTTP (HyperText Transfer Protocol ).
Health check is one of the important functions of the load balancer, and the purpose of the health check is to discover and reject abnormal back-end service nodes in time, so as to ensure that a user's request is not forwarded to the failed back-end service node. In a specific implementation manner, the health check is that the load balancer actively sends a probe request to the back-end service node at a certain time interval, for example, a Ping (Packet Internet Groper, internet packet probe) probe, a TCP (Transmission Control Protocol ) handshake probe, an HTTP/HTTPs (Hypertext Transfer Protocol Secure, hypertext transfer security protocol) probe may be specifically used to determine whether the back-end service node is operating normally, and this health check mode is called active health check. Currently mainstream load balancers, such as LVS (Linux Virtual Server, a load balancer), HAProxy (High Availability Proxy, a load balancer), nminux (a load balancer), all employ active health checks. The LVS is a four-layer load balancer implemented in a Linux (an operating system) kernel, which does not have a health check function, but implements the health check function through a user-mode tool keepalive (a survival detection mechanism) or seesaw (a load balancing system). Active health inspection mainly has the following problems:
First, probe requests for health checks are generally relatively simple, and often differ greatly from real traffic in request format, network paths, and may lead to inaccurate probes.
Secondly, the probe request is a false request constructed by the health check program, and needs to be periodically and frequently sent to the back-end service node, so that the normal operation of the back-end service node can be interfered to a certain extent.
Thirdly, there may be tens of thousands of back-end service nodes in a large cluster, in which case the health check program needs to periodically send probe requests for each back-end service node, wait for the back-end service node to respond or timeout, acquire probe results and perform reject or restore actions for the back-end service node, which not only consumes a lot of server performance, but also causes delay increase of probing, and cannot discover and reject failed back-end service nodes in time.
Disclosure of Invention
The embodiment of the invention aims to provide a fault removing method and device for service nodes, a load balancer and a readable storage medium, so as to ensure the accuracy of a fault service node detection result, timely remove the fault back-end service nodes in a system, avoid invading back-end service and reduce fault removing overhead of the service nodes. The specific technical scheme is as follows:
In a first aspect of the present invention, there is provided a method for removing an obstacle of a service node, which is applied to a load balancer in a load balancing system, where the load balancing system includes the load balancer and a plurality of back-end service nodes connected to the load balancer respectively, and the method includes:
in the process of forwarding the service request to the back-end service node, if a connection state abnormal event is detected for any back-end service node, determining that the back-end service node is in a fault state, and stopping scheduling the service request to the back-end service node; wherein the connection state exception event includes: and receiving response information which is returned by the back-end service node and represents abnormal connection state, and/or not receiving the response information returned by the back-end service node within a preset waiting time after the service request or the connection request corresponding to the service request is sent out.
Optionally, the step of determining that the back-end service node is in the fault state if a connection state abnormal event is detected for any one of the back-end service nodes includes:
when the connection state abnormal event is detected for any back-end service node, determining that the back-end service node is in a state to be confirmed under the condition that the back-end service node is not in the state to be confirmed, and judging whether the frequency of detecting the connection state abnormal event for the back-end service node in the state to be confirmed is not less than the maximum abnormal frequency; the duration of the state to be confirmed is a preset state confirmation duration;
If yes, determining that the back-end service node is in a fault state;
if not, determining that the back-end service node is in a normal state.
Optionally, the method further comprises:
and after determining the fault suppression time length when any back-end service node is in a fault state, recovering to dispatch the service request to the back-end service node.
Optionally, the method further comprises:
for a back-end service node determined to be in a fault state, after a service request is scheduled to the back-end service node, if a connection state normal event is detected for the back-end service node, determining that the back-end service node is in a normal state, and shortening a fault suppression duration configured for the back-end service node based on a first preset proportion when a preset shortening condition is met once the connection state normal event is detected for the back-end service node; the connection state normal event includes: receiving response information which is returned by the back-end service node and represents normal connection state; the preset shortening condition includes: the shortened fault suppression time is not less than the preset shortest time.
Optionally, the method further comprises:
after each time that any back-end service node is in a fault state is determined, under the condition that a preset extension condition is met, extending the fault suppression time length configured for the back-end service node based on a second preset proportion; the preset extension conditions include: the prolonged fault suppression time is not longer than the preset maximum time.
Optionally, in the case of performing back-end service node barrier removal based on the TCP protocol, the connection state exception event includes any one or more of the following:
the connection state of the load equalizer is changed from SYN-RECV state or SYN-SENT state to CLOSE state;
the load equalizer is in SYN-RECV state or SYN-SENT state overtime;
receiving an RST packet returned by the back-end service node;
in the case of performing back-end service node barrier removal based on the UDP protocol, the connection state exception event may include any one or more of:
when the UDP connection between the load balancer and the back-end service node is the UDP connection of bidirectional data transmission, the response information returned by the back-end service node is not received after the preset waiting time of the service request is sent out;
receiving ICMP error messages of host unreachable types or port unreachable types returned by the back-end service node;
in the case of performing back-end service node barrier removal based on the HTTP/HTTPs protocol, the connection status exception event includes any one or more of:
receiving an error HTTP response status code returned by the back-end service node;
and not receiving response information returned by the back-end service node within a preset waiting time after the service request is sent.
Optionally, in the case of performing back-end service node barrier removal based on the TCP protocol, the connection state normal event includes any one or more of the following:
the connection state of the load equalizer enters an Established state;
receiving a SYN/ACK packet returned by the back-end service node;
under the condition of performing back-end service node barrier removal based on UDP protocol, the normal event of the connection state comprises:
receiving response information returned by the back-end service node after sending the service request;
in the case of performing back-end service node barrier removal based on the HTTP/HTTPs protocol, the connection state normal event includes:
and receiving a normal HTTP response status code returned by the back-end service node.
In a second aspect of the present invention, there is also provided a load balancer, including: the control unit and a plurality of forwarding units are respectively in communication connection with the control unit;
the forwarding unit is configured to forward a service request to a back-end service node, and if a connection state abnormal event is detected for any back-end service node during forwarding the service request, report error information for the back-end service node to the control unit; the connection state exception event includes: receiving response information which is returned by the back-end service node and represents abnormal connection state, and/or not receiving the response information returned by the back-end service node within a preset waiting time after the service request or the connection request corresponding to the service request is sent out;
The control unit is used for determining that the back-end service node is in a fault state when any back-end service node receives the error information reported by the forwarding units, and issuing an extraction instruction for the back-end service node to all forwarding units;
the forwarding unit is further configured to terminate forwarding a service request to the backend service node after receiving an removal instruction for any backend service node.
Optionally, the control unit is specifically configured to determine, when the back-end service node is not in a to-be-acknowledged state, that the back-end service node is in the to-be-acknowledged state and determine whether the number of error messages received by the back-end service node in the to-be-acknowledged state and reported by the forwarding unit is not less than a maximum number of abnormal times, where a duration of the to-be-acknowledged state is a preset state acknowledgement duration; if yes, determining that the back-end service node is in a fault state, and issuing an extraction instruction for the back-end service node to all forwarding nodes, otherwise, determining that the back-end service node is in a normal state.
Optionally, the control unit is further configured to issue a recovery instruction for the backend service node to all forwarding units after determining a fault suppression duration in which any backend service node is in a fault state;
and the forwarding unit is further used for recovering to forward the service request to the back-end service node after receiving the recovery instruction aiming at any back-end service node.
Optionally, the forwarding unit is further configured to, after restoring forwarding the service request to the back-end service node, report, to the control unit, normal information for the back-end service node if a connection state normal event is detected for any one of the back-end service nodes;
the control unit is further configured to determine, for a back-end service node determined to be in a failure state, that the back-end service node is restored to a normal state after issuing a restoration instruction for the back-end service node, and once the correct information reported by the forwarding unit for the back-end service node is received, shorten a failure suppression duration configured for the back-end service node based on a first preset proportion under the condition that a preset shortening condition is satisfied; the preset shortening condition includes: the shortened fault suppression time is not less than the preset shortest time.
Optionally, after receiving a recovery instruction issued by the control unit for any back-end service node, the forwarding unit reports the correct information to the control unit for the back-end service node by not more than a preset maximum reporting number.
In a third aspect of the present invention, there is also provided a barrier device of a service node, applied to a load balancer in a load balancing system, where the load balancing system includes the load balancer and a plurality of back-end service nodes connected to the load balancer respectively, the device includes:
the determining unit is used for determining that the back-end service node is in a fault state and stopping scheduling the service request to the back-end service node if any back-end service node detects a connection state abnormal event in the process of forwarding the service request to the back-end service node; wherein the connection state exception event includes: and receiving response information which is returned by the back-end service node and represents abnormal connection state, and/or not receiving the response information returned by the back-end service node within a preset waiting time after the service request or the connection request corresponding to the service request is sent out.
In a fourth aspect of the present invention, there is also provided an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
and the processor is used for realizing the fault clearing method of the service node when executing the program stored in the memory.
In yet another aspect of the implementation of the present invention, there is also provided a computer readable storage medium, where a computer program is stored, where the computer program when executed by a processor implements the method for removing an obstacle of a service node according to any one of the above.
In yet another aspect of the present invention, there is also provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of troubleshooting a service node as described in any of the above.
In the barrier removing method, the barrier removing device, the load balancer and the readable storage medium for the service node provided by the embodiment of the invention, in the process that the load balancer forwards the service request to the back-end service node, whether the back-end service node is in a fault state or not is determined based on the response result of the back-end service node to the service request, specifically, if an abnormal event of the connection state is detected for any back-end service node in the process, the back-end service node is determined to be in the fault state. Therefore, detection of the fault rear-end service node is realized based on the real service request, a detection request is not required to be constructed, the accuracy of a fault node detection result can be ensured, the fault rear-end service node in the system can be removed in time, the rear-end service is not invaded, and the fault removal overhead of the service node is low.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.
Fig. 1 is a schematic diagram of the basic operating principle of a load balancer;
fig. 2 is a flow chart of an obstacle removing method of a service node according to an embodiment of the present invention;
fig. 3 is a schematic diagram of an operating principle of a load balancer according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a control unit according to an embodiment of the present invention;
fig. 5 is a schematic state diagram of a forwarding unit according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a barrier device of a service node according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention.
In order to solve the problems of the existing active health inspection method, the embodiment of the invention provides a barrier removing method of a service node, which is particularly applied to a load balancer in a load balancing system, wherein the load balancing system comprises the load balancer and a plurality of rear-end service nodes which are respectively connected with the load balancer.
The structure of the load balancing system may be as shown in fig. 1, where a plurality of back-end service nodes connected behind the load balancing device, that is, back-end service nodes a1-a3 are shown as being used to provide the same or similar network services, and the load balancing device may distribute service requests among the back-end service nodes according to a configured load balancing policy, so as to provide services for users. In order to ensure service availability, the load balancer needs to detect the back-end service node, and timely troubleshoot and remove the abnormal back-end service node, so that the influence of the back-end service node abnormality on the service is avoided, and the process is the health check of the back-end service node.
Fig. 2 is a flow chart of an obstacle removing method of a service node according to an embodiment of the present invention, referring to fig. 2, the method includes:
step S201: in the process of forwarding the service request to the back-end service nodes, if a connection state abnormal event is detected for any back-end service node, determining that the back-end service node is in a fault state, and stopping dispatching the service request to the back-end service node; wherein the connection state exception event includes: and receiving response information which is returned by the back-end service node and represents abnormal connection state, and/or not receiving the response information returned by the back-end service node within a preset waiting time after a service request or a connection request corresponding to the service request is sent out.
As described above, the load balancer is specifically configured to distribute the service request to the back-end service node, and in practical application, in order to ensure communication efficiency, the back-end service node generally responds to the received service request.
In the embodiment of the invention, the load balancing appliance can judge the availability of the back-end service node based on the response result of the back-end service node to the service request. Specifically, in the process that the load balancer forwards the service request to the back-end service node, if the response result of the back-end service node to the service request is specifically a connection state abnormal event according to the embodiment of the present invention, the back-end service node is considered to be in a fault state.
In the embodiment of the invention, the connection state abnormal event specifically includes: and receiving response information which is returned by the back-end service node and represents abnormal connection state, and/or not receiving the response information returned by the back-end service node within a preset waiting time after a service request or a connection request corresponding to the service request is sent out.
In practical application, after receiving a service request, the back-end service node may return different types of response information based on an actual processing result of the service request, for example, response information indicating that the service request is successfully received, or response information indicating that the service request cannot be identified. In the embodiment of the invention, when the response information returned by the back-end service node can characterize that the connection state between the load balancer and the back-end service node is abnormal, the back-end service node is considered to be unable to establish normal communication connection with the load balancer, and the service request distributed to the back-end service node is not processed, so that the back-end service node can be considered to be in a fault state.
After receiving the response information returned by the back-end service node, the load balancer can judge whether the response information represents abnormal connection state based on the specific field carried in the response information, and if so, the back-end node is considered to be in a fault state. Specifically, a specific type of response information capable of representing the connection state abnormality is related to a specific communication protocol adopted by the load balancer, and in the embodiment of the present invention, which types of response information represent the connection state abnormality can be selected based on actual requirements.
In some application scenarios, after the back-end service node receives the service request forwarded by the load balancer, response information is returned to the load balancer according to the specification of the communication protocol. In this case, if the load balancer does not receive the response information returned by the back-end service node after the preset waiting time of the service request is sent, it may be considered that there is an abnormality in the connection state between the load balancer and the back-end service node, and therefore, the back-end service node may also be considered to be in a failure state.
In addition, under a part of communication protocols, for example, TCP, the load balancer and the back-end service node need to complete the establishment of the communication connection first, so that the service data is transmitted, in this case, if the response information returned by the back-end service node is not received within the preset waiting time after the connection request corresponding to the service request is sent, it may be considered that the normal communication connection cannot be established with the back-end service node, so that the back-end service node may be considered to be in a fault state.
After determining any back-end service node as a failure state, the load balancer may terminate scheduling service requests for that back-end service node. In practical applications, the load balancer is generally configured with an available node list, and the load balancer distributes service requests for the back-end service nodes in the available node list, so that the back-end service nodes in a fault state are terminated to call the service requests, which can be specifically understood as removing the back-end service nodes in the fault state from the available node list, so that the user requests can be prevented from being forwarded to the faulty back-end service nodes.
In the fault clearing method for the service node provided by the embodiment of the invention, in the process that the load balancer forwards the service request to the back-end service node, whether the back-end service node is in a fault state or not is determined based on the response result of the back-end service node to the service request, specifically, if a connection state abnormal event is detected for any back-end service node in the process, the back-end service node is determined to be in the fault state. Therefore, detection of the fault rear-end service node is realized based on the real service request, a detection request is not required to be constructed, the accuracy of a fault node detection result can be ensured, the fault rear-end service node in the system can be removed in time, the rear-end service is not invaded, and the fault removal overhead of the service node is low.
Through tests, when the fault removing method of the service node provided by the embodiment of the invention is particularly applied to a four-layer load balancer, fault detection and fault removal of hundreds of thousands of back-end service nodes can be realized, and the functions of normal service forwarding, service configuration and the like of the load balancer are not affected.
In one embodiment of the present invention, the step of determining that the back-end service node is in the failure state if the connection state exception event is detected for any back-end service node includes:
when a connection state abnormal event is detected for any back-end service node, under the condition that the back-end service node is not in a state to be confirmed, determining that the back-end service node is in the state to be confirmed, and judging whether the number of times of detecting the connection state abnormal event is not less than the maximum abnormal number of times for the back-end service node in the state to be confirmed; the duration of the state to be confirmed is a preset state confirmation duration;
if yes, determining that the back-end service node is in a fault state;
if not, determining that the back-end service node is in a normal state.
In practical applications, a temporary failure may occur in the back-end service node, or a temporary fluctuation may occur in the communication link between the back-end service node and the load balancer, in which case the load balancer is able to detect a connection state anomaly event for the back-end service node, but the back-end service node is still able to process the service request.
In order to avoid interference caused by temporary faults and reduce system burden caused by frequently changing the state of the back-end service node, in the embodiment of the present invention, a preset state confirmation duration may be preset, and only after detecting multiple connection state abnormal events for the back-end service node in the preset state confirmation duration, or after detecting multiple unavailability of the back-end service node in the preset state confirmation duration, the back-end service node is determined to be in a fault state, and service request scheduling to the back-end service node is terminated.
Specifically, when a connection state abnormal event is detected for any back-end service node, before the back-end service node is determined to be in a fault state, the back-end service node may be determined to be in a state to be confirmed, and the duration of the state to be confirmed is specifically a preset state confirmation duration, and it is determined whether the number of times of detecting the connection state abnormal event for the back-end service node is not less than the maximum abnormal number during the period that the back-end service node is in the state to be confirmed.
If so, the connection state exception event may be considered not to be caused by a temporary fault, and thus, the backend service node may be determined to be in a fault state; if yes, the abnormal event of the connection state is considered to be caused by the temporary fault, so that the back-end service node can be confirmed to be in a normal state.
The specific values of the preset state confirmation time length and the maximum abnormal times can be selected according to actual requirements, and the embodiment of the invention is not limited to the specific values.
In the embodiment of the invention, when the connection state abnormal event is detected for the back-end service node, the back-end service node is firstly determined to be in the state to be confirmed, and when the number of times of detecting the connection state abnormal event for the back-end service node during the state to be confirmed is not less than the maximum abnormal number, the back-end service node is determined to be in the abnormal state, so that the error of the detection result caused by temporary faults can be avoided, and the system burden caused by frequently changing the state of the back-end service node is reduced.
In an embodiment of the present invention, the method for removing an obstacle of a service node provided in the embodiment of the present invention further includes:
and after determining the fault suppression time length when any back-end service node is in a fault state, recovering to dispatch the service request to the back-end service node.
In practical application, the back-end service node with the fault may be repaired and recovered to a normal state, in this case, the repaired back-end service node can normally process the service request, so that the repaired back-end service node can be added back to the available node list.
It should be understood that, when the barrier removing method of the service node provided by the embodiment of the present invention is applied, the load balancer does not actively send a detection request for health check to the back-end service node, but needs to determine whether the back-end service node is in a normal state based on a response result of the back-end service node to the service request. Thus, in the embodiment of the invention, for the back-end service nodes which are confirmed to be in the fault state, the load balancer cannot actively sense whether the back-end service nodes are restored to the normal state, and restore to the back-end service node to schedule service requests under the condition that the back-end service nodes are confirmed to be restored to the normal state.
Therefore, in order to ensure that the back-end service node which has recovered to the normal state can timely participate in service forwarding of the system, after determining the fault suppression duration that any back-end service node is in the fault state, service request scheduling to the back-end service node can be recovered, that is, the back-end service node is added to the available node list of the load balancer again.
The failure suppression duration can be specifically understood as the estimated failure duration of the back-end service node between two adjacent normal working states, so that after any back-end service node is determined as the failure suppression duration of the failure state, the back-end service node can be considered to have higher possibility of being restored to the normal state, and therefore service requests can be restored to be scheduled to the back-end service node. In practical applications, the fault suppression duration may be determined based on the specific specification of the backend service node, information such as its actual operating parameters, and so on.
After a service request is scheduled to a back-end service node that was previously confirmed to be in an abnormal state, if a connection state abnormal event is detected for the back-end service node, it may be determined that the back-end service node is in a failure state or a state to be confirmed according to the foregoing embodiments of the present invention.
In the embodiment of the invention, after determining the fault suppression duration of any back-end service node in the fault state, the back-end service node is expected to be restored to the normal state, and service requests are restored to be scheduled to the back-end service node, so that the back-end service node which is actually restored to the normal state can participate in the service forwarding of the system in time, and the service processing efficiency of the system is improved.
In an embodiment of the present invention, the foregoing method for removing an obstacle of a service node further includes:
for a back-end service node determined to be in a fault state, if a connection state normal event is detected for the back-end service node, determining that the back-end service node is restored to a normal state, and shortening the fault suppression time configured for the back-end service node based on a first preset proportion when a preset shortening condition is met every time the connection state normal event is detected for the back-end service node; the connection state normal event includes: receiving response information which is returned by the back-end service node and represents normal connection state; the preset shortening conditions include: the shortened fault suppression time is not less than the preset shortest time.
As mentioned above, for the back-end service nodes determined to be in the failure state, the load balancer cannot actively sense whether the back-end service nodes have recovered to the normal state, so that it is only possible to determine whether there are back-end service nodes that have recovered to the normal state if service requests are scheduled for the back-end service nodes.
After the service request is returned to the back-end service node which is confirmed to be in the fault state, the load balancer can specifically judge whether the back-end service node is returned to the normal state or not based on the response result of the back-end service node to the service request.
Specifically, after a service request is recovered to any back-end service node, if the response result of the back-end service node to the service request is specifically a normal event of the connection state in the embodiment of the present invention, the back-end service node may be considered to be recovered to a normal state.
In the embodiment of the invention, the normal event of the connection state specifically comprises: and receiving response information which is returned by the back-end service node and represents that the connection state is normal.
In the foregoing, after receiving the service request, the back-end service node may return different types of response information, where part of the types of response information can indicate that the connection state between the load balancer and the back-end service node is normal, for example, indicate that the service request is correct. In the case that the connection state between the load balancer and the back-end service node is determined to be normal, the back-end service node is considered to be capable of being used for processing the service request, so that the back-end service node can be determined to be in a normal state.
Specifically, after receiving the response information returned by the back-end service node, the load balancer may determine, based on a specific field carried in the response information, whether the response information indicates that the connection state is normal, and if yes, consider that the back-end service node is restored to the normal state. Specifically, a specific type of response information capable of representing that the connection state is normal is related to a specific communication protocol adopted by the load balancer, and in the embodiment of the invention, which types of response information are capable of representing that the connection state is normal can be selected based on actual requirements.
It should be understood that, by applying the method for removing the barrier of the service node provided by the embodiment of the invention, it is necessary to send a service request to the back-end service node to detect whether the back-end service node in the fault state is in a normal state or not. In this process, if the configuration of the fault suppression duration is too long, the back-end service node still cannot detect the state of the back-end service node after the back-end service node is restored to the normal state, and the back-end service node cannot be applied to process the service request, so that the load of the rest back-end service nodes is increased.
Therefore, in the embodiment of the invention, the corresponding fault suppression time length can be configured for each back-end service node respectively, and the corresponding fault suppression time length can be dynamically adjusted according to the health check result for each back-end service node. Specifically, for a back-end service node determined to be in a fault state, after a service request is scheduled to the back-end service node, each time a connection state normal event is detected for the back-end service node, the lower the possibility that the back-end service node subsequently fails again is considered, so that the fault suppression duration configured for the back-end service node can be shortened based on a first preset proportion. And when the back-end service node is in the fault state, the service request can be recovered to be scheduled to the back-end service node after the shorter fault suppression time, and whether the back-end service node is recovered to be in the normal state is detected.
In this process, each time a normal connection state event is detected for the back-end service node, or each time the back-end service node is detected to be available, the fault suppression duration configured for the back-end service node is shortened based on the first preset proportion, but the shortest duration is not smaller than the preset shortest duration.
The first preset proportion and the preset shortest time period may be set based on actual requirements, which is not limited in the embodiment of the present invention, and as an example, the first preset proportion may be set to 1/2.
In the embodiment of the invention, after the service request is scheduled to the back-end service node which is determined to be in the fault state before the service request is recovered, the normal event of the connection state is detected once for the back-end service node, so that the fault suppression duration configured for the back-end service node is shortened, the fault suppression duration configured for the back-end service node can be dynamically adjusted based on the detection result of the back-end service node, the back-end service node which is recovered to the normal state can be detected in time, and the service processing efficiency of the system can be improved.
In an embodiment of the present invention, the method for removing an obstacle of a service node provided in the embodiment of the present invention further includes:
After determining that any back-end service node is in a fault state each time, under the condition that a preset extension condition is met, extending the fault suppression time length configured for the back-end service node based on a second preset proportion; the preset extension conditions include: the prolonged fault suppression time is not longer than the preset maximum time.
In the foregoing, by applying the method for removing the barrier of the service node provided by the embodiment of the invention, whether the back-end service node in the fault state is in the normal state or not can be detected by sending the service request to the back-end service node. However, in this process, if the back-end service node does not recover to a normal state, the transmitted service request cannot be correctly processed, which may affect the service of the load balancing system. Thus, in order to avoid this problem, it should be ensured that the backend service node has a high probability of returning to the normal state after the failure suppression period that is confirmed to be in the failure state.
In the embodiment of the invention, in the process of forwarding the service request to the back-end service node, after determining that any back-end service node is in a fault state each time, the service request is scheduled to the back-end service node, and the fault suppression duration configured for the back-end service node can be shortened based on a second preset proportion.
Specifically, when the number of times that any back-end service node is determined to be in the fault state is greater, the back-end service node is considered to be more likely to generate a fault, so that after each time that any back-end service node is determined to be in the fault state, the fault suppression duration configured for the back-end service node can be prolonged based on a second preset proportion, so as to reduce the influence that the back-end service node still is in the fault state after the service request is scheduled to the back-end service node is restored, which may cause to the load balancing system. In this process, the prolonged fault suppression period must not exceed a preset maximum period at maximum.
The second preset proportion and the preset maximum duration may be selected based on actual requirements, which is not limited in the embodiment of the present invention, and as an example, the second preset proportion may be 2.
In the embodiment of the invention, after determining that any back-end service node is in a fault state each time, the fault suppression time length configured for the back-end service node is prolonged, and the fault suppression time length corresponding to each back-end service node is dynamically adjusted, so that the influence on the service processing of a system caused by the fact that the back-end service node is still in the fault state after the service request is recovered to the back-end service node after the fault suppression time length is recovered.
In practical applications, depending on the application scenario, there may be a difference in the communication protocol used by the load balancer in forwarding the service request to the backend service node. According to the service node obstacle removing method provided by the embodiment of the invention, the application can be realized in different scenes only by determining a specific communication protocol for realizing the service node obstacle removing according to actual conditions and determining specific contents contained in abnormal events of the connection state and normal events of the connection state according to protocol contents. This is described below in connection with specific embodiments.
In one embodiment of the present invention, in the case of back-end service node barrier removal based on the TCP protocol, the connection state exception event includes any one or more of the following:
the connection state of the load equalizer is changed from SYN-RECV (synchronous-received) state to SYN packet to be forwarded to the back-end service node) state or SYN-SENT (synchronous-send) state to CLOSE (closed) state;
the load equalizer is in SYN-RECV state or SYN-SENT state overtime;
and receiving a RST (reset) packet returned by the back-end service node.
TCP is a connection-based transport protocol and therefore requires the establishment of a communication connection between a backend service node and a load balancer prior to data transmission based on a service request. In the process, the load balancer and the back-end service node can complete the establishment of communication connection based on the interaction of information, and the load balancer and the back-end service node can record the connection state of the load balancer and the back-end service node.
The following describes an exemplary connection establishment procedure under the TCP protocol with the device that actively initiates the connection establishment being a and the device that passively waits for the connection establishment being B. Under the condition that connection is not established between the A and the B, the A and the B are both in a CLOSE state, and the process of establishing connection between the two parties can be realized through three-way handshake. (1) first handshake: when a connection needs to be established, A sends a SYN (synchronization request) packet to B and enters a SYN-SENT state; (2) second handshake: b returns a SYN/ACK packet to confirm after receiving the SYN packet sent by the A, and enters a SYN-RECV state; the SYN/ACK packet may be specifically understood as a response data packet in which both the SYN flag bit and the ACK flag bit are set to be valid; (3) third handshake: and A, after receiving the acknowledgement made by B, returning an ACK packet to B for acknowledgement, wherein after the packet is sent, both sides enter an Established state, and the communication connection establishment is completed.
Under the TCP protocol connection, the RST packet is specifically used to close the abnormal connection, so when the RST packet returned by the backend service node is received, it can be considered that a normal communication connection cannot be established with the backend service node, and thus the backend service node can be considered as unavailable.
In addition, since the load balancer can record the connection state of the load balancer under the TCP protocol, the response result of the back-end service node can be judged directly based on the connection state of the load balancer. Specifically, when the load balancer is in the SYN-RECV state or the SYN-send state, if the load balancer is overtime in the state, that is, the confirmation information returned by the back-end service node cannot be received within the preset waiting time, or the load balancer is directly converted into the CLOSE state, that is, the information returned by the back-end node for closing the connection is received, it can be considered that normal communication connection cannot be established with the corresponding back-end service node, so that the back-end service node can be determined to be unavailable.
Accordingly, in the case of back-end service node barrier removal based on UDP (User Datagram Protocol, user data protocol) protocol, the connection status exception event includes any one or more of the following:
when the UDP connection between the load equalizer and the back-end service node is the UDP connection of the bidirectional data transmission, the response information returned by the back-end service node is not received after the preset waiting time of the service request is sent out;
ICMP (Internet Control Message Protocol ) error messages of host-unreachable type or port-unreachable type returned by the back-end service node are received.
Specifically, under the UDP connection of the bidirectional data transmission, if the back-end service node receives a service request sent by the load balancer, response information is returned for the service request. Therefore, if the response information returned by the back-end service node is not received after the preset waiting time of the service request is sent, the connection state with the back-end service node can be considered to be abnormal, so that the back-end service node is confirmed to be unavailable.
The ICMP is a network layer protocol for transmitting error report control information, so in the embodiment of the present invention, when the back-end service node returns an ICMP error message, the connection state between the back-end service node and the ICMP can be determined based on the error category represented by the ICMP error message. In the embodiment of the invention, when the ICMP error message characterizes that the host of the back-end service node is not reachable or the port is not reachable, the connection state between the back-end service node and the back-end service node is confirmed to be abnormal, and the back-end service node is not available.
Correspondingly, in the case of performing back-end service node barrier removal based on the HTTP/HTTPS protocol, the abnormal event of the connection state includes any one or more of the following:
receiving an error HTTP response status code returned by the back-end service node;
And not receiving response information returned by the back-end service node within a preset waiting time after sending the service request.
Specifically, the HTTP response status code is used to indicate whether the HTTP request is successfully completed, and in the embodiment of the present invention, when the back-end service node receives an erroneous HTTP response status code, that is, when the back-end service node has an erroneous HTTP response status code, it may be determined that the back-end service node is unavailable. As one example, the erroneous HTTP response status code may include a 5XX response code (i.e., a response code of 500-599) that characterizes the server error.
In addition, in the case that the response information returned by the back-end service node is not received within the preset waiting time after the service request is sent, the back-end service node can be confirmed to be unavailable.
In the embodiment of the invention, the corresponding abnormal event of the connection state can be determined based on the communication protocol actually adopted by the load balancing system, so that the application of the obstacle removing method of the service node provided by the embodiment of the invention can be realized under various communication protocols, the applicability of the obstacle removing method is strong, and the application under different communication protocols is simple and efficient.
In one embodiment of the present invention, in the case of performing back-end service node barrier removal based on the TCP protocol, the connection state normal event includes any one or more of the following:
The connection state of the load equalizer enters an Established state;
and receiving a SYN/ACK packet returned by the back-end service node.
Based on the foregoing description of the connection procedure under the TCP protocol, it can be seen that when the connection state of the load balancer enters the Established state, that is, the load balancer and the back-end service node can establish a normal communication connection, so that it can be confirmed based on the connection state that the connection state between the load balancer and the back-end service node is normal, that is, the back-end service node is available.
Furthermore, based on the description of the connection procedure under the TCP protocol, it can be seen that, in the case where the back-end service node returns a SYN/ACK packet during the connection procedure, it can be considered that a normal communication connection can be established between the load balancer and the back-end service node.
Correspondingly, under the condition of performing back-end service node obstacle removal based on UDP protocol, the normal event of the connection state comprises:
response information returned by the back-end service node is received after the service request is sent out.
Specifically, in the case that the connection between the load balancer and the back-end service node is specifically UDP for bidirectional data transmission, if the back-end service node receives a service request sent by the load balancer, response information is returned, so that in the case that the load balancer receives the response information returned by the back-end service node, the connection state with the back-end service node can be considered to be normal.
Correspondingly, under the condition that the back-end service node is used for removing the barrier based on the HTTP/HTTPS protocol, the normal event of the connection state comprises the following steps:
and receiving a normal HTTP response status code returned by the back-end service node.
In the embodiment of the invention, when the load balancer receives a normal HTTP response code, namely, an HTTP response code representing that the state of the back-end service node is normal, the back-end service node can be considered to be available. As an example, a normal HTTP response status code may refer to an HTTP response status code other than a 5XX response code, which, as mentioned above, characterizes that an internal error has occurred in the backend service node if the backend service node returns a 5XX response code. If the back-end service node returns other types of HTTP response status codes, it characterizes that the service request is successfully received, or the error cause is not caused by the back-end service node, in which case the back-end service node may be considered available.
In the embodiment of the invention, the corresponding normal event of the connection state can be determined based on the communication protocol actually adopted by the load balancing system, so that the application of the back-end service node obstacle removing method provided by the embodiment of the invention can be realized under various communication protocols, the applicability of the obstacle removing method is strong, and the application under different communication protocols is simple and efficient.
In addition, in practical applications, since the load balancer needs to support processing forwarding capability far higher than that of the back-end service node, many load balancers may use a multi-core concurrent processing model to ensure performance requirements. Based on the multi-core concurrent processing model, the embodiment of the invention particularly provides a load balancer capable of realizing barrier removal of a back-end service node.
Fig. 3 is a schematic diagram of an operation principle of a load balancer provided in an embodiment of the present invention, referring to fig. 3, the load balancer includes: the control unit and a plurality of forwarding units are respectively in communication connection with the control unit.
As an example, based on such a multi-core concurrent load balancer model, the control unit may be responsible for managing forwarding rules, forwarding policies, collecting states, statistics, etc., the forwarding units are responsible for data forwarding, and each forwarding unit works independently. In the design of the multi-core concurrency load balancer, a physical CPU (Central Processing Unit ) can be bound for each control unit and each forwarding unit, and communication between the control units and the forwarding units is supported through a lightweight synchronization mechanism, so that the load balancer achieves the best forwarding performance and network throughput rate.
In order to realize the obstacle avoidance for the back-end service node, in the embodiment of the invention, the forwarding unit is used for forwarding the service request to the back-end service node, and reporting the error information for the back-end service node to the control unit if the connection state abnormal event is detected for any back-end service node in the process of forwarding the service request; the connection state exception event includes: receiving response information which is returned by the back-end service node and represents abnormal connection state, and/or not receiving the response information returned by the back-end service node within a preset waiting time after a service request or a connection request corresponding to the service request is sent out;
the control unit is used for determining that the back-end service node is in a fault state when any back-end service node receives the error information reported by the forwarding units, and issuing an extraction instruction for the back-end service node to all the forwarding units;
and the forwarding unit is also used for terminating forwarding the service request to any back-end service node after receiving the removal instruction for the back-end service node.
In particular, since the forwarding of the service request is specifically performed by the forwarding unit, the connection state anomaly event may be specifically detected by the forwarding unit for the backend service node. In the embodiment of the invention, when the forwarding unit detects a connection state abnormal event aiming at any back-end service node, the forwarding unit reports error information aiming at the back-end service node to the control unit, and after receiving the error information, the control unit issues an ablation instruction to all the forwarding units so as to instruct all the control units not to forward a service request to the back-end service node.
Taking fig. 3 as an example, for a data stream 1 issued by a user, forwarding a service request to a back-end service node c2 by a forwarding unit b1, if the forwarding unit b1 detects that the connection state is abnormal for the back-end service node c2 in this process, the forwarding unit b1 reports error information of the back-end service node c2 to a control unit, the control unit determines that the back-end service node c2 is in a fault state after receiving the error information reported by the forwarding unit b1, and issues an removal instruction to the forwarding units b1-b4, so that the forwarding units b1-b4 all terminate forwarding the service request to the back-end service node c 2. Similarly, for the data stream 2, the forwarding unit b4 forwards the service request to the back-end service node c1, if the forwarding unit b4 detects that the connection state is abnormal for the back-end service node c1 in the process, the forwarding unit b4 reports the error information of the back-end service node c1 to the control unit, the control unit determines that the back-end service node c1 is in a fault state after receiving the error information reported by the forwarding unit b4, and issues an removal instruction to the forwarding units b1-b4, so that the forwarding units b1-b4 all terminate forwarding the service request to the back-end service node c1
In the embodiment of the invention, the control unit and the forwarding unit can communicate through a lock-free message mechanism in particular, so that the communication efficiency between the control unit and the forwarding unit is improved.
The load balancer provided by the embodiment of the invention is characterized in that the forwarding unit is responsible for detecting the state of the rear-end service node, the control unit is responsible for removing the rear-end service node in a fault state from the available node list, specifically, when the forwarding node detects a connection state abnormal event aiming at the rear-end service node, error information is reported to the control unit, the control unit can determine that the rear-end service node is in the fault state based on the error information reported by the forwarding node, and send an removing instruction aiming at the rear-end service node to all the forwarding nodes, so that service request forwarding to the rear-end service node is terminated. The fault removing method has the advantages that the fault removing of the rear-end service nodes is realized based on the real service request, the detection request is not required to be constructed, the accuracy of the fault node detection result can be ensured, the fault rear-end service nodes in the load balancing system can be removed in time, the fault removing method is non-invasive to rear-end service, the fault removing overhead is low, the fault removing of the service nodes in a multi-core concurrency scene can be realized through the message communication between the control unit and the forwarding unit, the fault removing process is rapid, efficient and the concurrency expansion performance is high.
In one embodiment of the present invention, the control unit is specifically configured to determine, when the back-end service node is not in a state to be confirmed, and determine whether the number of error messages reported by the forwarding unit is not less than a maximum number of abnormalities for the back-end service node when the back-end service node receives the error messages reported by the forwarding unit in the state to be confirmed, where a duration of the state to be confirmed is a preset state confirmation duration; if yes, determining that the back-end service node is in a fault state, and issuing an extraction instruction for the back-end service node to all forwarding nodes, otherwise, determining that the back-end service node is in a normal state.
In the embodiment of the invention, for the back-end service node in the state to be confirmed, each time the back-end service node receives the error information reported by the forwarding unit, the forwarding unit can be considered to detect a connection state abnormal event for the back-end service node, so that the number of the error information reported by the forwarding unit received by the back-end service node in the state to be confirmed can be considered as the number of times the back-end service node detects the connection state abnormal event.
In the process, after receiving the error information reported by the forwarding unit for the back-end service node, the control unit confirms that the back-end service node is in a state to be confirmed, and when the number of the error information reported by the forwarding unit for the back-end service node is not less than the maximum abnormal number in the state to be confirmed, the control unit determines that the back-end service node is in a fault state, so that an inspection result error caused by temporary faults can be avoided, and the system burden caused by frequently changing the state of the back-end service node is reduced.
In one embodiment of the present invention, the control unit is further configured to, after determining a failure suppression duration in which any one of the backend service nodes is in a failure state, issue a recovery instruction for the backend service node to all forwarding units;
And the forwarding unit is also used for recovering to forward the service request to any back-end service node after receiving the recovery instruction aiming at the back-end service node.
In the foregoing, after determining the failure suppression period in which any of the back-end service nodes is in the failure state, scheduling of the service request to the back-end service node may be resumed. For a load balancer applying a multi-core concurrency model, a forwarding unit is particularly responsible for forwarding a service request, detecting abnormal events of a connection state and normal events of the connection state, and a control unit is responsible for removing and recovering a back-end service node in a node availability list. Therefore, in the embodiment of the invention, after determining the fault suppression duration of any back-end service node in the fault state, the control unit can issue the recovery instruction for the back-end service node to all the forwarding units, so that all the forwarding units can recover to forward the service request to the back-end service node.
In the embodiment of the invention, after the control unit determines the fault suppression duration of any back-end service node in the fault state, the control unit is further used for sending a recovery instruction to all forwarding units to add back the back-end service node to the available node list, so that the back-end service node which has recovered to the normal state can participate in service forwarding in time.
In one embodiment of the present invention, the forwarding unit is further configured to, after restoring forwarding the service request to the back-end service node, report, to the control unit, normal information for the back-end service node if a connection state normal event is detected for any back-end service node;
the control unit is further used for determining that the back-end service node is in a normal state after issuing a recovery instruction for the back-end service node determined to be in a fault state, and shortening the fault suppression time length configured for the back-end service node based on a first preset proportion when the correct information reported by the forwarding unit for the back-end service node is received once; the preset shortening conditions include: the shortened fault suppression time is not less than the preset shortest time.
In the embodiment of the invention, when the forwarding unit detects a normal event of the connection state for the back-end service node, normal information for the back-end service node is reported to the control unit. After the control unit determines the fault suppression duration of any back-end service node in the fault state, if the control unit receives the correct information reported by the forwarding unit for the back-end service node, the back-end service node can be determined to recover to the normal state. If the error information reported by the forwarding unit for the back-end service node is received, the back-end service node can be confirmed to be in a state to be confirmed.
In the foregoing embodiment of the present invention, for a back-end service node determined to be in an abnormal state, after a service request is resumed to the back-end service node, the failure suppression period configured for the back-end service node is shortened once every time a connection state abnormal event is detected for the back-end service node. And the same as the error information reported by the forwarding unit, each time a piece of correct information reported by the forwarding unit is received, the correct information can be considered to be that a normal event of a connection state is detected for the back-end service node, so that the fault suppression duration configured for the back-end service node can be shortened based on a first preset proportion when each piece of correct information reported by the forwarding unit for the back-end service node is received.
In the embodiment of the invention, for the back-end service node determined to be in an abnormal state, after the service request is scheduled to the back-end service node is recovered, the control unit can shorten the fault suppression time length corresponding to the back-end service node according to the correct information reported by the forwarding unit for the back-end service node, and the dynamic adjustment of the fault suppression time length is efficiently realized based on the correct information reported by the forwarding unit.
In one embodiment of the present invention, after receiving the recovery instruction issued by the control unit for any back-end service node, the forwarding unit reports the correct information to the control unit for the back-end service node for a number of times not exceeding a preset maximum reporting number.
In the embodiment of the invention, the forwarding unit can detect the health state of the back-end service node in real time in the process of forwarding the service request to the back-end service node. In this process, when the forwarding unit detects a normal event of the connection state for the back-end service node, the forwarding unit reports the correct information to the control unit, specifically, for the back-end service node which is determined to be in the fault state before, the control unit is notified that the back-end service node has recovered to be normal, or the control unit dynamically adjusts the fault suppression duration configured for the back-end service node according to the received correct information.
If the control unit determines that the back-end service node has recovered to the normal state based on the correct information reported by the forwarding unit for any back-end node. Under the condition that the health state of the back-end service node tends to be stable, if the forwarding unit continuously reports the correct information for the back-end service node to the control unit, the processing load is increased for the load balancer.
Therefore, in the embodiment of the invention, the maximum reporting times of the forwarding unit reporting the correct information for any back-end service node can be set after the control unit issues the recovery instruction to the forwarding unit for the back-end service node. In the process, the communication overhead of the load balancer can be reduced by restricting the number of times that the correct information is reported by the forwarding unit.
As an example, in order to improve performance and convenience of the load balancer, specifically, the configuration of the maximum reporting number may be performed for each forwarding unit, and the embodiment of the present invention is not limited to a specific configuration manner.
Specifically, in the process of performing the back-end service node fault removal by the load balancer, the detection process of the back-end service node by the control unit and the forwarding unit can be divided into four states:
UP state: the back-end service node is detected as normal and is added to an available node list of the load balancer;
DOWN state: the back-end service node detects a fault and has been removed from the list of available nodes of the load balancer;
DOWN-WAIT state: the back-end service node detects the abnormality and waits for further detection and confirmation;
UP-WARM state: the failure suppression period of the back-end service node ends, attempting to add back to the list of available nodes of the load balancer and again detecting its availability.
Fig. 4 is a schematic state diagram of a control unit provided by an embodiment of the present invention, and fig. 5 is a schematic state diagram of a forwarding unit provided by an embodiment of the present invention, and a barrier removal process of a back-end service node is further described below with reference to fig. 4 and 5 and a specific example.
Specifically, in the process of performing the barrier removal process of the back-end service node, the following five configuration parameters may be specified for each back-end service node:
down_retry (i.e., maximum number of exceptions): the number of times that the back-end service node is detected to be abnormal before the back-end service node is removed from the available node list;
down_wait (i.e., preset state confirmation period): or DOWN-WAIT duration, in which the back-end service node is removed after continuously detecting the abnormality of the back-end service node for down_retry times;
up_confirm (i.e. preset maximum number of reports): after adding the failed back-end service node back to the list of available nodes and detecting that it is normal, the forwarding unit sends Up message (i.e. correct message) to the control node for a number of times.
Inhibit_min: namely, presetting the shortest time length;
inhibit_max: i.e. a preset maximum duration.
Specifically, in the process that the forwarding unit forwards the service request to the back-end service node, when the forwarding node detects that the back-end service is unavailable, that is, detects a connection state abnormal event aiming at the back-end service node, a Down message (that is, error information) is sent to the control unit. Wherein the Down message may contain specific information of the unavailable backend service node.
After receiving the Down message, the control unit starts a Down-Wait timer, the timer duration is specifically the preset state confirmation duration, and before the timer expires, the control unit waits to receive more Down messages sent by the forwarding unit for the back-end service node so as to determine whether the back-end service node is in a fault state.
If the control unit receives a Down message exceeding a specified threshold (i.e. down_retry) before the Down-Wait timer expires, the control unit sends a Close message (i.e. an ablation instruction), starts an Inhibit timer, then doubles the timer Inhibit time (i.e. the fault Inhibit duration), but does not exceed the inhibit_max at maximum, and the forwarding unit removes the backend service node from the list of available nodes after receiving the Close message. If the number of the received Down messages is smaller than the down_retry after the Down-Wait timer times out, the control unit clears the Down message counter and maintains the Down message counter in an UP state.
After the inhibit timer is overtime, the control unit sends the Open message (i.e. a recovery instruction) to all forwarding units, and the forwarding units add the corresponding back-end service node back to the available node list of the load balancer after receiving the Open message, so that the back-end service node participates in scheduling and forwarding of the service request.
After adding back the back-end service node back to the list of available nodes, if the forwarding unit detects that the back-end service is available, up (i.e. correct information) is sent to the control unit, one Up message is sent successfully per detection, and up_confirm is sent in total. After receiving the Up message, the control unit determines that the back-end service node is normal, and reduces the Inhibit timer Inhibit time by half (the minimum is not lower than inhibit_min) every time one Up message is received. In this process, the number of Up messages sent (i.e. the preset maximum reporting times) may be set, so that the forwarding unit may specifically send Up messages with a specified number (i.e. the preset maximum reporting times) to the control unit.
Correspondingly, after adding back end service node back to available node list, if the forwarding unit detects that the back end service is not available, then sending Down message to the control unit for the back end service node, and after receiving the Down message, the control unit starting Down-Wait timer again to determine whether the back end service node is available.
In the embodiment of the invention, if the forwarding unit detects that the back-end service node is unavailable, the forwarding unit can reschedule the service request to other back-end service nodes in a normal state under the condition that the service request is allowed, for example, under the condition that the user does not cancel the service request and the service request is not overtime, so that the service request can be processed.
It can be seen that, by adopting the load balancer of the model, in the process that the forwarding unit sends the service request to the back-end service node, the availability of the back-end service node is detected based on the response result of the back-end service node, the detection request does not need to be constructed, the accuracy of the detection result of the fault back-end service node can be ensured, the fault back-end service node in the system can be removed in time, the back-end service is not invaded, and the cost for health check is low. And the reliability of the detection result of the fault back-end service node is ensured by introducing two intermediate states of DOWN-WAIT and UP-WARM, the influence of the back-end service node for detecting the fault state through the service request on a user is reduced by dynamically adjusting the fault suppression time length, and the high-performance processing of the multi-core concurrency load equalizer is realized through a message mechanism between a control unit and a forwarding unit.
Based on the same inventive concept, the embodiment of the invention also provides a barrier device of a service node, which is applied to a load balancer in a load balancing system, wherein the load balancing system comprises the load balancer and a plurality of back-end service nodes respectively connected with the load balancer, and referring to fig. 6, the device comprises:
A determining module 601, configured to determine that, in a process of forwarding a service request to a back-end service node, if a connection state exception event is detected for any back-end service node, the back-end service node is in a failure state, and terminate scheduling the service request to the back-end service node; wherein the connection state exception event includes: and receiving response information which is returned by the back-end service node and represents abnormal connection state, and/or not receiving the response information returned by the back-end service node within a preset waiting time after a service request or a connection request corresponding to the service request is sent out.
In the barrier removing device for the service node provided by the embodiment of the invention, in the process that the load balancer forwards the service request to the back-end service node, whether the back-end service node is in a fault state or not is determined based on the response result of the back-end service node to the service request, specifically, if an abnormal event of the connection state is detected for any back-end service node in the process, the back-end service node is determined to be in the fault state. Therefore, detection of the fault rear-end service node is realized based on the real service request, a detection request is not required to be constructed, the accuracy of a fault node detection result can be ensured, the fault rear-end service node in the system can be removed in time, the rear-end service is not invaded, and the fault removal overhead of the service node is low.
In one embodiment of the present invention, the determining module 601 includes:
the first determining unit is used for determining that the back-end service node is in a state to be confirmed when the connection state abnormal event is detected for any back-end service node each time and judging whether the number of times of detecting the connection state abnormal event in the state to be confirmed for the back-end service node is not less than the maximum abnormal number of times under the condition that the back-end service node is not in the state to be confirmed; the duration of the state to be confirmed is a preset state confirmation duration;
if yes, determining that the back-end service node is in a fault state;
if not, determining that the back-end service node is in a normal state.
In one embodiment of the invention, the apparatus further comprises:
and the recovery module is used for recovering the scheduling service request to the back-end service node after determining the fault suppression time length of any back-end service node in the fault state.
In one embodiment of the invention, the apparatus further comprises:
a shortening module, configured to determine, for a back-end service node determined to be in a failure state, that the back-end service node resumes a normal state if a connection state normal event is detected for the back-end service node after a service request is resumed for the back-end service node, and shorten a failure suppression duration configured for the back-end service node based on a first preset proportion if a preset shortening condition is satisfied once the connection state normal event is detected for the back-end service node; the connection state normal event includes: receiving response information which is returned by the back-end service node and represents normal connection state; the preset shortening conditions include: the shortened fault suppression time is not less than the preset shortest time.
In one embodiment of the invention, the apparatus further comprises:
the extension module is used for extending the fault suppression time length configured for the back-end service node based on a second preset proportion under the condition that the preset extension condition is met after each time that any back-end service node is determined to be in a fault state; the preset extension conditions include: the prolonged fault suppression time is not longer than the preset maximum time.
In one embodiment of the present invention, in the case of back-end service node barrier removal based on the TCP protocol, the connection state exception event includes any one or more of the following:
the connection state of the load equalizer is changed from SYN-RECV state or SYN-SENT state to CLOSE state;
the load equalizer is in SYN-RECV state or SYN-SENT state overtime;
receiving a RST packet returned by a back-end service node;
in the case of performing back-end service node barrier removal based on the UDP protocol, the connection state exception event includes any one or more of the following:
when the UDP connection between the load equalizer and the back-end service node is the UDP connection of the bidirectional data transmission, the response information returned by the back-end service node is not received after the preset waiting time of the service request is sent out;
Receiving ICMP error messages of host unreachable type or port unreachable type returned by the back-end service node;
in the case of back-end service node barrier removal based on HTTP/HTTPs protocol, the connection status exception event includes any one or more of:
receiving an error HTTP response status code returned by the back-end service node;
and not receiving response information returned by the back-end service node within a preset waiting time after sending the service request.
In one embodiment of the present invention, in the case of performing back-end service node barrier removal based on the TCP protocol, the connection state normal event includes any one or more of the following:
the connection state of the load equalizer enters an Established state;
receiving a SYN/ACK packet returned by the back-end service node;
under the condition of performing back-end service node barrier removal based on UDP protocol, the normal events of the connection state comprise:
receiving response information returned by the back-end service node after sending a service request;
in the case of performing back-end service node barrier removal based on the HTTP/HTTPs protocol, the connection state normal event includes:
and receiving a normal HTTP response status code returned by the back-end service node.
The embodiment of the present invention further provides an electronic device, as shown in fig. 7, including a processor 701, a communication interface 702, a memory 703 and a communication bus 704, where the processor 701, the communication interface 702, and the memory 703 perform communication with each other through the communication bus 704,
a memory 703 for storing a computer program;
the processor 701 is configured to execute the program stored in the memory 703, and implement the following steps:
in the process of forwarding the service request to the back-end service nodes, if a connection state abnormal event is detected for any back-end service node, determining that the back-end service node is in a fault state, and stopping dispatching the service request to the back-end service node; wherein the connection state exception event includes: and receiving response information which is returned by the back-end service node and represents abnormal connection state, and/or not receiving the response information returned by the back-end service node within a preset waiting time after a service request or a connection request corresponding to the service request is sent out.
The communication bus mentioned by the above terminal may be a peripheral component interconnect standard (Peripheral Component Interconnect, abbreviated as PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated as EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The communication interface is used for communication between the terminal and other devices.
The memory may include random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (Digital Signal Processor, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
In yet another embodiment of the present invention, a computer readable storage medium is provided, where a computer program is stored, where the computer program, when executed by a processor, implements the health checking method of a backend service node in a load balancing system according to any one of the foregoing embodiments.
In yet another embodiment of the present invention, a computer program product comprising instructions that, when executed on a computer, cause the computer to perform the method for health checking of a backend service node in a load balancing system according to any of the previous embodiments is provided.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for embodiments of the load balancer, the barrier apparatus of the service node, the electronic device, and the readable storage medium, the description is relatively simple since it is substantially similar to the method embodiments, and the relevant points are referred to in the description of the method embodiments.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims (15)

1. The utility model provides a service node's barrier removal method, characterized by that is applied to the load balance system in the load balance system, load balance system include load balance and a plurality of back-end service nodes that are connected respectively with load balance, the method includes:
in the process of forwarding the service request to the back-end service node, if a connection state abnormal event is detected for any back-end service node, determining that the back-end service node is in a fault state, and stopping scheduling the service request to the back-end service node; wherein the connection state exception event includes: and receiving response information which is returned by the back-end service node and represents abnormal connection state, and/or not receiving the response information returned by the back-end service node within a preset waiting time after the service request or the connection request corresponding to the service request is sent out.
2. The method of claim 1, wherein the step of determining that the back-end service node is in a failure state if a connection state anomaly event is detected for any of the back-end service nodes comprises:
When the connection state abnormal event is detected for any back-end service node, determining that the back-end service node is in a state to be confirmed under the condition that the back-end service node is not in the state to be confirmed, and judging whether the frequency of detecting the connection state abnormal event for the back-end service node in the state to be confirmed is not less than the maximum abnormal frequency; the duration of the state to be confirmed is a preset state confirmation duration;
if yes, determining that the back-end service node is in a fault state;
if not, determining that the back-end service node is in a normal state.
3. The method as recited in claim 1, further comprising:
and after determining the fault suppression time length when any back-end service node is in a fault state, recovering to dispatch the service request to the back-end service node.
4. A method according to claim 3, characterized in that the method further comprises:
for a back-end service node determined to be in a fault state, after a service request is scheduled to the back-end service node, if a connection state normal event is detected for the back-end service node, determining that the back-end service node is in a normal state, and shortening a fault suppression duration configured for the back-end service node based on a first preset proportion when a preset shortening condition is met once the connection state normal event is detected for the back-end service node; the connection state normal event includes: receiving response information which is returned by the back-end service node and represents normal connection state; the preset shortening condition includes: the shortened fault suppression time is not less than the preset shortest time.
5. A method according to claim 3, characterized in that the method further comprises:
after each time that any back-end service node is in a fault state is determined, under the condition that a preset extension condition is met, extending the fault suppression time length configured for the back-end service node based on a second preset proportion; the preset extension conditions include: the prolonged fault suppression time is not longer than the preset maximum time.
6. The method according to any one of claims 1-5, wherein in case of back-end service node barrier removal based on TCP protocol, the connection status exception event comprises any one or more of the following:
the connection state of the load equalizer is changed from SYN-RECV state or SYN-SENT state to CLOSE state;
the load equalizer is in SYN-RECV state or SYN-SENT state overtime;
receiving an RST packet returned by the back-end service node;
in the case of performing back-end service node barrier removal based on the UDP protocol, the connection state exception event may include any one or more of:
when the UDP connection between the load balancer and the back-end service node is the UDP connection of bidirectional data transmission, the response information returned by the back-end service node is not received after the preset waiting time of the service request is sent out;
Receiving ICMP error messages of host unreachable types or port unreachable types returned by the back-end service node;
in the case of performing back-end service node barrier removal based on the HTTP/HTTPs protocol, the connection status exception event includes any one or more of:
receiving an error HTTP response status code returned by the back-end service node;
and not receiving response information returned by the back-end service node within a preset waiting time after the service request is sent.
7. The method according to claim 4, wherein in case of back-end service node barrier removal based on TCP protocol, the connection state normal event comprises any one or more of the following:
the connection state of the load equalizer enters an Established state;
receiving a SYN/ACK packet returned by the back-end service node;
under the condition of performing back-end service node barrier removal based on UDP protocol, the normal event of the connection state comprises:
receiving response information returned by the back-end service node after sending the service request;
in the case of performing back-end service node barrier removal based on the HTTP/HTTPs protocol, the connection state normal event includes:
And receiving a normal HTTP response status code returned by the back-end service node.
8. A load balancer, comprising: the control unit and a plurality of forwarding units are respectively in communication connection with the control unit;
the forwarding unit is configured to forward a service request to a back-end service node, and if a connection state abnormal event is detected for any back-end service node during forwarding the service request, report error information for the back-end service node to the control unit; the connection state exception event includes: receiving response information which is returned by the back-end service node and represents abnormal connection state, and/or not receiving the response information returned by the back-end service node within a preset waiting time after the service request or the connection request corresponding to the service request is sent out;
the control unit is used for determining that the back-end service node is in a fault state when any back-end service node receives the error information reported by the forwarding units, and issuing an extraction instruction for the back-end service node to all forwarding units;
the forwarding unit is further configured to terminate forwarding a service request to the backend service node after receiving an removal instruction for any backend service node.
9. The load balancer of claim 8, wherein the control unit is specifically configured to determine that, when the back-end service node is not in a state to be confirmed, each time error information reported by the forwarding unit is received for any back-end service node, determine that the back-end service node is in a state to be confirmed, and determine whether the number of error information reported by the forwarding unit is not less than a maximum number of anomalies received for the back-end service node in the state to be confirmed, where a duration of the state to be confirmed is a preset state confirmation duration; if yes, determining that the back-end service node is in a fault state, and issuing an extraction instruction for the back-end service node to all forwarding nodes, otherwise, determining that the back-end service node is in a normal state.
10. The load balancer of claim 8, wherein the control unit is further configured to issue a recovery instruction for any of the backend service nodes to all forwarding units after determining a failure suppression period in which the backend service node is in a failure state;
and the forwarding unit is further used for recovering to forward the service request to the back-end service node after receiving the recovery instruction aiming at any back-end service node.
11. The load balancer of claim 10, wherein the forwarding unit is further configured to, after resuming forwarding the service request to the back-end service nodes, report, to the control unit, normal information for any of the back-end service nodes if a connection state normal event is detected for the back-end service node;
the control unit is further configured to determine, for a back-end service node determined to be in a failure state, that the back-end service node is restored to a normal state after issuing a restoration instruction for the back-end service node, and once the correct information reported by the forwarding unit for the back-end service node is received, shorten a failure suppression duration configured for the back-end service node based on a first preset proportion under the condition that a preset shortening condition is satisfied; the preset shortening condition includes: the shortened fault suppression time is not less than the preset shortest time.
12. The load balancer of claim 11, wherein the number of times the forwarding unit reports the correct information to the control unit for any of the backend service nodes does not exceed a preset maximum number of times of reporting after receiving a recovery instruction issued by the control unit for the backend service node.
13. A barrier device for a service node, wherein the barrier device is applied to a load balancer in a load balancing system, the load balancing system includes the load balancer and a plurality of back-end service nodes respectively connected with the load balancer, and the device includes:
the determining module is used for determining that the back-end service node is in a fault state and stopping scheduling the service request to the back-end service node if any back-end service node detects a connection state abnormal event in the process of forwarding the service request to the back-end service node; wherein the connection state exception event includes: and receiving response information which is returned by the back-end service node and represents abnormal connection state, and/or not receiving the response information returned by the back-end service node within a preset waiting time after the service request or the connection request corresponding to the service request is sent out.
14. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
a memory for storing a computer program;
A processor for carrying out the method steps of any one of claims 1-7 when executing a program stored on a memory.
15. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of claims 1-7.
CN202311317985.6A 2023-10-12 2023-10-12 Barrier removing method and device for service node, load balancer and readable storage medium Pending CN117255001A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311317985.6A CN117255001A (en) 2023-10-12 2023-10-12 Barrier removing method and device for service node, load balancer and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311317985.6A CN117255001A (en) 2023-10-12 2023-10-12 Barrier removing method and device for service node, load balancer and readable storage medium

Publications (1)

Publication Number Publication Date
CN117255001A true CN117255001A (en) 2023-12-19

Family

ID=89136703

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311317985.6A Pending CN117255001A (en) 2023-10-12 2023-10-12 Barrier removing method and device for service node, load balancer and readable storage medium

Country Status (1)

Country Link
CN (1) CN117255001A (en)

Similar Documents

Publication Publication Date Title
US8166163B2 (en) Server checking using health probe chaining
US7990847B1 (en) Method and system for managing servers in a server cluster
EP3324576B1 (en) System for fast detection of communication path failures
CN112866004B (en) Control plane equipment switching method and device and transfer control separation system
US7797565B1 (en) System and method for maintaining communication protocol connections during failover
EP1697843B1 (en) System and method for managing protocol network failures in a cluster system
EP3874698A1 (en) Link fault isolation using latencies
JP4503934B2 (en) Server computer protection device, server computer protection method, server computer protection program, and server computer
CN107332793B (en) Message forwarding method, related equipment and system
JP2005301436A (en) Cluster system and failure recovery method for it
JP2006229399A (en) Communications system, relay node and communication method used for the same, and program thereof
JP2010287987A (en) Mail server system, and convergence control method
CN115801642B (en) RDMA communication management module, method, equipment and medium based on state control
CN116708129A (en) Method, device and storage medium for link fault detection and quick recovery
CN117255001A (en) Barrier removing method and device for service node, load balancer and readable storage medium
EP2815549B1 (en) Method and apparatus for improved handling of ims node blacklisting
US10181997B2 (en) Methods, systems and computer readable media for providing receive port resiliency in a network equipment test device
US10237122B2 (en) Methods, systems, and computer readable media for providing high availability support at a bypass switch
EP3158685B1 (en) Identification of candidate problem network entities
US11290319B2 (en) Dynamic distribution of bidirectional forwarding detection echo sessions across a multi-processor system
CN117082653A (en) Network communication optimization method and system based on retransmission mechanism
CN116915584B (en) Method and device for processing online computing exception
US7808893B1 (en) Systems and methods for providing redundancy in communications networks
CN118055043A (en) Health state checking method and device, electronic equipment and storage medium
CN113805788A (en) Distributed storage system and exception handling method and related device thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination