CN110837432A - Method and device for determining abnormal node in service cluster and monitoring server - Google Patents

Method and device for determining abnormal node in service cluster and monitoring server Download PDF

Info

Publication number
CN110837432A
CN110837432A CN201911117768.6A CN201911117768A CN110837432A CN 110837432 A CN110837432 A CN 110837432A CN 201911117768 A CN201911117768 A CN 201911117768A CN 110837432 A CN110837432 A CN 110837432A
Authority
CN
China
Prior art keywords
service
service node
node
probability
abnormal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911117768.6A
Other languages
Chinese (zh)
Inventor
汤爱迪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Cloud Network Technology Co Ltd
Beijing Kingsoft Cloud Technology Co Ltd
Original Assignee
Beijing Kingsoft Cloud Network Technology Co Ltd
Beijing Kingsoft Cloud Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Cloud Network Technology Co Ltd, Beijing Kingsoft Cloud Technology Co Ltd filed Critical Beijing Kingsoft Cloud Network Technology Co Ltd
Priority to CN201911117768.6A priority Critical patent/CN110837432A/en
Publication of CN110837432A publication Critical patent/CN110837432A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • G06F11/0754Error or fault detection not based on redundancy by exceeding limits

Abstract

The invention provides a method, a device and a monitoring server for determining abnormal nodes in a service cluster, which comprises the steps of firstly, acquiring operation parameters of each service node in a preset service cluster; according to the operation parameters and the corresponding parameter threshold values thereof, carrying out first screening processing on the service nodes in the service cluster to obtain first screening results; determining the probability of abnormal operation of each service node according to the operation parameters, and performing second screening processing on the service nodes in the service cluster based on the probability and a preset probability threshold to obtain a second screening result; and determining the service nodes with abnormal operation in the service cluster according to the first screening result and the second screening result. In the invention, the service nodes in the service cluster are screened twice, and the service nodes existing in the screening results of the two times are determined as abnormal nodes, so that the accuracy of judging the abnormality of the service nodes is improved, the number of normal service nodes in the service cluster is ensured, and the operation stability of the service cluster is improved.

Description

Method and device for determining abnormal node in service cluster and monitoring server
Technical Field
The present invention relates to the technical field of servers, and in particular, to a method and an apparatus for determining an abnormal node in a service cluster, and a monitoring server.
Background
In a service cluster including a plurality of service nodes, due to network jitter or service failure, a certain service node is often brought into an unavailable state or an abnormal state, and an abnormal service detection isolation mechanism is usually required to be introduced to perform request isolation and elastic recovery on the service node in the unavailable state or the abnormal state in the service cluster.
In the related technology, a server usually records parameters such as the total request times, the request error times, the request duration and the like of each service node, when the error rate of the service node exceeds a threshold or the timeout rate exceeds a threshold, the service node is judged to be abnormal, then the service node is isolated, and the method judges whether the service node is abnormal or not, so that the misjudgment rate is high, a large number of normal service nodes are easily isolated due to misjudgment, the number of the normal service nodes is greatly reduced, and the operating pressure of a service cluster is increased.
Disclosure of Invention
The invention aims to provide a method and a device for determining abnormal nodes in a service cluster and a monitoring server, so as to improve the accuracy of judging the abnormal nodes of the service cluster and improve the operation stability of the service cluster.
In a first aspect, an embodiment of the present invention provides a method for determining an abnormal node in a service cluster, where the method includes: acquiring operation parameters of each service node in a preset service cluster; the service cluster comprises a plurality of service nodes; according to the operation parameters and the parameter threshold values corresponding to the operation parameters, performing first screening processing on the service nodes in the service cluster to obtain first screening results; determining the probability of abnormal operation of each service node according to the operation parameters; performing second screening processing on the service nodes in the service cluster according to the probability of abnormal operation of each service node and a preset probability threshold value to obtain a second screening result; and determining the service nodes with abnormal operation in the service cluster according to the first screening result and the second screening result.
In a preferred embodiment of the present invention, the step of performing a first screening process on the service nodes in the service cluster according to the operation parameters and the parameter threshold corresponding to the operation parameters to obtain a first screening result includes: for each service node in the service cluster, if the operation parameter of the current service node meets the parameter threshold corresponding to the operation parameter, adding the current service node into the first screening result; wherein the operating parameters include: a request error rate and/or a request timeout rate.
In a preferred embodiment of the present invention, the step of determining the probability of the abnormal operation of each service node according to the operation parameters includes: and aiming at each service node in the service cluster, inputting the operation parameters of the current service node into a preset probability distribution model to obtain the abnormal operation probability of the current service node.
In a preferred embodiment of the present invention, the step of inputting the operation parameters of the current service node into a preset probability distribution model to obtain the probability of the abnormal operation of the current service node includes: calculating average value of operation parameters of each service node in service cluster
Figure BDA0002273702980000021
Calculating variance of operating parameters of each service node in service cluster
Figure BDA0002273702980000022
Where m represents the total number of service nodes in the service cluster, xiRepresenting the operation parameters corresponding to the ith service node; and inputting the operation parameters, the mean value and the variance of the current service node into a preset Gaussian model to obtain the probability of the abnormal operation of the current service node.
In a preferred embodiment of the present invention, the current service node includes a plurality of operation parameters; the step of inputting the operation parameters of the current service node into the preset probability distribution model to obtain the probability of the abnormal operation of the current service node includes: for each operation parameter, inputting the current operation parameter into a preset probability distribution model to obtain the probability corresponding to the current operation parameter; and multiplying the probabilities corresponding to each operation parameter to obtain the probability of the abnormal operation of the current service node.
In a preferred embodiment of the present invention, the step of performing a second screening process on the service nodes in the service cluster according to the probability of the abnormal operation of each service node and a preset probability threshold to obtain a second screening result includes: and aiming at each service node in the service cluster, if the probability of the abnormal operation of the current service node is smaller than a preset probability threshold value, adding the current service node into a second screening result.
In a preferred embodiment of the present invention, the step of determining a service node with abnormal operation in the service cluster according to the first screening result and the second screening result includes: calculating the intersection of the first screening result and the second screening result; and determining the service nodes in the intersection as the service nodes with abnormal operation in the service cluster.
In a preferred embodiment of the present invention, after the step of determining a service node with abnormal operation in the service cluster, the method further includes: setting the service node with abnormal operation as an isolation state, and counting the time length of the service node with abnormal operation in the isolation state; if the time length reaches the preset time length, sending a recovery request to the service node with abnormal operation; if a signal of successful recovery returned by the service node with abnormal operation is received, setting the service node with abnormal operation as a normal operation state; and if a signal of failure recovery returned by the service node with abnormal operation is received, resetting the duration, and continuously performing the step of counting the duration of the service node with abnormal operation in the isolation state.
In a second aspect, an embodiment of the present invention provides an apparatus for determining an abnormal node in a service cluster, where the apparatus includes: the parameter acquisition module is used for acquiring the operation parameters of each service node in the preset service cluster; the service cluster comprises a plurality of service nodes; the first screening module is used for carrying out first screening processing on the service nodes in the service cluster according to the operation parameters and the parameter threshold values corresponding to the operation parameters to obtain a first screening result; the probability determining module is used for determining the probability of abnormal operation of each service node according to the operation parameters; the second screening module is used for performing second screening processing on the service nodes in the service cluster according to the probability of the abnormal operation of each service node and a preset probability threshold value to obtain a second screening result; and the abnormal node determining module is used for determining the service nodes which run abnormally in the service cluster according to the first screening result and the second screening result.
In a third aspect, an embodiment of the present invention provides a monitoring server, which includes a processor and a memory, where the memory stores machine executable instructions that can be executed by the processor, and the processor executes the machine executable instructions to implement the method for determining an abnormal node in the service cluster.
In a fourth aspect, embodiments of the present invention provide a machine-readable storage medium storing machine-executable instructions that, when invoked and executed by a processor, cause the processor to implement the above-described method for determining an abnormal node in a service cluster.
The embodiment of the invention has the following beneficial effects:
according to the method, the device and the monitoring server for determining the abnormal nodes in the service cluster, firstly, the operation parameters of each service node in the preset service cluster are obtained; then, according to the operation parameter and a parameter threshold corresponding to the operation parameter, performing first screening processing on service nodes in the service cluster to obtain a first screening result; determining the probability of abnormal operation of each service node according to the operation parameters; then, according to the probability of abnormal operation of each service node and a preset probability threshold value, performing second screening processing on the service nodes in the service cluster to obtain a second screening result; and determining the service nodes with abnormal operation in the service cluster according to the first screening result and the second screening result. In the method, the service nodes in the service cluster are screened twice, and the service nodes existing in the screening results of the two times are determined as abnormal nodes, so that the accuracy of judging the abnormality of the service nodes is improved, the number of normal service nodes in the service cluster is ensured to a certain extent, and the operation stability of the service cluster is improved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention as set forth above.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a schematic view of an application scenario for determining an abnormal node in a service cluster according to an embodiment of the present invention;
fig. 2 is a flowchart of a method for determining an abnormal node in a service cluster according to an embodiment of the present invention;
fig. 3 is a flowchart of another method for determining an abnormal node in a service cluster according to an embodiment of the present invention;
fig. 4 is a flowchart of another method for determining an abnormal node in a service cluster according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an apparatus for determining an abnormal node in a service cluster according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a monitoring server according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
For the convenience of understanding, an application scenario for determining an abnormal node in a service cluster is first shown, and as shown in fig. 1, the scenario includes the service cluster and a monitoring server; the service cluster comprises a plurality of service nodes, and the service nodes are deployed on different servers; the monitoring server is typically connected to a server in which the service nodes are deployed to monitor the operational status of the service nodes.
In the related technology, a monitoring server usually records parameters such as the total request times, the request error times and the request duration of each service node, calculates an error rate and an overtime rate of each service node in a specified time period according to the recorded parameters, and determines that the service node is abnormal when the error rate of the service node exceeds an error threshold or the overtime rate exceeds an overtime threshold, and isolates the service node, wherein the determination is completely dependent on a preset error threshold and an overtime threshold when the service node is isolated by the method, however, in the actual production process, the increase of the error rate and the increase of the overtime rate are not necessarily caused by the abnormality of the service node, for example, in image identification service, the identification duration of the service node is increased when a requesting party transmits a batch of large graphs, and the request of the service node is overtime; or the requester calls an exception to cause the service node to report an error, so that the error rate is increased, and the like. Therefore, the misjudgment rate for judging whether the service node is abnormal or not is high in the manner, so that a large number of normal service nodes are easily isolated due to misjudgment, the number of the normal service nodes is greatly reduced, and the operation pressure of the service cluster is increased.
Based on the above problems, embodiments of the present invention provide a method and an apparatus for determining an abnormal node in a service cluster, and a monitoring server, where the technique may be applied to monitoring processing scenarios of various service clusters, especially monitoring scenarios of an operating state, an abnormal state, and the like of a service node. To facilitate understanding of the present embodiment, first, a detailed description is given of a method for determining an abnormal node in a service cluster, as shown in fig. 2, where the method includes the following steps:
step S202, obtaining operation parameters of each service node in a preset service cluster; the service cluster comprises a plurality of service nodes.
The service cluster can also be called a server cluster, and generally refers to that a plurality of servers are centralized together to perform the same service, and the client looks like only one server; the service cluster can utilize a plurality of computers to perform parallel computation so as to obtain high computation speed, and also can use a plurality of computers to perform backup so as to ensure that any one machine damages the whole system or can normally run.
The service cluster generally includes a plurality of service nodes, and each service node may be regarded as a server or a service process on the server. The operation parameters of the service node may be one or more, and generally include an error rate or a request timeout rate of the service node during operation; the error rate is related to the total request times and the request error times of the service node, and the request timeout rate is related to the total request times and the request duration of the service node.
Step S204, according to the operation parameters and the parameter threshold corresponding to the operation parameters, performing a first screening process on the service nodes in the service cluster to obtain a first screening result.
The parameter threshold is usually set by a user according to a requirement, and the number of the service nodes in the first screening result is usually affected by the height of the parameter threshold, for example, when the parameter threshold is higher, the number of the service nodes meeting the parameter threshold may be relatively less. In a specific implementation, the operation parameter is usually compared with a parameter threshold, and the service node corresponding to the operation parameter meeting the parameter threshold is put into the first screening result, for example, when the operation parameter is greater than the parameter threshold, the service node corresponding to the operation parameter is put into the first screening result.
And step S206, determining the probability of abnormal operation of each service node according to the operation parameters.
Judging the probability of abnormal operation or normal operation of the service node according to the numerical value of the operation parameter; where the probability of operating abnormally is equal to one minus the probability of operating normally. For example, for the error rate in the above-mentioned operational parameters, generally, the higher the error rate, the higher the probability that the service node is operating abnormally. The probability of each service node operating abnormally is usually obtained by a probability distribution function or a probability density function.
And step S208, performing second screening processing on the service nodes in the service cluster according to the abnormal operation probability of each service node and a preset probability threshold value to obtain a second screening result.
The preset probability threshold is usually set by a user according to requirements. In a specific implementation, the probability of the abnormal operation of each service node is usually compared with a preset probability threshold, and the service node corresponding to the probability meeting the probability threshold is put into the second screening result, for example, when the probability of the abnormal operation is greater than the probability threshold, the service node corresponding to the probability of the abnormal operation is put into the second screening result.
And step S210, determining the service nodes with abnormal operation in the service cluster according to the first screening result and the second screening result.
In a specific implementation, all service nodes included in the first selection result and the second selection result may be determined as abnormal service nodes (equivalent to the service nodes with abnormal operation), the service nodes included in the first screening result and the second screening result may be determined as abnormal service nodes, or the service nodes included in the first screening result and the second screening result may be screened again based on the operation parameters, and the screened service nodes are determined as abnormal service nodes.
Firstly, acquiring operation parameters of each service node in a preset service cluster; then, according to the operation parameter and a parameter threshold corresponding to the operation parameter, performing first screening processing on service nodes in the service cluster to obtain a first screening result; determining the probability of abnormal operation of each service node according to the operation parameters; then, according to the probability of abnormal operation of each service node and a preset probability threshold value, performing second screening processing on the service nodes in the service cluster to obtain a second screening result; and determining the service nodes with abnormal operation in the service cluster according to the first screening result and the second screening result. In the method, the service nodes in the service cluster are screened twice, and the service nodes existing in the screening results of the two times are determined as abnormal nodes, so that the accuracy of judging the abnormality of the service nodes is improved, the number of normal service nodes in the service cluster is ensured to a certain extent, and the operation stability of the service cluster is improved.
The embodiment of the invention also provides another method for determining the abnormal node in the service cluster, which is realized on the basis of the method in the embodiment; the method mainly describes a specific process of determining a first screening result (realized by the following step S304), and a specific process of determining the probability of each service node operating abnormally according to the operating parameters (realized by the following step S306); as shown in fig. 3, the method comprises the steps of:
step S302, obtaining the operation parameters of each service node in the preset service cluster.
Step S304, aiming at each service node in the service cluster, if the operation parameter of the current service node meets the parameter threshold corresponding to the operation parameter, adding the current service node into a first screening result; wherein the operating parameters include: a request error rate and/or a request timeout rate.
In specific implementation, each service node in a service cluster needs to be screened, whether the operating parameters of the current service node meet parameter thresholds is judged, and if yes, the current service node is added to a first screening result; and then, taking the next service node of the current service node as a new current service node, and continuously judging whether the operating parameters of the current service node meet the parameter threshold value until all the service nodes in the service cluster are judged. The operating parameter satisfying the parameter threshold may be that the operating parameter is greater than the parameter threshold, or that the operating parameter is less than the parameter threshold.
The operating parameters may include only the error rate or only the request timeout rate, or may include both the error rate and the request timeout rate. The total request times, the error times and the request duration of the service node at each moment can be recorded according to the monitoring service, then the total request times, the request error times and the request duration of the server node in a certain time period can be counted through a sliding window algorithm, and an error rate and a request timeout rate are determined according to the total request times, the request error times and the request duration, wherein the error rate is generally equal to the request error times divided by the total request times, and the request timeout rate is equal to the timeout times divided by the total request times, wherein the timeout times are related to the request duration of each request and a preset normal request duration.
When the operation parameter includes both the error rate and the request timeout rate, the parameter threshold includes an error threshold and a timeout threshold, at this time, the error rate and the request timeout rate need to be respectively judged, and as long as one parameter satisfies the parameter threshold, the service node corresponding to the parameter is added to the first screening result. For example, if the error rate of the serving node is greater than or equal to the error threshold, the serving node is added to the first screening result, or if the request timeout rate of the serving node is greater than or equal to the timeout threshold, the serving node is added to the first screening result.
Step S306, for each service node in the service cluster, inputting the operation parameters of the current service node into a preset probability distribution model to obtain the probability of the abnormal operation of the current service node.
The predetermined probability distribution model may be a gaussian model, a poisson distribution model, a bernoulli distribution model, or the like. The preset probability distribution model can generally represent the approximate distribution condition of the operation parameters of the service nodes in the service cluster, the indefinite parameters in the probability distribution model can be obtained by fitting according to the operation parameters of all the service nodes in the service cluster, then the operation parameters of the current service node are input into the probability distribution model after the indefinite parameters are determined, the probability that the current node operates normally can be output, and then the normal probability is used for determining the probability of abnormality (the sum of the normal probability and the probability of abnormality is one).
When the predetermined probability distribution model is a gaussian model, the above step S306 can be generally implemented by the following steps 30-32:
step 30, calculating the average value of the operation parameters of each service node in the service cluster
Figure BDA0002273702980000101
Where m represents the total number of service nodes in the service cluster, xiAnd representing the operating parameters corresponding to the ith service node. Usually, the operating parameters of all the service nodes in the service cluster are summed and averaged to obtain the average value of the operating parameters.
Step 31, calculating the variance of the operation parameters of each service node in the service cluster
Figure BDA0002273702980000102
The variance of the operating parameter can be obtained according to the average value of the operating parameter of all the service nodes in the service cluster and the corresponding operating parameter.
And 32, inputting the operation parameters, the mean value and the variance of the current service node into a preset Gaussian model to obtain the probability of abnormal operation of the current service node.
Formula of the Gaussian distribution model
Figure BDA0002273702980000111
In the formula, x represents the operation parameter of the current service node, and the mean value mu and the variance sigma corresponding to the operation parameter are used2And substituting the operation parameters of the current service node into the formula of the Gaussian distribution model, outputting the probability that the current service node operates normally, and subtracting the probability that the current service node operates normally by one to obtain the probability that the current service node operates abnormally.
In another implementation, the current service node may include a plurality of operation parameters; for example, both the error rate and the request timeout rate. At this time, the above step S306 can be generally realized by the following steps 40 to 41:
and step 40, inputting the current operation parameters into a preset probability distribution model aiming at each operation parameter to obtain the probability corresponding to the current operation parameters. When the operation parameters include an error rate and a request timeout rate, the error rate and the request timeout rate are respectively input into a preset probability distribution model (such as a Gaussian model), and a probability corresponding to the error rate and a probability corresponding to the request timeout rate are obtained.
And step 41, multiplying the probabilities corresponding to each operation parameter to obtain the probability of the abnormal operation of the current service node. In specific implementation, the probability corresponding to the error rate and the probability corresponding to the request timeout rate may be multiplied to obtain a probability that the current service node operates normally, and the probability that the current service node operates abnormally is determined according to the probability that the current service node operates normally.
For example, a preset probability distribution model is determined as a gaussian model, a mean value and a variance corresponding to each operation parameter are calculated, assuming that a service node corresponds to two operation parameters, the probability of the current service node operating abnormally may be expressed as:
in the formula yjRepresenting two operating parameters y1And y2,μjMeans μ representing the correspondence of two operating parameters1And mu2,σjRepresenting the corresponding variance of two operating parameters
Figure BDA0002273702980000113
And
Figure BDA0002273702980000114
step 308, performing a second screening process on the service nodes in the service cluster according to the probability of the abnormal operation of each service node and a preset probability threshold value, so as to obtain a second screening result.
Step S310, determining a service node with abnormal operation in the service cluster according to the first screening result and the second screening result.
In the method for determining the abnormal node in the service cluster, for each service node in the service cluster, if the operating parameter of the current service node meets the parameter threshold corresponding to the operating parameter, the current service node is added to the first screening result, and the operating parameter of the current service node is input into a preset probability distribution model to obtain the probability of the abnormal operation of the current service node; then, according to the probability of abnormal operation of each service node and a preset probability threshold value, performing second screening processing on the service nodes in the service cluster to obtain a second screening result; and finally, determining the service nodes with abnormal operation in the service cluster according to the first screening result and the second screening result. In the method, the abnormal service nodes are determined by screening the service nodes twice, so that the phenomena that normal service nodes running normally are judged by mistake and abnormal nodes are caused due to calling problem report or overtime of a requester can be effectively avoided, the stability of normal running of the service nodes is reported, and the running pressure of a service cluster is reduced.
The embodiment of the invention also provides another method for determining the abnormal node in the service cluster, which is realized on the basis of the method in the embodiment; the method mainly describes a specific process of determining a second screening result (realized by the following step S408), and a specific process of determining a service node with abnormal operation in the service cluster according to the first screening result and the second screening result (realized by the following steps S410-S412); as shown in fig. 4; the method comprises the following steps:
step S402, obtaining the operation parameters of each service node in the preset service cluster.
Step S404, performing a first screening process on the service nodes in the service cluster according to the operation parameters and the parameter thresholds corresponding to the operation parameters, so as to obtain a first screening result.
Step S406, determining the probability of abnormal operation of each service node according to the operation parameters.
Step S408, for each service node in the service cluster, if the probability that the current service node operates abnormally is smaller than a preset probability threshold, adding the current service node to a second screening result.
The preset probability threshold is usually set according to the user requirement. During specific implementation, whether the probability of abnormal operation of the current service node is smaller than a probability threshold value needs to be judged; if the current node is smaller than the second screening result, adding the current node to the second screening result; and then, taking the next service node of the current service node as a new current service node, and continuously judging whether the probability of the abnormal operation of the current service node is smaller than the probability threshold value until all the service nodes in the service cluster are judged.
Step S410, calculating the intersection of the first screening result and the second screening result. And comparing the service nodes put in the first screening result and the second screening result one by one, and putting the service nodes which are commonly included in the first screening result and the second screening result into an intersection.
Step S412, determining the service node in the intersection as a service node with abnormal operation in the service cluster.
The operation parameters of the service nodes in the intersection meet a preset parameter threshold, and the probability of abnormal operation of the service nodes is greater than a preset probability threshold. When the intersection is empty, the service cluster is represented that no service node runs abnormally, and each service node can receive a service request; and when the intersection is not empty, determining that the service node in the intersection is the service node with abnormal operation, and isolating the service node with abnormal operation.
Step S414, setting the service node with abnormal operation to be in an isolated state.
The service node with abnormal operation is usually isolated by an isolator, that is, the service node with abnormal operation is set to an isolated state. When a service node is in the isolated state, the service node may generally continue to run running service requests, but no new service requests are assigned to the service node to avoid the service invoker continuing to send service requests to the service node.
Step S416, the duration of the service node in the isolated state with abnormal operation is counted.
And step S418, if the duration reaches the preset duration, sending a recovery request to the service node with abnormal operation.
The preset time period can be set according to the user requirement, for example, 10 minutes or 20 minutes. The recovery request is usually a probe request, which can probe whether the service node with abnormal operation recovers the normal operation state, and usually, when the recovery request is sent to the service node with abnormal operation, the service node is in a semi-open state, which may also be referred to as a half-open state.
Step S420, judging whether a signal of successful recovery returned by the abnormally operated service node is received, if so, executing step S422; otherwise, step S416 is performed.
Step S422, the service node with abnormal operation is set to be in a normal operation state.
When the service node with abnormal operation receives the recovery request, the service node with abnormal operation returns a corresponding signal according to the operation state of the service node, for example, if the service node with abnormal operation returns a signal of successful recovery, the service node with abnormal operation is determined to recover the normal operation state; and if the service node with abnormal operation returns a recovery failure signal, determining that the service node with abnormal operation still operates abnormally and needs to be in the isolation state continuously, and continuously counting the time length of the service node with abnormal operation in the isolation state to wait for sending the recovery request again.
In a specific implementation, under an extreme condition, such as a large-area network fault, a large number of abnormal service nodes may occur, and if the abnormal service nodes are all isolated, the operating pressure of the normal service nodes is unnecessarily increased, and finally, the normal service nodes are unavailable due to an excessively high load, so that service cluster avalanche is caused, that is, all the service nodes are unavailable. The embodiment of the invention can avoid the situation, and when a large number of service nodes (namely, a large number of service nodes are put in the first screening result) are abnormal, the service nodes are screened for the second time to determine the final abnormal service nodes, so that the phenomenon of service node isolation caused by misjudgment is avoided, the number of the service nodes which operate normally is increased to a certain extent, and the stability of service cluster operation is increased.
The method for determining the abnormal node in the service cluster can accurately identify the abnormal service node, and avoids the phenomenon that the normal service node is judged to be abnormal due to the abnormal operation of the service calling party and is isolated, so that the normal service node can normally receive the service request, and the operation stability of the service cluster can be ensured.
Corresponding to the embodiment of the method for determining an abnormal node in a service cluster, an embodiment of the present invention further provides a device for determining an abnormal node in a service cluster, as shown in fig. 5, where the device includes:
a parameter obtaining module 50, configured to obtain an operation parameter of each service node in a preset service cluster; the service cluster comprises a plurality of service nodes.
The first screening module 51 is configured to perform a first screening process on the service nodes in the service cluster according to the operation parameter and the parameter threshold corresponding to the operation parameter, so as to obtain a first screening result.
And a probability determining module 52, configured to determine, according to the operation parameters, a probability that each service node operates abnormally.
And the second screening module 53 is configured to perform a second screening process on the service nodes in the service cluster according to the probability that each service node operates abnormally and a preset probability threshold, so as to obtain a second screening result.
And an abnormal node determining module 54, configured to determine a service node in the service cluster, which operates abnormally, according to the first screening result and the second screening result.
The determining device of the abnormal node in the service cluster firstly acquires the operation parameters of each service node in the preset service cluster; then, according to the operation parameter and a parameter threshold corresponding to the operation parameter, performing first screening processing on service nodes in the service cluster to obtain a first screening result; determining the probability of abnormal operation of each service node according to the operation parameters; then, according to the probability of abnormal operation of each service node and a preset probability threshold value, performing second screening processing on the service nodes in the service cluster to obtain a second screening result; and determining the service nodes with abnormal operation in the service cluster according to the first screening result and the second screening result. In the method, the service nodes in the service cluster are screened twice, and the service nodes with the screening results of the two times are determined as abnormal nodes, so that the accuracy of judging the abnormality of the service nodes is improved, the number of normal service nodes in the service cluster is ensured to a certain extent, and the operation stability of the service cluster is improved.
Further, the first screening module 51 is configured to: for each service node in the service cluster, if the operation parameter of the current service node meets the parameter threshold corresponding to the operation parameter, adding the current service node into the first screening result; wherein the operating parameters include: a request error rate and/or a request timeout rate.
Further, the probability determination module 52 is configured to: and aiming at each service node in the service cluster, inputting the operation parameters of the current service node into a preset probability distribution model to obtain the abnormal operation probability of the current service node.
Further, the probability determination module 52 is further configured to: calculating average value of operation parameters of each service node in service cluster
Figure BDA0002273702980000161
Calculating variance of operating parameters of each service node in service cluster
Figure BDA0002273702980000162
Where m represents the total number of service nodes in the service cluster, xiRepresenting the operation parameters corresponding to the ith service node; operating parameters of the current service nodeAnd inputting the mean value and the variance into a preset Gaussian model to obtain the probability of abnormal operation of the current service node.
Specifically, the current service node includes a plurality of operation parameters; the probability determination module 52 is configured to: for each operation parameter, inputting the current operation parameter into a preset probability distribution model to obtain the probability corresponding to the current operation parameter; and multiplying the probabilities corresponding to each operation parameter to obtain the probability of the abnormal operation of the current service node.
Further, the second screening module 53 is configured to: and aiming at each service node in the service cluster, if the probability of the abnormal operation of the current service node is smaller than a preset probability threshold value, adding the current service node into a second screening result.
Further, the abnormal node determining module 54 is configured to: calculating the intersection of the first screening result and the second screening result; and determining the service nodes in the intersection as the service nodes with abnormal operation in the service cluster.
Further, after the abnormal node determining module 54, the apparatus further includes an isolating module, configured to: setting the service node with abnormal operation as an isolation state, and counting the time length of the service node with abnormal operation in the isolation state; if the duration reaches the preset duration, sending a recovery request to the service node with abnormal operation; if a signal of successful recovery returned by the service node with abnormal operation is received, setting the service node with abnormal operation as a normal operation state; and if a signal of failure recovery returned by the service node with abnormal operation is received, resetting the time length, and continuously performing the step of counting the time length of the service node with abnormal operation in the isolation state.
The implementation principle and the generated technical effect of the device for determining an abnormal node in a service cluster provided by the embodiment of the present invention are the same as those of the method embodiment described above, and for brief description, reference may be made to corresponding contents in the method embodiment described above where no mention is made in part of the device embodiment.
Referring to fig. 6, the monitoring server includes a processor 101 and a memory 100, where the memory 100 stores machine executable instructions capable of being executed by the processor 101, and the processor 101 executes the machine executable instructions to implement the method for determining the abnormal node in the service cluster.
Further, the monitoring server shown in fig. 6 further includes a bus 102 and a communication interface 103, and the processor 101, the communication interface 103 and the memory 100 are connected through the bus 102.
The Memory 100 may include a high-speed Random Access Memory (RAM) and may further include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 103 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used. The bus 102 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 6, but that does not indicate only one bus or one type of bus.
The processor 101 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 101. The Processor 101 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 100, and the processor 101 reads the information in the memory 100, and completes the steps of the method of the foregoing embodiment in combination with the hardware thereof.
The embodiment of the present invention further provides a machine-readable storage medium, where the machine-readable storage medium stores machine-executable instructions, and when the machine-executable instructions are called and executed by a processor, the machine-executable instructions cause the processor to implement the method for determining an abnormal node in the service cluster.
The method and apparatus for determining an abnormal node in a service cluster and the computer program product of the monitoring server provided in the embodiments of the present invention include a computer-readable storage medium storing a program code, where instructions included in the program code may be used to execute the method described in the foregoing method embodiments, and specific implementation may refer to the method embodiments, and details are not described here.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the apparatus and/or the electronic device described above may refer to corresponding processes in the foregoing method embodiments, and are not described herein again.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (11)

1. A method for determining abnormal nodes in a service cluster, the method comprising:
acquiring operation parameters of each service node in a preset service cluster; the service cluster comprises a plurality of service nodes;
performing first screening processing on the service nodes in the service cluster according to the operation parameters and parameter thresholds corresponding to the operation parameters to obtain first screening results;
determining the probability of abnormal operation of each service node according to the operation parameters;
performing second screening processing on the service nodes in the service cluster according to the probability of the abnormal operation of each service node and a preset probability threshold value to obtain a second screening result;
and determining the service nodes with abnormal operation in the service cluster according to the first screening result and the second screening result.
2. The method according to claim 1, wherein the step of performing a first filtering process on the service nodes in the service cluster according to the operation parameter and a parameter threshold corresponding to the operation parameter to obtain a first filtering result includes:
for each service node in the service cluster, if the operation parameter of the current service node meets the parameter threshold corresponding to the operation parameter, adding the current service node to a first screening result; wherein the operating parameters include: a request error rate and/or a request timeout rate.
3. The method of claim 1, wherein determining the probability of each service node operating abnormally based on the operating parameters comprises: and inputting the operation parameters of the current service node into a preset probability distribution model aiming at each service node in the service cluster to obtain the abnormal operation probability of the current service node.
4. The method according to claim 3, wherein the step of inputting the operation parameters of the current service node into a preset probability distribution model to obtain the probability of the abnormal operation of the current service node comprises:
calculating the mean value of the operating parameters of each service node in the service cluster
Calculating the variance of the operating parameters of each service node in the service cluster
Where m represents the total number of service nodes in the service cluster, xiRepresenting the operation parameters corresponding to the ith service node;
and inputting the operation parameters, the mean value and the variance of the current service node into a preset Gaussian model to obtain the probability of abnormal operation of the current service node.
5. The method of claim 3, wherein the current serving node comprises a plurality of operating parameters;
the step of inputting the operation parameters of the current service node into a preset probability distribution model to obtain the probability of the abnormal operation of the current service node comprises the following steps:
for each operation parameter, inputting the current operation parameter into a preset probability distribution model to obtain the probability corresponding to the current operation parameter;
and multiplying the probabilities corresponding to each operation parameter to obtain the probability of the abnormal operation of the current service node.
6. The method according to claim 1, wherein the step of performing a second filtering process on the service nodes in the service cluster according to the probability of the abnormal operation of each service node and a preset probability threshold to obtain a second filtering result comprises:
and for each service node in the service cluster, if the probability of the abnormal operation of the current service node is smaller than a preset probability threshold value, adding the current service node into a second screening result.
7. The method according to claim 1, wherein the step of determining the service node with abnormal operation in the service cluster according to the first screening result and the second screening result comprises:
calculating the intersection of the first screening result and the second screening result;
and determining the service nodes in the intersection as the service nodes with abnormal operation in the service cluster.
8. The method of claim 1, wherein after the step of determining a service node in the service cluster that is operating abnormally, the method further comprises:
setting the service node with abnormal operation to be in an isolation state, and counting the time length of the service node with abnormal operation in the isolation state;
if the time length reaches the preset time length, sending a recovery request to the service node with abnormal operation;
if a signal of successful recovery returned by the service node with abnormal operation is received, setting the service node with abnormal operation as a normal operation state;
if a signal that the recovery of the service node with the abnormal operation fails is received, resetting the duration, and continuing to perform the step of counting the duration of the service node with the abnormal operation in the isolation state.
9. An apparatus for determining abnormal nodes in a service cluster, the apparatus comprising:
the parameter acquisition module is used for acquiring the operation parameters of each service node in the preset service cluster; the service cluster comprises a plurality of service nodes;
the first screening module is used for carrying out first screening processing on the service nodes in the service cluster according to the operation parameters and the parameter threshold values corresponding to the operation parameters to obtain first screening results;
the probability determining module is used for determining the probability of abnormal operation of each service node according to the operation parameters;
the second screening module is used for performing second screening processing on the service nodes in the service cluster according to the probability of the abnormal operation of each service node and a preset probability threshold value to obtain a second screening result;
and the abnormal node determining module is used for determining the service node which runs abnormally in the service cluster according to the first screening result and the second screening result.
10. A monitoring server comprising a processor and a memory, the memory storing machine executable instructions executable by the processor, the processor executing the machine executable instructions to implement the method of determining an anomalous node in a service cluster of any one of claims 1 to 8.
11. A machine-readable storage medium having stored thereon machine-executable instructions which, when invoked and executed by a processor, cause the processor to carry out a method of determining an anomalous node in a service cluster according to any one of claims 1 to 8.
CN201911117768.6A 2019-11-14 2019-11-14 Method and device for determining abnormal node in service cluster and monitoring server Pending CN110837432A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911117768.6A CN110837432A (en) 2019-11-14 2019-11-14 Method and device for determining abnormal node in service cluster and monitoring server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911117768.6A CN110837432A (en) 2019-11-14 2019-11-14 Method and device for determining abnormal node in service cluster and monitoring server

Publications (1)

Publication Number Publication Date
CN110837432A true CN110837432A (en) 2020-02-25

Family

ID=69575037

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911117768.6A Pending CN110837432A (en) 2019-11-14 2019-11-14 Method and device for determining abnormal node in service cluster and monitoring server

Country Status (1)

Country Link
CN (1) CN110837432A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111510351A (en) * 2020-04-10 2020-08-07 星辰天合(北京)数据科技有限公司 Anomaly detection method and device based on Promissuris monitoring system
CN112052053A (en) * 2020-10-10 2020-12-08 国科晋云技术有限公司 Method and system for cleaning mining program in high-performance computing cluster
CN113055246A (en) * 2021-03-11 2021-06-29 中国工商银行股份有限公司 Abnormal service node identification method, device, equipment and storage medium
CN114697322A (en) * 2022-02-17 2022-07-01 许强 Data screening method based on cloud service processing
CN115801203A (en) * 2023-01-19 2023-03-14 苏州浪潮智能科技有限公司 Distributed cluster reliability management method, device and equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1944699A1 (en) * 2005-10-31 2008-07-16 Fujitsu Ltd. Performance failure analysis device, method, program, and performance failure analysis device analysis result display method
CN104426696A (en) * 2013-08-29 2015-03-18 深圳市腾讯计算机系统有限公司 Fault processing method and device
US20170097863A1 (en) * 2015-10-05 2017-04-06 Fujitsu Limited Detection method and information processing device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1944699A1 (en) * 2005-10-31 2008-07-16 Fujitsu Ltd. Performance failure analysis device, method, program, and performance failure analysis device analysis result display method
CN104426696A (en) * 2013-08-29 2015-03-18 深圳市腾讯计算机系统有限公司 Fault processing method and device
US20170097863A1 (en) * 2015-10-05 2017-04-06 Fujitsu Limited Detection method and information processing device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王玲: "《数据挖掘学习方法》", 北京:冶金工业出版社, pages: 98 - 101 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111510351A (en) * 2020-04-10 2020-08-07 星辰天合(北京)数据科技有限公司 Anomaly detection method and device based on Promissuris monitoring system
CN112052053A (en) * 2020-10-10 2020-12-08 国科晋云技术有限公司 Method and system for cleaning mining program in high-performance computing cluster
CN112052053B (en) * 2020-10-10 2023-12-19 国科晋云技术有限公司 Method and system for cleaning ore mining program in high-performance computing cluster
CN113055246A (en) * 2021-03-11 2021-06-29 中国工商银行股份有限公司 Abnormal service node identification method, device, equipment and storage medium
CN114697322A (en) * 2022-02-17 2022-07-01 许强 Data screening method based on cloud service processing
CN114697322B (en) * 2022-02-17 2024-03-22 上海生慧樘科技有限公司 Data screening method based on cloud service processing
CN115801203A (en) * 2023-01-19 2023-03-14 苏州浪潮智能科技有限公司 Distributed cluster reliability management method, device and equipment

Similar Documents

Publication Publication Date Title
CN110837432A (en) Method and device for determining abnormal node in service cluster and monitoring server
CN107644194B (en) System and method for providing monitoring data
US20150046757A1 (en) Performance Metrics of a Computer System
CN106330588B (en) BFD detection method and device
US20150113337A1 (en) Failure symptom report device and method for detecting failure symptom
CN111782462A (en) Alarm method and device and electronic equipment
CN110674149B (en) Service data processing method and device, computer equipment and storage medium
CN114168071B (en) Distributed cluster capacity expansion method, distributed cluster capacity expansion device and medium
CN112152833B (en) Network abnormity alarm method and device and electronic equipment
CN107222497B (en) Network flow abnormity monitoring method and electronic equipment
US20150281008A1 (en) Automatic derivation of system performance metric thresholds
CN112073329A (en) Distributed current limiting method and device, electronic equipment and storage medium
CN116820828A (en) Method and device for setting correctable error threshold, electronic equipment and storage medium
US11777785B2 (en) Alert throttling
CN108964992B (en) Node fault detection method and device and computer readable storage medium
CN110995522A (en) Information processing method and device
CN114610560B (en) System abnormality monitoring method, device and storage medium
CN114816915A (en) Link tracking method and device
US9054995B2 (en) Method of detecting measurements in service level agreement based systems
CN114116128A (en) Method, device, equipment and storage medium for fault diagnosis of container instance
US20170153924A1 (en) Method for request scheduling and scheduling device
CN112822166A (en) Abnormal process detection method, device, equipment and medium
CN109104299B (en) Method and device for reducing cluster oscillation
EP3457609A1 (en) System and method for computing of anomalies based on frequency driven transformation and computing of new features based on point anomaly density
CN112152834B (en) Network abnormity alarm method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination