CN113055246B - Abnormal service node identification method, device, equipment and storage medium - Google Patents

Abnormal service node identification method, device, equipment and storage medium Download PDF

Info

Publication number
CN113055246B
CN113055246B CN202110264381.4A CN202110264381A CN113055246B CN 113055246 B CN113055246 B CN 113055246B CN 202110264381 A CN202110264381 A CN 202110264381A CN 113055246 B CN113055246 B CN 113055246B
Authority
CN
China
Prior art keywords
service node
abnormal
node
intermediate data
suspected
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110264381.4A
Other languages
Chinese (zh)
Other versions
CN113055246A (en
Inventor
吴旭东
徐翥
徐砚劼
张易知
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202110264381.4A priority Critical patent/CN113055246B/en
Publication of CN113055246A publication Critical patent/CN113055246A/en
Application granted granted Critical
Publication of CN113055246B publication Critical patent/CN113055246B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0817Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route
    • H04L43/106Active monitoring, e.g. heartbeat, ping or trace-route using time related information in packets, e.g. by adding timestamps
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/16Threshold monitoring

Abstract

An embodiment of the specification provides a method, a device, equipment and a storage medium for identifying an abnormal service node, wherein the method comprises the following steps: receiving node operation data in real time; the node operation data comprises various operation parameters of the service node; judging whether the service node is suspected to be abnormal or not according to the various operation parameters; determining the suspected abnormal rate of the service node in a set judgment period; and when the suspected abnormal rate of the service node in the set judgment period exceeds a set threshold value, identifying the service node as an abnormal service node. The embodiment of the specification can improve the identification accuracy of the abnormal node.

Description

Abnormal service node identification method, device, equipment and storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for identifying an abnormal service node.
Background
With the continuous development of the business field, in order to support large-scale business volume, service nodes in a service cluster are increased in a geometric level, most users of a business have normal transactions, but some users respond to slow transaction response or directly fail the transaction, which may be that some service nodes or some service nodes in the service cluster are abnormal, so that abnormal service nodes need to be timely, quickly and automatically isolated to eliminate the influence.
Therefore, how to accurately discover the abnormal condition of a single service node in a service cluster is a technical problem which needs to be solved at present.
Disclosure of Invention
An embodiment of the present specification aims to provide an abnormal service node identification method, apparatus, device and storage medium, so as to improve the accuracy of identification of an abnormal node.
In order to achieve the above object, in one aspect, an embodiment of the present specification provides an abnormal service node identification method, including:
receiving node operation data in real time; the node operation data comprises various operation parameters of the service node;
judging whether the service node is suspected to be abnormal or not according to the various operation parameters;
determining the suspected abnormal rate of the service node in a set judgment period;
and when the suspected abnormal rate of the service node in the set judgment period exceeds a set threshold value, identifying the service node as an abnormal service node.
In an embodiment of this specification, the determining whether the service node is suspected to be abnormal according to the operation parameter includes:
comparing various operation parameters of the service node with corresponding operation parameter thresholds respectively to correspondingly obtain identification result sub-values of the operation parameters;
carrying out weighted summation on the recognition result sub-values of the operation parameters according to a preset weighted summation formula to obtain a weighted sum;
comparing the weighted sum to a first threshold;
when the weighted sum is greater than the first threshold, identifying the service node as suspected abnormal.
In the embodiment of the present specification, in the weighted summation formula, the weight parameter of each operating parameter is periodically updated by:
determining the times that each operation parameter exceeds a corresponding operation parameter threshold value in a specified historical period for all service nodes in the service cluster;
determining new weight parameters of the operation parameters according to the times;
and updating the weight parameters in the weighted summation formula by using the new weight parameters.
In an embodiment of the present specification, the node operation data further includes a sampling timestamp, an application identifier, and a service node IP; the judgment result of the judgment is represented by a corresponding status flag bit;
after the determining whether the service node is suspected to be abnormal according to the plurality of operation parameters, the method further includes:
intercepting a specified part in the sampling time stamp to be used as a period time stamp;
splicing the periodic timestamp, the application identifier, the service node IP and the state flag bit into intermediate data;
and writing the intermediate data into an intermediate data table.
In an embodiment of this specification, the writing the intermediate data into an intermediate data table includes:
judging whether old intermediate data with the same period timestamp exist in the intermediate data table;
and when old intermediate data with the same period time stamp exists in the intermediate data table, replacing the old intermediate data with the intermediate data.
In an embodiment of the present specification, the determining a suspected abnormality rate of the service node in a set determination period includes:
extracting all intermediate data from the intermediate data table;
determining the number of intermediate data with suspected abnormal marks in all the intermediate data;
and dividing the number of the intermediate data with the suspected abnormal marks by the number of all the intermediate data to obtain the suspected abnormal rate of the service node in the judgment period.
In an embodiment of the present specification, the method enables multiple threads to process in parallel; wherein each thread is assigned node operational data according to the following:
one node operation data is fished from a database of the node operation data;
adding all numerical values of the IP field of the service node in the node operation data to obtain a summary value;
performing remainder calculation on the summary value and the number of started threads to obtain a remainder value;
and allocating the node operation data to the thread with the same thread program number as the remainder value.
On the other hand, an embodiment of the present specification further provides an abnormal service node identification apparatus, including:
the receiving module is used for receiving the node operation data in real time; the node operation data comprises various operation parameters of the service node;
the judging module is used for judging whether the service node is suspected to be abnormal or not according to the various operation parameters;
the determining module is used for determining the suspected abnormal rate of the service node in a set judging period;
and the identification module is used for identifying the service node as an abnormal service node when the suspected abnormal rate of the service node in the set judgment period exceeds a set threshold value.
In another aspect, the embodiments of the present specification further provide a computer device, which includes a memory, a processor, and a computer program stored in the memory, and when the computer program is executed by the processor, the computer program executes the instructions of the method.
In another aspect, the present specification further provides a computer storage medium, on which a computer program is stored, and the computer program is executed by a processor of a computer device to execute the instructions of the method.
As can be seen from the technical solutions provided in the embodiments of the present specification, the node operation data received in real time includes multiple operation parameters of the service node, so that whether the service node is suspected to be abnormal is determined according to the multiple operation parameters, thereby avoiding erroneous determination that may be caused by considering only a single operation parameter. Moreover, in the embodiments of the present description, it is determined whether the service node is an abnormal node by counting the suspected abnormal rate of the service node in the set determination period, so that the accidental jitter of the service node at a certain time is prevented from being identified as an abnormality, and the accuracy of identifying the abnormality of the service node is further improved.
Drawings
In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the description below are only some embodiments described in the present specification, and for those skilled in the art, other drawings may be obtained according to these drawings without creative efforts. In the drawings:
FIG. 1 illustrates a flow diagram of abnormal service node identification in some embodiments of the present description;
FIG. 2 is a schematic diagram illustrating a multithreading identify exception service node in one embodiment of the present description;
FIG. 3 is a diagram illustrating a data structure of operational data in one embodiment of the present description;
FIG. 4 is a diagram illustrating a data structure of intermediate data in an embodiment of the present specification;
FIG. 5 is a block diagram of an abnormal service node identification apparatus in some embodiments of the present description;
FIG. 6 shows a block diagram of a computer device in accordance with some embodiments of the present disclosure.
[ description of reference ]
51. A receiving module;
52. a judgment module;
53. a determination module;
54. an identification module;
602. a computer device;
604. a processor;
606. a memory;
608. a drive mechanism;
610. an input/output module;
612. an input device;
614. an output device;
616. a presentation device;
618. a graphical user interface;
620. a network interface;
622. a communication link;
624. a communication bus.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present specification, and not all of the embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments in the present specification without any inventive step should fall within the scope of protection of the present specification.
The service cluster in this specification is a server cluster, which is a cluster that many servers are collected together to perform the same service, and appears to a client as if there is only one server. The service cluster can utilize a plurality of servers to perform parallel computation so as to obtain high computation speed, thereby supporting large-scale traffic. For convenience of description, the servers in the service cluster are hereinafter referred to as service nodes in this specification. The service cluster can also use a plurality of service nodes for backup, so that the whole system can still normally operate when any one service node is abnormal. However, if the abnormal node is not isolated, many service processing failures will be caused. Therefore, there is a need for automatic identification and automatic isolation of abnormal nodes in a service cluster to reduce adverse effects of the abnormal nodes on the service cluster. The abnormality in the present specification generally refers to a problem of a node failure, downtime, and the like, and the failure and downtime are generally continuous rather than sporadic instantaneous jitter.
To automatically isolate the abnormal service node in real time, the abnormal service node needs to be accurately positioned. Generally, for the anomaly diagnosis at the service node level, conventionally, the anomaly is usually determined by collecting the operation data of the service node at a certain moment, and once the operation data exceeds a threshold value, the anomaly is determined. However, the sensitivity of the monitoring threshold system is too high, if a certain service node generates occasional jitter, the automatic isolation operation is triggered at this time, and the pressure of the service cluster is increased, which may cause a greater systematic risk. Therefore, how to timely and accurately discover the abnormal situation of the service node of the service cluster is a technical problem which needs to be solved at present.
In view of the above, in order to solve the above problems, embodiments of the present specification provide a new abnormal service node identification method, which can be applied to any suitable computer device. Referring to fig. 1, in some embodiments of the present specification, the abnormal service node identification method may include the following steps:
s101, receiving node operation data in real time; the node operation data comprises various operation parameters of the service node.
And S102, judging whether the service node is suspected to be abnormal or not according to the multiple operation parameters.
S103, determining the suspected abnormal rate of the service node in a set judgment period.
And S104, when the suspected abnormality rate of the service node in the set judgment period exceeds a set threshold value, identifying the service node as an abnormal service node.
In the embodiment of the present specification, the node operation data received in real time includes multiple operation parameters of the service node, so that whether the service node is suspected to be abnormal is determined according to the multiple operation parameters, thereby avoiding misjudgment possibly caused by considering only a single operation parameter. Moreover, in the embodiments of the present specification, a suspected abnormal rate of the service node in a set determination period is counted to determine whether the service node is an abnormal node, so that identification of accidental jitter of the service node at a certain time as an abnormality is avoided, and accuracy of identifying the abnormality of the service node is further improved.
In order to improve the timeliness of identifying abnormal nodes, the node operation data can be received in real time and processed in time. In an embodiment of the present specification, the node operation data may be node operation data with a node as a granularity, which is summarized by an upstream node at a specified time interval. The node operation data may include a sampling timestamp, a service type, a node identifier, an operation parameter, and the like of the node operation data. In addition, in order to avoid erroneous judgment possibly caused by considering only a single operation parameter, the node operation data of the embodiments of the present specification may include a plurality of operation parameters (which operation parameters may be specifically specified as needed).
For example, in the embodiment shown in fig. 3, the node operational data may include: a timestamp (i.e., a sampling timestamp of the operation parameter), an application identifier (i.e., a traffic type identifier), a server IP (i.e., a service node IP, for identifying a service node), and five operation parameters (transaction amount, system success rate, transaction response time, CPU usage rate, and memory usage rate). For example, in an exemplary embodiment, a piece of node operation data may be represented as:
data=[20210101001000∣F-ABCD∣192.168.1.1∣731∣0.9644∣123∣90∣80]
wherein 20210101001000 is a timestamp; F-ABCD is an application identifier; 192.168.1.1 is the IP address of a certain service node under the application of the F-ABCD; 731 is the transaction amount of the service node; 0.9644 is the system success rate of the service node; 123 (milliseconds) is the transaction response time of the serving node; 90 (percent) is the CPU usage of the service node; 80 (percent) is the memory usage of the service node.
Obviously, these operational parameters are for a single service node of the application, not for the service cluster as a whole. Furthermore, those skilled in the art will appreciate that the above several operating parameters are merely exemplary and are not intended to be limiting in this specification. In actual implementation, addition, deletion, modification and the like can be performed as required.
In view of the fact that the number of service nodes in a service cluster is usually large, it is generally difficult to efficiently complete the work of identifying abnormal service nodes in time by enabling a single thread. Thus, in embodiments of the present description, multiple threads may be enabled for parallel processing (e.g., as shown in FIG. 2). The number of enabled threads can be selected according to actual needs. Referring to fig. 2, in a multi-thread scenario, node operation data sent by an upstream node may be cached in an operation database, and then may be distributed to each thread for processing (each job in job0 to job9 in fig. 2 may be a node operation data). Wherein each thread may be assigned node operational data by:
1) And fishing out one node operation data from the database of the node operation data.
2) And adding the numerical values of the IP fields of the service nodes in the node operation data to obtain a summary value.
3) And performing remainder calculation on the summary value and the number of started threads to obtain a remainder value.
4) And allocating the node operation data to the thread with the same thread program number as the remainder value.
For example, in the embodiment shown in fig. 2, 10 threads are started to determine the node abnormality, and for the node operation data with the service node IP of 192.168.1.1, the 4-segment numerical values are added as follows: 192+168+1+1=362. According to the rule (362mod 10= 2), the node operation data can be handed to thread No. 2 for processing. By analogy, if the calculated value is 5, thread number 5 may be handed over for processing.
For a large amount of node operation data, load balance among multiple threads can be maintained as much as possible through the data fragmentation processing rule, and therefore the efficiency of data processing is improved.
For each thread, the determining whether the service node has a suspected exception according to the operating parameter may include the following steps:
1) And comparing the various operation parameters of the service node with corresponding operation parameter thresholds respectively to correspondingly obtain the identification result sub-values of the operation parameters.
For example, taking the above five operation parameters (transaction amount, system success rate, transaction response time, CPU utilization rate, and memory utilization rate) as an example, a threshold may be set for each operation parameter: 500. 099, 100 (milliseconds), 80. If one node operation data is:
[20210101001000∣F-ABCD∣192.168.1.1∣731∣0.9644∣123∣90∣80]
then it can be seen that: the transaction amount is not triggered above a lower threshold (731 > -500), the system success rate is triggered below a lower threshold (0.9644-0.99), the transaction response time is triggered above an upper threshold (123 > -100), the CPU usage rate is triggered above an upper threshold (90 > -80), and the memory usage rate is triggered above an upper threshold (80 > = 80). The recognition result sub-value of each operation parameter may be represented by 0 or 1; when triggered, the recognition result sub-value is 1, and when not triggered, the recognition result sub-value is 0.
2) And carrying out weighted summation on the recognition result sub-values of all the operation parameters according to a preset weighted summation formula to obtain a weighted sum. Wherein, the weight corresponding to each operation parameter can be configured.
For example, if the weight of the above five operation parameters is 0.2. Then, according to the calculation result of the above steps, the weighted sum can be calculated as: sum =0.2 +1+ 0.2 +1= 0.8.
3) The weighted sum is compared to a first threshold.
The first threshold may be set according to actual needs, for example, in an exemplary embodiment, the first threshold may be set to 0.6. The calculation result (0.8) of the above step is compared with 0.6.
4) When the weighted sum is greater than the first threshold, identifying the service node as suspected abnormal.
Also taking the example in the above step as an example, since 0.8> < 0.6, the service node with the IP address of 192.168.1.1 under the application F-ABCD can be identified as a suspected exception.
It should be noted that according to the abnormal service node identification method in the present specification, a service node cannot be considered abnormal directly when the weighted sum is greater than the first threshold; therefore, in order to distinguish the abnormality from the subsequent abnormality, the abnormality is referred to herein as a suspected abnormality.
As the operation of the cluster system develops and changes, the degree of influence of the service node on each operation parameter may change under abnormal conditions. Therefore, in order to facilitate the accuracy of the abnormality identification, the weights of the respective operation parameters in the weighted sum formula may be adjusted in timing. For example, in some embodiments of the present disclosure, the weighted sum formula is configured such that the weighting parameter of each operating parameter is periodically updated by:
1) And determining the times that each operation parameter of all the service nodes in the service cluster exceeds the threshold value of the corresponding operation parameter in a specified historical period. Wherein the specified historical period may be a specified recent period. Such as the last 30 days, the last two weeks, the last three months, etc. In specific implementation, the specified historical period may be set according to actual needs, and this specification does not limit this.
2) And determining new weight parameters of the operation parameters according to the times.
For example, taking the historical period as 30 days as an example, the number of times that transactions exceed each threshold in the data of the last 30 days can be counted, and the threshold weight of each operating parameter can be determined again according to the number. For example, for the application of F-ABCD, if the weights of the five operation parameters are all 0.2 (i.e., 20%), the number of times that each node exceeds the threshold based on the transaction response time is 600 times, the number of times that the transaction success rate is lower than the threshold is 160 times, the number of times that the transaction amount is lower than the threshold is 40 times, the number of times that the CPU exceeds the threshold is 500 times, and the number of times that the memory exceeds the threshold is 200 times within 30 days. From this, new weighting parameters for each operating parameter can be calculated as follows:
weighting of response times
Figure BDA0002971527390000081
I.e. the weight of the response time is increased to 40%;
weighting of transaction success rate
Figure BDA0002971527390000082
Namely, the weight of the transaction success rate is reduced to 11%;
weighting of transaction amount rights
Figure BDA0002971527390000083
I.e. the weight of the transaction amount is reduced to 3%;
weighting of CPU usage
Figure BDA0002971527390000084
Namely, the weight of the CPU utilization rate is increased to 33 percent;
weight of memory usage
Figure BDA0002971527390000085
I.e. the weight of the memory usage is reduced to 13%.
3) And updating the weight parameters in the weighted summation formula by using the new weight parameters. I.e. the calculated new weights correspond to the original weights in the replacement weighted sum formula.
In some embodiments of this specification, after the determining whether the service node has the suspected abnormality according to the multiple operation parameters, the following may be further included:
1) And intercepting a specified part in the sampling time stamp to be used as a period time stamp.
In an embodiment of the present specification, intercepting the specified part of the sampling timestamp may refer to: and intercepting the time stamp in a format of time, minute and second year, month, day and minute (YYYYMMDDHHMISS), taking only the last three bits as the finally stored format, such as data of time points of [20210209095600 and 20210209095620], and taking only [600 and 620] three bits to ensure that the data of the decision point appears periodically. I.e., where the decision period and the sampling period of the node operational data are determined, the period timestamp may occur periodically, and thus, 600 and 620 may both be referred to as period timestamps. Therefore, the corresponding old data can be replaced conveniently according to the time stamp of the period, the automatic aging processing of the data is realized, and the abnormal node judgment efficiency is improved.
If the determination period is 10 minutes, the sampling period of the node operation data is 20 seconds, so that there are 30 intermediate data in one determination period of each service node. For example, for the service node with an IP address of 192.168.1.1, starting from zero minutes and zero seconds (000), the data in one decision period may be as follows:
000F-ABCD192.168.1.10,
020F-ABCD192.168.1.10,
040F-ABCD192.168.1.10,
100∣F-ABCD∣192.168.1.1∣1,
120∣F-ABCD∣192.168.1.1∣1,
140F-ABCD192.168.1.10,
200∣F-ABCD∣192.168.1.1∣1,
220∣F-ABCD∣192.168.1.1∣1,
240∣F-ABCD∣192.168.1.1∣1,
300F-ABCD192.168.1.10,
320∣F-ABCD∣192.168.1.1∣1,
340∣F-ABCD∣192.168.1.1∣1,
400∣F-ABCD∣192.168.1.1∣1,
420∣F-ABCD∣192.168.1.1∣1,
440∣F-ABCD∣192.168.1.1∣1,
500∣F-ABCD∣192.168.1.1∣1,
520∣F-ABCD∣192.168.1.1∣1,
540∣F-ABCD∣192.168.1.1∣1,
600∣F-ABCD∣192.168.1.1∣1,
620∣F-ABCD∣192.168.1.1∣1,
640∣F-ABCD∣192.168.1.1∣1,
700∣F-ABCD∣192.168.1.1∣1,
720∣F-ABCD∣192.168.1.1∣1,
740∣F-ABCD∣192.168.1.1∣1,
800∣F-ABCD∣192.168.1.1∣1,
820∣F-ABCD∣192.168.1.1∣1,
840F-ABCD192.168.1.10,
900∣F-ABCD∣192.168.1.1∣1,
920∣F-ABCD∣192.168.1.1∣1,
940∣F-ABCD∣192.168.1.1∣1。
2) And splicing the period timestamp, the application identifier, the service node IP and the state zone bit into intermediate data. The data structure corresponding to the intermediate data may be as shown in fig. 4.
For example for the above-mentioned node operational data:
[20210101001000∣F-ABCD∣192.168.1.1∣731∣0.9644∣123∣90∣80]
after intercepting the specified portion of the sample timestamp (here three bits later for example), the node operation data becomes:
[000∣F-ABCD∣192.168.1.1∣731∣0.9644∣123∣90∣80]
in some embodiments of the present disclosure, the determination result corresponding to the step S102 may be represented by a corresponding status flag bit. When step S102 confirms that the corresponding service node is suspected to be abnormal, the status flag bit of the service node corresponding to the node operation data is set to 1 (that is, 1 indicates suspected to be abnormal); correspondingly, when it is determined through step S102 that the corresponding service node is not suspected to be abnormal (i.e., normal), the status flag bit of the service node corresponding to the node operation data is set to 0 (i.e., 0 indicates normal).
Accordingly, the periodic timestamp [000], the application identifier [ F-ABCD ], the service node IP [192.168.1.1], the status flag [1] can be concatenated into intermediate data: [000 | F-ABCD | 192.168.1.1 | 1].
3) And writing the intermediate data into an intermediate data table.
Compared with the node operation data, the intermediate data not only comprises the period timestamp, the application identifier, the service node IP and the state flag bit information required by the aging data, but also is more concise. Therefore, the node operation data and the state flag bit are integrated into the intermediate data, and the efficiency of judging the abnormity of the service node is improved. However, if the data aging process is not performed, the same number of new intermediate data will appear in the next decision cycle. Therefore, in order to improve the efficiency of abnormal node determination, the data amount in the intermediate data table can be maintained to be unchanged through the data aging strategy.
Thus, in some embodiments of the present description, the writing the intermediate data into the intermediate data table may include:
1) And judging whether old intermediate data with the same period time stamp exists in the intermediate data table.
2) And when old intermediate data with the same period time stamp exists in the intermediate data table, replacing the old intermediate data with the intermediate data.
For example, taking the above-mentioned 30 pieces of intermediate data as an example, if the newly obtained intermediate data is [040 | F-ABCD | 192.168.1.1 | 1] and the intermediate data table has [040 | F-ABCD | 192.168.1.1 | 0], it can be said that there is old intermediate data having the same period time stamp in the intermediate data table since the period time stamps of both pieces of intermediate data are 040.
Of course, if there is no old intermediate data having the same period time stamp in the intermediate data table, the intermediate data may be directly written to the intermediate data table. Obviously, before the first decision period starts, the intermediate data table is empty, so that, in the first decision period, no old intermediate data with the same period timestamp exists in the intermediate data table, and the intermediate data corresponding to each period timestamp can be directly written. However, since the intermediate data corresponding to each cycle time stamp exists in the intermediate data table from the next determination cycle, the intermediate data having the same cycle time stamp exists, and thus data aging is involved to maintain a fixed amount of intermediate data at all times.
Accordingly, in this case, the determining the suspected abnormality rate of the service node in the set determination period may include:
1) And extracting all intermediate data from the intermediate data table.
2) And determining the number of the intermediate data with the suspected abnormal mark in all the intermediate data.
3) And dividing the number of the intermediate data with the suspected abnormal marks by the number of all the intermediate data to obtain the suspected abnormal rate of the service node in the judgment period.
For example, all of the 30 pieces of intermediate data are extracted. Of the 30 intermediate data, only 5 were normal, and the remaining 26 were suspected abnormalities. Therefore, the suspected abnormality rate of the service node in the set judgment period can be determined as
Figure BDA0002971527390000121
Greater than 80% of the set threshold, and therefore, the service node is determined to be abnormal.
In some embodiments of the present specification, after a service node is determined to be abnormal, relevant information (for example, an IP address, an application identifier, and the like) of the abnormal node may be immediately inserted into a table to be isolated, so as to perform an isolation operation in a subsequent time. Therefore, the abnormal judgment method based on the superposition of the real-time operation indexes and the periodic operation indexes is matched with the multi-threshold combination and the self-learning dynamic adjustment of the threshold weight, so that the accuracy of abnormal node judgment in the cluster is improved, and the abnormal node can be found and isolated at the first time when the abnormal node occurs, so that the service continuity of the business system is improved.
While the process flows described above include operations that occur in a particular order, it should be appreciated that the processes may include more or less operations that are performed sequentially or in parallel (e.g., using parallel processors or a multi-threaded environment).
Corresponding to the above abnormal service node identification method, the present specification also provides an embodiment of an abnormal service node identification apparatus. Referring to fig. 5, in some embodiments of the present specification, the abnormal service node identifying apparatus may include:
a receiving module 51, which may be used to receive node operation data in real time; the node operation data comprises various operation parameters of the service node;
the determining module 52 may be configured to determine whether the service node is suspected to be abnormal according to the multiple operation parameters;
a determining module 53, configured to determine a suspected abnormal rate of the service node in a set determination period;
the identifying module 54 may be configured to identify the service node as an abnormal service node when the suspected abnormality rate of the service node in the set determination period exceeds a set threshold.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the various elements may be implemented in the same one or more pieces of software and/or hardware in the practice of this description.
Embodiments of the present description also provide a computer device. As shown in fig. 6, in some embodiments of the present description, the computer device 602 may include one or more processors 604, such as one or more Central Processing Units (CPUs) or Graphics Processors (GPUs), each of which may implement one or more hardware threads. The computer device 602 may also include any memory 606 for storing any kind of information, such as code, settings, data, etc., and in a particular embodiment a computer program running on the memory 606 and on the processor 604, which computer program, when executed by the processor 604, may perform the instructions according to the above-described method. For example, and without limitation, memory 606 may include any one or more of the following in combination: any type of RAM, any type of ROM, flash memory devices, hard disks, optical disks, etc. More generally, any memory may use any technology to store information. Further, any memory may provide volatile or non-volatile retention of information. Further, any memory may represent fixed or removable components of computer device 602. In one case, when the processor 604 executes the associated instructions, which are stored in any memory or combination of memories, the computer device 602 may perform any of the operations of the associated instructions. The computer device 602 also includes one or more drive mechanisms 608, such as a hard disk drive mechanism, an optical disk drive mechanism, etc., for interacting with any memory.
Computer device 602 may also include an input/output module 610 (I/O) for receiving various inputs (via input device 612) and for providing various outputs (via output device 614). One particular output mechanism may include a presentation device 616 and an associated graphical user interface 618 (GUI). In other embodiments, input/output module 610 (I/O), input device 612, and output device 614 may also not be included, but merely as a computer device in a network. Computer device 602 may also include one or more network interfaces 620 for exchanging data with other devices via one or more communication links 622. One or more communication buses 624 couple the above-described components together.
Communication link 622 may be implemented in any manner, such as over a local area network, a wide area network (e.g., the Internet), a point-to-point connection, etc., or any combination thereof. Communication link 622 may include any combination of hardwired links, wireless links, routers, gateway functions, name servers, etc., governed by any protocol or combination of protocols.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products of some embodiments of the specification. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processor to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processor, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processor to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processor to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computer device includes one or more processors (CPUs), input/output interfaces, a network interface, and memory.
The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both permanent and non-permanent, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information that can be accessed by a computer device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.
The embodiments of this specification may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The described embodiments may also be practiced in distributed computing environments where tasks are performed by remote processors that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment. In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of an embodiment of the specification. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Moreover, various embodiments or examples and features of various embodiments or examples described in this specification can be combined and combined by one skilled in the art without being mutually inconsistent.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (7)

1. An abnormal service node identification method is characterized by comprising the following steps:
receiving node operation data in real time; the node operation data comprises various operation parameters of the service node;
judging whether the service node is suspected to be abnormal or not according to the multiple operation parameters; wherein the determining whether the service node is suspected to be abnormal according to the operation parameters includes: comparing various operation parameters of the service node with corresponding operation parameter thresholds respectively to correspondingly obtain identification result sub-values of the operation parameters; carrying out weighted summation on the recognition result sub-values of the operation parameters according to a preset weighted summation formula to obtain weighted summation; comparing the weighted sum to a first threshold; identifying the service node as suspected abnormal when the weighted sum is greater than the first threshold;
determining the suspected abnormal rate of the service node in a set judgment period;
when the suspected abnormal rate of the service node in the set judgment period exceeds a set threshold value, identifying the service node as an abnormal service node;
the node operating data also comprises a sampling time stamp, an application identifier and a service node IP; the judgment result of the judgment is represented by a corresponding state flag bit; after the step of judging whether the service node is suspected to be abnormal according to the multiple operation parameters further comprises the following steps: intercepting a specified part in the sampling time stamp to be used as a period time stamp; splicing the periodic timestamp, the application identifier, the service node IP and the state flag bit into intermediate data; writing the intermediate data into an intermediate data table;
the determining the suspected abnormality rate of the service node in a set determination period includes:
extracting all intermediate data from the intermediate data table;
determining the number of intermediate data with suspected abnormal marks in all the intermediate data;
and dividing the number of the intermediate data with the suspected abnormal marks by the number of all the intermediate data to obtain the suspected abnormal rate of the service node in the judgment period.
2. The abnormal service node identification method of claim 1, wherein in the weighted sum formula, the weight parameter of each operating parameter is periodically updated by:
determining the times that each operation parameter exceeds a corresponding operation parameter threshold value in a specified historical time period for all service nodes in the service cluster;
determining new weight parameters of the operation parameters according to the times;
and updating the weight parameters in the weighted summation formula by using the new weight parameters.
3. The abnormal service node identification method of claim 1, wherein said writing the intermediate data to an intermediate data table comprises:
judging whether old intermediate data with the same period timestamp exist in the intermediate data table;
and when old intermediate data with the same period time stamp exist in the intermediate data table, replacing the old intermediate data with the intermediate data.
4. The exception service node identification method of claim 1, wherein said method enables multiple threads to process in parallel; wherein each thread is assigned node operational data according to the following:
one node operation data is fished from a database of the node operation data;
adding all numerical values of the IP field of the service node in the node operation data to obtain a summary value;
performing remainder calculation on the summary value and the number of started threads to obtain a remainder value;
and allocating the node operation data to the thread with the same thread program number as the remainder value.
5. An abnormal service node identification apparatus, comprising:
the receiving module is used for receiving the node operation data in real time; the node operation data comprises various operation parameters of the service node;
the judging module is used for judging whether the service node is suspected to be abnormal or not according to the various operation parameters; wherein the determining whether the service node is suspected to be abnormal according to the operation parameters includes: comparing various operation parameters of the service node with corresponding operation parameter thresholds respectively to correspondingly obtain identification result sub-values of the operation parameters; carrying out weighted summation on the recognition result sub-values of the operation parameters according to a preset weighted summation formula to obtain a weighted sum; comparing the weighted sum to a first threshold; identifying the service node as suspected abnormal when the weighted sum is greater than the first threshold;
the determining module is used for determining the suspected abnormal rate of the service node in a set judging period;
the identification module is used for identifying the service node as an abnormal service node when the suspected abnormal rate of the service node in the set judgment period exceeds a set threshold value;
the node operation data also comprises a sampling time stamp, an application identifier and a service node IP; the judgment result of the judgment is represented by a corresponding state flag bit; the judging module is further configured to intercept a specified portion in the sampling timestamp as a period timestamp after the service node is judged to be suspected to be abnormal according to the multiple operation parameters; splicing the periodic timestamp, the application identifier, the service node IP and the state flag bit into intermediate data; writing the intermediate data into an intermediate data table;
the determining the suspected abnormality rate of the service node in a set determination period includes: extracting all intermediate data from the intermediate data table;
determining the number of intermediate data with suspected abnormal marks in all the intermediate data;
and dividing the number of the intermediate data with the suspected abnormal mark by the number of all the intermediate data to obtain the suspected abnormal rate of the service node in the judgment period.
6. A computer arrangement comprising a memory, a processor, and a computer program stored on the memory, characterized in that the computer program, when executed by the processor, executes the instructions of the method according to any one of claims 1-4.
7. A computer storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor of a computer device, executes instructions of a method according to any one of claims 1-4.
CN202110264381.4A 2021-03-11 2021-03-11 Abnormal service node identification method, device, equipment and storage medium Active CN113055246B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110264381.4A CN113055246B (en) 2021-03-11 2021-03-11 Abnormal service node identification method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110264381.4A CN113055246B (en) 2021-03-11 2021-03-11 Abnormal service node identification method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113055246A CN113055246A (en) 2021-06-29
CN113055246B true CN113055246B (en) 2022-11-22

Family

ID=76511391

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110264381.4A Active CN113055246B (en) 2021-03-11 2021-03-11 Abnormal service node identification method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113055246B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113886757A (en) * 2021-08-30 2022-01-04 国网山东省电力公司信息通信公司 Power communication network PTN network service operation reliability assessment method
CN113965497B (en) * 2021-10-20 2022-12-06 深圳平安医疗健康科技服务有限公司 Server abnormity identification method and device, computer equipment and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106856489A (en) * 2015-12-08 2017-06-16 阿里巴巴集团控股有限公司 A kind of service node switching method and apparatus of distributed memory system
CN110908824A (en) * 2019-12-04 2020-03-24 支付宝(杭州)信息技术有限公司 Fault identification method, device and equipment
CN111338903A (en) * 2020-02-28 2020-06-26 中国工商银行股份有限公司 Transaction abnormity alarming method and device
CN111897705A (en) * 2020-07-06 2020-11-06 上海泛微网络科技股份有限公司 Service state processing method, service state processing device, model training method, model training device, equipment and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9071535B2 (en) * 2013-01-03 2015-06-30 Microsoft Technology Licensing, Llc Comparing node states to detect anomalies
US20170243268A1 (en) * 2016-02-23 2017-08-24 International Business Machines Corporation Method and system for determining an optimized service package model for market participation
CN108616429B (en) * 2018-05-14 2019-12-13 平安科技(深圳)有限公司 reconnection method and device for push service
CN110475224B (en) * 2019-07-01 2022-03-11 南京邮电大学 Sensor data processing and collaborative prediction method based on edge calculation
CN110837432A (en) * 2019-11-14 2020-02-25 北京金山云网络技术有限公司 Method and device for determining abnormal node in service cluster and monitoring server
CN111031017B (en) * 2019-11-29 2021-12-14 腾讯科技(深圳)有限公司 Abnormal business account identification method, device, server and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106856489A (en) * 2015-12-08 2017-06-16 阿里巴巴集团控股有限公司 A kind of service node switching method and apparatus of distributed memory system
CN110908824A (en) * 2019-12-04 2020-03-24 支付宝(杭州)信息技术有限公司 Fault identification method, device and equipment
CN111338903A (en) * 2020-02-28 2020-06-26 中国工商银行股份有限公司 Transaction abnormity alarming method and device
CN111897705A (en) * 2020-07-06 2020-11-06 上海泛微网络科技股份有限公司 Service state processing method, service state processing device, model training method, model training device, equipment and storage medium

Also Published As

Publication number Publication date
CN113055246A (en) 2021-06-29

Similar Documents

Publication Publication Date Title
CN113055246B (en) Abnormal service node identification method, device, equipment and storage medium
WO2021208184A1 (en) Method and system for calling-in and recovery of node traffic and central server
US9280437B2 (en) Dynamically scalable real-time system monitoring
JP5471859B2 (en) Analysis program, analysis method, and analysis apparatus
US20030135382A1 (en) Self-monitoring service system for providing historical and current operating status
US9235491B2 (en) Systems and methods for installing, managing, and provisioning applications
US20140096129A1 (en) Systems and methods for installing, managing, and provisioning applications
US11321155B2 (en) Automatic resource dependency tracking and structure for maintenance of resource fault propagation
CN113220534A (en) Cluster multi-dimensional anomaly monitoring method, device, equipment and storage medium
CN106911519A (en) A kind of data acquisition monitoring method and device
CN112039701A (en) Interface call monitoring method, device, equipment and storage medium
CN112699007A (en) Method, system, network device and storage medium for monitoring machine performance
WO2015171860A1 (en) Automatic alert generation
US20230144084A1 (en) Analysis of code coverage differences across environments
US11416379B1 (en) Creation of software tests matching production personas
US20140096125A1 (en) Systems and methods for installing, managing, and provisioning applications
JP5007247B2 (en) Job processing system and job management method
CN110363381B (en) Information processing method and device
WO2020167570A1 (en) Cause-based event correlation to virtual page transitions in single page applications
CN109997337B (en) Visualization of network health information
WO2021174684A1 (en) Cutover information processing method, system and apparatus
US11449407B2 (en) System and method for monitoring computing platform parameters and dynamically generating and deploying monitoring packages
US20180287914A1 (en) System and method for management of services in a cloud environment
CN113254309A (en) Active early warning system and method for errors of service system
Bandari Proactive Fault Tolerance Through Cloud Failure Prediction Using Machine Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant