CN113094215A

CN113094215A - Fault detection method, system and device

Info

Publication number: CN113094215A
Application number: CN201911335084.3A
Authority: CN
Inventors: 陶娟
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Henan Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Henan Co Ltd
Priority date: 2019-12-23
Filing date: 2019-12-23
Publication date: 2021-07-09

Abstract

The invention discloses a fault detection method, and belongs to the technical field of computer networks. According to the fault detection method, the index data associated with the client and the client can be actively acquired, and the associated data is analyzed to determine whether the client communicates with the client normally or not, so that whether the client fails or not is determined; and a test system comprising a simulation terminal and a middleware is established for the client server, a test command is initiated to the client server by the test system, and whether the client server fails or not is judged after a test result is obtained. Compared with the prior art, the fault detection method determines whether the client server has faults or not in an artificial passive mode, actively monitors index data of the client server, and can test the index data by the simulation system, so that the problem that the client server can be determined whether the client server has faults or not only by consuming a lot of force and time is solved, and the fault detection efficiency is improved.

Description

Fault detection method, system and device

Technical Field

The present invention relates to the field of computer networks, and in particular, to a method, a system, and an apparatus for monitoring a fault.

Background

The internet of things client server system belongs to client equipment, a data operator in the client server system cannot directly obtain the data, the internet of things service cannot be used only when the client server system breaks down, and the operator starts troubleshooting after receiving complaints.

The troubleshooting method mainly depends on the processes of communication, feedback, determination and the like with a client to determine whether the Internet of things service cannot be used due to the system fault of the client server. Therefore, the current troubleshooting method is a passive troubleshooting and requires a lot of manpower and time.

Disclosure of Invention

The invention provides a fault detection method, a system and a device, aiming at solving the problems that a fault troubleshooting method is a passive troubleshooting method and needs to consume a large amount of manpower and time.

In a first aspect, the present invention provides a fault detection method, including:

acquiring at least one index data of a client server, wherein the index data is the associated data of the client server and a client;

analyzing the index data to determine whether the client server fails;

and/or

Building a test system for a client server, wherein the test system comprises a simulation terminal, a middleware and the middleware for establishing communication connection between the simulation terminal and the client server;

the simulation terminal or the middleware initiates a test command to the client server to obtain a test result;

and determining whether the client server fails according to the test result.

In the foregoing method for detecting a failure, the analyzing the index data to determine whether the client server fails includes:

respectively calculating the dispersion of the numerical values of the index data in two successive preset times, and calculating the change degree of the two successive dispersion;

when the change degree exceeds a preset degree, determining that the client server fails;

and/or

Comparing the value of the index data with a preset limit range;

and when the numerical value of the index data exceeds a preset value range, determining that the client server fails.

In the above fault detection method, the two preset periods of time include a first preset period of time and a second preset period of time, and the number of time cycles included in the first preset period of time and the second preset period of time are the same.

In the above fault detection method, the first preset time period is earlier than the second preset time period by at least one time period.

In the above fault detection method, constructing a test system for a client server, where the test system includes a simulation terminal and a middleware, and the middleware establishes a communication connection between the simulation terminal and the client server includes:

acquiring attribute information of a client server, wherein the attribute information comprises a service type, a host identity, an IP address and a use protocol;

and establishing the connection between the simulation terminal and the client server through the middleware based on the attribute information.

In the above fault detection method, the test command initiated by the simulation terminal to the client server includes:

based on the corresponding relationship between the preset use protocol and the test tool, the simulation client initiates a test command to the client server by using the corresponding test tool according to the type of the use protocol of the client server.

In the above fault detection method, the initiating, by the middleware, a test command to the client server includes:

initiating a test command to the client server by a core network element in the middleware that is closest to the client server.

In a second aspect, the present invention provides a fault detection system comprising a memory and a processor;

the memory stores metric data;

the processor is used for analyzing the index data to determine whether the client server sends a fault; and/or

The memory stores a test command;

the processor is configured to initiate a test on a client server to determine whether the client server sent a failure.

In a third aspect, the present invention provides an apparatus comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the method as claimed in any one of the above.

In a fourth aspect, the invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method as defined in any one of the above.

The fault detection method provided by the embodiment of the invention comprises two implementation modes. The first mode is as follows: by actively acquiring index data associated with the client and the client, analyzing the associated data, whether the client and the client normally communicate can be determined, and whether the client and the client fail can be determined. The second mode is as follows: and constructing a test system comprising a simulation terminal and a middleware for the client server, initiating a test command to the client server by the test system, and judging whether the client server fails or not after obtaining a test result. The first mode and the second mode can be used alternatively to determine whether the client server fails, or can be used simultaneously, and the client server can be considered to fail as long as one mode considers that the client server fails. Compared with the prior art, the fault detection method provided by the invention has the advantages that whether the client server has the fault or not is determined in an artificial passive mode, the client server is actively monitored for index data, and the client server can be tested by a simulation system, so that the problem that whether the client server has the fault or not can be determined only by consuming a lot of force and time is solved, and the fault detection efficiency is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a flowchart of a fault detection method according to a second embodiment of the present invention;

FIG. 2 is a flowchart of a fault detection method according to a third embodiment of the present invention;

fig. 3 is a schematic structural diagram of connection between a simulation terminal and a client server in the third embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the specific embodiments of the present invention and the accompanying drawings. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

The execution main body of the fault detection process provided by the first embodiment of the present invention may be a test system, and the test system is connected to the client server to implement fault detection on the client server. The failure detection flow includes step S100 or S200.

S100 includes S140 and S160.

S140: and acquiring at least one index data of the client server, wherein the index data is the associated data of the client server and the client.

S160: and analyzing the index data to determine whether the client server fails.

Whether the client server malfunctions is determined by acquiring data associated with the client on the client server as a data index and then analyzing the data index to determine whether the client server normally communicates with the client at S100.

S200 includes S220, S240, and S260.

S220, constructing a test system for the client server, wherein the test system comprises a simulation terminal and a middleware for establishing communication connection between the simulation terminal and the client server.

And S240, the simulation terminal or the middleware initiates a test command to the client server to obtain a test result.

And S260, determining whether the client server fails according to the test result.

In S200, a test system including a simulation terminal and a middleware is established for the client server, and the test system issues a test command to the client server to determine whether the client server fails after obtaining a test result.

The fault detection method of the embodiment of the invention can determine whether the client server has a fault only through one of S100 or S200, in order to improve the accuracy of fault detection and avoid missed judgment, the S100 and S200 can also be used in combination, and when one of the S100 and S200 determines that the client server has a fault, the client server is determined to have the fault. Of course, when S100 and S200 are used in combination, it is also possible to determine that the client server has failed only when both of them are considered to have failed.

After S160, further comprising S180: when the client server fails, an alarm is sent out. Specifically, the fault detection system sends an alarm.

After the step S260, the method also comprises the step S280 that when the client server fails, the simulation terminal sends out an alarm.

Example two

The present invention specifically describes the flow of S100 in the first embodiment, and fig. 1 is a flow chart of a fault detection method in the second embodiment of the present invention, where an execution subject may be a test system or a DPI system capable of acquiring data of a client server.

As analyzed above, S100 includes S140 and S160.

S140: and acquiring at least one index data of the client server, wherein the index data is the associated data of the client server and the client. The data index can be formed by the client server according to the detection result reported by each data probe, and the index data can be one type, two types, three types and the like. Specifically, the index data matched with the key service and the service income of the client server and the transmission mode of the service data can be selected based on the key service and the service income of the client server.

S160: and analyzing the index data to determine whether the client server fails. S160 includes at least one of S161 and S162, and it may be determined whether the client server fails based on at least one of S161 and S162.

S161: and respectively calculating the dispersion of the numerical values of the index data in two successive preset times, calculating the change degree of the two successive dispersion, and determining that the client server fails when the change degree exceeds the preset degree. Since the kind of index data includes at least one kind, it may be determined that the client server malfunctions when the degree of change of at least one kind (or at least two kinds, or 50% kind of all kinds of index data, etc.) of index data within the first and second preset periods exceeds a preset degree.

When the numerical value of the index data needs to be calculated based on S161, in S140, the acquired index data includes, but is not limited to: at least one of the uplink and downlink flow, the number of active users, the success rate of TCP one-time and two-time handshake, the failure rate of HTTP5XX, the time delay of HTTP first packet, the time delay of uplink RTT, the time delay of downlink RTT, the failure rate of COAP 5.XX and the success rate of MQTT response.

The uplink and downlink flow is used for judging whether the client terminal and the client server generate the flow or not and monitoring the change condition of the flow. The number of active users is used for monitoring the change of the number of users of the flow generated by interaction between the client terminal and the client server, and when the downtime of the client server is in an unavailable state, the number of the active users is 0. The success rate of primary and secondary handshaking of the TCP means that a TCP connection needs to be established in the first step of communication between the client terminal and the client server, and when the client server fails, if a port is in an unavailable state, the TCP cannot be established successfully, which means that the primary and secondary handshaking fails. The HTTP5XX failure rate refers to the number of failures due to client-server side reasons/the number of HTTP requests, and reflects the rate of HTTP failures due to client-server side reasons, thereby monitoring the state of the server, and the HTTP5XX failure includes the events in table 1. The HTTP header delay refers to an average delay of HTTP successful responses within a statistical time period. The uplink RTT time delay refers to the average time delay of uplink data transmission in a statistical time period. The downlink RTT time delay refers to the average downlink data transmission time delay in the statistical time period. The failure rate of COAP 5.XX is the failure frequency/COAP request frequency caused by the client server side, which reflects the rate of COAP failure caused by the client server, so as to monitor the state of the client server. The MQTT response success rate refers to the response times of the client server side/the request message times of the client terminal so as to monitor the state of the MQTT server, and the MQTT response success rate is low when the MQTT server fails. The client server response and client terminal message request may refer to the contents of table 2.

TABLE 1

TABLE 2

Corresponding to S161, the index data acquired in S140 includes, but is not limited to: the uplink and downlink flow, the number of active users, the success rate of TCP one-time and two-time handshaking, the failure rate of HTTP5XX, the time delay of HTTP first packet, the time delay of uplink RTT, the time delay of downlink RTT, the failure rate of COAP 5XX and the success rate of MQTT response.

S161 calculates a degree of variation of the dispersion of the numerical values of the index data, where the index data may be data of the internet of things traffic of the number of users TOPN (N is equal to 1, 2, 3, or the like). S161 includes S1611, S1612, and S1613.

S1611: and acquiring the numerical values of the index data of the latest Y time periods based on the preset period duration T from the acquired index data. The index data may be obtained by using the client ID or IP as a dimension, the preset period duration T may default to 10 minutes (or 15 minutes, 5 minutes, 1 minute, etc.), and the latest Y time periods may be the latest 5 periods (or 15 periods, 10 periods, 3 periods, etc.).

S1612: dividing the numerical values of the index data of the Y time periods into two groups of numerical values according to time, wherein the first group of numerical values are the numerical values of the index data of a first preset time period, and the second group of numerical values are the numerical values of the index data of a second preset time period, and calculating the absolute value of the difference value between the standard deviation of the first group of numerical values and the standard deviation of the second group of numerical values.

The first preset time period and the second preset time period may have the same or different time periods; the first preset time period and the second preset time period may or may not have overlapping time in the time cycle. For example, when Y is equal to 5, the first preset time period may be the first 4 time periods of 5 cycles, and the second preset time period may be the last 4 time periods of 5 cycles; alternatively, when Y is equal to 5, the first preset time period may be the first 2 time periods of 5 cycles, and the second preset time period may be the last 2 time periods of 5 cycles.

In addition, in order to obtain a more accurate calculation result, the first preset time period may be earlier than the second preset time period by a time period, and the two preset time periods have overlapping time. For example, when Y is equal to 5, the first preset time period may be the first 4 time periods of 5 cycles, and the second preset time period may be the last 4 time periods of 5 cycles; or Y equals 7, the first preset time period may be the first 6 time periods of the 7 cycles, and the second preset time period may be the last 6 time periods of the 7 cycles.

The standard deviation can be used for calculating the dispersion degree of the index data, and if the dispersion degree of the index data suddenly increases, the index is suddenly changed, and the server may have a fault. Of course, as a variant, it is also possible to calculate the difference between the variance of the first set of values and the variance of the second set of data. In addition to the standard deviation and variance, other common concepts that measure the degree of dispersion of the index data may be utilized.

S1613: and when the absolute value of the difference value is larger than the preset difference value, determining that the client server fails. The preset difference value is default to 2, and can also be adjusted to 1, 3 or 4 according to actual requirements, and the like.

The following describes how to determine whether the client server fails based on the variation degree of the TCP one-time and two-time handshake success rate in two consecutive preset times, with the TCP one-time and two-time handshake success rate as index data.

Firstly, the numerical values of the TCP primary-secondary handshake success rates of the last 5 periods are obtained and arranged according to the sequence of time periods, each period is 10 minutes, and the obtained TCP primary-secondary handshake success rates are as follows: 99.21 percent of point 10 of 1, 99.38 percent of point 20 of 1, 99.25 percent of point 30 of 1, 99.31 percent of point 40 of 1 and 23.01 percent of point 50 of 1. Taking the values of the index data as a first group of values, calculating the standard deviation of the first group of values, and obtaining the calculation result of sigma (r)₁(ii) a Taking the numerical value of the second-fifth index data as a second group of numerical values, calculating the standard deviation of the second group of data, and obtaining the calculation result of sigma (r)₂。

Wherein, the standard deviation formula is as follows:

wherein N is the number of numerical values (index values for short) of the index data; xi is an index value; r is the mean of the N index values; the calculation result sigma (r) is a standard deviation, the standard deviation reflects the size of the dispersion, the larger the standard deviation is, the higher the dispersion is, the larger the fluctuation amplitude of the index is, and otherwise, the smaller the fluctuation amplitude is.

After the calculation of the standard deviation of the first group of numerical values is completed to obtain a first standard deviation, and the calculation of the standard deviation of the second group of numerical values is completed to obtain a second standard deviation, subtracting the first standard deviation from the second standard deviation, and if the obtained difference value is greater than z, determining that the index of the server has large fluctuation and is abnormal, wherein the expression is as follows:

σ(r)₁-σ(r)₂＞z

of course, it is possible to calculate the absolute value of the difference between the second standard deviation and the first standard deviation, and when the absolute value of the difference between the two is larger than z, it is considered that there is an abnormality.

In S161, when the change degree of the value of at least one index data in two consecutive preset time periods exceeds the preset degree, it is determined that the client server fails. In addition, it should be noted that, in order to ensure the accuracy of determining whether the client server fails according to the analysis of the index data, the client server may be considered to fail only when the degree of change of the numerical values of at least two or three types of index data in two consecutive preset times exceeds a preset degree. For example, the variation degree of the dispersion of the MQTT response success rate in two preset time periods, or the variation degree of the dispersion of the number of active users in two preset time periods is calculated.

In S161, how to determine whether the client server normally operates according to the dispersion of the values of the index data in two consecutive preset periods is introduced, that is, whether the client server normally operates is determined based on whether the value of the index data changes suddenly in a certain time period. Next, S162 of determining whether the client server fails according to the preset limit range is described.

S162, comparing the numerical value of the index data with a preset limit range; and when the numerical value of the index data exceeds the range of the preset limit value, determining that the client server fails.

Determining whether the client server has a fault or not based on the preset limit range of S162, collecting service data of a large number of users aiming at certain index data, obtaining the limit range of each index data when the client server is normally used, and taking the limit range as the preset limit range.

Corresponding to S162, the index data acquired in S140 includes, but is not limited to: at least one of TCP primary and secondary handshake success rate, HTTP5XX failure rate, HTTP first packet time delay, uplink RTT time delay, downlink RTT time delay, COAP 5XX failure rate and MQTT response success rate. Compared with the step of acquiring numerical values of two indexes, namely uplink and downlink flow and the number of active users, in the step S161, data of the two indexes, namely the uplink and downlink flow and the number of the active users, do not need to be acquired because the numerical values need to be calculated in the step S162 and are related to specific services of the users and are unrelated to the number or the flow of the users. In other words, in S162, a change in user activity may be received, as well as a change in uplink and downlink traffic.

Specifically, the preset limit range of each index data in S162 is shown in table 3. Of course, the predetermined limit range in table 3 is an exemplary example, and in practical applications, the value may be increased by 10% or decreased by 10%.

TABLE 3

In S162, if the value of at least one index data exceeds the preset limit range, it is determined that the client side has a fault. In addition, in order to improve the accuracy of fault judgment, the client server may be considered to be in fault only after the numerical values of at least two or three types of index data exceed the corresponding preset limit value ranges.

EXAMPLE III

The present invention specifically describes the process of S200 in the first embodiment, and fig. 2 is a flowchart of a fault detection method in the second embodiment of the present invention.

As analyzed above, S200 includes S220, S240, and S260.

And S220, constructing a test system for the client server, wherein the test system comprises a simulation terminal, a middleware and the middleware for establishing communication connection between the simulation terminal and the client server. Specifically, the service type can be acquired through investigation and DPI service identification, the host identifier and IP address of the client server, port information, a use protocol, and the like are determined from the APN information, and the acquired information is referred to as attribute information to be provided to the constructed test system, so that the simulation terminal in the test system can establish communication connection with the client server through the middleware. Middleware refers to gateways, switches, routes, network elements, and the like between the emulated terminal and the client server.

And S240, the simulation terminal or the middleware initiates a test command to the client server to obtain a test result. S240 includes S241 and S242, the execution subject of S241 may be a simulation terminal, and the execution subject of S242 is middleware connecting the simulation terminal and a client server.

And S241, based on the preset corresponding relation between the use protocol and the test tool, the simulation client initiates a test command to the client server by using the corresponding test tool according to the type of the use protocol of the client server. Wherein, the types of the used protocols are different, and the test tools for initiating the test commands are also different.

For example, in the case of TCP protocol to implement communication between the emulation client and the client server, test tools including Tcping, Psping, Paping and Tracert may be used. The Tcping has the function of connectivity test, the Psping has the functions of ICMP ping, TCP ping, delay test and bandwidth test, and the Tracert has the function of route tracing. For example, the UDP protocol is used to implement communication between the simulation client and the client server, and the available test tool is Ncat, which has a function of UDP port connectivity test. Specifically, see tables 4 and 5 for examples and analyses of tools and corresponding functions, and test results.

TABLE 4

Specifically, in table 4, the ICMP ping function test command, such as psing-4-n 10-w 2-h10180.76.76.76, is exemplified, where the parameter-4 represents to force IPv4 connection, the parameter-n represents the number of regular ping packets, or defines the unit of seconds s, the parameter-w represents the number of warm-up ping packets, i.e. how many times warm-up test connections are performed before the regular test, the parameter-h represents the minimum and maximum delay milliseconds, and 180.76.76.76 indicates the routing address.

The definition of each parameter in the test command of the TCP Ping function is the same as that of the test command of the ICMP Ping function. TCP Ping is a test for ports, e.g., Ping-n 10-w2-h 42.159.27.213:8081, then a connectivity test for 8081 ports.

Delay test psping-l 1500-n 300-h 10xxx. xxx: in xxx, parameter-l 1500 represents that i send packets with a size of 1500Bytes each, the MTU (maximum Transmission Unit) TCP is online in Layer 2Ethernet, usually 1500Bytes is one unit, and n 300 represents that 300 packets are sent.

In the bandwidth test, -b represents that the bandwidth test is to be performed, -l 1500 represents the use of packets of size 1500Bytes, -n 15000 represents the use of 15000 packets.

TABLE 5

In addition, Tcping and Tracert of table 4, and Ncat of table 5 are common test tools in the art, and therefore, each parameter in its test command is not explained in detail.

S242: the test command to the client server is initiated by the core network element in the middleware closest to the client server. Specifically, referring to fig. 3, in two core network elements PGW and SGW, it is obvious that the core network element PGW is closer to the client server, and therefore, in S242, a test command to the client server is initiated by the core network element PGW. The slave core network element PGW may initiate a test command to the client server system using the ping and the traceroute, as shown in table 6.

TABLE 6

In addition, Ping and Traceroute of table 6 are common test tools in the art, and therefore, the parameters in their test commands are not explained in detail.

In addition, referring to fig. 3, the simulation terminal includes a simulation client and a dial testing platform, and the simulation client may be a mobile terminal such as a mobile phone. The dial testing platform is connected with the simulation client and can be used for downloading and installing tools such as Tcping and Psping, and therefore the testing command is issued to the simulation client.

Example four

The embodiment of the invention provides a fault detection system, a device and a computer readable storage medium.

The embodiment of the invention provides a fault detection system which comprises a memory and a processor. The memory stores metric data; the processor is configured to analyze the metric data to determine whether the client server has sent a failure.

The embodiment of the invention provides a fault detection system which comprises a memory and a processor. The memory stores a test command; the processor is configured to initiate a test on a client server to determine whether the client server sent a failure.

The fault detection system provided in the embodiment of the present invention may further execute the method executed by the fault detection system in fig. 1 or fig. 2, and implement the functions of the fault detection system in the embodiment shown in fig. 1 or fig. 2, which are not described herein again.

An apparatus provided in an embodiment of the present invention includes: a memory, a processor and a computer program stored on said memory and executable on said processor, the computer program realizing the steps of the above described fault detection system when executed by said processor.

The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the above-mentioned fault detection method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of fault detection, comprising:

analyzing the index data to determine whether the client server fails;

and/or

and determining whether the client server fails according to the test result.

2. The method according to claim 1, wherein the analyzing the index data to determine whether the client server fails comprises:

and/or

Comparing the value of the index data with a preset limit range;

3. The method according to claim 2, wherein the two predetermined periods of time include a first predetermined period of time and a second predetermined period of time, and the first predetermined period of time and the second predetermined period of time include the same number of time cycles.

4. The fault detection method according to claim 3, wherein the first preset time period is advanced by at least one time period from the second preset time period.

5. The method for detecting the fault according to claim 1, wherein in constructing a test system for a client server, the test system comprising a simulation terminal and middleware, and the middleware for establishing a communication connection between the simulation terminal and the client server, the method comprises:

6. The fault detection method according to claim 5, wherein the initiating, by the simulation terminal, the test command to the client server includes:

7. The method of claim 5, wherein the initiating, by the middleware, a test command to the client server comprises:

8. A fault detection system comprising a memory and a processor;

the memory stores metric data;

The memory stores a test command;

9. An apparatus, comprising: memory, processor and computer program stored on the memory and executable on the processor, which computer program, when executed by the processor, carries out the steps of the method according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.