CN116366424A

CN116366424A - Intelligent network fault detection system and method based on big data

Info

Publication number: CN116366424A
Application number: CN202310320235.8A
Authority: CN
Inventors: 方彬; 韦武文
Original assignee: Shenzhen Credible Cloud Technology Co ltd
Current assignee: Shenzhen Credible Cloud Technology Co ltd
Priority date: 2023-03-29
Filing date: 2023-03-29
Publication date: 2023-06-30

Abstract

The invention relates to the technical field of network fault detection, in particular to a network fault intelligent detection system and a network fault intelligent detection method based on big data. According to the method, the system and the device, the use data in the near time period in the cloud server historical data are analyzed from the two angles of the whole cloud server and the individual tenants respectively, the network state interference condition of a certain tenant corresponding to the rest of tenants in each time point of the server in a time period based on the current time is accurately predicted, whether network faults exist in the tenants in the cloud server or not is judged, management and control are carried out, and interference of the noisy neighbor effect on L3 cache time delay is effectively reduced.

Description

Intelligent network fault detection system and method based on big data

Technical Field

The invention relates to the technical field of network fault detection, in particular to a network fault intelligent detection system and method based on big data.

Background

The cloud server is a simple and efficient computing service with elastically scalable processing capacity, and in a management mode, the cloud server is simpler and more efficient than a physical server, and a user can quickly create or release any plurality of cloud servers without purchasing hardware in advance; meanwhile, the cloud server can effectively reduce the difficulty of developing operation and maintenance and the overall IT cost, and is favored by most small and medium enterprises.

However, because there is a "noisy neighbor" effect among multiple tenants in the cloud server when the cloud server is for multiple users, i.e., the same cloud server hardware is for multiple tenants, VMs may be adversely affected by excessive storage or network requirements of other cloud servers on the same hardware, thereby making the user's network affected by the remaining tenants (modern multi-core processors prefer to allocate a single virtual computer to an L3 cache, thereby speeding up data exchange among cores.

In the current intelligent network fault detection system based on big data, the network fault is detected only through the performance of the network when the network is in fault, and specific network faults are screened step by step, but the method is not suitable for a cloud server with the 'noisy neighbor' effect, and the network delay fault caused by the 'noisy neighbor' effect cannot be accurately judged.

Disclosure of Invention

The invention aims to provide a network fault intelligent detection system and method based on big data, which are used for solving the problems in the background technology.

In order to solve the technical problems, the invention provides the following technical scheme: a network fault intelligent detection method based on big data, the method comprising the steps of:

s1, acquiring cloud service information corresponding to multiple tenants in the same cloud server hardware, and recording cloud service information corresponding to a j-th tenant in an i-th cloud server hardware as a set Aij, wherein the cloud service information comprises the rented CPU size and hardware operation temperature, and the hardware operation temperature represents the operation temperature of the cloud server;

s2, acquiring relations between different hardware running temperatures and cloud server performances in the historical data, wherein the cloud server performances represent average time delay of each tenant on corresponding cloud server hardware in L3 cache information;

s3, acquiring a relation between the hardware operation temperature of the same cloud server in the historical data and the CPU utilization rate corresponding to each tenant, and acquiring a time-varying relation of the hardware operation temperature of the ith cloud server according to the historical data;

s4, acquiring a hardware operation temperature Ti, a CPU utilization rate and an L3 cache time delay SYi corresponding to a jth tenant in the ith cloud server hardware at the current time, and acquiring an L3 cache time delay calibration value corresponding to the jth tenant in the ith cloud server hardware;

s5, predicting a network state corresponding to the jth tenant in the ith cloud server hardware in a first unit time based on the current time, and managing the jth tenant in the ith cloud server hardware according to a prediction result.

Further, the method for S2 obtaining the relationship between the different hardware operating temperatures and the cloud server performance in the historical data includes the following steps:

s21, acquiring average time delay of each tenant of the ith cloud server in the historical data in L3 cache information at different hardware operation temperatures, and recording the average time delay of each tenant of the ith cloud server in the L3 cache information as SYiT when the hardware operation temperature is T, so as to obtain a hardware performance relation data pair (T, SYiT);

s22, fitting each corresponding hardware performance relation data pair (T, SYiT) when T is different in value by using a first reference model in a database, calculating the sum of a function corresponding to each fitting result and each hardware performance relation data pair distance, selecting a fitting result with the smallest sum of the distances as a relation between different hardware operation temperatures in an ith cloud server and cloud server performances in historical data, and recording the function corresponding to the relation between different hardware operation temperatures in the ith cloud server and cloud server performances as G (T).

According to the method, the relation between different hardware operation temperatures and the performance of the cloud server is analyzed, the fact that the hardware operation temperatures can reflect the overall use condition of the cloud server is considered, when the environment where the cloud server is located is kept unchanged (the cloud server is generally managed in a centralized mode and the environment where the cloud server is located is kept unchanged), the use degree of the cloud server is in direct proportion to the hardware operation temperatures, and the higher the use degree of the cloud server is, the higher the processing capacity of the cloud server for data in unit time is indicated, and as the multi-core processor is more willing to allocate a single virtual computer to an L3 cache, the data exchange among cores is accelerated, and then all other operations of other machines which access the same processor can be completed only in a quite long time; meanwhile, the relation between the running temperatures of different hardware and the performance of the cloud server is analyzed, and data reference is provided for obtaining an L3 cache time delay calibration value corresponding to the jth tenant in the ith cloud server hardware in the subsequent steps.

Further, the method for obtaining the relationship between the hardware operating temperature of the same cloud server and the CPU utilization rate corresponding to each tenant in the historical data in S3 includes the following steps:

s31, obtaining CPU utilization rates corresponding to all tenants at the same time point in the same cloud server, and recording the CPU utilization rate corresponding to the jth tenant of the ith cloud server at time t as Lijt;

s32, obtaining the comprehensive utilization rate LZit of the CPU in the ith cloud server at time t,

the Zij represents the CPU size rented by the jth tenant in the ith cloud server, and j1 represents the number of corresponding tenants in the ith cloud server at time t;

s33, constructing a first characteristic data pair (LZit, tt) corresponding to the ith cloud server at time t, wherein Tt represents the hardware operation temperature of the ith cloud server at time t, the comprehensive utilization rate of a CPU in the cloud server is taken as an independent variable, the hardware operation temperature is taken as a dependent variable, each first characteristic data pair corresponding to the ith cloud server is fitted by combining a linear regression equation formula, and a function corresponding to a fitting result is recorded as a relation function between the hardware operation temperature of the ith cloud server and the CPU utilization rate corresponding to each tenant and is recorded as G1 (L);

the method for acquiring the time-varying relation of the hardware operating temperature of the ith cloud server according to the historical data in the S3 comprises the following steps:

s301, taking a day as a time period, respectively acquiring hardware operation temperatures corresponding to different time points of an ith cloud server in the previous n time periods based on the current time in historical data, acquiring a line diagram of temperature change of the ith cloud server along with time in each time period in the previous n time periods based on the current time, and recording the time interval of acquiring the hardware operation temperatures of the ith cloud server as t1, wherein the time duration corresponding to each time period is an integer multiple of t 1;

s302, acquiring functions respectively corresponding to the temperature-time-varying line graphs of the ith cloud server in each time period in the first n time periods based on the current time, wherein the functions corresponding to each line graph are piecewise functions, and the functions corresponding to the temperature-time-varying line graphs of the ith cloud server in the n1 time periods based on the current time are marked as Fn1 (tx), wherein tx represents a time point with a time length tx from the starting point of the corresponding time period;

s303, obtaining a time-varying relation function F (tx) of the hardware operation temperature of the ith cloud server,

the method acquires the relation between the hardware operating temperature of the same cloud server and the CPU utilization rate corresponding to each tenant in the historical data, considers that the hardware operating temperature of the cloud server is influenced by the comprehensive utilization rate of the CPU in the cloud server, and the comprehensive utilization rate of the CPU of the cloud server is influenced by the CPU utilization rate corresponding to each tenant.

Further, the method for obtaining the L3 cache time delay calibration value corresponding to the jth tenant in the ith cloud server hardware in S4 includes the following steps:

s41, acquiring a hardware operation temperature Ti, a CPU utilization rate LYij and an L3 cache delay SYi corresponding to a jth tenant in the ith cloud server hardware at the current time;

s42, G (T) and G1 (L) are obtained;

s43, obtaining a first buffer delay calibration value P1,

P1＝G(Ti)-SYi

wherein, G (Ti) represents a value corresponding to G (T) when Ti is substituted into T in G (T);

s44, obtaining a second cache time delay calibration value P2,

P2＝G(G1(LYij))-SYi

wherein, when LYij is substituted into L in G1 (L), G1 (L) corresponds to the value,

g (G1 (LYij)) represents a value corresponding to G (T) when G1 (LYij) is substituted into T of G (T).

The method and the device for obtaining the first cache time delay calibration value and the second cache time delay calibration value are used for calibrating the cache time delay prediction value corresponding to the j-th tenant in the i-th cloud server hardware in the follow-up first unit time based on the current time in the follow-up step, so that the network state corresponding to the j-th tenant in the i-th cloud server hardware in the follow-up first unit time based on the current time is accurately predicted.

Further, the method for predicting the network state corresponding to the jth tenant in the ith cloud server hardware in the subsequent first unit time based on the current time in S5 includes the following steps:

s51, combining a method for acquiring the time-varying relation of the operating temperature of the ith cloud server hardware, acquiring the time-varying relation of the CPU utilization rate of the jth tenant in the ith cloud server hardware, and recording the time-varying relation function of the CPU utilization rate of the jth tenant in the ith cloud server hardware as FL (tx);

the method for acquiring the relationship of the CPU utilization rate of the jth tenant in the ith cloud server hardware over time comprises the following steps:

s511, taking one day as a time period, respectively acquiring CPU utilization rates corresponding to different time points of the jth tenant in the ith cloud server in the previous n time periods based on the current time in the historical data, acquiring a line graph of the change of the CPU utilization rate of the jth tenant in the ith cloud server with time in the previous n time periods based on the current time, and recording the time interval of the acquisition of the CPU utilization rate of the jth tenant in the ith cloud server as t2, wherein the duration corresponding to each time period is an integer multiple of t 2;

s512, acquiring functions respectively corresponding to the line graphs of the CPU utilization rate of the jth tenant in the ith cloud server in each time period in the first n time periods based on the current time, wherein the functions corresponding to each line graph are piecewise functions, and recording the functions corresponding to the line graphs of the CPU utilization rate of the jth tenant in the ith cloud server in the nth 2 time periods based on the current time as FLn2 (tx), wherein the tx represents a time point with a time length tx from the starting point of the corresponding time period;

s513, obtaining a relation function FL (tx) of the CPU utilization rate of the jth tenant in the ith cloud server hardware along with the time change,

s52, G (T), G1 (L) and F (tx) are obtained;

s53, predicting a first predicted value Q1 corresponding to the L3 cache time delay when the jth tenant in the ith cloud server hardware is based on the subsequent tx1 of the current time _tx1 ，

Q1 _tx1 ＝G(F(tx1))；

S54, predicting the jth tenant in the ith cloud server hardware to correspond to the L3 cache time delay based on the subsequent tx1 of the current timeTwo predicted values Q2 _tx1 ，

Q2 _tx1 ＝G(G1(FL(tx1)))；

S55, Q1 _tx1 -P1 and Q2 _tx1 P2 is compared to obtain the corresponding L3 cache delay deviation of the jth tenant in the ith cloud server hardware after calibration based on the subsequent tx1 time of the current time, and the corresponding L3 cache delay deviation is recorded as Rij _tx1 ，

When Q1 is _tx1 -P1＜Q2 _tx1 P2, then determining that the jth tenant in the ith cloud server hardware is the L3 cache latency interferer at a subsequent tx1 based on the current time,

when Q1 is _tx1 -P1＞Q2 _tx1 P2, determining that the jth tenant in the ith cloud server hardware is an interfered end of the L3 cache delay at the subsequent tx1 based on the current time,

when Q1 is _tx1 -P1＝Q2 _tx1 P2, then determining that the j-th tenant in the i-th cloud server hardware is normal in buffering delay for L3 at a subsequent tx1 based on the current time,

obtaining Rij _tx1 ＝H(Q1 _tx1 -P1，Q2 _tx1 P2, beta), beta representing a preset delay threshold in the database,

if Q1 _tx1 -P1 is less than or equal to beta and Q2 _tx1 If P2 is less than or equal to beta, determining that the L3 cache time delay is within the normal fluctuation range of the cloud server cache time delay, and determining that H (Q1 _tx1 -P1，Q2 _tx1 -P2，β)＝0，

If Q1 _tx1 -P1≤β≤Q2 _tx1 -P2, then determine H (Q1 _tx1 -P1，Q2 _tx1 -P2，β)＝β-(Q2 _tx1 -P2)，

If Q2 _tx1 -P2≤β≤Q1 _tx1 -P1, then determine H (Q1 _tx1 -P1，Q2 _tx1 -P2，β)＝Q1 _tx1 -P1-β，

If Q1 _tx1 -P1 is greater than or equal to beta and Q2 _tx1 If P2 is larger than or equal to beta, then judging H (Q1) _tx1 -P1，Q2 _tx1 -P2，β)＝(Q1 _tx1 -P1)-(Q2 _tx1 -P2)，

The duration corresponding to the first unit time is one day.

Further, the method comprises the steps of,

the method for managing the jth tenant in the ith cloud server hardware according to the prediction result in the S5 comprises the following steps:

s501, acquiring an L3 cache delay deviation Rij corresponding to a jth tenant in ith cloud server hardware after calibration based on a subsequent tx1 time of the current time _tx1 Tx1 is more than or equal to 0 and ts is more than or equal to one day;

s502, predicting a comprehensive deviation value Kij of the corresponding L3 cache time delay of the jth tenant in the ith cloud server hardware based on the calibration in the following day of the current time,

s503, managing the jth tenant in the ith cloud server hardware according to Kij,

when Kij > 0, determining that the network is abnormal in the j-th tenant period in the i-th cloud server hardware, and the L3 cache time delay of the tenant is interfered by other tenants, suggesting that the tenant adjusts the cloud server leasing environment,

when Kij is less than 0, determining network abnormality in the period of the jth tenant in the ith cloud server hardware, taking the tenant as an interference source to interfere with L3 cache time delay of other tenants, reminding the jth tenant in the ith cloud server hardware of capacity expansion,

when kij=0, it is determined that the network state in the jth tenant period in the ith cloud server hardware is normal, and the cloud service state of the jth tenant in the ith cloud server hardware does not need to be changed.

According to the method, the device and the system, the fact that a time interval exists in the use process of the cloud server rented by the tenant, the use condition of the tenant to the server cannot be accurately reflected by the network state corresponding to the jth tenant in the ith cloud server hardware in a certain time point based on the follow-up first unit time of the current time is considered, however, if the time interval is considered, the time interval change relation of the running temperature of the cloud server hardware and the time change relation of the CPU utilization rate of each tenant are combined, and the use condition of the tenant to the cloud server can be accurately reflected by analyzing the integral network state of the jth tenant in the ith cloud server hardware in a time interval, so that whether the network of the jth tenant in the ith cloud server hardware has faults or not is obtained, and management is carried out.

A big data based network fault intelligent detection system, the system comprising the following modules:

the cloud service tenant information acquisition module acquires cloud service information corresponding to multiple tenants in the same cloud server hardware, and marks cloud service information corresponding to a j-th tenant in the i-th cloud server hardware as a set Aij, wherein the cloud service information comprises the rented CPU size and hardware operation temperature, and the hardware operation temperature represents the operation temperature of the cloud server;

the cache time delay analysis module acquires the relation between different hardware running temperatures in the historical data and cloud server performances, wherein the cloud server performances represent the average time delay of each tenant on corresponding cloud server hardware in L3 cache information;

the hardware operation temperature analysis module acquires the relation between the hardware operation temperature of the same cloud server in the historical data and the CPU utilization rate corresponding to each tenant, and acquires the time-varying relation of the hardware operation temperature of the ith cloud server according to the historical data;

the cache time delay calibration module acquires a hardware running temperature Ti, a CPU utilization rate and an L3 cache time delay SYi corresponding to a jth tenant in the ith cloud server hardware at the current time, and acquires an L3 cache time delay calibration value corresponding to the jth tenant in the ith cloud server hardware;

and the network state judging and managing module predicts the network state corresponding to the jth tenant in the ith cloud server hardware in the first unit time based on the current time and manages the jth tenant in the ith cloud server hardware according to the prediction result.

Further, in the network status determination and management module,

when the network is abnormal in the jth tenant period in the ith cloud server hardware, determining that the L3 cache time delay of the tenant is interfered by other tenants, suggesting the tenant to adjust the cloud server leasing environment,

when the network is abnormal in the jth tenant period in the ith cloud server hardware, judging that the tenant is used as an interference source to interfere the L3 cache time delay of the rest tenants, reminding the jth tenant in the ith cloud server hardware to expand the capacity,

when the network state in the jth tenant period in the ith cloud server hardware is normal, the cloud service state of the jth tenant in the ith cloud server hardware does not need to be changed.

Compared with the prior art, the invention has the following beneficial effects: according to the cloud server management method, the cloud server management system and the cloud server management system, the use data in the near time period in the cloud server historical data are analyzed from the two angles of the whole cloud server and the individual tenant, under the condition that the cloud server is used periodically by the default cloud server tenant, the actual data corresponding to the current time of the cloud server are used as data calibration reference data, the network state interference condition of the server corresponding to one tenant and the rest of tenants at each time point in a time period based on the current time is accurately predicted, the network condition of the cloud server in the time period is comprehensively evaluated, whether network faults exist in the tenants in the cloud server or not is judged and managed, and the interference of the noisy neighbor effect on L3 cache delay is effectively reduced.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:

FIG. 1 is a schematic diagram of a network fault intelligent detection system based on big data;

fig. 2 is a flow chart of a network fault intelligent detection method based on big data.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1-2, the present invention provides the following technical solutions: a network fault intelligent detection method based on big data, the method comprising the steps of:

the S2 method for acquiring the relation between different hardware operation temperatures and cloud server performances in the historical data comprises the following steps:

the method for acquiring the relation between the hardware running temperature of the same cloud server and the CPU utilization rate corresponding to each tenant in the historical data in the S3 comprises the following steps:

in this embodiment, if three tenants U1, U2, U3 exist in the cloud server a,

if the CPU rented by the tenant U1 is U1V in size and the CPU utilization rate corresponding to the time point t is U1L;

if the CPU rented by the tenant U2 is U2V in size and the CPU utilization rate corresponding to the time point t is U2L;

if the CPU rented by the tenant U1 is U3V in size and the CPU utilization rate corresponding to the time point t is U3L;

the comprehensive utilization rate of the CPU in the cloud server a at time t is equal to

(U1V*U1L+U2V*U2L+U3V*U3L)/(U1V+U2V+U3V)；

the method for obtaining the L3 cache time delay calibration value corresponding to the jth tenant in the ith cloud server hardware in the S4 comprises the following steps:

s42, G (T) and G1 (L) are obtained;

s43, obtaining a first buffer delay calibration value P1,

P1＝G(Ti)-SYi

s44, obtaining a second cache time delay calibration value P2,

P2＝G(G1(LYij))-SYi

The method for predicting the network state corresponding to the jth tenant in the ith cloud server hardware in the subsequent first unit time based on the current time in the S5 includes the following steps:

s52, G (T), G1 (L) and F (tx) are obtained;

Q1 _tx1 ＝G(F(tx1))；

S54, predicting a second predicted value Q2 corresponding to the L3 cache time delay when the jth tenant in the ith cloud server hardware is based on the subsequent tx1 of the current time _tx1 ，

Q2 _tx1 ＝G(G1(FL(tx1)))；

When Q1 is _tx1 -P1＜Q2 _tx1 -upon P2, then determining that the jth tenant in the ith cloud server hardware is after the current time basedThe time delay interference source is buffered for L3 at the time of continuing tx1,

The duration corresponding to the first unit time is one day.

In the network state determination and management module,

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Finally, it should be noted that: the foregoing description is only a preferred embodiment of the present invention, and the present invention is not limited thereto, but it is to be understood that modifications and equivalents of some of the technical features described in the foregoing embodiments may be made by those skilled in the art, although the present invention has been described in detail with reference to the foregoing embodiments. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The intelligent network fault detection method based on big data is characterized by comprising the following steps:

2. The intelligent network fault detection method based on big data according to claim 1, wherein the intelligent network fault detection method based on big data is characterized in that: the S2 method for acquiring the relation between different hardware operation temperatures and cloud server performances in the historical data comprises the following steps:

3. The intelligent network fault detection method based on big data according to claim 2, wherein the intelligent network fault detection method based on big data is characterized in that: the method for acquiring the relation between the hardware running temperature of the same cloud server and the CPU utilization rate corresponding to each tenant in the historical data in the S3 comprises the following steps:

4. a method for intelligent detection of network faults based on big data as claimed in claim 3, wherein: the method for obtaining the L3 cache time delay calibration value corresponding to the jth tenant in the ith cloud server hardware in the S4 comprises the following steps:

s42, G (T) and G1 (L) are obtained;

s43, obtaining a first buffer delay calibration value P1,

P1＝G(Ti)-SYi

s44, obtaining a second cache time delay calibration value P2,

P2＝G(G1(LYij))-SYi

5. The intelligent network fault detection method based on big data as claimed in claim 4, wherein: the method for predicting the network state corresponding to the jth tenant in the ith cloud server hardware in the subsequent first unit time based on the current time in the S5 includes the following steps:

s52, G (T), G1 (L) and F (tx) are obtained;

Q1 _tx1 ＝G(F(tx1))；

Q2 _tx1 ＝G(G1(FL(tx1)))；

The duration corresponding to the first unit time is one day.

6. The intelligent network fault detection method based on big data according to claim 5, wherein the intelligent network fault detection method based on big data is characterized in that: the method for managing the jth tenant in the ith cloud server hardware according to the prediction result in the S5 comprises the following steps:

7. A big data based network fault intelligent detection system, the system comprising the following modules:

8. The intelligent network fault detection system based on big data as claimed in claim 7, wherein: in the network state determination and management module,