CN117093461A - Method, system, equipment and storage medium for time delay detection and analysis - Google Patents

Method, system, equipment and storage medium for time delay detection and analysis Download PDF

Info

Publication number
CN117093461A
CN117093461A CN202311117356.9A CN202311117356A CN117093461A CN 117093461 A CN117093461 A CN 117093461A CN 202311117356 A CN202311117356 A CN 202311117356A CN 117093461 A CN117093461 A CN 117093461A
Authority
CN
China
Prior art keywords
data
time delay
throughput
delay
latency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311117356.9A
Other languages
Chinese (zh)
Inventor
贾上坤
郭坤
张小康
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinan Inspur Data Technology Co Ltd
Original Assignee
Jinan Inspur Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinan Inspur Data Technology Co Ltd filed Critical Jinan Inspur Data Technology Co Ltd
Priority to CN202311117356.9A priority Critical patent/CN117093461A/en
Publication of CN117093461A publication Critical patent/CN117093461A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3447Performance evaluation by modeling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Abstract

The application provides a method, a system, equipment and a storage medium for detecting and analyzing time delay, wherein the method comprises the following steps: preprocessing the acquired delay data and throughput data; clustering the preprocessed time delay data and throughput data, and establishing a relation model between the time delay and the throughput according to a clustering result; predicting new time delay data through the relation model, and determining whether an alarm is generated or not according to the relation between the new time delay data and a preset range; and in response to generating the alert, performing association analysis according to the topological relation to determine the component most relevant to the time delay. The application can judge whether the distributed storage system generates high-delay event on the whole, saves the occupation of system resources and has higher efficiency.

Description

Method, system, equipment and storage medium for time delay detection and analysis
Technical Field
The present application relates to the field of distributed storage systems, and in particular, to a method, system, device, and storage medium for latency detection analysis.
Background
When distributed storage is in a high latency state, it can cause increased delays in user access and retrieval of data, can reduce user satisfaction, and can negatively impact user experience. Therefore, it is important for a distributed storage system how to intelligently and efficiently detect and process high latency states in time. However, the distributed storage high latency detection process is a challenging task. First, distributed storage systems typically handle large amounts of data, involving a large number of data read and write operations. In this case, if each component in the system is modeled and whether a high latency state occurs is detected, this is a significant overhead on system resources, which may increase the latency of the system; secondly, how to determine whether the distributed storage system has a high latency state, because the increase of the workload pressure affects the latency change of the system, it is obvious that the latency belongs to a normal condition, if the condition is detected as an abnormal state and is processed, great waste is caused to operation and maintenance resources, and therefore, modeling detection is unreliable only by means of the latency; in addition, how to select the delay threshold and determine what delay is defined as high delay is a challenge, and because the working environments of the distributed storage system are different, the thresholds cannot be consistent, so that a range cannot be simply selected as the threshold, and a proper threshold is selected according to the running state of the system, so that the universality of the method is improved as much as possible.
Disclosure of Invention
In view of the above, an object of the embodiments of the present application is to provide a method, a system, an electronic device, and a computer readable storage medium for detecting and analyzing a time delay, which perform high-time delay detection from a system level, and only occupy limited system resources; modeling by using time delay and throughput is beneficial to filtering the time delay increase caused by workload increase, and reducing the false alarm rate of high time delay; the confidence interval of polynomial regression prediction is utilized to automatically determine the detection threshold value, so that the method has good universality; the degree of abnormality is classified and different alarm strategies are adopted. The application can timely detect and process high time delay, is beneficial to improving the stability, performance and user experience of the system, and ensures that the system operates in a high-efficiency and reliable state.
Based on the above objects, an aspect of the embodiments of the present application provides a method for detecting and analyzing a time delay, including the following steps: preprocessing the acquired delay data and throughput data; clustering the preprocessed time delay data and throughput data, and establishing a relation model between the time delay and the throughput according to a clustering result; predicting new time delay data through the relation model, and determining whether an alarm is generated or not according to the relation between the new time delay data and a preset range; and in response to generating the alert, performing association analysis according to the topological relation to determine the component most relevant to the time delay.
In some embodiments, the step of preprocessing the acquired latency data and throughput data includes: and acquiring write delay and throughput data of each node, filling the missing data with the data at the last moment, and calculating the average value of the filled data.
In some embodiments, the step of clustering the pre-processed latency data and throughput data includes: the neighborhood is defined according to the density, the data points are divided into core points, boundary points and noise points, and the noise points are removed.
In some embodiments, the step of modeling a relationship between latency and throughput based on the clustering result includes: taking throughput as an independent variable, taking write time delay as an independent variable, and splitting a data set into a training set and a testing set; training a polynomial regression model by using the training set, and learning a relation between throughput and write delay by fitting a polynomial function; calculating an error index between a predicted result and an actual result by using the test set so as to evaluate the trained polynomial regression model; and dynamically adjusting parameters of the polynomial regression model according to the evaluation result.
In some embodiments, the step of modeling a relationship between latency and throughput based on the clustering result includes: calculating a predicted value through the polynomial regression model according to the write time delay, and calculating a standard error of the predicted value; and calculating a confidence interval according to the standard error, the write delay and the predicted value.
In some embodiments, the step of determining whether to generate the alarm according to the relationship between the new time delay data and the preset range includes: and calculating the ratio of new time delay data corresponding to a plurality of continuous moments to the confidence upper limit, and responding to the fact that the ratio corresponding to more than half of the moments exceeds a threshold value, and alarming.
In some embodiments, the step of performing a correlation analysis based on the topological relation to determine the component most relevant to the latency comprises: and respectively carrying out correlation analysis on the time delay and the time delay of the storage node, the storage medium, the switch and the router, and determining the component most correlated with the time delay by using a pearson correlation coefficient method.
In another aspect of the embodiments of the present application, a system for delay detection analysis is provided, including: the processing module is used for preprocessing the acquired time delay data and throughput data; the clustering module is used for clustering the preprocessed time delay data and the preprocessed throughput data and establishing a relation model between the time delay and the throughput according to a clustering result; the alarm module is used for predicting new time delay data through the relation model and determining whether an alarm is generated or not according to the relation between the new time delay data and a preset range; and an analysis module for performing association analysis according to the topological relation to determine the component most relevant to the time delay in response to generating the alarm.
In still another aspect of the embodiment of the present application, there is also provided an electronic device, including: at least one processor; and a memory storing computer instructions executable on the processor, which when executed by the processor, perform the steps of the method as above.
In yet another aspect of the embodiments of the present application, there is also provided a computer-readable storage medium storing a computer program which, when executed by a processor, implements the method steps as described above.
The application has the following beneficial technical effects:
1. from the system level, whether the distributed storage system generates a high-delay event or not can be judged on the whole, the occupation of system resources is saved, and the efficiency is high;
2. abnormal values are removed by using a DBSCAN clustering algorithm, a prediction model of a normal index is established through polynomial regression, the upper limit of a threshold value is determined through a confidence interval, the threshold value can be adjusted in a self-adaptive mode according to different production environments where a distributed storage system is located, and the universality is good;
3. the write delay and the throughput are used as modeling inputs, compared with modeling by only using delay data, the method has the advantages that the workload of a distributed storage system can be considered, the accuracy is better, delay change caused by normal workload increase is avoided being judged to be abnormal, in addition, the abnormality degree evaluation module can measure high delay risks according to duration time and abnormality degree, normal conditions of instantaneous delay increase can be filtered, and the false alarm rate is reduced;
4. different alarm notification strategies can be selected according to different degrees of time delay, meanwhile, correlation analysis is carried out according to the topological relation of the distributed storage system, a plurality of parts most correlated with the system write time delay change are provided, follow-up investigation of operation and maintenance personnel is facilitated, and stability of the distributed storage system is improved.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are necessary for the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application and that other embodiments may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of an embodiment of a method for delay detection analysis provided by the present application;
FIG. 2 is a flow chart of an embodiment of a method for delay detection analysis provided by the present application;
FIG. 3 is a schematic diagram of DBSCAN clustering provided by the application;
FIG. 4 is a schematic diagram of a polynomial regression scheme provided by the present application;
FIG. 5 is a schematic view of sliding window evaluation provided by the present application;
FIG. 6 is a schematic diagram of an embodiment of a system for delay detection analysis provided by the present application;
fig. 7 is a schematic hardware structure diagram of an embodiment of an electronic device for delay detection and analysis according to the present application;
fig. 8 is a schematic diagram of an embodiment of a computer storage medium for delay detection analysis according to the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the following embodiments of the present application will be described in further detail with reference to the accompanying drawings.
It should be noted that, in the embodiments of the present application, all the expressions "first" and "second" are used to distinguish two entities with the same name but different entities or different parameters, and it is noted that the "first" and "second" are only used for convenience of expression, and should not be construed as limiting the embodiments of the present application, and the following embodiments are not described one by one.
In a first aspect of the embodiment of the present application, an embodiment of a method for latency detection analysis is provided. Fig. 1 is a schematic diagram of an embodiment of a method for delay detection analysis provided by the present application.
As shown in fig. 1, the embodiment of the present application includes the following steps:
s1, preprocessing acquired time delay data and throughput data;
s2, clustering the preprocessed time delay data and the preprocessed throughput data, and establishing a relation model between the time delay and the throughput according to a clustering result;
s3, predicting new delay data through the relation model, and determining whether an alarm is generated or not according to the relation between the new delay data and a preset range; and
and S4, responding to the generation of the alarm, and carrying out association analysis according to the topological relation to determine the component most relevant to the time delay.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise, density-based noise application spatial clustering) is a Density clustering algorithm that divides data points into clusters in a Density-based manner and can efficiently handle noise and outliers in the data. The core idea of DBSCAN is to divide clusters according to the density of data points, it does not need to pre-specify the number of clusters, it can automatically find clusters of arbitrary shape and size, and it is robust to noise and outliers and relatively less parameter dependence. Polynomial regression is a nonlinear regression method used to build a polynomial relationship model between independent and dependent variables. It is widely used in both academic and practical applications, and is particularly useful for exploring nonlinear relationships in data. The core idea of polynomial regression is to fit nonlinear data by introducing higher order terms of the polynomial, transform the raw data into polynomial feature space, estimate regression coefficients using least squares or other regression methods, which enables the model to adapt to more complex data patterns and provide more accurate predictions. Therefore, firstly, abnormal points in data can be removed by using a DBSCAN clustering algorithm, then a polynomial regression model is built for the data with the abnormal points removed, the normal range of an index is calculated, the upper limit is used for monitoring the running state of the system, and whether a high-delay event occurs or not is detected.
The embodiment of the application collects the time delay data and the throughput data in the distributed storage system, then preprocesses the collected time delay data and throughput data, and mainly processes the missing values so as to ensure the accuracy and consistency of the data. And clustering the preprocessed data by using a DBSCAN algorithm, dividing data points in similar modes into the same cluster, and taking data points with larger differences from other modes as abnormal points, removing the abnormal points and complementing the abnormal points by using a forward filling mode. And fitting the data with abnormal values removed by using a polynomial regression model, and establishing a relation model between time delay and throughput by selecting an appropriate polynomial order. And calculating a normal index range by using the established polynomial regression model, and predicting new time delay data. If the predicted delay value exceeds the normal range, the data point may be marked as a high delay point. Scoring high latency events according to duration and severity; and once the set threshold value is exceeded, an alarm is generated, and association analysis is carried out according to the topological relation, so that the first five parts most relevant to the system delay curve are listed.
Fig. 2 is a flowchart of an embodiment of a method for delay detection and analysis provided by the present application, and an embodiment of the present application is described with reference to fig. 2.
And preprocessing the acquired delay data and throughput data.
In some embodiments, the step of preprocessing the acquired latency data and throughput data includes: and acquiring write delay and throughput data of each node, filling the missing data with the data at the last moment, and calculating the average value of the filled data.
The method mainly comprises the steps of mainly obtaining write delay and throughput indexes of each node, and calculating average write delay and throughput indexes of the system at a certain moment: performance monitoring tools or agents are deployed to monitor latency and throughput of the distributed storage system. These tools can capture and record key performance index data, common monitoring tools include Prometheus, grafana, nagios, etc. And acquiring write delay and throughput indexes of each node at a certain moment, and filling the missing data by using the data at the previous moment. And then, averaging the filled data so as to acquire the average write delay and throughput index of the system at a certain moment.
Clustering the preprocessed time delay data and throughput data, and establishing a relation model between the time delay and the throughput according to a clustering result.
In some embodiments, the step of clustering the pre-processed latency data and throughput data includes: the neighborhood is defined according to the density, the data points are divided into core points, boundary points and noise points, and the noise points are removed. Before a normal index polynomial regression model is established, the influence of abnormal values is removed, mainly by using a DBSCAN clustering algorithm, and the DBSCAN algorithm can effectively identify the abnormal values in the data through the clustering characteristic based on density. Outliers tend to be classified as noise points and do not belong to any cluster. Thus, by identifying noise points, outliers in the data can be determined. The key step of the DBSCAN algorithm is to define a neighborhood according to density, and divide data points into core points, boundary points and noise points. The core points are those having a sufficient number of sample points in the neighborhood, the boundary points are those in the neighborhood where the sample points are insufficient but belong to the core points, and the noise points are those in the neighborhood where the sample points are insufficient and do not belong to the core points. Fig. 3 is a schematic diagram of DBSCAN clustering provided by the present application, as shown in fig. 3, data in two dotted boxes are in two different categories.
And the acquired write delay and throughput index of the distributed storage system are formed into the input data of the DBSCAN algorithm according to the direction of the horizontal axis of throughput and the direction of the vertical axis of the write delay. Parameters of the DBSCAN algorithm are determined, including a neighborhood radius and a minimum neighborhood point number. The neighborhood radius defines the neighborhood range of a sample, and the minimum neighborhood number represents the number of samples in the minimum neighborhood required for a core point. The euclidean distance is used to calculate the distance between samples in the dataset. And clustering the data by using a DBSCAN algorithm. And dividing the data points into core points, boundary points and noise points according to the setting of the neighborhood radius and the minimum neighborhood point number. Noise points are identified as outliers. Noise points refer to samples that are not categorized into any clusters during clustering. And it is removed from the data and filled with write latency and throughput from the previous time.
Modeling the data with the abnormal values removed by using polynomial regression, so as to obtain a model in a normal state of the system, and obtaining a confidence interval of the predicted data to obtain a threshold value. Fig. 4 is a schematic diagram of polynomial regression provided in the present application, as shown in fig. 4, the dotted line is the upper limit, and the solid line is the fitted curve.
In some embodiments, the step of modeling a relationship between latency and throughput based on the clustering result includes: taking throughput as an independent variable, taking write time delay as an independent variable, and splitting a data set into a training set and a testing set; training a polynomial regression model by using the training set, and learning a relation between throughput and write delay by fitting a polynomial function; calculating an error index between a predicted result and an actual result by using the test set so as to evaluate the trained polynomial regression model; and dynamically adjusting parameters of the polynomial regression model according to the evaluation result.
Taking the throughput of the system as an independent variable, taking the write time delay as an independent variable, splitting a data set into a training set and a testing set by adopting a cross-validation mode, and selecting polynomial regression as a modeling algorithm. The polynomial regression model is trained using training set data. The relationship between throughput and write latency is learned by fitting a polynomial function. The trained model is evaluated using the test set data, and error indicators, such as Mean Square Error (MSE), root Mean Square Error (RMSE), etc., between the predicted and actual results are calculated. And according to the evaluation result, using automatic parameter adjustment tools or grid searching and other technologies to adjust the super parameters of the model and optimize the performance of the model.
In some embodiments, the step of modeling a relationship between latency and throughput based on the clustering result includes: calculating a predicted value through the polynomial regression model according to the write time delay, and calculating a standard error of the predicted value; and calculating a confidence interval according to the standard error, the write delay and the predicted value.
Assume that a quadratic polynomial regression model is obtained, in the form: y=β 01 x+β 2 x 2 +ε, where y is the response variable write latency, x is the independent variable throughput, β 0 、β 1 And beta 2 Is the regression coefficient and epsilon is the error term. And calculating the standard error of the predicted value. The calculation can be made by the following formula:
where SE is the standard error, y is the observed response variable value,is calculated according to regression modelThe predicted value reached, n is the number of samples and k is the number of parameters in the regression model. Confidence levels and degrees of freedom are determined. A confidence level of 95% or 99% is chosen, with degrees of freedom n-k-1. And searching the critical value of the corresponding t distribution or standard normal distribution according to the selected confidence level and the degree of freedom. For the large sample case, a critical value of a standard normal distribution may be used; for the small sample case, the threshold γ of the t distribution may be used. Confidence intervals were calculated using the following formula:
after obtaining the confidence interval of the write latency prediction value, the confidence upper limit may be used as a threshold to detect whether the write latency observation value is a high latency point.
And predicting new time delay data through the relation model, and determining whether an alarm is generated or not according to the relation between the new time delay data and a preset range.
In some embodiments, the step of determining whether to generate the alarm according to the relationship between the new time delay data and the preset range includes: and calculating the ratio of new time delay data corresponding to a plurality of continuous moments to the confidence upper limit, and responding to the fact that the ratio corresponding to more than half of the moments exceeds a threshold value, and alarming.
In the running process of the distributed storage system, the instantaneous increase of the time delay is probably caused by the change of the pressure of the working load, the normal change of the time delay is acceptable, and if the situation is not judged, the high false alarm rate is generated. Therefore, the step is to score the abnormality degree based on the sliding window, and comprehensively evaluate the abnormality degree of the time delay by calculating the ratio of the time delay observed value to the upper limit obtained by modeling, namely the slow speed ratio.
Assume that the time sequence of the acquired system delay is y= (Y) 1 ,y 2 ,…y t ,…y n-1 ,y n ) The upper threshold value obtained by the polynomial regression prediction module is S= (S) 1 ,s 2 ,…,s t ,…s n-1 ,s n ) Each window contains 5 data points, assuming that the time delay data obtained by sliding the window at this time is Y 5 =(y t-2 ,y t-1 ,y t ,y t+1 ,y t+2 ) The obtained upper threshold data is S 5 =(s t-2 ,s t-1 ,s t ,s t+1 ,s t+2 ) Slow speed ratio r= (R) t-2 ,r t-1 ,r t ,r t+1 ,r t+2 )。
Wherein the method comprises the steps ofr t The threshold of (2) is set to 1 and the ratio is 50%, i.e. when more than 50% of the number in the window exceeds 1, the window is considered to form a high latency event.
Fig. 5 is a schematic diagram of sliding window evaluation provided by the present application, as shown in fig. 5, a high-delay window is shown, a solid line is a writing delay point, a dotted line is a threshold upper limit obtained by modeling, when the upper limit is exceeded, the abnormal point is represented by a black square, otherwise, the abnormal point is represented by a gray square. If both consecutive windows are high latency windows, they are considered to be an integer.
In high-delay detection, momentary increases in delay cannot be defined as an abnormal condition, and the duration is also taken into account. This patent classifies the different risk levels according to the duration and severity of the delay anomalies. For example, the durations are classified into temporary (from 1 to 5 minutes), medium (from 5 to 20 minutes) and long (20 minutes or more) according to the abnormal span of the time delay of the window. In addition, the delay anomalies were rated as mild (1. Ltoreq. SR < 2), moderate (2. Ltoreq. SR < 5), and severe (SR. Gtoreq.5) according to the window average slow speed ratio, as shown in the following table. The subsequent alarm and association module may choose different alarm notification policies based on severity.
In response to generating the alert, an association analysis is performed according to the topological relationship to determine the component most relevant to the latency.
When the delay score exceeds a preset threshold, the module generates a corresponding alarm, and selects different modes according to the severity to inform related personnel or a monitoring system; the association analysis module is used for carrying out association from top to bottom by using a correlation coefficient method, and finding out five parts which are most relevant to the change trend of the distributed storage system.
The alarm information is integrated into the existing monitoring system or work order system so that related personnel can receive and process alarms on the common work interface, and the alarm information is applicable to alarms of all levels. Mail notification: the alarm information is sent to the email address of the related personnel, the alarm content is timely transmitted, necessary detailed information is provided, and the method is suitable for important alarms. And (3) short message notification: and sending the alarm information to mobile phones of related personnel through short messages. This is a quick and direct way of notification, applicable in situations where an emergency response is required. Telephone call: the alert information is directly communicated to the relevant personnel through a telephone automatic dialing system or by manually calling them. This is an important way of notification in case of emergency.
In some embodiments, the step of performing a correlation analysis based on the topological relation to determine the component most relevant to the latency comprises: and respectively carrying out correlation analysis on the time delay and the time delay of the storage node, the storage medium, the switch and the router, and determining the component most correlated with the time delay by using a pearson correlation coefficient method.
Assuming that the delay variable of the system is Y, the delay variable of a component needing to perform association analysis is X, and the two time sequences respectively comprise n observation values. The following symbols are defined: observation of X 1 ,x 2 ,…x t ,…x n-1 ,x n The observation of Y is Y 1 ,y 2 ,…y t ,…y n-1 ,y n Mean value of X is mu x Y has a mean value of mu y Standard deviation of X is sigma x The standard deviation of Y isσ y The pearson correlation coefficient can be calculated by the following formula:
the formula calculates the covariance between the variables X and Y and then divides by the product of the standard deviations of X and Y to normalize the value of the correlation coefficient. The results range from-1 to 1, where-1 represents a complete negative correlation, 0 represents no correlation, and 1 represents a complete positive correlation. Wherein the association means may be storage nodes, storage media, switches or routers etc. which need to be determined in practice according to the topological relation of the system.
The embodiment of the application acquires the time delay and throughput of the system, removes abnormal points through DBSCAN, establishes a model of a normal index by utilizing polynomial regression, determines the upper limit of the index as a threshold value for subsequent time delay abnormal detection according to a confidence interval, evaluates the time delay risk according to duration and degree of abnormality, selects different alarm strategies according to severity once a high-delay event occurs, carries out association analysis, and locates five components with highest relativity with the time delay change of the system (the five components are only exemplified, and other components can be located according to the need). The operation and maintenance personnel can formulate corresponding countermeasures according to the alarm level and the associated analysis result, so that risks are avoided in time, the stability of the system is improved, and the operation and maintenance cost can be saved.
The embodiment of the application accurately acquires the write delay and throughput data of each node of the distributed storage system, and fills the missing data by using the data at the last moment, thereby acquiring the write delay and throughput data of the system. And performing cluster analysis on the preprocessed time delay data by using a DBSCAN algorithm, and identifying time delay abnormal points and normal points. The DBSCAN algorithm is based on the density reachability principle, and can find data points of different densities and divide them into different clusters. By using a polynomial regression method, a polynomial regression model is established based on normal time delay and throughput data points, so that nonlinear relations between time delay indexes and throughput can be captured, and more accurate modeling of time delay behaviors is provided. And calculating the confidence interval of the time delay index through analysis of the polynomial regression model. And determining an upper threshold of the normal time delay according to the confidence interval, and taking the upper threshold as a threshold for the subsequent time delay abnormal detection. When the time delay exceeds the threshold at a certain time, the delay point is determined to be high. And comprehensively considering factors such as abnormal duration, abnormal degree and the like, carrying out risk assessment on the high-delay event, determining the severity and the persistence of the high-delay event according to a risk assessment result, and providing basis for subsequent alarm strategies and response measures. And selecting proper alarm strategies including mail notification, short message reminding and the like and setting alarm levels according to the severity of the high-delay event. And simultaneously carrying out association analysis to find out five parts or components most relevant to the system time delay change. In summary, the application can accurately detect and analyze the high-delay event in the distributed storage system by acquiring the time delay and throughput data of the system and utilizing the cluster analysis and regression modeling technology. By determining the threshold value, evaluating the risk, selecting the alarm strategy and performing association analysis, the operation and maintenance personnel can be helped to timely cope with the high-delay event, the stability and the performance of the system are improved, and the operation and maintenance cost is saved. The technology has practical value, can be applied to various distributed storage systems, and provides effective tools and methods for operation and maintenance teams so as to ensure the normal operation and optimization of the system.
It should be noted that, in the embodiments of the method for detecting and analyzing a delay, the steps may be intersected, replaced, added and subtracted, so that the method for detecting and analyzing a delay by using these reasonable permutation and combination should also belong to the protection scope of the present application, and the protection scope of the present application should not be limited to the embodiments.
Based on the above object, a second aspect of the embodiments of the present application provides a system for delay detection and analysis. As shown in fig. 6, the system 200 includes the following modules: the processing module is used for preprocessing the acquired time delay data and throughput data; the clustering module is used for clustering the preprocessed time delay data and the preprocessed throughput data and establishing a relation model between the time delay and the throughput according to a clustering result; the alarm module is used for predicting new time delay data through the relation model and determining whether an alarm is generated or not according to the relation between the new time delay data and a preset range; and an analysis module for performing association analysis according to the topological relation to determine the component most relevant to the time delay in response to generating the alarm.
In some embodiments, the processing module is further to: and acquiring write delay and throughput data of each node, filling the missing data with the data at the last moment, and calculating the average value of the filled data.
In some embodiments, the clustering module is further to: the neighborhood is defined according to the density, the data points are divided into core points, boundary points and noise points, and the noise points are removed.
In some embodiments, the clustering module is further to: taking throughput as an independent variable, taking write time delay as an independent variable, and splitting a data set into a training set and a testing set; training a polynomial regression model by using the training set, and learning a relation between throughput and write delay by fitting a polynomial function; calculating an error index between a predicted result and an actual result by using the test set so as to evaluate the trained polynomial regression model; and dynamically adjusting parameters of the polynomial regression model according to the evaluation result.
In some embodiments, the clustering module is further to: calculating a predicted value through the polynomial regression model according to the write time delay, and calculating a standard error of the predicted value; and calculating a confidence interval according to the standard error, the write delay and the predicted value.
In some embodiments, the alert module is further to: and calculating the ratio of new time delay data corresponding to a plurality of continuous moments to the confidence upper limit, and responding to the fact that the ratio corresponding to more than half of the moments exceeds a threshold value, and alarming.
In some embodiments, the analysis module is further to: and respectively carrying out correlation analysis on the time delay and the time delay of the storage node, the storage medium, the switch and the router, and determining the component most correlated with the time delay by using a pearson correlation coefficient method.
In view of the above object, a third aspect of an embodiment of the present application provides an electronic device, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions being executable by the processor to perform the steps of: s1, preprocessing acquired time delay data and throughput data; s2, clustering the preprocessed time delay data and the preprocessed throughput data, and establishing a relation model between the time delay and the throughput according to a clustering result; s3, predicting new delay data through the relation model, and determining whether an alarm is generated or not according to the relation between the new delay data and a preset range; and S4, responding to the generation of the alarm, and carrying out association analysis according to the topological relation to determine the component most relevant to the time delay.
In some embodiments, the step of preprocessing the acquired latency data and throughput data includes: and acquiring write delay and throughput data of each node, filling the missing data with the data at the last moment, and calculating the average value of the filled data.
In some embodiments, the step of clustering the pre-processed latency data and throughput data includes: the neighborhood is defined according to the density, the data points are divided into core points, boundary points and noise points, and the noise points are removed.
In some embodiments, the step of modeling a relationship between latency and throughput based on the clustering result includes: taking throughput as an independent variable, taking write time delay as an independent variable, and splitting a data set into a training set and a testing set; training a polynomial regression model by using the training set, and learning a relation between throughput and write delay by fitting a polynomial function; calculating an error index between a predicted result and an actual result by using the test set so as to evaluate the trained polynomial regression model; and dynamically adjusting parameters of the polynomial regression model according to the evaluation result.
In some embodiments, the step of modeling a relationship between latency and throughput based on the clustering result includes: calculating a predicted value through the polynomial regression model according to the write time delay, and calculating a standard error of the predicted value; and calculating a confidence interval according to the standard error, the write delay and the predicted value.
In some embodiments, the step of determining whether to generate the alarm according to the relationship between the new time delay data and the preset range includes: and calculating the ratio of new time delay data corresponding to a plurality of continuous moments to the confidence upper limit, and responding to the fact that the ratio corresponding to more than half of the moments exceeds a threshold value, and alarming.
In some embodiments, the step of performing a correlation analysis based on the topological relation to determine the component most relevant to the latency comprises: and respectively carrying out correlation analysis on the time delay and the time delay of the storage node, the storage medium, the switch and the router, and determining the component most correlated with the time delay by using a pearson correlation coefficient method.
Fig. 7 is a schematic hardware structure of an embodiment of the electronic device for delay detection and analysis according to the present application.
Taking the example of the apparatus shown in fig. 7, a processor 301 and a memory 302 are included in the apparatus.
The processor 301 and the memory 302 may be connected by a bus or otherwise, for example in fig. 7.
The memory 302 is used as a non-volatile computer readable storage medium for storing non-volatile software programs, non-volatile computer executable programs and modules, such as program instructions/modules corresponding to the methods of delay detection and analysis in embodiments of the present application. The processor 301 executes various functional applications of the server and data processing, i.e., a method of implementing delay detection analysis, by running nonvolatile software programs, instructions, and modules stored in the memory 302.
Memory 302 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created according to the use of the method of latency detection analysis, etc. In addition, memory 302 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, memory 302 may optionally include memory located remotely from processor 301, which may be connected to the local module via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Computer instructions 303 corresponding to one or more methods of delay detection analysis are stored in memory 302 that, when executed by processor 301, perform the methods of delay detection analysis in any of the method embodiments described above.
Any one embodiment of the electronic device that performs the above-described method for detecting and analyzing a time delay may achieve the same or similar effects as any one of the foregoing method embodiments corresponding thereto.
The application also provides a computer readable storage medium storing a computer program for executing the method of latency detection analysis when executed by a processor.
Fig. 8 is a schematic diagram of an embodiment of the computer storage medium for the above-mentioned delay detection analysis according to the present application. Taking a computer storage medium as shown in fig. 8 as an example, the computer-readable storage medium 401 stores a computer program 402 that, when executed by a processor, performs the above method.
Finally, it should be noted that, as will be understood by those skilled in the art, all or part of the processes in the methods of the embodiments described above may be implemented by a computer program to instruct related hardware, and the program of the method for latency detection analysis may be stored in a computer readable storage medium, where the program may include processes in the embodiments of the methods described above when executed. The storage medium of the program may be a magnetic disk, an optical disk, a read-only memory (ROM), a random-access memory (RAM), or the like. The computer program embodiments described above may achieve the same or similar effects as any of the method embodiments described above.
The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.
It should be understood that as used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.
The foregoing embodiment of the present application has been disclosed with reference to the number of embodiments for the purpose of description only, and does not represent the advantages or disadvantages of the embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, and the program may be stored in a computer readable storage medium, where the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
Those of ordinary skill in the art will appreciate that: the above discussion of any embodiment is merely exemplary and is not intended to imply that the scope of the disclosure of embodiments of the application, including the claims, is limited to such examples; combinations of features of the above embodiments or in different embodiments are also possible within the idea of an embodiment of the application, and many other variations of the different aspects of the embodiments of the application as described above exist, which are not provided in detail for the sake of brevity. Therefore, any omission, modification, equivalent replacement, improvement, etc. of the embodiments should be included in the protection scope of the embodiments of the present application.

Claims (10)

1. A method of delay detection analysis comprising the steps of:
preprocessing the acquired delay data and throughput data;
clustering the preprocessed time delay data and throughput data, and establishing a relation model between the time delay and the throughput according to a clustering result;
predicting new time delay data through the relation model, and determining whether an alarm is generated or not according to the relation between the new time delay data and a preset range; and
in response to generating the alert, an association analysis is performed according to the topological relationship to determine the component most relevant to the latency.
2. The method of latency detection analysis of claim 1, wherein the step of preprocessing the acquired latency data and throughput data comprises:
and acquiring write delay and throughput data of each node, filling the missing data with the data at the last moment, and calculating the average value of the filled data.
3. The method of latency detection analysis of claim 1, wherein the step of clustering the preprocessed latency data and throughput data comprises:
the neighborhood is defined according to the density, the data points are divided into core points, boundary points and noise points, and the noise points are removed.
4. The method of latency detection analysis according to claim 1, wherein the step of modeling a relationship between latency and throughput based on the clustering result comprises:
taking throughput as an independent variable, taking write time delay as an independent variable, and splitting a data set into a training set and a testing set;
training a polynomial regression model by using the training set, and learning a relation between throughput and write delay by fitting a polynomial function;
calculating an error index between a predicted result and an actual result by using the test set so as to evaluate the trained polynomial regression model; and
and dynamically adjusting parameters of the polynomial regression model according to the evaluation result.
5. The method of latency detection analysis according to claim 4, wherein the step of modeling a relationship between latency and throughput based on the clustering result comprises:
calculating a predicted value through the polynomial regression model according to the write time delay, and calculating a standard error of the predicted value; and
and calculating a confidence interval according to the standard error, the write delay and the predicted value.
6. The method of latency detection analysis according to claim 5, wherein the step of determining whether to generate an alarm based on a relationship between new latency data and a predetermined range comprises:
and calculating the ratio of new time delay data corresponding to a plurality of continuous moments to the confidence upper limit, and responding to the fact that the ratio corresponding to more than half of the moments exceeds a threshold value, and alarming.
7. The method of latency detection analysis according to claim 1, wherein the step of performing association analysis to determine components most relevant to latency based on topology comprises:
and respectively carrying out correlation analysis on the time delay and the time delay of the storage node, the storage medium, the switch and the router, and determining the component most correlated with the time delay by using a pearson correlation coefficient method.
8. A system for delay detection analysis, comprising:
the processing module is used for preprocessing the acquired time delay data and throughput data;
the clustering module is used for clustering the preprocessed time delay data and the preprocessed throughput data and establishing a relation model between the time delay and the throughput according to a clustering result;
the alarm module is used for predicting new time delay data through the relation model and determining whether an alarm is generated or not according to the relation between the new time delay data and a preset range; and
and the analysis module is used for carrying out association analysis according to the topological relation to determine the component most relevant to the time delay in response to the generation of the alarm.
9. An electronic device, comprising:
at least one processor; and
a memory storing computer instructions executable on the processor, which when executed by the processor, perform the steps of the method of any one of claims 1-7.
10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method of any one of claims 1-7.
CN202311117356.9A 2023-08-31 2023-08-31 Method, system, equipment and storage medium for time delay detection and analysis Pending CN117093461A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311117356.9A CN117093461A (en) 2023-08-31 2023-08-31 Method, system, equipment and storage medium for time delay detection and analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311117356.9A CN117093461A (en) 2023-08-31 2023-08-31 Method, system, equipment and storage medium for time delay detection and analysis

Publications (1)

Publication Number Publication Date
CN117093461A true CN117093461A (en) 2023-11-21

Family

ID=88773425

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311117356.9A Pending CN117093461A (en) 2023-08-31 2023-08-31 Method, system, equipment and storage medium for time delay detection and analysis

Country Status (1)

Country Link
CN (1) CN117093461A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117459418A (en) * 2023-12-25 2024-01-26 天津神州海创科技有限公司 Real-time data acquisition and storage method and system
CN117807055A (en) * 2024-02-29 2024-04-02 济南浪潮数据技术有限公司 Method and related device for predicting and analyzing key performance indexes of storage system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117459418A (en) * 2023-12-25 2024-01-26 天津神州海创科技有限公司 Real-time data acquisition and storage method and system
CN117459418B (en) * 2023-12-25 2024-03-08 天津神州海创科技有限公司 Real-time data acquisition and storage method and system
CN117807055A (en) * 2024-02-29 2024-04-02 济南浪潮数据技术有限公司 Method and related device for predicting and analyzing key performance indexes of storage system

Similar Documents

Publication Publication Date Title
CN117093461A (en) Method, system, equipment and storage medium for time delay detection and analysis
CN102957579B (en) A kind of exception flow of network monitoring method and device
CN111885012B (en) Network situation perception method and system based on information acquisition of various network devices
CN107154950B (en) Method and system for detecting log stream abnormity
CN110147387B (en) Root cause analysis method, root cause analysis device, root cause analysis equipment and storage medium
CN108206747B (en) Alarm generation method and system
Chhabra et al. Distributed spatial anomaly detection
KR20180120558A (en) System and method for predicting communication apparatuses failure based on deep learning
EP3058679A1 (en) Alarm prediction in a telecommunication network
US20150207696A1 (en) Predictive Anomaly Detection of Service Level Agreement in Multi-Subscriber IT Infrastructure
CN111262750B (en) Method and system for evaluating baseline model
CN114978568A (en) Data center management using machine learning
EP3923517A1 (en) System and method for predicting and handling short-term overflow
CN115454778A (en) Intelligent monitoring system for abnormal time sequence indexes in large-scale cloud network environment
US8661113B2 (en) Cross-cutting detection of event patterns
CN106452941A (en) Network anomaly detection method and device
CN114610559A (en) Equipment operation environment evaluation method, judgment model training method and electronic equipment
CN115237717A (en) Micro-service abnormity detection method and system
CN114095965A (en) Index detection model obtaining and fault positioning method, device, equipment and storage medium
CN115622867A (en) Industrial control system safety event early warning classification method and system
CN110647086B (en) Intelligent operation and maintenance monitoring system based on operation big data analysis
CN116471196B (en) Operation and maintenance monitoring network maintenance method, system and equipment
CN117076258A (en) Remote monitoring method and system based on Internet cloud
CN111078503B (en) Abnormality monitoring method and system
US20220092457A1 (en) Electronic device and method for analyzing reliability of facility

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination