CN114328198A

CN114328198A - System fault detection method, device, equipment and medium

Info

Publication number: CN114328198A
Application number: CN202111554982.5A
Authority: CN
Inventors: 赵利强
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd; Guangdong Inspur Smart Computing Technology Co Ltd
Priority date: 2021-12-17
Filing date: 2021-12-17
Publication date: 2022-04-12
Also published as: WO2023109251A1

Abstract

The application discloses a system fault detection method, a device, equipment and a medium, wherein the method comprises the following steps: acquiring current operation data of each service node in a service system to be detected; standardizing the current operation data by using a preset data standardization method to obtain standard scores corresponding to various operation state data respectively; training a model to be trained constructed based on a logistic regression algorithm by using historical operating data with fault type labels to obtain a trained supervised learning model; extracting a weight coefficient corresponding to each linear parameter in the trained supervised learning model; and respectively carrying out weighted calculation on the standard scores of the corresponding running state data by using the weight coefficients of the linear parameters, and carrying out fault positioning on the service system to be detected based on the weighted scores. The method and the device for detecting the system fault obtain a supervised learning model based on historical operating data, and detect the current operating data by using the model in a weighting calculation mode so as to detect the system fault.

Description

System fault detection method, device, equipment and medium

Technical Field

The present invention relates to the field of computer systems, and in particular, to a method, an apparatus, a device, and a medium for detecting system faults.

Background

The cloud native environment mainly has the characteristics of micro-service, automatic release, continuous delivery and containerization. The micro-service architecture has great advantages in independent deployment, rapid delivery and expansion capabilities, but meanwhile, due to numerous services in the micro-service system, the calling relationship among the services becomes abnormally complex, and when the system has problems, an operation and maintenance manager is difficult to find the problems and troubleshoot the problems quickly, accurately and comprehensively. Therefore, in a service system environment, fault detection and root cause localization require more intelligent algorithmic models.

At present, in service system scenes with more services and operation and maintenance data, such as private cloud monitoring, large-scale micro-service troubleshooting, cloud native platform intelligent operation and maintenance and the like, when problems occur in a service system, due to the fact that service nodes in the service system are numerous, calling relations among the service nodes can become abnormal and complex, in the prior art, fault searching and troubleshooting are mostly carried out through methods such as threshold value detection and rule alarm, and operation and maintenance personnel are often difficult to search faults and troubleshoot problems quickly, accurately and comprehensively.

In summary, how to automatically, rapidly, accurately and comprehensively detect and locate faults in a service system is a problem to be solved at present.

Disclosure of Invention

In view of the above, an object of the present invention is to provide a method, an apparatus, a device and a medium for detecting system faults, which can automatically, rapidly, accurately and comprehensively detect and locate faults in a service system. The specific scheme is as follows:

in a first aspect, the present application discloses a system fault detection method, including:

acquiring current operation data of each service node in a service system to be detected; the current operating data comprises a plurality of operating state data;

standardizing the current operation data by using a preset data standardization method to obtain standard scores corresponding to various operation state data respectively;

training a model to be trained constructed based on a logistic regression algorithm by using historical operating data with fault type labels to obtain a trained supervised learning model;

extracting a weight coefficient corresponding to each linear parameter in the trained supervised learning model; wherein the different linear parameters respectively correspond to the different operating state data;

and respectively carrying out weighted calculation on the corresponding standard scores of the running state data by using the weighting coefficients of the linear parameters, and carrying out fault positioning on the service system to be detected based on the weighted scores.

Optionally, the obtaining current operation data of each service node in the service system to be detected includes:

the method comprises the steps of obtaining system performance index data, micro-service call chain data and system log data of each service node in a service system to be detected so as to obtain current operation data of each service node.

Optionally, the obtaining system performance index data, micro service call chain data, and system log data of each service node in the service system to be detected to obtain current operation data of each service node includes:

determining a time length of a sliding window of the time series;

sampling system performance index data of each service node in the service system to be detected based on a first preset time interval within the time length of each sliding window to obtain multiple groups of system performance index data which are arranged according to a time sequence and correspond to the sliding windows;

sampling the micro-service call chain data of each service node in the service system to be detected based on a second preset time interval within the time length of each sliding window to obtain a plurality of groups of micro-service call chain data which are arranged according to a time sequence and correspond to the sliding windows;

and sampling system log data of each service node in the service system to be detected based on a third preset time interval within the time length of each sliding window to obtain multiple groups of system log data which are arranged according to a time sequence and correspond to the sliding windows.

Optionally, the normalizing the current operating data by using a preset data normalization method to obtain standard scores corresponding to the various operating state data respectively includes:

calculating a z-score corresponding to each group of the system performance index data and a z-score corresponding to first-order difference data between different system performance index data in each group of the system performance index data;

acquiring micro-service calling time in each group of micro-service calling chain data, and calculating a z-score corresponding to the micro-service calling time in each group of micro-service calling chain data and a z-score corresponding to first-order difference data between different micro-service calling times in each group of micro-service calling chain data;

and matching each group of the system log data by using a preset log template to obtain matching scores corresponding to different system log data in each group of the system log data, and calculating the z-scores of the matching scores corresponding to the different system log data in each group of the system log data and the z-score of first-order difference data between the different matching scores corresponding to the different system log data in each group of the system log data.

Optionally, the process of normalizing any group of operation state data in the current operation data includes:

respectively calculating the mean value and the variance corresponding to the group of running state data by using an optimized mean value calculation formula and an optimized back difference calculation formula, and calculating the z-fraction corresponding to the group of running state data based on the mean value and the variance corresponding to the group of running state data; wherein the content of the first and second substances,

the optimized mean value calculation formula is as follows:

the optimized rear difference calculation formula is as follows:

wherein the content of the first and second substances,

n represents the data sample size corresponding to the set of operating state data, x_iRepresents the ith data sample in the set of operating state data, mean represents the mean, s²The variance is indicated.

Optionally, before training the model to be trained, which is constructed based on the logistic regression algorithm, by using the historical operating data with the fault type label, the method further includes:

acquiring historical normal operation data and historical fault operation data;

adding label information containing corresponding running time interval labels and fault-free type labels to the historical normal running data to obtain first historical running data serving as a negative sample;

adding label information including corresponding operation time interval labels and fault type labels to the historical fault operation data, and resampling the historical fault operation data with the label information added to obtain second historical operation data serving as a positive sample, so that the proportion between the number of samples corresponding to the second historical operation data and the number of samples corresponding to the first historical operation data reaches a preset positive-negative sample proportion.

Optionally, the performing weighted calculation on the standard scores of the corresponding operating condition data by using the weighting coefficients of the linear parameters includes:

acquiring expert knowledge for optimizing the weight coefficient of the linear parameter through a preset expert knowledge acquisition interface;

correspondingly adjusting the weight coefficient of the linear parameter by using the expert knowledge to obtain the adjusted weight coefficient of the linear parameter;

and respectively carrying out weighted calculation on the corresponding standard scores of the running state data by utilizing the adjusted weight coefficients of the linear parameters.

Optionally, the performing weighted calculation on the corresponding standard scores of the operating state data by using the weighting coefficients of the linear parameters, and performing fault location on the service system to be detected based on the weighted scores includes:

respectively carrying out weighted calculation on the standard score of the corresponding running state data in each service node by using the weight coefficient of the linear parameter so as to obtain the weighted score of each service node;

screening a preset number of service nodes with weighted scores larger than a preset threshold value from all the service nodes according to the sequence of the weighted scores from large to small, and determining a target service node with a fault based on the service nodes obtained after screening;

screening out the maximum weight coefficient from the weight coefficients of all the linear parameters corresponding to the target service node, and determining the parameter type of the linear parameter corresponding to the maximum weight coefficient as the corresponding fault root.

In a second aspect, the present application discloses a system fault detection apparatus, comprising:

the data acquisition module is used for acquiring the current operation data of each service node in the service system to be detected; the current operating data comprises a plurality of operating state data;

the standardization processing module is used for carrying out standardization processing on the current operation data by utilizing a preset data standardization method so as to obtain standard scores corresponding to various operation state data respectively;

the model training module is used for training a model to be trained, which is constructed based on a logistic regression algorithm, by using historical operating data with fault type labels so as to obtain a trained supervised learning model;

the weight coefficient extraction module is used for extracting a weight coefficient corresponding to each linear parameter in the trained supervised learning model; wherein the different linear parameters respectively correspond to the different operating state data;

and the fault positioning module is used for respectively carrying out weighted calculation on the corresponding standard scores of the running state data by utilizing the weighting coefficients of the linear parameters and carrying out fault positioning on the service system to be detected based on the weighted scores.

In a third aspect, the present application discloses an electronic device, comprising:

a memory for storing a computer program;

a processor for executing the computer program to implement the steps of the system fault detection method disclosed in the foregoing.

In a fourth aspect, the present application discloses a computer readable storage medium for storing a computer program; wherein the computer program realizes the steps of the system fault detection method disclosed in the foregoing when executed by a processor.

Therefore, the method includes the steps that current operation data of each service node in a service system to be detected are obtained; the current operating data comprises a plurality of operating state data; then, standardizing the current operation data by using a preset data standardization method to obtain standard scores corresponding to various operation state data respectively; then training a model to be trained constructed based on a logistic regression algorithm by using historical operating data with fault type labels to obtain a trained supervised learning model; extracting a weight coefficient corresponding to each linear parameter in the trained supervised learning model; wherein the different linear parameters respectively correspond to the different operating state data; and finally, respectively carrying out weighted calculation on the corresponding standard scores of the running state data by using the weight coefficients of the linear parameters, and carrying out fault positioning on the service system to be detected based on the weighted scores. Therefore, in the application, the standard scores corresponding to the operation state data of each service node are weighted and calculated based on the weight coefficient corresponding to each linear parameter obtained from the trained supervised learning model to obtain the weighted calculation score, the weighted scores meeting the preset condition are screened out by sequencing the weighted scores corresponding to each group obtained by weighted calculation, and the service node corresponding to the weighted score is correspondingly determined, so that the fault location in the system fault is realized, the root cause of the system fault is further determined according to the determined weight coefficient of the component information in the service node, and the fault location efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of a system fault detection method disclosed herein;

FIG. 2 is a flow chart of a particular system fault detection method disclosed herein;

FIG. 3 is a flow chart of a particular system fault detection method disclosed herein;

FIG. 4 is a flow chart of a specific fault detection and root cause determination disclosed herein;

FIG. 5 is a flow chart of a particular system fault detection method disclosed herein;

FIG. 6 is a schematic diagram of a system fault detection apparatus according to the present disclosure;

fig. 7 is a block diagram of an electronic device disclosed in the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In service system scenes with more services and operation and maintenance data, such as private cloud monitoring, large-scale micro-service troubleshooting, cloud native platform intelligent operation and maintenance and the like, when problems occur in a service system, due to the fact that service nodes in the service system are numerous, calling relations among the service nodes can become extremely complex, most of the prior art means search and troubleshoot faults through methods of threshold detection, rule alarming and the like, and operation and maintenance personnel are often difficult to search and troubleshoot the problems quickly, accurately and comprehensively. Therefore, the embodiment of the application discloses a system fault detection method, device, equipment and medium, which can automatically, quickly, accurately and comprehensively detect and position faults in a service system.

Referring to fig. 1, an embodiment of the present application discloses a system fault detection method, including:

step S11: acquiring current operation data of each service node in a service system to be detected; the current operating data includes a plurality of operating state data.

In this embodiment, current operation data of each service node in a service system to be detected needs to be acquired, where the service system to be detected may include, but is not limited to, any one of a private cloud monitoring system, a large-scale micro-service system, or a cloud native platform intelligent operation and maintenance system. In addition, the current operation data of each service node includes various operation state data, that is, various data capable of representing the operation state of the service system to be detected.

Step S12: and carrying out standardization processing on the current operation data by using a preset data standardization method to obtain standard scores corresponding to various operation state data respectively.

In this embodiment, after the current operation data is acquired, different operation state data in the current operation data have different dimensions, data units, or orders of magnitude. In order to facilitate comparison and weighting between different operation state data, it is necessary to perform normalization processing on the current operation data by using a preset data normalization method to convert the current operation data into a dimensionless pure numerical value, that is, a standard score corresponding to each operation state data obtained by performing normalization processing on the current operation data in this embodiment.

Step S13: and training the model to be trained constructed based on the logistic regression algorithm by using the historical operating data carrying the fault type label to obtain the trained supervised learning model.

In this embodiment, a to-be-trained model needs to be trained by using pre-prepared historical operating data carrying a faulty type tag to obtain a trained supervised learning model, where the to-be-trained model is constructed based on a Logistic regression algorithm, that is, the supervised learning training is performed by using a Logistic Classifier (Logistic regression Classifier). It should be noted that the purpose of building the supervised learning model by using logistic regression classification is not to use supervised learning to perform fault detection and root cause positioning on the system to be detected, but to use the computation process of logistic regression to adjust the weight of the corresponding linear parameter in the model by using limited historical operating data carrying the fault type tag.

Step S14: extracting a weight coefficient corresponding to each linear parameter in the trained supervised learning model; wherein the different linear parameters correspond to the different operating state data, respectively.

In this embodiment, as described in step S13, the purpose of using the logistic regression classification algorithm to establish the supervised learning model is to adjust the weights of the corresponding linear parameters in the model, so after the model training is completed, the weight coefficient corresponding to each linear parameter is extracted from the supervised learning model, and it should be noted that different linear parameters correspond to different operating state data.

Step S15: and respectively carrying out weighted calculation on the corresponding standard scores of the running state data by using the weighting coefficients of the linear parameters, and carrying out fault positioning on the service system to be detected based on the weighted scores.

In this embodiment, the performing weighted calculation on the standard scores of the corresponding operating status data by using the weighting coefficients of the linear parameters may further include: acquiring expert knowledge for optimizing the weight coefficient of the linear parameter through a preset expert knowledge acquisition interface; correspondingly adjusting the weight coefficient of the linear parameter by using the expert knowledge to obtain the adjusted weight coefficient of the linear parameter; and respectively carrying out weighted calculation on the corresponding standard scores of the running state data by utilizing the adjusted weight coefficients of the linear parameters. It can be understood that after the weight coefficients are extracted from the supervised learning model, the corresponding expert knowledge can be obtained through the preset expert knowledge obtaining interface, and the weight coefficients of different linear parameters are adjusted by using the expert knowledge. The expert knowledge can be optimized data extracted from an optimized weight coefficient model established based on a historical expert knowledge base and acquired by the preset expert knowledge acquisition interface, so that the time of a weight coefficient optimization process can be reduced, and the waste of human resources can be reduced; or the instruction may be an instruction for manually optimizing the weight coefficient acquired through the preset expert knowledge acquisition interface. The expert knowledge is used for optimizing the weight coefficient, so that the accuracy and comprehensiveness of the subsequent fault location of the service system to be detected can be improved. The purpose of adjusting the weighting factors by expert knowledge is to improve the sensitivity of the model to system faults not encountered, for example, for a performance index "system.tcp.syn _ recv" on the network, if the index is often related to network faults but there is no fault type label in the historical operating data indicating such system faults, for this case, it is necessary to appropriately adjust the weighting factors of the linear parameters corresponding to such faults.

Therefore, the method includes the steps that firstly, current operation data of each service node in a service system to be detected are obtained; the current operating data comprises a plurality of operating state data; then, standardizing the current operation data by using a preset data standardization method to obtain standard scores corresponding to various operation state data respectively; then training a model to be trained constructed based on a logistic regression algorithm by using historical operating data with fault type labels to obtain a trained supervised learning model; extracting a weight coefficient corresponding to each linear parameter in the trained supervised learning model; wherein the different linear parameters respectively correspond to the different operating state data; and finally, respectively carrying out weighted calculation on the corresponding standard scores of the running state data by using the weight coefficients of the linear parameters, and carrying out fault positioning on the service system to be detected based on the weighted scores. Therefore, in the application, the standard scores corresponding to the operation state data of each service node are weighted and calculated based on the weight coefficient corresponding to each linear parameter obtained from the trained supervised learning model to obtain the weighted calculation score, the weighted scores meeting the preset condition are screened out by sequencing the weighted scores corresponding to each group obtained by weighted calculation, and the service node corresponding to the weighted score is correspondingly determined, so that the fault location in the system fault is realized, the root cause of the system fault is further determined according to the determined weight coefficient of the component information in the service node, and the fault location efficiency is improved.

Referring to fig. 2, the embodiment of the present application discloses a specific system fault detection method, and compared with the previous embodiment, the present embodiment further describes and optimizes the technical solution. The method specifically comprises the following steps:

step S21: the method comprises the steps of obtaining system performance index data, micro-service call chain data and system log data of each service node in a service system to be detected so as to obtain current operation data of each service node.

In this embodiment, current operation data of each service node in the service system to be detected needs to be acquired, where the current operation data mainly includes the following three operation state data: system performance index data, microservice call chain data, and system log data. The type of the index in the system performance index data may include, but is not limited to, any one or more types of a CPU (Central Processing Unit), a memory, a disk, a database, a JVM (Java Virtual Machine), a network, an I/O (Input/Output), and an HA (dual-computer cluster); in the micro-service call chain data, each group of call chain data comprises a link number (TraceId), a unit number (spandex id) called each time, a called service name (ServiceName), a physical unit (CmdbId) where the call chain data is located and a call Duration (Duration); the system log data also comprises a plurality of groups of log data of corresponding types in the system performance indexes.

Step S22: and carrying out standardization processing on the current operation data by using a preset data standardization method to obtain standard scores corresponding to various operation state data respectively.

Step S23: and acquiring historical normal operation data and historical fault operation data.

In this embodiment, it is necessary to obtain the existing historical normal operation data and historical failure operation data in the system, and it is noted that most (for example, more than 99%) of the operation data in the system is normal, and only a very small number (less than 1%) of the operation data has a problem of transmission failure.

Step S24: and adding label information containing corresponding running time interval labels and fault-free type labels to the historical normal running data to obtain first historical running data serving as a negative sample.

In this embodiment, the obtained historical normal operation data is used as a negative sample to obtain first historical operation data, and label information including a corresponding operation time interval label and a fault-free type label is also required to be added to the historical normal operation data, where the fault-free type label is label information that represents that the operation data does not cause a system fault.

Step S25: adding label information including corresponding operation time interval labels and fault type labels to the historical fault operation data, and resampling the historical fault operation data with the label information added to obtain second historical operation data serving as a positive sample, so that the proportion between the number of samples corresponding to the second historical operation data and the number of samples corresponding to the first historical operation data reaches a preset positive-negative sample proportion.

In this embodiment, similar to step S24, the historical faulty operation data obtained in step S23 needs to be used as a positive sample to obtain second historical operation data, and tag information including a corresponding operation time interval tag and a fault type tag, which is tag information that represents that the operation data may cause a system fault, needs to be added to the historical faulty operation data. It should be noted that, because most of the operation data in the system are normal, and only a very small number of operation data have a problem of sending faults, the ratio between positive and negative samples obtained through the historical normal operation data and the historical fault operation data respectively is very unbalanced, and in order to solve the problem of unbalance between the positive and negative samples, the historical fault operation data added with the tag information needs to be resampled to improve the ratio of the positive sample number in the total sample number, so that the ratio between the positive sample number corresponding to the second historical operation data and the negative sample number corresponding to the first historical operation data reaches a preset ratio of the positive and negative samples. In this embodiment, the ratio of the positive sample to the negative sample is 1: 10.

Step S26: and training a model to be trained constructed based on a logistic regression algorithm by using the first historical operating data and the second history to obtain a trained supervised learning model.

In this embodiment, a model to be trained, which is constructed based on a logistic regression algorithm, is trained by using first historical operating data and second historical operating data that conform to a preset positive-negative sample ratio, so as to obtain a trained supervised learning model.

Step S27: extracting a weight coefficient corresponding to each linear parameter in the trained supervised learning model; wherein the different linear parameters correspond to the different operating state data, respectively.

Step S28: and respectively carrying out weighted calculation on the corresponding standard scores of the running state data by using the weighting coefficients of the linear parameters, and carrying out fault positioning on the service system to be detected based on the weighted scores.

For more specific processing procedures of the steps S22, S27, and S28, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not repeated here.

Therefore, in the embodiment of the application, system performance index data, micro-service call chain data and system log data of each service node in the service system to be detected are obtained, and corresponding standard score calculation is performed by using a preset data standardization method; and then, by labeling the acquired historical normal operation data and historical fault operation data with corresponding label information and using the labeled historical fault operation data as a source of positive and negative sample data, under the condition of unbalance of the number of positive and negative samples, resampling the historical fault operation data added with the label information to improve the proportion of the number of positive samples in the total number of samples, so that the proportion between the number of positive samples corresponding to the second historical operation data and the number of negative samples corresponding to the first historical operation data reaches a preset positive and negative sample proportion. And finally, training the model to be trained based on logistic regression by utilizing the first historical operating data and the second historical operating data to obtain a supervised learning model, extracting corresponding weight coefficients from the supervised learning model, and performing weighted calculation on the standard scores of the operating state data so as to perform fault positioning on the service system to be detected. Therefore, a supervised learning model is established through a small amount of historical fault operation data and historical normal operation data, and then the model can be used for detecting various operation state data in the current operation data in a streaming calculation mode.

Referring to fig. 3 and 4, the embodiment of the present application discloses a specific system fault detection method, and compared with the previous embodiment, the present embodiment further describes and optimizes the technical solution. The method specifically comprises the following steps:

step S31: the method comprises the steps of determining the time length of sliding windows of a time sequence, and sampling system performance index data, micro-service call chain data and system log data of each service node in a service system to be detected based on a first preset time interval, a second preset time interval and a third preset time interval respectively within the time length of each sliding window so as to obtain multiple groups of system performance index data, multiple groups of micro-service call chain data and multiple groups of system log data which are arranged according to a time sequence and correspond to multiple sliding windows.

In this embodiment, first, the length of the sliding window of the time sequence needs to be determined, and in this embodiment, the length of the sliding window is determined to be 30 minutes, and then, the current operation data needs to be sampled based on different preset time intervals within 30 minutes. In this embodiment, three different operation state data, that is, system performance index data, micro service call chain data, and system log data, in the current operation data of each service node in the service system to be detected are sampled based on a first preset time interval, a second preset time interval, and a third preset time interval, respectively, and multiple sets of system performance index data, multiple sets of micro service call chain data, and multiple sets of system log data, which are arranged according to a time sequence and correspond to multiple sliding windows, are obtained after sampling. It should be noted that the first preset time interval, the second preset time interval and the third preset time interval may be equal to or different from each other. For example, for the system performance index data, the first preset time interval may be set to 1 minute, and then 30 sampling points may be obtained after sampling in a sliding window with a sliding window length of 30 minutes, and data values corresponding to the 30 sampling points are arranged according to a time sequence to obtain a set of system performance index data. The same is true for micro-service call chain data and system log data, the number of sampling points is determined by preset time intervals, and when the preset time intervals are different, the number of sampling points obtained after sampling is different; when the preset time intervals are the same, the number of the sampling points is also the same.

Step S32: and calculating a z-score corresponding to each group of the system performance index data and a z-score corresponding to first-order difference data among different system performance index data in each group of the system performance index data.

In this embodiment, after the plurality of sets of system performance index data are obtained, each set of system performance index data needs to be classified according to the data index type and standardized, for example, when the data index type is a CPU, only the obtained index data related to the CPU needs to be calculated. In this embodiment, a z-score normalization method is mainly adopted to normalize the data, that is, a z-score corresponding to each set of system performance index data is calculated, and in addition, a z-score corresponding to first-order difference data between different sets of system performance index data in each set of system performance index data needs to be calculated. It should be noted that, in this embodiment, before performing the standardization processing on the system performance index data, data cleaning is also required to be performed on the acquired system performance index data, and the purpose of the data cleaning is to clean up duplicate data and redundant data, or to completely supplement missing data, or to correct or delete erroneous data, so as to improve the quality of data and reduce the error rate in the data use process.

Wherein, the calculation formula of the z fraction is as follows:

wherein Metric_iZ-scores of corresponding data index types for a set of operating state data; value_iA data value for each sampling point in a set of operating condition data; mean is the mean of a set of operating state data within a sliding window, std is the standard deviation of a set of operating state data within a sliding window.

In a conventional formula for calculating the mean and variance:

in the flow calculation, the data volume is huge, and the performance of an algorithm for calculating the mean value and the variance by using a traditional method is low, so that the invention provides a mean value and variance calculation method with time complexity of O (1), and the performance of a model can be effectively improved. The method mainly optimizes the formula, and the specific optimization method comprises the following steps:

the formula for calculating the variance in the prior art is developed to obtain:

reissue to order

Then it is possible to obtain:

and (3) an optimized mean value calculation formula:

optimizing a rear difference calculation formula:

that is, when any one set of operation state data in the current operation data is subjected to the standardized processing, the following processing manner may be adopted in the embodiment of the present application: and respectively calculating the mean value and the variance corresponding to the group of running state data by using the optimized mean value calculation formula and the optimized back difference calculation formula, and calculating the z-fraction corresponding to the group of running state data based on the mean value and the variance corresponding to the group of running state data. Wherein n represents the data sample size corresponding to the set of operating state data, x_iRepresents the ith data sample in the set of operating state data, mean represents the mean, s²The variance is indicated. It can be understood that, when calculating the z-score corresponding to the group of operating state data and the z-score corresponding to the first-order difference data of the group of operating state data, the optimized mean value calculation formula and the optimized back difference calculation formula may be adopted to calculate the respective corresponding mean value and variance, and then the respective corresponding z-score is solved.

Thus, only x in one queue maintenance interval needs to be designed_iThe method is used for rapidly updating the values of S0 and S1, the standard score corresponding to each current running state data can be calculated within O (1) time complexity, namely the mean value of a group of running state data in a sliding window and the variance of a group of running state data in the sliding window can be obtained after one-time calculation no matter how large the calculated data scale is, the speed of calculating the mean value of a group of running state data in the sliding window and the variance of a group of running state data in the sliding window can be greatly improved, and the speed of calculating the standard score and the efficiency of detecting system faults are further improved.

Step S33: acquiring micro-service calling time in each group of micro-service calling chain data, and calculating z-scores corresponding to the micro-service calling time in each group of micro-service calling chain data and z-scores corresponding to first-order difference data between different micro-service calling times in each group of micro-service calling chain data.

In this embodiment, after the multiple sets of micro service invocation chain data are obtained, graph decomposition needs to be performed on each set of micro service invocation chain data, and response time of service invocation needs to be calculated. And calculating a z-score corresponding to the total micro-service calling time in each group of micro-service calling chain data and a z-score corresponding to first-order difference data between different micro-service calling times in each group of micro-service calling chain data. The specific calculation method of the z-score is shown in step S32.

Step S34: and matching each group of the system log data by using a preset log template to obtain matching scores corresponding to different system log data in each group of the system log data, and calculating the z-scores of the matching scores corresponding to the different system log data in each group of the system log data and the z-score of first-order difference data between the different matching scores corresponding to the different system log data in each group of the system log data.

In this embodiment, after the multiple sets of system log data are obtained, matching and detection need to be performed on each set of system log data by using a corresponding log template, so as to determine the type of the system log data (that is, the system log data belong to a CPU, a memory, or a magnetic disk, etc.), obtain matching scores corresponding to different types of system log data in each set of system log data, and then calculate z-scores of matching scores corresponding to different types of system log data in each set of system log data and z-scores of first-order difference data between different matching scores corresponding to each set of system log data. The specific calculation method of the z-score is shown in step S32.

Step S35: and training the model to be trained constructed based on the logistic regression algorithm by using the historical operating data carrying the fault type label to obtain the trained supervised learning model.

Step S36: extracting a weight coefficient corresponding to each linear parameter in the trained supervised learning model; wherein the different linear parameters correspond to the different operating state data, respectively.

Step S37: and respectively carrying out weighted calculation on the corresponding standard scores of the running state data by using the weighting coefficients of the linear parameters, and carrying out fault positioning on the service system to be detected based on the weighted scores.

For more specific processing procedures of the steps S35, S36, and S37, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not repeated here.

In the embodiment of the application, the length of the sliding window of the time sequence is determined, and the system performance index data, the micro-service call chain data and the system log data of each service node in the service system to be detected are sampled based on the preset time interval within the time length of the sliding window to obtain multiple groups of system performance index data, multiple groups of micro-service call chain data and multiple groups of system log data; and then calculating each group of system performance index data, micro-service calling time in each group of micro-service calling chain data, matching scores corresponding to different system log data in each group of system log data and z scores of corresponding first-order difference data thereof, so as to perform weighting calculation on the obtained z scores by using weighting coefficients corresponding to different linear parameters extracted from a supervised learning model in the following process, and performing fault location on the service system to be detected based on the weighting scores.

Referring to fig. 5, the embodiment of the present application discloses a specific system fault detection method, and compared with the previous embodiment, the present embodiment further describes and optimizes the technical solution. The method specifically comprises the following steps:

step S41: acquiring current operation data of each service node in a service system to be detected; the current operating data includes a plurality of operating state data.

Step S42: and carrying out standardization processing on the current operation data by using a preset data standardization method to obtain standard scores corresponding to various operation state data respectively.

Step S43: and training the model to be trained constructed based on the logistic regression algorithm by using the historical operating data carrying the fault type label to obtain the trained supervised learning model.

Step S44: extracting a weight coefficient corresponding to each linear parameter in the trained supervised learning model; wherein the different linear parameters correspond to the different operating state data, respectively.

Step S45: and respectively carrying out weighted calculation on the standard score of the corresponding running state data in each service node by using the weight coefficient of the linear parameter so as to obtain the weighted score of each service node.

In this embodiment, after the weight coefficient corresponding to each linear parameter is obtained, the weight coefficient is used to perform weighted calculation on the standard score of the corresponding operation state data in each service node in the system to be detected, for example, for three different operation state data in one service node, if the system index type data includes four index types, that is, a CPU, a memory, a disk, and a database, the CPU, the memory, the disk, and the database all have respective standard scores, and the call time of the micro-service call chain data and the matching score of the system log data all have respective standard scores, the weight coefficient of each linear parameter extracted from the model is used to perform weighted calculation on the corresponding standard score, so as to obtain the weighted score of each service node.

Step S46: and screening a preset number of service nodes with weighted scores larger than a preset threshold value from all the service nodes according to the sequence of the weighted scores from large to small so as to determine the target service node with the fault based on the service nodes obtained after screening.

In this embodiment, the weighted score of each service node can be obtained through step S45, and then a preset number of service nodes with weighted scores greater than a preset threshold are screened from all service nodes according to the sequence of weighted scores from large to small, in this embodiment, the preset number is set to 3, and the preset threshold is set to 0.9, that is, after all weighted scores are sorted according to the sequence of large to small, the first 3 service nodes with scores greater than 0.9 are screened, and a target service node with a fault is determined based on the 3 service nodes.

In the process of determining the target service node, the determination may be performed in two ways, and in a specific embodiment, all of the 3 service nodes may be used as the target service node that has a failure, that is, all of the 3 service nodes have a failure; in another specific embodiment, one or two service nodes can be screened out again from the 3 service nodes in a manual participation mode according to a certain rule to serve as a target service node with a fault.

Step S47: screening out the maximum weight coefficient from the weight coefficients of all the linear parameters corresponding to the target service node, and determining the parameter type of the linear parameter corresponding to the maximum weight coefficient as the corresponding fault root.

In this embodiment, after the target service node with the fault is determined, the maximum weight coefficient is screened from the weight coefficients of all the linear parameters corresponding to the target service node, for example, if the weight coefficient of the CPU in one target service node is 0.4, the weight coefficient of the memory is 0.8, the weight coefficient of the disk is 0.8, the weight coefficient of the database is 0.2, and the weight coefficient of the network is 0.5, the maximum weight coefficient in the target service node is 0.8, and then the parameter type of the linear parameter corresponding to the maximum weight coefficient 0.8 is determined, that is, the memory and the disk are determined as the root cause of the fault.

For more specific processing procedures of the steps S41, S42, S43, and S44, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not repeated here.

As can be seen, in the embodiment of the present application, the weighting coefficient of the linear parameter is used to perform weighted calculation on the standard score of the operating state data in each service node, so as to obtain the weighted score of each service node, a preset number of service nodes with the weighted scores larger than a preset threshold are screened out from all the service nodes according to the descending order of the weighted scores, so as to determine a target service node with a fault, and finally, the parameter type of the linear parameter corresponding to the maximum weighting coefficient in the target service node is determined as the corresponding fault root cause. Therefore, in the embodiment of the application, by calculating the weighted score of each service node and sequencing the weighted scores to determine the target service node with the fault, the fault detection of each service node in the system to be detected can be realized, and the root cause causing the sending fault in the target service node can be determined.

Referring to fig. 6, an embodiment of the present application further discloses a system fault detection apparatus, including:

the data acquisition module 11 is configured to acquire current operating data of each service node in the service system to be detected; the current operating data comprises a plurality of operating state data;

the standardization processing module 12 is configured to standardize the current operation data by using a preset data standardization method to obtain standard scores corresponding to various operation state data;

the model training module 13 is used for training a model to be trained, which is constructed based on a logistic regression algorithm, by using historical operating data with a fault type label to obtain a trained supervised learning model;

a weight coefficient extraction module 14, configured to extract a weight coefficient corresponding to each linear parameter in the trained supervised learning model; wherein the different linear parameters respectively correspond to the different operating state data;

and the fault positioning module 15 is configured to perform weighted calculation on the corresponding standard scores of the operating state data by using the weighting coefficients of the linear parameters, and perform fault positioning on the service system to be detected based on the weighted scores.

Fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present application. The method specifically comprises the following steps: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input output interface 25, and a communication bus 26. Wherein the memory 22 is used for storing a computer program, which is loaded and executed by the processor 21 to implement the relevant steps in the system fault detection method executed by the computer device disclosed in any of the foregoing embodiments.

In this embodiment, the power supply 23 is used to provide operating voltage for each hardware device on the computer device 20; the communication interface 24 can create a data transmission channel between the computer device 20 and an external device, and a communication protocol followed by the communication interface is any communication protocol applicable to the technical solution of the present application, and is not specifically limited herein; the input/output interface 25 is configured to obtain external input data or output data to the outside, and a specific interface type thereof may be selected according to specific application requirements, which is not specifically limited herein.

The processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 21 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 21 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 21 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 21 may further include an AI (Artificial Intelligence) processor for processing a calculation operation related to machine learning.

In addition, the storage 22 is used as a carrier for storing resources, and may be a read-only memory, a random access memory, a magnetic disk or an optical disk, etc., the resources stored thereon include an operating system 221, a computer program 222, data 223, etc., and the storage may be a transient storage or a permanent storage.

The operating system 221 is used for managing and controlling each hardware device and the computer program 222 on the computer device 20, so as to realize the operation and processing of the mass data 223 in the memory 22 by the processor 21, which may be Windows, Unix, Linux, or the like. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the system fault detection method disclosed in any of the foregoing embodiments and executed by the computer device 20. The data 223 may include data received by the computer device and transmitted from an external device, data collected by the input/output interface 25, and the like.

Further, an embodiment of the present application further discloses a storage medium, in which a computer program is stored, and when the computer program is loaded and executed by a processor, the method steps executed in the system fault detection process disclosed in any of the foregoing embodiments are implemented.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The system fault detection method, apparatus, device and storage medium provided by the present invention are described in detail above, and a specific example is applied in the present document to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understand the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for system fault detection, comprising:

2. The method according to claim 1, wherein the obtaining current operation data of each service node in the service system to be detected comprises:

3. The method according to claim 2, wherein the obtaining system performance index data, micro service call chain data, and system log data of each service node in the service system to be detected to obtain current operating data of each service node comprises:

determining a time length of a sliding window of the time series;

4. The method for detecting system faults according to claim 3, wherein the normalizing the current operation data by using a preset data normalizing method to obtain standard scores corresponding to various operation state data respectively comprises:

5. The method according to claim 4, wherein the step of normalizing any one set of operation state data in the current operation data comprises:

the optimized mean value calculation formula is as follows:

the optimized rear difference calculation formula is as follows:

wherein the content of the first and second substances,

6. The system fault detection method according to claim 1, wherein before training the model to be trained, which is constructed based on the logistic regression algorithm, by using the historical operating data carrying the fault type label, the method further comprises:

acquiring historical normal operation data and historical fault operation data;

7. The method according to claim 1, wherein the performing weighted calculation on the respective standard scores of the operating state data by using the weighting coefficients of the linear parameters comprises:

8. The system fault detection method according to any one of claims 1 to 7, wherein the performing weighted calculation on the standard scores of the corresponding operating state data by using the weighting coefficients of the linear parameters, and performing fault location on the service system to be detected based on the weighted scores comprises:

9. A system fault detection device, comprising:

10. An electronic device, comprising:

a memory for storing a computer program;

a processor for executing the computer program for carrying out the steps of the system fault detection method according to any one of claims 1 to 8.

11. A computer-readable storage medium for storing a computer program; wherein the computer program realizes the steps of the system fault detection method according to any one of claims 1 to 8 when executed by a processor.