WO2023109251A1

WO2023109251A1 - System fault detection method and apparatus, device, and medium

Info

Publication number: WO2023109251A1
Application number: PCT/CN2022/122295
Authority: WO
Inventors: 赵利强
Original assignee: 浪潮电子信息产业股份有限公司
Priority date: 2021-12-17
Filing date: 2022-09-28
Publication date: 2023-06-22
Also published as: CN114328198A

Abstract

The present application discloses a system fault detection method and apparatus, a device, and a medium. The method comprises: acquiring current operation data of each service node in a service system to be detected; standardizing the current operation data by using a preset data standardization method, so as to obtain standard scores respectively corresponding to various operation state data; training, by using historical operation data carrying a fault type label, a model to be trained constructed based on a logistic regression algorithm, so as to obtain a trained supervised learning model; extracting a weight coefficient corresponding to each linear parameter in the trained supervised learning model; and respectively performing weighted calculation on the standard scores of the corresponding operation state data by using the weight coefficients of the linear parameters, and performing fault positioning on said service system on the basis of weighted scores. According to the present application, the supervised learning model is obtained on the basis of the historical operation data, and the model is used to detect the current operation data by means of the weighted calculation, so as to detect a system fault.

Description

A system fault detection method, device, equipment and medium

Cross References to Related Applications

This application claims the priority of the Chinese patent application with the application number 202111554982.5 and the application title "A System Fault Detection Method, Device, Equipment and Medium" submitted to the China Patent Office on December 17, 2021, the entire contents of which are incorporated by reference incorporated in this application.

technical field

The present application relates to the field of computer systems, in particular to a system fault detection method, device, equipment and medium.

Background technique

The cloud-native environment mainly has four characteristics: microservices, automated publishing, continuous delivery, and containerization. The microservice architecture shows great advantages in independent deployment, fast delivery and expansion capabilities, but at the same time, due to the large number of services in the microservice system, the calling relationship between services will become extremely complicated. When the system has problems, It is difficult for operation and maintenance administrators to quickly, accurately and comprehensively find faults and troubleshoot problems. Therefore, in the service system environment, fault detection and root cause location require more intelligent algorithm models.

Currently, in private cloud monitoring, large-scale microservice troubleshooting, cloud-native platform intelligent operation and The call relationship between nodes will also become extremely complicated. Most of the existing technical methods use threshold detection and rule alarms to find and troubleshoot faults. It is often difficult for operation and maintenance personnel to quickly, accurately and comprehensively find faults and troubleshoot problems.

To sum up, how to automatically, quickly, accurately and comprehensively detect and locate faults in the service system is a problem to be solved at present.

Contents of the invention

In view of this, the purpose of this application is to provide a system fault detection method, device, equipment and medium, which can automatically, quickly, accurately and comprehensively detect and locate faults in the service system. The specific plan is as follows:

In a first aspect, the present application discloses a system fault detection method, including:

Obtain the current running data of each service node in the service system to be tested; the current running data includes various running status data;

Use the preset data standardization method to standardize the current operating data to obtain the standard scores corresponding to various operating status data;

Use the historical operating data carrying the fault type label to train the model to be trained based on the logistic regression algorithm to obtain the trained supervised learning model;

Extract the weight coefficient corresponding to each linear parameter in the supervised learning model after training; wherein, different linear parameters correspond to different operating status data;

The weight coefficients of the linear parameters are used to weight the corresponding standard scores of the operating status data, and based on the weighted scores, the fault location of the service system to be detected is performed.

Optionally, obtain the current running data of each service node in the service system to be tested, including:

Obtain the system performance index data, microservice call chain data and system log data of each service node in the service system to be tested, so as to obtain the current operation data of each service node.

Optionally, obtain the system performance index data, microservice call chain data and system log data of each service node in the service system to be tested, so as to obtain the current operating data of each service node, including:

determine the time length of the sliding window for the time series;

Sampling the system performance index data of each service node in the service system to be detected based on the first preset time interval within the time length of each sliding window, so as to obtain multiple sets of system performance corresponding to multiple sliding windows arranged in time series indicator data;

Sampling the microservice call chain data of each service node in the service system to be detected based on the second preset time interval within the time length of each sliding window, so as to obtain multiple groups of microservices corresponding to multiple sliding windows arranged in time series Service call chain data;

Sampling the system log data of each service node in the service system to be detected based on the third preset time interval within the time length of each sliding window, so as to obtain multiple sets of system log data corresponding to multiple sliding windows arranged in time series .

Optionally, the current operating data is standardized using a preset data standardization method to obtain standard scores corresponding to various operating status data, including:

Calculate the z-score corresponding to each set of system performance index data and the z-score corresponding to the first-order difference data between different system performance index data in each set of system performance index data;

Obtain the microservice call time in each set of microservice call chain data, and calculate the z-score corresponding to the microservice call time in each set of microservice call chain data and the difference between different microservice call times in each set of microservice call chain data The z-score corresponding to the first-order difference data of ;

Use the preset log template to match each set of system log data to obtain the matching scores corresponding to different system log data in each set of system log data, and calculate the z of the matching scores corresponding to different system log data in each set of system log data scores and the z-scores of the first-difference data between different matching scores for each set of syslog data.

Optionally, calculate the z-score corresponding to the microservice call time in each set of microservice call chain data and the z-score corresponding to the first-order difference data between different microservice call times in each set of microservice call chain data, including:

For each set of microservice call chain data, determine the parent node and child node of the microservice call chain corresponding to the set of microservice call chain data;

At the same time, add the call duration and call direction to represent the call relationship for the parent node and the child node;

Calculate the total microservice call time in the group of microservice call chain data based on the call duration and call direction of the parent node and child nodes;

Calculate the z-score corresponding to the total microservice call time in the group of microservice call chain data and the z-score corresponding to the first-order difference data between different microservice call times in each set of microservice call chain data.

Optionally, the process of standardizing any set of operating status data in the current operating data includes:

Using the optimized mean calculation formula and the optimized variance calculation formula, respectively calculate the mean and variance corresponding to the group of operating status data, and calculate the corresponding z-score of the group of operating status data based on the mean and variance corresponding to the group of operating status data; ,

The formula for calculating the mean value after optimization is:

The formula for calculating variance after optimization is:

in,

n represents the data sample size corresponding to the group of operating state data, x _i represents the i-th data sample in the group of operating state data, mean represents the mean value, and s ² represents the variance.

Optionally, the mean value and variance corresponding to the group of operating status data are respectively calculated by using the optimized mean value calculation formula and the optimized variance calculation formula, including:

For any set of running status data, use the preset target queue to maintain the data samples in the set of running status data;

Obtain data samples in the target queue;

According to the data samples, the mean value and variance corresponding to the group of operating status data are respectively calculated by using the optimized mean value calculation formula and the optimized variance calculation formula.

Optionally, before training the model to be trained based on the logistic regression algorithm using the historical operation data carrying the fault type label, it also includes:

Obtain historical normal operation data and historical fault operation data;

Add label information including the corresponding running time interval label and the non-fault type label to the historical normal operation data to obtain the first historical operation data as a negative sample;

Add label information including the corresponding running time interval label and fault type label to the historical fault operation data, and resample the historical fault operation data with the added label information to obtain the second historical operation data as a positive sample, so that the second The ratio between the number of samples corresponding to the historical operation data and the number of samples corresponding to the first historical operation data reaches a preset ratio of positive and negative samples.

Optionally, the weight coefficients of the linear parameters are used to perform weighted calculations on the corresponding standard scores of the operating status data, including:

The expert knowledge used to optimize the weight coefficient of the linear parameter is obtained through the preset expert knowledge acquisition interface;

Use expert knowledge to adjust the weight coefficient of the linear parameter accordingly to obtain the adjusted weight coefficient of the linear parameter;

The adjusted weight coefficients of the linear parameters are used to perform weighted calculations on the corresponding standard scores of the operating status data.

Optionally, the expert knowledge used to optimize the weight coefficient of the linear parameter is obtained through the preset expert knowledge acquisition interface, including:

Extract optimization data from the optimization weight coefficient model established based on the historical expert knowledge base;

The optimization data is obtained through the preset expert knowledge acquisition interface, and the optimization data is used as expert knowledge for optimizing the weight coefficient of the linear parameter.

The instruction for manually optimizing the weight coefficient is obtained through the preset expert knowledge acquisition interface, and the instruction for manually optimizing the weight coefficient is used as expert knowledge for optimizing the weight coefficient of the linear parameter.

Optionally, use expert knowledge to adjust the weight coefficients of the linear parameters accordingly to obtain the adjusted weight coefficients of the linear parameters, including:

Determine the fault type label corresponding to the linear parameter;

In the case that the fault type label is not included in the second historical operation data, the weight coefficient of the linear parameter corresponding to the fault type label is adjusted by using expert knowledge, so that the adjusted weight coefficient of the linear parameter is greater than the original weight coefficient of the linear parameter .

Optionally, use the weight coefficient of the linear parameter to perform weighted calculations on the corresponding standard scores of the operating status data, and perform fault location on the service system to be detected based on the weighted scores, including:

Use the weight coefficient of the linear parameter to carry out weighted calculation on the standard score of the corresponding running status data in each service node, so as to obtain the weighted score of each service node;

Screen out a preset number of service nodes with a weighted score greater than a preset threshold from all service nodes in order of weighted scores from large to small, so as to determine a faulty target service node based on the service nodes obtained after screening;

The maximum weight coefficient is selected from the weight coefficients of all linear parameters corresponding to the target service node, and the parameter type of the linear parameter corresponding to the maximum weight coefficient is determined as the corresponding fault root cause.

Optionally, a preset number of service nodes with a weighted score greater than a preset threshold are screened out from all service nodes in order of weighted scores from large to small, so as to determine a faulty target service node based on the filtered service nodes ,include:

Screen out a preset number of service nodes with a weighted score greater than a preset threshold from all service nodes in descending order of weighted scores;

Determining the service node with the filtered weighted score greater than the preset threshold as the faulty target service node.

In response to the user's selection operation on the filtered service nodes whose weighted scores are greater than a preset threshold, determine the service node selected by the user;

The service node selected by the user is determined as the target service node where the failure occurs.

Optionally, before using the preset data standardization method to standardize the current operating data, the method further includes:

Perform data cleaning processing on the current operating data, and the data cleaning processing includes one or more of the following: removing duplicate data in the current operating data, supplementing missing data in the current operating data, and correcting erroneous data in the current operating data.

In a second aspect, the present application discloses a system fault detection device, including:

The data acquisition module is used to acquire the current operation data of each service node in the service system to be detected; the current operation data includes various operation status data;

The standardization processing module is used to standardize the current operation data by using a preset data standardization method, so as to obtain standard scores corresponding to various operation state data;

The model training module is used to train the model to be trained based on the logistic regression algorithm by using the historical operation data carrying the fault type label to obtain the trained supervised learning model;

The weight coefficient extraction module is used to extract the weight coefficient corresponding to each linear parameter in the supervised learning model after training; wherein, different linear parameters correspond to different operating state data;

The fault location module is configured to use the weight coefficient of the linear parameter to carry out weighted calculation on the standard scores of the corresponding operation status data, and perform fault location on the service system to be detected based on the weighted scores.

In a third aspect, the present application discloses an electronic device, comprising:

memory for storing computer programs;

The processor is used to execute the computer program to realize the steps of the aforementioned disclosed system fault detection method.

In a fourth aspect, the present application discloses a computer non-volatile readable storage medium for storing a computer program; wherein, when the computer program is executed by a processor, the steps of the aforementioned disclosed system fault detection method are implemented.

It can be seen that this application first obtains the current operating data of each service node in the service system to be tested; the current operating data includes various operating state data; and then uses the preset data standardization method to standardize the current operating data to obtain various operating data The standard scores corresponding to the status data; then use the historical operation data with the fault type label to train the model to be trained based on the logistic regression algorithm to obtain the trained supervised learning model; and extract the trained supervised learning model The weight coefficient corresponding to each linear parameter; among them, different linear parameters correspond to different operating status data; finally, the weight coefficient of the linear parameter is used to calculate the weighted calculation of the standard score of the corresponding operating status data, and based on the weighted score to be detected The service system performs fault location. It can be seen that in this application, based on obtaining the weight coefficient corresponding to each linear parameter from the trained supervised learning model, the standard score corresponding to the operation status data of each service node is weighted and calculated to obtain the weighted calculation score. The calculated weighted scores corresponding to each group are sorted to filter out the weighted scores that meet the preset conditions, and correspondingly determine the service node corresponding to the weighted score, thereby realizing the fault location in the system fault, and further according to the determined service node The weight coefficient of the component information in the system determines the root cause of the system failure and improves the efficiency of fault location.

Description of drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only It is an embodiment of the present application, and those skilled in the art can also obtain other drawings according to the provided drawings without creative work.

Fig. 1 is a flow chart of a system fault detection method disclosed in the present application;

Fig. 2 is a flow chart of a specific system fault detection method disclosed in the present application;

FIG. 3 is a flow chart of a specific system fault detection method disclosed in the present application;

Fig. 4 is a kind of specific fault detection and root cause determination flowchart disclosed in the present application;

FIG. 5 is a flow chart of a specific system fault detection method disclosed in the present application;

FIG. 6 is a schematic structural diagram of a system fault detection device disclosed in the present application;

FIG. 7 is a structural diagram of an electronic device disclosed in the present application.

Detailed ways

The following will clearly and completely describe the technical solutions in the embodiments of the application with reference to the drawings in the embodiments of the application. Apparently, the described embodiments are only some of the embodiments of the application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.

In private cloud monitoring, large-scale microservice troubleshooting, cloud-native platform intelligent operation and maintenance and other services and service system scenarios with a lot of operation and maintenance data, when there is a problem in the service system, due to the large number of service nodes in the service system, the The call relationship among them will also become extremely complicated. Existing technical methods mostly use methods such as threshold detection and rule alarms to find and troubleshoot faults. It is often difficult for operation and maintenance personnel to quickly, accurately and comprehensively find faults and troubleshoot problems. To this end, the embodiment of the present application discloses a system fault detection method, device, equipment and medium, which can automatically, quickly, accurately and comprehensively detect and locate faults in the service system.

Referring to Fig. 1, the embodiment of the present application discloses a system fault detection method, which includes:

Step S11: Obtain the current operating data of each service node in the service system to be tested; the current operating data includes various operating state data.

In this embodiment, it is necessary to obtain the current operating data of each service node in the service system to be tested, wherein the above-mentioned service system to be tested may include but not limited to a private cloud monitoring system, a large microservice system, or a cloud-native platform intelligent operation and maintenance system any of the systems. Moreover, the current operating data of each service node includes various operating state data, that is, various data that can characterize the operating state of the service system to be detected.

Step S12: Standardize the current operating data by using a preset data standardization method to obtain standard scores corresponding to various operating status data.

In this embodiment, after the above-mentioned current operation data is obtained, it is considered that different operation state data in the current operation data have different dimensions, data units or orders of magnitude. In order to facilitate the comparison and weighting of different operating status data, it is necessary to use the preset data standardization method to standardize the current operating data to convert it into a dimensionless pure value. The standard scores corresponding to various operating status data obtained after the data is standardized.

Step S13: Use the historical operating data carrying the fault type label to train the model to be trained based on the logistic regression algorithm to obtain the trained supervised learning model.

In this embodiment, it is necessary to use pre-prepared historical operating data carrying fault type labels to train the model to be trained to obtain a trained supervised learning model, wherein the above-mentioned model to be trained in this embodiment is constructed based on a logistic regression algorithm That is, this implementation is to use Logistic Classifier (Logistic Regression Classifier) for supervised learning training. It should be noted that the purpose of using logistic regression classification to establish a supervised learning model is not to use supervised learning to detect faults and locate the root cause of the system to be detected, but to use the calculation process of logistic regression to use limited history with fault type labels The run data adjusts the weights of the corresponding linear parameters in the model.

Step S14: extracting weight coefficients corresponding to each linear parameter in the trained supervised learning model; where different linear parameters correspond to different operating state data.

In this embodiment, referring to step S13, the purpose of using the logistic regression classification algorithm to establish a supervised learning model is to adjust the weight of the corresponding linear parameters in the model. Therefore, after the model training is completed, each linear parameter is extracted from the above supervised learning model. It should be pointed out that different linear parameters correspond to different operating state data.

Step S15: Use the weight coefficient of the linear parameter to perform weighted calculation on the corresponding standard scores of the operating status data, and perform fault location on the service system to be detected based on the weighted scores.

In this embodiment, the above-mentioned use of the weight coefficient of the linear parameter to perform weighted calculation on the standard score of the corresponding operating status data may also include: obtaining the expert used to optimize the weight coefficient of the linear parameter through the preset expert knowledge acquisition interface Knowledge; use expert knowledge to adjust the weight coefficient of the linear parameter accordingly to obtain the adjusted weight coefficient of the linear parameter; use the adjusted weight coefficient of the linear parameter to perform weighted calculations on the corresponding standard scores of the operating status data. It can be understood that after extracting the weight coefficients from the supervised learning model, corresponding expert knowledge can also be obtained through the preset expert knowledge acquisition interface, and the weight coefficients of different linear parameters can be adjusted by using the expert knowledge. Among them, the expert knowledge can be optimized data extracted from the optimized weight coefficient model based on the historical expert knowledge base, and obtained by the preset expert knowledge acquisition interface. In this way, the time of the weight coefficient optimization process can be reduced, and It can reduce the waste of human resources; it can also be an instruction to manually optimize the weight coefficient obtained through the preset expert knowledge acquisition interface. Optimizing the weight coefficient through expert knowledge can improve the accuracy and comprehensiveness of the subsequent fault location of the service system to be detected. The purpose of adjusting the weight coefficients by expert knowledge is to increase the sensitivity of the model to system faults that have not been encountered.

In an optional embodiment, the process of using expert knowledge to adjust the weight coefficients of the linear parameters accordingly to obtain the adjusted weight coefficients of the linear parameters may include:

Determine the fault type label corresponding to the linear parameter; in the case that the second historical operation data does not include the fault type label, use expert knowledge to adjust the weight coefficient of the linear parameter corresponding to the fault type label, so that after the adjustment of the linear parameter The weight factor is greater than the original weight factor of the linear parameter. Since different operating state data correspond to different linear parameters, the fault type label of the linear parameter corresponding to the operating state data can be determined according to the fault type label corresponding to the operating state data. For example, for a network-related performance indicator "system.tcp.syn_recv", if the indicator is often related to network failures, that is, the failure type label corresponding to the indicator is a label of the network failure category. However, there is no fault type label indicating such a system fault in the historical operating data, so in this case, the weight coefficient of the linear parameter corresponding to this type of fault needs to be adjusted appropriately. Among them, tcp (transmission control protocol, transmission control protocol) is a connection-oriented, reliable, byte stream-based transport layer communication protocol. syn_recv refers to the state when the server receives the client's syn and sends an ack after the server is passively opened. syn (synchronize sequence numbers, synchronization sequence number) is a handshake signal used by TCP/IP to establish a connection. Ack (acknowledge character) is an acknowledgment character. In data communication, a transmission control character sent by the receiving station to the sending station. Indicates that the incoming data has been confirmed to be received without error.

It can be seen that, in the embodiment of the present application, the current operating data of each service node in the service system to be tested is first obtained; the current operating data includes various operating state data; and then the current operating data is standardized using the preset data standardization method to obtain The standard scores corresponding to each kind of operating status data; then use the historical operating data carrying the fault type label to train the model to be trained based on the logistic regression algorithm to obtain the trained supervised learning model; and extract the trained supervised learning The weight coefficients corresponding to each linear parameter in the model; among them, different linear parameters correspond to different operating status data; finally, the standard scores of the corresponding operating status data are weighted and calculated using the weight coefficients of the linear parameters, and based on the weighted score Perform fault location on the service system to be tested. It can be seen that in this application, based on obtaining the weight coefficient corresponding to each linear parameter from the trained supervised learning model, the standard score corresponding to the operation status data of each service node is weighted and calculated to obtain the weighted calculation score. The calculated weighted scores corresponding to each group are sorted to filter out the weighted scores that meet the preset conditions, and correspondingly determine the service node corresponding to the weighted score, thereby realizing the fault location in the system fault, and further according to the determined service node The weight coefficient of the component information in the system determines the root cause of the system failure and improves the efficiency of fault location.

Referring to FIG. 2 , the embodiment of the present application discloses a specific system fault detection method. Compared with the previous embodiment, this embodiment further explains and optimizes the technical solution. Specifically include:

Step S21: Obtain the system performance index data, microservice call chain data and system log data of each service node in the service system to be tested, so as to obtain the current operation data of each service node.

In this embodiment, it is necessary to obtain the current operation data of each service node in the service system to be tested. The above-mentioned current operation data mainly includes the following three types of operation status data: system performance index data, microservice call chain data and system log data. Among them, the types of indicators in the system performance indicator data may include but not limited to CPU (Central Processing Unit, i.e. central processing unit), memory, disk, database, JVM (Java Virtual Machine, i.e. Java virtual machine), network, I/O (Input/Output, that is, input and output), HA (Highly Available, that is, two-machine cluster) any one or several types; in the microservice call chain data, each set of call chain data includes the link number (TraceId), The unit number (SpanId) of each call, the service name of the call (ServiceName), the physical unit (CmdbId), and the call duration (Duration); the system log data also includes multiple sets of log data corresponding to the above system performance indicators .

Step S22: Use a preset data standardization method to standardize the current operating data to obtain standard scores corresponding to various operating status data.

Step S23: Obtain historical normal operation data and historical fault operation data.

In this embodiment, it is necessary to obtain the existing historical normal operation data and historical fault operation data in the system. A small number (less than 1%) of the operational data has the problem of sending failures.

Step S24: Add label information including the corresponding running time interval label and non-fault type label to the historical normal operation data to obtain the first historical operation data as a negative sample.

In this embodiment, the historical normal operation data obtained above is used as a negative sample to obtain the first historical operation data, and it is also necessary to add label information including the corresponding operation time interval label and non-fault type label to the historical normal operation data, The above-mentioned non-fault type label is the label information that indicates that the operating data will not cause system failure.

Step S25: Add tag information including the corresponding running time interval tag and fault type tag to the historical fault operation data, and resample the historical fault operation data with added tag information to obtain the second historical operation data as a positive sample, to Make the ratio between the number of samples corresponding to the second historical operation data and the number of samples corresponding to the first historical operation data reach a preset ratio of positive and negative samples.

In this embodiment, similar to step S24, it is necessary to use the historical fault operation data obtained in step S23 as a positive sample to obtain the second historical operation data, and it is also necessary to add to the historical fault operation data the label containing the corresponding operation time interval and The label information of the fault type label, the above-mentioned fault type label is the label information that indicates that the operation data will cause a system failure. It should be pointed out that since most of the operating data in the system are normal, only a very small number of operating data have the problem of transmission failure, so the difference between the positive and negative samples obtained from the historical normal operating data and the historical fault operating data respectively The ratio of positive and negative samples is very unbalanced. In order to solve this problem of positive and negative sample imbalance, it is necessary to resample the historical fault operation data with label information to increase the proportion of positive samples in the total number of samples, so that the second history The ratio between the number of positive samples corresponding to the operating data and the number of negative samples corresponding to the first historical operating data reaches a preset ratio of positive and negative samples. In this embodiment, the preset ratio of positive and negative samples is 1:10.

Step S26: Using the first historical operation data and the second history to train the model to be trained based on the logistic regression algorithm to obtain a trained supervised learning model.

In this embodiment, the model to be trained based on the logistic regression algorithm is trained by using the first historical operating data and the second historical operating data conforming to the preset ratio of positive and negative samples to obtain a trained supervised learning model.

Step S27: extracting weight coefficients corresponding to each linear parameter in the trained supervised learning model; wherein, different linear parameters correspond to different operating state data.

Step S28: Use the weight coefficient of the linear parameter to perform weighted calculation on the corresponding standard scores of the operating status data, and perform fault location on the service system to be detected based on the weighted scores.

Wherein, for more specific processing procedures of the above-mentioned steps S22, S27 and S28, reference may be made to the corresponding contents disclosed in the foregoing embodiments, and details are not repeated here.

It can be seen that in the embodiment of the present application, the system performance index data, microservice call chain data and system log data of each service node in the service system to be detected are obtained, and the corresponding standard score is calculated using the preset data standardization method; By marking the obtained historical normal operation data and historical fault operation data with corresponding label information and using them as the source of positive and negative sample data, in view of the imbalance in the number of positive and negative samples, the historical fault operation data with added label information Perform resampling to increase the proportion of positive samples in the total number of samples, so that the ratio between the number of positive samples corresponding to the second historical operation data and the number of negative samples corresponding to the first historical operation data reaches the preset positive and negative samples Proportion. Finally, by using the first historical operating data and the second historical operating data to train the model to be trained based on logistic regression to obtain a supervised learning model, and extract the corresponding weight coefficients from it to perform weighted calculation on the standard score of the operating status data, to be detected The service system performs fault location. In this way, a supervised learning model is established through a small amount of historical fault operation data and historical normal operation data, and then the model can be used to detect various operating status data in the current operating data through stream computing.

Referring to FIG. 3 and FIG. 4 , the embodiment of the present application discloses a specific system fault detection method. Compared with the previous embodiment, this embodiment further explains and optimizes the technical solution. Specifically include:

Step S31: Determine the time length of the sliding window of the time series, and within the time length of each sliding window, respectively, based on the first preset time interval, the second preset time interval and the third preset time interval in the service system to be detected The system performance indicator data, microservice call chain data and system log data of each service node are sampled to obtain multiple sets of system performance indicator data, multiple sets of microservice call chain data and corresponding to multiple sliding windows arranged in time series. Multiple sets of syslog data.

In this embodiment, it is first necessary to determine the length of the sliding window of the time series. In this embodiment, the length of the sliding window is determined to be 30 minutes, and the current operating data needs to be sampled based on different preset time intervals within 30 minutes. In this embodiment, three kinds of different running state data in the current running data of each service node in the service system to be detected, that is, system performance index data, microservice call chain data and system log data are respectively based on the first preset time interval, Sampling is performed at the second preset time interval and the third preset time interval, and multiple sets of system performance index data corresponding to multiple sliding windows, multiple sets of microservice call chain data, and multiple sets of system data arranged in time series after sampling are obtained. log data. It should be noted that, the first preset time interval, the second preset time interval and the third preset time interval may be equal or unequal to each other. For example, for system performance index data, the first preset time interval can be set to 1 minute, then 30 sampling points can be obtained after sampling in a sliding window whose length is 30 minutes, and these 30 The data values corresponding to the sampling points are arranged in time series to obtain a set of system performance index data. The same is true for microservice call chain data and system log data, and the number of sampling points is determined by the preset time interval. When the preset time interval is different, the number of sampling points obtained after sampling is also different; and when the preset time interval When the interval is the same, the number of sampling points is also the same.

Step S32: Calculate the z-score corresponding to each set of system performance index data and the z-score corresponding to the first-order difference data between different system performance index data in each set of system performance index data.

In this embodiment, after obtaining the above multiple sets of system performance index data, it is necessary to classify and standardize each set of system performance index data according to its data index type. For example, when the data index type is CPU, it is only necessary to calculate The obtained indicator data related to CPU. This embodiment mainly uses the z-score standardization method to standardize the data, that is, calculates the z-score corresponding to each set of system performance index data, and also needs to calculate the first-order difference between different system performance index data in each set of system performance index data The z-score corresponding to the data. It should be noted that, before standardizing the system performance index data in this embodiment, it is necessary to perform data cleaning on the acquired system performance index data. The data is supplemented completely or the wrong data is corrected or deleted, so as to improve the quality of the data and reduce the error rate in the process of data use.

Among them, the calculation formula of z-score is as follows:

Among them, Metric _i is the z-score of the corresponding data index type of a group of operating status data; value _i is the data value of each sampling point in a group of operating status data; mean is the mean value of a group of operating status data in a sliding window, std is the standard deviation of a set of running status data within a sliding window.

In the traditional formula for calculating mean and variance:

In streaming computing, due to the huge amount of data, the performance of the algorithm for calculating the mean and variance using the traditional method is low. Therefore, this application proposes a method for calculating the mean and variance with a time complexity of O(1). It can effectively improve the model performance. This method mainly optimizes the above formula, and the specific optimization method is as follows:

After expanding the traditional formula for calculating variance, we can get:

Reorder

Then you can get:

The formula for calculating the mean value after optimization is:

The formula for calculating the variance after optimization is:

That is to say, in the embodiment of the present application, when standardizing any group of operating status data in the current operating data, the following processing method can be adopted: use the above-mentioned optimized mean value calculation formula and the above-mentioned optimized variance calculation formula to calculate respectively The mean value and variance corresponding to the operation status data, and the z-score corresponding to the group of operation status data is calculated based on the mean value and variance corresponding to the group of operation status data. In the formula, n represents the data sample size corresponding to the group of operating state data, _xi represents the i-th data sample in the group of operating state data, mean represents the mean value, and s ² represents the variance. It can be understood that when calculating the z-score corresponding to the group of operating state data itself and the z-score corresponding to the first-order difference data of the group of operating state data, the above-mentioned optimized mean calculation formula and optimized variance calculation formula can be used to calculate Calculate the corresponding mean and variance, and then solve the corresponding z-scores.

In this way, only one _xi in the queue maintenance interval needs to be designed to quickly update the values of S0 and S1, and the standard scores corresponding to the current running status data can be calculated within O(1) time complexity, that is, no matter what No matter how large the scale of the calculated data is, the mean value of a group of operating state data in a sliding window and the variance of a group of operating state data in a sliding window can be obtained after one calculation. The speed of the mean value of a set of operating state data and the variance of a set of operating state data within a sliding window, thereby improving the speed of calculating standard scores and the efficiency of system fault detection.

Step S33: Obtain the microservice call time in each set of microservice call chain data, and calculate the z-score corresponding to the microservice call time in each set of microservice call chain data and the different microservice calls in each set of microservice call chain data The z-score corresponding to the first difference data between times.

In this embodiment, after obtaining the above multiple sets of microservice call chain data, it is necessary to decompose the graph of each set of microservice call chain data and calculate the response time of the service call. In addition, since each set of call chains will have parent nodes and child nodes , so for a set of call chain data, it is necessary to add the call duration and the call direction representing the call relationship for both the parent node and the child node. Based on the call time and call duration of the above microservices, the total microservice call time in each group of microservice call chain data is obtained, and then the z-score corresponding to the total microservice call time in each group of microservice call chain data and each group The z-score corresponding to the first-order difference data between different microservice call times in the microservice call chain data. Wherein, the specific calculation method of the z-score refers to that shown in step S32.

Step S34: Use the preset log template to match each set of system log data to obtain matching scores corresponding to different system log data in each set of system log data, and calculate the matching scores corresponding to different system log data in each set of system log data The z-scores of the values and the z-scores of the first-difference data between different matching scores for each set of syslog data.

In this embodiment, after obtaining the above-mentioned multiple sets of system log data, it is necessary to use the corresponding log template for each set of system log data to match and detect, the purpose is to determine the type of system log data (that is, belong to CPU, memory or disk) etc.), and get the matching scores corresponding to different system log data types in each group of system log data, and then calculate the z-score of the matching scores corresponding to different system log data in each group of system log data and the corresponding The z-score of the first-difference data between different matching scores. Wherein, the specific calculation method of the z-score refers to that shown in step S32.

Step S35: Using the historical operating data carrying the fault type label to train the model to be trained based on the logistic regression algorithm to obtain a trained supervised learning model.

Step S36: Extracting weight coefficients corresponding to each linear parameter in the trained supervised learning model; wherein, different linear parameters correspond to different operating state data.

Step S37: Using the weight coefficient of the linear parameter to perform weighted calculations on the corresponding standard scores of the operating status data, and perform fault location on the service system to be detected based on the weighted scores.

Wherein, for more specific processing procedures of the above-mentioned steps S35, S36 and S37, reference may be made to the corresponding content disclosed in the foregoing embodiments, and details are not repeated here.

It can be seen that in the embodiment of the present application, by first determining the length of the sliding window of the time series, and based on the preset time interval within the time length of the sliding window, the system performance index data of each service node in the service system to be detected, the microservice call chain Data and system log data are sampled to obtain multiple sets of system performance index data, multiple sets of microservice call chain data, and multiple sets of system log data; The service call time, the matching scores corresponding to different system log data in each set of system log data, and the z-scores of the corresponding first-order difference data, so that the weight coefficients corresponding to different linear parameters extracted from the supervised learning model can be used to compare the above The obtained z-score is weighted and calculated, and based on the weighted score, the fault location of the service system to be detected is performed.

Referring to FIG. 5 , the embodiment of the present application discloses a specific system fault detection method. Compared with the previous embodiment, this embodiment further explains and optimizes the technical solution. Specifically include:

Step S41: Obtain the current operation data of each service node in the service system to be detected; the current operation data includes various operation state data.

Step S42: Standardize the current operating data by using a preset data standardization method to obtain standard scores corresponding to various operating status data.

Step S43: Use the historical operating data with the fault type label to train the model to be trained based on the logistic regression algorithm to obtain a trained supervised learning model.

Step S44: extracting weight coefficients corresponding to each linear parameter in the trained supervised learning model; wherein, different linear parameters correspond to different operating state data.

Step S45: Use the weight coefficient of the linear parameter to perform weighted calculation on the corresponding standard score of the running status data in each service node, so as to obtain the weighted score of each service node.

In this embodiment, after obtaining the weight coefficient corresponding to each linear parameter, use the weight coefficient to perform weighted calculation on the standard score of the corresponding running status data in each service node in the system to be detected, for example, for a service node Three different running status data, assuming that the system indicator type data includes four indicator types of CPU, memory, disk and database, then CPU, memory, disk and database all have their own standard scores, while the microservice call chain data The matching scores of call time and system log data have their own standard scores, then use the weight coefficients of various linear parameters extracted from the model to weight the above corresponding standard scores to get each service node weighted score.

Step S46: Screen out a preset number of service nodes with a weighted score greater than a preset threshold from all service nodes in descending order of weighted scores, so as to determine a faulty target service node based on the service nodes obtained after screening.

In this embodiment, the weighted score of each service node can be obtained through step S45, and then a preset number of service nodes with a weighted score greater than a preset threshold are screened out from all service nodes in the order of weighted scores from large to small. In the embodiment, the above-mentioned preset number is set to 3, and the above-mentioned preset threshold is set to 0.9, that is, after all weighted scores are sorted in descending order, the first 3 service nodes with scores greater than 0.9 are screened out, and based on these three service nodes to determine the faulty target service node.

In the process of determining the target service node, it can be determined in two ways. In a specific implementation, the above three service nodes can be used as the target service node where the failure occurs, that is, the above three service nodes all exist Fault; in another specific implementation manner, one or two service nodes can be selected again from the three service nodes according to certain rules through manual participation as the target service node where the fault occurs.

Step S47: Screen out the largest weight coefficient from the weight coefficients of all linear parameters corresponding to the target service node, and determine the parameter type of the linear parameter corresponding to the largest weight coefficient as the corresponding root cause of the failure.

In this embodiment, after the faulty target service node is determined, the maximum weight coefficient is selected from the weight coefficients of all linear parameters corresponding to the target service node. For example, suppose the weight coefficient of the CPU in a target service node is 0.4, the weight coefficient of the memory is 0.8, the weight coefficient of the disk is 0.8, the weight coefficient of the database is 0.2, and the weight coefficient of the network is 0.5, then the target service node has the largest weight coefficient of 0.8, and then the largest weight coefficient 0.8 corresponds to the parameter type of the linear parameter, that is, to determine the memory and disk as the root cause of the failure.

Wherein, for more specific processing procedures of the above-mentioned steps S41, S42, S43 and S44, reference may be made to the corresponding content disclosed in the foregoing embodiments, which will not be repeated here.

It can be seen that in the embodiment of the present application, the weighted calculation of the standard score of the running status data in each service node is carried out by using the weight coefficient of the linear parameter to obtain the weighted score of each service node, and the weighted score is from large to The small sequence screens out a preset number of service nodes with a weighted score greater than the preset threshold from all service nodes to determine the target service node that has failed, and finally sets the parameter type of the linear parameter corresponding to the largest weight coefficient in the target service node identified as the corresponding root cause of the failure. It can be seen that, in the embodiment of the present application, by calculating the weighted score of each service node and sorting the weighted scores to determine the target service node that has failed, it can realize that each service node in the system to be detected is shown in Figure 6. , the embodiment of the present application also discloses a system fault detection device, the device includes:

The data acquisition module 11 is used to acquire the current operation data of each service node in the service system to be detected; the current operation data includes various operation state data;

The standardization processing module 12 is used to standardize the current operation data by using a preset data standardization method, so as to obtain standard scores corresponding to various operation state data;

Model training module 13, for utilizing the historical operation data that carries fault type label to train the model to be trained based on logistic regression algorithm construction, to obtain the supervised learning model after training;

The weight coefficient extraction module 14 is used to extract the weight coefficient corresponding to each linear parameter in the supervised learning model after training; wherein, different linear parameters correspond to different operating state data;

The fault location module 15 is configured to use the weight coefficient of the linear parameter to perform weighted calculation on the corresponding standard scores of the operating status data, and perform fault location on the service system to be detected based on the weighted scores.

FIG. 7 is a schematic structural diagram of a computer device provided by an embodiment of the present application. Specifically, it may include: at least one processor 21 , at least one memory 22 , a power supply 23 , a communication interface 24 , an input/output interface 25 and a communication bus 26 . Wherein, the memory 22 is used to store a computer program, and the computer program is loaded and executed by the processor 21 to implement the relevant steps in the system fault detection method performed by the computer device disclosed in any of the foregoing embodiments.

In this embodiment, the power supply 23 is used to provide operating voltage for each hardware device on the computer device 20; the communication interface 24 can create a data transmission channel between the computer device 20 and external devices, and the communication protocol it follows is applicable Any communication protocol in the technical solution of the present application is not specifically limited here; the input and output interface 25 is used to obtain external input data or output data to the external, and its specific interface type can be selected according to specific application needs, here Not specifically limited.

Wherein, the processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. Processor 21 can adopt at least one hardware form in DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, programmable logic array) accomplish. Processor 21 may also include a main processor and a coprocessor, the main processor is a processor for processing data in a wake-up state, also called CPU (Central Processing Unit, central processing unit); the coprocessor is Low-power processor for processing data in standby state. In some embodiments, the processor 21 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing the content that needs to be displayed on the display screen. In some embodiments, the processor 21 may also include an AI (Artificial Intelligence, artificial intelligence) processor, and the AI processor is used to process computing operations related to machine learning.

In addition, the memory 22, as a resource storage carrier, can be a read-only memory, random access memory, magnetic disk or optical disk, etc., and the resources stored thereon include the operating system 221, computer program 222 and data 223, etc., and the storage method can be short-term storage or permanent storage.

Among them, the operating system 221 is used to manage and control each hardware device and computer program 222 on the computer device 20, so as to realize the operation and processing of the massive data 223 in the memory 22 by the processor 21, which can be Windows, Unix, Linux, etc. In addition to the computer program 222 that can be used to complete the system fault detection method performed by the computer device 20 disclosed in any of the foregoing embodiments, the computer program 222 can further include a computer program that can be used to complete other specific tasks. The data 223 may not only include data received by the computer device and transmitted from an external device, but may also include data collected by its own input and output interface 25 and the like.

Further, the embodiment of the present application also discloses a storage medium, in which a computer program is stored, and when the computer program is loaded and executed by a processor, the method performed during the system fault detection process disclosed in any of the foregoing embodiments is implemented. step.

Finally, it should also be noted that in this text, relational terms such as first and second etc. are only used to distinguish one entity or operation from another, and do not necessarily require or imply that these entities or operations, any such actual relationship or order exists. Furthermore, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus comprising a set of elements includes not only those elements, but also includes elements not expressly listed. other elements of or also include elements inherent in such a process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not preclude the presence of additional identical elements in the process, method, article, or apparatus that includes the element.

The system fault detection method, device, equipment and storage medium provided by this application have been introduced in detail above. In this paper, specific examples have been used to illustrate the principle and implementation of this application. The description of the above embodiments is only for helping understanding The method of the application and its core idea; at the same time, for those of ordinary skill in the art, according to the idea of the application, there will be changes in the specific implementation and scope of application. In summary, the content of this specification should not be understood as Limitations on this Application. point to perform fault detection, and determine the root cause of the sending fault in the target service node.

Claims

A system fault detection method, including:

Obtain the current operation data of each service node in the service system to be detected; the current operation data includes various operation status data;

Standardize the current operating data by using a preset data standardization method to obtain standard scores corresponding to various operating state data;

Use the historical operating data carrying the fault type label to train the model to be trained based on the logistic regression algorithm to obtain the trained supervised learning model;

Extracting weight coefficients corresponding to each linear parameter in the trained supervised learning model; wherein, different linear parameters correspond to different operating state data;

The weight coefficients of the linear parameters are used to perform weighted calculations on the corresponding standard scores of the operating status data, and perform fault location on the service system to be detected based on the weighted scores.
The system fault detection method according to claim 1, wherein said obtaining the current operating data of each service node in the service system to be detected comprises:

The system performance index data, microservice call chain data and system log data of each service node in the service system to be tested are obtained to obtain the current operation data of each service node.
The system fault detection method according to claim 2, wherein said acquisition of system performance index data, microservice call chain data and system log data of each service node in the service system to be detected, to obtain each said service node current operating data, including:

determine the time length of the sliding window for the time series;

Sampling the system performance index data of each service node in the service system to be detected based on the first preset time interval within the time length of each sliding window, so as to obtain the time series and multiple sliding windows. Multiple sets of system performance index data corresponding to the window;

Sampling the microservice invocation chain data of each service node in the service system to be detected based on the second preset time interval within the time length of each sliding window, so as to obtain the time series and multiple Multiple sets of microservice call chain data corresponding to the sliding window;

Sampling the system log data of each service node in the service system to be detected based on a third preset time interval within the time length of each sliding window, so as to obtain time-series data related to multiple sliding windows Corresponding sets of system log data.
The system fault detection method according to claim 3, wherein, the standardization process is performed on the current operation data by using a preset data standardization method, so as to obtain standard scores corresponding to various operation status data, including:

Calculating the z-score corresponding to each set of system performance index data and the z-score corresponding to the first-order difference data between different system performance index data in each set of system performance index data;

Obtain the microservice call time in each set of microservice call chain data, and calculate the z-score corresponding to the microservice call time in each set of microservice call chain data and the difference between each set of microservice call chain data The z-score corresponding to the first-order difference data between the microservice call times;

Use the preset log template to match each group of the system log data to obtain the matching score corresponding to the different system log data in each group of the system log data, and calculate the correspondence between the different system log data in each group of the system log data The z-score of the matching score of each set of system log data and the z-score of the first-order difference data between different matching scores corresponding to each set of system log data.
The system fault detection method according to claim 4, wherein said calculating the z-score corresponding to the microservice call time in each group of said microservice call chain data and the different microservices in each group of said microservice call chain data The z-score corresponding to the first-order difference data between calling times, including:

For each set of microservice call chain data, determine the parent node and child node of the microservice call chain corresponding to the set of microservice call chain data;

At the same time, adding a call duration and a call direction representing a call relationship for the parent node and the child node;

Calculate the total microservice invocation time in the group of microservice invocation chain data based on the invocation duration and invocation direction of the parent node and the child node;

Calculate the z-score corresponding to the total microservice invocation time in the group of microservice invocation chain data and the z-score corresponding to the first-order difference data between different microservice invocation times in each group of microservice invocation chain data.
The system fault detection method according to claim 4, wherein the process of standardizing any set of operating status data in the current operating data includes:

Using the optimized mean calculation formula and the optimized variance calculation formula, respectively calculate the mean and variance corresponding to the group of operating status data, and calculate the corresponding z-score of the group of operating status data based on the mean and variance corresponding to the group of operating status data; ,

The formula for calculating the mean value after the optimization is:

The formula for calculating variance after optimization is:

in,
n represents the data sample size corresponding to the group of operating state data, x i represents the i-th data sample in the group of operating state data, mean represents the mean value, and s 2 represents the variance.
The system fault detection method according to claim 6, wherein said calculation of the mean value and variance corresponding to the group of operating state data by using the optimized mean calculation formula and the optimized variance calculation formula includes:

For any set of running status data, use the preset target queue to maintain the data samples in the set of running status data;

Obtaining data samples in the target queue;

According to the data samples, the mean value and variance corresponding to the group of operating state data are respectively calculated by using the optimized mean value calculation formula and the optimized variance calculation formula.
The system fault detection method according to claim 1, wherein, before using the historical operation data carrying the fault type label to train the model to be trained based on the logistic regression algorithm, further comprising:

Obtain historical normal operation data and historical fault operation data;

Adding label information including a corresponding running time interval label and a non-fault type label to the historical normal operation data to obtain the first historical operation data as a negative sample;

Adding tag information including corresponding running time interval tags and fault type tags to the historical faulty running data, and resampling the historical faulty running data to which the tag information has been added to obtain second historical running data as a positive sample, Make the ratio between the number of samples corresponding to the second historical operation data and the number of samples corresponding to the first historical operation data reach a preset positive and negative sample ratio.
The system fault detection method according to claim 8, wherein the weighted calculation of the corresponding standard scores of the operating state data by using the weight coefficients of the linear parameters includes:

Obtaining expert knowledge for optimizing the weight coefficients of the linear parameters through a preset expert knowledge acquisition interface;

Using the expert knowledge to adjust the weight coefficients of the linear parameters accordingly to obtain the adjusted weight coefficients of the linear parameters;

The adjusted weight coefficients of the linear parameters are used to perform weighted calculations on the corresponding standard scores of the operating status data.
The system fault detection method according to claim 9, wherein said obtaining the expert knowledge for optimizing the weight coefficient of the linear parameter through the preset expert knowledge acquisition interface comprises:

Extract optimization data from the optimization weight coefficient model established based on the historical expert knowledge base;

The optimization data is acquired through a preset expert knowledge acquisition interface, and the optimization data is used as expert knowledge for optimizing the weight coefficients of the linear parameters.
The system fault detection method according to claim 9, wherein said acquisition of expert knowledge for optimizing the weight coefficient of said linear parameter through a preset expert knowledge acquisition interface comprises:

The instruction for manually optimizing the weight coefficient is obtained through the preset expert knowledge acquisition interface, and the instruction for manually optimizing the weight coefficient is used as expert knowledge for optimizing the weight coefficient of the linear parameter.
The system fault detection method according to claim 9, wherein said using said expert knowledge to adjust the weight coefficients of said linear parameters correspondingly, so as to obtain the adjusted weight coefficients of said linear parameters, comprises:

determining a fault type label corresponding to the linear parameter;

In the case that the fault type label is not included in the second historical operation data, the expert knowledge is used to increase the weight coefficient of the linear parameter corresponding to the fault type label, so that the adjustment of the linear parameter The post weight coefficient is greater than the original weight coefficient of the linear parameter.
The system fault detection method according to any one of claims 1 to 12, wherein the weighted calculation of the corresponding standard scores of the operating status data is performed using the weight coefficients of the linear parameters, and the weighted scores are calculated based on the weighted scores. The fault location of the service system to be detected includes:

Using the weight coefficient of the linear parameter to perform weighted calculation on the corresponding standard score of the operation status data in each of the service nodes, so as to obtain the weighted score of each of the service nodes;

Screen out a preset number of service nodes whose weighted scores are greater than a preset threshold from all the service nodes in order of weighted scores from large to small, so as to determine the faulty target based on the service nodes obtained after screening service node;

Selecting the largest weight coefficient from the weight coefficients of all the linear parameters corresponding to the target service node, and determining the parameter type of the linear parameter corresponding to the largest weight coefficient as the corresponding root cause of the fault.
The system fault detection method according to claim 13, wherein, according to the descending order of weighted scores, a preset number of said service nodes whose weighted scores are greater than a preset threshold are selected from all said service nodes, Determining the failed target service node based on the service node obtained after screening, including:

Screening out a preset number of service nodes whose weighted scores are greater than a preset threshold from all the service nodes in descending order of weighted scores;

Determining the service node whose filtered weighted score is greater than the preset threshold as the faulty target service node.
The system fault detection method according to claim 13, wherein, according to the descending order of weighted scores, a preset number of said service nodes whose weighted scores are greater than a preset threshold are selected from all said service nodes, Determining the failed target service node based on the service node obtained after screening, including:

Screen out a preset number of service nodes whose weighted scores are greater than a preset threshold from all the service nodes in descending order of weighted scores;

determining the service node selected by the user in response to the user's selection operation on the service node whose weighted score is greater than the preset threshold;

Determining the service node selected by the user as the faulty target service node.
The system fault detection method according to claim 1, wherein, before using a preset data standardization method to standardize the current operating data, the method further comprises:

Perform data cleaning processing on the current operating data, the data cleaning processing includes one or more of the following: removing duplicate data in the current operating data, supplementing missing data in the current operating data, and correcting the current operating data. Incorrect data in run data.
A system fault detection device, including:

A data acquisition module, configured to acquire the current operating data of each service node in the service system to be detected; the current operating data includes various operating state data;

A standardization processing module, configured to perform standardization processing on the current operation data by using a preset data standardization method, so as to obtain standard scores corresponding to various operation state data;

The model training module is used to train the model to be trained based on the logistic regression algorithm by using the historical operation data carrying the fault type label to obtain the trained supervised learning model;

A weight coefficient extraction module, configured to extract weight coefficients corresponding to each linear parameter in the trained supervised learning model; wherein different linear parameters correspond to different operating state data;

The fault location module is configured to use the weight coefficients of the linear parameters to perform weighted calculations on the corresponding standard scores of the operation status data, and perform fault location on the service system to be detected based on the weighted scores.
The system fault detection device according to claim 17, wherein the device further comprises a weight coefficient adjustment module, configured to:

Obtaining expert knowledge for optimizing the weight coefficients of the linear parameters through a preset expert knowledge acquisition interface;

Using the expert knowledge to adjust the weight coefficients of the linear parameters accordingly to obtain the adjusted weight coefficients of the linear parameters;

The adjusted weight coefficients of the linear parameters are used to perform weighted calculations on the corresponding standard scores of the operating status data.
An electronic device, comprising:

memory for storing computer programs;

A processor, configured to execute the computer program to realize the steps of the system fault detection method according to any one of claims 1 to 16.
A computer non-volatile readable storage medium, which is used to store a computer program; wherein, when the computer program is executed by a processor, the steps of the system fault detection method according to any one of claims 1 to 16 are realized .