WO2023109251A1 - System fault detection method and apparatus, device, and medium - Google Patents

System fault detection method and apparatus, device, and medium Download PDF

Info

Publication number
WO2023109251A1
WO2023109251A1 PCT/CN2022/122295 CN2022122295W WO2023109251A1 WO 2023109251 A1 WO2023109251 A1 WO 2023109251A1 CN 2022122295 W CN2022122295 W CN 2022122295W WO 2023109251 A1 WO2023109251 A1 WO 2023109251A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
service
preset
weighted
scores
Prior art date
Application number
PCT/CN2022/122295
Other languages
French (fr)
Chinese (zh)
Inventor
赵利强
Original Assignee
浪潮电子信息产业股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浪潮电子信息产业股份有限公司 filed Critical 浪潮电子信息产业股份有限公司
Publication of WO2023109251A1 publication Critical patent/WO2023109251A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software

Definitions

  • the present application relates to the field of computer systems, in particular to a system fault detection method, device, equipment and medium.
  • the cloud-native environment mainly has four characteristics: microservices, automated publishing, continuous delivery, and containerization.
  • the microservice architecture shows great advantages in independent deployment, fast delivery and expansion capabilities, but at the same time, due to the large number of services in the microservice system, the calling relationship between services will become extremely complicated.
  • the system has problems, It is difficult for operation and maintenance administrators to quickly, accurately and comprehensively find faults and troubleshoot problems. Therefore, in the service system environment, fault detection and root cause location require more intelligent algorithm models.
  • the purpose of this application is to provide a system fault detection method, device, equipment and medium, which can automatically, quickly, accurately and comprehensively detect and locate faults in the service system.
  • the specific plan is as follows:
  • the current running data includes various running status data
  • the weight coefficients of the linear parameters are used to weight the corresponding standard scores of the operating status data, and based on the weighted scores, the fault location of the service system to be detected is performed.
  • obtain the current running data of each service node in the service system to be tested including:
  • calculate the z-score corresponding to the microservice call time in each set of microservice call chain data and the z-score corresponding to the first-order difference data between different microservice call times in each set of microservice call chain data including:
  • For each set of microservice call chain data determine the parent node and child node of the microservice call chain corresponding to the set of microservice call chain data;
  • the process of standardizing any set of operating status data in the current operating data includes:
  • the optimized mean calculation formula and the optimized variance calculation formula respectively calculate the mean and variance corresponding to the group of operating status data, and calculate the corresponding z-score of the group of operating status data based on the mean and variance corresponding to the group of operating status data;
  • n the data sample size corresponding to the group of operating state data
  • x i the i-th data sample in the group of operating state data
  • mean the mean value
  • s 2 the variance
  • the mean value and variance corresponding to the group of operating status data are respectively calculated by using the optimized mean value calculation formula and the optimized variance calculation formula, including:
  • the mean value and variance corresponding to the group of operating status data are respectively calculated by using the optimized mean value calculation formula and the optimized variance calculation formula.
  • the model to be trained based on the logistic regression algorithm using the historical operation data carrying the fault type label it also includes:
  • the weight coefficients of the linear parameters are used to perform weighted calculations on the corresponding standard scores of the operating status data, including:
  • the expert knowledge used to optimize the weight coefficient of the linear parameter is obtained through the preset expert knowledge acquisition interface
  • the adjusted weight coefficients of the linear parameters are used to perform weighted calculations on the corresponding standard scores of the operating status data.
  • the expert knowledge used to optimize the weight coefficient of the linear parameter is obtained through the preset expert knowledge acquisition interface, including:
  • the optimization data is obtained through the preset expert knowledge acquisition interface, and the optimization data is used as expert knowledge for optimizing the weight coefficient of the linear parameter.
  • the expert knowledge used to optimize the weight coefficient of the linear parameter is obtained through the preset expert knowledge acquisition interface, including:
  • the instruction for manually optimizing the weight coefficient is obtained through the preset expert knowledge acquisition interface, and the instruction for manually optimizing the weight coefficient is used as expert knowledge for optimizing the weight coefficient of the linear parameter.
  • use expert knowledge to adjust the weight coefficients of the linear parameters accordingly to obtain the adjusted weight coefficients of the linear parameters including:
  • the weight coefficient of the linear parameter corresponding to the fault type label is adjusted by using expert knowledge, so that the adjusted weight coefficient of the linear parameter is greater than the original weight coefficient of the linear parameter .
  • weight coefficient of the linear parameter to perform weighted calculations on the corresponding standard scores of the operating status data, and perform fault location on the service system to be detected based on the weighted scores, including:
  • the maximum weight coefficient is selected from the weight coefficients of all linear parameters corresponding to the target service node, and the parameter type of the linear parameter corresponding to the maximum weight coefficient is determined as the corresponding fault root cause.
  • a preset number of service nodes with a weighted score greater than a preset threshold are screened out from all service nodes in order of weighted scores from large to small, so as to determine a faulty target service node based on the filtered service nodes ,include:
  • a preset number of service nodes with a weighted score greater than a preset threshold are screened out from all service nodes in order of weighted scores from large to small, so as to determine a faulty target service node based on the filtered service nodes ,include:
  • the service node selected by the user is determined as the target service node where the failure occurs.
  • the method before using the preset data standardization method to standardize the current operating data, the method further includes:
  • Perform data cleaning processing on the current operating data includes one or more of the following: removing duplicate data in the current operating data, supplementing missing data in the current operating data, and correcting erroneous data in the current operating data.
  • system fault detection device including:
  • the data acquisition module is used to acquire the current operation data of each service node in the service system to be detected; the current operation data includes various operation status data;
  • the standardization processing module is used to standardize the current operation data by using a preset data standardization method, so as to obtain standard scores corresponding to various operation state data;
  • the model training module is used to train the model to be trained based on the logistic regression algorithm by using the historical operation data carrying the fault type label to obtain the trained supervised learning model;
  • the weight coefficient extraction module is used to extract the weight coefficient corresponding to each linear parameter in the supervised learning model after training; wherein, different linear parameters correspond to different operating state data;
  • the fault location module is configured to use the weight coefficient of the linear parameter to carry out weighted calculation on the standard scores of the corresponding operation status data, and perform fault location on the service system to be detected based on the weighted scores.
  • an electronic device comprising:
  • the processor is used to execute the computer program to realize the steps of the aforementioned disclosed system fault detection method.
  • the present application discloses a computer non-volatile readable storage medium for storing a computer program; wherein, when the computer program is executed by a processor, the steps of the aforementioned disclosed system fault detection method are implemented.
  • this application first obtains the current operating data of each service node in the service system to be tested; the current operating data includes various operating state data; and then uses the preset data standardization method to standardize the current operating data to obtain various operating data
  • the standard scores corresponding to the status data then use the historical operation data with the fault type label to train the model to be trained based on the logistic regression algorithm to obtain the trained supervised learning model; and extract the trained supervised learning model
  • the weight coefficient corresponding to each linear parameter among them, different linear parameters correspond to different operating status data; finally, the weight coefficient of the linear parameter is used to calculate the weighted calculation of the standard score of the corresponding operating status data, and based on the weighted score to be detected
  • the service system performs fault location.
  • the standard score corresponding to the operation status data of each service node is weighted and calculated to obtain the weighted calculation score.
  • the calculated weighted scores corresponding to each group are sorted to filter out the weighted scores that meet the preset conditions, and correspondingly determine the service node corresponding to the weighted score, thereby realizing the fault location in the system fault, and further according to the determined service node.
  • the weight coefficient of the component information in the system determines the root cause of the system failure and improves the efficiency of fault location.
  • Fig. 1 is a flow chart of a system fault detection method disclosed in the present application
  • Fig. 2 is a flow chart of a specific system fault detection method disclosed in the present application.
  • FIG. 3 is a flow chart of a specific system fault detection method disclosed in the present application.
  • Fig. 4 is a kind of specific fault detection and root cause determination flowchart disclosed in the present application.
  • FIG. 5 is a flow chart of a specific system fault detection method disclosed in the present application.
  • FIG. 6 is a schematic structural diagram of a system fault detection device disclosed in the present application.
  • FIG. 7 is a structural diagram of an electronic device disclosed in the present application.
  • the embodiment of the present application discloses a system fault detection method, device, equipment and medium, which can automatically, quickly, accurately and comprehensively detect and locate faults in the service system.
  • a system fault detection method which includes:
  • Step S11 Obtain the current operating data of each service node in the service system to be tested; the current operating data includes various operating state data.
  • the current operating data of each service node in the service system to be tested may include but not limited to a private cloud monitoring system, a large microservice system, or a cloud-native platform intelligent operation and maintenance system any of the systems.
  • the current operating data of each service node includes various operating state data, that is, various data that can characterize the operating state of the service system to be detected.
  • Step S12 Standardize the current operating data by using a preset data standardization method to obtain standard scores corresponding to various operating status data.
  • Step S13 Use the historical operating data carrying the fault type label to train the model to be trained based on the logistic regression algorithm to obtain the trained supervised learning model.
  • this embodiment it is necessary to use pre-prepared historical operating data carrying fault type labels to train the model to be trained to obtain a trained supervised learning model, wherein the above-mentioned model to be trained in this embodiment is constructed based on a logistic regression algorithm That is, this implementation is to use Logistic Classifier (Logistic Regression Classifier) for supervised learning training.
  • Logistic Classifier Logistic Regression Classifier
  • the purpose of using logistic regression classification to establish a supervised learning model is not to use supervised learning to detect faults and locate the root cause of the system to be detected, but to use the calculation process of logistic regression to use limited history with fault type labels
  • the run data adjusts the weights of the corresponding linear parameters in the model.
  • Step S14 extracting weight coefficients corresponding to each linear parameter in the trained supervised learning model; where different linear parameters correspond to different operating state data.
  • step S13 the purpose of using the logistic regression classification algorithm to establish a supervised learning model is to adjust the weight of the corresponding linear parameters in the model. Therefore, after the model training is completed, each linear parameter is extracted from the above supervised learning model. It should be pointed out that different linear parameters correspond to different operating state data.
  • Step S15 Use the weight coefficient of the linear parameter to perform weighted calculation on the corresponding standard scores of the operating status data, and perform fault location on the service system to be detected based on the weighted scores.
  • the expert knowledge can be optimized data extracted from the optimized weight coefficient model based on the historical expert knowledge base, and obtained by the preset expert knowledge acquisition interface.
  • the time of the weight coefficient optimization process can be reduced, and It can reduce the waste of human resources; it can also be an instruction to manually optimize the weight coefficient obtained through the preset expert knowledge acquisition interface.
  • Optimizing the weight coefficient through expert knowledge can improve the accuracy and comprehensiveness of the subsequent fault location of the service system to be detected.
  • the purpose of adjusting the weight coefficients by expert knowledge is to increase the sensitivity of the model to system faults that have not been encountered.
  • the process of using expert knowledge to adjust the weight coefficients of the linear parameters accordingly to obtain the adjusted weight coefficients of the linear parameters may include:
  • the fault type label of the linear parameter corresponding to the operating state data can be determined according to the fault type label corresponding to the operating state data. For example, for a network-related performance indicator "system.tcp.syn_recv", if the indicator is often related to network failures, that is, the failure type label corresponding to the indicator is a label of the network failure category.
  • tcp transmission control protocol, transmission control protocol
  • syn_recv refers to the state when the server receives the client's syn and sends an ack after the server is passively opened.
  • syn synchronize sequence numbers, synchronization sequence number
  • Ack acknowledgement character
  • a transmission control character sent by the receiving station to the sending station. Indicates that the incoming data has been confirmed to be received without error.
  • the current operating data of each service node in the service system to be tested is first obtained; the current operating data includes various operating state data; and then the current operating data is standardized using the preset data standardization method to obtain The standard scores corresponding to each kind of operating status data; then use the historical operating data carrying the fault type label to train the model to be trained based on the logistic regression algorithm to obtain the trained supervised learning model; and extract the trained supervised learning
  • the weight coefficients corresponding to each linear parameter in the model among them, different linear parameters correspond to different operating status data; finally, the standard scores of the corresponding operating status data are weighted and calculated using the weight coefficients of the linear parameters, and based on the weighted score Perform fault location on the service system to be tested.
  • the standard score corresponding to the operation status data of each service node is weighted and calculated to obtain the weighted calculation score.
  • the calculated weighted scores corresponding to each group are sorted to filter out the weighted scores that meet the preset conditions, and correspondingly determine the service node corresponding to the weighted score, thereby realizing the fault location in the system fault, and further according to the determined service node.
  • the weight coefficient of the component information in the system determines the root cause of the system failure and improves the efficiency of fault location.
  • the embodiment of the present application discloses a specific system fault detection method. Compared with the previous embodiment, this embodiment further explains and optimizes the technical solution. Specifically include:
  • Step S21 Obtain the system performance index data, microservice call chain data and system log data of each service node in the service system to be tested, so as to obtain the current operation data of each service node.
  • the above-mentioned current operation data mainly includes the following three types of operation status data: system performance index data, microservice call chain data and system log data.
  • the types of indicators in the system performance indicator data may include but not limited to CPU (Central Processing Unit, i.e. central processing unit), memory, disk, database, JVM (Java Virtual Machine, i.e.
  • each set of call chain data includes the link number (TraceId), The unit number (SpanId) of each call, the service name of the call (ServiceName), the physical unit (CmdbId), and the call duration (Duration);
  • the system log data also includes multiple sets of log data corresponding to the above system performance indicators .
  • Step S22 Use a preset data standardization method to standardize the current operating data to obtain standard scores corresponding to various operating status data.
  • Step S23 Obtain historical normal operation data and historical fault operation data.
  • Step S24 Add label information including the corresponding running time interval label and non-fault type label to the historical normal operation data to obtain the first historical operation data as a negative sample.
  • the historical normal operation data obtained above is used as a negative sample to obtain the first historical operation data, and it is also necessary to add label information including the corresponding operation time interval label and non-fault type label to the historical normal operation data,
  • the above-mentioned non-fault type label is the label information that indicates that the operating data will not cause system failure.
  • Step S25 Add tag information including the corresponding running time interval tag and fault type tag to the historical fault operation data, and resample the historical fault operation data with added tag information to obtain the second historical operation data as a positive sample, to Make the ratio between the number of samples corresponding to the second historical operation data and the number of samples corresponding to the first historical operation data reach a preset ratio of positive and negative samples.
  • step S24 it is necessary to use the historical fault operation data obtained in step S23 as a positive sample to obtain the second historical operation data, and it is also necessary to add to the historical fault operation data the label containing the corresponding operation time interval and
  • the label information of the fault type label is the label information that indicates that the operation data will cause a system failure. It should be pointed out that since most of the operating data in the system are normal, only a very small number of operating data have the problem of transmission failure, so the difference between the positive and negative samples obtained from the historical normal operating data and the historical fault operating data respectively The ratio of positive and negative samples is very unbalanced.
  • the second history The ratio between the number of positive samples corresponding to the operating data and the number of negative samples corresponding to the first historical operating data reaches a preset ratio of positive and negative samples.
  • the preset ratio of positive and negative samples is 1:10.
  • Step S26 Using the first historical operation data and the second history to train the model to be trained based on the logistic regression algorithm to obtain a trained supervised learning model.
  • the model to be trained based on the logistic regression algorithm is trained by using the first historical operating data and the second historical operating data conforming to the preset ratio of positive and negative samples to obtain a trained supervised learning model.
  • Step S27 extracting weight coefficients corresponding to each linear parameter in the trained supervised learning model; wherein, different linear parameters correspond to different operating state data.
  • Step S28 Use the weight coefficient of the linear parameter to perform weighted calculation on the corresponding standard scores of the operating status data, and perform fault location on the service system to be detected based on the weighted scores.
  • the system performance index data, microservice call chain data and system log data of each service node in the service system to be detected are obtained, and the corresponding standard score is calculated using the preset data standardization method;
  • the historical fault operation data with added label information Perform resampling to increase the proportion of positive samples in the total number of samples, so that the ratio between the number of positive samples corresponding to the second historical operation data and the number of negative samples corresponding to the first historical operation data reaches the preset positive and negative samples Proportion.
  • the embodiment of the present application discloses a specific system fault detection method. Compared with the previous embodiment, this embodiment further explains and optimizes the technical solution. Specifically include:
  • the length of the sliding window of the time series is determined to be 30 minutes, and the current operating data needs to be sampled based on different preset time intervals within 30 minutes.
  • three kinds of different running state data in the current running data of each service node in the service system to be detected that is, system performance index data, microservice call chain data and system log data are respectively based on the first preset time interval, Sampling is performed at the second preset time interval and the third preset time interval, and multiple sets of system performance index data corresponding to multiple sliding windows, multiple sets of microservice call chain data, and multiple sets of system data arranged in time series after sampling are obtained. log data.
  • Metric i is the z-score of the corresponding data index type of a group of operating status data; value i is the data value of each sampling point in a group of operating status data; mean is the mean value of a group of operating status data in a sliding window, std is the standard deviation of a set of running status data within a sliding window.
  • this application proposes a method for calculating the mean and variance with a time complexity of O(1). It can effectively improve the model performance.
  • This method mainly optimizes the above formula, and the specific optimization method is as follows:
  • n represents the data sample size corresponding to the group of operating state data
  • xi represents the i-th data sample in the group of operating state data
  • mean represents the mean value
  • s 2 represents the variance
  • the above-mentioned optimized mean calculation formula and optimized variance calculation formula can be used to calculate Calculate the corresponding mean and variance, and then solve the corresponding z-scores.
  • the standard scores corresponding to the current running status data can be calculated within O(1) time complexity, that is, no matter what No matter how large the scale of the calculated data is, the mean value of a group of operating state data in a sliding window and the variance of a group of operating state data in a sliding window can be obtained after one calculation.
  • the speed of the mean value of a set of operating state data and the variance of a set of operating state data within a sliding window thereby improving the speed of calculating standard scores and the efficiency of system fault detection.
  • Step S33 Obtain the microservice call time in each set of microservice call chain data, and calculate the z-score corresponding to the microservice call time in each set of microservice call chain data and the different microservice calls in each set of microservice call chain data The z-score corresponding to the first difference data between times.
  • each set of call chains will have parent nodes and child nodes , so for a set of call chain data, it is necessary to add the call duration and the call direction representing the call relationship for both the parent node and the child node.
  • the total microservice call time in each group of microservice call chain data is obtained, and then the z-score corresponding to the total microservice call time in each group of microservice call chain data and each group The z-score corresponding to the first-order difference data between different microservice call times in the microservice call chain data.
  • the specific calculation method of the z-score refers to that shown in step S32.
  • the purpose is to determine the type of system log data (that is, belong to CPU, memory or disk) etc.), and get the matching scores corresponding to different system log data types in each group of system log data, and then calculate the z-score of the matching scores corresponding to different system log data in each group of system log data and the corresponding The z-score of the first-difference data between different matching scores.
  • the specific calculation method of the z-score refers to that shown in step S32.
  • Step S35 Using the historical operating data carrying the fault type label to train the model to be trained based on the logistic regression algorithm to obtain a trained supervised learning model.
  • Step S36 Extracting weight coefficients corresponding to each linear parameter in the trained supervised learning model; wherein, different linear parameters correspond to different operating state data.
  • Step S37 Using the weight coefficient of the linear parameter to perform weighted calculations on the corresponding standard scores of the operating status data, and perform fault location on the service system to be detected based on the weighted scores.
  • the microservice call chain Data and system log data are sampled to obtain multiple sets of system performance index data, multiple sets of microservice call chain data, and multiple sets of system log data;
  • the obtained z-score is weighted and calculated, and based on the weighted score, the fault location of the service system to be detected is performed.
  • the embodiment of the present application discloses a specific system fault detection method. Compared with the previous embodiment, this embodiment further explains and optimizes the technical solution. Specifically include:
  • Step S41 Obtain the current operation data of each service node in the service system to be detected; the current operation data includes various operation state data.
  • Step S42 Standardize the current operating data by using a preset data standardization method to obtain standard scores corresponding to various operating status data.
  • Step S43 Use the historical operating data with the fault type label to train the model to be trained based on the logistic regression algorithm to obtain a trained supervised learning model.
  • Step S44 extracting weight coefficients corresponding to each linear parameter in the trained supervised learning model; wherein, different linear parameters correspond to different operating state data.
  • Step S45 Use the weight coefficient of the linear parameter to perform weighted calculation on the corresponding standard score of the running status data in each service node, so as to obtain the weighted score of each service node.
  • the weight coefficient after obtaining the weight coefficient corresponding to each linear parameter, use the weight coefficient to perform weighted calculation on the standard score of the corresponding running status data in each service node in the system to be detected, for example, for a service node
  • the system indicator type data includes four indicator types of CPU, memory, disk and database
  • CPU, memory, disk and database all have their own standard scores
  • the microservice call chain data The matching scores of call time and system log data have their own standard scores, then use the weight coefficients of various linear parameters extracted from the model to weight the above corresponding standard scores to get each service node weighted score.
  • Step S46 Screen out a preset number of service nodes with a weighted score greater than a preset threshold from all service nodes in descending order of weighted scores, so as to determine a faulty target service node based on the service nodes obtained after screening.
  • the weighted score of each service node can be obtained through step S45, and then a preset number of service nodes with a weighted score greater than a preset threshold are screened out from all service nodes in the order of weighted scores from large to small.
  • the above-mentioned preset number is set to 3
  • the above-mentioned preset threshold is set to 0.9, that is, after all weighted scores are sorted in descending order, the first 3 service nodes with scores greater than 0.9 are screened out, and based on these three service nodes to determine the faulty target service node.
  • the above three service nodes can be used as the target service node where the failure occurs, that is, the above three service nodes all exist Fault; in another specific implementation manner, one or two service nodes can be selected again from the three service nodes according to certain rules through manual participation as the target service node where the fault occurs.
  • Step S47 Screen out the largest weight coefficient from the weight coefficients of all linear parameters corresponding to the target service node, and determine the parameter type of the linear parameter corresponding to the largest weight coefficient as the corresponding root cause of the failure.
  • the maximum weight coefficient is selected from the weight coefficients of all linear parameters corresponding to the target service node. For example, suppose the weight coefficient of the CPU in a target service node is 0.4, the weight coefficient of the memory is 0.8, the weight coefficient of the disk is 0.8, the weight coefficient of the database is 0.2, and the weight coefficient of the network is 0.5, then the target service node has the largest weight coefficient of 0.8, and then the largest weight coefficient 0.8 corresponds to the parameter type of the linear parameter, that is, to determine the memory and disk as the root cause of the failure.
  • the weighted calculation of the standard score of the running status data in each service node is carried out by using the weight coefficient of the linear parameter to obtain the weighted score of each service node, and the weighted score is from large to
  • the small sequence screens out a preset number of service nodes with a weighted score greater than the preset threshold from all service nodes to determine the target service node that has failed, and finally sets the parameter type of the linear parameter corresponding to the largest weight coefficient in the target service node identified as the corresponding root cause of the failure.
  • the embodiment of the present application by calculating the weighted score of each service node and sorting the weighted scores to determine the target service node that has failed, it can realize that each service node in the system to be detected is shown in Figure 6.
  • the embodiment of the present application also discloses a system fault detection device, the device includes:
  • the data acquisition module 11 is used to acquire the current operation data of each service node in the service system to be detected; the current operation data includes various operation state data;
  • the standardization processing module 12 is used to standardize the current operation data by using a preset data standardization method, so as to obtain standard scores corresponding to various operation state data;
  • Model training module 13 for utilizing the historical operation data that carries fault type label to train the model to be trained based on logistic regression algorithm construction, to obtain the supervised learning model after training;
  • the weight coefficient extraction module 14 is used to extract the weight coefficient corresponding to each linear parameter in the supervised learning model after training; wherein, different linear parameters correspond to different operating state data;
  • the fault location module 15 is configured to use the weight coefficient of the linear parameter to perform weighted calculation on the corresponding standard scores of the operating status data, and perform fault location on the service system to be detected based on the weighted scores.
  • this application first obtains the current operating data of each service node in the service system to be tested; the current operating data includes various operating state data; and then uses the preset data standardization method to standardize the current operating data to obtain various operating data
  • the standard scores corresponding to the status data then use the historical operation data with the fault type label to train the model to be trained based on the logistic regression algorithm to obtain the trained supervised learning model; and extract the trained supervised learning model
  • the weight coefficient corresponding to each linear parameter among them, different linear parameters correspond to different operating status data; finally, the weight coefficient of the linear parameter is used to calculate the weighted calculation of the standard score of the corresponding operating status data, and based on the weighted score to be detected
  • the service system performs fault location.
  • the standard score corresponding to the operation status data of each service node is weighted and calculated to obtain the weighted calculation score.
  • the calculated weighted scores corresponding to each group are sorted to filter out the weighted scores that meet the preset conditions, and correspondingly determine the service node corresponding to the weighted score, thereby realizing the fault location in the system fault, and further according to the determined service node.
  • the weight coefficient of the component information in the system determines the root cause of the system failure and improves the efficiency of fault location.
  • FIG. 7 is a schematic structural diagram of a computer device provided by an embodiment of the present application. Specifically, it may include: at least one processor 21 , at least one memory 22 , a power supply 23 , a communication interface 24 , an input/output interface 25 and a communication bus 26 .
  • the memory 22 is used to store a computer program, and the computer program is loaded and executed by the processor 21 to implement the relevant steps in the system fault detection method performed by the computer device disclosed in any of the foregoing embodiments.
  • the power supply 23 is used to provide operating voltage for each hardware device on the computer device 20;
  • the communication interface 24 can create a data transmission channel between the computer device 20 and external devices, and the communication protocol it follows is applicable Any communication protocol in the technical solution of the present application is not specifically limited here;
  • the input and output interface 25 is used to obtain external input data or output data to the external, and its specific interface type can be selected according to specific application needs, here Not specifically limited.
  • the processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like.
  • Processor 21 can adopt at least one hardware form in DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, programmable logic array) accomplish.
  • Processor 21 may also include a main processor and a coprocessor, the main processor is a processor for processing data in a wake-up state, also called CPU (Central Processing Unit, central processing unit); the coprocessor is Low-power processor for processing data in standby state.
  • CPU Central Processing Unit
  • the coprocessor Low-power processor for processing data in standby state.
  • the processor 21 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing the content that needs to be displayed on the display screen.
  • the processor 21 may also include an AI (Artificial Intelligence, artificial intelligence) processor, and the AI processor is used to process computing operations related to machine learning.
  • AI Artificial Intelligence, artificial intelligence
  • the memory 22, as a resource storage carrier can be a read-only memory, random access memory, magnetic disk or optical disk, etc., and the resources stored thereon include the operating system 221, computer program 222 and data 223, etc., and the storage method can be short-term storage or permanent storage.
  • the operating system 221 is used to manage and control each hardware device and computer program 222 on the computer device 20, so as to realize the operation and processing of the massive data 223 in the memory 22 by the processor 21, which can be Windows, Unix, Linux, etc.
  • the computer program 222 can further include a computer program that can be used to complete other specific tasks.
  • the data 223 may not only include data received by the computer device and transmitted from an external device, but may also include data collected by its own input and output interface 25 and the like.
  • the embodiment of the present application also discloses a storage medium, in which a computer program is stored, and when the computer program is loaded and executed by a processor, the method performed during the system fault detection process disclosed in any of the foregoing embodiments is implemented. step.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The present application discloses a system fault detection method and apparatus, a device, and a medium. The method comprises: acquiring current operation data of each service node in a service system to be detected; standardizing the current operation data by using a preset data standardization method, so as to obtain standard scores respectively corresponding to various operation state data; training, by using historical operation data carrying a fault type label, a model to be trained constructed based on a logistic regression algorithm, so as to obtain a trained supervised learning model; extracting a weight coefficient corresponding to each linear parameter in the trained supervised learning model; and respectively performing weighted calculation on the standard scores of the corresponding operation state data by using the weight coefficients of the linear parameters, and performing fault positioning on said service system on the basis of weighted scores. According to the present application, the supervised learning model is obtained on the basis of the historical operation data, and the model is used to detect the current operation data by means of the weighted calculation, so as to detect a system fault.

Description

一种系统故障检测方法、装置、设备及介质A system fault detection method, device, equipment and medium
相关申请的交叉引用Cross References to Related Applications
本申请要求于2021年12月17日提交中国专利局,申请号为202111554982.5,申请名称为“一种系统故障检测方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202111554982.5 and the application title "A System Fault Detection Method, Device, Equipment and Medium" submitted to the China Patent Office on December 17, 2021, the entire contents of which are incorporated by reference incorporated in this application.
技术领域technical field
本申请涉及计算机系统领域,特别涉及一种系统故障检测方法、装置、设备及介质。The present application relates to the field of computer systems, in particular to a system fault detection method, device, equipment and medium.
背景技术Background technique
云原生环境主要有微服务、自动化发布、持续交付和容器化四个特点。微服务架构在独立部署、快速交付和扩展能力上表现出极大的优势,但同时,微服务系统中由于服务众多,服务之间的调用关系也会变得异常复杂,当系统出现问题时,运维管理员难以快速、精准、全面的寻找故障和排查问题。因此在服务系统环境中,故障检测和根因定位需要更加智能的算法模型。The cloud-native environment mainly has four characteristics: microservices, automated publishing, continuous delivery, and containerization. The microservice architecture shows great advantages in independent deployment, fast delivery and expansion capabilities, but at the same time, due to the large number of services in the microservice system, the calling relationship between services will become extremely complicated. When the system has problems, It is difficult for operation and maintenance administrators to quickly, accurately and comprehensively find faults and troubleshoot problems. Therefore, in the service system environment, fault detection and root cause location require more intelligent algorithm models.
目前,在私有云监控、大型微服务故障排查、云原生平台智能运维等服务及运维数据较多的服务系统场景中,当服务系统中出现问题时,由于服务系统中服务节点众多,服务节点之间的调用关系也会变得异常复杂,现有技术手段大多通过阈值检测和规则告警等方法进行故障的寻找和排查,运维人员往往难以快速、精准、全面的寻找故障和排查问题。Currently, in private cloud monitoring, large-scale microservice troubleshooting, cloud-native platform intelligent operation and The call relationship between nodes will also become extremely complicated. Most of the existing technical methods use threshold detection and rule alarms to find and troubleshoot faults. It is often difficult for operation and maintenance personnel to quickly, accurately and comprehensively find faults and troubleshoot problems.
综上,如何自动、快速、精准、全面的对服务系统中的故障进行检测和定位是目前有待解决的问题。To sum up, how to automatically, quickly, accurately and comprehensively detect and locate faults in the service system is a problem to be solved at present.
发明内容Contents of the invention
有鉴于此,本申请的目的在于提供一种系统故障检测方法、装置、设备及介质,能够自动、快速、精准、全面的对服务系统中的故障进行检测和定位。其具体方案如下:In view of this, the purpose of this application is to provide a system fault detection method, device, equipment and medium, which can automatically, quickly, accurately and comprehensively detect and locate faults in the service system. The specific plan is as follows:
第一方面,本申请公开了一种系统故障检测方法,包括:In a first aspect, the present application discloses a system fault detection method, including:
获取待检测服务系统中每个服务节点的当前运行数据;当前运行数据包括多种运行状态数据;Obtain the current running data of each service node in the service system to be tested; the current running data includes various running status data;
利用预设数据标准化方法对当前运行数据进行标准化处理,以得到各种运行状态数据分 别对应的标准分数;Use the preset data standardization method to standardize the current operating data to obtain the standard scores corresponding to various operating status data;
利用携带有故障类型标签的历史运行数据对基于逻辑回归算法构建的待训练模型进行训练,以得到训练后的监督学习模型;Use the historical operating data carrying the fault type label to train the model to be trained based on the logistic regression algorithm to obtain the trained supervised learning model;
提取训练后的监督学习模型中每种线性参数对应的权重系数;其中,不同的线性参数分别对应不同的运行状态数据;Extract the weight coefficient corresponding to each linear parameter in the supervised learning model after training; wherein, different linear parameters correspond to different operating status data;
利用线性参数的权重系数分别对相应的运行状态数据的标准分数进行加权计算,并基于加权得分对待检测服务系统进行故障定位。The weight coefficients of the linear parameters are used to weight the corresponding standard scores of the operating status data, and based on the weighted scores, the fault location of the service system to be detected is performed.
可选的,获取待检测服务系统中每个服务节点的当前运行数据,包括:Optionally, obtain the current running data of each service node in the service system to be tested, including:
获取待检测服务系统中每个服务节点的系统性能指标数据、微服务调用链数据和系统日志数据,以得到每个服务节点的当前运行数据。Obtain the system performance index data, microservice call chain data and system log data of each service node in the service system to be tested, so as to obtain the current operation data of each service node.
可选的,获取待检测服务系统中每个服务节点的系统性能指标数据、微服务调用链数据和系统日志数据,以得到每个服务节点的当前运行数据,包括:Optionally, obtain the system performance index data, microservice call chain data and system log data of each service node in the service system to be tested, so as to obtain the current operating data of each service node, including:
确定时间序列的滑动窗口的时间长度;determine the time length of the sliding window for the time series;
在每个滑动窗口的时间长度内基于第一预设时间间隔对待检测服务系统中每个服务节点的系统性能指标数据进行采样,以得到按照时序排列的与多个滑动窗口对应的多组系统性能指标数据;Sampling the system performance index data of each service node in the service system to be detected based on the first preset time interval within the time length of each sliding window, so as to obtain multiple sets of system performance corresponding to multiple sliding windows arranged in time series indicator data;
在每个滑动窗口的时间长度内基于第二预设时间间隔对待检测服务系统中每个服务节点的微服务调用链数据进行采样,以得到按照时序排列的与多个滑动窗口对应的多组微服务调用链数据;Sampling the microservice call chain data of each service node in the service system to be detected based on the second preset time interval within the time length of each sliding window, so as to obtain multiple groups of microservices corresponding to multiple sliding windows arranged in time series Service call chain data;
在每个滑动窗口的时间长度内基于第三预设时间间隔对待检测服务系统中每个服务节点的系统日志数据进行采样,以得到按照时序排列的与多个滑动窗口对应的多组系统日志数据。Sampling the system log data of each service node in the service system to be detected based on the third preset time interval within the time length of each sliding window, so as to obtain multiple sets of system log data corresponding to multiple sliding windows arranged in time series .
可选的,利用预设数据标准化方法对当前运行数据进行标准化处理,以得到各种运行状态数据分别对应的标准分数,包括:Optionally, the current operating data is standardized using a preset data standardization method to obtain standard scores corresponding to various operating status data, including:
计算每组系统性能指标数据对应的z分数以及每组系统性能指标数据中不同系统性能指标数据之间的一阶差分数据对应的z分数;Calculate the z-score corresponding to each set of system performance index data and the z-score corresponding to the first-order difference data between different system performance index data in each set of system performance index data;
获取每组微服务调用链数据中的微服务调用时间,并计算每组微服务调用链数据中的微服务调用时间对应的z分数以及每组微服务调用链数据中不同微服务调用时间之间的一阶差分数据对应的z分数;Obtain the microservice call time in each set of microservice call chain data, and calculate the z-score corresponding to the microservice call time in each set of microservice call chain data and the difference between different microservice call times in each set of microservice call chain data The z-score corresponding to the first-order difference data of ;
利用预设日志模板对每组系统日志数据进行匹配以得到每组系统日志数据中不同系统日 志数据对应的匹配分值,并计算每组系统日志数据中不同系统日志数据对应的匹配分值的z分数以及每组系统日志数据对应的不同匹配分值之间的一阶差分数据的z分数。Use the preset log template to match each set of system log data to obtain the matching scores corresponding to different system log data in each set of system log data, and calculate the z of the matching scores corresponding to different system log data in each set of system log data scores and the z-scores of the first-difference data between different matching scores for each set of syslog data.
可选的,计算每组微服务调用链数据中的微服务调用时间对应的z分数以及每组微服务调用链数据中不同微服务调用时间之间的一阶差分数据对应的z分数,包括:Optionally, calculate the z-score corresponding to the microservice call time in each set of microservice call chain data and the z-score corresponding to the first-order difference data between different microservice call times in each set of microservice call chain data, including:
针对每组微服务调用链数据,确定该组微服务调用链数据对应的微服务调用链的父节点和子节点;For each set of microservice call chain data, determine the parent node and child node of the microservice call chain corresponding to the set of microservice call chain data;
同时为父节点和子节点添加调用时长和表征调用关系的调用方向;At the same time, add the call duration and call direction to represent the call relationship for the parent node and the child node;
基于父节点和子节点的调用时长和调用方向,计算该组微服务调用链数据中总的微服务调用时间;Calculate the total microservice call time in the group of microservice call chain data based on the call duration and call direction of the parent node and child nodes;
计算该组微服务调用链数据中总的微服务调用时间对应的z分数以及每组微服务调用链数据中不同微服务调用时间之间的一阶差分数据对应的z分数。Calculate the z-score corresponding to the total microservice call time in the group of microservice call chain data and the z-score corresponding to the first-order difference data between different microservice call times in each set of microservice call chain data.
可选的,对当前运行数据中的任一组运行状态数据进行标准化处理的过程,包括:Optionally, the process of standardizing any set of operating status data in the current operating data includes:
利用优化后均值计算公式以及优化后方差计算公式,分别计算该组运行状态数据对应的均值和方差,并基于该组运行状态数据对应的均值和方差计算该组运行状态数据对应的z分数;其中,Using the optimized mean calculation formula and the optimized variance calculation formula, respectively calculate the mean and variance corresponding to the group of operating status data, and calculate the corresponding z-score of the group of operating status data based on the mean and variance corresponding to the group of operating status data; ,
优化后均值计算公式为:
Figure PCTCN2022122295-appb-000001
The formula for calculating the mean value after optimization is:
Figure PCTCN2022122295-appb-000001
优化后方差计算公式为:
Figure PCTCN2022122295-appb-000002
The formula for calculating variance after optimization is:
Figure PCTCN2022122295-appb-000002
其中,
Figure PCTCN2022122295-appb-000003
n表示该组运行状态数据对应的数据样本量,x i表示该组运行状态数据中的第i个数据样本,mean表示均值,s 2表示方差。
in,
Figure PCTCN2022122295-appb-000003
n represents the data sample size corresponding to the group of operating state data, x i represents the i-th data sample in the group of operating state data, mean represents the mean value, and s 2 represents the variance.
可选的,利用优化后均值计算公式以及优化后方差计算公式,分别计算该组运行状态数据对应的均值和方差,包括:Optionally, the mean value and variance corresponding to the group of operating status data are respectively calculated by using the optimized mean value calculation formula and the optimized variance calculation formula, including:
针对任一组运行状态数据,利用预设的目标队列维护该组运行状态数据中的数据样本;For any set of running status data, use the preset target queue to maintain the data samples in the set of running status data;
获取目标队列中的数据样本;Obtain data samples in the target queue;
根据数据样本,利用优化后均值计算公式以及优化后方差计算公式,分别计算该组运行状态数据对应的均值和方差。According to the data samples, the mean value and variance corresponding to the group of operating status data are respectively calculated by using the optimized mean value calculation formula and the optimized variance calculation formula.
可选的,利用携带有故障类型标签的历史运行数据对基于逻辑回归算法构建的待训练模型进行训练之前,还包括:Optionally, before training the model to be trained based on the logistic regression algorithm using the historical operation data carrying the fault type label, it also includes:
获取历史正常运行数据和历史故障运行数据;Obtain historical normal operation data and historical fault operation data;
向历史正常运行数据添加包含相应的运行时间区间标签以及无故障类型标签的标签信 息,以得到作为负样本的第一历史运行数据;Add label information including the corresponding running time interval label and the non-fault type label to the historical normal operation data to obtain the first historical operation data as a negative sample;
向历史故障运行数据添加包含相应的运行时间区间标签以及故障类型标签的标签信息,并对已添加标签信息的历史故障运行数据进行重采样得到作为正样本的第二历史运行数据,以使第二历史运行数据对应的样本数与第一历史运行数据对应的样本数之间的比例达到预设正负样本比例。Add label information including the corresponding running time interval label and fault type label to the historical fault operation data, and resample the historical fault operation data with the added label information to obtain the second historical operation data as a positive sample, so that the second The ratio between the number of samples corresponding to the historical operation data and the number of samples corresponding to the first historical operation data reaches a preset ratio of positive and negative samples.
可选的,利用线性参数的权重系数分别对相应的运行状态数据的标准分数进行加权计算,包括:Optionally, the weight coefficients of the linear parameters are used to perform weighted calculations on the corresponding standard scores of the operating status data, including:
通过预设专家知识获取接口获取用于对线性参数的权重系数进行优化的专家知识;The expert knowledge used to optimize the weight coefficient of the linear parameter is obtained through the preset expert knowledge acquisition interface;
利用专家知识对线性参数的权重系数进行相应的调整,以得到线性参数的调整后权重系数;Use expert knowledge to adjust the weight coefficient of the linear parameter accordingly to obtain the adjusted weight coefficient of the linear parameter;
利用线性参数的调整后权重系数分别对相应的运行状态数据的标准分数进行加权计算。The adjusted weight coefficients of the linear parameters are used to perform weighted calculations on the corresponding standard scores of the operating status data.
可选的,通过预设专家知识获取接口获取用于对线性参数的权重系数进行优化的专家知识,包括:Optionally, the expert knowledge used to optimize the weight coefficient of the linear parameter is obtained through the preset expert knowledge acquisition interface, including:
从基于历史专家知识库建立的优化权重系数模型中提取优化数据;Extract optimization data from the optimization weight coefficient model established based on the historical expert knowledge base;
通过预设专家知识获取接口获取优化数据,将优化数据作为对线性参数的权重系数进行优化的专家知识。The optimization data is obtained through the preset expert knowledge acquisition interface, and the optimization data is used as expert knowledge for optimizing the weight coefficient of the linear parameter.
可选的,通过预设专家知识获取接口获取用于对线性参数的权重系数进行优化的专家知识,包括:Optionally, the expert knowledge used to optimize the weight coefficient of the linear parameter is obtained through the preset expert knowledge acquisition interface, including:
通过预设专家知识获取接口获取人工优化权重系数的指令,将人工优化权重系数的指令作为对线性参数的权重系数进行优化的专家知识。The instruction for manually optimizing the weight coefficient is obtained through the preset expert knowledge acquisition interface, and the instruction for manually optimizing the weight coefficient is used as expert knowledge for optimizing the weight coefficient of the linear parameter.
可选的,利用专家知识对线性参数的权重系数进行相应的调整,以得到线性参数的调整后权重系数,包括:Optionally, use expert knowledge to adjust the weight coefficients of the linear parameters accordingly to obtain the adjusted weight coefficients of the linear parameters, including:
确定线性参数对应的故障类型标签;Determine the fault type label corresponding to the linear parameter;
在第二历史运行数据中不包括故障类型标签的情况下,利用专家知识对故障类型标签对应的线性参数的权重系数进行上调处理,以使线性参数的调整后权重系数大于线性参数原始的权重系数。In the case that the fault type label is not included in the second historical operation data, the weight coefficient of the linear parameter corresponding to the fault type label is adjusted by using expert knowledge, so that the adjusted weight coefficient of the linear parameter is greater than the original weight coefficient of the linear parameter .
可选的,利用线性参数的权重系数分别对相应的运行状态数据的标准分数进行加权计算,并基于加权得分对待检测服务系统进行故障定位,包括:Optionally, use the weight coefficient of the linear parameter to perform weighted calculations on the corresponding standard scores of the operating status data, and perform fault location on the service system to be detected based on the weighted scores, including:
利用线性参数的权重系数分别对每个服务节点中相应的运行状态数据的标准分数进行加权计算,以得到每个服务节点的加权得分;Use the weight coefficient of the linear parameter to carry out weighted calculation on the standard score of the corresponding running status data in each service node, so as to obtain the weighted score of each service node;
按照加权得分从大到小的顺序从所有服务节点中筛选出预设数量个加权得分大于预设阈值的服务节点,以基于筛选后得到的服务节点确定出发生故障的目标服务节点;Screen out a preset number of service nodes with a weighted score greater than a preset threshold from all service nodes in order of weighted scores from large to small, so as to determine a faulty target service node based on the service nodes obtained after screening;
从目标服务节点对应的所有线性参数的权重系数中筛选出最大权重系数,并将最大权重系数对应的线性参数的参数类型确定为相应的故障根因。The maximum weight coefficient is selected from the weight coefficients of all linear parameters corresponding to the target service node, and the parameter type of the linear parameter corresponding to the maximum weight coefficient is determined as the corresponding fault root cause.
可选的,按照加权得分从大到小的顺序从所有服务节点中筛选出预设数量个加权得分大于预设阈值的服务节点,以基于筛选后得到的服务节点确定出发生故障的目标服务节点,包括:Optionally, a preset number of service nodes with a weighted score greater than a preset threshold are screened out from all service nodes in order of weighted scores from large to small, so as to determine a faulty target service node based on the filtered service nodes ,include:
按照加权得分从大到小的顺序从所有服务节点中筛选出预设数量个加权得分大于预设阈值的服务节点;Screen out a preset number of service nodes with a weighted score greater than a preset threshold from all service nodes in descending order of weighted scores;
将筛选出的加权得分大于预设阈值的服务节点确定为发生故障的目标服务节点。Determining the service node with the filtered weighted score greater than the preset threshold as the faulty target service node.
可选的,按照加权得分从大到小的顺序从所有服务节点中筛选出预设数量个加权得分大于预设阈值的服务节点,以基于筛选后得到的服务节点确定出发生故障的目标服务节点,包括:Optionally, a preset number of service nodes with a weighted score greater than a preset threshold are screened out from all service nodes in order of weighted scores from large to small, so as to determine a faulty target service node based on the filtered service nodes ,include:
按照加权得分从大到小的顺序从所有服务节点中筛选出预设数量个加权得分大于预设阈值的服务节点;Screen out a preset number of service nodes with a weighted score greater than a preset threshold from all service nodes in descending order of weighted scores;
响应于用户对筛选出的加权得分大于预设阈值的服务节点的选择操作,确定用户所选的服务节点;In response to the user's selection operation on the filtered service nodes whose weighted scores are greater than a preset threshold, determine the service node selected by the user;
将用户所选的服务节点确定为发生故障的目标服务节点。The service node selected by the user is determined as the target service node where the failure occurs.
可选的,在利用预设数据标准化方法对当前运行数据进行标准化处理之前,方法还包括:Optionally, before using the preset data standardization method to standardize the current operating data, the method further includes:
对当前运行数据进行数据清洗处理,数据清洗处理包括以下一种或多种:去除当前运行数据中的重复数据、补充当前运行数据中的缺失数据和纠正当前运行数据中的错误数据。Perform data cleaning processing on the current operating data, and the data cleaning processing includes one or more of the following: removing duplicate data in the current operating data, supplementing missing data in the current operating data, and correcting erroneous data in the current operating data.
第二方面,本申请公开了一种系统故障检测装置,包括:In a second aspect, the present application discloses a system fault detection device, including:
数据获取模块,用于获取待检测服务系统中每个服务节点的当前运行数据;当前运行数据包括多种运行状态数据;The data acquisition module is used to acquire the current operation data of each service node in the service system to be detected; the current operation data includes various operation status data;
标准化处理模块,用于利用预设数据标准化方法对当前运行数据进行标准化处理,以得到各种运行状态数据分别对应的标准分数;The standardization processing module is used to standardize the current operation data by using a preset data standardization method, so as to obtain standard scores corresponding to various operation state data;
模型训练模块,用于利用携带有故障类型标签的历史运行数据对基于逻辑回归算法构建的待训练模型进行训练,以得到训练后的监督学习模型;The model training module is used to train the model to be trained based on the logistic regression algorithm by using the historical operation data carrying the fault type label to obtain the trained supervised learning model;
权重系数提取模块,用于提取训练后的监督学习模型中每种线性参数对应的权重系数; 其中,不同的线性参数分别对应不同的运行状态数据;The weight coefficient extraction module is used to extract the weight coefficient corresponding to each linear parameter in the supervised learning model after training; wherein, different linear parameters correspond to different operating state data;
故障定位模块,用于利用线性参数的权重系数分别对相应的运行状态数据的标准分数进行加权计算,并基于加权得分对待检测服务系统进行故障定位。The fault location module is configured to use the weight coefficient of the linear parameter to carry out weighted calculation on the standard scores of the corresponding operation status data, and perform fault location on the service system to be detected based on the weighted scores.
第三方面,本申请公开了一种电子设备,包括:In a third aspect, the present application discloses an electronic device, comprising:
存储器,用于保存计算机程序;memory for storing computer programs;
处理器,用于执行计算机程序,以实现前述公开的系统故障检测方法的步骤。The processor is used to execute the computer program to realize the steps of the aforementioned disclosed system fault detection method.
第四方面,本申请公开了一种计算机非易失性可读存储介质,用于存储计算机程序;其中,计算机程序被处理器执行时实现前述公开的系统故障检测方法的步骤。In a fourth aspect, the present application discloses a computer non-volatile readable storage medium for storing a computer program; wherein, when the computer program is executed by a processor, the steps of the aforementioned disclosed system fault detection method are implemented.
可见,本申请首先获取待检测服务系统中每个服务节点的当前运行数据;当前运行数据包括多种运行状态数据;再利用预设数据标准化方法对当前运行数据进行标准化处理,以得到各种运行状态数据分别对应的标准分数;然后利用携带有故障类型标签的历史运行数据对基于逻辑回归算法构建的待训练模型进行训练,以得到训练后的监督学习模型;并提取训练后的监督学习模型中每种线性参数对应的权重系数;其中,不同的线性参数分别对应不同的运行状态数据;最后利用线性参数的权重系数分别对相应的运行状态数据的标准分数进行加权计算,并基于加权得分对待检测服务系统进行故障定位。由此可见,本申请中基于从训练后的监督学习模型中获取每种线性参数对应的权重系数对每个服务节点的运行状态数据对应的标准分数进行加权计算以得到加权计算得分,通过将加权计算得到的每组对应的加权得分进行排序筛选出满足预设条件的加权得分,并相应确定出该加权得分对应的服务节点,从而实现了系统故障中故障的定位,进一步根据确定出的服务节点中的组件信息的权重系数确定出系统故障的根因,提高故障定位的效率。It can be seen that this application first obtains the current operating data of each service node in the service system to be tested; the current operating data includes various operating state data; and then uses the preset data standardization method to standardize the current operating data to obtain various operating data The standard scores corresponding to the status data; then use the historical operation data with the fault type label to train the model to be trained based on the logistic regression algorithm to obtain the trained supervised learning model; and extract the trained supervised learning model The weight coefficient corresponding to each linear parameter; among them, different linear parameters correspond to different operating status data; finally, the weight coefficient of the linear parameter is used to calculate the weighted calculation of the standard score of the corresponding operating status data, and based on the weighted score to be detected The service system performs fault location. It can be seen that in this application, based on obtaining the weight coefficient corresponding to each linear parameter from the trained supervised learning model, the standard score corresponding to the operation status data of each service node is weighted and calculated to obtain the weighted calculation score. The calculated weighted scores corresponding to each group are sorted to filter out the weighted scores that meet the preset conditions, and correspondingly determine the service node corresponding to the weighted score, thereby realizing the fault location in the system fault, and further according to the determined service node The weight coefficient of the component information in the system determines the root cause of the system failure and improves the efficiency of fault location.
附图说明Description of drawings
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only It is an embodiment of the present application, and those skilled in the art can also obtain other drawings according to the provided drawings without creative work.
图1为本申请公开的一种系统故障检测方法流程图;Fig. 1 is a flow chart of a system fault detection method disclosed in the present application;
图2为本申请公开的一种具体的系统故障检测方法流程图;Fig. 2 is a flow chart of a specific system fault detection method disclosed in the present application;
图3为本申请公开的一种具体的系统故障检测方法流程图;FIG. 3 is a flow chart of a specific system fault detection method disclosed in the present application;
图4为本申请公开的一种具体的故障检测和根因确定流程图;Fig. 4 is a kind of specific fault detection and root cause determination flowchart disclosed in the present application;
图5为本申请公开的一种具体的系统故障检测方法流程图;FIG. 5 is a flow chart of a specific system fault detection method disclosed in the present application;
图6为本申请公开的一种系统故障检测装置结构示意图;FIG. 6 is a schematic structural diagram of a system fault detection device disclosed in the present application;
图7为本申请公开的一种电子设备结构图。FIG. 7 is a structural diagram of an electronic device disclosed in the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the application with reference to the drawings in the embodiments of the application. Apparently, the described embodiments are only some of the embodiments of the application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.
在私有云监控、大型微服务故障排查、云原生平台智能运维等服务及运维数据较多的服务系统场景中,当服务系统中出现问题时,由于服务系统中服务节点众多,服务节点之间的调用关系也会变得异常复杂,现有技术手段大多通过阈值检测和规则告警等方法进行故障的寻找和排查,运维人员往往难以快速、精准、全面的寻找故障和排查问题。为此,本申请实施例公开了一种系统故障检测方法、装置、设备及介质,能够自动、快速、精准、全面的对服务系统中的故障进行检测和定位。In private cloud monitoring, large-scale microservice troubleshooting, cloud-native platform intelligent operation and maintenance and other services and service system scenarios with a lot of operation and maintenance data, when there is a problem in the service system, due to the large number of service nodes in the service system, the The call relationship among them will also become extremely complicated. Existing technical methods mostly use methods such as threshold detection and rule alarms to find and troubleshoot faults. It is often difficult for operation and maintenance personnel to quickly, accurately and comprehensively find faults and troubleshoot problems. To this end, the embodiment of the present application discloses a system fault detection method, device, equipment and medium, which can automatically, quickly, accurately and comprehensively detect and locate faults in the service system.
参见图1所示,本申请实施例公开了一种系统故障检测方法,该方法包括:Referring to Fig. 1, the embodiment of the present application discloses a system fault detection method, which includes:
步骤S11:获取待检测服务系统中每个服务节点的当前运行数据;当前运行数据包括多种运行状态数据。Step S11: Obtain the current operating data of each service node in the service system to be tested; the current operating data includes various operating state data.
本实施例中,需要获取待检测服务系统中每个服务节点的当前运行数据,其中,上述待检测服务系统可以包括但不限于私有云监控系统、大型微服务系统或云原生平台智能运维系统中的任意一种系统。并且,每个服务节点的当前运行数据中包括多种运行状态数据,也即多种可以表征待检测服务系统运行状态的数据。In this embodiment, it is necessary to obtain the current operating data of each service node in the service system to be tested, wherein the above-mentioned service system to be tested may include but not limited to a private cloud monitoring system, a large microservice system, or a cloud-native platform intelligent operation and maintenance system any of the systems. Moreover, the current operating data of each service node includes various operating state data, that is, various data that can characterize the operating state of the service system to be detected.
步骤S12:利用预设数据标准化方法对当前运行数据进行标准化处理,以得到各种运行状态数据分别对应的标准分数。Step S12: Standardize the current operating data by using a preset data standardization method to obtain standard scores corresponding to various operating status data.
本实施例中,在获取到上述当前运行数据之后,鉴于当前运行数据中不同的运行状态数据具有不同的量纲、数据单位或者数量级。为了方便不同运行状态数据相互之间能够进行比较和加权,需要利用预设数据标准化方法对当前运行数据进行标准化处理,以将其转化为无量纲的纯数值,也即本实施例中通过对当前数据进行标准化处理后得到的各种运行状态数据分别对应的标准分数。In this embodiment, after the above-mentioned current operation data is obtained, it is considered that different operation state data in the current operation data have different dimensions, data units or orders of magnitude. In order to facilitate the comparison and weighting of different operating status data, it is necessary to use the preset data standardization method to standardize the current operating data to convert it into a dimensionless pure value. The standard scores corresponding to various operating status data obtained after the data is standardized.
步骤S13:利用携带有故障类型标签的历史运行数据对基于逻辑回归算法构建的待训练 模型进行训练,以得到训练后的监督学习模型。Step S13: Use the historical operating data carrying the fault type label to train the model to be trained based on the logistic regression algorithm to obtain the trained supervised learning model.
本实施例中,需要利用预先准备好的携带有故障类型标签的历史运行数据对待训练模型进行训练以得到训练后的监督学习模型,其中,本实施例中上述待训练模型是基于逻辑回归算法构建的,也即本实施是使用Logistic Classifier(逻辑回归分类器)进行监督学习训练。需要注意的是,使用逻辑回归分类建立监督学习模型的目的并非是使用监督学习对待检测系统进行故障检测和根因定位,而是利用逻辑回归的计算过程,使用有限的携带有故障类型标签的历史运行数据调整模型中相应线性参数的权重。In this embodiment, it is necessary to use pre-prepared historical operating data carrying fault type labels to train the model to be trained to obtain a trained supervised learning model, wherein the above-mentioned model to be trained in this embodiment is constructed based on a logistic regression algorithm That is, this implementation is to use Logistic Classifier (Logistic Regression Classifier) for supervised learning training. It should be noted that the purpose of using logistic regression classification to establish a supervised learning model is not to use supervised learning to detect faults and locate the root cause of the system to be detected, but to use the calculation process of logistic regression to use limited history with fault type labels The run data adjusts the weights of the corresponding linear parameters in the model.
步骤S14:提取训练后的监督学习模型中每种线性参数对应的权重系数;其中,不同的线性参数分别对应不同的运行状态数据。Step S14: extracting weight coefficients corresponding to each linear parameter in the trained supervised learning model; where different linear parameters correspond to different operating state data.
本实施例中,参照步骤S13,使用逻辑回归分类算法建立监督学习模型的目的是为了调整模型中相应线性参数的权重,因此在模型训练完成之后,则从上述监督学习模型中提取出每种线性参数对应的权重系数,需要指出的是,不同的线性参数与不同的运行状态数据相对应。In this embodiment, referring to step S13, the purpose of using the logistic regression classification algorithm to establish a supervised learning model is to adjust the weight of the corresponding linear parameters in the model. Therefore, after the model training is completed, each linear parameter is extracted from the above supervised learning model. It should be pointed out that different linear parameters correspond to different operating state data.
步骤S15:利用线性参数的权重系数分别对相应的运行状态数据的标准分数进行加权计算,并基于加权得分对待检测服务系统进行故障定位。Step S15: Use the weight coefficient of the linear parameter to perform weighted calculation on the corresponding standard scores of the operating status data, and perform fault location on the service system to be detected based on the weighted scores.
本实施例中,上述利用线性参数的权重系数分别对相应的运行状态数据的标准分数进行加权计算,还可以包括:通过预设专家知识获取接口获取用于对线性参数的权重系数进行优化的专家知识;利用专家知识对线性参数的权重系数进行相应的调整,以得到线性参数的调整后权重系数;利用线性参数的调整后权重系数分别对相应的运行状态数据的标准分数进行加权计算。可以理解的是,在从监督学习模型中提取出权重系数后,还可以通过预设专家知识获取接口获取相应的专家知识,并利用专家知识对不同线性参数的权重系数进行调整。其中,专家知识既可以是从基于历史专家知识库建立的优化权重系数模型中提取的优化数据,并由预设专家知识获取接口获取到,这样一来,可以减少权重系数优化过程的时间,以及能够减少人力资源的浪费;也可以是通过预设专家知识获取接口获取到的人工优化权重系数的指令。通过专家知识进行优化权重系数能够提高后续进行待检测服务系统故障定位的精准性以及全面性。通过专家知识对权重系数进行调整的目的是为了提高模型对没有遇到的系统故障的敏感度。In this embodiment, the above-mentioned use of the weight coefficient of the linear parameter to perform weighted calculation on the standard score of the corresponding operating status data may also include: obtaining the expert used to optimize the weight coefficient of the linear parameter through the preset expert knowledge acquisition interface Knowledge; use expert knowledge to adjust the weight coefficient of the linear parameter accordingly to obtain the adjusted weight coefficient of the linear parameter; use the adjusted weight coefficient of the linear parameter to perform weighted calculations on the corresponding standard scores of the operating status data. It can be understood that after extracting the weight coefficients from the supervised learning model, corresponding expert knowledge can also be obtained through the preset expert knowledge acquisition interface, and the weight coefficients of different linear parameters can be adjusted by using the expert knowledge. Among them, the expert knowledge can be optimized data extracted from the optimized weight coefficient model based on the historical expert knowledge base, and obtained by the preset expert knowledge acquisition interface. In this way, the time of the weight coefficient optimization process can be reduced, and It can reduce the waste of human resources; it can also be an instruction to manually optimize the weight coefficient obtained through the preset expert knowledge acquisition interface. Optimizing the weight coefficient through expert knowledge can improve the accuracy and comprehensiveness of the subsequent fault location of the service system to be detected. The purpose of adjusting the weight coefficients by expert knowledge is to increase the sensitivity of the model to system faults that have not been encountered.
在可选的实施例中,利用专家知识对线性参数的权重系数进行相应的调整,以得到线性参数的调整后权重系数的过程可以包括:In an optional embodiment, the process of using expert knowledge to adjust the weight coefficients of the linear parameters accordingly to obtain the adjusted weight coefficients of the linear parameters may include:
确定线性参数对应的故障类型标签;在第二历史运行数据中不包括故障类型标签的情况 下,利用专家知识对故障类型标签对应的线性参数的权重系数进行上调处理,以使线性参数的调整后权重系数大于线性参数原始的权重系数。由于不同的运行状态数据对应不同的线性参数,则可以根据该运行状态数据对应的故障类型标签,确定该运行状态数据对应的线性参数的故障类型标签。例如,对于一个关于网络的性能指标“system.tcp.syn_recv”,若该指标常常与网络故障有关,即该指标对应的故障类型标签为网络故障类的标签。但历史运行数据中并没有表示此类系统故障的故障类型标签,则对于这种情况,则需要适当上调与此类故障相应线性参数的权重系数。其中,tcp(transmission control protocol,传输控制协议)是一种面向连接的、可靠的、基于字节流的传输层通信协议。syn_recv是指服务端被动打开后,接收到了客户端的syn并且发送了ack时的状态。syn(synchronize sequence numbers,同步序列编号)是TCP/IP建立连接时使用的握手信号。ack(acknowledge character)即是确认字符,在数据通信中,接收站发给发送站的一种传输类控制字符。表示发来的数据已确认接收无误。Determine the fault type label corresponding to the linear parameter; in the case that the second historical operation data does not include the fault type label, use expert knowledge to adjust the weight coefficient of the linear parameter corresponding to the fault type label, so that after the adjustment of the linear parameter The weight factor is greater than the original weight factor of the linear parameter. Since different operating state data correspond to different linear parameters, the fault type label of the linear parameter corresponding to the operating state data can be determined according to the fault type label corresponding to the operating state data. For example, for a network-related performance indicator "system.tcp.syn_recv", if the indicator is often related to network failures, that is, the failure type label corresponding to the indicator is a label of the network failure category. However, there is no fault type label indicating such a system fault in the historical operating data, so in this case, the weight coefficient of the linear parameter corresponding to this type of fault needs to be adjusted appropriately. Among them, tcp (transmission control protocol, transmission control protocol) is a connection-oriented, reliable, byte stream-based transport layer communication protocol. syn_recv refers to the state when the server receives the client's syn and sends an ack after the server is passively opened. syn (synchronize sequence numbers, synchronization sequence number) is a handshake signal used by TCP/IP to establish a connection. Ack (acknowledge character) is an acknowledgment character. In data communication, a transmission control character sent by the receiving station to the sending station. Indicates that the incoming data has been confirmed to be received without error.
可见,本申请实施例首先获取待检测服务系统中每个服务节点的当前运行数据;当前运行数据包括多种运行状态数据;再利用预设数据标准化方法对当前运行数据进行标准化处理,以得到各种运行状态数据分别对应的标准分数;然后利用携带有故障类型标签的历史运行数据对基于逻辑回归算法构建的待训练模型进行训练,以得到训练后的监督学习模型;并提取训练后的监督学习模型中每种线性参数对应的权重系数;其中,不同的线性参数分别对应不同的运行状态数据;最后利用线性参数的权重系数分别对相应的运行状态数据的标准分数进行加权计算,并基于加权得分对待检测服务系统进行故障定位。由此可见,本申请中基于从训练后的监督学习模型中获取每种线性参数对应的权重系数对每个服务节点的运行状态数据对应的标准分数进行加权计算以得到加权计算得分,通过将加权计算得到的每组对应的加权得分进行排序筛选出满足预设条件的加权得分,并相应确定出该加权得分对应的服务节点,从而实现了系统故障中故障的定位,进一步根据确定出的服务节点中的组件信息的权重系数确定出系统故障的根因,提高故障定位的效率。It can be seen that, in the embodiment of the present application, the current operating data of each service node in the service system to be tested is first obtained; the current operating data includes various operating state data; and then the current operating data is standardized using the preset data standardization method to obtain The standard scores corresponding to each kind of operating status data; then use the historical operating data carrying the fault type label to train the model to be trained based on the logistic regression algorithm to obtain the trained supervised learning model; and extract the trained supervised learning The weight coefficients corresponding to each linear parameter in the model; among them, different linear parameters correspond to different operating status data; finally, the standard scores of the corresponding operating status data are weighted and calculated using the weight coefficients of the linear parameters, and based on the weighted score Perform fault location on the service system to be tested. It can be seen that in this application, based on obtaining the weight coefficient corresponding to each linear parameter from the trained supervised learning model, the standard score corresponding to the operation status data of each service node is weighted and calculated to obtain the weighted calculation score. The calculated weighted scores corresponding to each group are sorted to filter out the weighted scores that meet the preset conditions, and correspondingly determine the service node corresponding to the weighted score, thereby realizing the fault location in the system fault, and further according to the determined service node The weight coefficient of the component information in the system determines the root cause of the system failure and improves the efficiency of fault location.
参见图2所示,本申请实施例公开了一种具体的系统故障检测方法,相对于上一实施例,本实施例对技术方案作了进一步的说明和优化。具体包括:Referring to FIG. 2 , the embodiment of the present application discloses a specific system fault detection method. Compared with the previous embodiment, this embodiment further explains and optimizes the technical solution. Specifically include:
步骤S21:获取待检测服务系统中每个服务节点的系统性能指标数据、微服务调用链数据和系统日志数据,以得到每个服务节点的当前运行数据。Step S21: Obtain the system performance index data, microservice call chain data and system log data of each service node in the service system to be tested, so as to obtain the current operation data of each service node.
本实施例中,需要获取待检测服务系统中每个服务节点的当前运行数据,上述当前运行 数据主要包括以下三种运行状态数据:系统性能指标数据、微服务调用链数据和系统日志数据。其中,系统性能指标数据中指标的类型可以包括但不限于CPU(Central Processing Unit,即中央处理器)、内存、磁盘、数据库、JVM(Java Virtual Machine,即Java虚拟机)、网络,I/O(Input/Output,即输入输出)、HA(Highly Available,即双机集群)中的任意一种或几种类型;微服务调用链数据中,每组调用链数据包含链路编号(TraceId),每次调用的单元编号(SpanId),调用的服务名(ServiceName),所在物理单元(CmdbId),调用时长(Duration);系统日志数据中也包含与上述系统性能指标中相应类别的多组日志数据。In this embodiment, it is necessary to obtain the current operation data of each service node in the service system to be tested. The above-mentioned current operation data mainly includes the following three types of operation status data: system performance index data, microservice call chain data and system log data. Among them, the types of indicators in the system performance indicator data may include but not limited to CPU (Central Processing Unit, i.e. central processing unit), memory, disk, database, JVM (Java Virtual Machine, i.e. Java virtual machine), network, I/O (Input/Output, that is, input and output), HA (Highly Available, that is, two-machine cluster) any one or several types; in the microservice call chain data, each set of call chain data includes the link number (TraceId), The unit number (SpanId) of each call, the service name of the call (ServiceName), the physical unit (CmdbId), and the call duration (Duration); the system log data also includes multiple sets of log data corresponding to the above system performance indicators .
步骤S22:利用预设数据标准化方法对当前运行数据进行标准化处理,以得到各种运行状态数据分别对应的标准分数。Step S22: Use a preset data standardization method to standardize the current operating data to obtain standard scores corresponding to various operating status data.
步骤S23:获取历史正常运行数据和历史故障运行数据。Step S23: Obtain historical normal operation data and historical fault operation data.
本实施例中,需要对系统中已有的历史正常运行数据和历史故障运行数据进行获取,需要指出的是,系统中绝大部分(如99%以上)的运行数据都是正常的,只有极少数的(不到1%)运行数据存在发送故障的问题。In this embodiment, it is necessary to obtain the existing historical normal operation data and historical fault operation data in the system. A small number (less than 1%) of the operational data has the problem of sending failures.
步骤S24:向历史正常运行数据添加包含相应的运行时间区间标签以及无故障类型标签的标签信息,以得到作为负样本的第一历史运行数据。Step S24: Add label information including the corresponding running time interval label and non-fault type label to the historical normal operation data to obtain the first historical operation data as a negative sample.
本实施例中,将上述获取到的历史正常运行数据作为负样本以得到第一历史运行数据,并且还需要向历史正常运行数据中添加包含相应运行时间区间标签和无故障类型标签的标签信息,上述无故障类型标签即为表征该运行数据不会导致系统故障的标签信息。In this embodiment, the historical normal operation data obtained above is used as a negative sample to obtain the first historical operation data, and it is also necessary to add label information including the corresponding operation time interval label and non-fault type label to the historical normal operation data, The above-mentioned non-fault type label is the label information that indicates that the operating data will not cause system failure.
步骤S25:向历史故障运行数据添加包含相应的运行时间区间标签以及故障类型标签的标签信息,并对已添加标签信息的历史故障运行数据进行重采样得到作为正样本的第二历史运行数据,以使第二历史运行数据对应的样本数与第一历史运行数据对应的样本数之间的比例达到预设正负样本比例。Step S25: Add tag information including the corresponding running time interval tag and fault type tag to the historical fault operation data, and resample the historical fault operation data with added tag information to obtain the second historical operation data as a positive sample, to Make the ratio between the number of samples corresponding to the second historical operation data and the number of samples corresponding to the first historical operation data reach a preset ratio of positive and negative samples.
本实施例中,与步骤S24相似,需要将步骤S23获取到的历史故障运行数据作为正样本以得到第二历史运行数据,并且还需要向历史故障运行数据中添加包含相应的运行时间区间标签以及故障类型标签的标签信息,上述故障类型标签即为表征该运行数据会导致系统故障的标签信息。需要指出的是,由于系统中绝大部分的运行数据都是正常的,只有极少数的运行数据存在发送故障的问题,因此通过历史正常运行数据和历史故障运行数据分别得到的正负样本之间的比例非常不平衡,为了解决这种正负样本不平衡的问题,需要对添加标签信息的历史故障运行数据进行重采样,以提高正样本数在总样本数中的比例,从而使得第二历史 运行数据对应的正样本数与第一历史运行数据对应的负样本数之间的比例达到预设正负样本比例。在本实施例中,上述预设正负样本比例为1:10。In this embodiment, similar to step S24, it is necessary to use the historical fault operation data obtained in step S23 as a positive sample to obtain the second historical operation data, and it is also necessary to add to the historical fault operation data the label containing the corresponding operation time interval and The label information of the fault type label, the above-mentioned fault type label is the label information that indicates that the operation data will cause a system failure. It should be pointed out that since most of the operating data in the system are normal, only a very small number of operating data have the problem of transmission failure, so the difference between the positive and negative samples obtained from the historical normal operating data and the historical fault operating data respectively The ratio of positive and negative samples is very unbalanced. In order to solve this problem of positive and negative sample imbalance, it is necessary to resample the historical fault operation data with label information to increase the proportion of positive samples in the total number of samples, so that the second history The ratio between the number of positive samples corresponding to the operating data and the number of negative samples corresponding to the first historical operating data reaches a preset ratio of positive and negative samples. In this embodiment, the preset ratio of positive and negative samples is 1:10.
步骤S26:利用第一历史运行数据和第二历史对基于逻辑回归算法构建的待训练模型进行训练,以得到训练后的监督学习模型。Step S26: Using the first historical operation data and the second history to train the model to be trained based on the logistic regression algorithm to obtain a trained supervised learning model.
本实施例中,是利用符合预设正负样本比例的第一历史运行数据和第二历史运行数据对基于逻辑回归算法构建的待训练模型进行训练,以得到训练后的监督学习模型。In this embodiment, the model to be trained based on the logistic regression algorithm is trained by using the first historical operating data and the second historical operating data conforming to the preset ratio of positive and negative samples to obtain a trained supervised learning model.
步骤S27:提取训练后的监督学习模型中每种线性参数对应的权重系数;其中,不同的线性参数分别对应不同的运行状态数据。Step S27: extracting weight coefficients corresponding to each linear parameter in the trained supervised learning model; wherein, different linear parameters correspond to different operating state data.
步骤S28:利用线性参数的权重系数分别对相应的运行状态数据的标准分数进行加权计算,并基于加权得分对待检测服务系统进行故障定位。Step S28: Use the weight coefficient of the linear parameter to perform weighted calculation on the corresponding standard scores of the operating status data, and perform fault location on the service system to be detected based on the weighted scores.
其中,关于上述步骤S22、S27和S28更加具体的处理过程可以参考前述实施例中公开的相应内容,在此不再进行赘述。Wherein, for more specific processing procedures of the above-mentioned steps S22, S27 and S28, reference may be made to the corresponding contents disclosed in the foregoing embodiments, and details are not repeated here.
可见,本申请实施例中,对待检测服务系统中每个服务节点的系统性能指标数据、微服务调用链数据和系统日志数据进行获取,并利用预设数据标准化方法进行相应的标准分数计算;再通过对获取到的历史正常运行数据和历史故障运行数据进行相应的标签信息标记并作为正负样本数据的来源,鉴于正负样本数不平衡的情况下,对已添加标签信息的历史故障运行数据进行重采样,以提高正样本数在总样本数中的比例,从而使得第二历史运行数据对应的正样本数与第一历史运行数据对应的负样本数之间的比例达到预设正负样本比例。最后通过利用第一历史运行数据和第二历史运行数据对基于逻辑回归的待训练模型进行训练得到监督学习模型,并从中提取相应的权重系数对运行状态数据的标准分数进行加权计算,以对待检测服务系统进行故障定位。如此一来,通过少量历史故障运行数据和历史正常运行数据建立监督学习模型,后续即可通过流式计算的方式利用该模型对当前运行数据中的多种运行状态数据进行检测。It can be seen that in the embodiment of the present application, the system performance index data, microservice call chain data and system log data of each service node in the service system to be detected are obtained, and the corresponding standard score is calculated using the preset data standardization method; By marking the obtained historical normal operation data and historical fault operation data with corresponding label information and using them as the source of positive and negative sample data, in view of the imbalance in the number of positive and negative samples, the historical fault operation data with added label information Perform resampling to increase the proportion of positive samples in the total number of samples, so that the ratio between the number of positive samples corresponding to the second historical operation data and the number of negative samples corresponding to the first historical operation data reaches the preset positive and negative samples Proportion. Finally, by using the first historical operating data and the second historical operating data to train the model to be trained based on logistic regression to obtain a supervised learning model, and extract the corresponding weight coefficients from it to perform weighted calculation on the standard score of the operating status data, to be detected The service system performs fault location. In this way, a supervised learning model is established through a small amount of historical fault operation data and historical normal operation data, and then the model can be used to detect various operating status data in the current operating data through stream computing.
参见图3和图4所示,本申请实施例公开了一种具体的系统故障检测方法,相对于上一实施例,本实施例对技术方案作了进一步的说明和优化。具体包括:Referring to FIG. 3 and FIG. 4 , the embodiment of the present application discloses a specific system fault detection method. Compared with the previous embodiment, this embodiment further explains and optimizes the technical solution. Specifically include:
步骤S31:确定时间序列的滑动窗口的时间长度,并在每个滑动窗口的时间长度内分别基于第一预设时间间隔、第二预设时间间隔和第三预设时间间隔对待检测服务系统中每个服务节点的系统性能指标数据、微服务调用链数据和系统日志数据进行采样,以得到按照时序排列的与多个滑动窗口对应的多组系统性能指标数据、多组微服务调用链数据和多组系统日志数据。Step S31: Determine the time length of the sliding window of the time series, and within the time length of each sliding window, respectively, based on the first preset time interval, the second preset time interval and the third preset time interval in the service system to be detected The system performance indicator data, microservice call chain data and system log data of each service node are sampled to obtain multiple sets of system performance indicator data, multiple sets of microservice call chain data and corresponding to multiple sliding windows arranged in time series. Multiple sets of syslog data.
本实施例中,首先需要确定时间序列的滑动窗口长度,本实施例中将滑动窗口长度确定为30分钟,则需要在30分钟内基于不同的预设时间间隔对当前运行数据进行采样。本实施例对待检测服务系统中每个服务节点的当前运行数据中的三种不同运行状态数据,也即系统性能指标数据、微服务调用链数据和系统日志数据分别基于第一预设时间间隔、第二预设时间间隔和第三预设时间间隔进行采样,并得到采样后的按照时序排列的与多个滑动窗口对应的多组系统性能指标数据、多组微服务调用链数据和多组系统日志数据。需要指出的是,上述第一预设时间间隔、第二预设时间间隔和第三预设时间间隔彼此之间可以相等或者不相等。例如,对于系统性能指标数据,可以将第一预设时间间隔设为1分钟,则在一个滑动窗口长度为30分钟的滑动窗口内进行采样后则可以得到30个采样点,并将这30个采样点对应的数据值按照时序排列得到一组系统性能指标数据。对于微服务调用链数据和系统日志数据也是如此,并且采样点的个数由预设时间间隔确定,当预设时间间隔不同时,采样后得到的采样点个数也不同;而当预设时间间隔相同时,采样点的个数也相同。In this embodiment, it is first necessary to determine the length of the sliding window of the time series. In this embodiment, the length of the sliding window is determined to be 30 minutes, and the current operating data needs to be sampled based on different preset time intervals within 30 minutes. In this embodiment, three kinds of different running state data in the current running data of each service node in the service system to be detected, that is, system performance index data, microservice call chain data and system log data are respectively based on the first preset time interval, Sampling is performed at the second preset time interval and the third preset time interval, and multiple sets of system performance index data corresponding to multiple sliding windows, multiple sets of microservice call chain data, and multiple sets of system data arranged in time series after sampling are obtained. log data. It should be noted that, the first preset time interval, the second preset time interval and the third preset time interval may be equal or unequal to each other. For example, for system performance index data, the first preset time interval can be set to 1 minute, then 30 sampling points can be obtained after sampling in a sliding window whose length is 30 minutes, and these 30 The data values corresponding to the sampling points are arranged in time series to obtain a set of system performance index data. The same is true for microservice call chain data and system log data, and the number of sampling points is determined by the preset time interval. When the preset time interval is different, the number of sampling points obtained after sampling is also different; and when the preset time interval When the interval is the same, the number of sampling points is also the same.
步骤S32:计算每组系统性能指标数据对应的z分数以及每组系统性能指标数据中不同系统性能指标数据之间的一阶差分数据对应的z分数。Step S32: Calculate the z-score corresponding to each set of system performance index data and the z-score corresponding to the first-order difference data between different system performance index data in each set of system performance index data.
本实施例中,在获取到上述多组系统性能指标数据后,需要对每组系统性能指标数据按照其数据指标类型进行分类并进行标准化处理,例如当数据指标类型为CPU时,则只需计算获取到的与CPU相关的指标数据。本实施例主要采取z分数标准化方法对数据进行标准化,也即计算每组系统性能指标数据对应的z分数,此外还需要计算每组系统性能指标数据中不同系统性能指标数据之间的一阶差分数据对应的z分数。需要注意的是,本实施例在对系统性能指标数据进行标准化处理之前,还需要对获取到的系统性能指标数据进行数据清洗,数据清洗的目的是为了将重复、多余的数据清除或者、将缺失的数据补充完整或者将错误的数据纠正或删除,从而提高数据的质量,以降低数据使用过程中的出错率。In this embodiment, after obtaining the above multiple sets of system performance index data, it is necessary to classify and standardize each set of system performance index data according to its data index type. For example, when the data index type is CPU, it is only necessary to calculate The obtained indicator data related to CPU. This embodiment mainly uses the z-score standardization method to standardize the data, that is, calculates the z-score corresponding to each set of system performance index data, and also needs to calculate the first-order difference between different system performance index data in each set of system performance index data The z-score corresponding to the data. It should be noted that, before standardizing the system performance index data in this embodiment, it is necessary to perform data cleaning on the acquired system performance index data. The data is supplemented completely or the wrong data is corrected or deleted, so as to improve the quality of the data and reduce the error rate in the process of data use.
其中,z分数的计算公式如下所示:Among them, the calculation formula of z-score is as follows:
Figure PCTCN2022122295-appb-000004
Figure PCTCN2022122295-appb-000004
其中,Metric i为一组运行状态数据相应数据指标类型的z分数;value i为一组运行状态数据中每一个采样点的数据值;mean为一个滑动窗口内一组运行状态数据的均值,std为一个滑动窗口内一组运行状态数据的标准差。 Among them, Metric i is the z-score of the corresponding data index type of a group of operating status data; value i is the data value of each sampling point in a group of operating status data; mean is the mean value of a group of operating status data in a sliding window, std is the standard deviation of a set of running status data within a sliding window.
在传统计算均值和方差的公式中:In the traditional formula for calculating mean and variance:
Figure PCTCN2022122295-appb-000005
Figure PCTCN2022122295-appb-000005
Figure PCTCN2022122295-appb-000006
Figure PCTCN2022122295-appb-000006
而在流式计算中,由于数据量非常庞大,利用传统方法计算均值和方差的算法性能较低,为此,本申请提出了一种时间复杂度为O(1)的均值和方差计算方法,能够有效改善模型性能。该方法主要是对上述公式进行了优化,具体优化方法如下:In streaming computing, due to the huge amount of data, the performance of the algorithm for calculating the mean and variance using the traditional method is low. Therefore, this application proposes a method for calculating the mean and variance with a time complexity of O(1). It can effectively improve the model performance. This method mainly optimizes the above formula, and the specific optimization method is as follows:
将传统计算方差的公式展开后可以得到:After expanding the traditional formula for calculating variance, we can get:
Figure PCTCN2022122295-appb-000007
Figure PCTCN2022122295-appb-000007
再令
Figure PCTCN2022122295-appb-000008
则可以得到:
Reorder
Figure PCTCN2022122295-appb-000008
Then you can get:
优化后均值计算公式:
Figure PCTCN2022122295-appb-000009
The formula for calculating the mean value after optimization is:
Figure PCTCN2022122295-appb-000009
优化后方差计算公式:
Figure PCTCN2022122295-appb-000010
The formula for calculating the variance after optimization is:
Figure PCTCN2022122295-appb-000010
也即,本申请实施例在对当前运行数据中的任一组运行状态数据进行标准化处理时,可以采用以下处理方式:利用上述优化后均值计算公式以及上述优化后方差计算公式,分别计算该组运行状态数据对应的均值和方差,并基于该组运行状态数据对应的均值和方差计算该组运行状态数据对应的z分数。式中,n表示该组运行状态数据对应的数据样本量,x i表示该组运行状态数据中的第i个数据样本,mean表示均值,s 2表示方差。可以理解的是,在计算该组运行状态数据自身对应的z分数以及该组运行状态数据的一阶差分数据对应的z分数时,均可以采用上述优化后均值计算公式和优化后方差计算公式来计算各自对应的均值和方差,进而求解出各自对应的z分数。 That is to say, in the embodiment of the present application, when standardizing any group of operating status data in the current operating data, the following processing method can be adopted: use the above-mentioned optimized mean value calculation formula and the above-mentioned optimized variance calculation formula to calculate respectively The mean value and variance corresponding to the operation status data, and the z-score corresponding to the group of operation status data is calculated based on the mean value and variance corresponding to the group of operation status data. In the formula, n represents the data sample size corresponding to the group of operating state data, xi represents the i-th data sample in the group of operating state data, mean represents the mean value, and s 2 represents the variance. It can be understood that when calculating the z-score corresponding to the group of operating state data itself and the z-score corresponding to the first-order difference data of the group of operating state data, the above-mentioned optimized mean calculation formula and optimized variance calculation formula can be used to calculate Calculate the corresponding mean and variance, and then solve the corresponding z-scores.
如此一来,只需要设计一个队列维护区间中的x i用于快速更新S0和S1的值,可在O(1)时间复杂度内计算出当前各个运行状态数据对应的标准分数,即无论所计算的数据规模有多大,都可以在一次计算后得出一个滑动窗口内一组运行状态数据的均值和一个滑动窗口内一组运行状态数据的方差,本申请方法可大大提高计算一个滑动窗口内一组运行状态数据的均值和一个滑动窗口内一组运行状态数据的方差的速度,进而提升了计算标准分数的速度以及系统故障检测的效率。 In this way, only one xi in the queue maintenance interval needs to be designed to quickly update the values of S0 and S1, and the standard scores corresponding to the current running status data can be calculated within O(1) time complexity, that is, no matter what No matter how large the scale of the calculated data is, the mean value of a group of operating state data in a sliding window and the variance of a group of operating state data in a sliding window can be obtained after one calculation. The speed of the mean value of a set of operating state data and the variance of a set of operating state data within a sliding window, thereby improving the speed of calculating standard scores and the efficiency of system fault detection.
步骤S33:获取每组微服务调用链数据中的微服务调用时间,并计算每组微服务调用链数据中的微服务调用时间对应的z分数以及每组微服务调用链数据中不同微服务调用时间之间的一阶差分数据对应的z分数。Step S33: Obtain the microservice call time in each set of microservice call chain data, and calculate the z-score corresponding to the microservice call time in each set of microservice call chain data and the different microservice calls in each set of microservice call chain data The z-score corresponding to the first difference data between times.
本实施例中,在获取到上述多组微服务调用链数据后,需要对每组微服务调用链数据进行图分解并计算服务调用的响应时间,此外由于每组调用链都会存在父节点和子节点,因此对于一组调用链数据,需要同时为父节点和子节点同时添加调用时长和表征调用关系的调用方向。基于上述微服务的调用时间和调用时长得到每组微服务调用链数据中总的微服务调用时间,再计算计算每组微服务调用链数据中总的微服务调用时间对应的z分数以及每组微服务调用链数据中不同微服务调用时间之间的一阶差分数据对应的z分数。其中,z分数的具体计算方法参照步骤S32中所示。In this embodiment, after obtaining the above multiple sets of microservice call chain data, it is necessary to decompose the graph of each set of microservice call chain data and calculate the response time of the service call. In addition, since each set of call chains will have parent nodes and child nodes , so for a set of call chain data, it is necessary to add the call duration and the call direction representing the call relationship for both the parent node and the child node. Based on the call time and call duration of the above microservices, the total microservice call time in each group of microservice call chain data is obtained, and then the z-score corresponding to the total microservice call time in each group of microservice call chain data and each group The z-score corresponding to the first-order difference data between different microservice call times in the microservice call chain data. Wherein, the specific calculation method of the z-score refers to that shown in step S32.
步骤S34:利用预设日志模板对每组系统日志数据进行匹配以得到每组系统日志数据中不同系统日志数据对应的匹配分值,并计算每组系统日志数据中不同系统日志数据对应的匹配分值的z分数以及每组系统日志数据对应的不同匹配分值之间的一阶差分数据的z分数。Step S34: Use the preset log template to match each set of system log data to obtain matching scores corresponding to different system log data in each set of system log data, and calculate the matching scores corresponding to different system log data in each set of system log data The z-scores of the values and the z-scores of the first-difference data between different matching scores for each set of syslog data.
本实施例中,在获取到上述多组系统日志数据后,需要对每组系统日志数据利用相应的日志模板进行匹配和检测,目的是为了判断系统日志数据的类型(即属于CPU、内存或者磁盘等),并得到每组系统日志数据中不同系统日志数据类型对应的匹配分值,再计算每组系统日志数据中不同系统日志数据对应的匹配分值的z分数以及每组系统日志数据对应的不同匹配分值之间的一阶差分数据的z分数。其中,z分数的具体计算方法参照步骤S32中所示。In this embodiment, after obtaining the above-mentioned multiple sets of system log data, it is necessary to use the corresponding log template for each set of system log data to match and detect, the purpose is to determine the type of system log data (that is, belong to CPU, memory or disk) etc.), and get the matching scores corresponding to different system log data types in each group of system log data, and then calculate the z-score of the matching scores corresponding to different system log data in each group of system log data and the corresponding The z-score of the first-difference data between different matching scores. Wherein, the specific calculation method of the z-score refers to that shown in step S32.
步骤S35:利用携带有故障类型标签的历史运行数据对基于逻辑回归算法构建的待训练模型进行训练,以得到训练后的监督学习模型。Step S35: Using the historical operating data carrying the fault type label to train the model to be trained based on the logistic regression algorithm to obtain a trained supervised learning model.
步骤S36:提取训练后的监督学习模型中每种线性参数对应的权重系数;其中,不同的线性参数分别对应不同的运行状态数据。Step S36: Extracting weight coefficients corresponding to each linear parameter in the trained supervised learning model; wherein, different linear parameters correspond to different operating state data.
步骤S37:利用线性参数的权重系数分别对相应的运行状态数据的标准分数进行加权计算,并基于加权得分对待检测服务系统进行故障定位。Step S37: Using the weight coefficient of the linear parameter to perform weighted calculations on the corresponding standard scores of the operating status data, and perform fault location on the service system to be detected based on the weighted scores.
其中,关于上述步骤S35、S36和S37更加具体的处理过程可以参考前述实施例中公开的相应内容,在此不再进行赘述。Wherein, for more specific processing procedures of the above-mentioned steps S35, S36 and S37, reference may be made to the corresponding content disclosed in the foregoing embodiments, and details are not repeated here.
可见,本申请实施例中,通过先确定时间序列的滑动窗口长度,并在滑动窗口的时间长度内基于预设时间间隔对待检测服务系统中每个服务节点的系统性能指标数据、微服务调用链数据和系统日志数据进行采样以得到多组系统性能指标数据、多组微服务调用链数据和多组系统日志数据;再通过计算每组系统性能指标数据、每组微服务调用链数据中的微服务调用时间、每组系统日志数据中不同系统日志数据对应的匹配分值及其相应的一阶差分数据的z分数,以便后续利用监督学习模型中提取到的不同线性参数对应的权重系数对上述得到的 z分数进行加权计算,并基于加权得分对待检测服务系统进行故障定位。It can be seen that in the embodiment of the present application, by first determining the length of the sliding window of the time series, and based on the preset time interval within the time length of the sliding window, the system performance index data of each service node in the service system to be detected, the microservice call chain Data and system log data are sampled to obtain multiple sets of system performance index data, multiple sets of microservice call chain data, and multiple sets of system log data; The service call time, the matching scores corresponding to different system log data in each set of system log data, and the z-scores of the corresponding first-order difference data, so that the weight coefficients corresponding to different linear parameters extracted from the supervised learning model can be used to compare the above The obtained z-score is weighted and calculated, and based on the weighted score, the fault location of the service system to be detected is performed.
参见图5所示,本申请实施例公开了一种具体的系统故障检测方法,相对于上一实施例,本实施例对技术方案作了进一步的说明和优化。具体包括:Referring to FIG. 5 , the embodiment of the present application discloses a specific system fault detection method. Compared with the previous embodiment, this embodiment further explains and optimizes the technical solution. Specifically include:
步骤S41:获取待检测服务系统中每个服务节点的当前运行数据;当前运行数据包括多种运行状态数据。Step S41: Obtain the current operation data of each service node in the service system to be detected; the current operation data includes various operation state data.
步骤S42:利用预设数据标准化方法对当前运行数据进行标准化处理,以得到各种运行状态数据分别对应的标准分数。Step S42: Standardize the current operating data by using a preset data standardization method to obtain standard scores corresponding to various operating status data.
步骤S43:利用携带有故障类型标签的历史运行数据对基于逻辑回归算法构建的待训练模型进行训练,以得到训练后的监督学习模型。Step S43: Use the historical operating data with the fault type label to train the model to be trained based on the logistic regression algorithm to obtain a trained supervised learning model.
步骤S44:提取训练后的监督学习模型中每种线性参数对应的权重系数;其中,不同的线性参数分别对应不同的运行状态数据。Step S44: extracting weight coefficients corresponding to each linear parameter in the trained supervised learning model; wherein, different linear parameters correspond to different operating state data.
步骤S45:利用线性参数的权重系数分别对每个服务节点中相应的运行状态数据的标准分数进行加权计算,以得到每个服务节点的加权得分。Step S45: Use the weight coefficient of the linear parameter to perform weighted calculation on the corresponding standard score of the running status data in each service node, so as to obtain the weighted score of each service node.
本实施例中,在得到每种线性参数对应的权重系数后,利用该权重系数对待检测系统中每个服务节点中相应的运行状态数据的标准分数进行加权计算,例如,对于一个服务节点中的三种不同的运行状态数据,其中假设系统指标类型数据中包括CPU、内存、磁盘和数据库四种指标类型,则CPU、内存、磁盘和数据库都具有各自的标准分数,而微服务调用链数据的调用时间和系统日志数据的匹配分值都有各自的标准分数,那么则利用从模型中提取出的各种线性参数的权重系数对上述相应的标准分值进行加权计算,以得到每个服务节点的加权得分。In this embodiment, after obtaining the weight coefficient corresponding to each linear parameter, use the weight coefficient to perform weighted calculation on the standard score of the corresponding running status data in each service node in the system to be detected, for example, for a service node Three different running status data, assuming that the system indicator type data includes four indicator types of CPU, memory, disk and database, then CPU, memory, disk and database all have their own standard scores, while the microservice call chain data The matching scores of call time and system log data have their own standard scores, then use the weight coefficients of various linear parameters extracted from the model to weight the above corresponding standard scores to get each service node weighted score.
步骤S46:按照加权得分从大到小的顺序从所有服务节点中筛选出预设数量个加权得分大于预设阈值的服务节点,以基于筛选后得到的服务节点确定出发生故障的目标服务节点。Step S46: Screen out a preset number of service nodes with a weighted score greater than a preset threshold from all service nodes in descending order of weighted scores, so as to determine a faulty target service node based on the service nodes obtained after screening.
本实施例中,通过步骤S45能得到每个服务节点的加权得分,再按照加权得分从大到小的顺序从所有服务节点中筛选出预设数量个加权得分大于预设阈值的服务节点,本实施例中,将上述预设数量设为3,并将上述预设阈值设为0.9,也即将所有加权得分按照从大到小的顺序排序后,将分值大于0.9的前3个服务节点筛选出来,并基于这3个服务节点确定出发生故障的目标服务节点。In this embodiment, the weighted score of each service node can be obtained through step S45, and then a preset number of service nodes with a weighted score greater than a preset threshold are screened out from all service nodes in the order of weighted scores from large to small. In the embodiment, the above-mentioned preset number is set to 3, and the above-mentioned preset threshold is set to 0.9, that is, after all weighted scores are sorted in descending order, the first 3 service nodes with scores greater than 0.9 are screened out, and based on these three service nodes to determine the faulty target service node.
在确定目标服务节点的过程中,可以通过两种方式进行确定,在一种具体实施方式中,可以将上述3个服务节点都作为发生故障的目标服务节点,也即上述3个服务节点都存在故障;在另一种具体实施方式中,可以通过人工参与的方式从3个服务节点中根据一定规则再 一次筛选出一个或两个服务节点作为发生故障的目标服务节点。In the process of determining the target service node, it can be determined in two ways. In a specific implementation, the above three service nodes can be used as the target service node where the failure occurs, that is, the above three service nodes all exist Fault; in another specific implementation manner, one or two service nodes can be selected again from the three service nodes according to certain rules through manual participation as the target service node where the fault occurs.
步骤S47:从目标服务节点对应的所有线性参数的权重系数中筛选出最大权重系数,并将最大权重系数对应的线性参数的参数类型确定为相应的故障根因。Step S47: Screen out the largest weight coefficient from the weight coefficients of all linear parameters corresponding to the target service node, and determine the parameter type of the linear parameter corresponding to the largest weight coefficient as the corresponding root cause of the failure.
本实施例中,在确定出发生故障的目标服务节点之后,再从上述目标服务节点对应的所有线性参数的权重系数中筛选出最大权重系数,例如,假设一个目标服务节点中CPU的权重系数为0.4,内存的权重系数为0.8,磁盘的权重系数为0.8、数据库的权重系数为0.2、网络的权重系数为0.5,则该目标服务节点中,权重系数最大的是0.8,接着则将最大权重系数0.8对应的线性参数的参数类型,也即将内存和磁盘确定为故障根因。In this embodiment, after the faulty target service node is determined, the maximum weight coefficient is selected from the weight coefficients of all linear parameters corresponding to the target service node. For example, suppose the weight coefficient of the CPU in a target service node is 0.4, the weight coefficient of the memory is 0.8, the weight coefficient of the disk is 0.8, the weight coefficient of the database is 0.2, and the weight coefficient of the network is 0.5, then the target service node has the largest weight coefficient of 0.8, and then the largest weight coefficient 0.8 corresponds to the parameter type of the linear parameter, that is, to determine the memory and disk as the root cause of the failure.
其中,关于上述步骤S41、S42、S43和S44更加具体的处理过程可以参考前述实施例中公开的相应内容,在此不再进行赘述。Wherein, for more specific processing procedures of the above-mentioned steps S41, S42, S43 and S44, reference may be made to the corresponding content disclosed in the foregoing embodiments, which will not be repeated here.
可见,本申请实施例中,通过利用线性参数的权重系数分别对每个服务节点中的运行状态数据的标准分数进行加权计算,以得到每个服务节点的加权得分,并按照加权得分从大到小的顺序从所有服务节点中筛选出预设数量个加权得分大于预设阈值的服务节点,以确定出发生故障的目标服务节点,最后将目标服务节点中最大权重系数对应的线性参数的参数类型确定为相应的故障根因。由此可见,本申请实施例通过计算每个服务节点加权得分,并对加权得分进行排序以确定出发生故障的目标服务节点的方式,可以实现对待检测系统中每个服务节参见图6所示,本申请实施例还公开了一种系统故障检测装置,该装置包括:It can be seen that in the embodiment of the present application, the weighted calculation of the standard score of the running status data in each service node is carried out by using the weight coefficient of the linear parameter to obtain the weighted score of each service node, and the weighted score is from large to The small sequence screens out a preset number of service nodes with a weighted score greater than the preset threshold from all service nodes to determine the target service node that has failed, and finally sets the parameter type of the linear parameter corresponding to the largest weight coefficient in the target service node identified as the corresponding root cause of the failure. It can be seen that, in the embodiment of the present application, by calculating the weighted score of each service node and sorting the weighted scores to determine the target service node that has failed, it can realize that each service node in the system to be detected is shown in Figure 6. , the embodiment of the present application also discloses a system fault detection device, the device includes:
数据获取模块11,用于获取待检测服务系统中每个服务节点的当前运行数据;当前运行数据包括多种运行状态数据;The data acquisition module 11 is used to acquire the current operation data of each service node in the service system to be detected; the current operation data includes various operation state data;
标准化处理模块12,用于利用预设数据标准化方法对当前运行数据进行标准化处理,以得到各种运行状态数据分别对应的标准分数;The standardization processing module 12 is used to standardize the current operation data by using a preset data standardization method, so as to obtain standard scores corresponding to various operation state data;
模型训练模块13,用于利用携带有故障类型标签的历史运行数据对基于逻辑回归算法构建的待训练模型进行训练,以得到训练后的监督学习模型;Model training module 13, for utilizing the historical operation data that carries fault type label to train the model to be trained based on logistic regression algorithm construction, to obtain the supervised learning model after training;
权重系数提取模块14,用于提取训练后的监督学习模型中每种线性参数对应的权重系数;其中,不同的线性参数分别对应不同的运行状态数据;The weight coefficient extraction module 14 is used to extract the weight coefficient corresponding to each linear parameter in the supervised learning model after training; wherein, different linear parameters correspond to different operating state data;
故障定位模块15,用于利用线性参数的权重系数分别对相应的运行状态数据的标准分数进行加权计算,并基于加权得分对待检测服务系统进行故障定位。The fault location module 15 is configured to use the weight coefficient of the linear parameter to perform weighted calculation on the corresponding standard scores of the operating status data, and perform fault location on the service system to be detected based on the weighted scores.
可见,本申请首先获取待检测服务系统中每个服务节点的当前运行数据;当前运行数据包括多种运行状态数据;再利用预设数据标准化方法对当前运行数据进行标准化处理,以得到各种运行状态数据分别对应的标准分数;然后利用携带有故障类型标签的历史运行数据对 基于逻辑回归算法构建的待训练模型进行训练,以得到训练后的监督学习模型;并提取训练后的监督学习模型中每种线性参数对应的权重系数;其中,不同的线性参数分别对应不同的运行状态数据;最后利用线性参数的权重系数分别对相应的运行状态数据的标准分数进行加权计算,并基于加权得分对待检测服务系统进行故障定位。由此可见,本申请中基于从训练后的监督学习模型中获取每种线性参数对应的权重系数对每个服务节点的运行状态数据对应的标准分数进行加权计算以得到加权计算得分,通过将加权计算得到的每组对应的加权得分进行排序筛选出满足预设条件的加权得分,并相应确定出该加权得分对应的服务节点,从而实现了系统故障中故障的定位,进一步根据确定出的服务节点中的组件信息的权重系数确定出系统故障的根因,提高故障定位的效率。It can be seen that this application first obtains the current operating data of each service node in the service system to be tested; the current operating data includes various operating state data; and then uses the preset data standardization method to standardize the current operating data to obtain various operating data The standard scores corresponding to the status data; then use the historical operation data with the fault type label to train the model to be trained based on the logistic regression algorithm to obtain the trained supervised learning model; and extract the trained supervised learning model The weight coefficient corresponding to each linear parameter; among them, different linear parameters correspond to different operating status data; finally, the weight coefficient of the linear parameter is used to calculate the weighted calculation of the standard score of the corresponding operating status data, and based on the weighted score to be detected The service system performs fault location. It can be seen that in this application, based on obtaining the weight coefficient corresponding to each linear parameter from the trained supervised learning model, the standard score corresponding to the operation status data of each service node is weighted and calculated to obtain the weighted calculation score. The calculated weighted scores corresponding to each group are sorted to filter out the weighted scores that meet the preset conditions, and correspondingly determine the service node corresponding to the weighted score, thereby realizing the fault location in the system fault, and further according to the determined service node The weight coefficient of the component information in the system determines the root cause of the system failure and improves the efficiency of fault location.
图7为本申请实施例提供的一种计算机设备的结构示意图。具体可以包括:至少一个处理器21、至少一个存储器22、电源23、通信接口24、输入输出接口25和通信总线26。其中,存储器22用于存储计算机程序,计算机程序由处理器21加载并执行,以实现前述任一实施例公开的由计算机设备执行的系统故障检测方法中的相关步骤。FIG. 7 is a schematic structural diagram of a computer device provided by an embodiment of the present application. Specifically, it may include: at least one processor 21 , at least one memory 22 , a power supply 23 , a communication interface 24 , an input/output interface 25 and a communication bus 26 . Wherein, the memory 22 is used to store a computer program, and the computer program is loaded and executed by the processor 21 to implement the relevant steps in the system fault detection method performed by the computer device disclosed in any of the foregoing embodiments.
本实施例中,电源23用于为计算机设备20上的各硬件设备提供工作电压;通信接口24能够为计算机设备20创建与外界设备之间的数据传输通道,其所遵循的通信协议是能够适用于本申请技术方案的任意通信协议,在此不对其进行具体限定;输入输出接口25,用于获取外界输入数据或向外界输出数据,其具体的接口类型可以根据具体应用需要进行选取,在此不进行具体限定。In this embodiment, the power supply 23 is used to provide operating voltage for each hardware device on the computer device 20; the communication interface 24 can create a data transmission channel between the computer device 20 and external devices, and the communication protocol it follows is applicable Any communication protocol in the technical solution of the present application is not specifically limited here; the input and output interface 25 is used to obtain external input data or output data to the external, and its specific interface type can be selected according to specific application needs, here Not specifically limited.
其中,处理器21可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器21可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器21也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器21可以在集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器21还可以包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。Wherein, the processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. Processor 21 can adopt at least one hardware form in DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, programmable logic array) accomplish. Processor 21 may also include a main processor and a coprocessor, the main processor is a processor for processing data in a wake-up state, also called CPU (Central Processing Unit, central processing unit); the coprocessor is Low-power processor for processing data in standby state. In some embodiments, the processor 21 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing the content that needs to be displayed on the display screen. In some embodiments, the processor 21 may also include an AI (Artificial Intelligence, artificial intelligence) processor, and the AI processor is used to process computing operations related to machine learning.
另外,存储器22作为资源存储的载体,可以是只读存储器、随机存储器、磁盘或者光盘等,其上所存储的资源包括操作系统221、计算机程序222及数据223等,存储方式可以是短暂存储或者永久存储。In addition, the memory 22, as a resource storage carrier, can be a read-only memory, random access memory, magnetic disk or optical disk, etc., and the resources stored thereon include the operating system 221, computer program 222 and data 223, etc., and the storage method can be short-term storage or permanent storage.
其中,操作系统221用于管理与控制计算机设备20上的各硬件设备以及计算机程序222,以实现处理器21对存储器22中海量数据223的运算与处理,其可以是Windows、Unix、Linux等。计算机程序222除了包括能够用于完成前述任一实施例公开的由计算机设备20执行的系统故障检测方法的计算机程序之外,还可以进一步包括能够用于完成其他特定工作的计算机程序。数据223除了可以包括计算机设备接收到的由外部设备传输进来的数据,也可以包括由自身输入输出接口25采集到的数据等。Among them, the operating system 221 is used to manage and control each hardware device and computer program 222 on the computer device 20, so as to realize the operation and processing of the massive data 223 in the memory 22 by the processor 21, which can be Windows, Unix, Linux, etc. In addition to the computer program 222 that can be used to complete the system fault detection method performed by the computer device 20 disclosed in any of the foregoing embodiments, the computer program 222 can further include a computer program that can be used to complete other specific tasks. The data 223 may not only include data received by the computer device and transmitted from an external device, but may also include data collected by its own input and output interface 25 and the like.
进一步的,本申请实施例还公开了一种存储介质,存储介质中存储有计算机程序,计算机程序被处理器加载并执行时,实现前述任一实施例公开的由系统故障检测过程中执行的方法步骤。Further, the embodiment of the present application also discloses a storage medium, in which a computer program is stored, and when the computer program is loaded and executed by a processor, the method performed during the system fault detection process disclosed in any of the foregoing embodiments is implemented. step.
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括要素的过程、方法、物品或者设备中还存在另外的相同要素。Finally, it should also be noted that in this text, relational terms such as first and second etc. are only used to distinguish one entity or operation from another, and do not necessarily require or imply that these entities or operations, any such actual relationship or order exists. Furthermore, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus comprising a set of elements includes not only those elements, but also includes elements not expressly listed. other elements of or also include elements inherent in such a process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not preclude the presence of additional identical elements in the process, method, article, or apparatus that includes the element.
以上对本申请所提供的系统故障检测方法、装置、设备及存储介质进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上,本说明书内容不应理解为对本申请的限制。点进行故障检测,并确定出目标服务节点中导致发送故障的根因。The system fault detection method, device, equipment and storage medium provided by this application have been introduced in detail above. In this paper, specific examples have been used to illustrate the principle and implementation of this application. The description of the above embodiments is only for helping understanding The method of the application and its core idea; at the same time, for those of ordinary skill in the art, according to the idea of the application, there will be changes in the specific implementation and scope of application. In summary, the content of this specification should not be understood as Limitations on this Application. point to perform fault detection, and determine the root cause of the sending fault in the target service node.

Claims (20)

  1. 一种系统故障检测方法,其中,包括:A system fault detection method, including:
    获取待检测服务系统中每个服务节点的当前运行数据;所述当前运行数据包括多种运行状态数据;Obtain the current operation data of each service node in the service system to be detected; the current operation data includes various operation status data;
    利用预设数据标准化方法对所述当前运行数据进行标准化处理,以得到各种所述运行状态数据分别对应的标准分数;Standardize the current operating data by using a preset data standardization method to obtain standard scores corresponding to various operating state data;
    利用携带有故障类型标签的历史运行数据对基于逻辑回归算法构建的待训练模型进行训练,以得到训练后的监督学习模型;Use the historical operating data carrying the fault type label to train the model to be trained based on the logistic regression algorithm to obtain the trained supervised learning model;
    提取所述训练后的监督学习模型中每种线性参数对应的权重系数;其中,不同的所述线性参数分别对应不同的所述运行状态数据;Extracting weight coefficients corresponding to each linear parameter in the trained supervised learning model; wherein, different linear parameters correspond to different operating state data;
    利用所述线性参数的权重系数分别对相应的所述运行状态数据的标准分数进行加权计算,并基于加权得分对所述待检测服务系统进行故障定位。The weight coefficients of the linear parameters are used to perform weighted calculations on the corresponding standard scores of the operating status data, and perform fault location on the service system to be detected based on the weighted scores.
  2. 根据权利要求1所述的系统故障检测方法,其中,所述获取待检测服务系统中每个服务节点的当前运行数据,包括:The system fault detection method according to claim 1, wherein said obtaining the current operating data of each service node in the service system to be detected comprises:
    获取待检测服务系统中每个服务节点的系统性能指标数据、微服务调用链数据和系统日志数据,以得到每个所述服务节点的当前运行数据。The system performance index data, microservice call chain data and system log data of each service node in the service system to be tested are obtained to obtain the current operation data of each service node.
  3. 根据权利要求2所述的系统故障检测方法,其中,所述获取待检测服务系统中每个服务节点的系统性能指标数据、微服务调用链数据和系统日志数据,以得到每个所述服务节点的当前运行数据,包括:The system fault detection method according to claim 2, wherein said acquisition of system performance index data, microservice call chain data and system log data of each service node in the service system to be detected, to obtain each said service node current operating data, including:
    确定时间序列的滑动窗口的时间长度;determine the time length of the sliding window for the time series;
    在每个所述滑动窗口的时间长度内基于第一预设时间间隔对所述待检测服务系统中每个服务节点的系统性能指标数据进行采样,以得到按照时序排列的与多个所述滑动窗口对应的多组系统性能指标数据;Sampling the system performance index data of each service node in the service system to be detected based on the first preset time interval within the time length of each sliding window, so as to obtain the time series and multiple sliding windows. Multiple sets of system performance index data corresponding to the window;
    在每个所述滑动窗口的时间长度内基于第二预设时间间隔对所述待检测服务系统中每个服务节点的微服务调用链数据进行采样,以得到按照时序排列的与多个所述滑动窗口对应的多组微服务调用链数据;Sampling the microservice invocation chain data of each service node in the service system to be detected based on the second preset time interval within the time length of each sliding window, so as to obtain the time series and multiple Multiple sets of microservice call chain data corresponding to the sliding window;
    在每个所述滑动窗口的时间长度内基于第三预设时间间隔对所述待检测服务系统中每个服务节点的系统日志数据进行采样,以得到按照时序排列的与多个所述滑动窗口对应的多组系统日志数据。Sampling the system log data of each service node in the service system to be detected based on a third preset time interval within the time length of each sliding window, so as to obtain time-series data related to multiple sliding windows Corresponding sets of system log data.
  4. 根据权利要求3所述的系统故障检测方法,其中,所述利用预设数据标准化方法 对所述当前运行数据进行标准化处理,以得到各种所述运行状态数据分别对应的标准分数,包括:The system fault detection method according to claim 3, wherein, the standardization process is performed on the current operation data by using a preset data standardization method, so as to obtain standard scores corresponding to various operation status data, including:
    计算每组所述系统性能指标数据对应的z分数以及每组所述系统性能指标数据中不同系统性能指标数据之间的一阶差分数据对应的z分数;Calculating the z-score corresponding to each set of system performance index data and the z-score corresponding to the first-order difference data between different system performance index data in each set of system performance index data;
    获取每组所述微服务调用链数据中的微服务调用时间,并计算每组所述微服务调用链数据中的微服务调用时间对应的z分数以及每组所述微服务调用链数据中不同微服务调用时间之间的一阶差分数据对应的z分数;Obtain the microservice call time in each set of microservice call chain data, and calculate the z-score corresponding to the microservice call time in each set of microservice call chain data and the difference between each set of microservice call chain data The z-score corresponding to the first-order difference data between the microservice call times;
    利用预设日志模板对每组所述系统日志数据进行匹配以得到每组所述系统日志数据中不同系统日志数据对应的匹配分值,并计算每组所述系统日志数据中不同系统日志数据对应的匹配分值的z分数以及每组所述系统日志数据对应的不同匹配分值之间的一阶差分数据的z分数。Use the preset log template to match each group of the system log data to obtain the matching score corresponding to the different system log data in each group of the system log data, and calculate the correspondence between the different system log data in each group of the system log data The z-score of the matching score of each set of system log data and the z-score of the first-order difference data between different matching scores corresponding to each set of system log data.
  5. 根据权利要求4所述的系统故障检测方法,其中,所述计算每组所述微服务调用链数据中的微服务调用时间对应的z分数以及每组所述微服务调用链数据中不同微服务调用时间之间的一阶差分数据对应的z分数,包括:The system fault detection method according to claim 4, wherein said calculating the z-score corresponding to the microservice call time in each group of said microservice call chain data and the different microservices in each group of said microservice call chain data The z-score corresponding to the first-order difference data between calling times, including:
    针对每组所述微服务调用链数据,确定该组微服务调用链数据对应的微服务调用链的父节点和子节点;For each set of microservice call chain data, determine the parent node and child node of the microservice call chain corresponding to the set of microservice call chain data;
    同时为所述父节点和所述子节点添加调用时长和表征调用关系的调用方向;At the same time, adding a call duration and a call direction representing a call relationship for the parent node and the child node;
    基于所述父节点和所述子节点的所述调用时长和所述调用方向,计算该组微服务调用链数据中总的微服务调用时间;Calculate the total microservice invocation time in the group of microservice invocation chain data based on the invocation duration and invocation direction of the parent node and the child node;
    计算该组微服务调用链数据中总的微服务调用时间对应的z分数以及每组所述微服务调用链数据中不同微服务调用时间之间的一阶差分数据对应的z分数。Calculate the z-score corresponding to the total microservice invocation time in the group of microservice invocation chain data and the z-score corresponding to the first-order difference data between different microservice invocation times in each group of microservice invocation chain data.
  6. 根据权利要求4所述的系统故障检测方法,其中,对所述当前运行数据中的任一组运行状态数据进行标准化处理的过程,包括:The system fault detection method according to claim 4, wherein the process of standardizing any set of operating status data in the current operating data includes:
    利用优化后均值计算公式以及优化后方差计算公式,分别计算该组运行状态数据对应的均值和方差,并基于该组运行状态数据对应的均值和方差计算该组运行状态数据对应的z分数;其中,Using the optimized mean calculation formula and the optimized variance calculation formula, respectively calculate the mean and variance corresponding to the group of operating status data, and calculate the corresponding z-score of the group of operating status data based on the mean and variance corresponding to the group of operating status data; ,
    所述优化后均值计算公式为:
    Figure PCTCN2022122295-appb-100001
    The formula for calculating the mean value after the optimization is:
    Figure PCTCN2022122295-appb-100001
    所述优化后方差计算公式为:
    Figure PCTCN2022122295-appb-100002
    The formula for calculating variance after optimization is:
    Figure PCTCN2022122295-appb-100002
    其中,
    Figure PCTCN2022122295-appb-100003
    n表示该组运行状态数据对应的数据样本量,x i表示 该组运行状态数据中的第i个数据样本,mean表示均值,s 2表示方差。
    in,
    Figure PCTCN2022122295-appb-100003
    n represents the data sample size corresponding to the group of operating state data, x i represents the i-th data sample in the group of operating state data, mean represents the mean value, and s 2 represents the variance.
  7. 根据权利要求6所述的系统故障检测方法,其中,所述利用优化后均值计算公式以及优化后方差计算公式,分别计算该组运行状态数据对应的均值和方差,包括:The system fault detection method according to claim 6, wherein said calculation of the mean value and variance corresponding to the group of operating state data by using the optimized mean calculation formula and the optimized variance calculation formula includes:
    针对任一组运行状态数据,利用预设的目标队列维护该组运行状态数据中的数据样本;For any set of running status data, use the preset target queue to maintain the data samples in the set of running status data;
    获取所述目标队列中的数据样本;Obtaining data samples in the target queue;
    根据所述数据样本,利用优化后均值计算公式以及优化后方差计算公式,分别计算该组运行状态数据对应的均值和方差。According to the data samples, the mean value and variance corresponding to the group of operating state data are respectively calculated by using the optimized mean value calculation formula and the optimized variance calculation formula.
  8. 根据权利要求1所述的系统故障检测方法,其中,所述利用携带有故障类型标签的历史运行数据对基于逻辑回归算法构建的待训练模型进行训练之前,还包括:The system fault detection method according to claim 1, wherein, before using the historical operation data carrying the fault type label to train the model to be trained based on the logistic regression algorithm, further comprising:
    获取历史正常运行数据和历史故障运行数据;Obtain historical normal operation data and historical fault operation data;
    向所述历史正常运行数据添加包含相应的运行时间区间标签以及无故障类型标签的标签信息,以得到作为负样本的第一历史运行数据;Adding label information including a corresponding running time interval label and a non-fault type label to the historical normal operation data to obtain the first historical operation data as a negative sample;
    向所述历史故障运行数据添加包含相应的运行时间区间标签以及故障类型标签的标签信息,并对已添加标签信息的所述历史故障运行数据进行重采样得到作为正样本的第二历史运行数据,以使所述第二历史运行数据对应的样本数与所述第一历史运行数据对应的样本数之间的比例达到预设正负样本比例。Adding tag information including corresponding running time interval tags and fault type tags to the historical faulty running data, and resampling the historical faulty running data to which the tag information has been added to obtain second historical running data as a positive sample, Make the ratio between the number of samples corresponding to the second historical operation data and the number of samples corresponding to the first historical operation data reach a preset positive and negative sample ratio.
  9. 根据权利要求8所述的系统故障检测方法,其中,所述利用所述线性参数的权重系数分别对相应的所述运行状态数据的标准分数进行加权计算,包括:The system fault detection method according to claim 8, wherein the weighted calculation of the corresponding standard scores of the operating state data by using the weight coefficients of the linear parameters includes:
    通过预设专家知识获取接口获取用于对所述线性参数的权重系数进行优化的专家知识;Obtaining expert knowledge for optimizing the weight coefficients of the linear parameters through a preset expert knowledge acquisition interface;
    利用所述专家知识对所述线性参数的权重系数进行相应的调整,以得到所述线性参数的调整后权重系数;Using the expert knowledge to adjust the weight coefficients of the linear parameters accordingly to obtain the adjusted weight coefficients of the linear parameters;
    利用所述线性参数的调整后权重系数分别对相应的所述运行状态数据的标准分数进行加权计算。The adjusted weight coefficients of the linear parameters are used to perform weighted calculations on the corresponding standard scores of the operating status data.
  10. 根据权利要求9所述的系统故障检测方法,其中,所述通过预设专家知识获取接口获取用于对所述线性参数的权重系数进行优化的专家知识,包括:The system fault detection method according to claim 9, wherein said obtaining the expert knowledge for optimizing the weight coefficient of the linear parameter through the preset expert knowledge acquisition interface comprises:
    从基于历史专家知识库建立的优化权重系数模型中提取优化数据;Extract optimization data from the optimization weight coefficient model established based on the historical expert knowledge base;
    通过预设专家知识获取接口获取所述优化数据,将所述优化数据作为对所述线性参数的权重系数进行优化的专家知识。The optimization data is acquired through a preset expert knowledge acquisition interface, and the optimization data is used as expert knowledge for optimizing the weight coefficients of the linear parameters.
  11. 根据权利要求9所述的系统故障检测方法,其中,所述通过预设专家知识获取 接口获取用于对所述线性参数的权重系数进行优化的专家知识,包括:The system fault detection method according to claim 9, wherein said acquisition of expert knowledge for optimizing the weight coefficient of said linear parameter through a preset expert knowledge acquisition interface comprises:
    通过预设专家知识获取接口获取人工优化权重系数的指令,将所述人工优化权重系数的指令作为对所述线性参数的权重系数进行优化的专家知识。The instruction for manually optimizing the weight coefficient is obtained through the preset expert knowledge acquisition interface, and the instruction for manually optimizing the weight coefficient is used as expert knowledge for optimizing the weight coefficient of the linear parameter.
  12. 根据权利要求9所述的系统故障检测方法,其中,所述利用所述专家知识对所述线性参数的权重系数进行相应的调整,以得到所述线性参数的调整后权重系数,包括:The system fault detection method according to claim 9, wherein said using said expert knowledge to adjust the weight coefficients of said linear parameters correspondingly, so as to obtain the adjusted weight coefficients of said linear parameters, comprises:
    确定所述线性参数对应的故障类型标签;determining a fault type label corresponding to the linear parameter;
    在所述第二历史运行数据中不包括所述故障类型标签的情况下,利用所述专家知识对所述故障类型标签对应的线性参数的权重系数进行上调处理,以使所述线性参数的调整后权重系数大于所述线性参数原始的权重系数。In the case that the fault type label is not included in the second historical operation data, the expert knowledge is used to increase the weight coefficient of the linear parameter corresponding to the fault type label, so that the adjustment of the linear parameter The post weight coefficient is greater than the original weight coefficient of the linear parameter.
  13. 根据权利要求1至12任一项所述的系统故障检测方法,其中,所述利用所述线性参数的权重系数分别对相应的所述运行状态数据的标准分数进行加权计算,并基于加权得分对所述待检测服务系统进行故障定位,包括:The system fault detection method according to any one of claims 1 to 12, wherein the weighted calculation of the corresponding standard scores of the operating status data is performed using the weight coefficients of the linear parameters, and the weighted scores are calculated based on the weighted scores. The fault location of the service system to be detected includes:
    利用所述线性参数的权重系数分别对每个所述服务节点中相应的所述运行状态数据的标准分数进行加权计算,以得到每个所述服务节点的加权得分;Using the weight coefficient of the linear parameter to perform weighted calculation on the corresponding standard score of the operation status data in each of the service nodes, so as to obtain the weighted score of each of the service nodes;
    按照加权得分从大到小的顺序从所有所述服务节点中筛选出预设数量个加权得分大于预设阈值的所述服务节点,以基于筛选后得到的所述服务节点确定出发生故障的目标服务节点;Screen out a preset number of service nodes whose weighted scores are greater than a preset threshold from all the service nodes in order of weighted scores from large to small, so as to determine the faulty target based on the service nodes obtained after screening service node;
    从所述目标服务节点对应的所有所述线性参数的所述权重系数中筛选出最大权重系数,并将所述最大权重系数对应的所述线性参数的参数类型确定为相应的故障根因。Selecting the largest weight coefficient from the weight coefficients of all the linear parameters corresponding to the target service node, and determining the parameter type of the linear parameter corresponding to the largest weight coefficient as the corresponding root cause of the fault.
  14. 根据权利要求13所述的系统故障检测方法,其中,所述按照加权得分从大到小的顺序从所有所述服务节点中筛选出预设数量个加权得分大于预设阈值的所述服务节点,以基于筛选后得到的所述服务节点确定出发生故障的目标服务节点,包括:The system fault detection method according to claim 13, wherein, according to the descending order of weighted scores, a preset number of said service nodes whose weighted scores are greater than a preset threshold are selected from all said service nodes, Determining the failed target service node based on the service node obtained after screening, including:
    按照加权得分从大到小的顺序从所有所述服务节点中筛选出预设数量个加权得分大于预设阈值的所述服务节点;Screening out a preset number of service nodes whose weighted scores are greater than a preset threshold from all the service nodes in descending order of weighted scores;
    将筛选出的加权得分大于所述预设阈值的所述服务节点确定为发生故障的目标服务节点。Determining the service node whose filtered weighted score is greater than the preset threshold as the faulty target service node.
  15. 根据权利要求13所述的系统故障检测方法,其中,所述按照加权得分从大到小的顺序从所有所述服务节点中筛选出预设数量个加权得分大于预设阈值的所述服务节点,以基于筛选后得到的所述服务节点确定出发生故障的目标服务节点,包括:The system fault detection method according to claim 13, wherein, according to the descending order of weighted scores, a preset number of said service nodes whose weighted scores are greater than a preset threshold are selected from all said service nodes, Determining the failed target service node based on the service node obtained after screening, including:
    按照加权得分从大到小的顺序从所有所述服务节点中筛选出预设数量个加权得分大 于预设阈值的所述服务节点;Screen out a preset number of service nodes whose weighted scores are greater than a preset threshold from all the service nodes in descending order of weighted scores;
    响应于用户对筛选出的加权得分大于所述预设阈值的所述服务节点的选择操作,确定所述用户所选的服务节点;determining the service node selected by the user in response to the user's selection operation on the service node whose weighted score is greater than the preset threshold;
    将所述用户所选的服务节点确定为发生故障的目标服务节点。Determining the service node selected by the user as the faulty target service node.
  16. 根据权利要求1所述的系统故障检测方法,其中,在利用预设数据标准化方法对所述当前运行数据进行标准化处理之前,所述方法还包括:The system fault detection method according to claim 1, wherein, before using a preset data standardization method to standardize the current operating data, the method further comprises:
    对所述当前运行数据进行数据清洗处理,所述数据清洗处理包括以下一种或多种:去除所述当前运行数据中的重复数据、补充所述当前运行数据中的缺失数据和纠正所述当前运行数据中的错误数据。Perform data cleaning processing on the current operating data, the data cleaning processing includes one or more of the following: removing duplicate data in the current operating data, supplementing missing data in the current operating data, and correcting the current operating data. Incorrect data in run data.
  17. 一种系统故障检测装置,其中,包括:A system fault detection device, including:
    数据获取模块,用于获取待检测服务系统中每个服务节点的当前运行数据;所述当前运行数据包括多种运行状态数据;A data acquisition module, configured to acquire the current operating data of each service node in the service system to be detected; the current operating data includes various operating state data;
    标准化处理模块,用于利用预设数据标准化方法对所述当前运行数据进行标准化处理,以得到各种所述运行状态数据分别对应的标准分数;A standardization processing module, configured to perform standardization processing on the current operation data by using a preset data standardization method, so as to obtain standard scores corresponding to various operation state data;
    模型训练模块,用于利用携带有故障类型标签的历史运行数据对基于逻辑回归算法构建的待训练模型进行训练,以得到训练后的监督学习模型;The model training module is used to train the model to be trained based on the logistic regression algorithm by using the historical operation data carrying the fault type label to obtain the trained supervised learning model;
    权重系数提取模块,用于提取所述训练后的监督学习模型中每种线性参数对应的权重系数;其中,不同的所述线性参数分别对应不同的所述运行状态数据;A weight coefficient extraction module, configured to extract weight coefficients corresponding to each linear parameter in the trained supervised learning model; wherein different linear parameters correspond to different operating state data;
    故障定位模块,用于利用所述线性参数的权重系数分别对相应的所述运行状态数据的标准分数进行加权计算,并基于加权得分对所述待检测服务系统进行故障定位。The fault location module is configured to use the weight coefficients of the linear parameters to perform weighted calculations on the corresponding standard scores of the operation status data, and perform fault location on the service system to be detected based on the weighted scores.
  18. 根据权利要求17所述的系统故障检测装置,其中,所述装置还包括权重系数调整模块,用于:The system fault detection device according to claim 17, wherein the device further comprises a weight coefficient adjustment module, configured to:
    通过预设专家知识获取接口获取用于对所述线性参数的权重系数进行优化的专家知识;Obtaining expert knowledge for optimizing the weight coefficients of the linear parameters through a preset expert knowledge acquisition interface;
    利用所述专家知识对所述线性参数的权重系数进行相应的调整,以得到所述线性参数的调整后权重系数;Using the expert knowledge to adjust the weight coefficients of the linear parameters accordingly to obtain the adjusted weight coefficients of the linear parameters;
    利用所述线性参数的调整后权重系数分别对相应的所述运行状态数据的标准分数进行加权计算。The adjusted weight coefficients of the linear parameters are used to perform weighted calculations on the corresponding standard scores of the operating status data.
  19. 一种电子设备,其中,包括:An electronic device, comprising:
    存储器,用于保存计算机程序;memory for storing computer programs;
    处理器,用于执行所述计算机程序,以实现如权利要求1至16任一项所述的系统故 障检测方法的步骤。A processor, configured to execute the computer program to realize the steps of the system fault detection method according to any one of claims 1 to 16.
  20. 一种计算机非易失性可读存储介质,其中,用于存储计算机程序;其中,所述计算机程序被处理器执行时实现如权利要求1至16任一项所述的系统故障检测方法的步骤。A computer non-volatile readable storage medium, which is used to store a computer program; wherein, when the computer program is executed by a processor, the steps of the system fault detection method according to any one of claims 1 to 16 are realized .
PCT/CN2022/122295 2021-12-17 2022-09-28 System fault detection method and apparatus, device, and medium WO2023109251A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111554982.5 2021-12-17
CN202111554982.5A CN114328198A (en) 2021-12-17 2021-12-17 System fault detection method, device, equipment and medium

Publications (1)

Publication Number Publication Date
WO2023109251A1 true WO2023109251A1 (en) 2023-06-22

Family

ID=81053078

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/122295 WO2023109251A1 (en) 2021-12-17 2022-09-28 System fault detection method and apparatus, device, and medium

Country Status (2)

Country Link
CN (1) CN114328198A (en)
WO (1) WO2023109251A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116725613A (en) * 2023-08-11 2023-09-12 威海市博华医疗设备有限公司 Control method, device and storage medium based on pneumatic hemostatic equipment
CN116882701A (en) * 2023-07-27 2023-10-13 上海洲固电力科技有限公司 Electric power material intelligent scheduling system and method based on zero-carbon mode
CN116990450A (en) * 2023-07-18 2023-11-03 欧几里德(苏州)医疗科技有限公司 Defect detection method and system for cornea shaping mirror
CN117130819A (en) * 2023-10-27 2023-11-28 江西师范大学 Micro-service fault diagnosis method based on time delay variance and correlation coefficient value
CN117312879A (en) * 2023-11-09 2023-12-29 江门塚田正川科技有限公司 Injection molding machine production data supervision and early warning method, system and medium
CN117572159A (en) * 2024-01-17 2024-02-20 成都英华科技有限公司 Power failure detection method and system based on big data analysis
CN117873909A (en) * 2024-03-13 2024-04-12 上海爱可生信息技术股份有限公司 Fault diagnosis execution method, fault diagnosis execution system, electronic device, and storage medium
CN116882701B (en) * 2023-07-27 2024-05-31 上海洲固电力科技有限公司 Electric power material intelligent scheduling system and method based on zero-carbon mode

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114328198A (en) * 2021-12-17 2022-04-12 浪潮电子信息产业股份有限公司 System fault detection method, device, equipment and medium
CN115225460B (en) * 2022-07-15 2023-11-28 北京天融信网络安全技术有限公司 Fault determination method, electronic device, and storage medium
CN117348605B (en) * 2023-12-05 2024-03-12 东莞栢能电子科技有限公司 Optimization method and system applied to control system of release film tearing machine

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108710555A (en) * 2018-05-23 2018-10-26 郑州云海信息技术有限公司 A kind of server error diagnosis method based on supervised learning
CN109446049A (en) * 2018-11-01 2019-03-08 郑州云海信息技术有限公司 A kind of server error diagnosis method and apparatus based on supervised learning
US20200125465A1 (en) * 2018-10-23 2020-04-23 Gluesys, Co, Ltd. Automatic prediction system for server failure and method of automatically predicting server failure
CN111782532A (en) * 2020-07-02 2020-10-16 北京航空航天大学 Software fault positioning method and system based on network abnormal node analysis
CN112698975A (en) * 2020-12-14 2021-04-23 北京大学 Fault root cause positioning method and system of micro-service architecture information system
CN114328198A (en) * 2021-12-17 2022-04-12 浪潮电子信息产业股份有限公司 System fault detection method, device, equipment and medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108710555A (en) * 2018-05-23 2018-10-26 郑州云海信息技术有限公司 A kind of server error diagnosis method based on supervised learning
US20200125465A1 (en) * 2018-10-23 2020-04-23 Gluesys, Co, Ltd. Automatic prediction system for server failure and method of automatically predicting server failure
CN109446049A (en) * 2018-11-01 2019-03-08 郑州云海信息技术有限公司 A kind of server error diagnosis method and apparatus based on supervised learning
CN111782532A (en) * 2020-07-02 2020-10-16 北京航空航天大学 Software fault positioning method and system based on network abnormal node analysis
CN112698975A (en) * 2020-12-14 2021-04-23 北京大学 Fault root cause positioning method and system of micro-service architecture information system
CN114328198A (en) * 2021-12-17 2022-04-12 浪潮电子信息产业股份有限公司 System fault detection method, device, equipment and medium

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116990450A (en) * 2023-07-18 2023-11-03 欧几里德(苏州)医疗科技有限公司 Defect detection method and system for cornea shaping mirror
CN116990450B (en) * 2023-07-18 2024-04-26 欧几里德(苏州)医疗科技有限公司 Defect detection method and system for cornea shaping mirror
CN116882701A (en) * 2023-07-27 2023-10-13 上海洲固电力科技有限公司 Electric power material intelligent scheduling system and method based on zero-carbon mode
CN116882701B (en) * 2023-07-27 2024-05-31 上海洲固电力科技有限公司 Electric power material intelligent scheduling system and method based on zero-carbon mode
CN116725613A (en) * 2023-08-11 2023-09-12 威海市博华医疗设备有限公司 Control method, device and storage medium based on pneumatic hemostatic equipment
CN116725613B (en) * 2023-08-11 2024-01-26 威海市博华医疗设备有限公司 Control device based on pneumatic hemostatic equipment
CN117130819B (en) * 2023-10-27 2024-01-30 江西师范大学 Micro-service fault diagnosis method based on time delay variance and correlation coefficient value
CN117130819A (en) * 2023-10-27 2023-11-28 江西师范大学 Micro-service fault diagnosis method based on time delay variance and correlation coefficient value
CN117312879A (en) * 2023-11-09 2023-12-29 江门塚田正川科技有限公司 Injection molding machine production data supervision and early warning method, system and medium
CN117572159A (en) * 2024-01-17 2024-02-20 成都英华科技有限公司 Power failure detection method and system based on big data analysis
CN117572159B (en) * 2024-01-17 2024-03-26 成都英华科技有限公司 Power failure detection method and system based on big data analysis
CN117873909A (en) * 2024-03-13 2024-04-12 上海爱可生信息技术股份有限公司 Fault diagnosis execution method, fault diagnosis execution system, electronic device, and storage medium
CN117873909B (en) * 2024-03-13 2024-05-28 上海爱可生信息技术股份有限公司 Fault diagnosis execution method, fault diagnosis execution system, electronic device, and storage medium

Also Published As

Publication number Publication date
CN114328198A (en) 2022-04-12

Similar Documents

Publication Publication Date Title
WO2023109251A1 (en) System fault detection method and apparatus, device, and medium
US20220292190A1 (en) Methods and apparatus for analyzing sequences of application programming interface traffic to identify potential malicious actions
CN110351150B (en) Fault source determination method and device, electronic equipment and readable storage medium
CN107683586A (en) Method and apparatus for rare degree of the calculating in abnormality detection based on cell density
AU2021309929B2 (en) Anomaly detection in network topology
EP3668007B1 (en) System for identifying and assisting in the creation and implementation of a network service configuration using hidden markov models (hmms)
CN103117879A (en) Network monitoring system for computer hardware processing parameters
US10372572B1 (en) Prediction model testing framework
CN111431819A (en) Network traffic classification method and device based on serialized protocol flow characteristics
US20230132116A1 (en) Prediction of impact to data center based on individual device issue
CN107094086A (en) A kind of information acquisition method and device
US20230133541A1 (en) Alert correlating using sequence model with topology reinforcement systems and methods
CN114363212B (en) Equipment detection method, device, equipment and storage medium
CN114969332A (en) Method and device for training text audit model
CN113282920B (en) Log abnormality detection method, device, computer equipment and storage medium
Auger et al. Towards the internet of everything: Deployment scenarios for a QoO-aware integration platform
CN113093695A (en) Data-driven SDN controller fault diagnosis system
CN116975081A (en) Log diagnosis set updating method, device, equipment and storage medium
US11651271B1 (en) Artificial intelligence system incorporating automatic model updates based on change point detection using likelihood ratios
CN110287256A (en) A kind of electric network data parallel processing system (PPS) and its processing method based on cloud computing
CN108874646A (en) The method and apparatus for analyzing data
CN114385398A (en) Request response state determination method, device, equipment and storage medium
CN112579402A (en) Method and device for positioning faults of application system
CN115277436B (en) Micro-service software architecture identification method based on topological structure
CN112948154A (en) System abnormity diagnosis method, device and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22906003

Country of ref document: EP

Kind code of ref document: A1