CN114328198A - System fault detection method, device, equipment and medium - Google Patents

System fault detection method, device, equipment and medium Download PDF

Info

Publication number
CN114328198A
CN114328198A CN202111554982.5A CN202111554982A CN114328198A CN 114328198 A CN114328198 A CN 114328198A CN 202111554982 A CN202111554982 A CN 202111554982A CN 114328198 A CN114328198 A CN 114328198A
Authority
CN
China
Prior art keywords
data
service
fault
scores
historical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111554982.5A
Other languages
Chinese (zh)
Inventor
赵利强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Electronic Information Industry Co Ltd
Guangdong Inspur Smart Computing Technology Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN202111554982.5A priority Critical patent/CN114328198A/en
Publication of CN114328198A publication Critical patent/CN114328198A/en
Priority to PCT/CN2022/122295 priority patent/WO2023109251A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software

Abstract

The application discloses a system fault detection method, a device, equipment and a medium, wherein the method comprises the following steps: acquiring current operation data of each service node in a service system to be detected; standardizing the current operation data by using a preset data standardization method to obtain standard scores corresponding to various operation state data respectively; training a model to be trained constructed based on a logistic regression algorithm by using historical operating data with fault type labels to obtain a trained supervised learning model; extracting a weight coefficient corresponding to each linear parameter in the trained supervised learning model; and respectively carrying out weighted calculation on the standard scores of the corresponding running state data by using the weight coefficients of the linear parameters, and carrying out fault positioning on the service system to be detected based on the weighted scores. The method and the device for detecting the system fault obtain a supervised learning model based on historical operating data, and detect the current operating data by using the model in a weighting calculation mode so as to detect the system fault.

Description

System fault detection method, device, equipment and medium
Technical Field
The present invention relates to the field of computer systems, and in particular, to a method, an apparatus, a device, and a medium for detecting system faults.
Background
The cloud native environment mainly has the characteristics of micro-service, automatic release, continuous delivery and containerization. The micro-service architecture has great advantages in independent deployment, rapid delivery and expansion capabilities, but meanwhile, due to numerous services in the micro-service system, the calling relationship among the services becomes abnormally complex, and when the system has problems, an operation and maintenance manager is difficult to find the problems and troubleshoot the problems quickly, accurately and comprehensively. Therefore, in a service system environment, fault detection and root cause localization require more intelligent algorithmic models.
At present, in service system scenes with more services and operation and maintenance data, such as private cloud monitoring, large-scale micro-service troubleshooting, cloud native platform intelligent operation and maintenance and the like, when problems occur in a service system, due to the fact that service nodes in the service system are numerous, calling relations among the service nodes can become abnormal and complex, in the prior art, fault searching and troubleshooting are mostly carried out through methods such as threshold value detection and rule alarm, and operation and maintenance personnel are often difficult to search faults and troubleshoot problems quickly, accurately and comprehensively.
In summary, how to automatically, rapidly, accurately and comprehensively detect and locate faults in a service system is a problem to be solved at present.
Disclosure of Invention
In view of the above, an object of the present invention is to provide a method, an apparatus, a device and a medium for detecting system faults, which can automatically, rapidly, accurately and comprehensively detect and locate faults in a service system. The specific scheme is as follows:
in a first aspect, the present application discloses a system fault detection method, including:
acquiring current operation data of each service node in a service system to be detected; the current operating data comprises a plurality of operating state data;
standardizing the current operation data by using a preset data standardization method to obtain standard scores corresponding to various operation state data respectively;
training a model to be trained constructed based on a logistic regression algorithm by using historical operating data with fault type labels to obtain a trained supervised learning model;
extracting a weight coefficient corresponding to each linear parameter in the trained supervised learning model; wherein the different linear parameters respectively correspond to the different operating state data;
and respectively carrying out weighted calculation on the corresponding standard scores of the running state data by using the weighting coefficients of the linear parameters, and carrying out fault positioning on the service system to be detected based on the weighted scores.
Optionally, the obtaining current operation data of each service node in the service system to be detected includes:
the method comprises the steps of obtaining system performance index data, micro-service call chain data and system log data of each service node in a service system to be detected so as to obtain current operation data of each service node.
Optionally, the obtaining system performance index data, micro service call chain data, and system log data of each service node in the service system to be detected to obtain current operation data of each service node includes:
determining a time length of a sliding window of the time series;
sampling system performance index data of each service node in the service system to be detected based on a first preset time interval within the time length of each sliding window to obtain multiple groups of system performance index data which are arranged according to a time sequence and correspond to the sliding windows;
sampling the micro-service call chain data of each service node in the service system to be detected based on a second preset time interval within the time length of each sliding window to obtain a plurality of groups of micro-service call chain data which are arranged according to a time sequence and correspond to the sliding windows;
and sampling system log data of each service node in the service system to be detected based on a third preset time interval within the time length of each sliding window to obtain multiple groups of system log data which are arranged according to a time sequence and correspond to the sliding windows.
Optionally, the normalizing the current operating data by using a preset data normalization method to obtain standard scores corresponding to the various operating state data respectively includes:
calculating a z-score corresponding to each group of the system performance index data and a z-score corresponding to first-order difference data between different system performance index data in each group of the system performance index data;
acquiring micro-service calling time in each group of micro-service calling chain data, and calculating a z-score corresponding to the micro-service calling time in each group of micro-service calling chain data and a z-score corresponding to first-order difference data between different micro-service calling times in each group of micro-service calling chain data;
and matching each group of the system log data by using a preset log template to obtain matching scores corresponding to different system log data in each group of the system log data, and calculating the z-scores of the matching scores corresponding to the different system log data in each group of the system log data and the z-score of first-order difference data between the different matching scores corresponding to the different system log data in each group of the system log data.
Optionally, the process of normalizing any group of operation state data in the current operation data includes:
respectively calculating the mean value and the variance corresponding to the group of running state data by using an optimized mean value calculation formula and an optimized back difference calculation formula, and calculating the z-fraction corresponding to the group of running state data based on the mean value and the variance corresponding to the group of running state data; wherein the content of the first and second substances,
the optimized mean value calculation formula is as follows:
Figure BDA0003418385930000031
the optimized rear difference calculation formula is as follows:
Figure BDA0003418385930000032
wherein the content of the first and second substances,
Figure BDA0003418385930000033
n represents the data sample size corresponding to the set of operating state data, xiRepresents the ith data sample in the set of operating state data, mean represents the mean, s2The variance is indicated.
Optionally, before training the model to be trained, which is constructed based on the logistic regression algorithm, by using the historical operating data with the fault type label, the method further includes:
acquiring historical normal operation data and historical fault operation data;
adding label information containing corresponding running time interval labels and fault-free type labels to the historical normal running data to obtain first historical running data serving as a negative sample;
adding label information including corresponding operation time interval labels and fault type labels to the historical fault operation data, and resampling the historical fault operation data with the label information added to obtain second historical operation data serving as a positive sample, so that the proportion between the number of samples corresponding to the second historical operation data and the number of samples corresponding to the first historical operation data reaches a preset positive-negative sample proportion.
Optionally, the performing weighted calculation on the standard scores of the corresponding operating condition data by using the weighting coefficients of the linear parameters includes:
acquiring expert knowledge for optimizing the weight coefficient of the linear parameter through a preset expert knowledge acquisition interface;
correspondingly adjusting the weight coefficient of the linear parameter by using the expert knowledge to obtain the adjusted weight coefficient of the linear parameter;
and respectively carrying out weighted calculation on the corresponding standard scores of the running state data by utilizing the adjusted weight coefficients of the linear parameters.
Optionally, the performing weighted calculation on the corresponding standard scores of the operating state data by using the weighting coefficients of the linear parameters, and performing fault location on the service system to be detected based on the weighted scores includes:
respectively carrying out weighted calculation on the standard score of the corresponding running state data in each service node by using the weight coefficient of the linear parameter so as to obtain the weighted score of each service node;
screening a preset number of service nodes with weighted scores larger than a preset threshold value from all the service nodes according to the sequence of the weighted scores from large to small, and determining a target service node with a fault based on the service nodes obtained after screening;
screening out the maximum weight coefficient from the weight coefficients of all the linear parameters corresponding to the target service node, and determining the parameter type of the linear parameter corresponding to the maximum weight coefficient as the corresponding fault root.
In a second aspect, the present application discloses a system fault detection apparatus, comprising:
the data acquisition module is used for acquiring the current operation data of each service node in the service system to be detected; the current operating data comprises a plurality of operating state data;
the standardization processing module is used for carrying out standardization processing on the current operation data by utilizing a preset data standardization method so as to obtain standard scores corresponding to various operation state data respectively;
the model training module is used for training a model to be trained, which is constructed based on a logistic regression algorithm, by using historical operating data with fault type labels so as to obtain a trained supervised learning model;
the weight coefficient extraction module is used for extracting a weight coefficient corresponding to each linear parameter in the trained supervised learning model; wherein the different linear parameters respectively correspond to the different operating state data;
and the fault positioning module is used for respectively carrying out weighted calculation on the corresponding standard scores of the running state data by utilizing the weighting coefficients of the linear parameters and carrying out fault positioning on the service system to be detected based on the weighted scores.
In a third aspect, the present application discloses an electronic device, comprising:
a memory for storing a computer program;
a processor for executing the computer program to implement the steps of the system fault detection method disclosed in the foregoing.
In a fourth aspect, the present application discloses a computer readable storage medium for storing a computer program; wherein the computer program realizes the steps of the system fault detection method disclosed in the foregoing when executed by a processor.
Therefore, the method includes the steps that current operation data of each service node in a service system to be detected are obtained; the current operating data comprises a plurality of operating state data; then, standardizing the current operation data by using a preset data standardization method to obtain standard scores corresponding to various operation state data respectively; then training a model to be trained constructed based on a logistic regression algorithm by using historical operating data with fault type labels to obtain a trained supervised learning model; extracting a weight coefficient corresponding to each linear parameter in the trained supervised learning model; wherein the different linear parameters respectively correspond to the different operating state data; and finally, respectively carrying out weighted calculation on the corresponding standard scores of the running state data by using the weight coefficients of the linear parameters, and carrying out fault positioning on the service system to be detected based on the weighted scores. Therefore, in the application, the standard scores corresponding to the operation state data of each service node are weighted and calculated based on the weight coefficient corresponding to each linear parameter obtained from the trained supervised learning model to obtain the weighted calculation score, the weighted scores meeting the preset condition are screened out by sequencing the weighted scores corresponding to each group obtained by weighted calculation, and the service node corresponding to the weighted score is correspondingly determined, so that the fault location in the system fault is realized, the root cause of the system fault is further determined according to the determined weight coefficient of the component information in the service node, and the fault location efficiency is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flow chart of a system fault detection method disclosed herein;
FIG. 2 is a flow chart of a particular system fault detection method disclosed herein;
FIG. 3 is a flow chart of a particular system fault detection method disclosed herein;
FIG. 4 is a flow chart of a specific fault detection and root cause determination disclosed herein;
FIG. 5 is a flow chart of a particular system fault detection method disclosed herein;
FIG. 6 is a schematic diagram of a system fault detection apparatus according to the present disclosure;
fig. 7 is a block diagram of an electronic device disclosed in the present application.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In service system scenes with more services and operation and maintenance data, such as private cloud monitoring, large-scale micro-service troubleshooting, cloud native platform intelligent operation and maintenance and the like, when problems occur in a service system, due to the fact that service nodes in the service system are numerous, calling relations among the service nodes can become extremely complex, most of the prior art means search and troubleshoot faults through methods of threshold detection, rule alarming and the like, and operation and maintenance personnel are often difficult to search and troubleshoot the problems quickly, accurately and comprehensively. Therefore, the embodiment of the application discloses a system fault detection method, device, equipment and medium, which can automatically, quickly, accurately and comprehensively detect and position faults in a service system.
Referring to fig. 1, an embodiment of the present application discloses a system fault detection method, including:
step S11: acquiring current operation data of each service node in a service system to be detected; the current operating data includes a plurality of operating state data.
In this embodiment, current operation data of each service node in a service system to be detected needs to be acquired, where the service system to be detected may include, but is not limited to, any one of a private cloud monitoring system, a large-scale micro-service system, or a cloud native platform intelligent operation and maintenance system. In addition, the current operation data of each service node includes various operation state data, that is, various data capable of representing the operation state of the service system to be detected.
Step S12: and carrying out standardization processing on the current operation data by using a preset data standardization method to obtain standard scores corresponding to various operation state data respectively.
In this embodiment, after the current operation data is acquired, different operation state data in the current operation data have different dimensions, data units, or orders of magnitude. In order to facilitate comparison and weighting between different operation state data, it is necessary to perform normalization processing on the current operation data by using a preset data normalization method to convert the current operation data into a dimensionless pure numerical value, that is, a standard score corresponding to each operation state data obtained by performing normalization processing on the current operation data in this embodiment.
Step S13: and training the model to be trained constructed based on the logistic regression algorithm by using the historical operating data carrying the fault type label to obtain the trained supervised learning model.
In this embodiment, a to-be-trained model needs to be trained by using pre-prepared historical operating data carrying a faulty type tag to obtain a trained supervised learning model, where the to-be-trained model is constructed based on a Logistic regression algorithm, that is, the supervised learning training is performed by using a Logistic Classifier (Logistic regression Classifier). It should be noted that the purpose of building the supervised learning model by using logistic regression classification is not to use supervised learning to perform fault detection and root cause positioning on the system to be detected, but to use the computation process of logistic regression to adjust the weight of the corresponding linear parameter in the model by using limited historical operating data carrying the fault type tag.
Step S14: extracting a weight coefficient corresponding to each linear parameter in the trained supervised learning model; wherein the different linear parameters correspond to the different operating state data, respectively.
In this embodiment, as described in step S13, the purpose of using the logistic regression classification algorithm to establish the supervised learning model is to adjust the weights of the corresponding linear parameters in the model, so after the model training is completed, the weight coefficient corresponding to each linear parameter is extracted from the supervised learning model, and it should be noted that different linear parameters correspond to different operating state data.
Step S15: and respectively carrying out weighted calculation on the corresponding standard scores of the running state data by using the weighting coefficients of the linear parameters, and carrying out fault positioning on the service system to be detected based on the weighted scores.
In this embodiment, the performing weighted calculation on the standard scores of the corresponding operating status data by using the weighting coefficients of the linear parameters may further include: acquiring expert knowledge for optimizing the weight coefficient of the linear parameter through a preset expert knowledge acquisition interface; correspondingly adjusting the weight coefficient of the linear parameter by using the expert knowledge to obtain the adjusted weight coefficient of the linear parameter; and respectively carrying out weighted calculation on the corresponding standard scores of the running state data by utilizing the adjusted weight coefficients of the linear parameters. It can be understood that after the weight coefficients are extracted from the supervised learning model, the corresponding expert knowledge can be obtained through the preset expert knowledge obtaining interface, and the weight coefficients of different linear parameters are adjusted by using the expert knowledge. The expert knowledge can be optimized data extracted from an optimized weight coefficient model established based on a historical expert knowledge base and acquired by the preset expert knowledge acquisition interface, so that the time of a weight coefficient optimization process can be reduced, and the waste of human resources can be reduced; or the instruction may be an instruction for manually optimizing the weight coefficient acquired through the preset expert knowledge acquisition interface. The expert knowledge is used for optimizing the weight coefficient, so that the accuracy and comprehensiveness of the subsequent fault location of the service system to be detected can be improved. The purpose of adjusting the weighting factors by expert knowledge is to improve the sensitivity of the model to system faults not encountered, for example, for a performance index "system.tcp.syn _ recv" on the network, if the index is often related to network faults but there is no fault type label in the historical operating data indicating such system faults, for this case, it is necessary to appropriately adjust the weighting factors of the linear parameters corresponding to such faults.
Therefore, the method includes the steps that firstly, current operation data of each service node in a service system to be detected are obtained; the current operating data comprises a plurality of operating state data; then, standardizing the current operation data by using a preset data standardization method to obtain standard scores corresponding to various operation state data respectively; then training a model to be trained constructed based on a logistic regression algorithm by using historical operating data with fault type labels to obtain a trained supervised learning model; extracting a weight coefficient corresponding to each linear parameter in the trained supervised learning model; wherein the different linear parameters respectively correspond to the different operating state data; and finally, respectively carrying out weighted calculation on the corresponding standard scores of the running state data by using the weight coefficients of the linear parameters, and carrying out fault positioning on the service system to be detected based on the weighted scores. Therefore, in the application, the standard scores corresponding to the operation state data of each service node are weighted and calculated based on the weight coefficient corresponding to each linear parameter obtained from the trained supervised learning model to obtain the weighted calculation score, the weighted scores meeting the preset condition are screened out by sequencing the weighted scores corresponding to each group obtained by weighted calculation, and the service node corresponding to the weighted score is correspondingly determined, so that the fault location in the system fault is realized, the root cause of the system fault is further determined according to the determined weight coefficient of the component information in the service node, and the fault location efficiency is improved.
Referring to fig. 2, the embodiment of the present application discloses a specific system fault detection method, and compared with the previous embodiment, the present embodiment further describes and optimizes the technical solution. The method specifically comprises the following steps:
step S21: the method comprises the steps of obtaining system performance index data, micro-service call chain data and system log data of each service node in a service system to be detected so as to obtain current operation data of each service node.
In this embodiment, current operation data of each service node in the service system to be detected needs to be acquired, where the current operation data mainly includes the following three operation state data: system performance index data, microservice call chain data, and system log data. The type of the index in the system performance index data may include, but is not limited to, any one or more types of a CPU (Central Processing Unit), a memory, a disk, a database, a JVM (Java Virtual Machine), a network, an I/O (Input/Output), and an HA (dual-computer cluster); in the micro-service call chain data, each group of call chain data comprises a link number (TraceId), a unit number (spandex id) called each time, a called service name (ServiceName), a physical unit (CmdbId) where the call chain data is located and a call Duration (Duration); the system log data also comprises a plurality of groups of log data of corresponding types in the system performance indexes.
Step S22: and carrying out standardization processing on the current operation data by using a preset data standardization method to obtain standard scores corresponding to various operation state data respectively.
Step S23: and acquiring historical normal operation data and historical fault operation data.
In this embodiment, it is necessary to obtain the existing historical normal operation data and historical failure operation data in the system, and it is noted that most (for example, more than 99%) of the operation data in the system is normal, and only a very small number (less than 1%) of the operation data has a problem of transmission failure.
Step S24: and adding label information containing corresponding running time interval labels and fault-free type labels to the historical normal running data to obtain first historical running data serving as a negative sample.
In this embodiment, the obtained historical normal operation data is used as a negative sample to obtain first historical operation data, and label information including a corresponding operation time interval label and a fault-free type label is also required to be added to the historical normal operation data, where the fault-free type label is label information that represents that the operation data does not cause a system fault.
Step S25: adding label information including corresponding operation time interval labels and fault type labels to the historical fault operation data, and resampling the historical fault operation data with the label information added to obtain second historical operation data serving as a positive sample, so that the proportion between the number of samples corresponding to the second historical operation data and the number of samples corresponding to the first historical operation data reaches a preset positive-negative sample proportion.
In this embodiment, similar to step S24, the historical faulty operation data obtained in step S23 needs to be used as a positive sample to obtain second historical operation data, and tag information including a corresponding operation time interval tag and a fault type tag, which is tag information that represents that the operation data may cause a system fault, needs to be added to the historical faulty operation data. It should be noted that, because most of the operation data in the system are normal, and only a very small number of operation data have a problem of sending faults, the ratio between positive and negative samples obtained through the historical normal operation data and the historical fault operation data respectively is very unbalanced, and in order to solve the problem of unbalance between the positive and negative samples, the historical fault operation data added with the tag information needs to be resampled to improve the ratio of the positive sample number in the total sample number, so that the ratio between the positive sample number corresponding to the second historical operation data and the negative sample number corresponding to the first historical operation data reaches a preset ratio of the positive and negative samples. In this embodiment, the ratio of the positive sample to the negative sample is 1: 10.
Step S26: and training a model to be trained constructed based on a logistic regression algorithm by using the first historical operating data and the second history to obtain a trained supervised learning model.
In this embodiment, a model to be trained, which is constructed based on a logistic regression algorithm, is trained by using first historical operating data and second historical operating data that conform to a preset positive-negative sample ratio, so as to obtain a trained supervised learning model.
Step S27: extracting a weight coefficient corresponding to each linear parameter in the trained supervised learning model; wherein the different linear parameters correspond to the different operating state data, respectively.
Step S28: and respectively carrying out weighted calculation on the corresponding standard scores of the running state data by using the weighting coefficients of the linear parameters, and carrying out fault positioning on the service system to be detected based on the weighted scores.
For more specific processing procedures of the steps S22, S27, and S28, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not repeated here.
Therefore, in the embodiment of the application, system performance index data, micro-service call chain data and system log data of each service node in the service system to be detected are obtained, and corresponding standard score calculation is performed by using a preset data standardization method; and then, by labeling the acquired historical normal operation data and historical fault operation data with corresponding label information and using the labeled historical fault operation data as a source of positive and negative sample data, under the condition of unbalance of the number of positive and negative samples, resampling the historical fault operation data added with the label information to improve the proportion of the number of positive samples in the total number of samples, so that the proportion between the number of positive samples corresponding to the second historical operation data and the number of negative samples corresponding to the first historical operation data reaches a preset positive and negative sample proportion. And finally, training the model to be trained based on logistic regression by utilizing the first historical operating data and the second historical operating data to obtain a supervised learning model, extracting corresponding weight coefficients from the supervised learning model, and performing weighted calculation on the standard scores of the operating state data so as to perform fault positioning on the service system to be detected. Therefore, a supervised learning model is established through a small amount of historical fault operation data and historical normal operation data, and then the model can be used for detecting various operation state data in the current operation data in a streaming calculation mode.
Referring to fig. 3 and 4, the embodiment of the present application discloses a specific system fault detection method, and compared with the previous embodiment, the present embodiment further describes and optimizes the technical solution. The method specifically comprises the following steps:
step S31: the method comprises the steps of determining the time length of sliding windows of a time sequence, and sampling system performance index data, micro-service call chain data and system log data of each service node in a service system to be detected based on a first preset time interval, a second preset time interval and a third preset time interval respectively within the time length of each sliding window so as to obtain multiple groups of system performance index data, multiple groups of micro-service call chain data and multiple groups of system log data which are arranged according to a time sequence and correspond to multiple sliding windows.
In this embodiment, first, the length of the sliding window of the time sequence needs to be determined, and in this embodiment, the length of the sliding window is determined to be 30 minutes, and then, the current operation data needs to be sampled based on different preset time intervals within 30 minutes. In this embodiment, three different operation state data, that is, system performance index data, micro service call chain data, and system log data, in the current operation data of each service node in the service system to be detected are sampled based on a first preset time interval, a second preset time interval, and a third preset time interval, respectively, and multiple sets of system performance index data, multiple sets of micro service call chain data, and multiple sets of system log data, which are arranged according to a time sequence and correspond to multiple sliding windows, are obtained after sampling. It should be noted that the first preset time interval, the second preset time interval and the third preset time interval may be equal to or different from each other. For example, for the system performance index data, the first preset time interval may be set to 1 minute, and then 30 sampling points may be obtained after sampling in a sliding window with a sliding window length of 30 minutes, and data values corresponding to the 30 sampling points are arranged according to a time sequence to obtain a set of system performance index data. The same is true for micro-service call chain data and system log data, the number of sampling points is determined by preset time intervals, and when the preset time intervals are different, the number of sampling points obtained after sampling is different; when the preset time intervals are the same, the number of the sampling points is also the same.
Step S32: and calculating a z-score corresponding to each group of the system performance index data and a z-score corresponding to first-order difference data among different system performance index data in each group of the system performance index data.
In this embodiment, after the plurality of sets of system performance index data are obtained, each set of system performance index data needs to be classified according to the data index type and standardized, for example, when the data index type is a CPU, only the obtained index data related to the CPU needs to be calculated. In this embodiment, a z-score normalization method is mainly adopted to normalize the data, that is, a z-score corresponding to each set of system performance index data is calculated, and in addition, a z-score corresponding to first-order difference data between different sets of system performance index data in each set of system performance index data needs to be calculated. It should be noted that, in this embodiment, before performing the standardization processing on the system performance index data, data cleaning is also required to be performed on the acquired system performance index data, and the purpose of the data cleaning is to clean up duplicate data and redundant data, or to completely supplement missing data, or to correct or delete erroneous data, so as to improve the quality of data and reduce the error rate in the data use process.
Wherein, the calculation formula of the z fraction is as follows:
Figure BDA0003418385930000121
wherein MetriciZ-scores of corresponding data index types for a set of operating state data; valueiA data value for each sampling point in a set of operating condition data; mean is the mean of a set of operating state data within a sliding window, std is the standard deviation of a set of operating state data within a sliding window.
In a conventional formula for calculating the mean and variance:
Figure BDA0003418385930000122
Figure BDA0003418385930000123
in the flow calculation, the data volume is huge, and the performance of an algorithm for calculating the mean value and the variance by using a traditional method is low, so that the invention provides a mean value and variance calculation method with time complexity of O (1), and the performance of a model can be effectively improved. The method mainly optimizes the formula, and the specific optimization method comprises the following steps:
the formula for calculating the variance in the prior art is developed to obtain:
Figure BDA0003418385930000131
reissue to order
Figure BDA0003418385930000132
Then it is possible to obtain:
and (3) an optimized mean value calculation formula:
Figure BDA0003418385930000133
optimizing a rear difference calculation formula:
Figure BDA0003418385930000134
that is, when any one set of operation state data in the current operation data is subjected to the standardized processing, the following processing manner may be adopted in the embodiment of the present application: and respectively calculating the mean value and the variance corresponding to the group of running state data by using the optimized mean value calculation formula and the optimized back difference calculation formula, and calculating the z-fraction corresponding to the group of running state data based on the mean value and the variance corresponding to the group of running state data. Wherein n represents the data sample size corresponding to the set of operating state data, xiRepresents the ith data sample in the set of operating state data, mean represents the mean, s2The variance is indicated. It can be understood that, when calculating the z-score corresponding to the group of operating state data and the z-score corresponding to the first-order difference data of the group of operating state data, the optimized mean value calculation formula and the optimized back difference calculation formula may be adopted to calculate the respective corresponding mean value and variance, and then the respective corresponding z-score is solved.
Thus, only x in one queue maintenance interval needs to be designediThe method is used for rapidly updating the values of S0 and S1, the standard score corresponding to each current running state data can be calculated within O (1) time complexity, namely the mean value of a group of running state data in a sliding window and the variance of a group of running state data in the sliding window can be obtained after one-time calculation no matter how large the calculated data scale is, the speed of calculating the mean value of a group of running state data in the sliding window and the variance of a group of running state data in the sliding window can be greatly improved, and the speed of calculating the standard score and the efficiency of detecting system faults are further improved.
Step S33: acquiring micro-service calling time in each group of micro-service calling chain data, and calculating z-scores corresponding to the micro-service calling time in each group of micro-service calling chain data and z-scores corresponding to first-order difference data between different micro-service calling times in each group of micro-service calling chain data.
In this embodiment, after the multiple sets of micro service invocation chain data are obtained, graph decomposition needs to be performed on each set of micro service invocation chain data, and response time of service invocation needs to be calculated. And calculating a z-score corresponding to the total micro-service calling time in each group of micro-service calling chain data and a z-score corresponding to first-order difference data between different micro-service calling times in each group of micro-service calling chain data. The specific calculation method of the z-score is shown in step S32.
Step S34: and matching each group of the system log data by using a preset log template to obtain matching scores corresponding to different system log data in each group of the system log data, and calculating the z-scores of the matching scores corresponding to the different system log data in each group of the system log data and the z-score of first-order difference data between the different matching scores corresponding to the different system log data in each group of the system log data.
In this embodiment, after the multiple sets of system log data are obtained, matching and detection need to be performed on each set of system log data by using a corresponding log template, so as to determine the type of the system log data (that is, the system log data belong to a CPU, a memory, or a magnetic disk, etc.), obtain matching scores corresponding to different types of system log data in each set of system log data, and then calculate z-scores of matching scores corresponding to different types of system log data in each set of system log data and z-scores of first-order difference data between different matching scores corresponding to each set of system log data. The specific calculation method of the z-score is shown in step S32.
Step S35: and training the model to be trained constructed based on the logistic regression algorithm by using the historical operating data carrying the fault type label to obtain the trained supervised learning model.
Step S36: extracting a weight coefficient corresponding to each linear parameter in the trained supervised learning model; wherein the different linear parameters correspond to the different operating state data, respectively.
Step S37: and respectively carrying out weighted calculation on the corresponding standard scores of the running state data by using the weighting coefficients of the linear parameters, and carrying out fault positioning on the service system to be detected based on the weighted scores.
For more specific processing procedures of the steps S35, S36, and S37, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not repeated here.
In the embodiment of the application, the length of the sliding window of the time sequence is determined, and the system performance index data, the micro-service call chain data and the system log data of each service node in the service system to be detected are sampled based on the preset time interval within the time length of the sliding window to obtain multiple groups of system performance index data, multiple groups of micro-service call chain data and multiple groups of system log data; and then calculating each group of system performance index data, micro-service calling time in each group of micro-service calling chain data, matching scores corresponding to different system log data in each group of system log data and z scores of corresponding first-order difference data thereof, so as to perform weighting calculation on the obtained z scores by using weighting coefficients corresponding to different linear parameters extracted from a supervised learning model in the following process, and performing fault location on the service system to be detected based on the weighting scores.
Referring to fig. 5, the embodiment of the present application discloses a specific system fault detection method, and compared with the previous embodiment, the present embodiment further describes and optimizes the technical solution. The method specifically comprises the following steps:
step S41: acquiring current operation data of each service node in a service system to be detected; the current operating data includes a plurality of operating state data.
Step S42: and carrying out standardization processing on the current operation data by using a preset data standardization method to obtain standard scores corresponding to various operation state data respectively.
Step S43: and training the model to be trained constructed based on the logistic regression algorithm by using the historical operating data carrying the fault type label to obtain the trained supervised learning model.
Step S44: extracting a weight coefficient corresponding to each linear parameter in the trained supervised learning model; wherein the different linear parameters correspond to the different operating state data, respectively.
Step S45: and respectively carrying out weighted calculation on the standard score of the corresponding running state data in each service node by using the weight coefficient of the linear parameter so as to obtain the weighted score of each service node.
In this embodiment, after the weight coefficient corresponding to each linear parameter is obtained, the weight coefficient is used to perform weighted calculation on the standard score of the corresponding operation state data in each service node in the system to be detected, for example, for three different operation state data in one service node, if the system index type data includes four index types, that is, a CPU, a memory, a disk, and a database, the CPU, the memory, the disk, and the database all have respective standard scores, and the call time of the micro-service call chain data and the matching score of the system log data all have respective standard scores, the weight coefficient of each linear parameter extracted from the model is used to perform weighted calculation on the corresponding standard score, so as to obtain the weighted score of each service node.
Step S46: and screening a preset number of service nodes with weighted scores larger than a preset threshold value from all the service nodes according to the sequence of the weighted scores from large to small so as to determine the target service node with the fault based on the service nodes obtained after screening.
In this embodiment, the weighted score of each service node can be obtained through step S45, and then a preset number of service nodes with weighted scores greater than a preset threshold are screened from all service nodes according to the sequence of weighted scores from large to small, in this embodiment, the preset number is set to 3, and the preset threshold is set to 0.9, that is, after all weighted scores are sorted according to the sequence of large to small, the first 3 service nodes with scores greater than 0.9 are screened, and a target service node with a fault is determined based on the 3 service nodes.
In the process of determining the target service node, the determination may be performed in two ways, and in a specific embodiment, all of the 3 service nodes may be used as the target service node that has a failure, that is, all of the 3 service nodes have a failure; in another specific embodiment, one or two service nodes can be screened out again from the 3 service nodes in a manual participation mode according to a certain rule to serve as a target service node with a fault.
Step S47: screening out the maximum weight coefficient from the weight coefficients of all the linear parameters corresponding to the target service node, and determining the parameter type of the linear parameter corresponding to the maximum weight coefficient as the corresponding fault root.
In this embodiment, after the target service node with the fault is determined, the maximum weight coefficient is screened from the weight coefficients of all the linear parameters corresponding to the target service node, for example, if the weight coefficient of the CPU in one target service node is 0.4, the weight coefficient of the memory is 0.8, the weight coefficient of the disk is 0.8, the weight coefficient of the database is 0.2, and the weight coefficient of the network is 0.5, the maximum weight coefficient in the target service node is 0.8, and then the parameter type of the linear parameter corresponding to the maximum weight coefficient 0.8 is determined, that is, the memory and the disk are determined as the root cause of the fault.
For more specific processing procedures of the steps S41, S42, S43, and S44, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not repeated here.
As can be seen, in the embodiment of the present application, the weighting coefficient of the linear parameter is used to perform weighted calculation on the standard score of the operating state data in each service node, so as to obtain the weighted score of each service node, a preset number of service nodes with the weighted scores larger than a preset threshold are screened out from all the service nodes according to the descending order of the weighted scores, so as to determine a target service node with a fault, and finally, the parameter type of the linear parameter corresponding to the maximum weighting coefficient in the target service node is determined as the corresponding fault root cause. Therefore, in the embodiment of the application, by calculating the weighted score of each service node and sequencing the weighted scores to determine the target service node with the fault, the fault detection of each service node in the system to be detected can be realized, and the root cause causing the sending fault in the target service node can be determined.
Referring to fig. 6, an embodiment of the present application further discloses a system fault detection apparatus, including:
the data acquisition module 11 is configured to acquire current operating data of each service node in the service system to be detected; the current operating data comprises a plurality of operating state data;
the standardization processing module 12 is configured to standardize the current operation data by using a preset data standardization method to obtain standard scores corresponding to various operation state data;
the model training module 13 is used for training a model to be trained, which is constructed based on a logistic regression algorithm, by using historical operating data with a fault type label to obtain a trained supervised learning model;
a weight coefficient extraction module 14, configured to extract a weight coefficient corresponding to each linear parameter in the trained supervised learning model; wherein the different linear parameters respectively correspond to the different operating state data;
and the fault positioning module 15 is configured to perform weighted calculation on the corresponding standard scores of the operating state data by using the weighting coefficients of the linear parameters, and perform fault positioning on the service system to be detected based on the weighted scores.
Therefore, the method includes the steps that current operation data of each service node in a service system to be detected are obtained; the current operating data comprises a plurality of operating state data; then, standardizing the current operation data by using a preset data standardization method to obtain standard scores corresponding to various operation state data respectively; then training a model to be trained constructed based on a logistic regression algorithm by using historical operating data with fault type labels to obtain a trained supervised learning model; extracting a weight coefficient corresponding to each linear parameter in the trained supervised learning model; wherein the different linear parameters respectively correspond to the different operating state data; and finally, respectively carrying out weighted calculation on the corresponding standard scores of the running state data by using the weight coefficients of the linear parameters, and carrying out fault positioning on the service system to be detected based on the weighted scores. Therefore, in the application, the standard scores corresponding to the operation state data of each service node are weighted and calculated based on the weight coefficient corresponding to each linear parameter obtained from the trained supervised learning model to obtain the weighted calculation score, the weighted scores meeting the preset condition are screened out by sequencing the weighted scores corresponding to each group obtained by weighted calculation, and the service node corresponding to the weighted score is correspondingly determined, so that the fault location in the system fault is realized, the root cause of the system fault is further determined according to the determined weight coefficient of the component information in the service node, and the fault location efficiency is improved.
Fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present application. The method specifically comprises the following steps: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input output interface 25, and a communication bus 26. Wherein the memory 22 is used for storing a computer program, which is loaded and executed by the processor 21 to implement the relevant steps in the system fault detection method executed by the computer device disclosed in any of the foregoing embodiments.
In this embodiment, the power supply 23 is used to provide operating voltage for each hardware device on the computer device 20; the communication interface 24 can create a data transmission channel between the computer device 20 and an external device, and a communication protocol followed by the communication interface is any communication protocol applicable to the technical solution of the present application, and is not specifically limited herein; the input/output interface 25 is configured to obtain external input data or output data to the outside, and a specific interface type thereof may be selected according to specific application requirements, which is not specifically limited herein.
The processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 21 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 21 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 21 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 21 may further include an AI (Artificial Intelligence) processor for processing a calculation operation related to machine learning.
In addition, the storage 22 is used as a carrier for storing resources, and may be a read-only memory, a random access memory, a magnetic disk or an optical disk, etc., the resources stored thereon include an operating system 221, a computer program 222, data 223, etc., and the storage may be a transient storage or a permanent storage.
The operating system 221 is used for managing and controlling each hardware device and the computer program 222 on the computer device 20, so as to realize the operation and processing of the mass data 223 in the memory 22 by the processor 21, which may be Windows, Unix, Linux, or the like. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the system fault detection method disclosed in any of the foregoing embodiments and executed by the computer device 20. The data 223 may include data received by the computer device and transmitted from an external device, data collected by the input/output interface 25, and the like.
Further, an embodiment of the present application further discloses a storage medium, in which a computer program is stored, and when the computer program is loaded and executed by a processor, the method steps executed in the system fault detection process disclosed in any of the foregoing embodiments are implemented.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The system fault detection method, apparatus, device and storage medium provided by the present invention are described in detail above, and a specific example is applied in the present document to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understand the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (11)

1. A method for system fault detection, comprising:
acquiring current operation data of each service node in a service system to be detected; the current operating data comprises a plurality of operating state data;
standardizing the current operation data by using a preset data standardization method to obtain standard scores corresponding to various operation state data respectively;
training a model to be trained constructed based on a logistic regression algorithm by using historical operating data with fault type labels to obtain a trained supervised learning model;
extracting a weight coefficient corresponding to each linear parameter in the trained supervised learning model; wherein the different linear parameters respectively correspond to the different operating state data;
and respectively carrying out weighted calculation on the corresponding standard scores of the running state data by using the weighting coefficients of the linear parameters, and carrying out fault positioning on the service system to be detected based on the weighted scores.
2. The method according to claim 1, wherein the obtaining current operation data of each service node in the service system to be detected comprises:
the method comprises the steps of obtaining system performance index data, micro-service call chain data and system log data of each service node in a service system to be detected so as to obtain current operation data of each service node.
3. The method according to claim 2, wherein the obtaining system performance index data, micro service call chain data, and system log data of each service node in the service system to be detected to obtain current operating data of each service node comprises:
determining a time length of a sliding window of the time series;
sampling system performance index data of each service node in the service system to be detected based on a first preset time interval within the time length of each sliding window to obtain multiple groups of system performance index data which are arranged according to a time sequence and correspond to the sliding windows;
sampling the micro-service call chain data of each service node in the service system to be detected based on a second preset time interval within the time length of each sliding window to obtain a plurality of groups of micro-service call chain data which are arranged according to a time sequence and correspond to the sliding windows;
and sampling system log data of each service node in the service system to be detected based on a third preset time interval within the time length of each sliding window to obtain multiple groups of system log data which are arranged according to a time sequence and correspond to the sliding windows.
4. The method for detecting system faults according to claim 3, wherein the normalizing the current operation data by using a preset data normalizing method to obtain standard scores corresponding to various operation state data respectively comprises:
calculating a z-score corresponding to each group of the system performance index data and a z-score corresponding to first-order difference data between different system performance index data in each group of the system performance index data;
acquiring micro-service calling time in each group of micro-service calling chain data, and calculating a z-score corresponding to the micro-service calling time in each group of micro-service calling chain data and a z-score corresponding to first-order difference data between different micro-service calling times in each group of micro-service calling chain data;
and matching each group of the system log data by using a preset log template to obtain matching scores corresponding to different system log data in each group of the system log data, and calculating the z-scores of the matching scores corresponding to the different system log data in each group of the system log data and the z-score of first-order difference data between the different matching scores corresponding to the different system log data in each group of the system log data.
5. The method according to claim 4, wherein the step of normalizing any one set of operation state data in the current operation data comprises:
respectively calculating the mean value and the variance corresponding to the group of running state data by using an optimized mean value calculation formula and an optimized back difference calculation formula, and calculating the z-fraction corresponding to the group of running state data based on the mean value and the variance corresponding to the group of running state data; wherein the content of the first and second substances,
the optimized mean value calculation formula is as follows:
Figure FDA0003418385920000021
the optimized rear difference calculation formula is as follows:
Figure FDA0003418385920000022
wherein the content of the first and second substances,
Figure FDA0003418385920000023
n represents the data sample size corresponding to the set of operating state data, xiRepresents the ith data sample in the set of operating state data, mean represents the mean, s2The variance is indicated.
6. The system fault detection method according to claim 1, wherein before training the model to be trained, which is constructed based on the logistic regression algorithm, by using the historical operating data carrying the fault type label, the method further comprises:
acquiring historical normal operation data and historical fault operation data;
adding label information containing corresponding running time interval labels and fault-free type labels to the historical normal running data to obtain first historical running data serving as a negative sample;
adding label information including corresponding operation time interval labels and fault type labels to the historical fault operation data, and resampling the historical fault operation data with the label information added to obtain second historical operation data serving as a positive sample, so that the proportion between the number of samples corresponding to the second historical operation data and the number of samples corresponding to the first historical operation data reaches a preset positive-negative sample proportion.
7. The method according to claim 1, wherein the performing weighted calculation on the respective standard scores of the operating state data by using the weighting coefficients of the linear parameters comprises:
acquiring expert knowledge for optimizing the weight coefficient of the linear parameter through a preset expert knowledge acquisition interface;
correspondingly adjusting the weight coefficient of the linear parameter by using the expert knowledge to obtain the adjusted weight coefficient of the linear parameter;
and respectively carrying out weighted calculation on the corresponding standard scores of the running state data by utilizing the adjusted weight coefficients of the linear parameters.
8. The system fault detection method according to any one of claims 1 to 7, wherein the performing weighted calculation on the standard scores of the corresponding operating state data by using the weighting coefficients of the linear parameters, and performing fault location on the service system to be detected based on the weighted scores comprises:
respectively carrying out weighted calculation on the standard score of the corresponding running state data in each service node by using the weight coefficient of the linear parameter so as to obtain the weighted score of each service node;
screening a preset number of service nodes with weighted scores larger than a preset threshold value from all the service nodes according to the sequence of the weighted scores from large to small, and determining a target service node with a fault based on the service nodes obtained after screening;
screening out the maximum weight coefficient from the weight coefficients of all the linear parameters corresponding to the target service node, and determining the parameter type of the linear parameter corresponding to the maximum weight coefficient as the corresponding fault root.
9. A system fault detection device, comprising:
the data acquisition module is used for acquiring the current operation data of each service node in the service system to be detected; the current operating data comprises a plurality of operating state data;
the standardization processing module is used for carrying out standardization processing on the current operation data by utilizing a preset data standardization method so as to obtain standard scores corresponding to various operation state data respectively;
the model training module is used for training a model to be trained, which is constructed based on a logistic regression algorithm, by using historical operating data with fault type labels so as to obtain a trained supervised learning model;
the weight coefficient extraction module is used for extracting a weight coefficient corresponding to each linear parameter in the trained supervised learning model; wherein the different linear parameters respectively correspond to the different operating state data;
and the fault positioning module is used for respectively carrying out weighted calculation on the corresponding standard scores of the running state data by utilizing the weighting coefficients of the linear parameters and carrying out fault positioning on the service system to be detected based on the weighted scores.
10. An electronic device, comprising:
a memory for storing a computer program;
a processor for executing the computer program for carrying out the steps of the system fault detection method according to any one of claims 1 to 8.
11. A computer-readable storage medium for storing a computer program; wherein the computer program realizes the steps of the system fault detection method according to any one of claims 1 to 8 when executed by a processor.
CN202111554982.5A 2021-12-17 2021-12-17 System fault detection method, device, equipment and medium Pending CN114328198A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111554982.5A CN114328198A (en) 2021-12-17 2021-12-17 System fault detection method, device, equipment and medium
PCT/CN2022/122295 WO2023109251A1 (en) 2021-12-17 2022-09-28 System fault detection method and apparatus, device, and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111554982.5A CN114328198A (en) 2021-12-17 2021-12-17 System fault detection method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN114328198A true CN114328198A (en) 2022-04-12

Family

ID=81053078

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111554982.5A Pending CN114328198A (en) 2021-12-17 2021-12-17 System fault detection method, device, equipment and medium

Country Status (2)

Country Link
CN (1) CN114328198A (en)
WO (1) WO2023109251A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115225460A (en) * 2022-07-15 2022-10-21 北京天融信网络安全技术有限公司 Failure determination method, electronic device, and storage medium
WO2023109251A1 (en) * 2021-12-17 2023-06-22 浪潮电子信息产业股份有限公司 System fault detection method and apparatus, device, and medium
CN117348605A (en) * 2023-12-05 2024-01-05 东莞栢能电子科技有限公司 Optimization method and system applied to control system of release film tearing machine

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116990450B (en) * 2023-07-18 2024-04-26 欧几里德(苏州)医疗科技有限公司 Defect detection method and system for cornea shaping mirror
CN116882701A (en) * 2023-07-27 2023-10-13 上海洲固电力科技有限公司 Electric power material intelligent scheduling system and method based on zero-carbon mode
CN116725613B (en) * 2023-08-11 2024-01-26 威海市博华医疗设备有限公司 Control device based on pneumatic hemostatic equipment
CN117130819B (en) * 2023-10-27 2024-01-30 江西师范大学 Micro-service fault diagnosis method based on time delay variance and correlation coefficient value
CN117572159B (en) * 2024-01-17 2024-03-26 成都英华科技有限公司 Power failure detection method and system based on big data analysis

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108710555A (en) * 2018-05-23 2018-10-26 郑州云海信息技术有限公司 A kind of server error diagnosis method based on supervised learning
KR101984730B1 (en) * 2018-10-23 2019-06-03 (주) 글루시스 Automatic predicting system for server failure and automatic predicting method for server failure
CN109446049A (en) * 2018-11-01 2019-03-08 郑州云海信息技术有限公司 A kind of server error diagnosis method and apparatus based on supervised learning
CN111782532B (en) * 2020-07-02 2022-04-05 北京航空航天大学 Software fault positioning method and system based on network abnormal node analysis
CN112698975B (en) * 2020-12-14 2022-09-27 北京大学 Fault root cause positioning method and system of micro-service architecture information system
CN114328198A (en) * 2021-12-17 2022-04-12 浪潮电子信息产业股份有限公司 System fault detection method, device, equipment and medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023109251A1 (en) * 2021-12-17 2023-06-22 浪潮电子信息产业股份有限公司 System fault detection method and apparatus, device, and medium
CN115225460A (en) * 2022-07-15 2022-10-21 北京天融信网络安全技术有限公司 Failure determination method, electronic device, and storage medium
CN115225460B (en) * 2022-07-15 2023-11-28 北京天融信网络安全技术有限公司 Fault determination method, electronic device, and storage medium
CN117348605A (en) * 2023-12-05 2024-01-05 东莞栢能电子科技有限公司 Optimization method and system applied to control system of release film tearing machine
CN117348605B (en) * 2023-12-05 2024-03-12 东莞栢能电子科技有限公司 Optimization method and system applied to control system of release film tearing machine

Also Published As

Publication number Publication date
WO2023109251A1 (en) 2023-06-22

Similar Documents

Publication Publication Date Title
CN114328198A (en) System fault detection method, device, equipment and medium
CN110472809A (en) Calculate the basic reason and forecast analysis of Environmental Technology problem
CN111475804A (en) Alarm prediction method and system
CN109471783B (en) Method and device for predicting task operation parameters
CN109992484B (en) Network alarm correlation analysis method, device and medium
CN110912737A (en) Dynamic perception performance early warning method based on hybrid model
CN115225536B (en) Virtual machine abnormality detection method and system based on unsupervised learning
AU2021309929B2 (en) Anomaly detection in network topology
US10372572B1 (en) Prediction model testing framework
CN112769605A (en) Heterogeneous multi-cloud operation and maintenance management method and hybrid cloud platform
CN105207797A (en) Fault locating method and fault locating device
CN114969366A (en) Network fault analysis method, device and equipment
CN113282920B (en) Log abnormality detection method, device, computer equipment and storage medium
CN113676343B (en) Fault source positioning method and device for power communication network
CN113825165B (en) 5G slice network congestion early warning method and device based on time diagram network
CN113093695A (en) Data-driven SDN controller fault diagnosis system
CN115809818A (en) Multidimensional diagnosis and evaluation method and device for auxiliary equipment of pumped storage power station
CN116155541A (en) Automatic machine learning platform and method for network security application
CN114443738A (en) Abnormal data mining method, device, equipment and medium
CN116522213A (en) Service state level classification and classification model training method and electronic equipment
CN112579402A (en) Method and device for positioning faults of application system
CN112579429A (en) Problem positioning method and device
CN112948154A (en) System abnormity diagnosis method, device and storage medium
CN117113157B (en) Platform district power consumption fault detection system based on artificial intelligence
CN111722977A (en) System inspection method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20231121

Address after: Room 2301, No. 395 Linjiang Avenue, Tianhe District, Guangzhou City, Guangdong Province, 510655 (Location: Self made Unit 01)

Applicant after: Guangdong Inspur Intelligent Computing Technology Co.,Ltd.

Applicant after: INSPUR ELECTRONIC INFORMATION INDUSTRY Co.,Ltd.

Address before: No. 1036, Shandong high tech Zone wave road, Ji'nan, Shandong

Applicant before: INSPUR ELECTRONIC INFORMATION INDUSTRY Co.,Ltd.

TA01 Transfer of patent application right