CN114461439A - Fault diagnosis method, device, equipment and storage medium - Google Patents

Fault diagnosis method, device, equipment and storage medium Download PDF

Info

Publication number
CN114461439A
CN114461439A CN202210381536.7A CN202210381536A CN114461439A CN 114461439 A CN114461439 A CN 114461439A CN 202210381536 A CN202210381536 A CN 202210381536A CN 114461439 A CN114461439 A CN 114461439A
Authority
CN
China
Prior art keywords
model parameters
enterprise
fault diagnosis
target
server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210381536.7A
Other languages
Chinese (zh)
Inventor
王斯
袁传博
张秀波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202210381536.7A priority Critical patent/CN114461439A/en
Publication of CN114461439A publication Critical patent/CN114461439A/en
Priority to PCT/CN2022/101975 priority patent/WO2023197453A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning

Abstract

The application discloses a fault diagnosis method, a fault diagnosis device, equipment and a storage medium, and relates to the technical field of computers. The method is applied to a server management center and comprises the following steps: acquiring model parameters which are respectively sent by each enterprise system and used for carrying out fault diagnosis on a target type device; the model parameters are parameters determined by different enterprise servers under the enterprise system by using cost functions corresponding to assumed functions in a preset machine learning algorithm; carrying out weighted average on model parameters sent by different enterprise systems by using a preset weighting rule to obtain target model parameters; and sending the target model parameters to different enterprise servers under each enterprise system so as to determine whether the target type device per se has a fault or not by using the assumed function and the target model parameters when the enterprise servers are down. By the technical scheme, the horizontal federal learning can be applied to fault diagnosis of the enterprise server, and the fault diagnosis accuracy is improved.

Description

Fault diagnosis method, device, equipment and storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for fault diagnosis.
Background
With the development of computer systems and the massive application of the internet in various industries, the number of servers used is more and more, and how to build a server management system is convenient for a more efficient management server to become a scene of using the server all the time, such as a server user, a server operation and maintenance enterprise, an ultra-computation center in each place and the like, and particularly the problem of important attention of internet enterprises using the server in large quantity is solved. However, as the number of servers increases, enterprises pay more attention to the operation and maintenance efficiency of the servers. When a server fails, an enterprise urgently needs an efficient operation and maintenance strategy to process the failed server, and operation and maintenance personnel need to quickly locate the cause of the failure, wherein a key part is a component for quickly diagnosing the failure of the server. In the traditional server operation and maintenance, many times, the fault cause is difficult to analyze according to the log, and the application of machine learning in fault diagnosis is undoubtedly a valuable direction. However, three elements of machine learning: in algorithms, examples and data, since enterprises have little knowledge of servers and data of a single enterprise is insufficient to support machine learning algorithms, a method for providing fault diagnosis and accurately positioning fault components by a server provider is obviously more suitable. However, the most lacking of server providers is server failure data, and enterprises are not willing to expose their own server failure data. Therefore, it is difficult for the server provider to train an effective machine learning algorithm to perform fault diagnosis on the server, which becomes a big bottleneck in applying the machine learning algorithm to the fault diagnosis of the server.
In conclusion, how to realize quick positioning of the failure reason when the server fails is solved, and the problems that data is insufficient in application of a machine learning algorithm and data islands among enterprises are to be solved at present are solved.
Disclosure of Invention
In view of the above, the present invention provides a fault diagnosis method, apparatus, device and storage medium, which can achieve fast location of a fault cause when a server fails, and solve the problems of insufficient data in application of a machine learning algorithm and data islanding between enterprises. The specific scheme is as follows:
in a first aspect, the present application discloses a fault diagnosis method applied to a server management center, including:
acquiring model parameters which are respectively sent by each enterprise system and used for carrying out fault diagnosis on a target type device; the model parameters are parameters determined by different enterprise servers under the enterprise system by using cost functions corresponding to assumed functions in a preset machine learning algorithm;
carrying out weighted average on the model parameters sent by different enterprise systems by using a preset weighting rule to obtain target model parameters;
and sending the target model parameters to different enterprise servers under each enterprise system, so that when the enterprise servers are down, whether the target type devices of the enterprise servers are in fault or not is determined by using the assumed functions and the target model parameters.
Optionally, the fault diagnosis method further includes:
the method comprises the steps of obtaining real-time data generated by target type devices in the enterprise servers when the servers are down through a distributed network system constructed based on a substrate management controller on each enterprise server, inputting the real-time data into a cost function corresponding to an assumed function in a preset machine learning algorithm, determining a parameter corresponding to the minimum value of the cost function based on a gradient descent function corresponding to the cost function, and obtaining model parameters for carrying out fault diagnosis on the target type devices.
Optionally, the acquiring real-time data generated by the target type device in each enterprise server when the server is down includes:
and reading a target register in a hardware error detection architecture through a platform environment type control interface to obtain real-time data, which is collected in the target register and generated by the target type device when the server is down, of the target type device.
Optionally, the inputting the real-time data into a cost function corresponding to an assumed function in a preset machine learning algorithm, and then determining a parameter corresponding to the minimum value of the cost function based on a gradient descent function corresponding to the cost function includes:
and inputting the real-time data into a cost function corresponding to an assumed function in a preset logistic regression algorithm, and then determining a parameter corresponding to the minimum value of the cost function based on a gradient descent function corresponding to the cost function.
Optionally, the obtaining model parameters which are respectively sent by each enterprise system and used for performing fault diagnosis on a target type device, and performing weighted average on the model parameters sent by different enterprise systems by using a preset weighting rule to obtain target model parameters includes:
according to a preset time period, regularly obtaining model parameters which are respectively sent by each enterprise system and used for carrying out fault diagnosis on a target type device, carrying out weighted average on the model parameters sent by different enterprise systems by using a preset weighting rule to obtain current target model parameters, and updating the target model parameters obtained in the last time period by using the current target model parameters.
Optionally, the determining whether the target type device of the target type device has a fault by using the hypothesis function and the target model parameter includes:
inputting the target model parameters into the hypothesis function to obtain the fault probability output by the hypothesis function, and judging whether the fault probability is smaller than a preset threshold value;
if the fault probability is smaller than the preset threshold value, judging that the target type device of the device does not have fault;
and if the fault probability is not smaller than the preset threshold value, judging that the target type device per se has faults.
Optionally, the obtaining of the model parameters respectively sent by each enterprise system and used for performing fault diagnosis on the target type device includes:
and acquiring the model parameters which are sent by each enterprise system respectively and subjected to homomorphic encryption processing and are used for carrying out fault diagnosis on the target type device.
In a second aspect, the present application discloses a fault diagnosis device applied to a server management center, including:
the parameter acquisition module is used for acquiring model parameters which are respectively sent by each enterprise system and used for carrying out fault diagnosis on the target type device; the model parameters are parameters determined by different enterprise servers under the enterprise system by using cost functions corresponding to assumed functions in a preset machine learning algorithm;
the parameter operation module is used for carrying out weighted average on the model parameters sent by different enterprise systems by utilizing a preset weighting rule so as to obtain target model parameters;
and the parameter sending module is used for sending the target model parameters to different enterprise servers under each enterprise system so as to determine whether the target type device per se has a fault or not by using the assumed function and the target model parameters when the enterprise servers are down.
In a third aspect, the present application discloses an electronic device comprising a processor and a memory; wherein the memory is used for storing a computer program which is loaded and executed by the processor to implement the fault diagnosis method as described above.
In a fourth aspect, the present application discloses a computer readable storage medium for storing a computer program; wherein the computer program realizes the fault diagnosis method as described above when executed by a processor.
The method is applied to a server management center, and model parameters which are respectively sent by each enterprise system and used for fault diagnosis of a target type device are obtained; the model parameters are parameters determined by different enterprise servers under the enterprise system by using cost functions corresponding to assumed functions in a preset machine learning algorithm; carrying out weighted average on the model parameters sent by different enterprise systems by using a preset weighting rule to obtain target model parameters; and sending the target model parameters to different enterprise servers under each enterprise system, so that when the enterprise servers are down, whether the target type devices of the enterprise servers are in fault or not is determined by using the assumed functions and the target model parameters. Therefore, model parameters which are respectively sent by each enterprise system and used for fault diagnosis of a target type device are obtained through the server management center, so that transverse federal learning is applied to the fault diagnosis of the server, sufficient data are provided for the server management center to perform machine learning training, and the problems of insufficient data in application of a machine learning algorithm and data isolated islands among enterprises are solved; the model parameters sent by each enterprise are weighted and averaged by using a preset weighting rule to obtain target model parameters, and then the server management center distributes the target model parameters to each enterprise system, so that fault components can be quickly positioned when the server fails, the operation and maintenance pressure is relieved, the fault diagnosis accuracy is improved, and the competitiveness of server suppliers is enhanced.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flow chart of a fault diagnosis method disclosed herein;
FIG. 2 is a schematic diagram of a fault diagnosis method disclosed herein;
FIG. 3 is a sub-flow diagram of a fault diagnosis method disclosed herein;
FIG. 4 is a sub-flow diagram of a fault diagnosis method disclosed herein;
FIG. 5 is a flow chart of a particular fault diagnosis method disclosed herein;
FIG. 6 is a schematic structural diagram of a fault diagnosis apparatus disclosed in the present application;
fig. 7 is a block diagram of an electronic device disclosed in the present application.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
At present, after a server fails, the traditional server operation and maintenance is difficult to analyze failure reasons according to logs, and when a machine learning algorithm is applied to server failure diagnosis, data of a single enterprise is insufficient to support the machine learning algorithm, and each enterprise is not willing to expose own server failure data, so that the server failure diagnosis performed by training an effective machine learning algorithm through a server provider becomes a bottleneck for applying the machine learning algorithm to server failure diagnosis.
Therefore, the fault diagnosis scheme is provided, the fault reason can be rapidly positioned when the server breaks down, and the problems of insufficient data and data isolated islands among enterprises in the application of a machine learning algorithm are solved.
The embodiment of the invention discloses a fault diagnosis method, which is shown in figure 1 and is applied to a server management center, and the method comprises the following steps:
step S11: acquiring model parameters which are respectively sent by each enterprise system and used for carrying out fault diagnosis on a target type device; the model parameters are parameters determined by different enterprise servers under the enterprise system by using cost functions corresponding to assumed functions in a preset machine learning algorithm.
The method and the device are applied to a server management center, and parameters reported by each enterprise system are integrated by obtaining model parameters which are sent by each enterprise system and used for carrying out fault diagnosis on a target device, wherein the model parameters are parameters determined by different enterprise servers under the enterprise systems by using cost functions corresponding to assumed functions in a preset machine learning algorithm.
It can be appreciated that federal learning is a machine learning framework that can effectively help multiple organizations to perform data usage and machine learning modeling while meeting the requirements of user privacy protection, data security and government regulations. Federal learning is used as a distributed machine learning paradigm, the problem of data island can be effectively solved, participators can jointly model on the basis of not sharing data, the data island can be technically broken, and AI (Artificial Intelligence) cooperation is realized. Therefore, when the server collects the model parameters sent by a plurality of enterprise systems, the application of the horizontal federal learning in the server operation and maintenance system is realized, the fault diagnosis is carried out on the server, and the problems of insufficient data and isolated data among enterprises in the application of a machine learning algorithm are solved.
In the embodiment of the application, the server management center can initially provide a set of initial model parameters to each enterprise system, the parameters are generally initial parameters obtained by integrating the model parameters obtained by each enterprise through a respective machine learning algorithm, when the server is down, each enterprise system server in each enterprise system firstly judges whether a target type device is in fault, and informs operation and maintenance personnel of the result. The server management center can continuously optimize the initial model parameters in order to improve the accuracy and efficiency of fault detection, so that the server management center can set a time, such as one hour or one day, to receive data reported by each enterprise system, continuously execute a gradient descent algorithm to update the parameters, and issue the parameters to each enterprise for updating at a certain specified time, thereby improving the accuracy of fault diagnosis.
Step S12: and carrying out weighted average on the model parameters sent by different enterprise systems by using a preset weighting rule to obtain target model parameters.
In the embodiment of the application, after the model parameters which are respectively sent by each enterprise system and used for fault diagnosis of the target type device are obtained, the server management center integrates the model parameters according to the preset weighting rule to obtain the target model parameters. For example, the weighted average calculation may be performed according to the number of servers operated by each enterprise system, and the larger the scale of an enterprise system is, the larger the weight given to the model parameter is, and the more reliable the data is.
In the embodiment of the application, each enterprise system reports the data collected by the enterprise system to the server management center periodically, so that the server management center periodically obtains the model parameters which are respectively sent by each enterprise system and used for fault diagnosis of the target type device according to a preset time period, and then performs weighted averaging on the model parameters sent by different enterprise systems by using preset weighting rules to obtain the current target model parameters. Therefore, the target model parameters obtained in the last time period can be updated by using the current target model parameters.
Step S13: and sending the target model parameters to different enterprise servers under each enterprise system, so that when the enterprise servers are down, whether the target type devices of the enterprise servers are in fault or not is determined by using the assumed functions and the target model parameters.
In the embodiment of the application, after the model parameters for fault diagnosis sent by each enterprise system are integrated and new target model parameters are calculated, the target model parameters are sent to different enterprise servers under each enterprise system, so that the model parameters of the enterprise systems are updated. Therefore, when the server is down, the enterprise server can judge whether the target type device of the enterprise server is in fault or not by using the target model parameters. It should be noted that one set of parameters and algorithm only determines whether a device of a certain type is faulty, for example, determines a CPU (Central Processing Unit) or a memory, and therefore, the server management center should actually use multiple sets of parameters to determine the fault probability of devices of different types.
Specifically, when the enterprise server determines whether the type device of the enterprise server fails by using the target model, the method includes: inputting the target model parameters into the hypothesis function to obtain the fault probability output by the hypothesis function, and judging whether the fault probability is smaller than a preset threshold value; if the fault probability is smaller than the preset threshold value, judging that the target type device of the device does not have fault; and if the fault probability is not smaller than the preset threshold value, judging that the target type device per se has faults. It can be understood that if an enterprise has strict requirements and control on the fault probability, the preset threshold value can be reduced, as long as the output fault probability is greater than the preset threshold value, the fault can be timely judged and the fault component can be judged, and then the operation and maintenance personnel can be informed of the result, so that the fault component can be replaced.
As shown in fig. 2, a specific schematic diagram of fault diagnosis provided in the embodiment of the present application may be wholly divided into several modules: a BMC (Baseboard Management Controller) fault data collection module, a BMC machine learning algorithm module, a BMC communication module, a center communication module and a center summary algorithm module. The BMC fault data collection module, the BMC machine learning algorithm module and the BMC communication module are arranged in a server user; the central communication module and the central summary algorithm module are arranged in the server management center.
BMC fault data collection module: and is responsible for collecting fault data when a fault occurs. Taking the CPU diagnosis as an example, the information of MSR (Model Specific Register) and CSR (Control State Register) registers in the MCA (Machine Check Architecture) Architecture can be read through the PECI (Platform Environment Control Interface) channel, and these registers are usually hundreds to thousands. BMC machine learning algorithm module: it has two main functions. The method comprises the steps of judging a fault component, and outputting the probability of the fault of a certain component, such as the probability of judging the fault of a CPU _0, by taking collected MSR and CSR register data as input and calculating through a machine learning algorithm when a server is down. The other function of the BMC machine learning algorithm module is to run an algorithm to update model parameters, when an operation and maintenance person replaces an actual fault part, the result is fed back to the BMC, the BMC calculates a multi-gradient descent algorithm according to the result and updates a group of model parameters, and the model parameters can be broadcasted to other BMCs of the enterprise to update the parameters. A BMC communication module: the system is responsible for interacting with other BMCs of the enterprise and interacting with a central node of a server provider, and simultaneously can perform homomorphic encryption and decryption on a server operation result. A central communication module: the system is responsible for interacting with the BMC on each enterprise server, realizes a bottom layer protocol stack, and safely receives and sends data. The central summary algorithm module: the encrypted parameters reported by each enterprise system can be integrated, new encrypted parameters are calculated and sent to each enterprise for updating.
It should be noted that, what the server management center collects is the enterprise server in each enterprise system, and the model parameters obtained by machine learning are performed by using the baseboard management controller, but the computing power of the baseboard management controller as an embedded system is limited, and the speed is not fast enough when executing the gradient descent algorithm, so that it can be considered that a plurality of baseboard management controllers are combined into a distributed network, and the computing power problem of the baseboard management controller is solved by applying distributed decentralized computing. Specifically, referring to fig. 3, the embodiment may further include:
step S21: and acquiring real-time data generated by the target type devices in the enterprise servers when the servers are down through a distributed network system constructed based on the substrate management controllers on the enterprise servers.
Step S22: inputting the real-time data into a cost function corresponding to an assumed function in a preset machine learning algorithm, and then determining a parameter corresponding to the minimum value of the cost function based on a gradient descent function corresponding to the cost function to obtain the model parameter for performing fault diagnosis on the target type device.
That is, in this embodiment, the distributed network system constructed by the baseboard management controllers on the enterprise servers is used to process real-time data generated by the target type device when the server is down, and then the model parameters for performing fault diagnosis on the target type device are obtained.
Specifically, in a constructed distributed network system, a baseboard management controller on each enterprise server reads a target register in a hardware error detection architecture through a platform environment type control interface to obtain real-time data, collected in the target register, of the target type device generated when the server is down; then, the real-time data is input into a cost function corresponding to an assumed function in a preset logistic regression algorithm, and a parameter corresponding to the minimum value of the cost function is determined based on a gradient descent function corresponding to the cost function. For example, the embodiment is described by using the most clear and practical Logistic regression (log probability regression) algorithm:
with a training set (or called sample)
Figure DEST_PATH_IMAGE001
Wherein
Figure 128231DEST_PATH_IMAGE002
Figure DEST_PATH_IMAGE003
Figure 468208DEST_PATH_IMAGE004
The specific meanings are as follows:
Figure DEST_PATH_IMAGE005
data indicating when a certain downtime occurred,
Figure 353338DEST_PATH_IMAGE006
the vector is an n + 1-dimensional feature vector formed by values of n +1 MSRs and CSRs;
Figure DEST_PATH_IMAGE007
whether the CPU _0 is down at this time is represented, namely whether the CPU _0 is down caused by the CPU _0 or not is represented, and the CPU _0 is represented by only two values, namely 0 and 1, wherein 0 represents that the CPU _0 is not down, and 1 represents that the CPU _0 is down.
The hypothetical function of the Logistic regression algorithm is:
Figure 217388DEST_PATH_IMAGE008
wherein θ is an n + 1-dimensional vector representing the coefficient by which each x is multiplied; the function inputs a feature vector x to obtain the probability of CPU _0 failure, and the function is characterized in that the value of the function is greater than 0 and less than 1. Therefore, if it is calculated to be greater than 0.5, it is judged that CPU _0 is malfunctioning, and vice versa. The key point is to find a specific value of the n + 1-dimensional vector θ, which is also the key of the machine learning algorithm.
The cost function is:
Figure DEST_PATH_IMAGE009
wherein m is the number of samples;
Figure 454597DEST_PATH_IMAGE006
is a feature vector of n +1 dimensions,
Figure 542639DEST_PATH_IMAGE007
whether a component fails or not is shown when the server is down, and only two values of 0 and 1 are provided;
Figure 12934DEST_PATH_IMAGE010
representing the probability value of the failure of the hypothesis function output. Since the difference between the probability calculated by the hypothesis function and the actual situation changes with the change of θ, to find the value of θ that minimizes the difference between the value calculated by the hypothesis function and the actual situation, it is necessary to find the value of θ by using the gradient decreasing function corresponding to the cost function
Figure DEST_PATH_IMAGE011
The minimum value of theta.
The gradient descent function is:
Figure 744259DEST_PATH_IMAGE012
wherein α represents a learning rate; a group of theta values are selected as initial values and are substituted into the formula to be calculated to obtain a group of more effective theta values, and the most effective group of theta values can be obtained by repeating the calculation continuously.
Fig. 4 shows a diagnostic process of the baseboard management controller when the server is down. Taking a CPU as an example, firstly, a BMC collects CPU register data, judges whether the data is collected or not, calculates an assumed function for some part when the data is collected, reports the result to an operation and maintenance worker, feeds the result back to the BMC (possibly CPU _0, possibly not, the assumed function may be judged correctly or wrongly) after the operation and maintenance worker replaces an actual fault part, updates algorithm parameters according to the operation and maintenance result, and the BMC calculates a gradient descent algorithm for multiple times according to the result and updates a group of theta and broadcasts the gradient descent algorithm to other BMCs of the enterprise to update the parameter theta; when no data is collected, the process ends. It should be noted that all BMCs should ensure that the machine learning model is consistent, and therefore, it is a convenient way to download and update the BMC firmware version from the server provider. Therefore, the method applies the transverse federal learning to fault diagnosis of the BMC server, provides a software method, and solves the problems that the server fault is difficult to locate, the data in the application of a machine learning algorithm is insufficient, and the data among enterprises are isolated; meanwhile, the fault treatment process is automated, so that a large amount of labor cost can be saved.
The method is applied to a server management center, and model parameters which are respectively sent by each enterprise system and used for fault diagnosis of a target type device are obtained; the model parameters are parameters determined by different enterprise servers under the enterprise system by using cost functions corresponding to assumed functions in a preset machine learning algorithm; carrying out weighted average on the model parameters sent by different enterprise systems by using a preset weighting rule to obtain target model parameters; and sending the target model parameters to different enterprise servers under each enterprise system, so that when the enterprise servers are down, whether the target type devices of the enterprise servers are in fault or not is determined by using the assumed functions and the target model parameters. Therefore, model parameters which are respectively sent by each enterprise system and used for fault diagnosis of a target type device are obtained through the server management center, so that transverse federal learning is applied to the fault diagnosis of the server, sufficient data are provided for the server management center to perform machine learning training, and the problems of insufficient data in application of a machine learning algorithm and data isolated islands among enterprises are solved; the model parameters sent by each enterprise are weighted and averaged by using a preset weighting rule to obtain target model parameters, and then the server management center distributes the target model parameters to each enterprise system, so that fault components can be quickly positioned when the server fails, the operation and maintenance pressure is relieved, the fault diagnosis accuracy is improved, and the competitiveness of server suppliers is enhanced.
The embodiment of the application discloses a specific fault diagnosis method, which is shown in fig. 5 and is applied to a server management center, and the method comprises the following steps:
step S31: and acquiring the model parameters which are sent by each enterprise system respectively and subjected to homomorphic encryption processing and are used for carrying out fault diagnosis on the target type device.
In the embodiment of the application, in each enterprise system, different enterprise servers generate corresponding model parameters for fault diagnosis of a target type device when a server is down, the enterprise servers communicate with each other, Homomorphic Encryption (HE) is performed on operation results obtained by the enterprise servers, different substrate management controllers of the same enterprise interact with each other, and then the model parameters subjected to Homomorphic Encryption are sent to a server management center. The model parameters are parameters determined by different enterprise servers in the enterprise system using cost functions corresponding to assumed functions in a preset machine learning algorithm, and specific reference may be made to corresponding contents disclosed in the foregoing embodiments, which are not described herein again.
It will be appreciated that the homomorphic encryption algorithm has the following characteristics: if the homomorphic encryption algorithm is considered as the function f, f (a + b) = f (a) + f (b), namely after the homomorphic encryption is carried out on the data, specific calculation is carried out on the ciphertext, the plaintext obtained after the corresponding homomorphic decryption is carried out on the ciphertext calculation result is equal to the plaintext obtained by directly carrying out the same calculation on the plaintext data, and the 'calculable invisibility' of the data is realized. Therefore, the server management center directly operates on the ciphertext and does not directly operate on the data by using the homomorphic encryption algorithm, and the problem that each enterprise is unwilling to expose own server fault data can be effectively solved. Furthermore, after the server management center finishes updating the data and sends the updated data to each enterprise system, the enterprise can decrypt the ciphertext to obtain the latest parameters, so that the fault diagnosis is more accurate. Various homomorphic encryption algorithms can be selected, and a Paillier (probabilistic public key encryption system) encryption algorithm can be used in the embodiment of the application, which is not particularly limited herein.
Step S32: and carrying out weighted average on the model parameters sent by different enterprise systems by using a preset weighting rule to obtain target model parameters.
Step S33: and sending the target model parameters to different enterprise servers under each enterprise system, so that when the enterprise servers are down, whether the target type devices of the enterprise servers are in fault or not is determined by using the assumed functions and the target model parameters.
For more specific processing procedures of the step S32 and the step S33, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not repeated here.
The method is applied to a server management center, and the model parameters which are sent by each enterprise system respectively and are subjected to homomorphic encryption processing and used for performing fault diagnosis on the target type device are obtained; the model parameters are parameters determined by different enterprise servers under the enterprise system by using cost functions corresponding to assumed functions in a preset machine learning algorithm; carrying out weighted average on the model parameters sent by different enterprise systems by using a preset weighting rule to obtain target model parameters; and sending the target model parameters to different enterprise servers under each enterprise system, so that when the enterprise servers are down, whether the target type devices of the enterprise servers are in fault or not is determined by using the assumed functions and the target model parameters. Therefore, model parameters which are respectively sent by each enterprise system and used for fault diagnosis of a target type device are obtained through the server management center, so that transverse federal learning is applied to the fault diagnosis of the server, sufficient data are provided for the server management center to perform machine learning training, and the problems of insufficient data in application of a machine learning algorithm and data isolated islands among enterprises are solved; meanwhile, the model parameters of each enterprise system are processed by using a homomorphic encryption algorithm, so that the server management center can directly operate the ciphertext and does not directly operate the data, and each enterprise does not expose the own server operation and maintenance data; the model parameters sent by each enterprise are weighted and averaged by using a preset weighting rule to obtain target model parameters, and then the server management center distributes the target model parameters to each enterprise system, so that fault components can be quickly positioned when the server fails, the operation and maintenance pressure is relieved, the fault diagnosis accuracy is improved, and the competitiveness of server suppliers is enhanced.
Correspondingly, the embodiment of the present application further discloses a fault diagnosis device, as shown in fig. 6, the fault diagnosis device includes:
the parameter acquisition module 11 is configured to acquire model parameters, which are sent by each enterprise system and used for performing fault diagnosis on a target type device; the model parameters are parameters determined by different enterprise servers under the enterprise system by using cost functions corresponding to assumed functions in a preset machine learning algorithm;
the parameter operation module 12 is configured to perform weighted average on the model parameters sent by different enterprise systems by using a preset weighting rule to obtain target model parameters;
a parameter sending module 13, configured to send the target model parameter to different enterprise servers in each enterprise system, so as to determine whether the target type device of the device has a fault by using the assumed function and the target model parameter when the enterprise server is down.
For more specific working processes of the modules, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not repeated here.
Therefore, the technical scheme of the embodiment is applied to the server management center, and the model parameters which are respectively sent by each enterprise system and used for fault diagnosis of the target type device are obtained; the model parameters are parameters determined by different enterprise servers under the enterprise system by using cost functions corresponding to assumed functions in a preset machine learning algorithm; carrying out weighted average on the model parameters sent by different enterprise systems by using a preset weighting rule to obtain target model parameters; and sending the target model parameters to different enterprise servers under each enterprise system, so that when the enterprise servers are down, whether the target type devices of the enterprise servers are in fault or not is determined by using the assumed functions and the target model parameters. Therefore, model parameters which are respectively sent by each enterprise system and used for fault diagnosis of a target type device are obtained through the server management center, so that transverse federal learning is applied to the fault diagnosis of the server, sufficient data are provided for the server management center to perform machine learning training, and the problems of insufficient data in application of a machine learning algorithm and data isolated islands among enterprises are solved; the model parameters sent by each enterprise are weighted and averaged by using a preset weighting rule to obtain target model parameters, and then the server management center distributes the target model parameters to each enterprise system, so that fault parts can be quickly positioned when the server fails, the operation and maintenance pressure is relieved, the fault diagnosis accuracy is improved, and the competitiveness of server suppliers is enhanced.
Further, an electronic device is disclosed in the embodiments of the present application, and fig. 7 is a block diagram of an electronic device 20 according to an exemplary embodiment, which should not be construed as limiting the scope of the application.
Fig. 7 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present disclosure. The electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input output interface 25, and a communication bus 26. The memory 22 is used for storing a computer program, and the computer program is loaded and executed by the processor 21 to implement the relevant steps in the fault diagnosis method disclosed in any one of the foregoing embodiments. In addition, the electronic device 20 in the present embodiment may be specifically a server.
In this embodiment, the power supply 23 is configured to provide a working voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and an external device, and a communication protocol followed by the communication interface is any communication protocol applicable to the technical solution of the present application, and is not specifically limited herein; the input/output interface 25 is configured to obtain external input data or output data to the outside, and a specific interface type thereof may be selected according to specific application requirements, which is not specifically limited herein.
In addition, the memory 22 is used as a carrier for storing resources, and may be a read-only memory, a random access memory, a magnetic disk, an optical disk, or the like, the resources stored thereon may include an operating system 221, a computer program 222, data 223, and the like, and the data 223 may include various data. The storage means may be a transient storage or a permanent storage.
The operating system 221 is used for managing and controlling each hardware device on the electronic device 20 and the computer program 222, and may be Windows Server, Netware, Unix, Linux, or the like. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the fault diagnosis method performed by the electronic device 20 disclosed in any of the foregoing embodiments.
Further, embodiments of the present application disclose a computer-readable storage medium, where the computer-readable storage medium includes a Random Access Memory (RAM), a Memory, a Read-Only Memory (ROM), an electrically programmable ROM, an electrically erasable programmable ROM, a register, a hard disk, a magnetic disk, or an optical disk or any other form of storage medium known in the art. Wherein the computer program realizes the aforementioned fault diagnosis method when executed by a processor. For the specific steps of the method, reference may be made to the corresponding contents disclosed in the foregoing embodiments, which are not described herein again.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part.
The steps of a fault diagnosis or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above detailed description is provided for a fault diagnosis method, apparatus, device and storage medium provided by the present invention, and the principle and implementation of the present invention are explained in this document by applying specific examples, and the description of the above examples is only used to help understanding the method and core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. A fault diagnosis method is applied to a server management center and comprises the following steps:
acquiring model parameters which are respectively sent by each enterprise system and used for carrying out fault diagnosis on a target type device; the model parameters are parameters determined by different enterprise servers under the enterprise system by using cost functions corresponding to assumed functions in a preset machine learning algorithm;
carrying out weighted average on the model parameters sent by different enterprise systems by using a preset weighting rule to obtain target model parameters;
and sending the target model parameters to different enterprise servers under each enterprise system, so that when the enterprise servers are down, whether the target type devices of the enterprise servers are in fault or not is determined by using the assumed functions and the target model parameters.
2. The fault diagnosis method according to claim 1, further comprising:
the method comprises the steps of obtaining real-time data generated by target type devices in the enterprise servers when the servers are down through a distributed network system constructed based on a substrate management controller on each enterprise server, inputting the real-time data into a cost function corresponding to an assumed function in a preset machine learning algorithm, determining a parameter corresponding to the minimum value of the cost function based on a gradient descent function corresponding to the cost function, and obtaining model parameters for carrying out fault diagnosis on the target type devices.
3. The method according to claim 2, wherein the obtaining real-time data generated by the target type device in each of the enterprise servers when the server is down comprises:
reading a target register in a hardware error detection architecture through a platform environment type control interface to obtain real-time data, collected in the target register, of the target type device generated when a server is down.
4. The fault diagnosis method according to claim 2, wherein the inputting the real-time data into a cost function corresponding to an assumed function in a preset machine learning algorithm, and then determining a parameter corresponding to a minimum value of the cost function based on a gradient descent function corresponding to the cost function, comprises:
and inputting the real-time data into a cost function corresponding to an assumed function in a preset logistic regression algorithm, and then determining a parameter corresponding to the minimum value of the cost function based on a gradient descent function corresponding to the cost function.
5. The method according to claim 1, wherein the obtaining model parameters which are respectively sent by each enterprise system and used for fault diagnosis of a target type device, and performing weighted average on the model parameters sent by different enterprise systems by using a preset weighting rule to obtain target model parameters comprises:
according to a preset time period, regularly obtaining model parameters which are respectively sent by each enterprise system and used for carrying out fault diagnosis on a target type device, carrying out weighted average on the model parameters sent by different enterprise systems by using a preset weighting rule to obtain current target model parameters, and updating the target model parameters obtained in the last time period by using the current target model parameters.
6. The method of claim 1, wherein the determining whether the target type device itself is faulty using the hypothesis function and the target model parameters comprises:
inputting the target model parameters into the hypothesis function to obtain the fault probability output by the hypothesis function, and judging whether the fault probability is smaller than a preset threshold value;
if the fault probability is smaller than the preset threshold value, judging that the target type device of the device does not have fault;
and if the fault probability is not smaller than the preset threshold value, judging that the target type device per se has faults.
7. The method according to any one of claims 1 to 6, wherein the obtaining model parameters respectively sent by each enterprise system for fault diagnosis of the target type device comprises:
and acquiring the model parameters which are sent by each enterprise system respectively and subjected to homomorphic encryption processing and are used for carrying out fault diagnosis on the target type device.
8. A fault diagnosis device is applied to a server management center and comprises the following components:
the parameter acquisition module is used for acquiring model parameters which are respectively sent by each enterprise system and used for carrying out fault diagnosis on the target type device; the model parameters are parameters determined by different enterprise servers under the enterprise system by using cost functions corresponding to assumed functions in a preset machine learning algorithm;
the parameter operation module is used for carrying out weighted average on the model parameters sent by different enterprise systems by utilizing a preset weighting rule so as to obtain target model parameters;
and the parameter sending module is configured to send the target model parameter to different enterprise servers under each enterprise system, so that when the enterprise servers are down, whether the target type device of the target type device fails is determined by using the assumed function and the target model parameter.
9. An electronic device, wherein the electronic device comprises a processor and a memory; wherein the memory is used for storing a computer program which is loaded and executed by the processor to implement the fault diagnosis method as claimed in any one of claims 1 to 7.
10. A computer-readable storage medium for storing a computer program; wherein the computer program, when executed by a processor, implements a fault diagnosis method as claimed in any one of claims 1 to 7.
CN202210381536.7A 2022-04-13 2022-04-13 Fault diagnosis method, device, equipment and storage medium Pending CN114461439A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210381536.7A CN114461439A (en) 2022-04-13 2022-04-13 Fault diagnosis method, device, equipment and storage medium
PCT/CN2022/101975 WO2023197453A1 (en) 2022-04-13 2022-06-28 Fault diagnosis method and apparatus, device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210381536.7A CN114461439A (en) 2022-04-13 2022-04-13 Fault diagnosis method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114461439A true CN114461439A (en) 2022-05-10

Family

ID=81418613

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210381536.7A Pending CN114461439A (en) 2022-04-13 2022-04-13 Fault diagnosis method, device, equipment and storage medium

Country Status (2)

Country Link
CN (1) CN114461439A (en)
WO (1) WO2023197453A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114780283A (en) * 2022-06-20 2022-07-22 新华三信息技术有限公司 Fault processing method and device
WO2023197453A1 (en) * 2022-04-13 2023-10-19 苏州浪潮智能科技有限公司 Fault diagnosis method and apparatus, device, and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111461215A (en) * 2020-03-31 2020-07-28 支付宝(杭州)信息技术有限公司 Multi-party combined training method, device, system and equipment of business model
CN111537945A (en) * 2020-06-28 2020-08-14 南方电网科学研究院有限责任公司 Intelligent ammeter fault diagnosis method and equipment based on federal learning
CN111722043A (en) * 2020-06-29 2020-09-29 南方电网科学研究院有限责任公司 Power equipment fault detection method, device and system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220292398A1 (en) * 2019-08-16 2022-09-15 Telefonaktiebolaget Lm Ericsson (Publ) Methods, apparatus and machine-readable media relating to machine-learning in a communication network
CN111737749A (en) * 2020-06-28 2020-10-02 南方电网科学研究院有限责任公司 Measuring device alarm prediction method and device based on federal learning
CN112101489A (en) * 2020-11-18 2020-12-18 天津开发区精诺瀚海数据科技有限公司 Equipment fault diagnosis method driven by united learning and deep learning fusion
CN114330740A (en) * 2021-12-17 2022-04-12 青岛鹏海软件有限公司 Manufacturing equipment fault monitoring model training system based on federal learning
CN114461439A (en) * 2022-04-13 2022-05-10 苏州浪潮智能科技有限公司 Fault diagnosis method, device, equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111461215A (en) * 2020-03-31 2020-07-28 支付宝(杭州)信息技术有限公司 Multi-party combined training method, device, system and equipment of business model
CN111537945A (en) * 2020-06-28 2020-08-14 南方电网科学研究院有限责任公司 Intelligent ammeter fault diagnosis method and equipment based on federal learning
CN111722043A (en) * 2020-06-29 2020-09-29 南方电网科学研究院有限责任公司 Power equipment fault detection method, device and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023197453A1 (en) * 2022-04-13 2023-10-19 苏州浪潮智能科技有限公司 Fault diagnosis method and apparatus, device, and storage medium
CN114780283A (en) * 2022-06-20 2022-07-22 新华三信息技术有限公司 Fault processing method and device

Also Published As

Publication number Publication date
WO2023197453A1 (en) 2023-10-19

Similar Documents

Publication Publication Date Title
Chen et al. Outage prediction and diagnosis for cloud service systems
US10530740B2 (en) Systems and methods for facilitating closed loop processing using machine learning
US8001063B2 (en) Method and apparatus for reward-based learning of improved policies for management of a plurality of application environments supported by a data processing system
CN101233491B (en) System and method for detecting imbalances in dynamic workload scheduling in clustered environments
CN114461439A (en) Fault diagnosis method, device, equipment and storage medium
EP3360096A1 (en) Systems and methods for security and risk assessment and testing of applications
CN112162878A (en) Database fault discovery method and device, electronic equipment and storage medium
US10489232B1 (en) Data center diagnostic information
US9280409B2 (en) Method and system for single point of failure analysis and remediation
US20230132116A1 (en) Prediction of impact to data center based on individual device issue
CN104615476A (en) Selected virtual machine replication and virtual machine restart techniques
CN113051019A (en) Flow task execution control method, device and equipment
JP7081741B2 (en) Methods and devices for determining the status of network devices
CN117171576B (en) Abnormality monitoring method and system applied to material purification system
CN111884859B (en) Network fault diagnosis method and device and readable storage medium
US11212173B2 (en) Model-driven technique for virtual network function rehoming for service chains
US11392821B2 (en) Detecting behavior patterns utilizing machine learning model trained with multi-modal time series analysis of diagnostic data
US10659289B2 (en) System and method for event processing order guarantee
CN111630534B (en) Method for collaborative machine learning of analytical models
EP1489499A1 (en) Tool and associated method for use in managed support for electronic devices
CN112163154A (en) Data processing method, device, equipment and storage medium
CN111082964B (en) Distribution method and device of configuration information
Martinez-Julia et al. Explained intelligent management decisions in virtual networks and network slices
Kawahara et al. Application of AI to network operation
KR20200063343A (en) System and method for managing operaiton in trust reality viewpointing networking infrastucture

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20220510