WO2023197453A1

WO2023197453A1 - Fault diagnosis method and apparatus, device, and storage medium

Info

Publication number: WO2023197453A1
Application number: PCT/CN2022/101975
Authority: WO
Inventors: 王斯; 袁传博; 张秀波
Original assignee: 苏州浪潮智能科技有限公司
Priority date: 2022-04-13
Filing date: 2022-06-28
Publication date: 2023-10-19
Also published as: CN114461439A

Abstract

The present application relates to the technical field of computers. Disclosed are a fault diagnosis method and apparatus, a device, and a storage medium. The method is applied to a server management center, and comprises: acquiring model parameters which are respectively sent by each enterprise system and are used for performing fault diagnosis on a target-type device, wherein the model parameters are parameters determined by different enterprise servers of the enterprise system by utilizing a cost function corresponding to a hypothesis function in a preset machine learning algorithm; performing, by using a preset weighting rule, weighted averaging on the model parameters sent by different enterprise systems to obtain a target model parameter; and sending the target model parameter to the different enterprise servers of each enterprise system, so that when the enterprise servers are down, the enterprise servers determine, by using the hypothesis function and the target model parameter, whether a fault occurs in their own target-type devices.

Description

A fault diagnosis method, device, equipment and storage medium

Cross-references to related applications

This application requests the priority of the Chinese patent application submitted to the China Patent Office on April 13, 2022, with the application number 202210381536.7, and the application title is "A fault diagnosis method, device, equipment and storage medium", the entire content of which is incorporated by reference incorporated in this application.

Technical field

The present invention relates to the field of computer technology, and in particular to a fault diagnosis method, device, equipment and storage medium.

Background technique

With the development of computer systems and the massive application of the Internet in all walks of life, more and more servers are used. How to build a server management system to manage servers more efficiently has become an important issue for server users, server operation and maintenance companies, and supermarkets in various places. Computing centers and other scenarios where servers are used, especially issues that Internet companies that use a large number of servers are concerned about.

However, as the number of servers increases, enterprises are paying more attention to the operation and maintenance efficiency of servers. When a server fails, enterprises urgently need an efficient operation and maintenance strategy to deal with the faulty server. Operation and maintenance personnel need to quickly locate the cause of the failure. A key part of this is to quickly diagnose the faulty component of the server. In traditional server operation and maintenance, it is often difficult to analyze the cause of faults based on logs. The application of machine learning in fault diagnosis is undoubtedly a valuable direction. However, among the three elements of machine learning: algorithms, calculation examples, and data, since enterprises do not know much about servers, and the data of a single enterprise is not enough to support machine learning algorithms, server vendors must provide fault diagnosis and accurately locate faults. The component approach is obviously more appropriate. However, the inventor realized that what server suppliers lack most is server failure data, and various enterprises are unwilling to expose their own server failure data. Therefore, it is difficult for server vendors to train effective machine learning algorithms to diagnose server faults, which has become a major bottleneck in applying machine learning algorithms to server fault diagnosis.

In summary, how to quickly locate the cause of the failure when a server fails and solve the problem of insufficient data in the application of machine learning algorithms and data islands among various enterprises are currently unresolved issues.

Contents of the invention

In the first aspect, this application discloses a fault diagnosis method, which is applied to a server management center and includes:

Obtain the model parameters sent by each enterprise system for fault diagnosis of the target type device; the model parameters are parameters determined by different enterprise servers under the enterprise system using the cost function corresponding to the hypothesis function in the preset machine learning algorithm. ;

Use preset weighting rules to perform a weighted average of model parameters sent by different enterprise systems to obtain target model parameters; and

The target model parameters are sent to different enterprise servers under each enterprise system, so that when the enterprise server goes down, the hypothesis function and the target model parameters can be used to determine whether its own target type device has failed.

In one embodiment, the above fault diagnosis method further includes:

Through a distributed network system built based on the baseboard management controller on each enterprise server, real-time data generated by the target type devices in each enterprise server when the server is down is obtained, and the real-time data is input into the preset machine learning algorithm. In the cost function corresponding to the hypothesis function, the parameters corresponding to the minimum value of the cost function are determined based on the gradient descent function corresponding to the cost function, and the model parameters used for fault diagnosis of the target type device are obtained.

In one embodiment, real-time data generated by target type devices in each enterprise server when the server is down is obtained, including:

The target register in the hardware error detection architecture is read through the platform environment control interface, and the real-time data generated by the target type device collected in the target register when the server is down is obtained.

In one embodiment, real-time data is input into a cost function corresponding to the hypothesis function in the preset machine learning algorithm, and then the parameters corresponding to the minimum value of the cost function are determined based on the gradient descent function corresponding to the cost function, include:

The real-time data is input into the cost function corresponding to the hypothesis function in the preset logistic regression algorithm, and then the parameters corresponding to the minimum value of the cost function are determined based on the gradient descent function corresponding to the cost function.

In one embodiment, the model parameters sent by each enterprise system for fault diagnosis of the target type device are obtained, and the preset weighting rules are used to perform a weighted average of the model parameters sent by different enterprise systems to obtain the target model parameters, include:

According to the preset time period, regularly obtain the model parameters sent by each enterprise system for fault diagnosis of the target type device, and use the preset weighting rules to perform a weighted average of the model parameters sent by different enterprise systems to obtain the current target Model parameters in order to use the current target model parameters to update the target model parameters obtained in the previous time period.

In one embodiment, the hypothesis function and target model parameters are used to determine whether the own target type device fails, including:

Input the target model parameters into the hypothesis function to obtain the failure probability output by the hypothesis function, and determine whether the failure probability is less than the preset threshold; and

In response to the failure probability being less than the preset threshold, it is determined that the own target type device has not failed, or in response to if the failure probability is not less than the preset threshold, it is determined that the own target type device has failed.

In one embodiment, the model parameters sent by each enterprise system for fault diagnosis of the target type device are obtained, including:

Obtain the homomorphically encrypted model parameters sent by each enterprise system for fault diagnosis of the target type device.

In the second aspect, this application discloses a fault diagnosis device, which is used in a server management center and includes:

The parameter acquisition module is used to obtain the model parameters sent by each enterprise system for fault diagnosis of the target type device; the model parameters correspond to the hypothesis functions in the preset machine learning algorithm used by different enterprise servers under the enterprise system. The parameters determined by the cost function;

The parameter calculation module is used to perform a weighted average of model parameters sent by different enterprise systems using preset weighting rules to obtain target model parameters; and

The parameter sending module is used to send the target model parameters to different enterprise servers under each enterprise system, so that when the enterprise server goes down, it can use the hypothesis function and the target model parameters to determine whether its own target type device has failed.

In a third aspect, the present application discloses an electronic device, which includes one or more processors and a memory; wherein the memory is used to store computer readable instructions, and the computer readable instructions are processed by the one or more processors. Load and execute to implement the fault diagnosis method disclosed in the previous embodiment.

In a fourth aspect, the present application discloses a non-volatile computer-readable storage medium for storing computer-readable instructions; wherein the computer-readable instructions implement the previous embodiment when executed by one or more processors. Disclosed fault diagnosis methods.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below. Other features and advantages of the application will be apparent from the description, drawings, and claims.

Description of the drawings

In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only These are embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on the provided drawings without exerting creative efforts.

Figure 1 is a flow chart of a fault diagnosis method disclosed in one or more embodiments of the present application;

Figure 2 is a schematic diagram of a fault diagnosis method disclosed in one or more embodiments of the present application;

Figure 3 is a sub-flow chart of a fault diagnosis method disclosed in one or more embodiments of the present application;

Figure 4 is a schematic sub-flow diagram of a fault diagnosis method disclosed in one or more embodiments of the present application;

Figure 5 is a flow chart of a specific fault diagnosis method disclosed in one or more embodiments of the present application;

Figure 6 is a schematic structural diagram of a fault diagnosis device disclosed in one or more embodiments of the present application;

Figure 7 is a structural diagram of an electronic device disclosed in one or more embodiments of the present application.

Detailed ways

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of the present invention.

Currently, after a server fails, it is often difficult for traditional server operation and maintenance to analyze the cause of the failure based on logs. When applying machine learning algorithms to server fault diagnosis, the data of a single enterprise is not enough to support the machine learning algorithm, and each enterprise They are not willing to expose their own server fault data, so training effective machine learning algorithms through server vendors to diagnose server faults has become a major bottleneck in applying machine learning algorithms to server fault diagnosis.

To this end, this application provides a fault diagnosis solution that can quickly locate the cause of the fault when a server fails and solve the problem of insufficient data in the application of machine learning algorithms and data islands among various enterprises.

An embodiment of the present invention discloses a fault diagnosis method, as shown in Figure 1, which is applied to a server management center. The method includes:

Step S11: Obtain the model parameters sent by each enterprise system for fault diagnosis of the target type device; the model parameters are determined by different enterprise servers under the enterprise system using the cost function corresponding to the hypothesis function in the preset machine learning algorithm. out parameters.

In the embodiment of this application, by obtaining the model parameters sent by each enterprise system for fault diagnosis of the target device, the parameters reported by each enterprise system are integrated, where the model parameters are different enterprise servers under the enterprise system. Parameters determined using the cost function corresponding to the hypothesis function in the preset machine learning algorithm.

It is understandable that federated learning is a machine learning framework that can effectively help multiple organizations perform data usage and machine learning modeling while meeting the requirements of user privacy protection, data security, and government regulations. As a distributed machine learning paradigm, federated learning can effectively solve the problem of data islands, allowing participants to jointly model without sharing data, and can technically break data islands and realize AI (Artificial Intelligence, artificial intelligence) collaboration. Therefore, when the server collects model parameters sent by multiple enterprise systems, it realizes the application of horizontal federated learning in the server operation and maintenance system, performs fault diagnosis on the server, and solves the problem of insufficient data in the application of machine learning algorithms and the problems between various enterprises. The data island problem.

In the embodiment of this application, the server management center can issue an initial set of model parameters to each enterprise system at the initial stage. These parameters are usually the initial model parameters obtained by each enterprise through their respective machine learning algorithms after integrating them. Parameters, when a server goes down, each enterprise system server under each enterprise system first determines whether the target type device has failed, and informs the operation and maintenance personnel of the result. In order to improve the accuracy and efficiency of fault detection, the server management center will continuously optimize the initial model parameters. Therefore, the server management center can set a time, such as one hour or one day, to receive the data reported by each enterprise system and continuously perform gradient descent. Algorithm is used to update parameters, and the parameters are sent to each enterprise for update at a specified time to improve the accuracy of fault diagnosis.

Step S12: Use preset weighting rules to perform a weighted average of model parameters sent by different enterprise systems to obtain target model parameters.

In the embodiment of this application, after obtaining the model parameters used for fault diagnosis of the target type device sent by each enterprise system, the server management center will integrate the various model parameters according to the preset weighting rules to obtain the target model parameters. . For example, a weighted average calculation can be performed based on the number of servers running in each enterprise system. The larger the scale of an enterprise system, the greater the weight given to its model parameters, and its data will be more reliable.

In the embodiment of this application, because each enterprise system will regularly report the data collected by itself to the server management center, the server management center will regularly obtain the data sent by each enterprise system for the purpose of processing according to the preset time period. Model parameters for fault diagnosis of target type devices, and then use preset weighting rules to weight and average the model parameters sent by different enterprise systems to obtain the current target model parameters. In this way, the current target model parameters can be used to update the target model parameters obtained in the previous time period.

Step S13: Send the target model parameters to different enterprise servers under each enterprise system, so that when the enterprise server goes down, use the hypothesis function and the target model parameters to determine whether its own target type device has failed.

In the embodiment of this application, when the model parameters sent by each enterprise system for fault diagnosis are integrated and new target model parameters are calculated, the target model parameters are sent to different enterprise servers under each enterprise system, so that It updates its own model parameters. In this way, when the server goes down, the enterprise server can use the target model parameters to determine whether its target type device has failed. It should be pointed out that a set of parameters and algorithms only determine whether a certain type of device is faulty, such as a certain CPU (Central Processing Unit, central processing unit) or a certain memory. Therefore, the server management center should actually have multiple sets of parameters to determine whether a certain type of device is faulty. Determine the failure probability of different types of devices.

Specifically, when the enterprise server uses the target model to determine whether its own type of device fails, it includes: inputting the target model parameters into the hypothesis function to obtain the failure probability output by the hypothesis function, and determine whether the failure probability is less than the preset threshold; if If the failure probability is less than the preset threshold, it is determined that the target type device has not failed; if the failure probability is not less than the preset threshold, it is determined that the target type device has failed. It is understandable that if an enterprise has strict requirements on failure probability, the preset threshold can be adjusted smaller. As long as the output failure probability is greater than the preset threshold, the fault can be determined in time and the faulty component can be determined, and the results can then be notified to the operation. Call maintenance personnel to replace faulty parts.

As shown in Figure 2, it is a specific fault diagnosis schematic diagram provided in the embodiment of this application. The whole can be divided into several modules: BMC (Baseboard Management Controller, baseboard management controller) fault data collection module, BMC machine learning algorithm module, BMC communication module, center communication module, and center summary algorithm module. Among them, the BMC fault data collection module, BMC machine learning algorithm module, and BMC communication module are in the server user; the central communication module and the central summary algorithm module are in the server management center.

BMC fault data collection module: responsible for collecting fault data when a fault occurs. Taking CPU diagnosis as an example, you can read out the MSR (Model Specific Register) and CSR (Special Module Register) in the MCA (Machine Check Architecture, hardware error detection) architecture through the PECI (Platform Environment Control Interface, Platform Environment Control Interface) channel. Control State Register (Control State Register) register information, there are usually hundreds or thousands of these registers. BMC machine learning algorithm module: It mainly has two functions. The first is to determine the faulty component. When the server goes down, the collected MSR and CSR register data are used as input, and the machine learning algorithm is used to calculate and output the probability of a component failure, such as determining the probability of CPU_0 failure. Another function of the BMC machine learning algorithm module is to run the algorithm to update the model parameters. When the operation and maintenance personnel replace the actual faulty parts, the results are fed back to the BMC. The BMC calculates multiple gradient descent algorithms based on the results and updates a set of model parameters, and It can be broadcast to other BMCs in the enterprise to update parameters. BMC communication module: Responsible for interacting with other BMCs of the enterprise and the central node of the server provider. It can also perform homomorphic encryption and decryption of server operation results. Central communication module: Responsible for interacting with the BMC on each enterprise server, implementing the underlying protocol stack, and securely receiving and sending data. Central summary algorithm module: It can integrate the encrypted parameters reported by each enterprise system, calculate new encrypted parameters, and send them to each enterprise for update.

It should be pointed out that the server management center collects the model parameters obtained from the enterprise servers in each enterprise system and uses the baseboard management controller to perform machine learning. However, the baseboard management controller as an embedded system has limited computing power. When executing The gradient descent algorithm is not fast enough, so you can consider forming a distributed network of multiple baseboard management controllers and applying distributed computing to solve the computing power problem of the baseboard management controller. Specifically, as shown in Figure 3, this embodiment may further include:

Step S21: Through a distributed network system built based on the baseboard management controller on each enterprise server, obtain real-time data generated by the target type device in each enterprise server when the server is down.

Step S22: Input the real-time data into the cost function corresponding to the hypothesis function in the preset machine learning algorithm, and then determine the parameters corresponding to the minimum value of the cost function based on the gradient descent function corresponding to the cost function, and obtain the Model parameters for fault diagnosis of target type devices.

That is to say, in this embodiment, the distributed network system constructed by the baseboard management controller on each enterprise server is used to process the real-time data generated by the target type device when the server is down, and then obtain the data for the target type. Model parameters for device fault diagnosis.

Specifically, first, in the constructed distributed network system, the baseboard management controller on each enterprise server reads the target register in the hardware error detection architecture through the platform environment control interface, and obtains the target type device collected in the target register. Real-time data generated when the server is down; then, input the real-time data into the cost function corresponding to the hypothesis function in the preset logistic regression algorithm, and determine the minimum value of the cost function based on the gradient descent function corresponding to the cost function. corresponding parameters. Illustratively, this embodiment uses the clearest and most practical Logistic Regression (Logarithmic Probability Regression) algorithm for explanation:

There is a certain training set (or sample)

{(x ⁽¹⁾ , y ⁽¹⁾ ), (x ⁽²⁾ , y ⁽²⁾ ),..., (x ^(m) , y ^(m) )};

in

x ₀ = 1, y∈{0, 1};

The specific meaning is: (x ⁽ⁱ⁾ , y ⁽ⁱ⁾ ) represents the data when a certain outage occurs, x ⁽ⁱ⁾ refers to the n+1-dimensional feature vector composed of the values of n+1 MSR and CSR registers; y ⁽ⁱ⁾ indicates whether CPU_0 is faulty in this outage, that is, whether it is caused by CPU_0. There are only two values: 0 and 1. 0 means that CPU_0 has no fault, and 1 means that CPU_0 has a fault.

The hypothesis function of the logistic regression algorithm is:

Among them, θ is an n+1-dimensional vector, representing the coefficient of multiplication of each x; this function inputs a feature vector x to obtain the probability of CPU_0 failure. The characteristics of this function are that its value is greater than 0 and less than 1. Therefore, if the calculation is greater than 0.5, CPU_0 can be judged to be faulty, and vice versa. The key point is to find the specific value of the n+1-dimensional vector θ, which is also the key to the machine learning algorithm.

The cost function is:

Among them, m is the number of samples; x ⁽ⁱ⁾ is an n+1-dimensional feature vector, y ⁽ⁱ⁾ represents whether the component fails when the server goes down, and has only two values 0 and 1; h _θ (x ⁽ⁱ⁾ ) represents the hypothesis Failure probability value output by the function. As θ changes, the difference between the probability calculated by the hypothesis function and the real situation will also change. Therefore, to find the value of θ that minimizes the difference between the value calculated by the hypothesis function and the real situation, you need to use the gradient corresponding to the cost function The descending function finds the value of θ that minimizes J(θ).

The gradient descent function is:

Among them, α represents the learning rate; select a set of θ values as the initial value, and substitute it into this formula to calculate a set of more effective θ values. Repeat this to get the most effective set of θ values.

Figure 4 shows the diagnostic process of the baseboard management controller when the server crashes. Taking the CPU as an example, first the BMC collects the CPU register data and determines whether the data has been collected. When the data is collected, it calculates a hypothetical function for a certain component and reports the result to the operation and maintenance. When the operation and maintenance personnel replace the actual faulty component Finally, the result is fed back to BMC (it may be CPU_0, or it may not be, the assumption function may be correct or wrong), and then the algorithm parameters are updated based on the operation and maintenance results. BMC operates multiple gradient descent algorithms based on the results and updates a set of θ , and can be broadcast to other BMCs in the enterprise to update parameters θ; when no data is collected, the process ends. It should be noted that all BMCs should ensure consistent machine learning models, so it is convenient to download and update the BMC firmware version from the server vendor. It can be seen that applying horizontal federated learning to BMC server fault diagnosis provides a software method to solve the problem of difficulty in locating server faults, insufficient data in the application of machine learning algorithms, and data islands among various enterprises; at the same time, faults can be Automation of the processing process can save a lot of labor costs.

In the embodiment of this application, the model parameters used for fault diagnosis of the target type device are obtained by obtaining the model parameters sent by each enterprise system respectively; the model parameters are corresponding to the hypothesis functions in the preset machine learning algorithms used by different enterprise servers under the enterprise system. Parameters determined by the cost function; use preset weighting rules to perform a weighted average of the model parameters sent by different enterprise systems to obtain the target model parameters; send the target model parameters to different enterprise servers under each enterprise system so that they can be processed on the enterprise server When there is a downtime, the hypothesis function and target model parameters are used to determine whether the target type device of its own has failed. It can be seen that the server management center obtains the model parameters sent by each enterprise system for fault diagnosis of the target type device, realizes the application of horizontal federated learning in server fault diagnosis, and provides the server management center with sufficient data for machine diagnosis. Learning and training solves the problem of insufficient data in the application of machine learning algorithms and the problem of data islands among various enterprises; the model parameters sent by each enterprise are weighted and averaged using preset weighting rules to obtain the target model parameters, and then the server management center transfers the target model The parameters are distributed to each enterprise system, so that when a server fails, the faulty component can be quickly located, easing the operation and maintenance pressure, thus improving the accuracy of fault diagnosis and strengthening the competitiveness of server suppliers.

The embodiment of the present application discloses a specific fault diagnosis method, as shown in Figure 5, which is applied to the server management center. The method includes:

Step S31: Obtain the homomorphically encrypted model parameters sent by each enterprise system and used for fault diagnosis of the target type device.

In the embodiment of this application, in each enterprise system, different enterprise servers will generate corresponding model parameters for fault diagnosis of target type devices when the server is down, and the enterprise servers communicate with each other, and the enterprise servers The obtained running results are processed by Homomorphic Encryption (HE). Different baseboard management controllers of the same enterprise will interact with each other, and then the model parameters after homomorphic encryption are sent to the server management center. Among them, the model parameters are parameters determined by different enterprise servers under the enterprise system using the cost function corresponding to the hypothesis function in the preset machine learning algorithm. For details, please refer to the corresponding content disclosed in the foregoing embodiments, and will not be described again here. .

It can be understood that the homomorphic encryption algorithm has the following characteristics: If the homomorphic encryption algorithm is considered as a function f, then f(a+b)=f(a)+f(b), that is, after the data is homomorphically encrypted, Perform specific calculations on the ciphertext, and the resulting ciphertext calculation results after corresponding homomorphic decryption are equivalent to directly performing the same calculation on the plaintext data, achieving "invisibility" of the data. In this way, the use of homomorphic encryption algorithms allows the server management center to directly operate on the ciphertext instead of directly operating on the data, which can effectively solve the problem of various enterprises being unwilling to expose their own server failure data. Furthermore, when the server management center updates the data and sends the updated data to each enterprise system, the enterprise can decrypt the ciphertext to obtain the latest parameters, thereby making fault diagnosis more accurate. Among them, there are a variety of homomorphic encryption algorithms to choose from. The embodiment of this application can use the Paillier (probabilistic public key encryption system) encryption algorithm, which is not specifically limited here.

Step S32: Use the preset weighting rules to perform a weighted average of the model parameters sent by different enterprise systems to obtain the target model parameters.

Step S33: Send the target model parameters to different enterprise servers under each enterprise system, so that when the enterprise server goes down, use the hypothesis function and the target model parameters to determine whether its own target type device has failed.

For more specific processing procedures of the above steps S32 and S33, reference may be made to the corresponding contents disclosed in the foregoing embodiments, and will not be described again here.

In this application, it is applied to the server management center to obtain the homomorphically encrypted model parameters sent by each enterprise system for fault diagnosis of the target type device; the model parameters are the utilization and prediction of different enterprise servers under the enterprise system. Assume the parameters determined by the cost function corresponding to the hypothesis function in the machine learning algorithm; use the preset weighting rules to weight and average the model parameters sent by different enterprise systems to obtain the target model parameters; send the target model parameters to each enterprise system Download different enterprise servers so that when the enterprise server goes down, the hypothesis function and the target model parameters can be used to determine whether its own target type device has failed. It can be seen that the server management center obtains the model parameters sent by each enterprise system for fault diagnosis of the target type device, realizes the application of horizontal federated learning in server fault diagnosis, and provides the server management center with sufficient data for machine diagnosis. Learning and training solves the problem of insufficient data in the application of machine learning algorithms and data islands among various enterprises; at the same time, the homomorphic encryption algorithm is used to process the model parameters of each enterprise system, so that the server management center can directly operate on the ciphertext , without directly operating the data, so that each enterprise does not expose its own server operation and maintenance data; the model parameters sent by each enterprise are weighted and averaged using preset weighting rules to obtain the target model parameters, and then the server management center will Distribute to various enterprise systems, so that when a server fails, the faulty component can be quickly located, easing the operation and maintenance pressure, thus improving the accuracy of fault diagnosis and strengthening the competitiveness of server suppliers.

Correspondingly, the embodiment of the present application also discloses a fault diagnosis device, as shown in Figure 6. The device includes:

Parameter acquisition module 11 is used to obtain model parameters sent by each enterprise system for fault diagnosis of target type devices; the model parameters correspond to the hypothesis functions in the preset machine learning algorithm utilized by different enterprise servers under the enterprise system. The parameters determined by the cost function;

The parameter calculation module 12 is used to perform a weighted average of model parameters sent by different enterprise systems using preset weighting rules to obtain target model parameters;

The parameter sending module 13 is used to send the target model parameters to different enterprise servers under each enterprise system, so that when the enterprise server goes down, it can use the hypothesis function and the target model parameters to determine whether its own target type device has failed.

For more specific working processes of each of the above modules and the corresponding technical effects, please refer to the corresponding contents disclosed in the foregoing embodiments, and will not be described again here.

Furthermore, the embodiment of the present application also discloses an electronic device. Figure 7 is a structural diagram of the electronic device 20 according to an exemplary embodiment. The content in the figure cannot be considered as any limitation on the scope of the present application.

FIG. 7 is a schematic structural diagram of an electronic device 20 provided by an embodiment of the present application. The electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input-output interface 25 and a communication bus 26. The memory 22 is used to store computer readable instructions, which are loaded and executed by the at least one processor 21 to implement relevant steps in the fault diagnosis method disclosed in any of the foregoing embodiments. In addition, the electronic device 20 in this embodiment may specifically be a server.

In this embodiment, the power supply 23 is used to provide working voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and external devices, and the communication protocol it follows can be applicable Any communication protocol of the technical solution of this application is not specifically limited here; the input and output interface 25 is used to obtain external input data or output data to the external world, and its specific interface type can be selected according to specific application needs. Here Not specifically limited.

In addition, the memory 22, as a carrier for resource storage, can be a read-only memory, a random access memory, a magnetic disk or an optical disk, etc. The resources stored thereon can include the operating system 221, computer readable instructions 222 and data 223, etc. The data 223 can include All kinds of data. The storage method can be temporary storage or permanent storage.

Among them, the operating system 221 is used to manage and control each hardware device on the electronic device 20 and the computer readable instructions 222, which can be Windows Server, Netware, Unix, Linux, etc. In addition to computer-readable instructions that can be used to complete the fault diagnosis method executed by the electronic device 20 disclosed in any of the foregoing embodiments, the computer-readable instructions 222 may further include computer-readable instructions that can be used to complete other specific tasks. instruction.

Further, the embodiment of the present application also discloses a non-volatile computer-readable storage medium. The non-volatile computer-readable storage medium mentioned here includes random access memory (Random Access Memory, RAM), memory, Read-Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, register, hard disk, magnetic disk or optical disk or any other form of storage medium known in the technical field. Wherein, when the computer-readable instructions are executed by one or more processors, the fault diagnosis method provided by any of the foregoing embodiments is implemented. Regarding the specific steps of this method, reference may be made to the corresponding content disclosed in the foregoing embodiments, which will not be described again here.

Each embodiment in this specification is described in a progressive manner. Each embodiment focuses on its differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other. As for the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple. For relevant details, please refer to the description in the method section.

The steps of fault diagnosis or algorithms described in conjunction with the embodiments disclosed herein may be implemented directly using hardware, software modules executed by a processor, or a combination of both. Software modules may be located in random access memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disks, removable disks, CD-ROMs, or anywhere in the field of technology. any other known form of storage media.

Finally, it should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that these entities or any such actual relationship or sequence between operations. Furthermore, the terms "comprises," "comprises," or any other variations thereof are intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus that includes a list of elements includes not only those elements, but also those not expressly listed other elements, or elements inherent to the process, method, article or equipment. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article, or device that includes the foregoing element.

The fault diagnosis method, device, equipment and storage medium provided by the present invention have been introduced in detail above. Specific examples are used in this article to illustrate the principles and implementation modes of the present invention. The description of the above embodiments is only for assistance. Understand the method of the present invention and its core idea; at the same time, for those of ordinary skill in the field, there will be changes in the specific implementation and application scope based on the idea of the present invention. In summary, the content of this specification does not It should be understood as a limitation of the invention.

Claims

A fault diagnosis method, characterized in that it is applied to a server management center and includes:

Obtain the model parameters sent by each enterprise system for fault diagnosis of the target type device; the model parameters are the cost functions corresponding to the hypothesis functions in the preset machine learning algorithms used by different enterprise servers under the enterprise system. determined parameters;

Using preset weighting rules to perform a weighted average of the model parameters sent by different enterprise systems to obtain target model parameters; and

Send the target model parameters to different enterprise servers under each enterprise system, so that when the enterprise server goes down, the hypothesis function and the target model parameters can be used to determine the target type of the enterprise server. Whether the device is faulty.
The fault diagnosis method according to claim 1, further comprising:

Through a distributed network system built based on the baseboard management controller on each of the enterprise servers, real-time data generated by the target type device in each of the enterprise servers when the server is down is obtained, and the real-time data is input to the cost function corresponding to the hypothesis function in the preset machine learning algorithm, and determine the parameters corresponding to the minimum value of the cost function based on the gradient descent function corresponding to the cost function, to obtain the The model parameters for fault diagnosis of target type devices.
The fault diagnosis method according to claim 2, characterized in that said obtaining the real-time data generated by the target type device in each of the enterprise servers when the server is down includes:

The target register in the hardware error detection architecture is read through the platform environment control interface, and the real-time data generated by the target type device collected in the target register when the server is down is obtained.
The fault diagnosis method according to claim 2 or 3, characterized in that the real-time data is input into a cost function corresponding to a hypothesis function in a preset machine learning algorithm, and then based on the cost function corresponding to The gradient descent function determines the parameters corresponding to the minimum value of the cost function, including:

The real-time data is input into the cost function corresponding to the hypothesis function in the preset logistic regression algorithm, and the parameters corresponding to the minimum value of the cost function are determined based on the gradient descent function corresponding to the cost function.
The fault diagnosis method according to any one of claims 1 to 4, characterized in that: obtaining model parameters sent by each enterprise system for fault diagnosis of target type devices, and using preset weighting rules to classify different The model parameters sent by the enterprise system are weighted and averaged to obtain target model parameters, including:

According to the preset time period, the model parameters sent by each enterprise system for fault diagnosis of the target type device are regularly obtained, and the preset weighting rules are used to perform a weighted average of the model parameters sent by different enterprise systems to obtain The current target model parameters are used to update the target model parameters obtained in the previous time period using the current target model parameters.
The fault diagnosis method according to any one of claims 1 to 5, characterized in that, using the hypothesis function and the target model parameters to determine whether the target type device of itself has failed includes:

Input the target model parameters into the hypothesis function to obtain the failure probability output by the hypothesis function, and determine whether the failure probability is less than a preset threshold; and

In response to the failure probability being less than the preset threshold, it is determined that the target type device of the self has not failed, or in response to the failure probability being not less than the preset threshold, it is determined that the target type device of the self has failed. Fault.
The fault diagnosis method according to any one of claims 1 to 6, characterized in that said obtaining the model parameters sent by each enterprise system for fault diagnosis of the target type device includes:

Obtain the homomorphically encrypted model parameters sent by each enterprise system for fault diagnosis of the target type device.
A fault diagnosis device, characterized in that it is applied to a server management center and includes:

The parameter acquisition module is used to obtain the model parameters sent by each enterprise system for fault diagnosis of the target type device; the model parameters are different enterprise server utilization and preset machine learning algorithms under the enterprise system. The parameters determined by the cost function corresponding to the hypothesis function;

A parameter calculation module used to perform a weighted average of the model parameters sent by different enterprise systems using preset weighting rules to obtain target model parameters; and

A parameter sending module, configured to send the target model parameters to different enterprise servers under each enterprise system, so that when the enterprise server is down, the hypothesis function and the target model parameters can be used to determine Whether its own device of the target type has failed.
An electronic device, characterized in that the electronic device includes one or more processors and a memory; wherein the memory is used to store computer readable instructions, and the computer readable instructions are processed by the one or more processors. The device is loaded and executed to implement the fault diagnosis method as described in any one of claims 1 to 7.
A non-volatile computer-readable storage medium, characterized in that it is used to store computer-readable instructions; wherein the computer-readable instructions implement any one of claims 1 to 7 when executed by one or more processors Described fault diagnosis method.