WO2023197453A1 - Fault diagnosis method and apparatus, device, and storage medium - Google Patents

Fault diagnosis method and apparatus, device, and storage medium Download PDF

Info

Publication number
WO2023197453A1
WO2023197453A1 PCT/CN2022/101975 CN2022101975W WO2023197453A1 WO 2023197453 A1 WO2023197453 A1 WO 2023197453A1 CN 2022101975 W CN2022101975 W CN 2022101975W WO 2023197453 A1 WO2023197453 A1 WO 2023197453A1
Authority
WO
WIPO (PCT)
Prior art keywords
model parameters
enterprise
fault diagnosis
target
server
Prior art date
Application number
PCT/CN2022/101975
Other languages
French (fr)
Chinese (zh)
Inventor
王斯
袁传博
张秀波
Original Assignee
苏州浪潮智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州浪潮智能科技有限公司 filed Critical 苏州浪潮智能科技有限公司
Publication of WO2023197453A1 publication Critical patent/WO2023197453A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning

Definitions

  • the present invention relates to the field of computer technology, and in particular to a fault diagnosis method, device, equipment and storage medium.
  • this application discloses a fault diagnosis method, which is applied to a server management center and includes:
  • model parameters sent by each enterprise system for fault diagnosis of the target type device are parameters determined by different enterprise servers under the enterprise system using the cost function corresponding to the hypothesis function in the preset machine learning algorithm. ;
  • the target model parameters are sent to different enterprise servers under each enterprise system, so that when the enterprise server goes down, the hypothesis function and the target model parameters can be used to determine whether its own target type device has failed.
  • the above fault diagnosis method further includes:
  • real-time data generated by target type devices in each enterprise server when the server is down is obtained, including:
  • the target register in the hardware error detection architecture is read through the platform environment control interface, and the real-time data generated by the target type device collected in the target register when the server is down is obtained.
  • real-time data is input into a cost function corresponding to the hypothesis function in the preset machine learning algorithm, and then the parameters corresponding to the minimum value of the cost function are determined based on the gradient descent function corresponding to the cost function, include:
  • the real-time data is input into the cost function corresponding to the hypothesis function in the preset logistic regression algorithm, and then the parameters corresponding to the minimum value of the cost function are determined based on the gradient descent function corresponding to the cost function.
  • the model parameters sent by each enterprise system for fault diagnosis of the target type device are obtained, and the preset weighting rules are used to perform a weighted average of the model parameters sent by different enterprise systems to obtain the target model parameters, include:
  • the preset time period regularly obtain the model parameters sent by each enterprise system for fault diagnosis of the target type device, and use the preset weighting rules to perform a weighted average of the model parameters sent by different enterprise systems to obtain the current target Model parameters in order to use the current target model parameters to update the target model parameters obtained in the previous time period.
  • the hypothesis function and target model parameters are used to determine whether the own target type device fails, including:
  • the own target type device In response to the failure probability being less than the preset threshold, it is determined that the own target type device has not failed, or in response to if the failure probability is not less than the preset threshold, it is determined that the own target type device has failed.
  • model parameters sent by each enterprise system for fault diagnosis of the target type device are obtained, including:
  • this application discloses a fault diagnosis device, which is used in a server management center and includes:
  • the parameter acquisition module is used to obtain the model parameters sent by each enterprise system for fault diagnosis of the target type device; the model parameters correspond to the hypothesis functions in the preset machine learning algorithm used by different enterprise servers under the enterprise system. The parameters determined by the cost function;
  • the parameter calculation module is used to perform a weighted average of model parameters sent by different enterprise systems using preset weighting rules to obtain target model parameters;
  • the parameter sending module is used to send the target model parameters to different enterprise servers under each enterprise system, so that when the enterprise server goes down, it can use the hypothesis function and the target model parameters to determine whether its own target type device has failed.
  • the present application discloses an electronic device, which includes one or more processors and a memory; wherein the memory is used to store computer readable instructions, and the computer readable instructions are processed by the one or more processors. Load and execute to implement the fault diagnosis method disclosed in the previous embodiment.
  • the present application discloses a non-volatile computer-readable storage medium for storing computer-readable instructions; wherein the computer-readable instructions implement the previous embodiment when executed by one or more processors. Disclosed fault diagnosis methods.
  • Figure 1 is a flow chart of a fault diagnosis method disclosed in one or more embodiments of the present application.
  • Figure 2 is a schematic diagram of a fault diagnosis method disclosed in one or more embodiments of the present application.
  • Figure 3 is a sub-flow chart of a fault diagnosis method disclosed in one or more embodiments of the present application.
  • Figure 4 is a schematic sub-flow diagram of a fault diagnosis method disclosed in one or more embodiments of the present application.
  • Figure 5 is a flow chart of a specific fault diagnosis method disclosed in one or more embodiments of the present application.
  • Figure 6 is a schematic structural diagram of a fault diagnosis device disclosed in one or more embodiments of the present application.
  • Figure 7 is a structural diagram of an electronic device disclosed in one or more embodiments of the present application.
  • this application provides a fault diagnosis solution that can quickly locate the cause of the fault when a server fails and solve the problem of insufficient data in the application of machine learning algorithms and data islands among various enterprises.
  • An embodiment of the present invention discloses a fault diagnosis method, as shown in Figure 1, which is applied to a server management center.
  • the method includes:
  • Step S11 Obtain the model parameters sent by each enterprise system for fault diagnosis of the target type device; the model parameters are determined by different enterprise servers under the enterprise system using the cost function corresponding to the hypothesis function in the preset machine learning algorithm. out parameters.
  • the parameters reported by each enterprise system are integrated, where the model parameters are different enterprise servers under the enterprise system. Parameters determined using the cost function corresponding to the hypothesis function in the preset machine learning algorithm.
  • federated learning is a machine learning framework that can effectively help multiple organizations perform data usage and machine learning modeling while meeting the requirements of user privacy protection, data security, and government regulations.
  • federated learning can effectively solve the problem of data islands, allowing participants to jointly model without sharing data, and can technically break data islands and realize AI (Artificial Intelligence, artificial intelligence) collaboration. Therefore, when the server collects model parameters sent by multiple enterprise systems, it realizes the application of horizontal federated learning in the server operation and maintenance system, performs fault diagnosis on the server, and solves the problem of insufficient data in the application of machine learning algorithms and the problems between various enterprises.
  • AI Artificial Intelligence, artificial intelligence
  • the server management center can issue an initial set of model parameters to each enterprise system at the initial stage. These parameters are usually the initial model parameters obtained by each enterprise through their respective machine learning algorithms after integrating them. Parameters, when a server goes down, each enterprise system server under each enterprise system first determines whether the target type device has failed, and informs the operation and maintenance personnel of the result. In order to improve the accuracy and efficiency of fault detection, the server management center will continuously optimize the initial model parameters. Therefore, the server management center can set a time, such as one hour or one day, to receive the data reported by each enterprise system and continuously perform gradient descent. Algorithm is used to update parameters, and the parameters are sent to each enterprise for update at a specified time to improve the accuracy of fault diagnosis.
  • a time such as one hour or one day
  • Step S12 Use preset weighting rules to perform a weighted average of model parameters sent by different enterprise systems to obtain target model parameters.
  • the server management center after obtaining the model parameters used for fault diagnosis of the target type device sent by each enterprise system, the server management center will integrate the various model parameters according to the preset weighting rules to obtain the target model parameters.
  • a weighted average calculation can be performed based on the number of servers running in each enterprise system. The larger the scale of an enterprise system, the greater the weight given to its model parameters, and its data will be more reliable.
  • the server management center will regularly obtain the data sent by each enterprise system for the purpose of processing according to the preset time period.
  • Model parameters for fault diagnosis of target type devices and then use preset weighting rules to weight and average the model parameters sent by different enterprise systems to obtain the current target model parameters. In this way, the current target model parameters can be used to update the target model parameters obtained in the previous time period.
  • Step S13 Send the target model parameters to different enterprise servers under each enterprise system, so that when the enterprise server goes down, use the hypothesis function and the target model parameters to determine whether its own target type device has failed.
  • the target model parameters sent by each enterprise system for fault diagnosis are integrated and new target model parameters are calculated
  • the target model parameters are sent to different enterprise servers under each enterprise system, so that It updates its own model parameters.
  • the enterprise server can use the target model parameters to determine whether its target type device has failed.
  • a set of parameters and algorithms only determine whether a certain type of device is faulty, such as a certain CPU (Central Processing Unit, central processing unit) or a certain memory. Therefore, the server management center should actually have multiple sets of parameters to determine whether a certain type of device is faulty. Determine the failure probability of different types of devices.
  • the enterprise server uses the target model to determine whether its own type of device fails, it includes: inputting the target model parameters into the hypothesis function to obtain the failure probability output by the hypothesis function, and determine whether the failure probability is less than the preset threshold; if If the failure probability is less than the preset threshold, it is determined that the target type device has not failed; if the failure probability is not less than the preset threshold, it is determined that the target type device has failed. It is understandable that if an enterprise has strict requirements on failure probability, the preset threshold can be adjusted smaller. As long as the output failure probability is greater than the preset threshold, the fault can be determined in time and the faulty component can be determined, and the results can then be notified to the operation. Call maintenance personnel to replace faulty parts.
  • BMC Baseboard Management Controller, baseboard management controller
  • BMC machine learning algorithm module BMC communication module
  • center communication module BMC communication module
  • center summary algorithm module BMC fault data collection module
  • BMC machine learning algorithm module BMC communication module
  • center communication module BMC communication module
  • BMC fault data collection module responsible for collecting fault data when a fault occurs. Taking CPU diagnosis as an example, you can read out the MSR (Model Specific Register) and CSR (Special Module Register) in the MCA (Machine Check Architecture, hardware error detection) architecture through the PECI (Platform Environment Control Interface, Platform Environment Control Interface) channel. Control State Register (Control State Register) register information, there are usually hundreds or thousands of these registers.
  • BMC machine learning algorithm module It mainly has two functions. The first is to determine the faulty component. When the server goes down, the collected MSR and CSR register data are used as input, and the machine learning algorithm is used to calculate and output the probability of a component failure, such as determining the probability of CPU_0 failure.
  • BMC machine learning algorithm module Another function of the BMC machine learning algorithm module is to run the algorithm to update the model parameters. When the operation and maintenance personnel replace the actual faulty parts, the results are fed back to the BMC.
  • the BMC calculates multiple gradient descent algorithms based on the results and updates a set of model parameters, and It can be broadcast to other BMCs in the enterprise to update parameters.
  • BMC communication module Responsible for interacting with other BMCs of the enterprise and the central node of the server provider. It can also perform homomorphic encryption and decryption of server operation results.
  • Central communication module Responsible for interacting with the BMC on each enterprise server, implementing the underlying protocol stack, and securely receiving and sending data.
  • Central summary algorithm module It can integrate the encrypted parameters reported by each enterprise system, calculate new encrypted parameters, and send them to each enterprise for update.
  • the server management center collects the model parameters obtained from the enterprise servers in each enterprise system and uses the baseboard management controller to perform machine learning.
  • the baseboard management controller as an embedded system has limited computing power. When executing The gradient descent algorithm is not fast enough, so you can consider forming a distributed network of multiple baseboard management controllers and applying distributed computing to solve the computing power problem of the baseboard management controller.
  • this embodiment may further include:
  • Step S21 Through a distributed network system built based on the baseboard management controller on each enterprise server, obtain real-time data generated by the target type device in each enterprise server when the server is down.
  • Step S22 Input the real-time data into the cost function corresponding to the hypothesis function in the preset machine learning algorithm, and then determine the parameters corresponding to the minimum value of the cost function based on the gradient descent function corresponding to the cost function, and obtain the Model parameters for fault diagnosis of target type devices.
  • the distributed network system constructed by the baseboard management controller on each enterprise server is used to process the real-time data generated by the target type device when the server is down, and then obtain the data for the target type.
  • Model parameters for device fault diagnosis is used to process the real-time data generated by the target type device when the server is down.
  • the baseboard management controller on each enterprise server reads the target register in the hardware error detection architecture through the platform environment control interface, and obtains the target type device collected in the target register.
  • Real-time data generated when the server is down then, input the real-time data into the cost function corresponding to the hypothesis function in the preset logistic regression algorithm, and determine the minimum value of the cost function based on the gradient descent function corresponding to the cost function.
  • this embodiment uses the clearest and most practical Logistic Regression (Logarithmic Probability Regression) algorithm for explanation:
  • (x (i) , y (i) ) represents the data when a certain outage occurs, x (i) refers to the n+1-dimensional feature vector composed of the values of n+1 MSR and CSR registers; y (i) indicates whether CPU_0 is faulty in this outage, that is, whether it is caused by CPU_0.
  • x (i) refers to the n+1-dimensional feature vector composed of the values of n+1 MSR and CSR registers; y (i) indicates whether CPU_0 is faulty in this outage, that is, whether it is caused by CPU_0.
  • is an n+1-dimensional vector, representing the coefficient of multiplication of each x; this function inputs a feature vector x to obtain the probability of CPU_0 failure.
  • the characteristics of this function are that its value is greater than 0 and less than 1. Therefore, if the calculation is greater than 0.5, CPU_0 can be judged to be faulty, and vice versa.
  • the key point is to find the specific value of the n+1-dimensional vector ⁇ , which is also the key to the machine learning algorithm.
  • the cost function is:
  • m is the number of samples
  • x (i) is an n+1-dimensional feature vector
  • y (i) represents whether the component fails when the server goes down, and has only two values 0 and 1
  • h ⁇ (x (i) ) represents the hypothesis Failure probability value output by the function.
  • changes, the difference between the probability calculated by the hypothesis function and the real situation will also change. Therefore, to find the value of ⁇ that minimizes the difference between the value calculated by the hypothesis function and the real situation, you need to use the gradient corresponding to the cost function
  • the descending function finds the value of ⁇ that minimizes J( ⁇ ).
  • the gradient descent function is:
  • represents the learning rate; select a set of ⁇ values as the initial value, and substitute it into this formula to calculate a set of more effective ⁇ values. Repeat this to get the most effective set of ⁇ values.
  • Figure 4 shows the diagnostic process of the baseboard management controller when the server crashes.
  • the BMC collects the CPU register data and determines whether the data has been collected. When the data is collected, it calculates a hypothetical function for a certain component and reports the result to the operation and maintenance. When the operation and maintenance personnel replace the actual faulty component Finally, the result is fed back to BMC (it may be CPU_0, or it may not be, the assumption function may be correct or wrong), and then the algorithm parameters are updated based on the operation and maintenance results.
  • BMC operates multiple gradient descent algorithms based on the results and updates a set of ⁇ , and can be broadcast to other BMCs in the enterprise to update parameters ⁇ ; when no data is collected, the process ends.
  • the model parameters used for fault diagnosis of the target type device are obtained by obtaining the model parameters sent by each enterprise system respectively; the model parameters are corresponding to the hypothesis functions in the preset machine learning algorithms used by different enterprise servers under the enterprise system. Parameters determined by the cost function; use preset weighting rules to perform a weighted average of the model parameters sent by different enterprise systems to obtain the target model parameters; send the target model parameters to different enterprise servers under each enterprise system so that they can be processed on the enterprise server When there is a downtime, the hypothesis function and target model parameters are used to determine whether the target type device of its own has failed.
  • the server management center obtains the model parameters sent by each enterprise system for fault diagnosis of the target type device, realizes the application of horizontal federated learning in server fault diagnosis, and provides the server management center with sufficient data for machine diagnosis.
  • Learning and training solves the problem of insufficient data in the application of machine learning algorithms and the problem of data islands among various enterprises;
  • the model parameters sent by each enterprise are weighted and averaged using preset weighting rules to obtain the target model parameters, and then the server management center transfers the target model
  • the parameters are distributed to each enterprise system, so that when a server fails, the faulty component can be quickly located, easing the operation and maintenance pressure, thus improving the accuracy of fault diagnosis and strengthening the competitiveness of server suppliers.
  • the embodiment of the present application discloses a specific fault diagnosis method, as shown in Figure 5, which is applied to the server management center.
  • the method includes:
  • Step S31 Obtain the homomorphically encrypted model parameters sent by each enterprise system and used for fault diagnosis of the target type device.
  • HE Homomorphic Encryption
  • Different baseboard management controllers of the same enterprise will interact with each other, and then the model parameters after homomorphic encryption are sent to the server management center.
  • the model parameters are parameters determined by different enterprise servers under the enterprise system using the cost function corresponding to the hypothesis function in the preset machine learning algorithm. For details, please refer to the corresponding content disclosed in the foregoing embodiments, and will not be described again here. .
  • the use of homomorphic encryption algorithms allows the server management center to directly operate on the ciphertext instead of directly operating on the data, which can effectively solve the problem of various enterprises being unwilling to expose their own server failure data.
  • the server management center updates the data and sends the updated data to each enterprise system
  • the enterprise can decrypt the ciphertext to obtain the latest parameters, thereby making fault diagnosis more accurate.
  • homomorphic encryption algorithms there are a variety of homomorphic encryption algorithms to choose from.
  • the embodiment of this application can use the Paillier (probabilistic public key encryption system) encryption algorithm, which is not specifically limited here.
  • Step S32 Use the preset weighting rules to perform a weighted average of the model parameters sent by different enterprise systems to obtain the target model parameters.
  • Step S33 Send the target model parameters to different enterprise servers under each enterprise system, so that when the enterprise server goes down, use the hypothesis function and the target model parameters to determine whether its own target type device has failed.
  • the server management center to obtain the homomorphically encrypted model parameters sent by each enterprise system for fault diagnosis of the target type device; the model parameters are the utilization and prediction of different enterprise servers under the enterprise system. Assume the parameters determined by the cost function corresponding to the hypothesis function in the machine learning algorithm; use the preset weighting rules to weight and average the model parameters sent by different enterprise systems to obtain the target model parameters; send the target model parameters to each enterprise system Download different enterprise servers so that when the enterprise server goes down, the hypothesis function and the target model parameters can be used to determine whether its own target type device has failed.
  • the server management center obtains the model parameters sent by each enterprise system for fault diagnosis of the target type device, realizes the application of horizontal federated learning in server fault diagnosis, and provides the server management center with sufficient data for machine diagnosis.
  • Learning and training solves the problem of insufficient data in the application of machine learning algorithms and data islands among various enterprises;
  • the homomorphic encryption algorithm is used to process the model parameters of each enterprise system, so that the server management center can directly operate on the ciphertext , without directly operating the data, so that each enterprise does not expose its own server operation and maintenance data;
  • the model parameters sent by each enterprise are weighted and averaged using preset weighting rules to obtain the target model parameters, and then the server management center will Distribute to various enterprise systems, so that when a server fails, the faulty component can be quickly located, easing the operation and maintenance pressure, thus improving the accuracy of fault diagnosis and strengthening the competitiveness of server suppliers.
  • the embodiment of the present application also discloses a fault diagnosis device, as shown in Figure 6.
  • the device includes:
  • Parameter acquisition module 11 is used to obtain model parameters sent by each enterprise system for fault diagnosis of target type devices; the model parameters correspond to the hypothesis functions in the preset machine learning algorithm utilized by different enterprise servers under the enterprise system. The parameters determined by the cost function;
  • the parameter calculation module 12 is used to perform a weighted average of model parameters sent by different enterprise systems using preset weighting rules to obtain target model parameters;
  • the parameter sending module 13 is used to send the target model parameters to different enterprise servers under each enterprise system, so that when the enterprise server goes down, it can use the hypothesis function and the target model parameters to determine whether its own target type device has failed.
  • Figure 7 is a structural diagram of the electronic device 20 according to an exemplary embodiment. The content in the figure cannot be considered as any limitation on the scope of the present application.
  • FIG. 7 is a schematic structural diagram of an electronic device 20 provided by an embodiment of the present application.
  • the electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input-output interface 25 and a communication bus 26.
  • the memory 22 is used to store computer readable instructions, which are loaded and executed by the at least one processor 21 to implement relevant steps in the fault diagnosis method disclosed in any of the foregoing embodiments.
  • the electronic device 20 in this embodiment may specifically be a server.
  • the power supply 23 is used to provide working voltage for each hardware device on the electronic device 20;
  • the communication interface 24 can create a data transmission channel between the electronic device 20 and external devices, and the communication protocol it follows can be applicable Any communication protocol of the technical solution of this application is not specifically limited here;
  • the input and output interface 25 is used to obtain external input data or output data to the external world, and its specific interface type can be selected according to specific application needs. Here Not specifically limited.
  • the memory 22, as a carrier for resource storage can be a read-only memory, a random access memory, a magnetic disk or an optical disk, etc.
  • the resources stored thereon can include the operating system 221, computer readable instructions 222 and data 223, etc.
  • the data 223 can include All kinds of data.
  • the storage method can be temporary storage or permanent storage.
  • the operating system 221 is used to manage and control each hardware device on the electronic device 20 and the computer readable instructions 222, which can be Windows Server, Netware, Unix, Linux, etc.
  • the computer-readable instructions 222 may further include computer-readable instructions that can be used to complete other specific tasks. instruction.
  • the embodiment of the present application also discloses a non-volatile computer-readable storage medium.
  • the non-volatile computer-readable storage medium mentioned here includes random access memory (Random Access Memory, RAM), memory, Read-Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, register, hard disk, magnetic disk or optical disk or any other form of storage medium known in the technical field.
  • RAM Random Access Memory
  • ROM Read-Only Memory
  • electrically programmable ROM electrically erasable programmable ROM
  • register hard disk
  • RAM random access memory
  • ROM read-only memory
  • electrically programmable ROM electrically erasable programmable ROM
  • registers hard disks, removable disks, CD-ROMs, or anywhere in the field of technology. any other known form of storage media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Test And Diagnosis Of Digital Computers (AREA)

Abstract

The present application relates to the technical field of computers. Disclosed are a fault diagnosis method and apparatus, a device, and a storage medium. The method is applied to a server management center, and comprises: acquiring model parameters which are respectively sent by each enterprise system and are used for performing fault diagnosis on a target-type device, wherein the model parameters are parameters determined by different enterprise servers of the enterprise system by utilizing a cost function corresponding to a hypothesis function in a preset machine learning algorithm; performing, by using a preset weighting rule, weighted averaging on the model parameters sent by different enterprise systems to obtain a target model parameter; and sending the target model parameter to the different enterprise servers of each enterprise system, so that when the enterprise servers are down, the enterprise servers determine, by using the hypothesis function and the target model parameter, whether a fault occurs in their own target-type devices.

Description

一种故障诊断方法、装置、设备及存储介质A fault diagnosis method, device, equipment and storage medium
相关申请的交叉引用Cross-references to related applications
本申请要求于2022年04月13日提交中国专利局,申请号为202210381536.7,申请名称为“一种故障诊断方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requests the priority of the Chinese patent application submitted to the China Patent Office on April 13, 2022, with the application number 202210381536.7, and the application title is "A fault diagnosis method, device, equipment and storage medium", the entire content of which is incorporated by reference incorporated in this application.
技术领域Technical field
本发明涉及计算机技术领域,特别涉及一种故障诊断方法、装置、设备及存储介质。The present invention relates to the field of computer technology, and in particular to a fault diagnosis method, device, equipment and storage medium.
背景技术Background technique
随着计算机系统的发展、互联网在各行各业的海量应用,服务器使用的数量越来越多,如何搭建服务器管理系统,以便更高效的管理服务器就成为了服务器用户、服务器运维企业、各地超算中心等凡用到服务器的场景,尤其是大量使用服务器的互联网企业重点关注的问题。With the development of computer systems and the massive application of the Internet in all walks of life, more and more servers are used. How to build a server management system to manage servers more efficiently has become an important issue for server users, server operation and maintenance companies, and supermarkets in various places. Computing centers and other scenarios where servers are used, especially issues that Internet companies that use a large number of servers are concerned about.
然而目前随着服务器数量的增多,企业更加关注服务器的运维效率。当服务器发生故障后,企业迫切地需要一种高效地运维策略来对故障服务器进行处理,运维人员需要迅速定位故障原因,其中关键的一环就是快速诊断出服务器故障的部件。在传统的服务器运维中,很多时候故障原因难以根据日志分析出来,机器学习在故障诊断方面的应用无疑是一个有价值的方向。然而,机器学习的三要素:算法、算例、数据中,由于企业对服务器的了解不多,并且单个企业的数据不足以支撑起机器学习算法,所以由服务器供应商来提供故障诊断精准定位故障部件的方法明显是更为合适的。然而,发明人意识到,服务器供应商最缺乏的就是服务器故障数据,各个企业并不愿意将自己的服务器故障数据暴露出来。因此服务器供应商难以训练出有效的机器学习算法来对服务器进行故障诊断,成为了将机器学习算法应用于服务器故障诊断的一大瓶颈。However, as the number of servers increases, enterprises are paying more attention to the operation and maintenance efficiency of servers. When a server fails, enterprises urgently need an efficient operation and maintenance strategy to deal with the faulty server. Operation and maintenance personnel need to quickly locate the cause of the failure. A key part of this is to quickly diagnose the faulty component of the server. In traditional server operation and maintenance, it is often difficult to analyze the cause of faults based on logs. The application of machine learning in fault diagnosis is undoubtedly a valuable direction. However, among the three elements of machine learning: algorithms, calculation examples, and data, since enterprises do not know much about servers, and the data of a single enterprise is not enough to support machine learning algorithms, server vendors must provide fault diagnosis and accurately locate faults. The component approach is obviously more appropriate. However, the inventor realized that what server suppliers lack most is server failure data, and various enterprises are unwilling to expose their own server failure data. Therefore, it is difficult for server vendors to train effective machine learning algorithms to diagnose server faults, which has become a major bottleneck in applying machine learning algorithms to server fault diagnosis.
综上,如何在服务器发生故障时实现快速定位故障原因,解决机器学习算法应用上数据不足、各个企业间的数据孤岛是目前有待解决的问题。In summary, how to quickly locate the cause of the failure when a server fails and solve the problem of insufficient data in the application of machine learning algorithms and data islands among various enterprises are currently unresolved issues.
发明内容Contents of the invention
第一方面,本申请公开了一种故障诊断方法,应用于服务器管理中心,包括:In the first aspect, this application discloses a fault diagnosis method, which is applied to a server management center and includes:
获取每个企业系统分别发送的用于对目标类型器件进行故障诊断的模型参数;模型参数为企业系统下不同的企业服务器利用与预设机器学习算法中的假设函数对应的代价函数确定出的参数;Obtain the model parameters sent by each enterprise system for fault diagnosis of the target type device; the model parameters are parameters determined by different enterprise servers under the enterprise system using the cost function corresponding to the hypothesis function in the preset machine learning algorithm. ;
利用预设加权规则对不同企业系统发送的模型参数进行加权平均,以得到目标模型参数;及Use preset weighting rules to perform a weighted average of model parameters sent by different enterprise systems to obtain target model parameters; and
将目标模型参数发送给每个企业系统下不同的企业服务器,以便在企业服务器宕机时,利用假设函数以及目标模型参数确定自身的目标类型器件是否出现故障。The target model parameters are sent to different enterprise servers under each enterprise system, so that when the enterprise server goes down, the hypothesis function and the target model parameters can be used to determine whether its own target type device has failed.
在一个实施例中,上述的故障诊断方法,还包括:In one embodiment, the above fault diagnosis method further includes:
通过基于各企业服务器上的基板管理控制器构建的分布式网络系统,获取各企业服务器中的目标类型器件在服务器宕机时产生的实时数据,并将实时数据输入至与预设机器学习算法中的假设函数对应的代价函数中,然后基于与代价函数对应的梯度下降函数确定出使代价函数取值最小时对应的参数,得到用于对目标类型器件进行故障诊断的模型参数。Through a distributed network system built based on the baseboard management controller on each enterprise server, real-time data generated by the target type devices in each enterprise server when the server is down is obtained, and the real-time data is input into the preset machine learning algorithm. In the cost function corresponding to the hypothesis function, the parameters corresponding to the minimum value of the cost function are determined based on the gradient descent function corresponding to the cost function, and the model parameters used for fault diagnosis of the target type device are obtained.
在一个实施例中,获取各企业服务器中的目标类型器件在服务器宕机时产生的实时数据,包括:In one embodiment, real-time data generated by target type devices in each enterprise server when the server is down is obtained, including:
通过平台环境式控制接口读取硬件错误检测架构中的目标寄存器,得到目标寄存器中收集的目标类型器件在服务器宕机时产生的实时数据。The target register in the hardware error detection architecture is read through the platform environment control interface, and the real-time data generated by the target type device collected in the target register when the server is down is obtained.
在一个实施例中,将实时数据输入至与预设机器学习算法中的假设函数对应的代价函数中,然后基于与代价函数对应的梯度下降函数确定出使代价函数取值最小时对应的参数,包括:In one embodiment, real-time data is input into a cost function corresponding to the hypothesis function in the preset machine learning algorithm, and then the parameters corresponding to the minimum value of the cost function are determined based on the gradient descent function corresponding to the cost function, include:
将实时数据输入至与预设逻辑回归算法中的假设函数对应的代价函数中,然后基于与代价函数对应的梯度下降函数确定出使代价函数取值最小时对应的参数。The real-time data is input into the cost function corresponding to the hypothesis function in the preset logistic regression algorithm, and then the parameters corresponding to the minimum value of the cost function are determined based on the gradient descent function corresponding to the cost function.
在一个实施例中,获取每个企业系统分别发送的用于对目标类型器件进行故障诊断的模型参数,利用预设加权规则对不同企业系统发送的模型参数进行加权平均,以得到目标模型参数,包括:In one embodiment, the model parameters sent by each enterprise system for fault diagnosis of the target type device are obtained, and the preset weighting rules are used to perform a weighted average of the model parameters sent by different enterprise systems to obtain the target model parameters, include:
按照预设的时间周期,定期获取每个企业系统分别发送的用于对目标类型器件进行故障诊断的模型参数,并利用预设加权规则对不同企业系统发送的模型参数进行加权平均得到当前的目标模型参数,以便利用当前的目标模型参数对上一时间周期得到的目标模型参数进行更新。According to the preset time period, regularly obtain the model parameters sent by each enterprise system for fault diagnosis of the target type device, and use the preset weighting rules to perform a weighted average of the model parameters sent by different enterprise systems to obtain the current target Model parameters in order to use the current target model parameters to update the target model parameters obtained in the previous time period.
在一个实施例中,利用假设函数以及目标模型参数确定自身的目标类型器件是否出现故障,包括:In one embodiment, the hypothesis function and target model parameters are used to determine whether the own target type device fails, including:
将目标模型参数输入至假设函数中,以得到假设函数输出的故障概率,并判断故障概率是否小于预设阈值;及Input the target model parameters into the hypothesis function to obtain the failure probability output by the hypothesis function, and determine whether the failure probability is less than the preset threshold; and
响应于故障概率小于预设阈值,判定自身的目标类型器件没有出现故障,或,响应于如果故障概率不小于预设阈值,判定自身的目标类型器件出现故障。In response to the failure probability being less than the preset threshold, it is determined that the own target type device has not failed, or in response to if the failure probability is not less than the preset threshold, it is determined that the own target type device has failed.
在一个实施例中,获取每个企业系统分别发送的用于对目标类型器件进行故障诊断的模型参数,包括:In one embodiment, the model parameters sent by each enterprise system for fault diagnosis of the target type device are obtained, including:
获取每个企业系统分别发送的经过同态加密处理的用于对目标类型器件进行故障诊断的模型参数。Obtain the homomorphically encrypted model parameters sent by each enterprise system for fault diagnosis of the target type device.
第二方面,本申请公开了一种故障诊断装置,应用于服务器管理中心,包括:In the second aspect, this application discloses a fault diagnosis device, which is used in a server management center and includes:
参数获取模块,用于获取每个企业系统分别发送的用于对目标类型器件进行故障诊断的模型参数;模型参数为企业系统下不同的企业服务器利用与预设机器学习算法中的假设函数对应的代价函数确定出的参数;The parameter acquisition module is used to obtain the model parameters sent by each enterprise system for fault diagnosis of the target type device; the model parameters correspond to the hypothesis functions in the preset machine learning algorithm used by different enterprise servers under the enterprise system. The parameters determined by the cost function;
参数运算模块,用于利用预设加权规则对不同企业系统发送的模型参数进行加权平均,以得到目标模型参数;及The parameter calculation module is used to perform a weighted average of model parameters sent by different enterprise systems using preset weighting rules to obtain target model parameters; and
参数发送模块,用于将目标模型参数发送给每个企业系统下不同的企业服务器,以便在企业服务器宕机时,利用假设函数以及目标模型参数确定自身的目标类型器件是否出现故障。The parameter sending module is used to send the target model parameters to different enterprise servers under each enterprise system, so that when the enterprise server goes down, it can use the hypothesis function and the target model parameters to determine whether its own target type device has failed.
第三方面,本申请公开了一种电子设备,该电子设备包括一个或多个处理器和存储器;其中,存储器用于存储计算机可读指令,该计算机可读指令由该一个或多个处理器加载并执行以实现如前任一项实施例所公开的故障诊断方法。In a third aspect, the present application discloses an electronic device, which includes one or more processors and a memory; wherein the memory is used to store computer readable instructions, and the computer readable instructions are processed by the one or more processors. Load and execute to implement the fault diagnosis method disclosed in the previous embodiment.
第四方面,本申请公开了一种非易失性计算机可读存储介质,用于存储计算机可读指令;其中该计算机可读指令被一个或多个处理器执行时实现如前任一项实施例所公开的故障诊断方法。In a fourth aspect, the present application discloses a non-volatile computer-readable storage medium for storing computer-readable instructions; wherein the computer-readable instructions implement the previous embodiment when executed by one or more processors. Disclosed fault diagnosis methods.
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below. Other features and advantages of the application will be apparent from the description, drawings, and claims.
附图说明Description of the drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only These are embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on the provided drawings without exerting creative efforts.
图1为本申请一个或多个实施例公开的一种故障诊断方法流程图;Figure 1 is a flow chart of a fault diagnosis method disclosed in one or more embodiments of the present application;
图2为本申请一个或多个实施例公开的一种故障诊断方法示意图;Figure 2 is a schematic diagram of a fault diagnosis method disclosed in one or more embodiments of the present application;
图3为本申请一个或多个实施例公开的一种故障诊断方法子流程图;Figure 3 is a sub-flow chart of a fault diagnosis method disclosed in one or more embodiments of the present application;
图4为本申请一个或多个实施例公开的一种故障诊断方法子流程示意图;Figure 4 is a schematic sub-flow diagram of a fault diagnosis method disclosed in one or more embodiments of the present application;
图5为本申请一个或多个实施例公开的一种具体的故障诊断方法流程图;Figure 5 is a flow chart of a specific fault diagnosis method disclosed in one or more embodiments of the present application;
图6为本申请一个或多个实施例公开的一种故障诊断装置结构示意图;Figure 6 is a schematic structural diagram of a fault diagnosis device disclosed in one or more embodiments of the present application;
图7为本申请一个或多个实施例公开的一种电子设备结构图。Figure 7 is a structural diagram of an electronic device disclosed in one or more embodiments of the present application.
具体实施方式Detailed ways
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of the present invention.
当前,在服务器发生故障后,很多时候传统的服务器运维难以根据日志分析出来故障原因,而将机器学习算法应用于服务器故障诊断时,单个企业的数据不足以支撑起机器学习算法,并且各个企业并不愿意将自己的服务器故障数据暴露出来,因此通过服务器供应商训练出有效的机器学习算法来对服务器进行故障诊断成为了将机器学习算法应用于服务器故障诊断的一大瓶颈。Currently, after a server fails, it is often difficult for traditional server operation and maintenance to analyze the cause of the failure based on logs. When applying machine learning algorithms to server fault diagnosis, the data of a single enterprise is not enough to support the machine learning algorithm, and each enterprise They are not willing to expose their own server fault data, so training effective machine learning algorithms through server vendors to diagnose server faults has become a major bottleneck in applying machine learning algorithms to server fault diagnosis.
为此,本申请提供了一种故障诊断方案,能够在服务器发生故障时实现快速定位故障原因,解决机器学习算法应用上数据不足、各个企业间的数据孤岛问题。To this end, this application provides a fault diagnosis solution that can quickly locate the cause of the fault when a server fails and solve the problem of insufficient data in the application of machine learning algorithms and data islands among various enterprises.
本发明实施例公开了一种故障诊断方法,参见图1所示,应用于服务器管理中心,该方法包括:An embodiment of the present invention discloses a fault diagnosis method, as shown in Figure 1, which is applied to a server management center. The method includes:
步骤S11:获取每个企业系统分别发送的用于对目标类型器件进行故障诊断的模型参数;模型参数为企业系统下不同的企业服务器利用与预设机器学习算法中的假设函数对应的代价函数确定出的参数。Step S11: Obtain the model parameters sent by each enterprise system for fault diagnosis of the target type device; the model parameters are determined by different enterprise servers under the enterprise system using the cost function corresponding to the hypothesis function in the preset machine learning algorithm. out parameters.
本申请实施例中,通过获取每个企业系统分别发送的用于对目标器件进行故障诊断的模型参数,实现对各个企业系统上报的参数进行整合,其中,模型参数为企业系统下不同的企业服务器利用与预设机器学习算法中的假设函数对应的代价函数确定出的参数。In the embodiment of this application, by obtaining the model parameters sent by each enterprise system for fault diagnosis of the target device, the parameters reported by each enterprise system are integrated, where the model parameters are different enterprise servers under the enterprise system. Parameters determined using the cost function corresponding to the hypothesis function in the preset machine learning algorithm.
可以理解的是,联邦学习是一个机器学习框架,能有效帮助多个机构在满足用户隐私保护、数据安全和政府法规的要求下,进行数据使用和机器学习建模。联邦学习作为分布式的机器学习范式,可以有效解决数据孤岛问题,让参与方在不共享数据的基础上联合建模,能从技术上打破数据孤岛,实现AI(Artificial Intelligence,人工智能)协作。 所以,当服务器对多个企业系统发送的模型参数进行收集时,实现了将横向联邦学习应用在服务器运维系统中,对服务器进行故障诊断,解决了机器学习算法应用上数据不足、各个企业间的数据孤岛问题。It is understandable that federated learning is a machine learning framework that can effectively help multiple organizations perform data usage and machine learning modeling while meeting the requirements of user privacy protection, data security, and government regulations. As a distributed machine learning paradigm, federated learning can effectively solve the problem of data islands, allowing participants to jointly model without sharing data, and can technically break data islands and realize AI (Artificial Intelligence, artificial intelligence) collaboration. Therefore, when the server collects model parameters sent by multiple enterprise systems, it realizes the application of horizontal federated learning in the server operation and maintenance system, performs fault diagnosis on the server, and solves the problem of insufficient data in the application of machine learning algorithms and the problems between various enterprises. The data island problem.
本申请实施例中,服务器管理中心在初始时,可以发放给各企业系统一组初始的模型参数,该参数通常为最初各企业通过各自的机器学习算法得到的模型参数进行整合后得到的初始的参数,当服务器发生宕机时,每个企业也系统下的各个企业系统服务器首先判断目标类型器件是否发生故障,并将结果告知运维人员。服务器管理中心为了提高故障检测的准确效率,会不断地优化该初始的模型参数,因此,服务器管理中心可以设定一个时间,如一小时或一天接收每个企业系统上报的数据,不断地执行梯度下降算法来更新参数,并在某指定时间将参数下发给各企业更新,提高故障诊断的精确度。In the embodiment of this application, the server management center can issue an initial set of model parameters to each enterprise system at the initial stage. These parameters are usually the initial model parameters obtained by each enterprise through their respective machine learning algorithms after integrating them. Parameters, when a server goes down, each enterprise system server under each enterprise system first determines whether the target type device has failed, and informs the operation and maintenance personnel of the result. In order to improve the accuracy and efficiency of fault detection, the server management center will continuously optimize the initial model parameters. Therefore, the server management center can set a time, such as one hour or one day, to receive the data reported by each enterprise system and continuously perform gradient descent. Algorithm is used to update parameters, and the parameters are sent to each enterprise for update at a specified time to improve the accuracy of fault diagnosis.
步骤S12:利用预设加权规则对不同企业系统发送的模型参数进行加权平均,以得到目标模型参数。Step S12: Use preset weighting rules to perform a weighted average of model parameters sent by different enterprise systems to obtain target model parameters.
本申请实施例中,在获取到每个企业系统分别发送的用于对目标类型器件进行故障诊断的模型参数之后,服务器管理中心会对各个模型参数按照预设加权规则进行整合,得到目标模型参数。例如,可以根据各个企业系统运行的服务器数量来进行加权平均计算,一个企业系统的规模越大,对其模型参数赋予的权值就越大,其数据更具有可靠性。In the embodiment of this application, after obtaining the model parameters used for fault diagnosis of the target type device sent by each enterprise system, the server management center will integrate the various model parameters according to the preset weighting rules to obtain the target model parameters. . For example, a weighted average calculation can be performed based on the number of servers running in each enterprise system. The larger the scale of an enterprise system, the greater the weight given to its model parameters, and its data will be more reliable.
本申请实施例中,因为每个企业系统会定期的将自身收集的数据上报给服务器管理中心,所以,服务器管理中心会按照预设的时间周期,定期获取每个企业系统分别发送的用于对目标类型器件进行故障诊断的模型参数,然后利用预设加权规则对不同企业系统发送的模型参数进行加权平均得到当前的目标模型参数。如此一来,可以实现利用当前的目标模型参数对上一时间周期得到的目标模型参数进行更新。In the embodiment of this application, because each enterprise system will regularly report the data collected by itself to the server management center, the server management center will regularly obtain the data sent by each enterprise system for the purpose of processing according to the preset time period. Model parameters for fault diagnosis of target type devices, and then use preset weighting rules to weight and average the model parameters sent by different enterprise systems to obtain the current target model parameters. In this way, the current target model parameters can be used to update the target model parameters obtained in the previous time period.
步骤S13:将目标模型参数发送给每个企业系统下不同的企业服务器,以便在企业服务器宕机时,利用假设函数以及目标模型参数确定自身的目标类型器件是否出现故障。Step S13: Send the target model parameters to different enterprise servers under each enterprise system, so that when the enterprise server goes down, use the hypothesis function and the target model parameters to determine whether its own target type device has failed.
本申请实施例中,当对每个企业系统发送的用于故障诊断的模型参数进行整合,计算出新的目标模型参数后,将目标模型参数发送给每个企业系统下不同的企业服务器,使其将自身的模型参数进行更新。如此一来,当服务器宕机时,企业服务器可以利用目标模型参数判定自身的目标类型器件是否出现故障。需要指出的是,一套参数与算法只判断某一类型器件是否故障,如判断某个CPU(Central Processing Unit,中央处理器)或某个内存,因此,服务器管理中心实际应有多套参数来判断不同类型器件的故障概率。In the embodiment of this application, when the model parameters sent by each enterprise system for fault diagnosis are integrated and new target model parameters are calculated, the target model parameters are sent to different enterprise servers under each enterprise system, so that It updates its own model parameters. In this way, when the server goes down, the enterprise server can use the target model parameters to determine whether its target type device has failed. It should be pointed out that a set of parameters and algorithms only determine whether a certain type of device is faulty, such as a certain CPU (Central Processing Unit, central processing unit) or a certain memory. Therefore, the server management center should actually have multiple sets of parameters to determine whether a certain type of device is faulty. Determine the failure probability of different types of devices.
具体的,企业服务器利用目标模型判断自身的类型器件是否出故障时,包括:将目标模型参数输入至假设函数中,以得到假设函数输出的故障概率,并判断故障概率是否 小于预设阈值;如果故障概率小于预设阈值,则判定自身的目标类型器件没有出现故障;如果故障概率不小于预设阈值,则判定自身的目标类型器件出现故障。可以理解的是,如果一个企业对故障概率的要求管控严格,可以将预设阈值调小,只要输出的故障概率大于预设阈值,就可以及时判断出故障并判定故障部件,然后将结果告知运维人员,使其对故障部件进行更换。Specifically, when the enterprise server uses the target model to determine whether its own type of device fails, it includes: inputting the target model parameters into the hypothesis function to obtain the failure probability output by the hypothesis function, and determine whether the failure probability is less than the preset threshold; if If the failure probability is less than the preset threshold, it is determined that the target type device has not failed; if the failure probability is not less than the preset threshold, it is determined that the target type device has failed. It is understandable that if an enterprise has strict requirements on failure probability, the preset threshold can be adjusted smaller. As long as the output failure probability is greater than the preset threshold, the fault can be determined in time and the faulty component can be determined, and the results can then be notified to the operation. Call maintenance personnel to replace faulty parts.
如图2所示,为本申请实施例中提供的一种具体的故障诊断示意图,整体可划分为几个模块:BMC(Baseboard Management Controller,基板管理控制器)故障数据收集模块、BMC机器学习算法模块、BMC通信模块、中心通信模块、中心汇总算法模块。其中,BMC故障数据收集模块、BMC机器学习算法模块、BMC通信模块在服务器用户中;中心通信模块、中心汇总算法模块在服务器管理中心。As shown in Figure 2, it is a specific fault diagnosis schematic diagram provided in the embodiment of this application. The whole can be divided into several modules: BMC (Baseboard Management Controller, baseboard management controller) fault data collection module, BMC machine learning algorithm module, BMC communication module, center communication module, and center summary algorithm module. Among them, the BMC fault data collection module, BMC machine learning algorithm module, and BMC communication module are in the server user; the central communication module and the central summary algorithm module are in the server management center.
BMC故障数据收集模块:负责在故障发生时收集故障数据。以CPU诊断为例,可以通过PECI(Platform Environment Control Interface,平台环境式控制接口)通道读出MCA(Machine Check Architecture,硬件错误检测)架构中的MSR(Model Specific Register,特殊模块寄存器)、CSR(Control State Register,控制状态寄存器)寄存器信息,这些寄存器通常有成百上千个。BMC机器学习算法模块:主要有二个功能。一是判定故障部件,当服务器宕机时,将收集到的MSR、CSR寄存器数据作为输入,通过机器学习算法计算,输出某部件故障的概率,如判断CPU_0故障的概率。BMC机器学习算法模块的另一功能为运行算法更新模型参数,当运维人员将实际故障部件更换后将结果反馈给BMC,BMC据此结果运算多次梯度下降算法并更新一组模型参数,并可将其广播至本企业的其它BMC,使之更新参数。BMC通信模块:负责和本企业的其它BMC交互,负责和服务器供应商的中心节点交互,同时也可以将服务器运算结果进行同态加密与解密。中心通信模块:负责和各个企业服务器上的BMC进行交互,实现底层协议栈,安全地接收与发送数据。中心汇总算法模块:可将各个企业系统上报的加密后的参数进行整合,计算出新的加密后的参数,并将其下发给各个企业更新。BMC fault data collection module: responsible for collecting fault data when a fault occurs. Taking CPU diagnosis as an example, you can read out the MSR (Model Specific Register) and CSR (Special Module Register) in the MCA (Machine Check Architecture, hardware error detection) architecture through the PECI (Platform Environment Control Interface, Platform Environment Control Interface) channel. Control State Register (Control State Register) register information, there are usually hundreds or thousands of these registers. BMC machine learning algorithm module: It mainly has two functions. The first is to determine the faulty component. When the server goes down, the collected MSR and CSR register data are used as input, and the machine learning algorithm is used to calculate and output the probability of a component failure, such as determining the probability of CPU_0 failure. Another function of the BMC machine learning algorithm module is to run the algorithm to update the model parameters. When the operation and maintenance personnel replace the actual faulty parts, the results are fed back to the BMC. The BMC calculates multiple gradient descent algorithms based on the results and updates a set of model parameters, and It can be broadcast to other BMCs in the enterprise to update parameters. BMC communication module: Responsible for interacting with other BMCs of the enterprise and the central node of the server provider. It can also perform homomorphic encryption and decryption of server operation results. Central communication module: Responsible for interacting with the BMC on each enterprise server, implementing the underlying protocol stack, and securely receiving and sending data. Central summary algorithm module: It can integrate the encrypted parameters reported by each enterprise system, calculate new encrypted parameters, and send them to each enterprise for update.
需要指出的是,服务器管理中心收集的是每个企业系统中的企业服务器,利用基板管理控制器进行机器学习得到的模型参数,但是基板管理控制器作为一个嵌入式系统的算力有限,在执行梯度下降算法时速度不够快,因此可以考虑将多个基板管理控制器组成分布式网络,应用分布式分散计算来解决基板管理控制器的算力问题。具体的,参见图3所示,本实施例还可以进一步包括:It should be pointed out that the server management center collects the model parameters obtained from the enterprise servers in each enterprise system and uses the baseboard management controller to perform machine learning. However, the baseboard management controller as an embedded system has limited computing power. When executing The gradient descent algorithm is not fast enough, so you can consider forming a distributed network of multiple baseboard management controllers and applying distributed computing to solve the computing power problem of the baseboard management controller. Specifically, as shown in Figure 3, this embodiment may further include:
步骤S21:通过基于各企业服务器上的基板管理控制器构建的分布式网络系统,获取各企业服务器中的目标类型器件在服务器宕机时产生的实时数据。Step S21: Through a distributed network system built based on the baseboard management controller on each enterprise server, obtain real-time data generated by the target type device in each enterprise server when the server is down.
步骤S22:将实时数据输入至与预设机器学习算法中的假设函数对应的代价函数中,然后基于与代价函数对应的梯度下降函数确定出使代价函数取值最小时对应的参数,得到用于对目标类型器件进行故障诊断的模型参数。Step S22: Input the real-time data into the cost function corresponding to the hypothesis function in the preset machine learning algorithm, and then determine the parameters corresponding to the minimum value of the cost function based on the gradient descent function corresponding to the cost function, and obtain the Model parameters for fault diagnosis of target type devices.
也即,在本实施例中,利用各企业服务器上的基板管理控制器构建的分布式网络系统,实现对目标类型器件在服务器宕机时产生的实时数据的处理,然后得到用于对目标类型器件进行故障诊断的模型参数。That is to say, in this embodiment, the distributed network system constructed by the baseboard management controller on each enterprise server is used to process the real-time data generated by the target type device when the server is down, and then obtain the data for the target type. Model parameters for device fault diagnosis.
具体的,首先,在构建的分布式网络系统中,各企业服务器上的基板管理控制器通过平台环境式控制接口读取硬件错误检测架构中的目标寄存器,得到目标寄存器中收集的目标类型器件在服务器宕机时产生的实时数据;然后,将实时数据输入至与预设逻辑回归算法中的假设函数对应的代价函数中,基于与代价函数对应的梯度下降函数确定出使代价函数取值最小时对应的参数。示例性的,本实施例采用最清晰实用的Logistic回归(对数几率回归)算法进行说明:Specifically, first, in the constructed distributed network system, the baseboard management controller on each enterprise server reads the target register in the hardware error detection architecture through the platform environment control interface, and obtains the target type device collected in the target register. Real-time data generated when the server is down; then, input the real-time data into the cost function corresponding to the hypothesis function in the preset logistic regression algorithm, and determine the minimum value of the cost function based on the gradient descent function corresponding to the cost function. corresponding parameters. Illustratively, this embodiment uses the clearest and most practical Logistic Regression (Logarithmic Probability Regression) algorithm for explanation:
有某一训练集(或称为样本)There is a certain training set (or sample)
{(x (1),y (1)),(x (2),y (2)),...,(x (m),y (m))}; {(x (1) , y (1) ), (x (2) , y (2) ),..., (x (m) , y (m) )};
其中
Figure PCTCN2022101975-appb-000001
x 0=1,y∈{0,1};
in
Figure PCTCN2022101975-appb-000001
x 0 = 1, y∈{0, 1};
具体含义为:(x (i),y (i))表示某次宕机发生时的数据,x (i)指由n+1个MSR、CSR寄存器的值构成的n+1维特征向量;y (i)表示此次宕机CPU_0是否故障,即是否是CPU_0引发的宕机,只有0、1两个值,0代表CPU_0没有故障,1代表CPU_0有故障。 The specific meaning is: (x (i) , y (i) ) represents the data when a certain outage occurs, x (i) refers to the n+1-dimensional feature vector composed of the values of n+1 MSR and CSR registers; y (i) indicates whether CPU_0 is faulty in this outage, that is, whether it is caused by CPU_0. There are only two values: 0 and 1. 0 means that CPU_0 has no fault, and 1 means that CPU_0 has a fault.
Logistic回归算法的假设函数为:The hypothesis function of the logistic regression algorithm is:
Figure PCTCN2022101975-appb-000002
Figure PCTCN2022101975-appb-000002
其中,θ是一个n+1维向量,表示每个x相乘的系数;此函数输入一个特征向量x,得出CPU_0故障的概率,此函数的特性为其值大于0、小于1。因此,如果算出大于0.5,即可判断CPU_0故障,反之亦然。关键点为求取θ这个n+1维向量的具体数值,这也是机器学习算法的关键。Among them, θ is an n+1-dimensional vector, representing the coefficient of multiplication of each x; this function inputs a feature vector x to obtain the probability of CPU_0 failure. The characteristics of this function are that its value is greater than 0 and less than 1. Therefore, if the calculation is greater than 0.5, CPU_0 can be judged to be faulty, and vice versa. The key point is to find the specific value of the n+1-dimensional vector θ, which is also the key to the machine learning algorithm.
代价函数为:The cost function is:
Figure PCTCN2022101975-appb-000003
Figure PCTCN2022101975-appb-000003
其中,m为样本数量;x (i)为n+1维特征向量,y (i)表示服务器宕机时部件是否故障,只有0、1两个值;h θ(x (i))表示假设函数输出的故障概率值。随着θ的变化,假设函数计算出的概率与真实情况之差也会变化,因此,要找出使假设函数计算出的数值与真实情况之差最小的θ值,需要利用代价函数对应的梯度下降函数找出使J(θ)最小的θ值。 Among them, m is the number of samples; x (i) is an n+1-dimensional feature vector, y (i) represents whether the component fails when the server goes down, and has only two values 0 and 1; h θ (x (i) ) represents the hypothesis Failure probability value output by the function. As θ changes, the difference between the probability calculated by the hypothesis function and the real situation will also change. Therefore, to find the value of θ that minimizes the difference between the value calculated by the hypothesis function and the real situation, you need to use the gradient corresponding to the cost function The descending function finds the value of θ that minimizes J(θ).
梯度下降函数为:The gradient descent function is:
Figure PCTCN2022101975-appb-000004
Figure PCTCN2022101975-appb-000004
其中,α表示学习率;选定一组θ值为初始值,代入此公式中计算后可得到一组更有效的θ值,不断重复可得到最有效的一组θ值。Among them, α represents the learning rate; select a set of θ values as the initial value, and substitute it into this formula to calculate a set of more effective θ values. Repeat this to get the most effective set of θ values.
如图4所示为当服务器发生宕机时,基板管理控制器的诊断流程。以CPU为例,首先BMC收集CPU寄存器数据,判断是否收集到数据,当收集到数据时,对其中的某部 件计算假设函数,并将结果上报通知运维,当运维人员将实际故障部件更换后,结果反馈给BMC(可能是CPU_0,也可能不是,假设函数可能判断正确,也可能错误),然后根据运维结果更新算法参数,BMC据此结果运算多次梯度下降算法并更新一组θ,并可将其广播至本企业的其它BMC,使之更新参数θ;当没有收集到数据时,则流程结束。需要注意的是,所有BMC应保证机器学习模型一致,因此,从服务器供应商下载并更新BMC固件版本是一种方便的做法。可见,将横向联邦学习应用在BMC服务器的故障诊断中,提供了一种软件上的方法,解决服务器故障难以定位,机器学习算法应用上数据不足,各个企业间的数据孤岛问题;同时,将故障处理过程自动化,可节省大量人力成本。Figure 4 shows the diagnostic process of the baseboard management controller when the server crashes. Taking the CPU as an example, first the BMC collects the CPU register data and determines whether the data has been collected. When the data is collected, it calculates a hypothetical function for a certain component and reports the result to the operation and maintenance. When the operation and maintenance personnel replace the actual faulty component Finally, the result is fed back to BMC (it may be CPU_0, or it may not be, the assumption function may be correct or wrong), and then the algorithm parameters are updated based on the operation and maintenance results. BMC operates multiple gradient descent algorithms based on the results and updates a set of θ , and can be broadcast to other BMCs in the enterprise to update parameters θ; when no data is collected, the process ends. It should be noted that all BMCs should ensure consistent machine learning models, so it is convenient to download and update the BMC firmware version from the server vendor. It can be seen that applying horizontal federated learning to BMC server fault diagnosis provides a software method to solve the problem of difficulty in locating server faults, insufficient data in the application of machine learning algorithms, and data islands among various enterprises; at the same time, faults can be Automation of the processing process can save a lot of labor costs.
本申请实施例,通过获取每个企业系统分别发送的用于对目标类型器件进行故障诊断的模型参数;模型参数为企业系统下不同的企业服务器利用与预设机器学习算法中的假设函数对应的代价函数确定出的参数;利用预设加权规则对不同企业系统发送的模型参数进行加权平均,以得到目标模型参数;将目标模型参数发送给每个企业系统下不同的企业服务器,以便在企业服务器宕机时,利用假设函数以及目标模型参数确定自身的目标类型器件是否出现故障。可见,通过服务器管理中心获取每个企业系统分别发送的用于对目标类型器件进行故障诊断的模型参数,实现将横向联邦学习应用在服务器故障诊断中,提供给服务器管理中心充足的数据来进行机器学习训练,解决了机器学习算法应用上数据不足,各个企业间的数据孤岛问题;将每个企业发送的模型参数利用预设加权规则进行加权平均,得到目标模型参数,然后服务器管理中心将目标模型参数分发给各个企业系统,这样在服务器故障时可以迅速定位故障部件,缓解运维压力,进而提高故障诊断的准确率,加强服务器供应商的竞争力。In the embodiment of this application, the model parameters used for fault diagnosis of the target type device are obtained by obtaining the model parameters sent by each enterprise system respectively; the model parameters are corresponding to the hypothesis functions in the preset machine learning algorithms used by different enterprise servers under the enterprise system. Parameters determined by the cost function; use preset weighting rules to perform a weighted average of the model parameters sent by different enterprise systems to obtain the target model parameters; send the target model parameters to different enterprise servers under each enterprise system so that they can be processed on the enterprise server When there is a downtime, the hypothesis function and target model parameters are used to determine whether the target type device of its own has failed. It can be seen that the server management center obtains the model parameters sent by each enterprise system for fault diagnosis of the target type device, realizes the application of horizontal federated learning in server fault diagnosis, and provides the server management center with sufficient data for machine diagnosis. Learning and training solves the problem of insufficient data in the application of machine learning algorithms and the problem of data islands among various enterprises; the model parameters sent by each enterprise are weighted and averaged using preset weighting rules to obtain the target model parameters, and then the server management center transfers the target model The parameters are distributed to each enterprise system, so that when a server fails, the faulty component can be quickly located, easing the operation and maintenance pressure, thus improving the accuracy of fault diagnosis and strengthening the competitiveness of server suppliers.
本申请实施例公开了一种具体的故障诊断方法,参见图5所示,应用于服务器管理中心,该方法包括:The embodiment of the present application discloses a specific fault diagnosis method, as shown in Figure 5, which is applied to the server management center. The method includes:
步骤S31:获取每个企业系统分别发送的经过同态加密处理的用于对目标类型器件进行故障诊断的模型参数。Step S31: Obtain the homomorphically encrypted model parameters sent by each enterprise system and used for fault diagnosis of the target type device.
本申请实施例中,在每个企业系统中,不同的企业服务器针对服务器宕机时会产生相应的用于对目标类型器件进行故障诊断的模型参数,各企业服务器之间进行通信,将企业服务器得到的运行结果进行同态加密处理(Homomorphic Encryption,HE),同一企业的不同基板管理控制器之间会进行交互,然后将进行同态加密处理后的模型参数发送给服务器管理中心。其中,模型参数为企业系统下不同的企业服务器利用与预设机器学习算法中的假设函数对应的代价函数确定出的参数,具体可以参考前述实施例中公开的 相应内容,在此不再进行赘述。In the embodiment of this application, in each enterprise system, different enterprise servers will generate corresponding model parameters for fault diagnosis of target type devices when the server is down, and the enterprise servers communicate with each other, and the enterprise servers The obtained running results are processed by Homomorphic Encryption (HE). Different baseboard management controllers of the same enterprise will interact with each other, and then the model parameters after homomorphic encryption are sent to the server management center. Among them, the model parameters are parameters determined by different enterprise servers under the enterprise system using the cost function corresponding to the hypothesis function in the preset machine learning algorithm. For details, please refer to the corresponding content disclosed in the foregoing embodiments, and will not be described again here. .
可以理解的是,同态加密算法有如下特征:若将同态加密算法考虑为函数f,则f(a+b)=f(a)+f(b),即数据经过同态加密之后,对密文进行特定的计算,得到的密文计算结果在进行对应的同态解密后的明文等同于对明文数据直接进行相同的计算,实现数据的"可算不可见"。如此一来,利用同态加密算法使服务器管理中心直接对密文操作,而不直接对数据操作,可以有效地解决各个企业不愿意将自己的服务器故障数据暴露出来的问题。进一步的,当服务器管理中心更新完数据,并将更新完的数据发送给每个企业系统后,企业可对密文解密获得最新的参数,从而使故障诊断更为准确。其中,有多种同态加密算法可选择,本申请实施例可使用Paillier(概率公钥加密系统)加密算法,在此不作具体限定。It can be understood that the homomorphic encryption algorithm has the following characteristics: If the homomorphic encryption algorithm is considered as a function f, then f(a+b)=f(a)+f(b), that is, after the data is homomorphically encrypted, Perform specific calculations on the ciphertext, and the resulting ciphertext calculation results after corresponding homomorphic decryption are equivalent to directly performing the same calculation on the plaintext data, achieving "invisibility" of the data. In this way, the use of homomorphic encryption algorithms allows the server management center to directly operate on the ciphertext instead of directly operating on the data, which can effectively solve the problem of various enterprises being unwilling to expose their own server failure data. Furthermore, when the server management center updates the data and sends the updated data to each enterprise system, the enterprise can decrypt the ciphertext to obtain the latest parameters, thereby making fault diagnosis more accurate. Among them, there are a variety of homomorphic encryption algorithms to choose from. The embodiment of this application can use the Paillier (probabilistic public key encryption system) encryption algorithm, which is not specifically limited here.
步骤S32:利用预设加权规则对不同企业系统发送的模型参数进行加权平均,以得到目标模型参数。Step S32: Use the preset weighting rules to perform a weighted average of the model parameters sent by different enterprise systems to obtain the target model parameters.
步骤S33:将目标模型参数发送给每个企业系统下不同的企业服务器,以便在企业服务器宕机时,利用假设函数以及目标模型参数确定自身的目标类型器件是否出现故障。Step S33: Send the target model parameters to different enterprise servers under each enterprise system, so that when the enterprise server goes down, use the hypothesis function and the target model parameters to determine whether its own target type device has failed.
其中,关于上述步骤S32、步骤S33更加具体的处理过程可以参考前述实施例中公开的相应内容,在此不再进行赘述。For more specific processing procedures of the above steps S32 and S33, reference may be made to the corresponding contents disclosed in the foregoing embodiments, and will not be described again here.
本申请中,应用于服务器管理中心,获取每个企业系统分别发送的经过同态加密处理的用于对目标类型器件进行故障诊断的模型参数;模型参数为企业系统下不同的企业服务器利用与预设机器学习算法中的假设函数对应的代价函数确定出的参数;利用预设加权规则对不同企业系统发送的模型参数进行加权平均,以得到目标模型参数;将目标模型参数发送给每个企业系统下不同的企业服务器,以便在企业服务器宕机时,利用假设函数以及目标模型参数确定自身的目标类型器件是否出现故障。可见,通过服务器管理中心获取每个企业系统分别发送的用于对目标类型器件进行故障诊断的模型参数,实现将横向联邦学习应用在服务器故障诊断中,提供给服务器管理中心充足的数据来进行机器学习训练,解决了机器学习算法应用上数据不足,各个企业间的数据孤岛问题;同时,利用同态加密算法对每个企业系统的模型参数进行处理,这样,服务器管理中心可直接对密文操作,而不直接对数据操作,实现各个企业不暴露自身的服务器运维数据;将每个企业发送的模型参数利用预设加权规则进行加权平均,得到目标模型参数,然后服务器管理中心将目标模型参数分发给各个企业系统,这样在服务器故障时可以迅速定位故障部件,缓解运维压力,进而提高故障诊断的准确率,加强服务器供应商的竞争力。In this application, it is applied to the server management center to obtain the homomorphically encrypted model parameters sent by each enterprise system for fault diagnosis of the target type device; the model parameters are the utilization and prediction of different enterprise servers under the enterprise system. Assume the parameters determined by the cost function corresponding to the hypothesis function in the machine learning algorithm; use the preset weighting rules to weight and average the model parameters sent by different enterprise systems to obtain the target model parameters; send the target model parameters to each enterprise system Download different enterprise servers so that when the enterprise server goes down, the hypothesis function and the target model parameters can be used to determine whether its own target type device has failed. It can be seen that the server management center obtains the model parameters sent by each enterprise system for fault diagnosis of the target type device, realizes the application of horizontal federated learning in server fault diagnosis, and provides the server management center with sufficient data for machine diagnosis. Learning and training solves the problem of insufficient data in the application of machine learning algorithms and data islands among various enterprises; at the same time, the homomorphic encryption algorithm is used to process the model parameters of each enterprise system, so that the server management center can directly operate on the ciphertext , without directly operating the data, so that each enterprise does not expose its own server operation and maintenance data; the model parameters sent by each enterprise are weighted and averaged using preset weighting rules to obtain the target model parameters, and then the server management center will Distribute to various enterprise systems, so that when a server fails, the faulty component can be quickly located, easing the operation and maintenance pressure, thus improving the accuracy of fault diagnosis and strengthening the competitiveness of server suppliers.
相应的,本申请实施例还公开了一种故障诊断装置,参见图6所示,该装置包括:Correspondingly, the embodiment of the present application also discloses a fault diagnosis device, as shown in Figure 6. The device includes:
参数获取模块11,用于获取每个企业系统分别发送的用于对目标类型器件进行故障诊断的模型参数;模型参数为企业系统下不同的企业服务器利用与预设机器学习算法中的假设函数对应的代价函数确定出的参数;Parameter acquisition module 11 is used to obtain model parameters sent by each enterprise system for fault diagnosis of target type devices; the model parameters correspond to the hypothesis functions in the preset machine learning algorithm utilized by different enterprise servers under the enterprise system. The parameters determined by the cost function;
参数运算模块12,用于利用预设加权规则对不同企业系统发送的模型参数进行加权平均,以得到目标模型参数;The parameter calculation module 12 is used to perform a weighted average of model parameters sent by different enterprise systems using preset weighting rules to obtain target model parameters;
参数发送模块13,用于将目标模型参数发送给每个企业系统下不同的企业服务器,以便在企业服务器宕机时,利用假设函数以及目标模型参数确定自身的目标类型器件是否出现故障。The parameter sending module 13 is used to send the target model parameters to different enterprise servers under each enterprise system, so that when the enterprise server goes down, it can use the hypothesis function and the target model parameters to determine whether its own target type device has failed.
其中,关于上述各个模块更加具体的工作过程,以及相应产生的技术效果可以参考前述实施例中公开的相应内容,在此不再进行赘述。For more specific working processes of each of the above modules and the corresponding technical effects, please refer to the corresponding contents disclosed in the foregoing embodiments, and will not be described again here.
进一步的,本申请实施例还公开了一种电子设备,图7是根据一示例性实施例示出的电子设备20结构图,图中内容不能认为是对本申请的使用范围的任何限制。Furthermore, the embodiment of the present application also discloses an electronic device. Figure 7 is a structural diagram of the electronic device 20 according to an exemplary embodiment. The content in the figure cannot be considered as any limitation on the scope of the present application.
图7为本申请实施例提供的一种电子设备20的结构示意图。该电子设备20,具体可以包括:至少一个处理器21、至少一个存储器22、电源23、通信接口24、输入输出接口25和通信总线26。其中,存储器22用于存储计算机可读指令,该计算机可读指令由上述至少一个处理器21加载并执行,以实现前述任一实施例公开的故障诊断方法中的相关步骤。另外,本实施例中的电子设备20具体可以为服务器。FIG. 7 is a schematic structural diagram of an electronic device 20 provided by an embodiment of the present application. The electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input-output interface 25 and a communication bus 26. The memory 22 is used to store computer readable instructions, which are loaded and executed by the at least one processor 21 to implement relevant steps in the fault diagnosis method disclosed in any of the foregoing embodiments. In addition, the electronic device 20 in this embodiment may specifically be a server.
本实施例中,电源23用于为电子设备20上的各硬件设备提供工作电压;通信接口24能够为电子设备20创建与外界设备之间的数据传输通道,其所遵循的通信协议是能够适用于本申请技术方案的任意通信协议,在此不对其进行具体限定;输入输出接口25,用于获取外界输入数据或向外界输出数据,其具体的接口类型可以根据具体应用需要进行选取,在此不进行具体限定。In this embodiment, the power supply 23 is used to provide working voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and external devices, and the communication protocol it follows can be applicable Any communication protocol of the technical solution of this application is not specifically limited here; the input and output interface 25 is used to obtain external input data or output data to the external world, and its specific interface type can be selected according to specific application needs. Here Not specifically limited.
另外,存储器22作为资源存储的载体,可以是只读存储器、随机存储器、磁盘或者光盘等,其上所存储的资源可以包括操作系统221、计算机可读指令222及数据223等,数据223可以包括各种各样的数据。存储方式可以是短暂存储或者永久存储。In addition, the memory 22, as a carrier for resource storage, can be a read-only memory, a random access memory, a magnetic disk or an optical disk, etc. The resources stored thereon can include the operating system 221, computer readable instructions 222 and data 223, etc. The data 223 can include All kinds of data. The storage method can be temporary storage or permanent storage.
其中,操作系统221用于管理与控制电子设备20上的各硬件设备以及计算机可读指令222,其可以是Windows Server、Netware、Unix、Linux等。计算机可读指令222除了包括能够用于完成前述任一实施例公开的由电子设备20执行的故障诊断方法的计算机可读指令之外,还可以进一步包括能够用于完成其他特定工作的计算机可读指令。Among them, the operating system 221 is used to manage and control each hardware device on the electronic device 20 and the computer readable instructions 222, which can be Windows Server, Netware, Unix, Linux, etc. In addition to computer-readable instructions that can be used to complete the fault diagnosis method executed by the electronic device 20 disclosed in any of the foregoing embodiments, the computer-readable instructions 222 may further include computer-readable instructions that can be used to complete other specific tasks. instruction.
进一步的,本申请实施例还公开了一种非易失性计算机可读存储介质,这里所说的非易失性计算机可读存储介质包括随机存取存储器(Random Access Memory,RAM)、内 存、只读存储器(Read-Only Memory,ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、磁碟或者光盘或技术领域内所公知的任意其他形式的存储介质。其中,该计算机可读指令被一个或多个处理器执行时实现前述任一实施例提供的故障诊断方法。关于该方法的具体步骤可以参考前述实施例中公开的相应内容,在此不再进行赘述。Further, the embodiment of the present application also discloses a non-volatile computer-readable storage medium. The non-volatile computer-readable storage medium mentioned here includes random access memory (Random Access Memory, RAM), memory, Read-Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, register, hard disk, magnetic disk or optical disk or any other form of storage medium known in the technical field. Wherein, when the computer-readable instructions are executed by one or more processors, the fault diagnosis method provided by any of the foregoing embodiments is implemented. Regarding the specific steps of this method, reference may be made to the corresponding content disclosed in the foregoing embodiments, which will not be described again here.
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。Each embodiment in this specification is described in a progressive manner. Each embodiment focuses on its differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other. As for the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple. For relevant details, please refer to the description in the method section.
结合本文中所公开的实施例描述的故障诊断或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of fault diagnosis or algorithms described in conjunction with the embodiments disclosed herein may be implemented directly using hardware, software modules executed by a processor, or a combination of both. Software modules may be located in random access memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disks, removable disks, CD-ROMs, or anywhere in the field of technology. any other known form of storage media.
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括上述要素的过程、方法、物品或者设备中还存在另外的相同要素。Finally, it should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that these entities or any such actual relationship or sequence between operations. Furthermore, the terms "comprises," "comprises," or any other variations thereof are intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus that includes a list of elements includes not only those elements, but also those not expressly listed other elements, or elements inherent to the process, method, article or equipment. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article, or device that includes the foregoing element.
以上对本发明所提供的一种故障诊断方法、装置、设备及存储介质进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。The fault diagnosis method, device, equipment and storage medium provided by the present invention have been introduced in detail above. Specific examples are used in this article to illustrate the principles and implementation modes of the present invention. The description of the above embodiments is only for assistance. Understand the method of the present invention and its core idea; at the same time, for those of ordinary skill in the field, there will be changes in the specific implementation and application scope based on the idea of the present invention. In summary, the content of this specification does not It should be understood as a limitation of the invention.

Claims (10)

  1. 一种故障诊断方法,其特征在于,应用于服务器管理中心,包括:A fault diagnosis method, characterized in that it is applied to a server management center and includes:
    获取每个企业系统分别发送的用于对目标类型器件进行故障诊断的模型参数;所述模型参数为所述企业系统下不同的企业服务器利用与预设机器学习算法中的假设函数对应的代价函数确定出的参数;Obtain the model parameters sent by each enterprise system for fault diagnosis of the target type device; the model parameters are the cost functions corresponding to the hypothesis functions in the preset machine learning algorithms used by different enterprise servers under the enterprise system. determined parameters;
    利用预设加权规则对不同所述企业系统发送的所述模型参数进行加权平均,以得到目标模型参数;及Using preset weighting rules to perform a weighted average of the model parameters sent by different enterprise systems to obtain target model parameters; and
    将所述目标模型参数发送给每个所述企业系统下不同的所述企业服务器,以便在所述企业服务器宕机时,利用所述假设函数以及所述目标模型参数确定自身的所述目标类型器件是否出现故障。Send the target model parameters to different enterprise servers under each enterprise system, so that when the enterprise server goes down, the hypothesis function and the target model parameters can be used to determine the target type of the enterprise server. Whether the device is faulty.
  2. 根据权利要求1所述的故障诊断方法,其特征在于,还包括:The fault diagnosis method according to claim 1, further comprising:
    通过基于各所述企业服务器上的基板管理控制器构建的分布式网络系统,获取各所述企业服务器中的所述目标类型器件在服务器宕机时产生的实时数据,并将所述实时数据输入至与预设机器学习算法中的假设函数对应的代价函数中,及基于与所述代价函数对应的梯度下降函数确定出使所述代价函数取值最小时对应的参数,得到用于对所述目标类型器件进行故障诊断的所述模型参数。Through a distributed network system built based on the baseboard management controller on each of the enterprise servers, real-time data generated by the target type device in each of the enterprise servers when the server is down is obtained, and the real-time data is input to the cost function corresponding to the hypothesis function in the preset machine learning algorithm, and determine the parameters corresponding to the minimum value of the cost function based on the gradient descent function corresponding to the cost function, to obtain the The model parameters for fault diagnosis of target type devices.
  3. 根据权利要求2所述的故障诊断方法,其特征在于,所述获取各所述企业服务器中的所述目标类型器件在服务器宕机时产生的实时数据,包括:The fault diagnosis method according to claim 2, characterized in that said obtaining the real-time data generated by the target type device in each of the enterprise servers when the server is down includes:
    通过平台环境式控制接口读取硬件错误检测架构中的目标寄存器,得到所述目标寄存器中收集的所述目标类型器件在服务器宕机时产生的实时数据。The target register in the hardware error detection architecture is read through the platform environment control interface, and the real-time data generated by the target type device collected in the target register when the server is down is obtained.
  4. 根据权利要求2或3所述的故障诊断方法,其特征在于,所述将所述实时数据输入至与预设机器学习算法中的假设函数对应的代价函数中,然后基于与所述代价函数对应的梯度下降函数确定出使所述代价函数取值最小时对应的参数,包括:The fault diagnosis method according to claim 2 or 3, characterized in that the real-time data is input into a cost function corresponding to a hypothesis function in a preset machine learning algorithm, and then based on the cost function corresponding to The gradient descent function determines the parameters corresponding to the minimum value of the cost function, including:
    将所述实时数据输入至与预设逻辑回归算法中的假设函数对应的代价函数中,及基于与所述代价函数对应的梯度下降函数确定出使所述代价函数取值最小时对应的参数。The real-time data is input into the cost function corresponding to the hypothesis function in the preset logistic regression algorithm, and the parameters corresponding to the minimum value of the cost function are determined based on the gradient descent function corresponding to the cost function.
  5. 根据权利要求1至4任一项所述的故障诊断方法,其特征在于,所述获取每个企业系统分别发送的用于对目标类型器件进行故障诊断的模型参数,利用预设加权规则对不同所述企业系统发送的所述模型参数进行加权平均,以得到目标模型参数,包括:The fault diagnosis method according to any one of claims 1 to 4, characterized in that: obtaining model parameters sent by each enterprise system for fault diagnosis of target type devices, and using preset weighting rules to classify different The model parameters sent by the enterprise system are weighted and averaged to obtain target model parameters, including:
    按照预设的时间周期,定期获取每个企业系统分别发送的用于对目标类型器件进行故障诊断的模型参数,利用预设加权规则对不同所述企业系统发送的所述模型参数进行 加权平均得到当前的目标模型参数,以便利用当前的目标模型参数对上一时间周期得到的目标模型参数进行更新。According to the preset time period, the model parameters sent by each enterprise system for fault diagnosis of the target type device are regularly obtained, and the preset weighting rules are used to perform a weighted average of the model parameters sent by different enterprise systems to obtain The current target model parameters are used to update the target model parameters obtained in the previous time period using the current target model parameters.
  6. 根据权利要求1至5任一项所述的故障诊断方法,其特征在于,所述利用所述假设函数以及所述目标模型参数确定自身的所述目标类型器件是否出现故障,包括:The fault diagnosis method according to any one of claims 1 to 5, characterized in that, using the hypothesis function and the target model parameters to determine whether the target type device of itself has failed includes:
    将所述目标模型参数输入至所述假设函数中,以得到所述假设函数输出的故障概率,并判断所述故障概率是否小于预设阈值;及Input the target model parameters into the hypothesis function to obtain the failure probability output by the hypothesis function, and determine whether the failure probability is less than a preset threshold; and
    响应于所述故障概率小于所述预设阈值,判定自身的所述目标类型器件没有出现故障,或,响应于所述故障概率不小于所述预设阈值,判定自身的所述目标类型器件出现故障。In response to the failure probability being less than the preset threshold, it is determined that the target type device of the self has not failed, or in response to the failure probability being not less than the preset threshold, it is determined that the target type device of the self has failed. Fault.
  7. 根据权利要求1至6任一项所述的故障诊断方法,其特征在于,所述获取每个企业系统分别发送的用于对目标类型器件进行故障诊断的模型参数,包括:The fault diagnosis method according to any one of claims 1 to 6, characterized in that said obtaining the model parameters sent by each enterprise system for fault diagnosis of the target type device includes:
    获取每个企业系统分别发送的经过同态加密处理的用于对目标类型器件进行故障诊断的模型参数。Obtain the homomorphically encrypted model parameters sent by each enterprise system for fault diagnosis of the target type device.
  8. 一种故障诊断装置,其特征在于,应用于服务器管理中心,包括:A fault diagnosis device, characterized in that it is applied to a server management center and includes:
    参数获取模块,用于获取每个企业系统分别发送的用于对目标类型器件进行故障诊断的模型参数;所述模型参数为所述企业系统下不同的企业服务器利用与预设机器学习算法中的假设函数对应的代价函数确定出的参数;The parameter acquisition module is used to obtain the model parameters sent by each enterprise system for fault diagnosis of the target type device; the model parameters are different enterprise server utilization and preset machine learning algorithms under the enterprise system. The parameters determined by the cost function corresponding to the hypothesis function;
    参数运算模块,用于利用预设加权规则对不同所述企业系统发送的所述模型参数进行加权平均,以得到目标模型参数;及A parameter calculation module used to perform a weighted average of the model parameters sent by different enterprise systems using preset weighting rules to obtain target model parameters; and
    参数发送模块,用于将所述目标模型参数发送给每个所述企业系统下不同的所述企业服务器,以便在所述企业服务器宕机时,利用所述假设函数以及所述目标模型参数确定自身的所述目标类型器件是否出现故障。A parameter sending module, configured to send the target model parameters to different enterprise servers under each enterprise system, so that when the enterprise server is down, the hypothesis function and the target model parameters can be used to determine Whether its own device of the target type has failed.
  9. 一种电子设备,其特征在于,所述电子设备包括一个或多个处理器和存储器;其中,所述存储器用于存储计算机可读指令,所述计算机可读指令由所述一个或多个处理器加载并执行以实现如权利要求1至7任一项所述的故障诊断方法。An electronic device, characterized in that the electronic device includes one or more processors and a memory; wherein the memory is used to store computer readable instructions, and the computer readable instructions are processed by the one or more processors. The device is loaded and executed to implement the fault diagnosis method as described in any one of claims 1 to 7.
  10. 一种非易失性计算机可读存储介质,其特征在于,用于存储计算机可读指令;其中所述计算机可读指令被一个或多个处理器执行时实现如权利要求1至7任一项所述的故障诊断方法。A non-volatile computer-readable storage medium, characterized in that it is used to store computer-readable instructions; wherein the computer-readable instructions implement any one of claims 1 to 7 when executed by one or more processors Described fault diagnosis method.
PCT/CN2022/101975 2022-04-13 2022-06-28 Fault diagnosis method and apparatus, device, and storage medium WO2023197453A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210381536.7A CN114461439A (en) 2022-04-13 2022-04-13 Fault diagnosis method, device, equipment and storage medium
CN202210381536.7 2022-04-13

Publications (1)

Publication Number Publication Date
WO2023197453A1 true WO2023197453A1 (en) 2023-10-19

Family

ID=81418613

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/101975 WO2023197453A1 (en) 2022-04-13 2022-06-28 Fault diagnosis method and apparatus, device, and storage medium

Country Status (2)

Country Link
CN (1) CN114461439A (en)
WO (1) WO2023197453A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114461439A (en) * 2022-04-13 2022-05-10 苏州浪潮智能科技有限公司 Fault diagnosis method, device, equipment and storage medium
CN114780283B (en) * 2022-06-20 2022-11-01 新华三信息技术有限公司 Fault processing method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111537945A (en) * 2020-06-28 2020-08-14 南方电网科学研究院有限责任公司 Intelligent ammeter fault diagnosis method and equipment based on federal learning
CN111737749A (en) * 2020-06-28 2020-10-02 南方电网科学研究院有限责任公司 Measuring device alarm prediction method and device based on federal learning
CN112101489A (en) * 2020-11-18 2020-12-18 天津开发区精诺瀚海数据科技有限公司 Equipment fault diagnosis method driven by united learning and deep learning fusion
WO2021032496A1 (en) * 2019-08-16 2021-02-25 Telefonaktiebolaget Lm Ericsson (Publ) Methods, apparatus and machine-readable media relating to machine-learning in a communication network
CN114330740A (en) * 2021-12-17 2022-04-12 青岛鹏海软件有限公司 Manufacturing equipment fault monitoring model training system based on federal learning
CN114461439A (en) * 2022-04-13 2022-05-10 苏州浪潮智能科技有限公司 Fault diagnosis method, device, equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111461215B (en) * 2020-03-31 2021-06-29 支付宝(杭州)信息技术有限公司 Multi-party combined training method, device, system and equipment of business model
CN111722043B (en) * 2020-06-29 2021-09-14 南方电网科学研究院有限责任公司 Power equipment fault detection method, device and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021032496A1 (en) * 2019-08-16 2021-02-25 Telefonaktiebolaget Lm Ericsson (Publ) Methods, apparatus and machine-readable media relating to machine-learning in a communication network
CN111537945A (en) * 2020-06-28 2020-08-14 南方电网科学研究院有限责任公司 Intelligent ammeter fault diagnosis method and equipment based on federal learning
CN111737749A (en) * 2020-06-28 2020-10-02 南方电网科学研究院有限责任公司 Measuring device alarm prediction method and device based on federal learning
CN112101489A (en) * 2020-11-18 2020-12-18 天津开发区精诺瀚海数据科技有限公司 Equipment fault diagnosis method driven by united learning and deep learning fusion
CN114330740A (en) * 2021-12-17 2022-04-12 青岛鹏海软件有限公司 Manufacturing equipment fault monitoring model training system based on federal learning
CN114461439A (en) * 2022-04-13 2022-05-10 苏州浪潮智能科技有限公司 Fault diagnosis method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CHEN, YUN'AN: "Research on Multi-category Fault Diagnosis Method of Industrial Equipment Based on Softmax Regression", DIANZI ZHIZUO - PRACTICAL ELECTRONICS, SHIJIE ZHISHI CHUBANSHE, CN, no. 20, 15 October 2018 (2018-10-15), CN , pages 18 - 20, 25, XP009549608, ISSN: 1006-5059, DOI: 10.16589/j.cnki.cn11-3571/tn.2018.20.008 *

Also Published As

Publication number Publication date
CN114461439A (en) 2022-05-10

Similar Documents

Publication Publication Date Title
Bharany et al. Energy efficient fault tolerance techniques in green cloud computing: A systematic survey and taxonomy
US10560313B2 (en) Pipeline system for time-series data forecasting
US10048996B1 (en) Predicting infrastructure failures in a data center for hosted service mitigation actions
WO2023197453A1 (en) Fault diagnosis method and apparatus, device, and storage medium
US9548886B2 (en) Help desk ticket tracking integration with root cause analysis
US10404551B2 (en) Automated event management
Gill et al. RADAR: Self‐configuring and self‐healing in resource management for enhancing quality of cloud services
US10489232B1 (en) Data center diagnostic information
US11818014B2 (en) Multi-baseline unsupervised security-incident and network behavioral anomaly detection in cloud-based compute environments
US11275617B2 (en) Self-managed intelligent elastic cloud stack
US20150280968A1 (en) Identifying alarms for a root cause of a problem in a data processing system
US20210255899A1 (en) Method for Establishing System Resource Prediction and Resource Management Model Through Multi-layer Correlations
US9280409B2 (en) Method and system for single point of failure analysis and remediation
Dehraj et al. A review on architecture and models for autonomic software systems
US9400731B1 (en) Forecasting server behavior
US20230132116A1 (en) Prediction of impact to data center based on individual device issue
US10372572B1 (en) Prediction model testing framework
CN115812298A (en) Block chain management of supply failure
US11392821B2 (en) Detecting behavior patterns utilizing machine learning model trained with multi-modal time series analysis of diagnostic data
Neto et al. MULTS: A multi-cloud fault-tolerant architecture to manage transient servers in cloud computing
Vizarreta et al. Dason: Dependability assessment framework for imperfect distributed sdn implementations
US11218378B1 (en) Cluser-aware networking fabric update system
KR20200063343A (en) System and method for managing operaiton in trust reality viewpointing networking infrastucture
Nivitha et al. A survey on machine learning based fault tolerant mechanisms in cloud towards uncertainty analysis
Zwietasch Online failure prediction for microservice architectures

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22937091

Country of ref document: EP

Kind code of ref document: A1