US20180267858A1 - Baseboard Management Controller To Deconfigure Field Replaceable Units According To Deep Learning Model - Google Patents
Baseboard Management Controller To Deconfigure Field Replaceable Units According To Deep Learning Model Download PDFInfo
- Publication number
- US20180267858A1 US20180267858A1 US15/463,713 US201715463713A US2018267858A1 US 20180267858 A1 US20180267858 A1 US 20180267858A1 US 201715463713 A US201715463713 A US 201715463713A US 2018267858 A1 US2018267858 A1 US 2018267858A1
- Authority
- US
- United States
- Prior art keywords
- field replaceable
- error
- computing device
- replaceable units
- error condition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1415—Saving, restoring, recovering or retrying at system level
- G06F11/1417—Boot up procedures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0766—Error or fault reporting or storing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1415—Saving, restoring, recovering or retrying at system level
- G06F11/142—Reconfiguring to eliminate the error
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G06N3/0445—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G06N7/005—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/805—Real-time
Definitions
- High availability is a characteristic that aims to ensure a level of operational performance, such as uptime for a period higher than a system that does not have the high availability characteristic.
- some computing devices with the high availability characteristic do become unavailable.
- FIG. 1 is a block diagram of a computing device including a baseboard management controller capable to process an error log according to a deep learning model to determine field replaceable units to deconfigure, according to one example;
- FIG. 2 is a block diagram of a system including devices each with a baseboard management controller capable to process respective error logs according to a deep learning model to determine field replaceable units to deconfigure, according to one example;
- FIG. 3 is a flowchart of a method for deconfiguring field replaceable units by a baseboard management controller according to a deep learning model, according to an example
- FIG. 4 is a block diagram of a baseboard management controller capable of deconfiguring field replaceable units according to a deep learning model in response to an error condition, according to an example
- FIG. 5 is a diagram of a deep learning model, according to one example.
- index number “N” appended to some of the reference numerals may be understood to merely denote plurality and may not necessarily represent the same quantity for each reference numeral having such an index number “N”. Additionally, use herein of a reference numeral without an index number, where such reference numeral is referred to elsewhere with an index number, may be a general reference to the corresponding plural elements, collectively or individually. In another example, an index number of “I,” “M,” etc. can be used in place of index number N.
- IT Information Technology
- computer manufacturers are challenged to deliver quality and value to consumers, for example by providing computing devices with high availability.
- High availability is a characteristic that aims to ensure a level of operational performance, such as uptime for a period higher than a system that does not have the high availability characteristic.
- some computing devices with the high availability characteristic do become unavailable.
- Error analysis tools may be static and could require a user to help determine a root cause of an error.
- the computing system may need to be shipped back to a lab to determine the cause of the error. There is a time and shipping cost for this type of analysis.
- various examples provided herein use a deep learning architecture that can autonomously assist IT personnel and field engineers in determining faulty components that may need to be replaced.
- the examples include usage of Recurrent Neural Networks (RNN) for processing system events to distinguish between the different causes and effects of a given failure, and make the appropriate predictions on which components to replace.
- RNN Recurrent Neural Networks
- BMC baseboard management controller
- BMCs provide so-called “lights-out” functionality for computing devices.
- the lights out functionality may allow a user, such as a systems administrator to perform management operations on the computing device even if an operating system is not installed or not functional on the computing device.
- the BMC can run on auxiliary power, thus the computing device need not be powered on to an on state where control of the computing device is handed over to an operating system after boot.
- the BMC may so-called provide management and so-called “out-of-band” services, such as remote console access, remote reboot and power management functionality, access to system logs, and the like.
- a BMC has management capabilities for sub-systems of a computing device, and is separate from a processor that executes a main operating system of a computing device.
- the BMC may comprise an interface, such as a network interface, and/or serial interface that an administrator can use to remotely communicate with the BMC.
- the BMC can have access to system logs.
- the BMC can process system logs to determine a root cause for the error condition based on the deep learning approach.
- the system logs can come from Field Replaceable Units (FRUs) or be related to the FRUs.
- FRUs Field Replaceable Units
- a field replaceable unit is a circuit board, part, or assembly that can be easily removed from a computing device and replaced by a user or technician without having to send the whole computing device to a repair facility.
- FRUs include parts that can attach to other parts of the computing device using a socket, a card, a module, etc.
- examples of FRUs can include computing modules, memory modules, peripheral cards and devices, etc.
- the system logs can include registers that provide particular information (e.g., an error flag for a particular component, a type of error, a current configuration, a location associated with an error, etc.).
- the BMC can process the information from the logs according to the deep learning model to determine scores associated with each of a number of the FRUs.
- the scores can relate to the likelihood that the FRU has responsibility for the error condition. In other examples, the scores can be associated with sets of FRUs.
- the FRU (or set of FRUs) with a highest likelihood of being responsible for the error condition can be deconfigured by the BMC. Once deconfigured, the computing device can be rebooted to determine if the error persists. In some examples, determining whether the error persists can include testing (e.g., initializing memory, writing to and reading back from various locations, etc.).
- the next FRU or set of FRUs likely to be responsible can be deconfigured. This can repeat. Moreover, in some examples, the failure to remove the error condition can be taken into account for re-scoring the FRUs.
- the BMC can send information about the logs (e.g., the logs themselves, a condensed version of the logs, etc.) as well as information about the FRU or set of FRUs deconfigured to an error analysis platform.
- the information sent can also include information about deconfigured FRUs that did not cause the error condition.
- the error analysis platform can take the feedback, along with parameters of a current deep learning model and feedback from other computing devices to update parameters for the deep learning model. The updated parameters can be provided to the BMC and other computing devices.
- the approaches described herein are autonomous and can self-learn.
- the approach can learn from multiple different computing devices providing feedback.
- a set of updated deep learning parameters can be determined and sent back to the computing devices.
- the deep learning model can be implemented while processing an error log in a computing device with an error condition. The implementation can also learn from mispredictions of a faulty component or field replaceable unit.
- a deep neural network can reduce the costs associated with handcrafting complex rules for analyzing and recovering from errors in computing devices that are used in statically defined analyzers.
- static analyzers may suffer from a lack of portability across different platform types and architectures.
- the approaches described herein offer a simpler approach where deep learning is used to capture mathematical functions for performing error analysis and recovery.
- a mathematical approach is advantageous because it can be generalized for other platforms and architectures.
- parameters from the deep learning model can be updated and provided back to BMCs within computing devices.
- FIG. 1 is a block diagram of a computing device including a baseboard management controller capable to process an error log according to a deep learning model to determine field replaceable units to deconfigure, according to one example.
- FIG. 2 is a block diagram of a system including devices each with a baseboard management controller capable to process respective error logs according to a deep learning model to determine field replaceable units to deconfigure, according to one example.
- the computing device 102 includes a central processing unit 110 , a number of field replaceable units 112 , and baseboard management controller 114 .
- an FRU 112 can include the central processing unit 110 .
- the computing device 102 can be included in a system 200 that can also include an error analysis platform 250 that can receive feedback from multiple devices with a local BMC 260 a - 260 n .
- the error analysis platform 250 can take the feedback information to determine updates to parameters for a deep learning model 116 that is used to autonomously diagnose a cause for an error condition of the computing device 102 .
- the BMC 114 can be notified (e.g., via an interrupt, a change in status of memory polled by the BMC 114 , etc.). The BMC 114 can determine that an error condition is present. Further, the BMC 114 can use an error log 218 to analyze the error condition of the computing device 102 .
- the error log 218 can be a single error log or can include multiple error logs, for example error logs retrieved from or read from various hardware devices (e.g., FRUs 112 or components of the FRUs 112 ), an operating system executing on the central processing unit 110 , or the like. In one example, the error log may include registers.
- each of the registers or other error log information can relate to particular information, for example, relate to a particular FRU, relate to particular components of the FRU, relate to particular errors or conditions of the FRU, etc.
- the error log may identify the particular register or component as well. This can be used to map the information to the deep learning model.
- the functionality of the BMC 114 described herein can be implemented by executing instructions stored in memory 232 .
- the processing of the error log 218 can include processing using the deep learning model 116 .
- Various deep learning models can be used. Examples of deep learning models include long short-term memory (LSTM), a convolution neural networks, recurrent neural networks, neural history compressor, recursive neural networks, gated recurrent unit (GRU), etc.
- An advantage to a recurrent neural network is the inclusion of feedback.
- An example of one implementation of using an LSTM approach as the deep learning model 116 is provided in the explanation corresponding to FIG. 5 .
- the parameters used for the deep learning model 116 can be updated based on feedback from the computing device 102 or other devices with local BMCs 260 as discussed herein.
- the deep learning model 116 can be applied to determine one of the FRUs 112 or a set of the FRUs 112 that can be deconfigured in response to the error condition.
- a score can be assigned to each of the FRUs 112 and/or to sets of FRUs 112 .
- the scores can relate to probability that the FRU or set of FRUs 112 is a root cause for the error condition.
- the error log can be processed as characters.
- characters can represent registers associated with dumps from FRU components or systems logs.
- each character can be considered an input vector.
- each of the scores for the FRUs can be updated.
- the updated scores can be included as an input vector along with the next character.
- the processing can continue until a character represents an end of the log.
- characters can be broken up by special characters and taken as a group. For example, a first character may identify an FRU's log, a second, third, and fourth character may include log register information, and a special character (fifth character) may indicate that the information about the FRU's log is over. In this example, the five characters are meant to be processed together.
- the information may be forgotten (though the updated scores remain) and a next set of characters can be read to update the scores for the FRUs.
- the scores can be used to rank the probability that each of the FRUs or sets of FRUs are a root cause of the error condition.
- a softmax function may be used to organize the scores (e.g., the softmax function can be used normalize the vectors into real values in the range of [0, 1] that add up to 1).
- One of the FRUs or sets of FRUs can be selected based on the analysis (e.g., the set of FRUs scored to have the highest probability to be the root cause of the error condition compared to the other FRUs).
- the BMC 114 can be caused to deconfigure the FRU.
- the deconfiguration of the FRU can be implemented by disabling the FRU.
- the disabling of the FRU can include removing power to the FRU.
- disabling of the FRU can include removing communications capabilities from the FRU.
- disabling of the FRU can include putting the FRU in a disconnected hot plug or hot swap state.
- the computing device 102 can be rebooted. Once reboot has occurred, the BMC 114 can determine whether the error condition persists. In one example, a test can be performed to determine whether the error condition persists. The test can be directed to the FRU or the computing device in general. In one example, the BMC 114 performs the test. In another example, the BMC 114 is communicatively coupled to another processor (e.g., CPU 110 ), which is instructed to perform the test.
- another processor e.g., CPU 110
- next most probable FRU or set of FRUs can be selected to be deconfigured.
- the next most probable FRU is determined by determining a new score relating to the probability of failure for each of the FRUs/sets of FRUs as part of processing the error log again, but this time with the additional information that the previous attempt failed and the error condition persisted.
- the selected FRU or set of FRUs can include at least one FRU that was not in the original selection.
- the selected FRU or set of FRUs is deconfigured and the computing device can be rebooted and tested again.
- the next set(s) of FRUs can be selected.
- Various approaches can be used to select the FRU or set of FRUs, for example, Qlearning, Markov Decision Process (MDP), etc.
- the error log and the information regarding the deconfiguration can be sent to the error analysis platform 250 .
- the error analysis platform 250 can use the information as part of a new sample set to provide to the deep learning model to update parameters based on the real world experience of the computing device 102 .
- the error analysis platform 250 can update the parameters for the deep learning model 116 for the computing device 102 .
- the parameters can also be used in the other devices with local BMCs 260 .
- the updated parameters can be sent by the error analysis platform 250 back to the devices that can use the updated parameters for future error log processing.
- the deep learning model can be trained on the error analysis platform 250 or another platform.
- the training may include initial error log impressions from a technical expert making the training sets based on error log entries (e.g., an error log entry of a register indicating that a memory module has an unrecoverable hard error may be trained to indicate that the memory module is a root cause for that error).
- full systems configurations can be added to the sample sets as well. For example, a configuration where a peripheral network card FRU has a hardware error, but two other FRUs (e.g., memory modules) have errors that were caused by the network card FRU, the root cause may be trained to be the network card (for that specific case).
- the training sets can be determined from observations. Feedback can come from computing devices put into implementation or from test units.
- the feedback can be used as training data to update the parameters for the deep learning model.
- Various approaches can be used to implement the deep learning approach to update parameters on the error analysis platform 250 , for example, RMSprop, Adagrad, Adam, etc.
- gradient descent optimization algorithms can be used.
- the BMC 114 can receive the updated parameters for the deep learning model 116 from the error analysis platform 250 based on the error log and the information regarding the deconfigured FRU.
- the updated parameters may include log and information regarding other deconfigured FRUs from other devices with local BMCs 260 .
- a new error log associated with that error condition can be processed as discussed above using the updated parameters.
- each of the computing device 102 and the devices with local BMCs 260 can have a common technology platform.
- each of the devices may be part of a same series server line.
- particular FRUs may be tested for use with that common technology platform to provide sample training formation.
- newly seen FRUs may create new training information as part of feedback.
- the error analysis platform 250 may be communicatively coupled to the BMC 114 .
- the error analysis platform 250 is on a separate network, but feedback can be provided via a message (e.g., email or via an API) and updated parameters may be provided in a similar way (e.g., an update file provided via an administrator device). Because access to BMCs 114 can be via a separate control network, the access between the error analysis platform 250 and the BMCs need not be constant.
- the deep learning model 116 can be trained using training data.
- the training data may include an error log entry and an identification of the FRU(s) that were the root cause of an error associated with the error log entry.
- the training data may include static data of error log information and root cause FRU identification.
- the deep learning parameters can be trained using a deep learning approach.
- the training can involve determination of a change to each parameter based on training information.
- Examples of such learning algorithms include gradient descent, various approaches used by Distbelief, Project Adam, and Hama, and stochastic gradient descent by backpropogation, among others.
- each worker e.g., a central processing unit (CPU) or graphical processing unit (GPU)
- CPU central processing unit
- GPU graphical processing unit
- Each worker iteratively processes new training data from its subset of batches of the training data.
- the workers communicate by exchanging gradient updates.
- a parameter server is used to provide each of the workers the same model parameters.
- the error analysis platform can be implemented over a number of computing devices.
- each worker receives a subset of training data and a full set of model parameters for each iteration of training.
- every worker sends a pull request to the parameter server and gets a latest copy of the parameters W, which might contain a number of floating-point values for a deep learning model.
- Each copy of the parameters on each device is called a model replica.
- Each model replica works on a different input training data subset.
- each subset can contain error log information including an identification of one or more FRUs associated with the information and status registers that provide additional information (e.g., state information, error conditions, etc.).
- Each model replica calculates its data gradients (in an example with three workers ⁇ D 1 , ⁇ D 2 , ⁇ D 3 ) with its own mini-batch input and sends the gradients back (usually a push request) to the parameter server.
- the parameter server gathers the gradients from all the workers, calculates the average of the gradient, and updates the model accordingly.
- the deep learning model 116 can be initially trained using predefined training data and then updated based on real world feedback.
- a communication network can be used to communicatively couple the computing device with other computing devices and/or the error analysis platform.
- the communication network can use wired communications, wireless communications, or combinations thereof.
- the communication network can include multiple sub communication networks such as data networks, wireless networks, telephony networks, etc.
- Such networks can include, for example, a public data network such as the Internet, local area networks (LANs), wide area networks (WANs), metropolitan area networks (MANs), cable networks, fiber optic networks, combinations thereof, or the like.
- LANs local area networks
- WANs wide area networks
- MANs metropolitan area networks
- wireless networks may include cellular networks, satellite communications, wireless LANs, etc.
- the communication network can be in the form of a direct network link between devices.
- Various communications structures and infrastructure can be utilized to implement the communication network(s).
- devices communicate with each other and other components with access to communication networks via a communication protocol or multiple protocols.
- a protocol can be a set of rules that defines how nodes of the communication network interact with other nodes.
- communications between network nodes can be implemented by exchanging discrete packets of data or sending messages. Packets can include header information associated with a protocol (e.g., information on the location of the network node(s) to contact) as well as payload information.
- the BMC 114 can include hardware and/or combinations of hardware and programming to perform functions provided herein. As noted, the BMC 114 can provide so-called “lights-out” functionality for computing devices. The lights out functionality may allow a user, such as a systems administrator to perform management operations on the computing device even if an operating system is not installed or not functional on the computing device. Moreover, in one example, the BMC 114 can run on auxiliary power, thus the computing device need not be powered on to an on state where control of the computing device is handed over to an operating system after boot. As examples, the BMC 114 may so-called provide management and so-called “out-of-band” services, such as remote console access, remote reboot and power management functionality, access to system logs, and the like.
- a BMC 114 has management capabilities for sub-systems of a computing device, and is separate from a processor that executes a main operating system of a computing device.
- the BMC 114 may comprise an interface, such as a network interface, and/or serial interface that an administrator can use to remotely communicate with the BMC 114 .
- the BMC 114 may be capable to receive error log information and to deconfigure FRUs 112 .
- a processor such as a central processing unit (CPU) 110 or a microprocessor suitable for retrieval and execution of instructions and/or electronic circuits can be configured to perform the functionality for the computing device 102 separately from the BMC 114 .
- CPU central processing unit
- microprocessor suitable for retrieval and execution of instructions and/or electronic circuits
- FIG. 3 is a flowchart of a method for deconfiguring field replaceable units by a baseboard management controller according to a deep learning model, according to an example.
- FIG. 4 is a block diagram of a baseboard management controller capable of deconfiguring field replaceable units according to a deep learning model in response to an error condition, according to an example.
- execution of method 300 is described below with reference to BMC 400 , other suitable components for execution of method 300 can be utilized (e.g., computing device 102 ).
- Method 300 may be implemented in the form of executable instructions stored on a machine-readable storage medium, such as storage medium 420 , and/or in the form of electronic circuitry.
- the BMC 400 may be part of a computing device with multiple FRUs. As noted above, BMC 400 can provide so-called “lights-out” functionality for computing devices. The lights out functionality may allow a user, such as a systems administrator to perform management operations on the computing device even if an operating system is not installed or not functional on the computing device. Moreover, in one example, the BMC 400 can run on auxiliary power, thus the computing device need not be powered on to an on state where control of the computing device is handed over to an operating system after boot. As examples, the BMC 400 may so-called provide management and so-called “out-of-band” services, such as remote console access, remote reboot and power management functionality, access to system logs, and the like.
- a BMC 400 has management capabilities for sub-systems of a computing device, and is separate from a processor that executes a main operating system of a computing device.
- the BMC 400 may comprise an interface, such as a network interface, and/or serial interface that an administrator can use to remotely communicate with the BMC 400 .
- an auxiliary state is a state where the BMC 400 is capable of functionality while a main subsystem of the computing device is not capable of functionality (e.g., when the computing device is powered off, but plugged in, when the main subsystem is in an error condition state, etc.).
- the BMC 400 may host a web server that allows for communications via the network interface.
- the BMC 400 can have access to system logs.
- the BMC 400 can process system logs to determine a root cause for the error condition based on the deep learning approach.
- the processing element 410 can execute error condition instructions 422 to determine that an error condition has occurred in the computing device ( 302 ).
- the BMC 400 can be notified (e.g., via an interrupt, a change in status of memory polled by the BMC 400 , etc.).
- the BMC 400 can determine that the error condition is present.
- the BMC 400 can also receive an error log.
- the model processing instructions 424 can be executed by the processing element 410 to process the error log according to a deep learning model.
- the error log can be a single error log or can include multiple error logs, for example error logs retrieved from or read from various hardware devices (e.g., FRUs or components of the FRUs), an operating system executing on a central processing unit associated with a main subsystem of the computing device, or the like.
- the error log may include registers.
- each of the registers or other error log information can relate to particular information, for example, relate to a particular FRU, relate to particular components of the FRU, relate to particular errors or conditions of the FRU, etc.
- the error log may identify the particular register or component as well. This can be used to map the information to the deep learning model.
- the processing can be used to determine a score for each of a number of sets of the FRUs of the computing device.
- a set of FRUs includes one FRU or multiple FRUs.
- the scores can relate to a probability to remove the error condition by deconfiguration of the set of FRUs.
- the deep learning model includes updated parameters based on error condition feedback from at least one other device.
- the deconfiguration instructions 426 can be executed by processing element 410 to deconfigure a first one of the sets of FRUs based on the score associated with the set (e.g., the set of FRUs scored to have the highest probability to be the root cause of the error condition compared to the other FRUs).
- the processing element 410 can be caused to deconfigure the FRU.
- the deconfiguration of the FRU can be implemented by disabling the FRU.
- the disabling of the FRU can include removing power to the FRU.
- disabling of the FRU can include removing communications capabilities from the FRU.
- disabling of the FRU can include putting the FRU in a disconnected hot plug or hot swap state.
- a configuration parameter associated with the FRU can be set to indicate to the computing device/FRU that the FRU is not to function.
- the computing device can be rebooted ( 308 ). Once reboot has occurred, the BMC 400 can determine whether the error condition persists. In one example, a test can be performed to determine whether the error condition persists. The test can be directed to the FRU or the computing device in general. In one example, the BMC 400 performs the test. In another example, the BMC 400 is communicatively coupled to another processor in the main subsystem of the computing system (e.g., a CPU) that is not deconfigured, which is instructed to perform the test.
- a processor in the main subsystem of the computing system e.g., a CPU
- next most probable set of FRUs can be selected to be deconfigured.
- the next most probable FRU is determined by determining a new score relating to the probability of failure for each of the sets of FRUs as part of processing the error log again, but this time with the additional information that the previous attempt failed and the error condition persisted.
- the selected set of FRUs can include at least one FRU that was not in the original selection.
- the selected set of FRUs is deconfigured and the computing device can be rebooted and tested again. If the error condition continues to persist, the next set(s) of FRUs can be selected.
- Various approaches can be used to select the FRU or set of FRUs, for example, Qlearning, Markov Decision Process (MDP), etc.
- the error log and the information regarding the deconfiguration configuration of FRUs can be sent to the error analysis platform. This allows for the feedback to be provided to other devices with local BMCs similar to the computing device.
- the error analysis platform can use the information as part of a new sample set to provide to the deep learning model to update parameters based on the real world experience of the computing device.
- the BMC 400 can receive updated parameters for the deep learning model from the error analysis platform that takes into consideration of the error log information and the information about the set of the FRUs that was deconfigured.
- the updated parameters may also take into consideration other sets of FRUs deconfigured in response to other error conditions associated with other similar computing devices.
- the FRUs deconfigured from the other similar computing devices may be considered additional training data from the other computing devices that represent real life experiences.
- the error analysis platform can update parameters for the deep learning model from the information provided and other training data.
- Processing element 410 may be, one or multiple processing unit, one or multiple semiconductor-based microprocessor, other hardware devices suitable for retrieval and execution of instructions stored in machine-readable storage medium 420 , or combinations thereof.
- the processing element 410 can be a physical device.
- the processing element 410 may include multiple cores on a chip, include multiple cores across multiple chips, or combinations thereof.
- Processing element 410 may fetch, decode, and execute instructions 422 , 424 , 426 to implement method 300 .
- processing element 410 may include at least one integrated circuit (IC), other control logic, other electronic circuits, or combinations thereof that include a number of electronic components for performing the functionality of instructions 422 , 424 , 426 .
- IC integrated circuit
- Machine-readable storage medium 420 may be any electronic, magnetic, optical, or other physical storage device that contains or stores executable instructions.
- machine-readable storage medium may be, for example, Random Access Memory (RAM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a storage drive, a Compact Disc Read Only Memory (CD-ROM), and the like.
- RAM Random Access Memory
- EEPROM Electrically Erasable Programmable Read-Only Memory
- CD-ROM Compact Disc Read Only Memory
- the machine-readable storage medium can be non-transitory.
- machine-readable storage medium 420 may be encoded with a series of executable instructions for implementing method 300 .
- FIG. 5 is a diagram of a deep learning model, according to one example.
- the deep learning model example includes LSTM, LSTM can be beneficial for error analysis because of its ability to remember appropriate sequences via gates.
- An input gate i t 501 controls the amount of input written into a neuron's memory cell at time step t.
- the error log can provide input.
- a forget gate f t 503 controls the amount of information to be forgotten from a neuron's memory cell at time step t.
- a set of characters can be grouped together to update an output and then cleared.
- the cell c t 505 represents the content of the neuron's memory cell at time step t.
- the output gate o t 507 controls the amount of information read from the neuron's cell and how much of it contributes to the output at time step L.
- the output h t 509 represents the output of the cell to the next layer at time step t. This output is also fed back into the same neuron and used in the following time step t t+1 .
- x t can be represented by error log+ 511. Error log+ can be considered the input vector to the gates.
- the input vector can be the same for each gate as shown in FIG. 5 . In some examples, this can include information from the error log plus hidden inputs (e.g., h t 509 before the end of the processing of the error log).
- b represents a parameter vector from the deep learning model
- the Ws represent parameter matrices for the deep learning model
- x represents an input vector
- h represents an output vector
- c represents a cell state vector
- f, i, and o represent gate vectors.
- equations 1-5 are representative of neurons for an entire layer within FIG. 5 .
- the W's are matrices.
- each row in W l for hidden layer l would be mapped to neuron j where j ⁇ [1, n].
- the ⁇ operator is a dot product operation.
- tan h and ⁇ (sigmoid) activation functions are also outlined in equations 6 and 7 for clarity. These functions are applied as element wise operations on the resulting vectors.
- Other example LSTM models can be implemented, such as a Gated Recurrent Unit. Further, as noted above, other deep learning models may be used.
- the example operates through the consumption of characters as input vectors.
- characters that are sourced from an MCE log as input For the purpose of this example, assume characters that are sourced from an MCE log as input.
- the BMC can focus on analysis where actions at the system level can be performed. However, approach can be capable to process other log types as long as the model it is trained with data in the desired format.
- the neural network can make delayed predictions as it consumes input vectors (consuming one character at a time) by generating ⁇ NOP> tags as output for each time step.
- the output, h t 509 can provide hidden output that can be used as feedback to include in the input vector for the next iteration until a prediction is made.
- a prediction is eventually made once the BMC processing the log according to the model receives a special ⁇ End-of-Log> tag as input.
- the prediction can go through a softmax processing layer to determine scores that can be used to deconfigure FRUs.
- the architecture in the example use can be fully connected with the final stage going through a softmax layer which uses the form in equation 8, P(y i
- z k ) e z k / ⁇ j e ⁇ z j , in order to obtain confidence levels for replacing each FRU k where k ⁇ [1, K] for K replaceable FRUs.
- the final output is a vector y that has the following format where T is the transpose operator: [ ⁇ NOP>, CPU 1 , CPU p , DIMM 1 , . . . DIMM d , I/O-slot 1 , . . . I/O-slot s , . . . ] T .
- various parts of debugging an error condition of a computing device can be automated using a BMC.
- the solution can take into account other error conditions found in the field from similar computing devices.
- the autonomous nature allows for accurate metric reporting on failures in the field while minimizing downtime (e.g. the amount of time it may take to have a technician come out and troubleshoot the computing device).
- the accurate metric reporting can be fed into the deep learning model to self-improve the automated process.
- the approach allows for reducing the field replacement costs for FRUs that are unnecessarily replaced in customer systems and personnel costs. Though specific examples of deep learning models are provided, other similar deep learning approaches can be implemented for both training and/or execution using deep learning parameters.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Quality & Reliability (AREA)
- Biophysics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Debugging And Monitoring (AREA)
Abstract
Description
- Information Technology companies and manufacturers are challenged to deliver quality and value to consumers, for example by providing computing devices with high availability. High availability is a characteristic that aims to ensure a level of operational performance, such as uptime for a period higher than a system that does not have the high availability characteristic. However, some computing devices with the high availability characteristic do become unavailable.
- The following detailed description references the drawings, wherein:
-
FIG. 1 is a block diagram of a computing device including a baseboard management controller capable to process an error log according to a deep learning model to determine field replaceable units to deconfigure, according to one example; -
FIG. 2 is a block diagram of a system including devices each with a baseboard management controller capable to process respective error logs according to a deep learning model to determine field replaceable units to deconfigure, according to one example; -
FIG. 3 is a flowchart of a method for deconfiguring field replaceable units by a baseboard management controller according to a deep learning model, according to an example; -
FIG. 4 is a block diagram of a baseboard management controller capable of deconfiguring field replaceable units according to a deep learning model in response to an error condition, according to an example; and -
FIG. 5 is a diagram of a deep learning model, according to one example. - Throughout the drawings, identical reference numbers may designate similar, but not necessarily identical, elements. An index number “N” appended to some of the reference numerals may be understood to merely denote plurality and may not necessarily represent the same quantity for each reference numeral having such an index number “N”. Additionally, use herein of a reference numeral without an index number, where such reference numeral is referred to elsewhere with an index number, may be a general reference to the corresponding plural elements, collectively or individually. In another example, an index number of “I,” “M,” etc. can be used in place of index number N.
- Information Technology (IT) companies and computer manufacturers are challenged to deliver quality and value to consumers, for example by providing computing devices with high availability. High availability is a characteristic that aims to ensure a level of operational performance, such as uptime for a period higher than a system that does not have the high availability characteristic. However, some computing devices with the high availability characteristic do become unavailable.
- With today's businesses demanding near real-time analytics on big data in order to conduct their daily transactions, IT companies are constantly being challenged to produce highly complex, yet fault tolerant systems to empower datacenters. As such, having the ability to efficiently diagnose and repair failures of increasingly complex systems can be advantageous. Error analysis tools may be static and could require a user to help determine a root cause of an error. With complex failures, the computing system may need to be shipped back to a lab to determine the cause of the error. There is a time and shipping cost for this type of analysis.
- Accordingly, various examples provided herein use a deep learning architecture that can autonomously assist IT personnel and field engineers in determining faulty components that may need to be replaced. The examples include usage of Recurrent Neural Networks (RNN) for processing system events to distinguish between the different causes and effects of a given failure, and make the appropriate predictions on which components to replace. A baseboard management controller (BMC) can be used to perform the analysis at the computing device with an error.
- BMCs provide so-called “lights-out” functionality for computing devices. The lights out functionality may allow a user, such as a systems administrator to perform management operations on the computing device even if an operating system is not installed or not functional on the computing device. Moreover, in one example, the BMC can run on auxiliary power, thus the computing device need not be powered on to an on state where control of the computing device is handed over to an operating system after boot. As examples, the BMC may so-called provide management and so-called “out-of-band” services, such as remote console access, remote reboot and power management functionality, access to system logs, and the like. As used herein, a BMC has management capabilities for sub-systems of a computing device, and is separate from a processor that executes a main operating system of a computing device. The BMC may comprise an interface, such as a network interface, and/or serial interface that an administrator can use to remotely communicate with the BMC.
- As noted, the BMC can have access to system logs. In one example, when an error condition occurs on the computing device, the BMC can process system logs to determine a root cause for the error condition based on the deep learning approach. In some examples, the system logs can come from Field Replaceable Units (FRUs) or be related to the FRUs. As used herein, a field replaceable unit is a circuit board, part, or assembly that can be easily removed from a computing device and replaced by a user or technician without having to send the whole computing device to a repair facility. Examples of FRUs include parts that can attach to other parts of the computing device using a socket, a card, a module, etc. Further, examples of FRUs can include computing modules, memory modules, peripheral cards and devices, etc. In some examples, the system logs can include registers that provide particular information (e.g., an error flag for a particular component, a type of error, a current configuration, a location associated with an error, etc.).
- The BMC can process the information from the logs according to the deep learning model to determine scores associated with each of a number of the FRUs. The scores can relate to the likelihood that the FRU has responsibility for the error condition. In other examples, the scores can be associated with sets of FRUs. Once each of the logs are processed, the FRU (or set of FRUs) with a highest likelihood of being responsible for the error condition can be deconfigured by the BMC. Once deconfigured, the computing device can be rebooted to determine if the error persists. In some examples, determining whether the error persists can include testing (e.g., initializing memory, writing to and reading back from various locations, etc.). In one example, if the error condition is not removed, the next FRU or set of FRUs likely to be responsible can be deconfigured. This can repeat. Moreover, in some examples, the failure to remove the error condition can be taken into account for re-scoring the FRUs.
- In one example, if the error condition is removed, the BMC can send information about the logs (e.g., the logs themselves, a condensed version of the logs, etc.) as well as information about the FRU or set of FRUs deconfigured to an error analysis platform. In some examples, the information sent can also include information about deconfigured FRUs that did not cause the error condition. The error analysis platform can take the feedback, along with parameters of a current deep learning model and feedback from other computing devices to update parameters for the deep learning model. The updated parameters can be provided to the BMC and other computing devices.
- Unlike a static error analysis engine, the approaches described herein are autonomous and can self-learn. In one example, the approach can learn from multiple different computing devices providing feedback. In this example, a set of updated deep learning parameters can be determined and sent back to the computing devices. In another example, the deep learning model can be implemented while processing an error log in a computing device with an error condition. The implementation can also learn from mispredictions of a faulty component or field replaceable unit.
- Further, the use of a deep neural network can reduce the costs associated with handcrafting complex rules for analyzing and recovering from errors in computing devices that are used in statically defined analyzers. Moreover, static analyzers may suffer from a lack of portability across different platform types and architectures. The approaches described herein offer a simpler approach where deep learning is used to capture mathematical functions for performing error analysis and recovery. A mathematical approach is advantageous because it can be generalized for other platforms and architectures. As noted, parameters from the deep learning model can be updated and provided back to BMCs within computing devices.
-
FIG. 1 is a block diagram of a computing device including a baseboard management controller capable to process an error log according to a deep learning model to determine field replaceable units to deconfigure, according to one example.FIG. 2 is a block diagram of a system including devices each with a baseboard management controller capable to process respective error logs according to a deep learning model to determine field replaceable units to deconfigure, according to one example. - In the example of
FIG. 1 , thecomputing device 102 includes acentral processing unit 110, a number of fieldreplaceable units 112, andbaseboard management controller 114. In some examples, anFRU 112 can include thecentral processing unit 110. In the example ofFIG. 2 , thecomputing device 102 can be included in asystem 200 that can also include anerror analysis platform 250 that can receive feedback from multiple devices with a local BMC 260 a-260 n. Theerror analysis platform 250 can take the feedback information to determine updates to parameters for adeep learning model 116 that is used to autonomously diagnose a cause for an error condition of thecomputing device 102. - When an error condition affects the
computing device 102, theBMC 114 can be notified (e.g., via an interrupt, a change in status of memory polled by theBMC 114, etc.). TheBMC 114 can determine that an error condition is present. Further, theBMC 114 can use anerror log 218 to analyze the error condition of thecomputing device 102. Theerror log 218 can be a single error log or can include multiple error logs, for example error logs retrieved from or read from various hardware devices (e.g.,FRUs 112 or components of the FRUs 112), an operating system executing on thecentral processing unit 110, or the like. In one example, the error log may include registers. In the example, each of the registers or other error log information can relate to particular information, for example, relate to a particular FRU, relate to particular components of the FRU, relate to particular errors or conditions of the FRU, etc. The error log may identify the particular register or component as well. This can be used to map the information to the deep learning model. The functionality of theBMC 114 described herein can be implemented by executing instructions stored inmemory 232. - As noted, in one example, the processing of the error log 218 can include processing using the
deep learning model 116. Various deep learning models can be used. Examples of deep learning models include long short-term memory (LSTM), a convolution neural networks, recurrent neural networks, neural history compressor, recursive neural networks, gated recurrent unit (GRU), etc. An advantage to a recurrent neural network is the inclusion of feedback. An example of one implementation of using an LSTM approach as thedeep learning model 116 is provided in the explanation corresponding toFIG. 5 . The parameters used for thedeep learning model 116 can be updated based on feedback from thecomputing device 102 or other devices with local BMCs 260 as discussed herein. - The
deep learning model 116 can be applied to determine one of theFRUs 112 or a set of theFRUs 112 that can be deconfigured in response to the error condition. When theBMC 114 processes the error log 218 according to thedeep learning model 116, a score can be assigned to each of theFRUs 112 and/or to sets ofFRUs 112. The scores can relate to probability that the FRU or set ofFRUs 112 is a root cause for the error condition. - In one example model, the error log can be processed as characters. In the example model, characters can represent registers associated with dumps from FRU components or systems logs. In one example, each character can be considered an input vector. When a character is processed, each of the scores for the FRUs can be updated. The updated scores can be included as an input vector along with the next character. The processing can continue until a character represents an end of the log. In an LSTM model, characters can be broken up by special characters and taken as a group. For example, a first character may identify an FRU's log, a second, third, and fourth character may include log register information, and a special character (fifth character) may indicate that the information about the FRU's log is over. In this example, the five characters are meant to be processed together.
- Once the information is processed, the information may be forgotten (though the updated scores remain) and a next set of characters can be read to update the scores for the FRUs. In some examples, the scores can be used to rank the probability that each of the FRUs or sets of FRUs are a root cause of the error condition. For example, a softmax function may be used to organize the scores (e.g., the softmax function can be used normalize the vectors into real values in the range of [0, 1] that add up to 1).
- One of the FRUs or sets of FRUs can be selected based on the analysis (e.g., the set of FRUs scored to have the highest probability to be the root cause of the error condition compared to the other FRUs). The
BMC 114 can be caused to deconfigure the FRU. In some examples, the deconfiguration of the FRU can be implemented by disabling the FRU. In one example, the disabling of the FRU can include removing power to the FRU. In another example, disabling of the FRU can include removing communications capabilities from the FRU. In a further example, disabling of the FRU can include putting the FRU in a disconnected hot plug or hot swap state. - Once the FRU(s) selected is deconfigured, the
computing device 102 can be rebooted. Once reboot has occurred, theBMC 114 can determine whether the error condition persists. In one example, a test can be performed to determine whether the error condition persists. The test can be directed to the FRU or the computing device in general. In one example, theBMC 114 performs the test. In another example, theBMC 114 is communicatively coupled to another processor (e.g., CPU 110), which is instructed to perform the test. - If the error condition persists, the next most probable FRU or set of FRUs can be selected to be deconfigured. In one example, the next most probable FRU is determined by determining a new score relating to the probability of failure for each of the FRUs/sets of FRUs as part of processing the error log again, but this time with the additional information that the previous attempt failed and the error condition persisted. The selected FRU or set of FRUs can include at least one FRU that was not in the original selection. The selected FRU or set of FRUs is deconfigured and the computing device can be rebooted and tested again. If the error condition continues to persist, the next set(s) of FRUs can be selected. Various approaches can be used to select the FRU or set of FRUs, for example, Qlearning, Markov Decision Process (MDP), etc.
- If the error condition does not persist after a reboot, the error log and the information regarding the deconfiguration can be sent to the
error analysis platform 250. This allows for the feedback to be provided to other devices with local BMCs 260 similar to the computing device. Theerror analysis platform 250 can use the information as part of a new sample set to provide to the deep learning model to update parameters based on the real world experience of thecomputing device 102. - The
error analysis platform 250 can update the parameters for thedeep learning model 116 for thecomputing device 102. The parameters can also be used in the other devices with local BMCs 260. The updated parameters can be sent by theerror analysis platform 250 back to the devices that can use the updated parameters for future error log processing. - The deep learning model can be trained on the
error analysis platform 250 or another platform. The training may include initial error log impressions from a technical expert making the training sets based on error log entries (e.g., an error log entry of a register indicating that a memory module has an unrecoverable hard error may be trained to indicate that the memory module is a root cause for that error). Similarly, full systems configurations can be added to the sample sets as well. For example, a configuration where a peripheral network card FRU has a hardware error, but two other FRUs (e.g., memory modules) have errors that were caused by the network card FRU, the root cause may be trained to be the network card (for that specific case). The training sets can be determined from observations. Feedback can come from computing devices put into implementation or from test units. As noted, the feedback can be used as training data to update the parameters for the deep learning model. Various approaches can be used to implement the deep learning approach to update parameters on theerror analysis platform 250, for example, RMSprop, Adagrad, Adam, etc. In one example, gradient descent optimization algorithms can be used. - As such, the
BMC 114 can receive the updated parameters for thedeep learning model 116 from theerror analysis platform 250 based on the error log and the information regarding the deconfigured FRU. Similarly, the updated parameters may include log and information regarding other deconfigured FRUs from other devices with local BMCs 260. When another error condition occurs on the computing device or one of the other devices with local BMCs 260 capable of implementing this approach, a new error log associated with that error condition can be processed as discussed above using the updated parameters. - In some examples, each of the
computing device 102 and the devices with local BMCs 260 can have a common technology platform. For example, each of the devices may be part of a same series server line. Moreover, particular FRUs may be tested for use with that common technology platform to provide sample training formation. In some examples, newly seen FRUs may create new training information as part of feedback. In one example, theerror analysis platform 250 may be communicatively coupled to theBMC 114. In another example, theerror analysis platform 250 is on a separate network, but feedback can be provided via a message (e.g., email or via an API) and updated parameters may be provided in a similar way (e.g., an update file provided via an administrator device). Because access toBMCs 114 can be via a separate control network, the access between theerror analysis platform 250 and the BMCs need not be constant. - The
deep learning model 116 can be trained using training data. In one example, the training data may include an error log entry and an identification of the FRU(s) that were the root cause of an error associated with the error log entry. In other examples, the training data may include static data of error log information and root cause FRU identification. - The deep learning parameters can be trained using a deep learning approach. The training can involve determination of a change to each parameter based on training information. Examples of such learning algorithms include gradient descent, various approaches used by Distbelief, Project Adam, and Hama, and stochastic gradient descent by backpropogation, among others.
- A commonly used technique in distributed deep learning for both convolution neural network and recurrent neural network models is data parallelism. In this example, each worker (e.g., a central processing unit (CPU) or graphical processing unit (GPU)) receives a subset of a batch of training data. Each worker iteratively processes new training data from its subset of batches of the training data. The workers communicate by exchanging gradient updates. A parameter server is used to provide each of the workers the same model parameters. As such, in some examples, the error analysis platform can be implemented over a number of computing devices.
- The following is an example model of distributed deep learning. In this example of distributed deep learning, each worker receives a subset of training data and a full set of model parameters for each iteration of training. At the beginning of one iteration, every worker sends a pull request to the parameter server and gets a latest copy of the parameters W, which might contain a number of floating-point values for a deep learning model. Each copy of the parameters on each device is called a model replica. Each model replica works on a different input training data subset. For example, each subset can contain error log information including an identification of one or more FRUs associated with the information and status registers that provide additional information (e.g., state information, error conditions, etc.).
- Each model replica calculates its data gradients (in an example with three workers ΔD1, ΔD2, ΔD3) with its own mini-batch input and sends the gradients back (usually a push request) to the parameter server. The parameter server gathers the gradients from all the workers, calculates the average of the gradient, and updates the model accordingly. For example, a new W′ can equal the previous W plus a learning rate times an average of the data gradients. Shown as an equation, the new W′ can be expressed as W′=W+learning rate*average (ΔD1, ΔD2, ΔD3). The
deep learning model 116 can be initially trained using predefined training data and then updated based on real world feedback. - A communication network can be used to communicatively couple the computing device with other computing devices and/or the error analysis platform. The communication network can use wired communications, wireless communications, or combinations thereof. Further, the communication network can include multiple sub communication networks such as data networks, wireless networks, telephony networks, etc. Such networks can include, for example, a public data network such as the Internet, local area networks (LANs), wide area networks (WANs), metropolitan area networks (MANs), cable networks, fiber optic networks, combinations thereof, or the like. In certain examples, wireless networks may include cellular networks, satellite communications, wireless LANs, etc. Further, the communication network can be in the form of a direct network link between devices. Various communications structures and infrastructure can be utilized to implement the communication network(s).
- By way of example, devices communicate with each other and other components with access to communication networks via a communication protocol or multiple protocols. A protocol can be a set of rules that defines how nodes of the communication network interact with other nodes. Further, communications between network nodes can be implemented by exchanging discrete packets of data or sending messages. Packets can include header information associated with a protocol (e.g., information on the location of the network node(s) to contact) as well as payload information.
- The
BMC 114 can include hardware and/or combinations of hardware and programming to perform functions provided herein. As noted, theBMC 114 can provide so-called “lights-out” functionality for computing devices. The lights out functionality may allow a user, such as a systems administrator to perform management operations on the computing device even if an operating system is not installed or not functional on the computing device. Moreover, in one example, theBMC 114 can run on auxiliary power, thus the computing device need not be powered on to an on state where control of the computing device is handed over to an operating system after boot. As examples, theBMC 114 may so-called provide management and so-called “out-of-band” services, such as remote console access, remote reboot and power management functionality, access to system logs, and the like. As used herein, aBMC 114 has management capabilities for sub-systems of a computing device, and is separate from a processor that executes a main operating system of a computing device. TheBMC 114 may comprise an interface, such as a network interface, and/or serial interface that an administrator can use to remotely communicate with theBMC 114. Moreover, as described herein, theBMC 114 may be capable to receive error log information and to deconfigureFRUs 112. - A processor, such as a central processing unit (CPU) 110 or a microprocessor suitable for retrieval and execution of instructions and/or electronic circuits can be configured to perform the functionality for the
computing device 102 separately from theBMC 114. -
FIG. 3 is a flowchart of a method for deconfiguring field replaceable units by a baseboard management controller according to a deep learning model, according to an example.FIG. 4 is a block diagram of a baseboard management controller capable of deconfiguring field replaceable units according to a deep learning model in response to an error condition, according to an example. Although execution ofmethod 300 is described below with reference toBMC 400, other suitable components for execution ofmethod 300 can be utilized (e.g., computing device 102).Method 300 may be implemented in the form of executable instructions stored on a machine-readable storage medium, such asstorage medium 420, and/or in the form of electronic circuitry. - The
BMC 400 may be part of a computing device with multiple FRUs. As noted above,BMC 400 can provide so-called “lights-out” functionality for computing devices. The lights out functionality may allow a user, such as a systems administrator to perform management operations on the computing device even if an operating system is not installed or not functional on the computing device. Moreover, in one example, theBMC 400 can run on auxiliary power, thus the computing device need not be powered on to an on state where control of the computing device is handed over to an operating system after boot. As examples, theBMC 400 may so-called provide management and so-called “out-of-band” services, such as remote console access, remote reboot and power management functionality, access to system logs, and the like. As used herein, aBMC 400 has management capabilities for sub-systems of a computing device, and is separate from a processor that executes a main operating system of a computing device. TheBMC 400 may comprise an interface, such as a network interface, and/or serial interface that an administrator can use to remotely communicate with theBMC 400. As used herein, an auxiliary state is a state where theBMC 400 is capable of functionality while a main subsystem of the computing device is not capable of functionality (e.g., when the computing device is powered off, but plugged in, when the main subsystem is in an error condition state, etc.). In some examples, theBMC 400 may host a web server that allows for communications via the network interface. - As noted, the
BMC 400 can have access to system logs. In one example, when an error condition occurs on the computing device, theBMC 400 can process system logs to determine a root cause for the error condition based on the deep learning approach. Theprocessing element 410 can executeerror condition instructions 422 to determine that an error condition has occurred in the computing device (302). When the error condition affects the computing device, theBMC 400 can be notified (e.g., via an interrupt, a change in status of memory polled by theBMC 400, etc.). TheBMC 400 can determine that the error condition is present. - The
BMC 400 can also receive an error log. At 304, themodel processing instructions 424 can be executed by theprocessing element 410 to process the error log according to a deep learning model. The error log can be a single error log or can include multiple error logs, for example error logs retrieved from or read from various hardware devices (e.g., FRUs or components of the FRUs), an operating system executing on a central processing unit associated with a main subsystem of the computing device, or the like. In one example, the error log may include registers. In the example, each of the registers or other error log information can relate to particular information, for example, relate to a particular FRU, relate to particular components of the FRU, relate to particular errors or conditions of the FRU, etc. The error log may identify the particular register or component as well. This can be used to map the information to the deep learning model. As noted above, the processing can be used to determine a score for each of a number of sets of the FRUs of the computing device. As used herein, a set of FRUs includes one FRU or multiple FRUs. As described above, the scores can relate to a probability to remove the error condition by deconfiguration of the set of FRUs. In some examples, the deep learning model includes updated parameters based on error condition feedback from at least one other device. - At 306, the
deconfiguration instructions 426 can be executed by processingelement 410 to deconfigure a first one of the sets of FRUs based on the score associated with the set (e.g., the set of FRUs scored to have the highest probability to be the root cause of the error condition compared to the other FRUs). Theprocessing element 410 can be caused to deconfigure the FRU. In some examples, the deconfiguration of the FRU can be implemented by disabling the FRU. In one example, the disabling of the FRU can include removing power to the FRU. In another example, disabling of the FRU can include removing communications capabilities from the FRU. In a further example, disabling of the FRU can include putting the FRU in a disconnected hot plug or hot swap state. For example, a configuration parameter associated with the FRU can be set to indicate to the computing device/FRU that the FRU is not to function. - Once the set of FRUs selected is deconfigured, the computing device can be rebooted (308). Once reboot has occurred, the
BMC 400 can determine whether the error condition persists. In one example, a test can be performed to determine whether the error condition persists. The test can be directed to the FRU or the computing device in general. In one example, theBMC 400 performs the test. In another example, theBMC 400 is communicatively coupled to another processor in the main subsystem of the computing system (e.g., a CPU) that is not deconfigured, which is instructed to perform the test. - If the error condition persists, the next most probable set of FRUs can be selected to be deconfigured. In one example, the next most probable FRU is determined by determining a new score relating to the probability of failure for each of the sets of FRUs as part of processing the error log again, but this time with the additional information that the previous attempt failed and the error condition persisted. The selected set of FRUs can include at least one FRU that was not in the original selection. The selected set of FRUs is deconfigured and the computing device can be rebooted and tested again. If the error condition continues to persist, the next set(s) of FRUs can be selected. Various approaches can be used to select the FRU or set of FRUs, for example, Qlearning, Markov Decision Process (MDP), etc.
- If the error condition does not persist after a reboot, the error log and the information regarding the deconfiguration configuration of FRUs can be sent to the error analysis platform. This allows for the feedback to be provided to other devices with local BMCs similar to the computing device. The error analysis platform can use the information as part of a new sample set to provide to the deep learning model to update parameters based on the real world experience of the computing device. The
BMC 400 can receive updated parameters for the deep learning model from the error analysis platform that takes into consideration of the error log information and the information about the set of the FRUs that was deconfigured. The updated parameters may also take into consideration other sets of FRUs deconfigured in response to other error conditions associated with other similar computing devices. The FRUs deconfigured from the other similar computing devices may be considered additional training data from the other computing devices that represent real life experiences. As noted above, the error analysis platform can update parameters for the deep learning model from the information provided and other training data. -
Processing element 410 may be, one or multiple processing unit, one or multiple semiconductor-based microprocessor, other hardware devices suitable for retrieval and execution of instructions stored in machine-readable storage medium 420, or combinations thereof. Theprocessing element 410 can be a physical device. Moreover, in one example, theprocessing element 410 may include multiple cores on a chip, include multiple cores across multiple chips, or combinations thereof.Processing element 410 may fetch, decode, and executeinstructions method 300. As an alternative or in addition to retrieving and executing instructions,processing element 410 may include at least one integrated circuit (IC), other control logic, other electronic circuits, or combinations thereof that include a number of electronic components for performing the functionality ofinstructions - Machine-
readable storage medium 420 may be any electronic, magnetic, optical, or other physical storage device that contains or stores executable instructions. Thus, machine-readable storage medium may be, for example, Random Access Memory (RAM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a storage drive, a Compact Disc Read Only Memory (CD-ROM), and the like. As such, the machine-readable storage medium can be non-transitory. As described in detail herein, machine-readable storage medium 420 may be encoded with a series of executable instructions for implementingmethod 300. -
FIG. 5 is a diagram of a deep learning model, according to one example. The deep learning model example includes LSTM, LSTM can be beneficial for error analysis because of its ability to remember appropriate sequences via gates. An input gate it 501 controls the amount of input written into a neuron's memory cell at time step t. In this scenario, the error log can provide input. A forgetgate f t 503 controls the amount of information to be forgotten from a neuron's memory cell at time step t. With this approach, a set of characters can be grouped together to update an output and then cleared. Thecell c t 505 represents the content of the neuron's memory cell at time step t. Theoutput gate o t 507 controls the amount of information read from the neuron's cell and how much of it contributes to the output at time step L. Theoutput h t 509 represents the output of the cell to the next layer at time step t. This output is also fed back into the same neuron and used in the following time step tt+1. In the example ofFIG. 5 , xt can be represented byerror log+ 511. Error log+ can be considered the input vector to the gates. The input vector can be the same for each gate as shown inFIG. 5 . In some examples, this can include information from the error log plus hidden inputs (e.g.,h t 509 before the end of the processing of the error log). - The following equations can be used to implement one example LSTM model: it=σ(Wxi xt+Whi ht−1+bi) Eq. 1; ft=σ(Wxf xt+Whf ht−1+bf) Eq. 2; ot=σ(Wxo xt+Who ht−1+bo) Eq. 3; ct=ft ⊙ ct−1+it ⊙ tan h(Wxc xt+Whc ht−1+bc) Eq. 4; ht=ot ⊙ tan h(ct) Eq. 5; σ (z)=1/(1+e−z) Eq. 6; tan h(z)=2σ(2z)−1 Eq. 7. In the equation set, b represents a parameter vector from the deep learning model, the Ws represent parameter matrices for the deep learning model, x represents an input vector, h represents an output vector, c represents a cell state vector, and f, i, and o represent gate vectors.
- Note that equations 1-5 are representative of neurons for an entire layer within
FIG. 5 . This implies that it, ft, ot, ct, ht, ht−1, and xt are vectors. In the example, the W's are matrices. In other words, if a given matrix W is augmented to include the weights for both x and h such that its dimensions become n×m, then each row in Wl for hidden layer l would be mapped to neuron j where jϵ[1, n]. Moreover, the ⊙ operator is a dot product operation. - The tan h and σ (sigmoid) activation functions are also outlined in equations 6 and 7 for clarity. These functions are applied as element wise operations on the resulting vectors. Other example LSTM models can be implemented, such as a Gated Recurrent Unit. Further, as noted above, other deep learning models may be used.
- The example operates through the consumption of characters as input vectors. For the purpose of this example, assume characters that are sourced from an MCE log as input. In the example, the BMC can focus on analysis where actions at the system level can be performed. However, approach can be capable to process other log types as long as the model it is trained with data in the desired format. The neural network can make delayed predictions as it consumes input vectors (consuming one character at a time) by generating <NOP> tags as output for each time step. As noted, the output,
h t 509, can provide hidden output that can be used as feedback to include in the input vector for the next iteration until a prediction is made. A prediction is eventually made once the BMC processing the log according to the model receives a special <End-of-Log> tag as input. As noted above, in some examples, the prediction can go through a softmax processing layer to determine scores that can be used to deconfigure FRUs. - Further, the architecture in the example use can be fully connected with the final stage going through a softmax layer which uses the form in equation 8, P(yi|zk)=ez k/Σje−z j, in order to obtain confidence levels for replacing each FRU k where kϵ[1, K] for K replaceable FRUs. As such, the final output is a vector y that has the following format where T is the transpose operator: [<NOP>, CPU1, CPUp, DIMM1, . . . DIMMd, I/O-slot1, . . . I/O-slots, . . . ]T.
- With the approaches described herein, various parts of debugging an error condition of a computing device can be automated using a BMC. The solution can take into account other error conditions found in the field from similar computing devices. Moreover, the autonomous nature allows for accurate metric reporting on failures in the field while minimizing downtime (e.g. the amount of time it may take to have a technician come out and troubleshoot the computing device). The accurate metric reporting can be fed into the deep learning model to self-improve the automated process. Moreover, the approach allows for reducing the field replacement costs for FRUs that are unnecessarily replaced in customer systems and personnel costs. Though specific examples of deep learning models are provided, other similar deep learning approaches can be implemented for both training and/or execution using deep learning parameters.
- While certain implementations have been shown and described above, various changes in form and details may be made. For example, some features that have been described in relation to one implementation and/or process can be related to other implementations. In other words, processes, features, components, and/or properties described in relation to one implementation can be useful in other implementations. Furthermore, it should be appreciated that the systems and methods described herein can include various combinations and/or sub-combinations of the components and/or features of the different implementations described. Thus, features described with reference to one or more implementations can be combined with other implementations described herein.
Claims (19)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/463,713 US10552729B2 (en) | 2017-03-20 | 2017-03-20 | Baseboard management controller to deconfigure field replaceable units according to deep learning model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/463,713 US10552729B2 (en) | 2017-03-20 | 2017-03-20 | Baseboard management controller to deconfigure field replaceable units according to deep learning model |
Publications (2)
Publication Number | Publication Date |
---|---|
US20180267858A1 true US20180267858A1 (en) | 2018-09-20 |
US10552729B2 US10552729B2 (en) | 2020-02-04 |
Family
ID=63520096
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/463,713 Active 2037-07-12 US10552729B2 (en) | 2017-03-20 | 2017-03-20 | Baseboard management controller to deconfigure field replaceable units according to deep learning model |
Country Status (1)
Country | Link |
---|---|
US (1) | US10552729B2 (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110427371A (en) * | 2019-07-19 | 2019-11-08 | 苏州浪潮智能科技有限公司 | Server FRU field management method, device, equipment and readable storage medium storing program for executing |
US20200004625A1 (en) * | 2018-06-29 | 2020-01-02 | International Business Machines Corporation | Determining when to perform error checking of a storage unit by training a machine learning module |
CN110751272A (en) * | 2019-10-30 | 2020-02-04 | 珠海格力电器股份有限公司 | Method, device and storage medium for positioning data in convolutional neural network model |
US10891181B2 (en) * | 2018-10-25 | 2021-01-12 | International Business Machines Corporation | Smart system dump |
US20210081238A1 (en) * | 2019-09-17 | 2021-03-18 | Western Digital Technologies, Inc. | Exception analysis for data storage devices |
US11099743B2 (en) | 2018-06-29 | 2021-08-24 | International Business Machines Corporation | Determining when to replace a storage device using a machine learning module |
US11119663B2 (en) | 2018-06-29 | 2021-09-14 | International Business Machines Corporation | Determining when to perform a data integrity check of copies of a data set by training a machine learning module |
CN113536306A (en) * | 2020-04-14 | 2021-10-22 | 慧与发展有限责任合伙企业 | Processing health information to determine whether an exception occurred |
US20220066890A1 (en) * | 2020-08-25 | 2022-03-03 | Softiron Limited | Centralized Server Management For Shadow Nodes |
CN114896212A (en) * | 2022-04-07 | 2022-08-12 | 支付宝(杭州)信息技术有限公司 | Log data analysis method and device, storage medium and electronic equipment |
US11636004B1 (en) * | 2021-10-22 | 2023-04-25 | EMC IP Holding Company LLC | Method, electronic device, and computer program product for training failure analysis model |
US11748478B2 (en) | 2020-08-07 | 2023-09-05 | Softiron Limited | Current monitor for security |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11226840B2 (en) * | 2015-10-08 | 2022-01-18 | Shanghai Zhaoxin Semiconductor Co., Ltd. | Neural network unit that interrupts processing core upon condition |
US11221872B2 (en) * | 2015-10-08 | 2022-01-11 | Shanghai Zhaoxin Semiconductor Co., Ltd. | Neural network unit that interrupts processing core upon condition |
US11403162B2 (en) * | 2019-10-17 | 2022-08-02 | Dell Products L.P. | System and method for transferring diagnostic data via a framebuffer |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5253184A (en) * | 1991-06-19 | 1993-10-12 | Storage Technology Corporation | Failure and performance tracking system |
DE10244131B4 (en) * | 2002-09-23 | 2006-11-30 | Siemens Ag | Method for supporting identification of a defective functional unit in a technical installation |
US6970804B2 (en) | 2002-12-17 | 2005-11-29 | Xerox Corporation | Automated self-learning diagnostic system |
US20040221198A1 (en) * | 2003-04-17 | 2004-11-04 | Vecoven Frederic Louis Ghislain Gabriel | Automatic error diagnosis |
US8001423B2 (en) * | 2008-09-26 | 2011-08-16 | Bae Systems Information And Electronic Systems Integration Inc. | Prognostic diagnostic capability tracking system |
US8504875B2 (en) * | 2009-12-28 | 2013-08-06 | International Business Machines Corporation | Debugging module to load error decoding logic from firmware and to execute logic in response to an error |
CN102455950A (en) * | 2010-10-28 | 2012-05-16 | 鸿富锦精密工业(深圳)有限公司 | Firmware recovery system and method of base board management controller |
CN103914735B (en) | 2014-04-17 | 2017-03-29 | 北京泰乐德信息技术有限公司 | A kind of fault recognition method and system based on Neural Network Self-learning |
US10817398B2 (en) * | 2015-03-09 | 2020-10-27 | Vapor IO Inc. | Data center management via out-of-band, low-pin count, external access to local motherboard monitoring and control |
US10339448B2 (en) * | 2017-01-09 | 2019-07-02 | Seagate Technology Llc | Methods and devices for reducing device test time |
-
2017
- 2017-03-20 US US15/463,713 patent/US10552729B2/en active Active
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11119660B2 (en) | 2018-06-29 | 2021-09-14 | International Business Machines Corporation | Determining when to replace a storage device by training a machine learning module |
US20200004625A1 (en) * | 2018-06-29 | 2020-01-02 | International Business Machines Corporation | Determining when to perform error checking of a storage unit by training a machine learning module |
US11204827B2 (en) | 2018-06-29 | 2021-12-21 | International Business Machines Corporation | Using a machine learning module to determine when to perform error checking of a storage unit |
US11119850B2 (en) | 2018-06-29 | 2021-09-14 | International Business Machines Corporation | Determining when to perform error checking of a storage unit by using a machine learning module |
US11099743B2 (en) | 2018-06-29 | 2021-08-24 | International Business Machines Corporation | Determining when to replace a storage device using a machine learning module |
US11119663B2 (en) | 2018-06-29 | 2021-09-14 | International Business Machines Corporation | Determining when to perform a data integrity check of copies of a data set by training a machine learning module |
US11119851B2 (en) * | 2018-06-29 | 2021-09-14 | International Business Machines Corporation | Determining when to perform error checking of a storage unit by training a machine learning module |
US11119662B2 (en) | 2018-06-29 | 2021-09-14 | International Business Machines Corporation | Determining when to perform a data integrity check of copies of a data set using a machine learning module |
US10891181B2 (en) * | 2018-10-25 | 2021-01-12 | International Business Machines Corporation | Smart system dump |
CN110427371A (en) * | 2019-07-19 | 2019-11-08 | 苏州浪潮智能科技有限公司 | Server FRU field management method, device, equipment and readable storage medium storing program for executing |
US20210081238A1 (en) * | 2019-09-17 | 2021-03-18 | Western Digital Technologies, Inc. | Exception analysis for data storage devices |
US11768701B2 (en) * | 2019-09-17 | 2023-09-26 | Western Digital Technologies, Inc. | Exception analysis for data storage devices |
CN110751272A (en) * | 2019-10-30 | 2020-02-04 | 珠海格力电器股份有限公司 | Method, device and storage medium for positioning data in convolutional neural network model |
CN113536306A (en) * | 2020-04-14 | 2021-10-22 | 慧与发展有限责任合伙企业 | Processing health information to determine whether an exception occurred |
US11755729B2 (en) | 2020-08-07 | 2023-09-12 | Softiron Limited | Centralized server management for current monitoring for security |
US11748478B2 (en) | 2020-08-07 | 2023-09-05 | Softiron Limited | Current monitor for security |
US20220066890A1 (en) * | 2020-08-25 | 2022-03-03 | Softiron Limited | Centralized Server Management For Shadow Nodes |
US12019528B2 (en) * | 2020-08-25 | 2024-06-25 | Softiron Limited | Centralized server management for shadow nodes |
US11636004B1 (en) * | 2021-10-22 | 2023-04-25 | EMC IP Holding Company LLC | Method, electronic device, and computer program product for training failure analysis model |
CN114896212A (en) * | 2022-04-07 | 2022-08-12 | 支付宝(杭州)信息技术有限公司 | Log data analysis method and device, storage medium and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
US10552729B2 (en) | 2020-02-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10552729B2 (en) | Baseboard management controller to deconfigure field replaceable units according to deep learning model | |
US10579459B2 (en) | Log events for root cause error diagnosis | |
US11494295B1 (en) | Automated software bug discovery and assessment | |
US10489232B1 (en) | Data center diagnostic information | |
US11860721B2 (en) | Utilizing automatic labelling, prioritizing, and root cause analysis machine learning models and dependency graphs to determine recommendations for software products | |
KR101331935B1 (en) | Method and system of fault diagnosis and repair using based-on tracepoint | |
US11625315B2 (en) | Software regression recovery via automated detection of problem change lists | |
US11551085B2 (en) | Method, device, and computer program product for error evaluation | |
CN111414268B (en) | Fault processing method and device and server | |
CN113282461A (en) | Alarm identification method and device for transmission network | |
JP7435799B2 (en) | Rule learning device, rule engine, rule learning method, and rule learning program | |
CN117668706A (en) | Method and device for isolating memory faults of server, storage medium and electronic equipment | |
CN112257745A (en) | Hidden Markov-based method and device for predicting health degree of underground coal mine system | |
CN114691409A (en) | Memory fault processing method and device | |
CN116582414A (en) | Fault root cause positioning method, device, equipment and readable storage medium | |
US20200127882A1 (en) | Identification of cause of failure of computing elements in a computing environment | |
US20230385048A1 (en) | Predictive recycling of computer systems in a cloud environment | |
CN113839861A (en) | Routing engine switching based on health determined by support vector machine | |
Simeonov et al. | Proactive software rejuvenation based on machine learning techniques | |
US20230161661A1 (en) | Utilizing topology-centric monitoring to model a system and correlate low level system anomalies and high level system impacts | |
US20230161637A1 (en) | Automated reasoning for event management in cloud platforms | |
CN116490857A (en) | Method and system for providing maintenance service for recording medium of electronic device | |
CN116827759B (en) | Method and device for processing restarting instruction of converging current divider | |
US11956117B1 (en) | Network monitoring and healing based on a behavior model | |
CN112996026B (en) | Double-backup upgrading method and system for wireless network equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: HEWLETT PACKARD ENTERPRISE DEVELOPMENT LP, TEXAS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:BACHA, ANYS;UMAR, DODDYANTO HAMID;SIGNING DATES FROM 20170315 TO 20170316;REEL/FRAME:041647/0016 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |