US20180267858A1 - Baseboard Management Controller To Deconfigure Field Replaceable Units According To Deep Learning Model - Google Patents

Baseboard Management Controller To Deconfigure Field Replaceable Units According To Deep Learning Model Download PDF

Info

Publication number
US20180267858A1
US20180267858A1 US15/463,713 US201715463713A US2018267858A1 US 20180267858 A1 US20180267858 A1 US 20180267858A1 US 201715463713 A US201715463713 A US 201715463713A US 2018267858 A1 US2018267858 A1 US 2018267858A1
Authority
US
United States
Prior art keywords
field replaceable
error
computing device
replaceable units
error condition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US15/463,713
Other versions
US10552729B2 (en
Inventor
Anys Bacha
Doddyanto Hamid Umar
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hewlett Packard Enterprise Development LP
Original Assignee
Hewlett Packard Enterprise Development LP
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett Packard Enterprise Development LP filed Critical Hewlett Packard Enterprise Development LP
Priority to US15/463,713 priority Critical patent/US10552729B2/en
Assigned to HEWLETT PACKARD ENTERPRISE DEVELOPMENT LP reassignment HEWLETT PACKARD ENTERPRISE DEVELOPMENT LP ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: UMAR, DODDYANTO HAMID, BACHA, ANYS
Publication of US20180267858A1 publication Critical patent/US20180267858A1/en
Application granted granted Critical
Publication of US10552729B2 publication Critical patent/US10552729B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1417Boot up procedures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/142Reconfiguring to eliminate the error
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0445
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • G06N7/005
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/805Real-time

Definitions

  • High availability is a characteristic that aims to ensure a level of operational performance, such as uptime for a period higher than a system that does not have the high availability characteristic.
  • some computing devices with the high availability characteristic do become unavailable.
  • FIG. 1 is a block diagram of a computing device including a baseboard management controller capable to process an error log according to a deep learning model to determine field replaceable units to deconfigure, according to one example;
  • FIG. 2 is a block diagram of a system including devices each with a baseboard management controller capable to process respective error logs according to a deep learning model to determine field replaceable units to deconfigure, according to one example;
  • FIG. 3 is a flowchart of a method for deconfiguring field replaceable units by a baseboard management controller according to a deep learning model, according to an example
  • FIG. 4 is a block diagram of a baseboard management controller capable of deconfiguring field replaceable units according to a deep learning model in response to an error condition, according to an example
  • FIG. 5 is a diagram of a deep learning model, according to one example.
  • index number “N” appended to some of the reference numerals may be understood to merely denote plurality and may not necessarily represent the same quantity for each reference numeral having such an index number “N”. Additionally, use herein of a reference numeral without an index number, where such reference numeral is referred to elsewhere with an index number, may be a general reference to the corresponding plural elements, collectively or individually. In another example, an index number of “I,” “M,” etc. can be used in place of index number N.
  • IT Information Technology
  • computer manufacturers are challenged to deliver quality and value to consumers, for example by providing computing devices with high availability.
  • High availability is a characteristic that aims to ensure a level of operational performance, such as uptime for a period higher than a system that does not have the high availability characteristic.
  • some computing devices with the high availability characteristic do become unavailable.
  • Error analysis tools may be static and could require a user to help determine a root cause of an error.
  • the computing system may need to be shipped back to a lab to determine the cause of the error. There is a time and shipping cost for this type of analysis.
  • various examples provided herein use a deep learning architecture that can autonomously assist IT personnel and field engineers in determining faulty components that may need to be replaced.
  • the examples include usage of Recurrent Neural Networks (RNN) for processing system events to distinguish between the different causes and effects of a given failure, and make the appropriate predictions on which components to replace.
  • RNN Recurrent Neural Networks
  • BMC baseboard management controller
  • BMCs provide so-called “lights-out” functionality for computing devices.
  • the lights out functionality may allow a user, such as a systems administrator to perform management operations on the computing device even if an operating system is not installed or not functional on the computing device.
  • the BMC can run on auxiliary power, thus the computing device need not be powered on to an on state where control of the computing device is handed over to an operating system after boot.
  • the BMC may so-called provide management and so-called “out-of-band” services, such as remote console access, remote reboot and power management functionality, access to system logs, and the like.
  • a BMC has management capabilities for sub-systems of a computing device, and is separate from a processor that executes a main operating system of a computing device.
  • the BMC may comprise an interface, such as a network interface, and/or serial interface that an administrator can use to remotely communicate with the BMC.
  • the BMC can have access to system logs.
  • the BMC can process system logs to determine a root cause for the error condition based on the deep learning approach.
  • the system logs can come from Field Replaceable Units (FRUs) or be related to the FRUs.
  • FRUs Field Replaceable Units
  • a field replaceable unit is a circuit board, part, or assembly that can be easily removed from a computing device and replaced by a user or technician without having to send the whole computing device to a repair facility.
  • FRUs include parts that can attach to other parts of the computing device using a socket, a card, a module, etc.
  • examples of FRUs can include computing modules, memory modules, peripheral cards and devices, etc.
  • the system logs can include registers that provide particular information (e.g., an error flag for a particular component, a type of error, a current configuration, a location associated with an error, etc.).
  • the BMC can process the information from the logs according to the deep learning model to determine scores associated with each of a number of the FRUs.
  • the scores can relate to the likelihood that the FRU has responsibility for the error condition. In other examples, the scores can be associated with sets of FRUs.
  • the FRU (or set of FRUs) with a highest likelihood of being responsible for the error condition can be deconfigured by the BMC. Once deconfigured, the computing device can be rebooted to determine if the error persists. In some examples, determining whether the error persists can include testing (e.g., initializing memory, writing to and reading back from various locations, etc.).
  • the next FRU or set of FRUs likely to be responsible can be deconfigured. This can repeat. Moreover, in some examples, the failure to remove the error condition can be taken into account for re-scoring the FRUs.
  • the BMC can send information about the logs (e.g., the logs themselves, a condensed version of the logs, etc.) as well as information about the FRU or set of FRUs deconfigured to an error analysis platform.
  • the information sent can also include information about deconfigured FRUs that did not cause the error condition.
  • the error analysis platform can take the feedback, along with parameters of a current deep learning model and feedback from other computing devices to update parameters for the deep learning model. The updated parameters can be provided to the BMC and other computing devices.
  • the approaches described herein are autonomous and can self-learn.
  • the approach can learn from multiple different computing devices providing feedback.
  • a set of updated deep learning parameters can be determined and sent back to the computing devices.
  • the deep learning model can be implemented while processing an error log in a computing device with an error condition. The implementation can also learn from mispredictions of a faulty component or field replaceable unit.
  • a deep neural network can reduce the costs associated with handcrafting complex rules for analyzing and recovering from errors in computing devices that are used in statically defined analyzers.
  • static analyzers may suffer from a lack of portability across different platform types and architectures.
  • the approaches described herein offer a simpler approach where deep learning is used to capture mathematical functions for performing error analysis and recovery.
  • a mathematical approach is advantageous because it can be generalized for other platforms and architectures.
  • parameters from the deep learning model can be updated and provided back to BMCs within computing devices.
  • FIG. 1 is a block diagram of a computing device including a baseboard management controller capable to process an error log according to a deep learning model to determine field replaceable units to deconfigure, according to one example.
  • FIG. 2 is a block diagram of a system including devices each with a baseboard management controller capable to process respective error logs according to a deep learning model to determine field replaceable units to deconfigure, according to one example.
  • the computing device 102 includes a central processing unit 110 , a number of field replaceable units 112 , and baseboard management controller 114 .
  • an FRU 112 can include the central processing unit 110 .
  • the computing device 102 can be included in a system 200 that can also include an error analysis platform 250 that can receive feedback from multiple devices with a local BMC 260 a - 260 n .
  • the error analysis platform 250 can take the feedback information to determine updates to parameters for a deep learning model 116 that is used to autonomously diagnose a cause for an error condition of the computing device 102 .
  • the BMC 114 can be notified (e.g., via an interrupt, a change in status of memory polled by the BMC 114 , etc.). The BMC 114 can determine that an error condition is present. Further, the BMC 114 can use an error log 218 to analyze the error condition of the computing device 102 .
  • the error log 218 can be a single error log or can include multiple error logs, for example error logs retrieved from or read from various hardware devices (e.g., FRUs 112 or components of the FRUs 112 ), an operating system executing on the central processing unit 110 , or the like. In one example, the error log may include registers.
  • each of the registers or other error log information can relate to particular information, for example, relate to a particular FRU, relate to particular components of the FRU, relate to particular errors or conditions of the FRU, etc.
  • the error log may identify the particular register or component as well. This can be used to map the information to the deep learning model.
  • the functionality of the BMC 114 described herein can be implemented by executing instructions stored in memory 232 .
  • the processing of the error log 218 can include processing using the deep learning model 116 .
  • Various deep learning models can be used. Examples of deep learning models include long short-term memory (LSTM), a convolution neural networks, recurrent neural networks, neural history compressor, recursive neural networks, gated recurrent unit (GRU), etc.
  • An advantage to a recurrent neural network is the inclusion of feedback.
  • An example of one implementation of using an LSTM approach as the deep learning model 116 is provided in the explanation corresponding to FIG. 5 .
  • the parameters used for the deep learning model 116 can be updated based on feedback from the computing device 102 or other devices with local BMCs 260 as discussed herein.
  • the deep learning model 116 can be applied to determine one of the FRUs 112 or a set of the FRUs 112 that can be deconfigured in response to the error condition.
  • a score can be assigned to each of the FRUs 112 and/or to sets of FRUs 112 .
  • the scores can relate to probability that the FRU or set of FRUs 112 is a root cause for the error condition.
  • the error log can be processed as characters.
  • characters can represent registers associated with dumps from FRU components or systems logs.
  • each character can be considered an input vector.
  • each of the scores for the FRUs can be updated.
  • the updated scores can be included as an input vector along with the next character.
  • the processing can continue until a character represents an end of the log.
  • characters can be broken up by special characters and taken as a group. For example, a first character may identify an FRU's log, a second, third, and fourth character may include log register information, and a special character (fifth character) may indicate that the information about the FRU's log is over. In this example, the five characters are meant to be processed together.
  • the information may be forgotten (though the updated scores remain) and a next set of characters can be read to update the scores for the FRUs.
  • the scores can be used to rank the probability that each of the FRUs or sets of FRUs are a root cause of the error condition.
  • a softmax function may be used to organize the scores (e.g., the softmax function can be used normalize the vectors into real values in the range of [0, 1] that add up to 1).
  • One of the FRUs or sets of FRUs can be selected based on the analysis (e.g., the set of FRUs scored to have the highest probability to be the root cause of the error condition compared to the other FRUs).
  • the BMC 114 can be caused to deconfigure the FRU.
  • the deconfiguration of the FRU can be implemented by disabling the FRU.
  • the disabling of the FRU can include removing power to the FRU.
  • disabling of the FRU can include removing communications capabilities from the FRU.
  • disabling of the FRU can include putting the FRU in a disconnected hot plug or hot swap state.
  • the computing device 102 can be rebooted. Once reboot has occurred, the BMC 114 can determine whether the error condition persists. In one example, a test can be performed to determine whether the error condition persists. The test can be directed to the FRU or the computing device in general. In one example, the BMC 114 performs the test. In another example, the BMC 114 is communicatively coupled to another processor (e.g., CPU 110 ), which is instructed to perform the test.
  • another processor e.g., CPU 110
  • next most probable FRU or set of FRUs can be selected to be deconfigured.
  • the next most probable FRU is determined by determining a new score relating to the probability of failure for each of the FRUs/sets of FRUs as part of processing the error log again, but this time with the additional information that the previous attempt failed and the error condition persisted.
  • the selected FRU or set of FRUs can include at least one FRU that was not in the original selection.
  • the selected FRU or set of FRUs is deconfigured and the computing device can be rebooted and tested again.
  • the next set(s) of FRUs can be selected.
  • Various approaches can be used to select the FRU or set of FRUs, for example, Qlearning, Markov Decision Process (MDP), etc.
  • the error log and the information regarding the deconfiguration can be sent to the error analysis platform 250 .
  • the error analysis platform 250 can use the information as part of a new sample set to provide to the deep learning model to update parameters based on the real world experience of the computing device 102 .
  • the error analysis platform 250 can update the parameters for the deep learning model 116 for the computing device 102 .
  • the parameters can also be used in the other devices with local BMCs 260 .
  • the updated parameters can be sent by the error analysis platform 250 back to the devices that can use the updated parameters for future error log processing.
  • the deep learning model can be trained on the error analysis platform 250 or another platform.
  • the training may include initial error log impressions from a technical expert making the training sets based on error log entries (e.g., an error log entry of a register indicating that a memory module has an unrecoverable hard error may be trained to indicate that the memory module is a root cause for that error).
  • full systems configurations can be added to the sample sets as well. For example, a configuration where a peripheral network card FRU has a hardware error, but two other FRUs (e.g., memory modules) have errors that were caused by the network card FRU, the root cause may be trained to be the network card (for that specific case).
  • the training sets can be determined from observations. Feedback can come from computing devices put into implementation or from test units.
  • the feedback can be used as training data to update the parameters for the deep learning model.
  • Various approaches can be used to implement the deep learning approach to update parameters on the error analysis platform 250 , for example, RMSprop, Adagrad, Adam, etc.
  • gradient descent optimization algorithms can be used.
  • the BMC 114 can receive the updated parameters for the deep learning model 116 from the error analysis platform 250 based on the error log and the information regarding the deconfigured FRU.
  • the updated parameters may include log and information regarding other deconfigured FRUs from other devices with local BMCs 260 .
  • a new error log associated with that error condition can be processed as discussed above using the updated parameters.
  • each of the computing device 102 and the devices with local BMCs 260 can have a common technology platform.
  • each of the devices may be part of a same series server line.
  • particular FRUs may be tested for use with that common technology platform to provide sample training formation.
  • newly seen FRUs may create new training information as part of feedback.
  • the error analysis platform 250 may be communicatively coupled to the BMC 114 .
  • the error analysis platform 250 is on a separate network, but feedback can be provided via a message (e.g., email or via an API) and updated parameters may be provided in a similar way (e.g., an update file provided via an administrator device). Because access to BMCs 114 can be via a separate control network, the access between the error analysis platform 250 and the BMCs need not be constant.
  • the deep learning model 116 can be trained using training data.
  • the training data may include an error log entry and an identification of the FRU(s) that were the root cause of an error associated with the error log entry.
  • the training data may include static data of error log information and root cause FRU identification.
  • the deep learning parameters can be trained using a deep learning approach.
  • the training can involve determination of a change to each parameter based on training information.
  • Examples of such learning algorithms include gradient descent, various approaches used by Distbelief, Project Adam, and Hama, and stochastic gradient descent by backpropogation, among others.
  • each worker e.g., a central processing unit (CPU) or graphical processing unit (GPU)
  • CPU central processing unit
  • GPU graphical processing unit
  • Each worker iteratively processes new training data from its subset of batches of the training data.
  • the workers communicate by exchanging gradient updates.
  • a parameter server is used to provide each of the workers the same model parameters.
  • the error analysis platform can be implemented over a number of computing devices.
  • each worker receives a subset of training data and a full set of model parameters for each iteration of training.
  • every worker sends a pull request to the parameter server and gets a latest copy of the parameters W, which might contain a number of floating-point values for a deep learning model.
  • Each copy of the parameters on each device is called a model replica.
  • Each model replica works on a different input training data subset.
  • each subset can contain error log information including an identification of one or more FRUs associated with the information and status registers that provide additional information (e.g., state information, error conditions, etc.).
  • Each model replica calculates its data gradients (in an example with three workers ⁇ D 1 , ⁇ D 2 , ⁇ D 3 ) with its own mini-batch input and sends the gradients back (usually a push request) to the parameter server.
  • the parameter server gathers the gradients from all the workers, calculates the average of the gradient, and updates the model accordingly.
  • the deep learning model 116 can be initially trained using predefined training data and then updated based on real world feedback.
  • a communication network can be used to communicatively couple the computing device with other computing devices and/or the error analysis platform.
  • the communication network can use wired communications, wireless communications, or combinations thereof.
  • the communication network can include multiple sub communication networks such as data networks, wireless networks, telephony networks, etc.
  • Such networks can include, for example, a public data network such as the Internet, local area networks (LANs), wide area networks (WANs), metropolitan area networks (MANs), cable networks, fiber optic networks, combinations thereof, or the like.
  • LANs local area networks
  • WANs wide area networks
  • MANs metropolitan area networks
  • wireless networks may include cellular networks, satellite communications, wireless LANs, etc.
  • the communication network can be in the form of a direct network link between devices.
  • Various communications structures and infrastructure can be utilized to implement the communication network(s).
  • devices communicate with each other and other components with access to communication networks via a communication protocol or multiple protocols.
  • a protocol can be a set of rules that defines how nodes of the communication network interact with other nodes.
  • communications between network nodes can be implemented by exchanging discrete packets of data or sending messages. Packets can include header information associated with a protocol (e.g., information on the location of the network node(s) to contact) as well as payload information.
  • the BMC 114 can include hardware and/or combinations of hardware and programming to perform functions provided herein. As noted, the BMC 114 can provide so-called “lights-out” functionality for computing devices. The lights out functionality may allow a user, such as a systems administrator to perform management operations on the computing device even if an operating system is not installed or not functional on the computing device. Moreover, in one example, the BMC 114 can run on auxiliary power, thus the computing device need not be powered on to an on state where control of the computing device is handed over to an operating system after boot. As examples, the BMC 114 may so-called provide management and so-called “out-of-band” services, such as remote console access, remote reboot and power management functionality, access to system logs, and the like.
  • a BMC 114 has management capabilities for sub-systems of a computing device, and is separate from a processor that executes a main operating system of a computing device.
  • the BMC 114 may comprise an interface, such as a network interface, and/or serial interface that an administrator can use to remotely communicate with the BMC 114 .
  • the BMC 114 may be capable to receive error log information and to deconfigure FRUs 112 .
  • a processor such as a central processing unit (CPU) 110 or a microprocessor suitable for retrieval and execution of instructions and/or electronic circuits can be configured to perform the functionality for the computing device 102 separately from the BMC 114 .
  • CPU central processing unit
  • microprocessor suitable for retrieval and execution of instructions and/or electronic circuits
  • FIG. 3 is a flowchart of a method for deconfiguring field replaceable units by a baseboard management controller according to a deep learning model, according to an example.
  • FIG. 4 is a block diagram of a baseboard management controller capable of deconfiguring field replaceable units according to a deep learning model in response to an error condition, according to an example.
  • execution of method 300 is described below with reference to BMC 400 , other suitable components for execution of method 300 can be utilized (e.g., computing device 102 ).
  • Method 300 may be implemented in the form of executable instructions stored on a machine-readable storage medium, such as storage medium 420 , and/or in the form of electronic circuitry.
  • the BMC 400 may be part of a computing device with multiple FRUs. As noted above, BMC 400 can provide so-called “lights-out” functionality for computing devices. The lights out functionality may allow a user, such as a systems administrator to perform management operations on the computing device even if an operating system is not installed or not functional on the computing device. Moreover, in one example, the BMC 400 can run on auxiliary power, thus the computing device need not be powered on to an on state where control of the computing device is handed over to an operating system after boot. As examples, the BMC 400 may so-called provide management and so-called “out-of-band” services, such as remote console access, remote reboot and power management functionality, access to system logs, and the like.
  • a BMC 400 has management capabilities for sub-systems of a computing device, and is separate from a processor that executes a main operating system of a computing device.
  • the BMC 400 may comprise an interface, such as a network interface, and/or serial interface that an administrator can use to remotely communicate with the BMC 400 .
  • an auxiliary state is a state where the BMC 400 is capable of functionality while a main subsystem of the computing device is not capable of functionality (e.g., when the computing device is powered off, but plugged in, when the main subsystem is in an error condition state, etc.).
  • the BMC 400 may host a web server that allows for communications via the network interface.
  • the BMC 400 can have access to system logs.
  • the BMC 400 can process system logs to determine a root cause for the error condition based on the deep learning approach.
  • the processing element 410 can execute error condition instructions 422 to determine that an error condition has occurred in the computing device ( 302 ).
  • the BMC 400 can be notified (e.g., via an interrupt, a change in status of memory polled by the BMC 400 , etc.).
  • the BMC 400 can determine that the error condition is present.
  • the BMC 400 can also receive an error log.
  • the model processing instructions 424 can be executed by the processing element 410 to process the error log according to a deep learning model.
  • the error log can be a single error log or can include multiple error logs, for example error logs retrieved from or read from various hardware devices (e.g., FRUs or components of the FRUs), an operating system executing on a central processing unit associated with a main subsystem of the computing device, or the like.
  • the error log may include registers.
  • each of the registers or other error log information can relate to particular information, for example, relate to a particular FRU, relate to particular components of the FRU, relate to particular errors or conditions of the FRU, etc.
  • the error log may identify the particular register or component as well. This can be used to map the information to the deep learning model.
  • the processing can be used to determine a score for each of a number of sets of the FRUs of the computing device.
  • a set of FRUs includes one FRU or multiple FRUs.
  • the scores can relate to a probability to remove the error condition by deconfiguration of the set of FRUs.
  • the deep learning model includes updated parameters based on error condition feedback from at least one other device.
  • the deconfiguration instructions 426 can be executed by processing element 410 to deconfigure a first one of the sets of FRUs based on the score associated with the set (e.g., the set of FRUs scored to have the highest probability to be the root cause of the error condition compared to the other FRUs).
  • the processing element 410 can be caused to deconfigure the FRU.
  • the deconfiguration of the FRU can be implemented by disabling the FRU.
  • the disabling of the FRU can include removing power to the FRU.
  • disabling of the FRU can include removing communications capabilities from the FRU.
  • disabling of the FRU can include putting the FRU in a disconnected hot plug or hot swap state.
  • a configuration parameter associated with the FRU can be set to indicate to the computing device/FRU that the FRU is not to function.
  • the computing device can be rebooted ( 308 ). Once reboot has occurred, the BMC 400 can determine whether the error condition persists. In one example, a test can be performed to determine whether the error condition persists. The test can be directed to the FRU or the computing device in general. In one example, the BMC 400 performs the test. In another example, the BMC 400 is communicatively coupled to another processor in the main subsystem of the computing system (e.g., a CPU) that is not deconfigured, which is instructed to perform the test.
  • a processor in the main subsystem of the computing system e.g., a CPU
  • next most probable set of FRUs can be selected to be deconfigured.
  • the next most probable FRU is determined by determining a new score relating to the probability of failure for each of the sets of FRUs as part of processing the error log again, but this time with the additional information that the previous attempt failed and the error condition persisted.
  • the selected set of FRUs can include at least one FRU that was not in the original selection.
  • the selected set of FRUs is deconfigured and the computing device can be rebooted and tested again. If the error condition continues to persist, the next set(s) of FRUs can be selected.
  • Various approaches can be used to select the FRU or set of FRUs, for example, Qlearning, Markov Decision Process (MDP), etc.
  • the error log and the information regarding the deconfiguration configuration of FRUs can be sent to the error analysis platform. This allows for the feedback to be provided to other devices with local BMCs similar to the computing device.
  • the error analysis platform can use the information as part of a new sample set to provide to the deep learning model to update parameters based on the real world experience of the computing device.
  • the BMC 400 can receive updated parameters for the deep learning model from the error analysis platform that takes into consideration of the error log information and the information about the set of the FRUs that was deconfigured.
  • the updated parameters may also take into consideration other sets of FRUs deconfigured in response to other error conditions associated with other similar computing devices.
  • the FRUs deconfigured from the other similar computing devices may be considered additional training data from the other computing devices that represent real life experiences.
  • the error analysis platform can update parameters for the deep learning model from the information provided and other training data.
  • Processing element 410 may be, one or multiple processing unit, one or multiple semiconductor-based microprocessor, other hardware devices suitable for retrieval and execution of instructions stored in machine-readable storage medium 420 , or combinations thereof.
  • the processing element 410 can be a physical device.
  • the processing element 410 may include multiple cores on a chip, include multiple cores across multiple chips, or combinations thereof.
  • Processing element 410 may fetch, decode, and execute instructions 422 , 424 , 426 to implement method 300 .
  • processing element 410 may include at least one integrated circuit (IC), other control logic, other electronic circuits, or combinations thereof that include a number of electronic components for performing the functionality of instructions 422 , 424 , 426 .
  • IC integrated circuit
  • Machine-readable storage medium 420 may be any electronic, magnetic, optical, or other physical storage device that contains or stores executable instructions.
  • machine-readable storage medium may be, for example, Random Access Memory (RAM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a storage drive, a Compact Disc Read Only Memory (CD-ROM), and the like.
  • RAM Random Access Memory
  • EEPROM Electrically Erasable Programmable Read-Only Memory
  • CD-ROM Compact Disc Read Only Memory
  • the machine-readable storage medium can be non-transitory.
  • machine-readable storage medium 420 may be encoded with a series of executable instructions for implementing method 300 .
  • FIG. 5 is a diagram of a deep learning model, according to one example.
  • the deep learning model example includes LSTM, LSTM can be beneficial for error analysis because of its ability to remember appropriate sequences via gates.
  • An input gate i t 501 controls the amount of input written into a neuron's memory cell at time step t.
  • the error log can provide input.
  • a forget gate f t 503 controls the amount of information to be forgotten from a neuron's memory cell at time step t.
  • a set of characters can be grouped together to update an output and then cleared.
  • the cell c t 505 represents the content of the neuron's memory cell at time step t.
  • the output gate o t 507 controls the amount of information read from the neuron's cell and how much of it contributes to the output at time step L.
  • the output h t 509 represents the output of the cell to the next layer at time step t. This output is also fed back into the same neuron and used in the following time step t t+1 .
  • x t can be represented by error log+ 511. Error log+ can be considered the input vector to the gates.
  • the input vector can be the same for each gate as shown in FIG. 5 . In some examples, this can include information from the error log plus hidden inputs (e.g., h t 509 before the end of the processing of the error log).
  • b represents a parameter vector from the deep learning model
  • the Ws represent parameter matrices for the deep learning model
  • x represents an input vector
  • h represents an output vector
  • c represents a cell state vector
  • f, i, and o represent gate vectors.
  • equations 1-5 are representative of neurons for an entire layer within FIG. 5 .
  • the W's are matrices.
  • each row in W l for hidden layer l would be mapped to neuron j where j ⁇ [1, n].
  • the ⁇ operator is a dot product operation.
  • tan h and ⁇ (sigmoid) activation functions are also outlined in equations 6 and 7 for clarity. These functions are applied as element wise operations on the resulting vectors.
  • Other example LSTM models can be implemented, such as a Gated Recurrent Unit. Further, as noted above, other deep learning models may be used.
  • the example operates through the consumption of characters as input vectors.
  • characters that are sourced from an MCE log as input For the purpose of this example, assume characters that are sourced from an MCE log as input.
  • the BMC can focus on analysis where actions at the system level can be performed. However, approach can be capable to process other log types as long as the model it is trained with data in the desired format.
  • the neural network can make delayed predictions as it consumes input vectors (consuming one character at a time) by generating ⁇ NOP> tags as output for each time step.
  • the output, h t 509 can provide hidden output that can be used as feedback to include in the input vector for the next iteration until a prediction is made.
  • a prediction is eventually made once the BMC processing the log according to the model receives a special ⁇ End-of-Log> tag as input.
  • the prediction can go through a softmax processing layer to determine scores that can be used to deconfigure FRUs.
  • the architecture in the example use can be fully connected with the final stage going through a softmax layer which uses the form in equation 8, P(y i
  • z k ) e z k / ⁇ j e ⁇ z j , in order to obtain confidence levels for replacing each FRU k where k ⁇ [1, K] for K replaceable FRUs.
  • the final output is a vector y that has the following format where T is the transpose operator: [ ⁇ NOP>, CPU 1 , CPU p , DIMM 1 , . . . DIMM d , I/O-slot 1 , . . . I/O-slot s , . . . ] T .
  • various parts of debugging an error condition of a computing device can be automated using a BMC.
  • the solution can take into account other error conditions found in the field from similar computing devices.
  • the autonomous nature allows for accurate metric reporting on failures in the field while minimizing downtime (e.g. the amount of time it may take to have a technician come out and troubleshoot the computing device).
  • the accurate metric reporting can be fed into the deep learning model to self-improve the automated process.
  • the approach allows for reducing the field replacement costs for FRUs that are unnecessarily replaced in customer systems and personnel costs. Though specific examples of deep learning models are provided, other similar deep learning approaches can be implemented for both training and/or execution using deep learning parameters.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Quality & Reliability (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Debugging And Monitoring (AREA)

Abstract

Examples disclosed herein relate to a baseboard management controller (BMC) capable of execution while a computing device is powered to an auxiliary state. The BMC is to process an error log according to a deep learning model to determine one of multiple field replaceable units to deconfigure in response to the error condition. The BMC is to deconfigure the field replaceable unit. The computing device is rebooted. In response to the reboot of the computing device the BMC is to determine whether the error condition persists.

Description

    BACKGROUND
  • Information Technology companies and manufacturers are challenged to deliver quality and value to consumers, for example by providing computing devices with high availability. High availability is a characteristic that aims to ensure a level of operational performance, such as uptime for a period higher than a system that does not have the high availability characteristic. However, some computing devices with the high availability characteristic do become unavailable.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The following detailed description references the drawings, wherein:
  • FIG. 1 is a block diagram of a computing device including a baseboard management controller capable to process an error log according to a deep learning model to determine field replaceable units to deconfigure, according to one example;
  • FIG. 2 is a block diagram of a system including devices each with a baseboard management controller capable to process respective error logs according to a deep learning model to determine field replaceable units to deconfigure, according to one example;
  • FIG. 3 is a flowchart of a method for deconfiguring field replaceable units by a baseboard management controller according to a deep learning model, according to an example;
  • FIG. 4 is a block diagram of a baseboard management controller capable of deconfiguring field replaceable units according to a deep learning model in response to an error condition, according to an example; and
  • FIG. 5 is a diagram of a deep learning model, according to one example.
  • Throughout the drawings, identical reference numbers may designate similar, but not necessarily identical, elements. An index number “N” appended to some of the reference numerals may be understood to merely denote plurality and may not necessarily represent the same quantity for each reference numeral having such an index number “N”. Additionally, use herein of a reference numeral without an index number, where such reference numeral is referred to elsewhere with an index number, may be a general reference to the corresponding plural elements, collectively or individually. In another example, an index number of “I,” “M,” etc. can be used in place of index number N.
  • DETAILED DESCRIPTION
  • Information Technology (IT) companies and computer manufacturers are challenged to deliver quality and value to consumers, for example by providing computing devices with high availability. High availability is a characteristic that aims to ensure a level of operational performance, such as uptime for a period higher than a system that does not have the high availability characteristic. However, some computing devices with the high availability characteristic do become unavailable.
  • With today's businesses demanding near real-time analytics on big data in order to conduct their daily transactions, IT companies are constantly being challenged to produce highly complex, yet fault tolerant systems to empower datacenters. As such, having the ability to efficiently diagnose and repair failures of increasingly complex systems can be advantageous. Error analysis tools may be static and could require a user to help determine a root cause of an error. With complex failures, the computing system may need to be shipped back to a lab to determine the cause of the error. There is a time and shipping cost for this type of analysis.
  • Accordingly, various examples provided herein use a deep learning architecture that can autonomously assist IT personnel and field engineers in determining faulty components that may need to be replaced. The examples include usage of Recurrent Neural Networks (RNN) for processing system events to distinguish between the different causes and effects of a given failure, and make the appropriate predictions on which components to replace. A baseboard management controller (BMC) can be used to perform the analysis at the computing device with an error.
  • BMCs provide so-called “lights-out” functionality for computing devices. The lights out functionality may allow a user, such as a systems administrator to perform management operations on the computing device even if an operating system is not installed or not functional on the computing device. Moreover, in one example, the BMC can run on auxiliary power, thus the computing device need not be powered on to an on state where control of the computing device is handed over to an operating system after boot. As examples, the BMC may so-called provide management and so-called “out-of-band” services, such as remote console access, remote reboot and power management functionality, access to system logs, and the like. As used herein, a BMC has management capabilities for sub-systems of a computing device, and is separate from a processor that executes a main operating system of a computing device. The BMC may comprise an interface, such as a network interface, and/or serial interface that an administrator can use to remotely communicate with the BMC.
  • As noted, the BMC can have access to system logs. In one example, when an error condition occurs on the computing device, the BMC can process system logs to determine a root cause for the error condition based on the deep learning approach. In some examples, the system logs can come from Field Replaceable Units (FRUs) or be related to the FRUs. As used herein, a field replaceable unit is a circuit board, part, or assembly that can be easily removed from a computing device and replaced by a user or technician without having to send the whole computing device to a repair facility. Examples of FRUs include parts that can attach to other parts of the computing device using a socket, a card, a module, etc. Further, examples of FRUs can include computing modules, memory modules, peripheral cards and devices, etc. In some examples, the system logs can include registers that provide particular information (e.g., an error flag for a particular component, a type of error, a current configuration, a location associated with an error, etc.).
  • The BMC can process the information from the logs according to the deep learning model to determine scores associated with each of a number of the FRUs. The scores can relate to the likelihood that the FRU has responsibility for the error condition. In other examples, the scores can be associated with sets of FRUs. Once each of the logs are processed, the FRU (or set of FRUs) with a highest likelihood of being responsible for the error condition can be deconfigured by the BMC. Once deconfigured, the computing device can be rebooted to determine if the error persists. In some examples, determining whether the error persists can include testing (e.g., initializing memory, writing to and reading back from various locations, etc.). In one example, if the error condition is not removed, the next FRU or set of FRUs likely to be responsible can be deconfigured. This can repeat. Moreover, in some examples, the failure to remove the error condition can be taken into account for re-scoring the FRUs.
  • In one example, if the error condition is removed, the BMC can send information about the logs (e.g., the logs themselves, a condensed version of the logs, etc.) as well as information about the FRU or set of FRUs deconfigured to an error analysis platform. In some examples, the information sent can also include information about deconfigured FRUs that did not cause the error condition. The error analysis platform can take the feedback, along with parameters of a current deep learning model and feedback from other computing devices to update parameters for the deep learning model. The updated parameters can be provided to the BMC and other computing devices.
  • Unlike a static error analysis engine, the approaches described herein are autonomous and can self-learn. In one example, the approach can learn from multiple different computing devices providing feedback. In this example, a set of updated deep learning parameters can be determined and sent back to the computing devices. In another example, the deep learning model can be implemented while processing an error log in a computing device with an error condition. The implementation can also learn from mispredictions of a faulty component or field replaceable unit.
  • Further, the use of a deep neural network can reduce the costs associated with handcrafting complex rules for analyzing and recovering from errors in computing devices that are used in statically defined analyzers. Moreover, static analyzers may suffer from a lack of portability across different platform types and architectures. The approaches described herein offer a simpler approach where deep learning is used to capture mathematical functions for performing error analysis and recovery. A mathematical approach is advantageous because it can be generalized for other platforms and architectures. As noted, parameters from the deep learning model can be updated and provided back to BMCs within computing devices.
  • FIG. 1 is a block diagram of a computing device including a baseboard management controller capable to process an error log according to a deep learning model to determine field replaceable units to deconfigure, according to one example. FIG. 2 is a block diagram of a system including devices each with a baseboard management controller capable to process respective error logs according to a deep learning model to determine field replaceable units to deconfigure, according to one example.
  • In the example of FIG. 1, the computing device 102 includes a central processing unit 110, a number of field replaceable units 112, and baseboard management controller 114. In some examples, an FRU 112 can include the central processing unit 110. In the example of FIG. 2, the computing device 102 can be included in a system 200 that can also include an error analysis platform 250 that can receive feedback from multiple devices with a local BMC 260 a-260 n. The error analysis platform 250 can take the feedback information to determine updates to parameters for a deep learning model 116 that is used to autonomously diagnose a cause for an error condition of the computing device 102.
  • When an error condition affects the computing device 102, the BMC 114 can be notified (e.g., via an interrupt, a change in status of memory polled by the BMC 114, etc.). The BMC 114 can determine that an error condition is present. Further, the BMC 114 can use an error log 218 to analyze the error condition of the computing device 102. The error log 218 can be a single error log or can include multiple error logs, for example error logs retrieved from or read from various hardware devices (e.g., FRUs 112 or components of the FRUs 112), an operating system executing on the central processing unit 110, or the like. In one example, the error log may include registers. In the example, each of the registers or other error log information can relate to particular information, for example, relate to a particular FRU, relate to particular components of the FRU, relate to particular errors or conditions of the FRU, etc. The error log may identify the particular register or component as well. This can be used to map the information to the deep learning model. The functionality of the BMC 114 described herein can be implemented by executing instructions stored in memory 232.
  • As noted, in one example, the processing of the error log 218 can include processing using the deep learning model 116. Various deep learning models can be used. Examples of deep learning models include long short-term memory (LSTM), a convolution neural networks, recurrent neural networks, neural history compressor, recursive neural networks, gated recurrent unit (GRU), etc. An advantage to a recurrent neural network is the inclusion of feedback. An example of one implementation of using an LSTM approach as the deep learning model 116 is provided in the explanation corresponding to FIG. 5. The parameters used for the deep learning model 116 can be updated based on feedback from the computing device 102 or other devices with local BMCs 260 as discussed herein.
  • The deep learning model 116 can be applied to determine one of the FRUs 112 or a set of the FRUs 112 that can be deconfigured in response to the error condition. When the BMC 114 processes the error log 218 according to the deep learning model 116, a score can be assigned to each of the FRUs 112 and/or to sets of FRUs 112. The scores can relate to probability that the FRU or set of FRUs 112 is a root cause for the error condition.
  • In one example model, the error log can be processed as characters. In the example model, characters can represent registers associated with dumps from FRU components or systems logs. In one example, each character can be considered an input vector. When a character is processed, each of the scores for the FRUs can be updated. The updated scores can be included as an input vector along with the next character. The processing can continue until a character represents an end of the log. In an LSTM model, characters can be broken up by special characters and taken as a group. For example, a first character may identify an FRU's log, a second, third, and fourth character may include log register information, and a special character (fifth character) may indicate that the information about the FRU's log is over. In this example, the five characters are meant to be processed together.
  • Once the information is processed, the information may be forgotten (though the updated scores remain) and a next set of characters can be read to update the scores for the FRUs. In some examples, the scores can be used to rank the probability that each of the FRUs or sets of FRUs are a root cause of the error condition. For example, a softmax function may be used to organize the scores (e.g., the softmax function can be used normalize the vectors into real values in the range of [0, 1] that add up to 1).
  • One of the FRUs or sets of FRUs can be selected based on the analysis (e.g., the set of FRUs scored to have the highest probability to be the root cause of the error condition compared to the other FRUs). The BMC 114 can be caused to deconfigure the FRU. In some examples, the deconfiguration of the FRU can be implemented by disabling the FRU. In one example, the disabling of the FRU can include removing power to the FRU. In another example, disabling of the FRU can include removing communications capabilities from the FRU. In a further example, disabling of the FRU can include putting the FRU in a disconnected hot plug or hot swap state.
  • Once the FRU(s) selected is deconfigured, the computing device 102 can be rebooted. Once reboot has occurred, the BMC 114 can determine whether the error condition persists. In one example, a test can be performed to determine whether the error condition persists. The test can be directed to the FRU or the computing device in general. In one example, the BMC 114 performs the test. In another example, the BMC 114 is communicatively coupled to another processor (e.g., CPU 110), which is instructed to perform the test.
  • If the error condition persists, the next most probable FRU or set of FRUs can be selected to be deconfigured. In one example, the next most probable FRU is determined by determining a new score relating to the probability of failure for each of the FRUs/sets of FRUs as part of processing the error log again, but this time with the additional information that the previous attempt failed and the error condition persisted. The selected FRU or set of FRUs can include at least one FRU that was not in the original selection. The selected FRU or set of FRUs is deconfigured and the computing device can be rebooted and tested again. If the error condition continues to persist, the next set(s) of FRUs can be selected. Various approaches can be used to select the FRU or set of FRUs, for example, Qlearning, Markov Decision Process (MDP), etc.
  • If the error condition does not persist after a reboot, the error log and the information regarding the deconfiguration can be sent to the error analysis platform 250. This allows for the feedback to be provided to other devices with local BMCs 260 similar to the computing device. The error analysis platform 250 can use the information as part of a new sample set to provide to the deep learning model to update parameters based on the real world experience of the computing device 102.
  • The error analysis platform 250 can update the parameters for the deep learning model 116 for the computing device 102. The parameters can also be used in the other devices with local BMCs 260. The updated parameters can be sent by the error analysis platform 250 back to the devices that can use the updated parameters for future error log processing.
  • The deep learning model can be trained on the error analysis platform 250 or another platform. The training may include initial error log impressions from a technical expert making the training sets based on error log entries (e.g., an error log entry of a register indicating that a memory module has an unrecoverable hard error may be trained to indicate that the memory module is a root cause for that error). Similarly, full systems configurations can be added to the sample sets as well. For example, a configuration where a peripheral network card FRU has a hardware error, but two other FRUs (e.g., memory modules) have errors that were caused by the network card FRU, the root cause may be trained to be the network card (for that specific case). The training sets can be determined from observations. Feedback can come from computing devices put into implementation or from test units. As noted, the feedback can be used as training data to update the parameters for the deep learning model. Various approaches can be used to implement the deep learning approach to update parameters on the error analysis platform 250, for example, RMSprop, Adagrad, Adam, etc. In one example, gradient descent optimization algorithms can be used.
  • As such, the BMC 114 can receive the updated parameters for the deep learning model 116 from the error analysis platform 250 based on the error log and the information regarding the deconfigured FRU. Similarly, the updated parameters may include log and information regarding other deconfigured FRUs from other devices with local BMCs 260. When another error condition occurs on the computing device or one of the other devices with local BMCs 260 capable of implementing this approach, a new error log associated with that error condition can be processed as discussed above using the updated parameters.
  • In some examples, each of the computing device 102 and the devices with local BMCs 260 can have a common technology platform. For example, each of the devices may be part of a same series server line. Moreover, particular FRUs may be tested for use with that common technology platform to provide sample training formation. In some examples, newly seen FRUs may create new training information as part of feedback. In one example, the error analysis platform 250 may be communicatively coupled to the BMC 114. In another example, the error analysis platform 250 is on a separate network, but feedback can be provided via a message (e.g., email or via an API) and updated parameters may be provided in a similar way (e.g., an update file provided via an administrator device). Because access to BMCs 114 can be via a separate control network, the access between the error analysis platform 250 and the BMCs need not be constant.
  • The deep learning model 116 can be trained using training data. In one example, the training data may include an error log entry and an identification of the FRU(s) that were the root cause of an error associated with the error log entry. In other examples, the training data may include static data of error log information and root cause FRU identification.
  • The deep learning parameters can be trained using a deep learning approach. The training can involve determination of a change to each parameter based on training information. Examples of such learning algorithms include gradient descent, various approaches used by Distbelief, Project Adam, and Hama, and stochastic gradient descent by backpropogation, among others.
  • A commonly used technique in distributed deep learning for both convolution neural network and recurrent neural network models is data parallelism. In this example, each worker (e.g., a central processing unit (CPU) or graphical processing unit (GPU)) receives a subset of a batch of training data. Each worker iteratively processes new training data from its subset of batches of the training data. The workers communicate by exchanging gradient updates. A parameter server is used to provide each of the workers the same model parameters. As such, in some examples, the error analysis platform can be implemented over a number of computing devices.
  • The following is an example model of distributed deep learning. In this example of distributed deep learning, each worker receives a subset of training data and a full set of model parameters for each iteration of training. At the beginning of one iteration, every worker sends a pull request to the parameter server and gets a latest copy of the parameters W, which might contain a number of floating-point values for a deep learning model. Each copy of the parameters on each device is called a model replica. Each model replica works on a different input training data subset. For example, each subset can contain error log information including an identification of one or more FRUs associated with the information and status registers that provide additional information (e.g., state information, error conditions, etc.).
  • Each model replica calculates its data gradients (in an example with three workers ΔD1, ΔD2, ΔD3) with its own mini-batch input and sends the gradients back (usually a push request) to the parameter server. The parameter server gathers the gradients from all the workers, calculates the average of the gradient, and updates the model accordingly. For example, a new W′ can equal the previous W plus a learning rate times an average of the data gradients. Shown as an equation, the new W′ can be expressed as W′=W+learning rate*average (ΔD1, ΔD2, ΔD3). The deep learning model 116 can be initially trained using predefined training data and then updated based on real world feedback.
  • A communication network can be used to communicatively couple the computing device with other computing devices and/or the error analysis platform. The communication network can use wired communications, wireless communications, or combinations thereof. Further, the communication network can include multiple sub communication networks such as data networks, wireless networks, telephony networks, etc. Such networks can include, for example, a public data network such as the Internet, local area networks (LANs), wide area networks (WANs), metropolitan area networks (MANs), cable networks, fiber optic networks, combinations thereof, or the like. In certain examples, wireless networks may include cellular networks, satellite communications, wireless LANs, etc. Further, the communication network can be in the form of a direct network link between devices. Various communications structures and infrastructure can be utilized to implement the communication network(s).
  • By way of example, devices communicate with each other and other components with access to communication networks via a communication protocol or multiple protocols. A protocol can be a set of rules that defines how nodes of the communication network interact with other nodes. Further, communications between network nodes can be implemented by exchanging discrete packets of data or sending messages. Packets can include header information associated with a protocol (e.g., information on the location of the network node(s) to contact) as well as payload information.
  • The BMC 114 can include hardware and/or combinations of hardware and programming to perform functions provided herein. As noted, the BMC 114 can provide so-called “lights-out” functionality for computing devices. The lights out functionality may allow a user, such as a systems administrator to perform management operations on the computing device even if an operating system is not installed or not functional on the computing device. Moreover, in one example, the BMC 114 can run on auxiliary power, thus the computing device need not be powered on to an on state where control of the computing device is handed over to an operating system after boot. As examples, the BMC 114 may so-called provide management and so-called “out-of-band” services, such as remote console access, remote reboot and power management functionality, access to system logs, and the like. As used herein, a BMC 114 has management capabilities for sub-systems of a computing device, and is separate from a processor that executes a main operating system of a computing device. The BMC 114 may comprise an interface, such as a network interface, and/or serial interface that an administrator can use to remotely communicate with the BMC 114. Moreover, as described herein, the BMC 114 may be capable to receive error log information and to deconfigure FRUs 112.
  • A processor, such as a central processing unit (CPU) 110 or a microprocessor suitable for retrieval and execution of instructions and/or electronic circuits can be configured to perform the functionality for the computing device 102 separately from the BMC 114.
  • FIG. 3 is a flowchart of a method for deconfiguring field replaceable units by a baseboard management controller according to a deep learning model, according to an example. FIG. 4 is a block diagram of a baseboard management controller capable of deconfiguring field replaceable units according to a deep learning model in response to an error condition, according to an example. Although execution of method 300 is described below with reference to BMC 400, other suitable components for execution of method 300 can be utilized (e.g., computing device 102). Method 300 may be implemented in the form of executable instructions stored on a machine-readable storage medium, such as storage medium 420, and/or in the form of electronic circuitry.
  • The BMC 400 may be part of a computing device with multiple FRUs. As noted above, BMC 400 can provide so-called “lights-out” functionality for computing devices. The lights out functionality may allow a user, such as a systems administrator to perform management operations on the computing device even if an operating system is not installed or not functional on the computing device. Moreover, in one example, the BMC 400 can run on auxiliary power, thus the computing device need not be powered on to an on state where control of the computing device is handed over to an operating system after boot. As examples, the BMC 400 may so-called provide management and so-called “out-of-band” services, such as remote console access, remote reboot and power management functionality, access to system logs, and the like. As used herein, a BMC 400 has management capabilities for sub-systems of a computing device, and is separate from a processor that executes a main operating system of a computing device. The BMC 400 may comprise an interface, such as a network interface, and/or serial interface that an administrator can use to remotely communicate with the BMC 400. As used herein, an auxiliary state is a state where the BMC 400 is capable of functionality while a main subsystem of the computing device is not capable of functionality (e.g., when the computing device is powered off, but plugged in, when the main subsystem is in an error condition state, etc.). In some examples, the BMC 400 may host a web server that allows for communications via the network interface.
  • As noted, the BMC 400 can have access to system logs. In one example, when an error condition occurs on the computing device, the BMC 400 can process system logs to determine a root cause for the error condition based on the deep learning approach. The processing element 410 can execute error condition instructions 422 to determine that an error condition has occurred in the computing device (302). When the error condition affects the computing device, the BMC 400 can be notified (e.g., via an interrupt, a change in status of memory polled by the BMC 400, etc.). The BMC 400 can determine that the error condition is present.
  • The BMC 400 can also receive an error log. At 304, the model processing instructions 424 can be executed by the processing element 410 to process the error log according to a deep learning model. The error log can be a single error log or can include multiple error logs, for example error logs retrieved from or read from various hardware devices (e.g., FRUs or components of the FRUs), an operating system executing on a central processing unit associated with a main subsystem of the computing device, or the like. In one example, the error log may include registers. In the example, each of the registers or other error log information can relate to particular information, for example, relate to a particular FRU, relate to particular components of the FRU, relate to particular errors or conditions of the FRU, etc. The error log may identify the particular register or component as well. This can be used to map the information to the deep learning model. As noted above, the processing can be used to determine a score for each of a number of sets of the FRUs of the computing device. As used herein, a set of FRUs includes one FRU or multiple FRUs. As described above, the scores can relate to a probability to remove the error condition by deconfiguration of the set of FRUs. In some examples, the deep learning model includes updated parameters based on error condition feedback from at least one other device.
  • At 306, the deconfiguration instructions 426 can be executed by processing element 410 to deconfigure a first one of the sets of FRUs based on the score associated with the set (e.g., the set of FRUs scored to have the highest probability to be the root cause of the error condition compared to the other FRUs). The processing element 410 can be caused to deconfigure the FRU. In some examples, the deconfiguration of the FRU can be implemented by disabling the FRU. In one example, the disabling of the FRU can include removing power to the FRU. In another example, disabling of the FRU can include removing communications capabilities from the FRU. In a further example, disabling of the FRU can include putting the FRU in a disconnected hot plug or hot swap state. For example, a configuration parameter associated with the FRU can be set to indicate to the computing device/FRU that the FRU is not to function.
  • Once the set of FRUs selected is deconfigured, the computing device can be rebooted (308). Once reboot has occurred, the BMC 400 can determine whether the error condition persists. In one example, a test can be performed to determine whether the error condition persists. The test can be directed to the FRU or the computing device in general. In one example, the BMC 400 performs the test. In another example, the BMC 400 is communicatively coupled to another processor in the main subsystem of the computing system (e.g., a CPU) that is not deconfigured, which is instructed to perform the test.
  • If the error condition persists, the next most probable set of FRUs can be selected to be deconfigured. In one example, the next most probable FRU is determined by determining a new score relating to the probability of failure for each of the sets of FRUs as part of processing the error log again, but this time with the additional information that the previous attempt failed and the error condition persisted. The selected set of FRUs can include at least one FRU that was not in the original selection. The selected set of FRUs is deconfigured and the computing device can be rebooted and tested again. If the error condition continues to persist, the next set(s) of FRUs can be selected. Various approaches can be used to select the FRU or set of FRUs, for example, Qlearning, Markov Decision Process (MDP), etc.
  • If the error condition does not persist after a reboot, the error log and the information regarding the deconfiguration configuration of FRUs can be sent to the error analysis platform. This allows for the feedback to be provided to other devices with local BMCs similar to the computing device. The error analysis platform can use the information as part of a new sample set to provide to the deep learning model to update parameters based on the real world experience of the computing device. The BMC 400 can receive updated parameters for the deep learning model from the error analysis platform that takes into consideration of the error log information and the information about the set of the FRUs that was deconfigured. The updated parameters may also take into consideration other sets of FRUs deconfigured in response to other error conditions associated with other similar computing devices. The FRUs deconfigured from the other similar computing devices may be considered additional training data from the other computing devices that represent real life experiences. As noted above, the error analysis platform can update parameters for the deep learning model from the information provided and other training data.
  • Processing element 410 may be, one or multiple processing unit, one or multiple semiconductor-based microprocessor, other hardware devices suitable for retrieval and execution of instructions stored in machine-readable storage medium 420, or combinations thereof. The processing element 410 can be a physical device. Moreover, in one example, the processing element 410 may include multiple cores on a chip, include multiple cores across multiple chips, or combinations thereof. Processing element 410 may fetch, decode, and execute instructions 422, 424, 426 to implement method 300. As an alternative or in addition to retrieving and executing instructions, processing element 410 may include at least one integrated circuit (IC), other control logic, other electronic circuits, or combinations thereof that include a number of electronic components for performing the functionality of instructions 422, 424, 426.
  • Machine-readable storage medium 420 may be any electronic, magnetic, optical, or other physical storage device that contains or stores executable instructions. Thus, machine-readable storage medium may be, for example, Random Access Memory (RAM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a storage drive, a Compact Disc Read Only Memory (CD-ROM), and the like. As such, the machine-readable storage medium can be non-transitory. As described in detail herein, machine-readable storage medium 420 may be encoded with a series of executable instructions for implementing method 300.
  • FIG. 5 is a diagram of a deep learning model, according to one example. The deep learning model example includes LSTM, LSTM can be beneficial for error analysis because of its ability to remember appropriate sequences via gates. An input gate it 501 controls the amount of input written into a neuron's memory cell at time step t. In this scenario, the error log can provide input. A forget gate f t 503 controls the amount of information to be forgotten from a neuron's memory cell at time step t. With this approach, a set of characters can be grouped together to update an output and then cleared. The cell c t 505 represents the content of the neuron's memory cell at time step t. The output gate o t 507 controls the amount of information read from the neuron's cell and how much of it contributes to the output at time step L. The output h t 509 represents the output of the cell to the next layer at time step t. This output is also fed back into the same neuron and used in the following time step tt+1. In the example of FIG. 5, xt can be represented by error log+ 511. Error log+ can be considered the input vector to the gates. The input vector can be the same for each gate as shown in FIG. 5. In some examples, this can include information from the error log plus hidden inputs (e.g., h t 509 before the end of the processing of the error log).
  • The following equations can be used to implement one example LSTM model: it=σ(Wxi xt+Whi ht−1+bi) Eq. 1; ft=σ(Wxf xt+Whf ht−1+bf) Eq. 2; ot=σ(Wxo xt+Who ht−1+bo) Eq. 3; ct=ft ⊙ ct−1+it ⊙ tan h(Wxc xt+Whc ht−1+bc) Eq. 4; ht=ot ⊙ tan h(ct) Eq. 5; σ (z)=1/(1+e−z) Eq. 6; tan h(z)=2σ(2z)−1 Eq. 7. In the equation set, b represents a parameter vector from the deep learning model, the Ws represent parameter matrices for the deep learning model, x represents an input vector, h represents an output vector, c represents a cell state vector, and f, i, and o represent gate vectors.
  • Note that equations 1-5 are representative of neurons for an entire layer within FIG. 5. This implies that it, ft, ot, ct, ht, ht−1, and xt are vectors. In the example, the W's are matrices. In other words, if a given matrix W is augmented to include the weights for both x and h such that its dimensions become n×m, then each row in Wl for hidden layer l would be mapped to neuron j where jϵ[1, n]. Moreover, the ⊙ operator is a dot product operation.
  • The tan h and σ (sigmoid) activation functions are also outlined in equations 6 and 7 for clarity. These functions are applied as element wise operations on the resulting vectors. Other example LSTM models can be implemented, such as a Gated Recurrent Unit. Further, as noted above, other deep learning models may be used.
  • The example operates through the consumption of characters as input vectors. For the purpose of this example, assume characters that are sourced from an MCE log as input. In the example, the BMC can focus on analysis where actions at the system level can be performed. However, approach can be capable to process other log types as long as the model it is trained with data in the desired format. The neural network can make delayed predictions as it consumes input vectors (consuming one character at a time) by generating <NOP> tags as output for each time step. As noted, the output, h t 509, can provide hidden output that can be used as feedback to include in the input vector for the next iteration until a prediction is made. A prediction is eventually made once the BMC processing the log according to the model receives a special <End-of-Log> tag as input. As noted above, in some examples, the prediction can go through a softmax processing layer to determine scores that can be used to deconfigure FRUs.
  • Further, the architecture in the example use can be fully connected with the final stage going through a softmax layer which uses the form in equation 8, P(yi|zk)=ez kje−z j, in order to obtain confidence levels for replacing each FRU k where kϵ[1, K] for K replaceable FRUs. As such, the final output is a vector y that has the following format where T is the transpose operator: [<NOP>, CPU1, CPUp, DIMM1, . . . DIMMd, I/O-slot1, . . . I/O-slots, . . . ]T.
  • With the approaches described herein, various parts of debugging an error condition of a computing device can be automated using a BMC. The solution can take into account other error conditions found in the field from similar computing devices. Moreover, the autonomous nature allows for accurate metric reporting on failures in the field while minimizing downtime (e.g. the amount of time it may take to have a technician come out and troubleshoot the computing device). The accurate metric reporting can be fed into the deep learning model to self-improve the automated process. Moreover, the approach allows for reducing the field replacement costs for FRUs that are unnecessarily replaced in customer systems and personnel costs. Though specific examples of deep learning models are provided, other similar deep learning approaches can be implemented for both training and/or execution using deep learning parameters.
  • While certain implementations have been shown and described above, various changes in form and details may be made. For example, some features that have been described in relation to one implementation and/or process can be related to other implementations. In other words, processes, features, components, and/or properties described in relation to one implementation can be useful in other implementations. Furthermore, it should be appreciated that the systems and methods described herein can include various combinations and/or sub-combinations of the components and/or features of the different implementations described. Thus, features described with reference to one or more implementations can be combined with other implementations described herein.

Claims (19)

What is claimed is:
1. A computing device comprising:
a central processing unit element;
a plurality of field replaceable units; and
a baseboard management controller capable of execution while the computing device is powered to an auxiliary state to;
process an error log according to a deep learning model to determine one of the field replaceable units to deconfigure in response to an error condition;
deconfigure the one field replaceable unit;
reboot the computing device; and
in response to the reboot of the computing device, determine whether the error condition persists.
2. The computing device of claim 1, wherein the error condition does not persist; the baseboard management controller further to:
send the error log and information regarding the deconfiguration of the one field replaceable unit to an error analysis platform.
3. The computing device of claim 2, wherein the baseboard management controller is further to:
receive updated parameters for the deep learning model from the error analysis platform based on the error log and information.
4. The computing device of claim 3, wherein the baseboard management controller is further to:
process another error log in response to another error condition to determine another of the field replaceable units to deconfigure in response to the other error condition based on the updated parameters.
5. The computing device of claim 1, wherein the baseboard management controller is further to:
determine a score relating to a probability of failure for each of the field replaceable units as part of the processing of the error log,
wherein the one field replaceable unit is determined based on the score corresponding to the one field replaceable unit.
6. The computing device of claim 5, wherein the baseboard management controller is further to:
determine that the error condition persists after reboot;
determine another score relating to the probability of failure for each of the field replaceable units based on the deep learning model, the error log, and the error condition persistence; and
determine a set of the field replaceable units including another one of the field replaceable units based on the other scores.
7. The computing device of claim 1, wherein the baseboard management controller is further to:
receive a set of updated parameters for the deep learning model from an error analysis platform, wherein the updated parameters are based on feedback from another computing device,
wherein the determination of the one field replaceable unit is based on the updated parameters.
8. The computing device of claim 1, wherein the one field replaceable unit has a highest probability to have caused the error condition compared to the other field replaceable units.
9. The computing device of claim 1, wherein the deep learning model includes a long short-term memory neural network.
10. A method comprising:
determining that an error condition has occurred in a computing device by a baseboard management controller capable of execution while the computing device is powered to an auxiliary state, wherein the computing device includes a plurality of field replaceable units;
processing an error log at the baseboard management controller according to a deep learning model to determine a score for each of a plurality of sets of the field replaceable units,
wherein the scores relate to a probability to remove the error condition by deconfiguration of the set of field replaceable units;
deconfiguring a set of the field replaceable units based on the respective score for the set;
rebooting the computing device; and
in response to the reboot, determining whether the error condition persists.
11. The method of claim 10, wherein the error condition does not persist, the method further comprising:
sending, by the baseboard management controller, information about the error log and information about the set of field replaceable units deconfigured to an error analysis platform.
12. The method of claim 11, further comprising:
receiving, by the baseboard management controller, from the error analysis platform, an updated set of parameters for the deep learning model based on the information about the error log, the information about the set of field replaceable units deconfigured, and additional training data about other computing devices.
13. The method of claim 12, further comprising:
processing, by the baseboard management controller, another error log in response to another error condition to determine another set of the field replaceable units to deconfigure in response to the other error condition based on the updated parameters and the deep learning model.
14. The method of claim 13, further comprising:
determining, at the error analysis platform, the updated set of parameters for the deep learning model based on the information about the error log, the information about the set of field replaceable units deconfigured, and the additional training data received from other computing devices.
15. The method of claim 10, further comprising:
determining, by the baseboard management controller, that the error condition persists after reboot;
determining another score relating to the probability of failure for a plurality of the sets of field replaceable units based on the deep learning model, the error log, and the error condition persistence;
determining a second set of the field replaceable units including another one of the field replaceable units not in the previously deconfigured set of field replaceable units based on the other scores;
deconfiguring the second set of field replaceable units;
performing another reboot of the computing device; and
in response to other reboot, determining whether the error condition remains persistent.
16. A non-transitory machine-readable storage medium storing instructions that, if executed by a physical baseboard management controller processing element of a device, cause the device to:
determine that an error condition has occurred in the device,
wherein the baseboard management controller element is capable of execution while the device is powered to an auxiliary state,
wherein the device includes a plurality of field replaceable units;
process an error log according to a deep learning model to determine a score for a plurality of sets of the field replaceable units,
wherein the deep learning model includes updated parameters based on error condition feedback from at least one other device, and
wherein the scores relate to a probability to remove the error condition by deconfiguration of the set of field replaceable units;
deconfigure a first one of the sets of the field replaceable units based on the score respective to the first set;
reboot the computing device; and
determine whether the error condition persists after the reboot.
17. The non-transitory machine-readable storage medium of claim 16, further comprising instructions that, if executed by the physical baseboard management controller processing element, cause the device to:
determine that the error condition persists after the reboot;
determine another score relating to the probability of failure for a plurality of the sets of field replaceable units based on the deep learning model, the error log, and the error condition persistence;
determine a second set of the field replaceable units including another one of the field replaceable units not in the first set based on the other scores;
deconfigure the second set of field replaceable units;
perform another reboot of the computing device; and
in response to other reboot, determine whether the error condition remains persistent.
18. The non-transitory machine-readable storage medium of claim 16, further comprising instructions that, if executed by the physical baseboard management controller processing element, cause the device to:
in response to a determination that the error condition does not remain persistent after the other reboot, provide information about the error log and information about the second set of field replaceable units to an error analysis platform.
19. The non-transitory machine-readable storage medium of claim 16, further comprising instructions that, if executed by the physical baseboard management controller processing element, cause the device to:
receive an updated set of parameters for the deep learning model from the error analysis platform that takes into consideration the error log information and information about the second set of field replaceable units.
US15/463,713 2017-03-20 2017-03-20 Baseboard management controller to deconfigure field replaceable units according to deep learning model Active 2037-07-12 US10552729B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/463,713 US10552729B2 (en) 2017-03-20 2017-03-20 Baseboard management controller to deconfigure field replaceable units according to deep learning model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US15/463,713 US10552729B2 (en) 2017-03-20 2017-03-20 Baseboard management controller to deconfigure field replaceable units according to deep learning model

Publications (2)

Publication Number Publication Date
US20180267858A1 true US20180267858A1 (en) 2018-09-20
US10552729B2 US10552729B2 (en) 2020-02-04

Family

ID=63520096

Family Applications (1)

Application Number Title Priority Date Filing Date
US15/463,713 Active 2037-07-12 US10552729B2 (en) 2017-03-20 2017-03-20 Baseboard management controller to deconfigure field replaceable units according to deep learning model

Country Status (1)

Country Link
US (1) US10552729B2 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110427371A (en) * 2019-07-19 2019-11-08 苏州浪潮智能科技有限公司 Server FRU field management method, device, equipment and readable storage medium storing program for executing
US20200004625A1 (en) * 2018-06-29 2020-01-02 International Business Machines Corporation Determining when to perform error checking of a storage unit by training a machine learning module
CN110751272A (en) * 2019-10-30 2020-02-04 珠海格力电器股份有限公司 Method, device and storage medium for positioning data in convolutional neural network model
US10891181B2 (en) * 2018-10-25 2021-01-12 International Business Machines Corporation Smart system dump
US20210081238A1 (en) * 2019-09-17 2021-03-18 Western Digital Technologies, Inc. Exception analysis for data storage devices
US11099743B2 (en) 2018-06-29 2021-08-24 International Business Machines Corporation Determining when to replace a storage device using a machine learning module
US11119663B2 (en) 2018-06-29 2021-09-14 International Business Machines Corporation Determining when to perform a data integrity check of copies of a data set by training a machine learning module
CN113536306A (en) * 2020-04-14 2021-10-22 慧与发展有限责任合伙企业 Processing health information to determine whether an exception occurred
US20220066890A1 (en) * 2020-08-25 2022-03-03 Softiron Limited Centralized Server Management For Shadow Nodes
CN114896212A (en) * 2022-04-07 2022-08-12 支付宝(杭州)信息技术有限公司 Log data analysis method and device, storage medium and electronic equipment
US11636004B1 (en) * 2021-10-22 2023-04-25 EMC IP Holding Company LLC Method, electronic device, and computer program product for training failure analysis model
US11748478B2 (en) 2020-08-07 2023-09-05 Softiron Limited Current monitor for security

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11226840B2 (en) * 2015-10-08 2022-01-18 Shanghai Zhaoxin Semiconductor Co., Ltd. Neural network unit that interrupts processing core upon condition
US11221872B2 (en) * 2015-10-08 2022-01-11 Shanghai Zhaoxin Semiconductor Co., Ltd. Neural network unit that interrupts processing core upon condition
US11403162B2 (en) * 2019-10-17 2022-08-02 Dell Products L.P. System and method for transferring diagnostic data via a framebuffer

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5253184A (en) * 1991-06-19 1993-10-12 Storage Technology Corporation Failure and performance tracking system
DE10244131B4 (en) * 2002-09-23 2006-11-30 Siemens Ag Method for supporting identification of a defective functional unit in a technical installation
US6970804B2 (en) 2002-12-17 2005-11-29 Xerox Corporation Automated self-learning diagnostic system
US20040221198A1 (en) * 2003-04-17 2004-11-04 Vecoven Frederic Louis Ghislain Gabriel Automatic error diagnosis
US8001423B2 (en) * 2008-09-26 2011-08-16 Bae Systems Information And Electronic Systems Integration Inc. Prognostic diagnostic capability tracking system
US8504875B2 (en) * 2009-12-28 2013-08-06 International Business Machines Corporation Debugging module to load error decoding logic from firmware and to execute logic in response to an error
CN102455950A (en) * 2010-10-28 2012-05-16 鸿富锦精密工业(深圳)有限公司 Firmware recovery system and method of base board management controller
CN103914735B (en) 2014-04-17 2017-03-29 北京泰乐德信息技术有限公司 A kind of fault recognition method and system based on Neural Network Self-learning
US10817398B2 (en) * 2015-03-09 2020-10-27 Vapor IO Inc. Data center management via out-of-band, low-pin count, external access to local motherboard monitoring and control
US10339448B2 (en) * 2017-01-09 2019-07-02 Seagate Technology Llc Methods and devices for reducing device test time

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11119660B2 (en) 2018-06-29 2021-09-14 International Business Machines Corporation Determining when to replace a storage device by training a machine learning module
US20200004625A1 (en) * 2018-06-29 2020-01-02 International Business Machines Corporation Determining when to perform error checking of a storage unit by training a machine learning module
US11204827B2 (en) 2018-06-29 2021-12-21 International Business Machines Corporation Using a machine learning module to determine when to perform error checking of a storage unit
US11119850B2 (en) 2018-06-29 2021-09-14 International Business Machines Corporation Determining when to perform error checking of a storage unit by using a machine learning module
US11099743B2 (en) 2018-06-29 2021-08-24 International Business Machines Corporation Determining when to replace a storage device using a machine learning module
US11119663B2 (en) 2018-06-29 2021-09-14 International Business Machines Corporation Determining when to perform a data integrity check of copies of a data set by training a machine learning module
US11119851B2 (en) * 2018-06-29 2021-09-14 International Business Machines Corporation Determining when to perform error checking of a storage unit by training a machine learning module
US11119662B2 (en) 2018-06-29 2021-09-14 International Business Machines Corporation Determining when to perform a data integrity check of copies of a data set using a machine learning module
US10891181B2 (en) * 2018-10-25 2021-01-12 International Business Machines Corporation Smart system dump
CN110427371A (en) * 2019-07-19 2019-11-08 苏州浪潮智能科技有限公司 Server FRU field management method, device, equipment and readable storage medium storing program for executing
US20210081238A1 (en) * 2019-09-17 2021-03-18 Western Digital Technologies, Inc. Exception analysis for data storage devices
US11768701B2 (en) * 2019-09-17 2023-09-26 Western Digital Technologies, Inc. Exception analysis for data storage devices
CN110751272A (en) * 2019-10-30 2020-02-04 珠海格力电器股份有限公司 Method, device and storage medium for positioning data in convolutional neural network model
CN113536306A (en) * 2020-04-14 2021-10-22 慧与发展有限责任合伙企业 Processing health information to determine whether an exception occurred
US11755729B2 (en) 2020-08-07 2023-09-12 Softiron Limited Centralized server management for current monitoring for security
US11748478B2 (en) 2020-08-07 2023-09-05 Softiron Limited Current monitor for security
US20220066890A1 (en) * 2020-08-25 2022-03-03 Softiron Limited Centralized Server Management For Shadow Nodes
US12019528B2 (en) * 2020-08-25 2024-06-25 Softiron Limited Centralized server management for shadow nodes
US11636004B1 (en) * 2021-10-22 2023-04-25 EMC IP Holding Company LLC Method, electronic device, and computer program product for training failure analysis model
CN114896212A (en) * 2022-04-07 2022-08-12 支付宝(杭州)信息技术有限公司 Log data analysis method and device, storage medium and electronic equipment

Also Published As

Publication number Publication date
US10552729B2 (en) 2020-02-04

Similar Documents

Publication Publication Date Title
US10552729B2 (en) Baseboard management controller to deconfigure field replaceable units according to deep learning model
US10579459B2 (en) Log events for root cause error diagnosis
US11494295B1 (en) Automated software bug discovery and assessment
US10489232B1 (en) Data center diagnostic information
US11860721B2 (en) Utilizing automatic labelling, prioritizing, and root cause analysis machine learning models and dependency graphs to determine recommendations for software products
KR101331935B1 (en) Method and system of fault diagnosis and repair using based-on tracepoint
US11625315B2 (en) Software regression recovery via automated detection of problem change lists
US11551085B2 (en) Method, device, and computer program product for error evaluation
CN111414268B (en) Fault processing method and device and server
CN113282461A (en) Alarm identification method and device for transmission network
JP7435799B2 (en) Rule learning device, rule engine, rule learning method, and rule learning program
CN117668706A (en) Method and device for isolating memory faults of server, storage medium and electronic equipment
CN112257745A (en) Hidden Markov-based method and device for predicting health degree of underground coal mine system
CN114691409A (en) Memory fault processing method and device
CN116582414A (en) Fault root cause positioning method, device, equipment and readable storage medium
US20200127882A1 (en) Identification of cause of failure of computing elements in a computing environment
US20230385048A1 (en) Predictive recycling of computer systems in a cloud environment
CN113839861A (en) Routing engine switching based on health determined by support vector machine
Simeonov et al. Proactive software rejuvenation based on machine learning techniques
US20230161661A1 (en) Utilizing topology-centric monitoring to model a system and correlate low level system anomalies and high level system impacts
US20230161637A1 (en) Automated reasoning for event management in cloud platforms
CN116490857A (en) Method and system for providing maintenance service for recording medium of electronic device
CN116827759B (en) Method and device for processing restarting instruction of converging current divider
US11956117B1 (en) Network monitoring and healing based on a behavior model
CN112996026B (en) Double-backup upgrading method and system for wireless network equipment

Legal Events

Date Code Title Description
AS Assignment

Owner name: HEWLETT PACKARD ENTERPRISE DEVELOPMENT LP, TEXAS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:BACHA, ANYS;UMAR, DODDYANTO HAMID;SIGNING DATES FROM 20170315 TO 20170316;REEL/FRAME:041647/0016

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS

STPP Information on status: patent application and granting procedure in general

Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED

STCF Information on status: patent grant

Free format text: PATENTED CASE

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 4