US20140059390A1 - Use of service processor to retrieve hardware information - Google Patents
Use of service processor to retrieve hardware information Download PDFInfo
- Publication number
- US20140059390A1 US20140059390A1 US14/071,517 US201314071517A US2014059390A1 US 20140059390 A1 US20140059390 A1 US 20140059390A1 US 201314071517 A US201314071517 A US 201314071517A US 2014059390 A1 US2014059390 A1 US 2014059390A1
- Authority
- US
- United States
- Prior art keywords
- processing unit
- central processing
- service processor
- information
- processing system
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/22—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
- G06F11/26—Functional testing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0766—Error or fault reporting or storing
- G06F11/0778—Dumping, i.e. gathering error/state information after a fault for later diagnosis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0721—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment within a central processing unit [CPU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0751—Error or fault detection not based on redundancy
- G06F11/0754—Error or fault detection not based on redundancy by exceeding limits
- G06F11/0757—Error or fault detection not based on redundancy by exceeding limits by exceeding a time limit, i.e. time-out, e.g. watchdogs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/22—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
- G06F11/2294—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing by remote test
Definitions
- the present disclosure relates generally to diagnosing processing systems and more specifically to use of a service processor to retrieve hardware information.
- an operating system executed on the computer system may dump contents of main memory at the time of the crash onto a file.
- This dump is referred to as a core dump, and the information in the core dump is generally used to debug or analyze errors in computer programs or computer systems.
- Embodiments of the present invention provide various techniques for retrieving information from a central processing unit (CPU).
- CPU central processing unit
- information from a central processing unit (CPU) in a processing system can be retrieved, even when an operating system has malfunctioned, in the event of a system crash.
- the processing system uses a service processor to retrieve information about the CPU from the CPU itself.
- a processing system in addition to a CPU, a processing system also has a separate service processor that controls the various hardware components of the processing system. Many processing systems include such a service processor in order to offload many hardware specific tasks from the CPU. This offloading of tasks by the service processor provides the CPU with more bandwidth to handle application specific tasks, thereby speeding the execution of applications. It should be appreciated that traditional service processors are not configured to retrieve information used for diagnosing a system crash, but as explained in detail below, embodiments of the present invention provide various techniques for using service processors to retrieve such information directly from the CPU.
- the service processor operates independently from the CPU and from the operating system executed by the CPU. Accordingly, the service processor is still operable in the event that the operating system malfunctions as a result of a CPU stall.
- the service processor can be used to retrieve various information about the CPU and/or about other hardware components of the processing system. The retrieval can be initiated when a stall of the CPU is detected or when a user manually initiates the retrieval. Additionally, the service processor can also be programmed to initiate retrieval at predefined intervals. Once the information is retrieved, it may be used to diagnose the errors that caused, for example, the CPU to stall.
- FIG. 1 depicts a block diagram of a system of processing systems, consistent with one embodiment of the present invention
- FIG. 2 depicts a high-level block diagram of a storage server, according to at least one embodiment of the present invention
- FIG. 3 depicts an architectural block diagram of the hardware and software associated with a processing system, in accordance with an embodiment of the present invention
- FIG. 4 depicts a flow diagram of a general overview of a method, in accordance with an embodiment, for retrieving information from a processing system that has a CPU and a service processor;
- FIG. 5 depicts a flow diagram of a general overview of a method, in accordance with an alternate embodiment, for retrieving information from a processing system that has a CPU and a service processor;
- FIGS. 6A and 6B depict circuit diagrams illustrating the retrieval of CPU information by a service processor, consistent with different embodiments of the present invention
- FIG. 7 depicts a circuit diagram of the detailed connections between a service processor and other components of a processing system, according to an embodiment of the present invention.
- FIG. 8 depicts a flow diagram of a more detailed method, in accordance with an alternate embodiment, for retrieving information from a processing system that has a CPU and a service processor.
- FIG. 1 depicts a block diagram of a system 100 of processing systems, consistent with one embodiment of the present invention.
- the system 100 includes a storage system 7 and various processing systems (e.g., clients 1 and administrative consoles 5 ) in communication with the storage system 7 through networks 3 and 21 , such as a local area network (LAN) or wide area network (WAN).
- the storage system 7 operates on behalf of the clients 1 to store and manage shared files or other units of data (e.g., blocks) in the set of mass storage devices.
- Each of the clients 1 may be, for example, a conventional personal computer (PC), a workstation, a smart phone, or other processing systems.
- the storage system 7 includes a storage server 20 in communication with a storage subsystem 4 .
- the storage server 20 manages the storage subsystem 4 and receives and responds to various read and write requests from the clients 1 , directed to data stored in, or to be stored in, the storage subsystem 4 .
- the mass storage devices in the storage subsystem 4 may be, for example, conventional magnetic disks, optical disks such as CD-ROM or DVD based storage, magneto-optical (MO) storage, or any other type of non-volatile storage devices suitable for storing large quantities of data.
- the mass storage devices may be organized into one or more volumes of Redundant Array of Inexpensive Disks (RAID).
- the storage server 20 in this configuration includes a communication port (e.g., RS-232) and appropriate software to allow direct communication between the storage server 20 and the local administrative console 5 through a transmission line.
- a communication port e.g., RS-232
- This configuration enables a network administrator to perform management functions on the storage server 20 .
- the storage server 20 can also be managed through a network 21 from a remote administrative console 5 ′. It should be noted that while network 3 and network 21 are depicted as separate networks in FIG. 1 , they can also be the same network.
- FIG. 2 depicts a high-level block diagram of a machine in the example form of a processing system 200 within which may be executed a set of instructions for causing the machine to perform any one or more of the methodologies discussed herein.
- the processing system 200 may be deployed in the form of, for example, a storage server, a personal computer, a tablet personal computer, a laptop computer, a smart phone, or a variety of other processing systems.
- the processing system 200 is a storage server (e.g., storage server 20 depicted in FIG. 1 )
- the storage server may be, for example, a file server, and more particularly, a network attached storage (NAS) appliance.
- NAS network attached storage
- the storage server may be a server that provides clients with access to information organized as data containers, such as individual data blocks, as may be the case in a storage area network (SAN).
- the storage server may be a device that provides clients with access to data at both the file level and the block level.
- the processing system 200 includes one or more CPUs 31 and memory 32 , which are coupled to each other through a chipset 33 .
- the chipset 33 may include, for example, a memory controller hub and input/output hub combination.
- the CPU 31 of the processing system 200 and may be, for example, one or more programmable general-purpose or special-purpose microprocessors or digital signal processors (DSPs), microcontrollers, application specific integrated circuits (ASICs), programmable logic devices (PLDs), or a combination of such devices.
- DSPs digital signal processors
- ASICs application specific integrated circuits
- PLDs programmable logic devices
- the memory 32 may be, or may include, any of various forms of read-only memory (ROM), random access memory (RAM), Flash memory, or the like, or a combination of such devices.
- the memory 32 stores, among other things, the operating system of the processing system 200 .
- the processing system 200 also includes one or more internal mass storage devices 34 , a console serial interface 35 , a network adapter 36 , and a storage adapter 37 , which are coupled to the CPU 31 through the chipset 33 .
- the processing system 200 also includes a power supply 38 , as shown.
- the internal mass storage devices 34 may be or include any machine-readable medium for storing large volumes of data in a non-volatile manner, such as one or more magnetic or optical based disks, or for storing one or more sets of data structures and instructions (e.g., software) embodying or utilized by any one or more of the methodologies or functions described herein.
- the serial interface 35 allows a direct serial connection with, for example, a local administrative console.
- the storage adapter 37 allows the processing system 200 to access a storage subsystem and may be, for example, a Fibre Channel adapter or a Small Computer System Interface (SCSI) adapter.
- the network adapter 36 such as an Ethernet adapter, provides the processing system 200 with the ability to communicate with remote devices over a network.
- the processing system 200 further includes a number of sensors 39 and presence detectors 40 .
- the sensors 39 are used to detect changes in the state of various environmental variables or parameters in the processing system 200 , such as temperatures, voltages, binary states, and other parameters.
- the presence detectors 40 are used to detect the presence or absence of various hardware components within the processing system 200 , such as a cooling fan, a particular circuit card, or other hardware components.
- the service processor 42 monitors and/or manages the various hardware components of the processing system 200 . Examples of monitoring and management functionalities are described in more detail below.
- the service processor 42 may be, for example, one or more programmable general-purpose or special-purpose microprocessors or digital signal processors (DSPs), microcontrollers, application specific integrated circuits (ASICs), programmable logic devices (PLDs), or a combination of such devices.
- DSPs digital signal processors
- ASICs application specific integrated circuits
- PLDs programmable logic devices
- Many processing systems include such a service processor 42 to offload many hardware specific tasks from the CPU 31 . This offloading of tasks by the service processor 42 provides the CPU 31 with more bandwidth to handle application specific tasks, thereby speeding the execution of applications executed by the CPU 31 .
- the service processor 42 is independent and separate from the CPU 31 and, in this example of the processing system 200 , the service processor 42 is coupled to the RMM 41 as well as to the chipset 33 and CPU 31 , and receives input from the sensors 39 and presence detectors 40 . It should be noted that the service processor 42 is independent from the CPU 31 in that the processing of the service processor 42 is not dependent on the CPU 31 . In other words, the service processor 42 can function independently of the CPU 31 and therefore the service processor 42 can still function if the CPU 31 stalls or malfunctions. Furthermore, the service processor 42 is physically separate from the CPU 31 where the internal components of the service processor 42 is separated from the CPU 31 by an intervening barrier or space.
- the service processor 42 may be embodied within a microchip while the CPU 31 may be embodied in a different microchip. As explained in more detail below, the service processor 42 is configured to retrieve various information from the CPU 31 or from other hardware components, and such information may be used in the analysis or diagnosis of errors in the processing system 200 .
- the service processor 42 further includes a remote management module (RMM) 41 that provides a network interface and allows a remote processing system, such as a remote administrative console, to control and/or perform various management functions on the processing system 200 by way of a network.
- the RMM 41 may be in the form of a dedicated circuit card separate from the other hardware components of the processing system 200 .
- the RMM 41 has a network interface that connects to the network and a separate internal interface that connects to one or more hardware components of the processing system.
- the RMM typically includes control circuitry (e.g., a microprocessor or microcontroller) which is programmed or otherwise configured to respond to commands received from a remote administrative console via the network and to perform at least some of the management functions.
- the processing system 200 may include fewer or more components apart from those shown in FIG. 2 .
- the processing system 200 may not include the RMM 41 .
- the processing system 200 may not include the storage adapter 37 .
- FIG. 3 depicts an architectural block diagram of the hardware and software associated with the processing system 200 , in accordance with an embodiment of the present invention.
- the processing system 200 includes a service processor 42 , a service processor operating system 310 , a monitor and management module 309 , and a diagnostic module 301 .
- the service processor 42 executes a service processor operating system 310 that manages various software processes and/or services.
- the service processor operating system 310 controls and schedules execution of processes by the service processor 42 .
- the service processor operating system 310 is separate and independent from the main operating system executed by a CPU. Accordingly, if the main operating system malfunctions, the service processor operating system 310 may continue to function because it is executed on a different hardware component, namely the service processor 42 .
- the software processes and/or other services executed by the service processor 42 include a diagnostic module 301 and a monitor and management module 309 .
- the monitor and management module 309 monitors and/or manages various components of a processing system.
- the diagnostic module 301 is configured to retrieve information from the CPU or other hardware components.
- the diagnostic module 301 may include a detection module 302 , an information retrieval module 304 , a reset module 306 , and a console login module 308 .
- the detection module 302 is configured to detect that a CPU included in the processing system 200 has stalled.
- the information retrieval module 304 is configured to retrieve information directly from the CPU, as also explained in more detail below.
- the reset module 306 is configured to reset the processing system 200 in order to, for example, attempt to place or return the CPU into an operational state.
- the console login module 308 provides a user with access to the processing system 200 such that the user can access or retrieve the information retrieved before the reset.
- the processing system 200 may include fewer or more modules apart from those shown in FIG. 3 .
- the diagnostic module 301 may exclude the console login module 308 and the reset module 306 .
- the functionalities of the reset module 306 and the console login module 308 may be handled by, for example, a different module.
- the modules 302 , 304 , 306 , and 308 are in the form of software that is processed by the service processor 42 .
- the modules 302 , 304 , 306 , and 308 may be in the form of firmware that is processed by Application Specific Integrated Circuits (ASICs), which may be integrated into a circuit board.
- ASICs Application Specific Integrated Circuits
- modules 302 , 304 , 306 , and 308 may be in the form of one or more logic blocks included in a programmable logic device (e.g., a field-programmable gate array).
- a programmable logic device e.g., a field-programmable gate array
- the described modules may be adapted, and/or additional structures may be provided, to provide alternative or additional functionalities beyond those specifically discussed in reference to FIG. 3 . Examples of such alternative or additional functionalities will be discussed in reference to the flow diagrams discussed below.
- the modifications or additions to the structures described in relation to FIG. 3 to implement these alternative or additional functionalities will be implementable by those skilled in the art, having the benefit of the present specification and teachings.
- FIG. 4 depicts a flow diagram of a general overview of a method 400 , in accordance with an embodiment, for retrieving information from a processing system that has a CPU and a service processor.
- the method 400 may be implemented by the diagnostic module 301 depicted in FIG. 3 and employed in the processing system 200 .
- the service processor retrieves CPU information directly from the CPU at 404 .
- CPU information refers to information associated with the CPU. Examples of CPU information include a state of the CPU, a CPU event, contents of CPU registers, and other information associated with the CPU.
- the service processor can also retrieve additional information related to other hardware components of the processing system.
- the retrieved CPU information may then be stored in a non-volatile storage device for later retrieval by a user for use in diagnosing, for example, any CPU or other hardware related errors.
- the retrieved CPU information may be transmitted to a different processing system.
- a reset refers to clearing any pending errors or events and bringing a processing system to normal condition or initial state.
- An example of a reset may be a hard reset where power is removed and subsequently restored to a processing system.
- Another example of a reset may be a soft reset where system software, such as the operating system, is terminated and subsequently executed again in a processing system. Particularly, a soft reset is restarting a processing system under operating system control, without removing power.
- the resetting of the processing system at 406 is optional as not all processing systems need to be reset after the CPU information is retrieved.
- the CPU may be allowed to continue to operate.
- the service processor may modify a state of the CPU based on the retrieved CPU information, and then allow the CPU to continue to operate based on the modified state. For example, the service processor can modify the CPU state by changing the registers of a CPU processing core.
- FIG. 5 depicts a flow diagram of a general overview of a method 500 , in accordance with an alternate embodiment, for retrieving information from a processing system that has a CPU and a service processor.
- the method 500 may be implemented by the diagnostic module 301 depicted in FIG. 3 and employed in the processing system 200 .
- the service processor detects at 502 that the CPU has stalled.
- a CPU is a state machine and has various internal components. In order to be able to fully function, a CPU needs all of its components of subsystems to be in a consistent state or known state. However, a CPU or a subsystem of the CPU may refuse to continue its current operation if it is in an inconsistent state or the data it depends to transition to the next state is not available. Such conditions can “stall” a CPU.
- the detection of the stall at 502 can be based on the receipt of heartbeat messages.
- the CPU can be configured to transmit heartbeat messages to a service processor at predefined intervals. If the CPU has completely stalled, the CPU is not able to transmit these heartbeat messages.
- the service processor does not receive the heartbeat messages within a predefined interval, the service processor can identify and therefore detect that the CPU has stalled.
- the detection of the stall can be based on receipt of an event signal from the CPU. Particularly, if the CPU has not completely stalled, a functioning subsystem within the CPU may detect an error condition within other subsystems of the CPU and send an event signal notifying the service processor of the error condition. In other words, a functioning subsystem of the CPU may detect that another subsystem has stalled and accordingly, send an event signal to the service processor notifying it of the stall in at least one of the subsystems.
- the service processor after the service processor has detected that the CPU has stalled, the service processor then retrieves CPU information directly from the CPU at 504 , the retrieval of which is described in detail below. With the CPU information retrieved, the service processor then resets the processing system at 506 . It should be noted that the retrieval of the CPU information or other hardware related information may not necessarily be triggered based on the detection that a CPU has stalled. In another embodiment, the service processor may be configured to automatically retrieve CPU information at predefined intervals, without a subsequent reset of the processing system. The CPU information may be automatically retrieved when there is no apparent error in the CPU, but such information may be useful for other CPU related analysis. In yet another embodiment, a user can manually trigger the retrieval of CPU information through use of, for example, a remote administrative console.
- FIGS. 6A and 6B depict circuit diagrams illustrating the retrieval of CPU information by a service processor, consistent with different embodiments of the present invention.
- a processing system 600 includes a CPU 31 and a service processor 42 .
- the CPU 31 includes a test access interface 602 , which is an interface that is included in many hardware components for use in, for example, testing circuit board assemblies and debugging embedded systems.
- An example of such a test access interface 602 is Joint Test Action Group (JTAG) interface (or IEEE 1149.1).
- JTAG interface is a specialized four/five-pin interface added to a hardware component, such as the CPU 31 .
- test access interface 602 is a Serial Peripheral Interface Bus (SPI bus), which is a synchronous serial data link that operates in full duplex mode.
- SPI bus specifies four logic signals, namely Serial Clock (SCLK), Master Output, Slave Input (MOSI/SIMO), Master Input, Slave Output (MISO/SOMI), and Slave Select (SS).
- SCLK Serial Clock
- MOSI/SIMO Master Output
- MISOMI Master Input
- SS Slave Select
- PECI Platform Environment Control Interface
- PECI Platform Environment Control Interface
- the PECI bus is a single-wire interface with a variable data transfer speed.
- the service processor 42 is connected to the test access interface 602 included in the CPU 31 by way of a general purpose I/O port 604 included in the service processor 42 .
- a general purpose I/O port 604 is a port that is available on the service processor 42 and may be used for a variety of different applications.
- a general purpose I/O port 604 may be a four-bit or eight-bit I/O port used to connect to other hardware components for light-emitting diode (LED) driving, monitoring switches, communicating data, or other applications.
- the service processor 42 can retrieve CPU information directly from the CPU 31 by way of the test access interface 602 .
- the test access interface 602 is a JTAG interface
- the service processor 42 can retrieve the CPU information from the TDO.
- the CPU 31 can be connected to the service processor 42 by way of a debug connection logic 652 .
- This alternate processing system 650 includes the CPU 31 , the debug connection logic 652 , and a service processor 42 , where the debug connection logic 652 is connected to both the CPU 31 and the service processor 42 .
- the debug connection logic 652 functions as a connecting switch between the CPU 31 and the service processor 42 .
- the debug connection logic 652 disconnects the CPU 31 from the service processor 42 such that any errant data or signals cannot be transmitted between the CPU 31 and the service processor 42 . This disconnection is implemented to assure that the service processor 42 cannot inadvertently transmit any signals or data to the CPU 31 that may interfere with the operations of the CPU 31 .
- the service processor 42 when the service processor 42 is instructed to retrieve CPU information from the CPU 31 , the service processor 42 transmits a signal by way of connection 654 to the debug connection logic 652 to access the CPU 31 .
- the debug connection logic 652 upon receipt of this signal, connects the service processor 42 to the CPU 31 such that the service processor 42 can directly retrieve the CPU information from the CPU 31 by way of the test access interface 602 .
- the service processor 42 can transmit another signal to the debug connection logic 652 by way of connection 654 to instruct the debug connection logic 652 to disconnect the service processor 42 from the CPU 31 .
- the debug connection logic 652 may include a timer set for a particular predefined time period, and the debug connection logic 652 can be configured to connect the service processor 42 to the CPU 31 for this particular predefined time period. Upon expiration of the time period, the debug connection logic 652 automatically disconnects the service processor 42 from the CPU 31 without any instructions to do so from the service processor 42 .
- FIG. 7 depicts a circuit diagram of the detailed connections between a service processor 42 and other components of a processing system 700 , according to an embodiment of the present invention.
- the processing system 700 includes the service processor 42 connected to and in communication with sensors 39 , presence detectors 40 , CPU 31 , chipset 33 , and power supply 38 .
- the sensors 39 are also connected to the CPU 31 and chipset 33 by, for example, an Inter IC bus 81 , which allows communication between hardware components on a circuit board.
- the service processor 42 monitors and/or manages the various hardware components of the processing system 700 .
- such monitoring and management functionalities can be provided by a monitor and management module 309 , as described above in FIG. 3 , that is embodied or executed by the service processor 42 . Examples of such functionalities include data logging, setting platform event traps, keeping a system event log, providing remote access to the processing system 700 , and monitoring various parameters associated with hardware components.
- the service processor 42 can monitor various parameters or variables present in a processing system, such as the temperature, voltage, fan speed, and/or current, through use of various sensors 39 . If the service processor 42 detects that a particular parameter has fallen below or exceeds a certain threshold, then the service processor 42 can log the readings and, as discussed below, transmit messages with the reading to other processing systems by way of the RMM 41 . In another example, as discussed above, the service processor 42 can detect the presence or absence of various hardware components in the processing system 700 by way of the presence detectors 40 .
- the service processor 42 also monitors the processing system 700 for changes in system-specified signals that are of interest. When any of these signals change, the service processor 42 captures and logs the state of the signals. For example, the service processor 42 can log system events, such as boot progress, field replaceable unit changes, operating system generated events, and service processor command history.
- the service processor 42 can also be configured to control various hardware components of the processing system 700 , such as the power supply 38 .
- the service processor 42 can provide a control signal CTRL to the power supply 38 to enable or disable the power supply 38 .
- the service processor 42 can collect status information about the power supply 38 with the receipt of the status signal STATUS from the power supply 38 .
- the service processor 42 can also shut down, power-cycle, generate a non-maskable interrupt (NMI), or reboot the processing system 700 , regardless of the state of the CPU 31 and chipset 33 .
- NMI non-maskable interrupt
- the service processor 42 can also be connected to a local administrative console by way of a serial communication port (not shown). In this connection, a user can log into the service processor 42 using a secure shell client application from the local administrative console.
- the service processor 42 can also be connected to a remote administrative console by way of the RMM 41 that provides a network interface, and can transmit messages to and from the remote administrative console. For example, upon detection of a specified critical event, the service processor 42 can automatically dispatch an alert e-mail or other form of electronic alert message to the remote administrative console.
- FIG. 8 depicts a flow diagram of a more detailed method, in accordance with an alternate embodiment, for retrieving information from a processing system that has a CPU and a service processor.
- the method 800 may be implemented by the diagnostic module 301 depicted in FIG. 3 and employed in the processing system 200 .
- the service processor detects that the CPU has stalled at 802 and thereafter, transmits an NMI to the CPU to attempt to wake the CPU at 804 .
- An NMI is a type of CPU interrupt that cannot be ignored by standard interrupt masking techniques.
- the service processor then waits for a time period after transmittal of the NMI and attempts to detect whether the CPU continues to be stalled during this particular time period. If the service processor detects that the CPU has become functional within this time period, then the service processor does not take any further actions.
- the service processor detects that the CPU is still stalled after this time period, then the service processor is configured to transmit a signal to a debug connection logic, which connects to both the CPU and the service processor, to access the CPU at 808 .
- the debug connection logic connects the service processor to the CPU upon receipt of the signal such that the service processor can retrieve CPU information from the CPU at 810 .
- the service processor has access to or has logged other information associated with other hardware components (e.g., system events and temperature).
- the service processor collects these other information at 812 and may then store all the information retrieved to a non-volatile storage device, such as a hard disk drive, at 814 .
- the service processor may instead transmit the collected information to a different processing system, such as a remote administrative console.
- the processing system may then be reset at 816 .
- the service processor in one embodiment, can also be configured to analyze the collected information (including the CPU information retrieved) at 815 and take different actions based on the results of the analysis. For example, the service processor can reset the processing system based on the results of the analysis.
- the service processor can analyze the collected information, remap particular subcomponents based on the analysis, and then reset the processing system.
- the service processor may identify that a particular Peripheral Component Interconnect (PCI) component has malfunctioned, and reboot the processing system without the malfunctioning PCI component.
- PCI Peripheral Component Interconnect
- the components of a processing system are identified by a range of addresses.
- the service processor may remap the range of addresses assigned to the malfunctioned component to some other address location (e.g., address 0).
- the service processor can identify or map the bad parts of a system memory based on the collected information and reboot the processing system without accessing the bad parts of the system memory.
- the service processor marks the bad sectors as unusable such that the operating system skips them in the future.
- Many system memories include spare sectors, and when a bad sector is found, the logical sector is remapped to a different physical sector.
- the service processor also may be configured to identify specific data that is related to a particular error message and to transmit the identified data along with the error message to, for example, a local administrative console.
- Modules may constitute either software modules (e.g., code embodied on a machine-readable medium or in a transmission signal) or hardware modules.
- a hardware module is a tangible unit capable of performing certain operations and may be configured or arranged in a certain manner.
- one or more processing systems e.g., the processing system 200 depicted in FIG. 3
- one or more hardware modules of a processing system e.g., the service processor 42 depicted in FIG. 2 or a group of processors
- software e.g., an application or application portion
- a hardware module may be implemented mechanically or electronically.
- a hardware module may comprise dedicated circuitry or logic that is permanently configured (e.g., as a special-purpose processor, such as a field programmable gate array (FPGA) or an application-specific integrated circuit (ASIC)) to perform certain operations.
- a hardware module may also comprise programmable logic or circuitry (e.g., as encompassed within the service processor 42 ) that is temporarily configured by software to perform certain operations.
- the term “hardware module” should be understood to encompass a tangible entity, be that an entity that is physically constructed, permanently configured (e.g., hardwired) or temporarily configured (e.g., programmed) to operate in a certain manner and/or to perform certain operations described herein.
- hardware modules are temporarily configured (e.g., programmed)
- each of the hardware modules need not be configured or instantiated at any one instance in time.
- the hardware modules comprise a service processor 42 configured using software
- the service processor 42 may be configured as respective different hardware modules at different times.
- Software may accordingly configure a service processor 42 , for example, to constitute a particular hardware module at one instance of time and to constitute a different hardware module at a different instance of time.
- Modules can provide information to, and receive information from, other modules.
- the described modules may be regarded as being communicatively coupled.
- communications may be achieved through signal transmission (e.g., over appropriate circuits and buses) that connect the modules.
- communications between such modules may be achieved, for example, through the storage and retrieval of information in memory structures to which the multiple modules have access.
- one module may perform an operation, and store the output of that operation in a memory device to which it is communicatively coupled.
- a further module may then, at a later time, access the memory device to retrieve and process the stored output.
- Modules may also initiate communications with input or output devices, and can operate on a resource (e.g., a collection of information).
- service processors such as the service processor 42 depicted in FIG. 2
- service processors may be temporarily configured (e.g., by software) or permanently configured to perform the relevant operations.
- service processors may constitute “processor-implemented” modules that operate to perform one or more operations or functions.
Abstract
Various techniques and hardware are described for retrieving information in a processing system. In one embodiment, a method is provided for retrieving information in a processing system that includes a central processing unit and a service processor. Here, the service processor retrieves central processing unit information from the central processing unit and resets the processing system after the retrieval of the central processing unit information.
Description
- This application is a Continuation of U.S. application Ser. No. 12/908,764, entitled “USE OF SERVICE PROCESSOR TO RETRIEVE HARDWARE INFORMATION”, filed Oct. 20, 2010; the aforementioned priority application being hereby incorporated by reference in its entirety.
- The present disclosure relates generally to diagnosing processing systems and more specifically to use of a service processor to retrieve hardware information.
- When a computer system crashes, an operating system executed on the computer system may dump contents of main memory at the time of the crash onto a file. This dump is referred to as a core dump, and the information in the core dump is generally used to debug or analyze errors in computer programs or computer systems.
- However, in conventional computer systems, only the operating system generates a core dump. If the operating system also malfunctions in the computer system crash, then a core dump cannot be generated. Instead, many conventional computer systems simply reset themselves in a computer system crash. Without any information being recorded at the time of the crash, it would be difficult to diagnose or analyze the errors that caused the crash.
- Embodiments of the present invention provide various techniques for retrieving information from a central processing unit (CPU). As an example, information from a central processing unit (CPU) in a processing system can be retrieved, even when an operating system has malfunctioned, in the event of a system crash. Particularly, the processing system uses a service processor to retrieve information about the CPU from the CPU itself.
- It should be appreciated that in addition to a CPU, a processing system also has a separate service processor that controls the various hardware components of the processing system. Many processing systems include such a service processor in order to offload many hardware specific tasks from the CPU. This offloading of tasks by the service processor provides the CPU with more bandwidth to handle application specific tasks, thereby speeding the execution of applications. It should be appreciated that traditional service processors are not configured to retrieve information used for diagnosing a system crash, but as explained in detail below, embodiments of the present invention provide various techniques for using service processors to retrieve such information directly from the CPU.
- The service processor operates independently from the CPU and from the operating system executed by the CPU. Accordingly, the service processor is still operable in the event that the operating system malfunctions as a result of a CPU stall. In one example, the service processor can be used to retrieve various information about the CPU and/or about other hardware components of the processing system. The retrieval can be initiated when a stall of the CPU is detected or when a user manually initiates the retrieval. Additionally, the service processor can also be programmed to initiate retrieval at predefined intervals. Once the information is retrieved, it may be used to diagnose the errors that caused, for example, the CPU to stall.
- The present disclosure is illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements and in which:
-
FIG. 1 depicts a block diagram of a system of processing systems, consistent with one embodiment of the present invention; -
FIG. 2 depicts a high-level block diagram of a storage server, according to at least one embodiment of the present invention; -
FIG. 3 depicts an architectural block diagram of the hardware and software associated with a processing system, in accordance with an embodiment of the present invention; -
FIG. 4 depicts a flow diagram of a general overview of a method, in accordance with an embodiment, for retrieving information from a processing system that has a CPU and a service processor; -
FIG. 5 depicts a flow diagram of a general overview of a method, in accordance with an alternate embodiment, for retrieving information from a processing system that has a CPU and a service processor; -
FIGS. 6A and 6B depict circuit diagrams illustrating the retrieval of CPU information by a service processor, consistent with different embodiments of the present invention; -
FIG. 7 depicts a circuit diagram of the detailed connections between a service processor and other components of a processing system, according to an embodiment of the present invention; and -
FIG. 8 depicts a flow diagram of a more detailed method, in accordance with an alternate embodiment, for retrieving information from a processing system that has a CPU and a service processor. - The description that follows includes illustrative systems, methods, techniques, instruction sequences, and computing machine program products that embody the present invention. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide an understanding of various embodiments of the inventive subject matter. It will be evident, however, to one skilled in the art that embodiments of the inventive subject matter may be practiced without these specific details. In general, well-known instruction instances, protocols, structures and techniques have not been shown in detail.
-
FIG. 1 depicts a block diagram of asystem 100 of processing systems, consistent with one embodiment of the present invention. As depicted, thesystem 100 includes astorage system 7 and various processing systems (e.g.,clients 1 and administrative consoles 5) in communication with thestorage system 7 throughnetworks storage system 7 operates on behalf of theclients 1 to store and manage shared files or other units of data (e.g., blocks) in the set of mass storage devices. Each of theclients 1 may be, for example, a conventional personal computer (PC), a workstation, a smart phone, or other processing systems. In this example, thestorage system 7 includes astorage server 20 in communication with astorage subsystem 4. Thestorage server 20 manages thestorage subsystem 4 and receives and responds to various read and write requests from theclients 1, directed to data stored in, or to be stored in, thestorage subsystem 4. The mass storage devices in thestorage subsystem 4 may be, for example, conventional magnetic disks, optical disks such as CD-ROM or DVD based storage, magneto-optical (MO) storage, or any other type of non-volatile storage devices suitable for storing large quantities of data. The mass storage devices may be organized into one or more volumes of Redundant Array of Inexpensive Disks (RAID). - Also depicted in
FIG. 1 is a localadministrative console 5 in communication with thestorage system 7. Thestorage server 20 in this configuration includes a communication port (e.g., RS-232) and appropriate software to allow direct communication between thestorage server 20 and the localadministrative console 5 through a transmission line. This configuration enables a network administrator to perform management functions on thestorage server 20. Thestorage server 20 can also be managed through anetwork 21 from a remoteadministrative console 5′. It should be noted that whilenetwork 3 andnetwork 21 are depicted as separate networks inFIG. 1 , they can also be the same network. -
FIG. 2 depicts a high-level block diagram of a machine in the example form of aprocessing system 200 within which may be executed a set of instructions for causing the machine to perform any one or more of the methodologies discussed herein. Theprocessing system 200 may be deployed in the form of, for example, a storage server, a personal computer, a tablet personal computer, a laptop computer, a smart phone, or a variety of other processing systems. In the embodiment where theprocessing system 200 is a storage server (e.g.,storage server 20 depicted inFIG. 1 ), the storage server may be, for example, a file server, and more particularly, a network attached storage (NAS) appliance. Alternatively, the storage server may be a server that provides clients with access to information organized as data containers, such as individual data blocks, as may be the case in a storage area network (SAN). In yet another example, the storage server may be a device that provides clients with access to data at both the file level and the block level. - The
processing system 200 includes one ormore CPUs 31 andmemory 32, which are coupled to each other through achipset 33. Thechipset 33 may include, for example, a memory controller hub and input/output hub combination. TheCPU 31 of theprocessing system 200 and may be, for example, one or more programmable general-purpose or special-purpose microprocessors or digital signal processors (DSPs), microcontrollers, application specific integrated circuits (ASICs), programmable logic devices (PLDs), or a combination of such devices. Thememory 32 may be, or may include, any of various forms of read-only memory (ROM), random access memory (RAM), Flash memory, or the like, or a combination of such devices. Thememory 32 stores, among other things, the operating system of theprocessing system 200. - The
processing system 200 also includes one or more internalmass storage devices 34, a consoleserial interface 35, anetwork adapter 36, and astorage adapter 37, which are coupled to theCPU 31 through thechipset 33. Theprocessing system 200 also includes apower supply 38, as shown. The internalmass storage devices 34 may be or include any machine-readable medium for storing large volumes of data in a non-volatile manner, such as one or more magnetic or optical based disks, or for storing one or more sets of data structures and instructions (e.g., software) embodying or utilized by any one or more of the methodologies or functions described herein. Theserial interface 35, an RS-232 port or Universal Serial Bus (USB) port, allows a direct serial connection with, for example, a local administrative console. Thestorage adapter 37 allows theprocessing system 200 to access a storage subsystem and may be, for example, a Fibre Channel adapter or a Small Computer System Interface (SCSI) adapter. Thenetwork adapter 36, such as an Ethernet adapter, provides theprocessing system 200 with the ability to communicate with remote devices over a network. - The
processing system 200 further includes a number ofsensors 39 andpresence detectors 40. Thesensors 39 are used to detect changes in the state of various environmental variables or parameters in theprocessing system 200, such as temperatures, voltages, binary states, and other parameters. Thepresence detectors 40 are used to detect the presence or absence of various hardware components within theprocessing system 200, such as a cooling fan, a particular circuit card, or other hardware components. - The
service processor 42, at a high level, monitors and/or manages the various hardware components of theprocessing system 200. Examples of monitoring and management functionalities are described in more detail below. Theservice processor 42 may be, for example, one or more programmable general-purpose or special-purpose microprocessors or digital signal processors (DSPs), microcontrollers, application specific integrated circuits (ASICs), programmable logic devices (PLDs), or a combination of such devices. Many processing systems include such aservice processor 42 to offload many hardware specific tasks from theCPU 31. This offloading of tasks by theservice processor 42 provides theCPU 31 with more bandwidth to handle application specific tasks, thereby speeding the execution of applications executed by theCPU 31. Theservice processor 42 is independent and separate from theCPU 31 and, in this example of theprocessing system 200, theservice processor 42 is coupled to theRMM 41 as well as to thechipset 33 andCPU 31, and receives input from thesensors 39 andpresence detectors 40. It should be noted that theservice processor 42 is independent from theCPU 31 in that the processing of theservice processor 42 is not dependent on theCPU 31. In other words, theservice processor 42 can function independently of theCPU 31 and therefore theservice processor 42 can still function if theCPU 31 stalls or malfunctions. Furthermore, theservice processor 42 is physically separate from theCPU 31 where the internal components of theservice processor 42 is separated from theCPU 31 by an intervening barrier or space. For example, theservice processor 42 may be embodied within a microchip while theCPU 31 may be embodied in a different microchip. As explained in more detail below, theservice processor 42 is configured to retrieve various information from theCPU 31 or from other hardware components, and such information may be used in the analysis or diagnosis of errors in theprocessing system 200. - In the embodiment depicted in
FIG. 2 , theservice processor 42 further includes a remote management module (RMM) 41 that provides a network interface and allows a remote processing system, such as a remote administrative console, to control and/or perform various management functions on theprocessing system 200 by way of a network. TheRMM 41 may be in the form of a dedicated circuit card separate from the other hardware components of theprocessing system 200. TheRMM 41 has a network interface that connects to the network and a separate internal interface that connects to one or more hardware components of the processing system. The RMM typically includes control circuitry (e.g., a microprocessor or microcontroller) which is programmed or otherwise configured to respond to commands received from a remote administrative console via the network and to perform at least some of the management functions. - It should be appreciated that in other embodiments, the
processing system 200 may include fewer or more components apart from those shown inFIG. 2 . For example, in an alternate embodiment, theprocessing system 200 may not include theRMM 41. In yet another embodiment, theprocessing system 200 may not include thestorage adapter 37. -
FIG. 3 depicts an architectural block diagram of the hardware and software associated with theprocessing system 200, in accordance with an embodiment of the present invention. As depicted, theprocessing system 200 includes aservice processor 42, a serviceprocessor operating system 310, a monitor andmanagement module 309, and adiagnostic module 301. Theservice processor 42 executes a serviceprocessor operating system 310 that manages various software processes and/or services. For example, the serviceprocessor operating system 310 controls and schedules execution of processes by theservice processor 42. It should be noted that the serviceprocessor operating system 310 is separate and independent from the main operating system executed by a CPU. Accordingly, if the main operating system malfunctions, the serviceprocessor operating system 310 may continue to function because it is executed on a different hardware component, namely theservice processor 42. - In this embodiment, the software processes and/or other services executed by the
service processor 42 include adiagnostic module 301 and a monitor andmanagement module 309. As described in more detail below, the monitor andmanagement module 309 monitors and/or manages various components of a processing system. Thediagnostic module 301 is configured to retrieve information from the CPU or other hardware components. As depicted inFIG. 3 , thediagnostic module 301 may include adetection module 302, aninformation retrieval module 304, areset module 306, and aconsole login module 308. As explained in more detail below, thedetection module 302 is configured to detect that a CPU included in theprocessing system 200 has stalled. Theinformation retrieval module 304 is configured to retrieve information directly from the CPU, as also explained in more detail below. After the information is retrieved, thereset module 306 is configured to reset theprocessing system 200 in order to, for example, attempt to place or return the CPU into an operational state. As also explained in more detail below, theconsole login module 308 provides a user with access to theprocessing system 200 such that the user can access or retrieve the information retrieved before the reset. - In other embodiments, the
processing system 200 may include fewer or more modules apart from those shown inFIG. 3 . For example, in an alternate embodiment, thediagnostic module 301 may exclude theconsole login module 308 and thereset module 306. The functionalities of thereset module 306 and theconsole login module 308 may be handled by, for example, a different module. In the example depicted inFIG. 3 , themodules service processor 42. In another example, themodules modules FIG. 3 . Examples of such alternative or additional functionalities will be discussed in reference to the flow diagrams discussed below. The modifications or additions to the structures described in relation toFIG. 3 to implement these alternative or additional functionalities will be implementable by those skilled in the art, having the benefit of the present specification and teachings. -
FIG. 4 depicts a flow diagram of a general overview of amethod 400, in accordance with an embodiment, for retrieving information from a processing system that has a CPU and a service processor. In one example, themethod 400 may be implemented by thediagnostic module 301 depicted inFIG. 3 and employed in theprocessing system 200. Referring toFIG. 4 , the service processor retrieves CPU information directly from the CPU at 404. As used herein, “CPU information” refers to information associated with the CPU. Examples of CPU information include a state of the CPU, a CPU event, contents of CPU registers, and other information associated with the CPU. As explained in more detail below, the service processor can also retrieve additional information related to other hardware components of the processing system. In one embodiment, the retrieved CPU information may then be stored in a non-volatile storage device for later retrieval by a user for use in diagnosing, for example, any CPU or other hardware related errors. In an alternate embodiment, the retrieved CPU information may be transmitted to a different processing system. - After the CPU information is retrieved, the service processor then resets the processing system at 406. In general, a reset refers to clearing any pending errors or events and bringing a processing system to normal condition or initial state. An example of a reset may be a hard reset where power is removed and subsequently restored to a processing system. Another example of a reset may be a soft reset where system software, such as the operating system, is terminated and subsequently executed again in a processing system. Particularly, a soft reset is restarting a processing system under operating system control, without removing power.
- It should be appreciated that the resetting of the processing system at 406 is optional as not all processing systems need to be reset after the CPU information is retrieved. In another embodiment, instead of resetting the CPU after information retrieval, the CPU may be allowed to continue to operate. In an alternate embodiment, the service processor may modify a state of the CPU based on the retrieved CPU information, and then allow the CPU to continue to operate based on the modified state. For example, the service processor can modify the CPU state by changing the registers of a CPU processing core.
-
FIG. 5 depicts a flow diagram of a general overview of amethod 500, in accordance with an alternate embodiment, for retrieving information from a processing system that has a CPU and a service processor. In one example, themethod 500 may be implemented by thediagnostic module 301 depicted inFIG. 3 and employed in theprocessing system 200. Referring toFIG. 5 , the service processor detects at 502 that the CPU has stalled. It should be appreciated that a CPU is a state machine and has various internal components. In order to be able to fully function, a CPU needs all of its components of subsystems to be in a consistent state or known state. However, a CPU or a subsystem of the CPU may refuse to continue its current operation if it is in an inconsistent state or the data it depends to transition to the next state is not available. Such conditions can “stall” a CPU. - In one example, the detection of the stall at 502 can be based on the receipt of heartbeat messages. In particular, the CPU can be configured to transmit heartbeat messages to a service processor at predefined intervals. If the CPU has completely stalled, the CPU is not able to transmit these heartbeat messages. When the service processor does not receive the heartbeat messages within a predefined interval, the service processor can identify and therefore detect that the CPU has stalled. In another example, the detection of the stall can be based on receipt of an event signal from the CPU. Particularly, if the CPU has not completely stalled, a functioning subsystem within the CPU may detect an error condition within other subsystems of the CPU and send an event signal notifying the service processor of the error condition. In other words, a functioning subsystem of the CPU may detect that another subsystem has stalled and accordingly, send an event signal to the service processor notifying it of the stall in at least one of the subsystems.
- Still referring to
FIG. 5 , after the service processor has detected that the CPU has stalled, the service processor then retrieves CPU information directly from the CPU at 504, the retrieval of which is described in detail below. With the CPU information retrieved, the service processor then resets the processing system at 506. It should be noted that the retrieval of the CPU information or other hardware related information may not necessarily be triggered based on the detection that a CPU has stalled. In another embodiment, the service processor may be configured to automatically retrieve CPU information at predefined intervals, without a subsequent reset of the processing system. The CPU information may be automatically retrieved when there is no apparent error in the CPU, but such information may be useful for other CPU related analysis. In yet another embodiment, a user can manually trigger the retrieval of CPU information through use of, for example, a remote administrative console. -
FIGS. 6A and 6B depict circuit diagrams illustrating the retrieval of CPU information by a service processor, consistent with different embodiments of the present invention. As depicted inFIG. 6A , one embodiment of aprocessing system 600 includes aCPU 31 and aservice processor 42. Here, theCPU 31 includes atest access interface 602, which is an interface that is included in many hardware components for use in, for example, testing circuit board assemblies and debugging embedded systems. An example of such atest access interface 602 is Joint Test Action Group (JTAG) interface (or IEEE 1149.1). The JTAG interface is a specialized four/five-pin interface added to a hardware component, such as theCPU 31. The connector pins are Test Data In (TDI), Test Data Out (TDO), Test Clock (TCK), Test Mode Select (TMS), and Test Reset (TRST). Another example of atest access interface 602 is a Serial Peripheral Interface Bus (SPI bus), which is a synchronous serial data link that operates in full duplex mode. The SPI bus specifies four logic signals, namely Serial Clock (SCLK), Master Output, Slave Input (MOSI/SIMO), Master Input, Slave Output (MISO/SOMI), and Slave Select (SS). Yet another example of atest access interface 602 is a Platform Environment Control Interface (PECI) bus, which allows access to temperature data or other data from chipset components. In particular, the PECI bus is a single-wire interface with a variable data transfer speed. - In the embodiment depicted in
FIG. 6A , theservice processor 42 is connected to thetest access interface 602 included in theCPU 31 by way of a general purpose I/O port 604 included in theservice processor 42. A general purpose I/O port 604 is a port that is available on theservice processor 42 and may be used for a variety of different applications. For example, a general purpose I/O port 604 may be a four-bit or eight-bit I/O port used to connect to other hardware components for light-emitting diode (LED) driving, monitoring switches, communicating data, or other applications. When instructed, theservice processor 42 can retrieve CPU information directly from theCPU 31 by way of thetest access interface 602. As an example, if thetest access interface 602 is a JTAG interface, theservice processor 42 can retrieve the CPU information from the TDO. - In the alternate embodiment depicted in
FIG. 6B , theCPU 31 can be connected to theservice processor 42 by way of adebug connection logic 652. Thisalternate processing system 650 includes theCPU 31, thedebug connection logic 652, and aservice processor 42, where thedebug connection logic 652 is connected to both theCPU 31 and theservice processor 42. Generally, thedebug connection logic 652 functions as a connecting switch between theCPU 31 and theservice processor 42. When theservice processor 42 is not instructed to retrieve CPU information from theCPU 31, thedebug connection logic 652 disconnects theCPU 31 from theservice processor 42 such that any errant data or signals cannot be transmitted between theCPU 31 and theservice processor 42. This disconnection is implemented to assure that theservice processor 42 cannot inadvertently transmit any signals or data to theCPU 31 that may interfere with the operations of theCPU 31. - However, when the
service processor 42 is instructed to retrieve CPU information from theCPU 31, theservice processor 42 transmits a signal by way ofconnection 654 to thedebug connection logic 652 to access theCPU 31. In one embodiment, upon receipt of this signal, thedebug connection logic 652 connects theservice processor 42 to theCPU 31 such that theservice processor 42 can directly retrieve the CPU information from theCPU 31 by way of thetest access interface 602. After the CPU information is retrieved, theservice processor 42 can transmit another signal to thedebug connection logic 652 by way ofconnection 654 to instruct thedebug connection logic 652 to disconnect theservice processor 42 from theCPU 31. In an alternate embodiment, thedebug connection logic 652 may include a timer set for a particular predefined time period, and thedebug connection logic 652 can be configured to connect theservice processor 42 to theCPU 31 for this particular predefined time period. Upon expiration of the time period, thedebug connection logic 652 automatically disconnects theservice processor 42 from theCPU 31 without any instructions to do so from theservice processor 42. -
FIG. 7 depicts a circuit diagram of the detailed connections between aservice processor 42 and other components of aprocessing system 700, according to an embodiment of the present invention. Here, theprocessing system 700 includes theservice processor 42 connected to and in communication withsensors 39,presence detectors 40,CPU 31,chipset 33, andpower supply 38. Thesensors 39 are also connected to theCPU 31 andchipset 33 by, for example, anInter IC bus 81, which allows communication between hardware components on a circuit board. As discussed above, theservice processor 42 monitors and/or manages the various hardware components of theprocessing system 700. In one example, such monitoring and management functionalities can be provided by a monitor andmanagement module 309, as described above inFIG. 3 , that is embodied or executed by theservice processor 42. Examples of such functionalities include data logging, setting platform event traps, keeping a system event log, providing remote access to theprocessing system 700, and monitoring various parameters associated with hardware components. - For example, the
service processor 42 can monitor various parameters or variables present in a processing system, such as the temperature, voltage, fan speed, and/or current, through use ofvarious sensors 39. If theservice processor 42 detects that a particular parameter has fallen below or exceeds a certain threshold, then theservice processor 42 can log the readings and, as discussed below, transmit messages with the reading to other processing systems by way of theRMM 41. In another example, as discussed above, theservice processor 42 can detect the presence or absence of various hardware components in theprocessing system 700 by way of thepresence detectors 40. - The
service processor 42 also monitors theprocessing system 700 for changes in system-specified signals that are of interest. When any of these signals change, theservice processor 42 captures and logs the state of the signals. For example, theservice processor 42 can log system events, such as boot progress, field replaceable unit changes, operating system generated events, and service processor command history. - The
service processor 42 can also be configured to control various hardware components of theprocessing system 700, such as thepower supply 38. For example, theservice processor 42 can provide a control signal CTRL to thepower supply 38 to enable or disable thepower supply 38. Additionally, theservice processor 42 can collect status information about thepower supply 38 with the receipt of the status signal STATUS from thepower supply 38. Theservice processor 42 can also shut down, power-cycle, generate a non-maskable interrupt (NMI), or reboot theprocessing system 700, regardless of the state of theCPU 31 andchipset 33. - The
service processor 42 can also be connected to a local administrative console by way of a serial communication port (not shown). In this connection, a user can log into theservice processor 42 using a secure shell client application from the local administrative console. Alternatively, theservice processor 42 can also be connected to a remote administrative console by way of theRMM 41 that provides a network interface, and can transmit messages to and from the remote administrative console. For example, upon detection of a specified critical event, theservice processor 42 can automatically dispatch an alert e-mail or other form of electronic alert message to the remote administrative console. -
FIG. 8 depicts a flow diagram of a more detailed method, in accordance with an alternate embodiment, for retrieving information from a processing system that has a CPU and a service processor. In one example, themethod 800 may be implemented by thediagnostic module 301 depicted inFIG. 3 and employed in theprocessing system 200. Referring toFIG. 8 , the service processor detects that the CPU has stalled at 802 and thereafter, transmits an NMI to the CPU to attempt to wake the CPU at 804. An NMI is a type of CPU interrupt that cannot be ignored by standard interrupt masking techniques. - The service processor then waits for a time period after transmittal of the NMI and attempts to detect whether the CPU continues to be stalled during this particular time period. If the service processor detects that the CPU has become functional within this time period, then the service processor does not take any further actions.
- However, if the service processor detects that the CPU is still stalled after this time period, then the service processor is configured to transmit a signal to a debug connection logic, which connects to both the CPU and the service processor, to access the CPU at 808. The debug connection logic connects the service processor to the CPU upon receipt of the signal such that the service processor can retrieve CPU information from the CPU at 810.
- Additionally, as described above, the service processor has access to or has logged other information associated with other hardware components (e.g., system events and temperature). The service processor collects these other information at 812 and may then store all the information retrieved to a non-volatile storage device, such as a hard disk drive, at 814. In an alternate embodiment, the service processor may instead transmit the collected information to a different processing system, such as a remote administrative console.
- The processing system may then be reset at 816. However, before the processing system is reset in the event of a CPU stall, the service processor, in one embodiment, can also be configured to analyze the collected information (including the CPU information retrieved) at 815 and take different actions based on the results of the analysis. For example, the service processor can reset the processing system based on the results of the analysis. Here, the service processor can analyze the collected information, remap particular subcomponents based on the analysis, and then reset the processing system. As an example, the service processor may identify that a particular Peripheral Component Interconnect (PCI) component has malfunctioned, and reboot the processing system without the malfunctioning PCI component. Particularly, the components of a processing system are identified by a range of addresses. When a particular component has malfunctioned, the service processor may remap the range of addresses assigned to the malfunctioned component to some other address location (e.g., address 0). In another example, the service processor can identify or map the bad parts of a system memory based on the collected information and reboot the processing system without accessing the bad parts of the system memory. In particular, when bad sectors are found, the service processor marks the bad sectors as unusable such that the operating system skips them in the future. Many system memories include spare sectors, and when a bad sector is found, the logical sector is remapped to a different physical sector.
- The service processor also may be configured to identify specific data that is related to a particular error message and to transmit the identified data along with the error message to, for example, a local administrative console.
- It should be noted that certain embodiments are described herein as including logic or a number of components, modules, or mechanisms. Modules may constitute either software modules (e.g., code embodied on a machine-readable medium or in a transmission signal) or hardware modules. A hardware module is a tangible unit capable of performing certain operations and may be configured or arranged in a certain manner. In example embodiments, one or more processing systems (e.g., the
processing system 200 depicted inFIG. 3 ) or one or more hardware modules of a processing system (e.g., theservice processor 42 depicted inFIG. 2 or a group of processors) may be configured by software (e.g., an application or application portion) as a hardware module that operates to perform certain operations as described herein. - In various embodiments, a hardware module may be implemented mechanically or electronically. For example, a hardware module may comprise dedicated circuitry or logic that is permanently configured (e.g., as a special-purpose processor, such as a field programmable gate array (FPGA) or an application-specific integrated circuit (ASIC)) to perform certain operations. A hardware module may also comprise programmable logic or circuitry (e.g., as encompassed within the service processor 42) that is temporarily configured by software to perform certain operations.
- Accordingly, the term “hardware module” should be understood to encompass a tangible entity, be that an entity that is physically constructed, permanently configured (e.g., hardwired) or temporarily configured (e.g., programmed) to operate in a certain manner and/or to perform certain operations described herein. Considering embodiments in which hardware modules are temporarily configured (e.g., programmed), each of the hardware modules need not be configured or instantiated at any one instance in time. For example, where the hardware modules comprise a
service processor 42 configured using software, theservice processor 42 may be configured as respective different hardware modules at different times. Software may accordingly configure aservice processor 42, for example, to constitute a particular hardware module at one instance of time and to constitute a different hardware module at a different instance of time. - Modules can provide information to, and receive information from, other modules. For example, the described modules may be regarded as being communicatively coupled. Where multiples of such hardware modules exist contemporaneously, communications may be achieved through signal transmission (e.g., over appropriate circuits and buses) that connect the modules. In embodiments in which multiple modules are configured or instantiated at different times, communications between such modules may be achieved, for example, through the storage and retrieval of information in memory structures to which the multiple modules have access. For example, one module may perform an operation, and store the output of that operation in a memory device to which it is communicatively coupled. A further module may then, at a later time, access the memory device to retrieve and process the stored output. Modules may also initiate communications with input or output devices, and can operate on a resource (e.g., a collection of information).
- The various operations of example methods described herein may be performed, at least partially, by one or more service processors, such as the
service processor 42 depicted inFIG. 2 , that are temporarily configured (e.g., by software) or permanently configured to perform the relevant operations. Whether temporarily or permanently configured, such service processors may constitute “processor-implemented” modules that operate to perform one or more operations or functions. - While the embodiments are described with reference to various implementations and exploitations, it will be understood that these embodiments are illustrative and that the scope of the embodiments is not limited to them. In general, techniques retrieving information from a processing system may be implemented with facilities consistent with any hardware system or hardware systems defined herein. Many variations, modifications, additions, and improvements are possible.
- Plural instances may be provided for components, operations or structures described herein as a single instance. Finally, boundaries between various components, operations, and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within the scope of the embodiments. In general, structures and functionality presented as separate components in the exemplary configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements fall within the scope of the embodiments.
Claims (20)
1. A processing system comprising:
a central processing unit;
a debug connection logic coupled to the central processing unit; and
a service processor coupled to the debug connection logic, the service processor to:
determine that the central processing unit is stalled;
in response to determining that the central processing unit is stalled for a period of time, transmit a first signal to the debug connection logic in order to access the central processing unit for information; and
in response to the debug connection logic enabling the service processor to access the central processing unit for information, retrieve central processing unit information from the central processing unit.
2. The processing system of claim 1 , further comprising:
a non-volatile memory resource; and
wherein the service processor further stores the retrieved central processing unit information in the non-volatile memory resource.
3. The processing system of claim 1 , wherein the service processor further analyzes the retrieved central processing unit information.
4. The processing system of claim 3 , wherein the service processor further resets the processing system based on analyzing the central processing unit information.
5. The processing system of claim 1 , wherein the service processor further transmits a second signal to the debug connection logic in response to completing retrieval of the central processing unit information, and wherein the second signal causes the debug connection logic to disable the service processor from accessing the central processing unit for information.
6. The processing system of claim 1 , wherein the debug connection logic includes a timer, the timer being set for a predefined time period once the first signal is received by the debug connection logic, and wherein upon expiration of the predefined time period, the debug connection logic disables the service processor from accessing the central processing unit for information.
7. The processing system of claim 1 , further comprising:
one or more sensors to detect one or more parameters of the processing system; and
wherein the service processor further receives information corresponding to the one or more parameters from the one or more sensors and determines whether a parameter of the one or more parameters has fallen below or has exceeded a corresponding threshold level.
8. The processing system of claim 7 , wherein when a parameter of the one or more parameters has fallen below or has exceeded a corresponding threshold level, the service processor records an event log corresponding to the parameter.
9. The processing system of claim 7 , wherein the one or more parameters corresponds to a temperature, a voltage, a current, or a fan speed associated with the processing system.
10. A processing system comprising:
a central processing unit;
a debug connection logic coupled to the central processing unit; and
a service processor coupled to the debug connection logic, the service processor to:
detect that the central processing unit is stalled;
transmit an interrupt signal to the central processing unit;
determine that the central processing unit is still stalled for a period of time after transmitting the interrupt signal;
in response to determining that the central processing unit is still stalled, transmit a first signal to the debug connection logic in order to access the central processing unit for information; and
in response to the debug connection logic enabling the service processor to access the central processing unit for information, retrieve central processing unit information from the central processing unit.
11. The processing system of claim 10 , wherein the central processing unit comprises a test access interface and the service processor comprises a general purpose port different than the test access interface, and wherein the service processor retrieves the central processing unit information via the test access interface and the general purpose port.
12. The processing system of claim 10 , wherein the service processor further transmits a second signal to the debug connection logic in response to completing retrieval of the central processing unit information, and wherein the second signal causes the debug connection logic to disable the service processor from accessing the central processing unit for information.
13. The processing system of claim 10 , wherein the debug connection logic includes a timer, the timer being set for a predefined time period once the first signal is received by the debug connection logic, and wherein upon expiration of the predefined time period, the debug connection logic disables the service processor from accessing the central processing unit for information.
14. The processing system of claim 10 , wherein the service processor further resets the central processing unit after completing retrieval of the central processing unit information.
15. A method of retrieving information in a processing system, the method being performed by a service processor and comprising:
determining, by the service processor, that a central processing unit of the processing system is stalled;
in response to determining that the central processing unit is stalled for a period of time, transmitting a first signal from the service processor to a debug connection logic of the processing system in order to access the central processing unit for information, the debug connection logic being connected between the central processing unit and the service processor; and
in response to the debug connection logic enabling the service processor to access the central processing unit for information, retrieving central processing unit information from the central processing unit.
16. The method of claim 15 , further comprising:
storing the retrieved central processing unit information in a non-volatile memory resource of the processing system; and
analyzing the retrieved central processing unit information.
17. The method of claim 15 , further comprising:
storing the retrieved central processing unit information in a non-volatile memory resource of the processing system; and
transmitting the retrieved central processing unit information to another processing system.
18. The method of claim 15 , further comprising:
transmitting the retrieved central processing unit information to an administrative console as part of a message.
19. The method of claim 15 , further comprising:
transmitting a second signal to the debug connection logic in response to completing retrieval of the central processing unit information, wherein the second signal causes the debug connection logic to disable the service processor from accessing the central processing unit for information.
20. The method of claim 15 , further comprising:
resetting the central processing unit after completing retrieval of the central processing unit information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/071,517 US20140059390A1 (en) | 2010-10-20 | 2013-11-04 | Use of service processor to retrieve hardware information |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/908,764 US8621118B1 (en) | 2010-10-20 | 2010-10-20 | Use of service processor to retrieve hardware information |
US14/071,517 US20140059390A1 (en) | 2010-10-20 | 2013-11-04 | Use of service processor to retrieve hardware information |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/908,764 Continuation US8621118B1 (en) | 2010-10-20 | 2010-10-20 | Use of service processor to retrieve hardware information |
Publications (1)
Publication Number | Publication Date |
---|---|
US20140059390A1 true US20140059390A1 (en) | 2014-02-27 |
Family
ID=49776201
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/908,764 Active US8621118B1 (en) | 2010-10-20 | 2010-10-20 | Use of service processor to retrieve hardware information |
US14/071,517 Abandoned US20140059390A1 (en) | 2010-10-20 | 2013-11-04 | Use of service processor to retrieve hardware information |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/908,764 Active US8621118B1 (en) | 2010-10-20 | 2010-10-20 | Use of service processor to retrieve hardware information |
Country Status (1)
Country | Link |
---|---|
US (2) | US8621118B1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9910705B1 (en) * | 2015-02-18 | 2018-03-06 | Altera Corporation | Modular offloading for computationally intensive tasks |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9672146B2 (en) * | 2011-03-29 | 2017-06-06 | EMC IP Holding Company LLC | Retrieveing data from data storage systems |
US20140247513A1 (en) * | 2011-10-25 | 2014-09-04 | Michael S. Bunker | Environmental data record |
EP2951698A4 (en) * | 2013-01-31 | 2016-10-05 | Hewlett Packard Entpr Dev Lp | Methods and apparatus for debugging of remote systems |
US9996134B2 (en) * | 2016-04-25 | 2018-06-12 | Zippy Technology Corp. | Method to avoid over-rebooting of power supply device |
US10509656B2 (en) * | 2017-12-01 | 2019-12-17 | American Megatrends International, Llc | Techniques of providing policy options to enable and disable system components |
US11113070B1 (en) * | 2019-07-31 | 2021-09-07 | American Megatrends International, Llc | Automated identification and disablement of system devices in a computing system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060080399A1 (en) * | 2004-10-08 | 2006-04-13 | Agilent Technologies, Inc. | Remote configuration management for data processing units |
US20080215927A1 (en) * | 2005-09-16 | 2008-09-04 | Thales | Method of Monitoring the Correct Operation of a Computer |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2758742B2 (en) * | 1991-07-19 | 1998-05-28 | 日本電気株式会社 | Malfunction detection method |
US20040078681A1 (en) * | 2002-01-24 | 2004-04-22 | Nick Ramirez | Architecture for high availability using system management mode driven monitoring and communications |
US7124328B2 (en) * | 2002-05-14 | 2006-10-17 | Sun Microsystems, Inc. | Capturing system error messages |
US20050240669A1 (en) * | 2004-03-29 | 2005-10-27 | Rahul Khanna | BIOS framework for accommodating multiple service processors on a single server to facilitate distributed/scalable server management |
US7269805B1 (en) * | 2004-04-30 | 2007-09-11 | Xilinx, Inc. | Testing of an integrated circuit having an embedded processor |
JP5163120B2 (en) * | 2005-06-22 | 2013-03-13 | 日本電気株式会社 | Debug system, debugging method, and program |
US8230429B2 (en) * | 2008-05-30 | 2012-07-24 | International Business Machines Corporation | Detecting a deadlock condition by monitoring firmware inactivity during the system IPL process |
-
2010
- 2010-10-20 US US12/908,764 patent/US8621118B1/en active Active
-
2013
- 2013-11-04 US US14/071,517 patent/US20140059390A1/en not_active Abandoned
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060080399A1 (en) * | 2004-10-08 | 2006-04-13 | Agilent Technologies, Inc. | Remote configuration management for data processing units |
US20080215927A1 (en) * | 2005-09-16 | 2008-09-04 | Thales | Method of Monitoring the Correct Operation of a Computer |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9910705B1 (en) * | 2015-02-18 | 2018-03-06 | Altera Corporation | Modular offloading for computationally intensive tasks |
US20180196698A1 (en) * | 2015-02-18 | 2018-07-12 | Altera Corporation | Modular offloading for computationally intensive tasks |
Also Published As
Publication number | Publication date |
---|---|
US8621118B1 (en) | 2013-12-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20140059390A1 (en) | Use of service processor to retrieve hardware information | |
CN105938450B (en) | The method and system that automatic debugging information is collected | |
CN105468484B (en) | Method and apparatus for locating a fault in a storage system | |
TWI584196B (en) | Bios recovery management system, computer program product and method for bios restoration | |
CN101126995B (en) | Method and apparatus for processing serious hardware error | |
TWI317868B (en) | System and method to detect errors and predict potential failures | |
TWI632462B (en) | Switching device and method for detecting i2c bus | |
EP3627323B1 (en) | Automatic diagnostic mode | |
US9927853B2 (en) | System and method for predicting and mitigating corrosion in an information handling system | |
US7783872B2 (en) | System and method to enable an event timer in a multiple event timer operating environment | |
US20170139605A1 (en) | Control device and control method | |
US9594899B2 (en) | Apparatus and method for managing operation of a mobile device | |
CN114328102B (en) | Equipment state monitoring method, equipment state monitoring device, equipment and computer readable storage medium | |
US20140122421A1 (en) | Information processing apparatus, information processing method and computer-readable storage medium | |
US20070016901A1 (en) | Storage system and automatic renewal method of firmware | |
US20210133081A1 (en) | Server status monitoring system and method using baseboard management controller | |
CN110704228A (en) | Solid state disk exception handling method and system | |
US8793538B2 (en) | System error response | |
JP2015162000A (en) | Information processing device, control device, and log information collection method | |
US20210349775A1 (en) | Method of data management and method of data analysis | |
CN115543707A (en) | Hard disk fault detection method, system and device, storage medium and electronic device | |
US20220222135A1 (en) | Electronic control device | |
JP6032369B2 (en) | Information processing apparatus, diagnostic method, diagnostic program, and information processing system | |
JP5884801B2 (en) | Path switching device, path switching method, and path switching program | |
TWI832173B (en) | Method and system for monitoring flash memory device and computer system thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: NETAPP, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:NULKAR, CHAITANYA;REGER, BRAD;KALRA, PRADEEP;AND OTHERS;SIGNING DATES FROM 20101018 TO 20101019;REEL/FRAME:031689/0833 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |