US8090997B2 - Run-time fault resolution from development-time fault and fault resolution path identification - Google Patents
Run-time fault resolution from development-time fault and fault resolution path identification Download PDFInfo
- Publication number
- US8090997B2 US8090997B2 US12/142,945 US14294508A US8090997B2 US 8090997 B2 US8090997 B2 US 8090997B2 US 14294508 A US14294508 A US 14294508A US 8090997 B2 US8090997 B2 US 8090997B2
- Authority
- US
- United States
- Prior art keywords
- fault
- computing system
- operator
- resolution
- prompting
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related, expires
Links
- 238000004590 computer program Methods 0.000 claims abstract description 21
- 238000000034 method Methods 0.000 claims abstract description 19
- 230000015654 memory Effects 0.000 claims description 12
- 230000007812 deficiency Effects 0.000 abstract description 2
- 230000003287 optical effect Effects 0.000 description 3
- 238000009434 installation Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 238000012913 prioritisation Methods 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0793—Remedial or corrective actions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0715—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a system implementing multitasking
Definitions
- the present invention relates to the field of fault handling in a computing system and more particularly to automated fault resolution.
- Computing systems continually grow in complexity each year. Managing the operation of a given computing system whether management relates to a computer program, supporting operating system, or intermediate job or job controller, can be a challenging task and oftentimes requires specific expertise.
- software developers anticipate faults in the computing system and hard code fault handling within the computing system itself.
- technical writers often prepare assistive documentation addressing potential faults and fault resolution paths to be provided in concert with the operation of the computing system, for example as a help file.
- the recognition of the occurrence of a fault in a computing system, the comprehension of prepared documentation addressing the occurrence of a fault, and the implementation of a fault resolution path recommended by the documentation can vary according to the expertise of one managing the computing system. Yet, in most instances, the expertise of computing support can vary from installation to installation so that developing of online accessible documentation can be a balancing act of supporting the most unsophisticated of users with limited but comprehensible information while providing enough detailed information for seasoned users. Further, end users must know how to correlate a fault condition with appropriate portions of the documentation.
- Hard coding a computing system to address an encountered fault condition can be beneficial to the extent that the fault condition can be addressed without consideration for the technical sophistication of the end user.
- automated fault handling logic upon detecting a fault merely rolls back the state of the computing system to a pre-fault state in order to provide a graceful exit.
- automated fault handling logic can provide notification to an operator of the automated handling of the fault condition so that the operator can effectuate a restart of the computing system once the computing system has completed its graceful exit.
- Embodiments of the present invention address deficiencies of the art in respect to fault handling and provide a novel and non-obvious method, system and computer program product for run-time fault resolution from development time fault and fault resolution path identification.
- a method for run-time fault resolution from development time fault and fault resolution path identification can be provided. The method can include detecting a recoverable fault condition in a computing system, selecting a fault resolution path from amongst a multiple development time specified fault resolution paths to match the recoverable fault condition, prompting an operator with the selected fault resolution path, and resuming operation of the computing system without restart subsequent to the operator performing the selected resolution fault path.
- detecting a recoverable fault condition in a computing system can include detecting a low resource condition in the computing system.
- prompting an operator with the selected fault resolution path can include prompting an operator to re-prioritize an existing job in the computing system, to terminate an existing job in the computing system, to pause operation of the computing system while adding new resources to the computing system, or to pause operation of the computing system while reallocating an existing resource to a different job in the computing system.
- a computer data processing system can be configured for run-time fault resolution from development time fault and fault resolution path identification.
- the system can include a host computing platform supporting an operating system and job controller.
- the system also can include a computer program executing in the operating system.
- the computer program can include multiple different jobs managed by the job controller.
- the system can include a recoverable fault handler.
- the recoverable fault handler can include program code enabled to detect a recoverable fault condition in the system, to select a fault resolution path from amongst a plurality of development time specified fault resolution paths to match the recoverable fault condition, to prompt an operator with the selected fault resolution path, and to resume operation of the system without restart subsequent to the operator performing the selected resolution fault path.
- the recoverable fault handler can be coupled to the operating system. In another aspect of the embodiment, the recoverable fault handler can be coupled to the job controller. In yet another aspect of the embodiment, the recoverable fault handler can be coupled to the computer program. In even yet another aspect of the embodiment, the recoverable fault handler can be coupled to at least one of the jobs.
- FIG. 1 is a pictorial illustration of a process for run-time fault resolution from development time fault and fault resolution path identification
- FIG. 2 is a schematic illustration of a computer data processing system configured for run-time fault resolution from development time fault and fault resolution path identification;
- FIG. 3 is a flow chart illustrating a process for run-time fault resolution from development time fault and fault resolution path identification.
- Embodiments of the present invention provide a method, system and computer program product for run-time fault resolution from development time fault and fault resolution path identification.
- a recoverable fault condition can be detected within a computing system.
- the recoverable detected fault can be compared to a listing of known recoverable faults set forth at development time for the computing system and, if available, a corresponding fault resolution path can be identified.
- the identified fault resolution path can provide a step or sequence of steps requisite to clearing the fault condition without requiring a restart of the computing system.
- the fault resolution path can be presented to an operator through a user interface to the computing system.
- the execution of the computing system can continue without requiring a restart of the computing system.
- FIG. 1 pictorial shows a process for run-time fault resolution from development time fault and fault resolution path identification.
- a computing system 110 can be configured for run-time fault resolution at development time to identify a fault resolution path for a detected recoverable fault in the computing system 100 .
- a set of identifiable fault resolution paths 130 can be compared to the recoverable fault 120 in order to select a suitable fault resolution path 140 .
- an operator 150 can be prompted with the selected suitable fault resolution path 140 which is turn can result in the operator 150 providing requisite information 160 to the computing system 100 to overcome the recoverable fault 120 . In this way, a costly rollback and restart can be avoided.
- FIG. 2 is a schematic illustration of a computer data processing system configured for run-time fault resolution from development time fault and fault resolution path identification.
- the system can include a host computing platform 210 supporting the operation of an operating system 220 .
- the operating system 220 in turn can manage the execution of a computer program 250 that can spawn multiple different jobs 240 to be managed in execution by a job controller 230 .
- job controller 230 can include a configuration for queuing job 240 for execution and also for placing jobs 240 in different hosts (whether virtual or actual) for execution in a distributed computing model.
- Recoverable fault handler 300 can be coupled to the operating system 220 either as separate computing logic, for instance in a linked library, or as standalone logic cooperatively executing with the operating system 220 .
- the recoverable fault handler 300 can be coupled to the job controller 230 , one or more of the jobs 240 , or to the computer program 250 .
- the recoverable fault handler 300 can include a set of fault resolution paths 260 , each providing a step or sequence of steps requisite to clearing a corresponding fault condition in a computing system without requiring a restart of the computing system, for example a host computing platform 210 , the operating system 220 , the job controller 230 , a given one of the jobs 240 , or even the computer program 250 .
- the recoverable fault handler 300 can include program code which when executed in the host computing platform 210 , identifies a recoverable fault condition, locates a corresponding one of the fault resolution paths 260 and presents the located corresponding one of the fault resolution paths 260 to an external operator to provide both guidance as to clearing the recoverable fault condition and also to solicit additional input from the operator requisite to clearing the recoverable fault condition.
- the additional input can include the operator commanding the re-prioritization of individual ones of the jobs 240 , the termination of selected ones of the jobs 240 , or the pausing of selected ones of the jobs 240 .
- the additional input can include the adding of computing resources like memory or disk space while pausing the processing of one or more of the jobs 240 , or the re-allocation of existing computing resources while pausing the processing of one or more of the jobs 240 .
- FIG. 3 is a flow chart illustrating a process for run-time fault resolution from development time fault and fault resolution path identification.
- a recoverable fault condition can be detected.
- the fault condition can include by way of example, excessive disk thrashing or low computing resources conditions like low memory.
- it can be determined whether or not the fault condition is recoverable in the sense that a restart can be avoided in consequence of the fault condition. If not, in block 330 ordinary fault handling can be performed resulting in a restart of the computing system. Otherwise, the process can continue through block 340 .
- a fault resolution path can be matched to the recoverable fault condition and in block 350 , an external operator can be prompted with the matched fault resolution path.
- Operator input can be provided consistent with the matched resolution path, such as a suggested commanding of the re-prioritization of individual jobs, the termination of selected jobs, or the pausing of selected jobs.
- the operator input can be consistent with the adding of computing resources like memory or disk space while pausing the processing of one or more jobs, or the re-allocation of existing computing resources while pausing the processing of one or more of the jobs.
- decision block 360 if the operator input is provided as prompted, in block 370 the operation of the computing system can continue. Otherwise, in block 380 rollback and restart can be performed on the computing system.
- Embodiments of the invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements.
- the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, and the like.
- the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system.
- a computer-usable or computer readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
- the medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium.
- Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk.
- Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
- a data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus.
- the memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
- I/O devices including but not limited to keyboards, displays, pointing devices, etc.
- Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Debugging And Monitoring (AREA)
Abstract
Description
Claims (17)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/142,945 US8090997B2 (en) | 2008-06-20 | 2008-06-20 | Run-time fault resolution from development-time fault and fault resolution path identification |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/142,945 US8090997B2 (en) | 2008-06-20 | 2008-06-20 | Run-time fault resolution from development-time fault and fault resolution path identification |
Publications (2)
Publication Number | Publication Date |
---|---|
US20090319823A1 US20090319823A1 (en) | 2009-12-24 |
US8090997B2 true US8090997B2 (en) | 2012-01-03 |
Family
ID=41432495
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/142,945 Expired - Fee Related US8090997B2 (en) | 2008-06-20 | 2008-06-20 | Run-time fault resolution from development-time fault and fault resolution path identification |
Country Status (1)
Country | Link |
---|---|
US (1) | US8090997B2 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10884888B2 (en) | 2019-01-22 | 2021-01-05 | International Business Machines Corporation | Facilitating communication among storage controllers |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9413893B2 (en) | 2012-04-05 | 2016-08-09 | Assurant, Inc. | System, method, apparatus, and computer program product for providing mobile device support services |
US9483344B2 (en) * | 2012-04-05 | 2016-11-01 | Assurant, Inc. | System, method, apparatus, and computer program product for providing mobile device support services |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4922491A (en) * | 1988-08-31 | 1990-05-01 | International Business Machines Corporation | Input/output device service alert function |
US6487677B1 (en) * | 1999-09-30 | 2002-11-26 | Lsi Logic Corporation | Methods and systems for dynamic selection of error recovery procedures in a managed device |
US20030177417A1 (en) * | 2002-03-14 | 2003-09-18 | Sun Microsystems Inc., A Delaware Corporation | System and method for remote performance analysis and optimization of computer systems |
US6718489B1 (en) * | 2000-12-07 | 2004-04-06 | Unisys Corporation | Electronic service request generator for automatic fault management system |
US6751758B1 (en) * | 2001-06-20 | 2004-06-15 | Emc Corporation | Method and system for handling errors in a data storage environment |
US6760869B2 (en) * | 2001-06-29 | 2004-07-06 | Intel Corporation | Reporting hard disk drive failure |
US20060053347A1 (en) * | 2004-09-09 | 2006-03-09 | Microsoft Corporation | Method, system, and apparatus for providing alert synthesis in a data protection system |
US20070174720A1 (en) * | 2006-01-23 | 2007-07-26 | Kubo Robert A | Apparatus, system, and method for predicting storage device failure |
US20080005609A1 (en) * | 2006-06-29 | 2008-01-03 | Zimmer Vincent J | Method and apparatus for OS independent platform recovery |
US20080028264A1 (en) * | 2006-07-27 | 2008-01-31 | Microsoft Corporation | Detection and mitigation of disk failures |
US20080115014A1 (en) * | 2006-11-13 | 2008-05-15 | Kalyanaraman Vaidyanathan | Method and apparatus for detecting degradation in a remote storage device |
US7409586B1 (en) * | 2004-12-09 | 2008-08-05 | Symantec Operating Corporation | System and method for handling a storage resource error condition based on priority information |
US20080270842A1 (en) * | 2007-04-26 | 2008-10-30 | Jenchang Ho | Computer operating system handling of severe hardware errors |
-
2008
- 2008-06-20 US US12/142,945 patent/US8090997B2/en not_active Expired - Fee Related
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4922491A (en) * | 1988-08-31 | 1990-05-01 | International Business Machines Corporation | Input/output device service alert function |
US6487677B1 (en) * | 1999-09-30 | 2002-11-26 | Lsi Logic Corporation | Methods and systems for dynamic selection of error recovery procedures in a managed device |
US6718489B1 (en) * | 2000-12-07 | 2004-04-06 | Unisys Corporation | Electronic service request generator for automatic fault management system |
US6751758B1 (en) * | 2001-06-20 | 2004-06-15 | Emc Corporation | Method and system for handling errors in a data storage environment |
US6760869B2 (en) * | 2001-06-29 | 2004-07-06 | Intel Corporation | Reporting hard disk drive failure |
US20030177417A1 (en) * | 2002-03-14 | 2003-09-18 | Sun Microsystems Inc., A Delaware Corporation | System and method for remote performance analysis and optimization of computer systems |
US20060053347A1 (en) * | 2004-09-09 | 2006-03-09 | Microsoft Corporation | Method, system, and apparatus for providing alert synthesis in a data protection system |
US7409586B1 (en) * | 2004-12-09 | 2008-08-05 | Symantec Operating Corporation | System and method for handling a storage resource error condition based on priority information |
US20070174720A1 (en) * | 2006-01-23 | 2007-07-26 | Kubo Robert A | Apparatus, system, and method for predicting storage device failure |
US20080005609A1 (en) * | 2006-06-29 | 2008-01-03 | Zimmer Vincent J | Method and apparatus for OS independent platform recovery |
US20080028264A1 (en) * | 2006-07-27 | 2008-01-31 | Microsoft Corporation | Detection and mitigation of disk failures |
US20080115014A1 (en) * | 2006-11-13 | 2008-05-15 | Kalyanaraman Vaidyanathan | Method and apparatus for detecting degradation in a remote storage device |
US20080270842A1 (en) * | 2007-04-26 | 2008-10-30 | Jenchang Ho | Computer operating system handling of severe hardware errors |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10884888B2 (en) | 2019-01-22 | 2021-01-05 | International Business Machines Corporation | Facilitating communication among storage controllers |
Also Published As
Publication number | Publication date |
---|---|
US20090319823A1 (en) | 2009-12-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9058265B2 (en) | Automated fault and recovery system | |
US8918783B2 (en) | Managing virtual computers simultaneously with static and dynamic dependencies | |
US20110296398A1 (en) | Systems and methods for determining when to update a package manager software | |
US20160019123A1 (en) | Fault tolerance for complex distributed computing operations | |
US10817819B2 (en) | Workflow compilation | |
US11218386B2 (en) | Service ticket escalation based on interaction patterns | |
US8918796B2 (en) | Generating and using constraints associated with software related products | |
US20170078369A1 (en) | Event-responsive download of portions of streamed applications | |
US9541980B2 (en) | Operation management device, operation management method, and recording medium | |
US9049101B2 (en) | Cluster monitor, method for monitoring a cluster, and computer-readable recording medium | |
US11983519B2 (en) | Abort installation of firmware bundles | |
US8090997B2 (en) | Run-time fault resolution from development-time fault and fault resolution path identification | |
CN110502399B (en) | Fault detection method and device | |
US10838712B1 (en) | Lifecycle management for software-defined datacenters | |
US20230342181A1 (en) | Validation of combined software/firmware updates | |
CN115904636A (en) | Event distribution method, event distribution device, storage medium, and electronic apparatus | |
CN111124095B (en) | Power supply running state detection method and related device during upgrading of power supply firmware | |
US11893380B2 (en) | Super bundles for software-defined datacenter upgrades | |
CN111930502A (en) | Server management method, device, equipment and storage medium | |
US20240111579A1 (en) | Termination of sidecar containers | |
CN109634769B (en) | Fault-tolerant processing method, device, equipment and storage medium in data storage | |
WO2024076425A1 (en) | Termination of sidecar containers | |
CN113080748A (en) | Automatic paper fetching method and device, electronic equipment and storage medium | |
CN116991445A (en) | Firmware upgrading method, device, equipment and readable storage medium | |
CN113821205A (en) | Method and device for controlling small program page parameters, medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:HAMPTON, MARK C.;REEL/FRAME:021331/0633 Effective date: 20080613 |
|
ZAAA | Notice of allowance and fees due |
Free format text: ORIGINAL CODE: NOA |
|
ZAAB | Notice of allowance mailed |
Free format text: ORIGINAL CODE: MN/=. |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FPAY | Fee payment |
Year of fee payment: 4 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
LAPS | Lapse for failure to pay maintenance fees |
Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20240103 |