US8090997B2 - Run-time fault resolution from development-time fault and fault resolution path identification - Google Patents

Run-time fault resolution from development-time fault and fault resolution path identification Download PDF

Info

Publication number
US8090997B2
US8090997B2 US12/142,945 US14294508A US8090997B2 US 8090997 B2 US8090997 B2 US 8090997B2 US 14294508 A US14294508 A US 14294508A US 8090997 B2 US8090997 B2 US 8090997B2
Authority
US
United States
Prior art keywords
fault
computing system
operator
resolution
prompting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related, expires
Application number
US12/142,945
Other versions
US20090319823A1 (en
Inventor
Mark C. Hampton
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Priority to US12/142,945 priority Critical patent/US8090997B2/en
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION reassignment INTERNATIONAL BUSINESS MACHINES CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: HAMPTON, MARK C.
Publication of US20090319823A1 publication Critical patent/US20090319823A1/en
Application granted granted Critical
Publication of US8090997B2 publication Critical patent/US8090997B2/en
Expired - Fee Related legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0715Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a system implementing multitasking

Definitions

  • the present invention relates to the field of fault handling in a computing system and more particularly to automated fault resolution.
  • Computing systems continually grow in complexity each year. Managing the operation of a given computing system whether management relates to a computer program, supporting operating system, or intermediate job or job controller, can be a challenging task and oftentimes requires specific expertise.
  • software developers anticipate faults in the computing system and hard code fault handling within the computing system itself.
  • technical writers often prepare assistive documentation addressing potential faults and fault resolution paths to be provided in concert with the operation of the computing system, for example as a help file.
  • the recognition of the occurrence of a fault in a computing system, the comprehension of prepared documentation addressing the occurrence of a fault, and the implementation of a fault resolution path recommended by the documentation can vary according to the expertise of one managing the computing system. Yet, in most instances, the expertise of computing support can vary from installation to installation so that developing of online accessible documentation can be a balancing act of supporting the most unsophisticated of users with limited but comprehensible information while providing enough detailed information for seasoned users. Further, end users must know how to correlate a fault condition with appropriate portions of the documentation.
  • Hard coding a computing system to address an encountered fault condition can be beneficial to the extent that the fault condition can be addressed without consideration for the technical sophistication of the end user.
  • automated fault handling logic upon detecting a fault merely rolls back the state of the computing system to a pre-fault state in order to provide a graceful exit.
  • automated fault handling logic can provide notification to an operator of the automated handling of the fault condition so that the operator can effectuate a restart of the computing system once the computing system has completed its graceful exit.
  • Embodiments of the present invention address deficiencies of the art in respect to fault handling and provide a novel and non-obvious method, system and computer program product for run-time fault resolution from development time fault and fault resolution path identification.
  • a method for run-time fault resolution from development time fault and fault resolution path identification can be provided. The method can include detecting a recoverable fault condition in a computing system, selecting a fault resolution path from amongst a multiple development time specified fault resolution paths to match the recoverable fault condition, prompting an operator with the selected fault resolution path, and resuming operation of the computing system without restart subsequent to the operator performing the selected resolution fault path.
  • detecting a recoverable fault condition in a computing system can include detecting a low resource condition in the computing system.
  • prompting an operator with the selected fault resolution path can include prompting an operator to re-prioritize an existing job in the computing system, to terminate an existing job in the computing system, to pause operation of the computing system while adding new resources to the computing system, or to pause operation of the computing system while reallocating an existing resource to a different job in the computing system.
  • a computer data processing system can be configured for run-time fault resolution from development time fault and fault resolution path identification.
  • the system can include a host computing platform supporting an operating system and job controller.
  • the system also can include a computer program executing in the operating system.
  • the computer program can include multiple different jobs managed by the job controller.
  • the system can include a recoverable fault handler.
  • the recoverable fault handler can include program code enabled to detect a recoverable fault condition in the system, to select a fault resolution path from amongst a plurality of development time specified fault resolution paths to match the recoverable fault condition, to prompt an operator with the selected fault resolution path, and to resume operation of the system without restart subsequent to the operator performing the selected resolution fault path.
  • the recoverable fault handler can be coupled to the operating system. In another aspect of the embodiment, the recoverable fault handler can be coupled to the job controller. In yet another aspect of the embodiment, the recoverable fault handler can be coupled to the computer program. In even yet another aspect of the embodiment, the recoverable fault handler can be coupled to at least one of the jobs.
  • FIG. 1 is a pictorial illustration of a process for run-time fault resolution from development time fault and fault resolution path identification
  • FIG. 2 is a schematic illustration of a computer data processing system configured for run-time fault resolution from development time fault and fault resolution path identification;
  • FIG. 3 is a flow chart illustrating a process for run-time fault resolution from development time fault and fault resolution path identification.
  • Embodiments of the present invention provide a method, system and computer program product for run-time fault resolution from development time fault and fault resolution path identification.
  • a recoverable fault condition can be detected within a computing system.
  • the recoverable detected fault can be compared to a listing of known recoverable faults set forth at development time for the computing system and, if available, a corresponding fault resolution path can be identified.
  • the identified fault resolution path can provide a step or sequence of steps requisite to clearing the fault condition without requiring a restart of the computing system.
  • the fault resolution path can be presented to an operator through a user interface to the computing system.
  • the execution of the computing system can continue without requiring a restart of the computing system.
  • FIG. 1 pictorial shows a process for run-time fault resolution from development time fault and fault resolution path identification.
  • a computing system 110 can be configured for run-time fault resolution at development time to identify a fault resolution path for a detected recoverable fault in the computing system 100 .
  • a set of identifiable fault resolution paths 130 can be compared to the recoverable fault 120 in order to select a suitable fault resolution path 140 .
  • an operator 150 can be prompted with the selected suitable fault resolution path 140 which is turn can result in the operator 150 providing requisite information 160 to the computing system 100 to overcome the recoverable fault 120 . In this way, a costly rollback and restart can be avoided.
  • FIG. 2 is a schematic illustration of a computer data processing system configured for run-time fault resolution from development time fault and fault resolution path identification.
  • the system can include a host computing platform 210 supporting the operation of an operating system 220 .
  • the operating system 220 in turn can manage the execution of a computer program 250 that can spawn multiple different jobs 240 to be managed in execution by a job controller 230 .
  • job controller 230 can include a configuration for queuing job 240 for execution and also for placing jobs 240 in different hosts (whether virtual or actual) for execution in a distributed computing model.
  • Recoverable fault handler 300 can be coupled to the operating system 220 either as separate computing logic, for instance in a linked library, or as standalone logic cooperatively executing with the operating system 220 .
  • the recoverable fault handler 300 can be coupled to the job controller 230 , one or more of the jobs 240 , or to the computer program 250 .
  • the recoverable fault handler 300 can include a set of fault resolution paths 260 , each providing a step or sequence of steps requisite to clearing a corresponding fault condition in a computing system without requiring a restart of the computing system, for example a host computing platform 210 , the operating system 220 , the job controller 230 , a given one of the jobs 240 , or even the computer program 250 .
  • the recoverable fault handler 300 can include program code which when executed in the host computing platform 210 , identifies a recoverable fault condition, locates a corresponding one of the fault resolution paths 260 and presents the located corresponding one of the fault resolution paths 260 to an external operator to provide both guidance as to clearing the recoverable fault condition and also to solicit additional input from the operator requisite to clearing the recoverable fault condition.
  • the additional input can include the operator commanding the re-prioritization of individual ones of the jobs 240 , the termination of selected ones of the jobs 240 , or the pausing of selected ones of the jobs 240 .
  • the additional input can include the adding of computing resources like memory or disk space while pausing the processing of one or more of the jobs 240 , or the re-allocation of existing computing resources while pausing the processing of one or more of the jobs 240 .
  • FIG. 3 is a flow chart illustrating a process for run-time fault resolution from development time fault and fault resolution path identification.
  • a recoverable fault condition can be detected.
  • the fault condition can include by way of example, excessive disk thrashing or low computing resources conditions like low memory.
  • it can be determined whether or not the fault condition is recoverable in the sense that a restart can be avoided in consequence of the fault condition. If not, in block 330 ordinary fault handling can be performed resulting in a restart of the computing system. Otherwise, the process can continue through block 340 .
  • a fault resolution path can be matched to the recoverable fault condition and in block 350 , an external operator can be prompted with the matched fault resolution path.
  • Operator input can be provided consistent with the matched resolution path, such as a suggested commanding of the re-prioritization of individual jobs, the termination of selected jobs, or the pausing of selected jobs.
  • the operator input can be consistent with the adding of computing resources like memory or disk space while pausing the processing of one or more jobs, or the re-allocation of existing computing resources while pausing the processing of one or more of the jobs.
  • decision block 360 if the operator input is provided as prompted, in block 370 the operation of the computing system can continue. Otherwise, in block 380 rollback and restart can be performed on the computing system.
  • Embodiments of the invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements.
  • the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, and the like.
  • the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system.
  • a computer-usable or computer readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
  • the medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium.
  • Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk.
  • Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
  • a data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus.
  • the memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
  • I/O devices including but not limited to keyboards, displays, pointing devices, etc.
  • Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

Embodiments of the present invention address deficiencies of the art in respect to fault handling and provide a method, system and computer program product for run-time fault resolution from development time fault and fault resolution path identification. In an embodiment of the invention, a method for run-time fault resolution from development time fault and fault resolution path identification can be provided. The method can include detecting a recoverable fault condition in a computing system, selecting a fault resolution path from amongst a multiple development time specified fault resolution paths to match the recoverable fault condition, prompting an operator with the selected fault resolution path, and resuming operation of the computing system without restart subsequent to the operator performing the selected resolution fault path.

Description

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to the field of fault handling in a computing system and more particularly to automated fault resolution.
2. Description of the Related Art
Computing systems continually grow in complexity each year. Managing the operation of a given computing system whether management relates to a computer program, supporting operating system, or intermediate job or job controller, can be a challenging task and oftentimes requires specific expertise. To provide assistance in addressing the complexity of management of a computing system, software developers anticipate faults in the computing system and hard code fault handling within the computing system itself. Additionally, technical writers often prepare assistive documentation addressing potential faults and fault resolution paths to be provided in concert with the operation of the computing system, for example as a help file.
The recognition of the occurrence of a fault in a computing system, the comprehension of prepared documentation addressing the occurrence of a fault, and the implementation of a fault resolution path recommended by the documentation can vary according to the expertise of one managing the computing system. Yet, in most instances, the expertise of computing support can vary from installation to installation so that developing of online accessible documentation can be a balancing act of supporting the most unsophisticated of users with limited but comprehensible information while providing enough detailed information for seasoned users. Further, end users must know how to correlate a fault condition with appropriate portions of the documentation.
Hard coding a computing system to address an encountered fault condition can be beneficial to the extent that the fault condition can be addressed without consideration for the technical sophistication of the end user. Oftentimes, automated fault handling logic upon detecting a fault merely rolls back the state of the computing system to a pre-fault state in order to provide a graceful exit. Additionally, automated fault handling logic can provide notification to an operator of the automated handling of the fault condition so that the operator can effectuate a restart of the computing system once the computing system has completed its graceful exit.
It is to be recognized, however, that many faults handled in an automated fashion resulting in a costly rollback and graceful exit otherwise can be avoided with a minimum of user interaction. Specifically, computing logic executing to a fault oftentimes can overcome the fault condition with a minimum of user interaction permitting a continuation of execution of the computing logic and avoiding the necessity of a restart. Conventional automated fault handling, however, does not always permit the interactivity of an operator once a fault condition has been addressed. Indeed, when interactivity is permitted, the choice for an operator usually is to confirm the termination of the computing system in the face of the detected fault.
BRIEF SUMMARY OF THE INVENTION
Embodiments of the present invention address deficiencies of the art in respect to fault handling and provide a novel and non-obvious method, system and computer program product for run-time fault resolution from development time fault and fault resolution path identification. In an embodiment of the invention, a method for run-time fault resolution from development time fault and fault resolution path identification can be provided. The method can include detecting a recoverable fault condition in a computing system, selecting a fault resolution path from amongst a multiple development time specified fault resolution paths to match the recoverable fault condition, prompting an operator with the selected fault resolution path, and resuming operation of the computing system without restart subsequent to the operator performing the selected resolution fault path.
In one aspect of the embodiment, detecting a recoverable fault condition in a computing system can include detecting a low resource condition in the computing system. In another aspect of the embodiment, prompting an operator with the selected fault resolution path can include prompting an operator to re-prioritize an existing job in the computing system, to terminate an existing job in the computing system, to pause operation of the computing system while adding new resources to the computing system, or to pause operation of the computing system while reallocating an existing resource to a different job in the computing system.
In another embodiment of the invention, a computer data processing system can be configured for run-time fault resolution from development time fault and fault resolution path identification. The system can include a host computing platform supporting an operating system and job controller. The system also can include a computer program executing in the operating system. The computer program can include multiple different jobs managed by the job controller. Finally, the system can include a recoverable fault handler. The recoverable fault handler can include program code enabled to detect a recoverable fault condition in the system, to select a fault resolution path from amongst a plurality of development time specified fault resolution paths to match the recoverable fault condition, to prompt an operator with the selected fault resolution path, and to resume operation of the system without restart subsequent to the operator performing the selected resolution fault path.
In one aspect of the embodiment, the recoverable fault handler can be coupled to the operating system. In another aspect of the embodiment, the recoverable fault handler can be coupled to the job controller. In yet another aspect of the embodiment, the recoverable fault handler can be coupled to the computer program. In even yet another aspect of the embodiment, the recoverable fault handler can be coupled to at least one of the jobs.
Additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The aspects of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the appended claims. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention. The embodiments illustrated herein are presently preferred, it being understood, however, that the invention is not limited to the precise arrangements and instrumentalities shown, wherein:
FIG. 1 is a pictorial illustration of a process for run-time fault resolution from development time fault and fault resolution path identification;
FIG. 2 is a schematic illustration of a computer data processing system configured for run-time fault resolution from development time fault and fault resolution path identification; and,
FIG. 3 is a flow chart illustrating a process for run-time fault resolution from development time fault and fault resolution path identification.
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention provide a method, system and computer program product for run-time fault resolution from development time fault and fault resolution path identification. In accordance with an embodiment of the present invention, a recoverable fault condition can be detected within a computing system. The recoverable detected fault can be compared to a listing of known recoverable faults set forth at development time for the computing system and, if available, a corresponding fault resolution path can be identified. The identified fault resolution path can provide a step or sequence of steps requisite to clearing the fault condition without requiring a restart of the computing system. Thereafter, the fault resolution path can be presented to an operator through a user interface to the computing system. Finally, upon satisfaction of the fault resolution path, the execution of the computing system can continue without requiring a restart of the computing system.
In illustration, FIG. 1 pictorial shows a process for run-time fault resolution from development time fault and fault resolution path identification. As shown in FIG. 1, a computing system 110 can be configured for run-time fault resolution at development time to identify a fault resolution path for a detected recoverable fault in the computing system 100. In this regard, upon detecting a recoverable fault 120 in the computing system 110, a set of identifiable fault resolution paths 130 can be compared to the recoverable fault 120 in order to select a suitable fault resolution path 140. Thereafter, an operator 150 can be prompted with the selected suitable fault resolution path 140 which is turn can result in the operator 150 providing requisite information 160 to the computing system 100 to overcome the recoverable fault 120. In this way, a costly rollback and restart can be avoided.
In further illustration, FIG. 2 is a schematic illustration of a computer data processing system configured for run-time fault resolution from development time fault and fault resolution path identification. The system can include a host computing platform 210 supporting the operation of an operating system 220. The operating system 220 in turn can manage the execution of a computer program 250 that can spawn multiple different jobs 240 to be managed in execution by a job controller 230. As it will be recognized by the skilled artisan, job controller 230 can include a configuration for queuing job 240 for execution and also for placing jobs 240 in different hosts (whether virtual or actual) for execution in a distributed computing model.
Recoverable fault handler 300 can be coupled to the operating system 220 either as separate computing logic, for instance in a linked library, or as standalone logic cooperatively executing with the operating system 220. Alternatively, the recoverable fault handler 300 can be coupled to the job controller 230, one or more of the jobs 240, or to the computer program 250. The recoverable fault handler 300 can include a set of fault resolution paths 260, each providing a step or sequence of steps requisite to clearing a corresponding fault condition in a computing system without requiring a restart of the computing system, for example a host computing platform 210, the operating system 220, the job controller 230, a given one of the jobs 240, or even the computer program 250.
To that end, the recoverable fault handler 300 can include program code which when executed in the host computing platform 210, identifies a recoverable fault condition, locates a corresponding one of the fault resolution paths 260 and presents the located corresponding one of the fault resolution paths 260 to an external operator to provide both guidance as to clearing the recoverable fault condition and also to solicit additional input from the operator requisite to clearing the recoverable fault condition. For instance, the additional input can include the operator commanding the re-prioritization of individual ones of the jobs 240, the termination of selected ones of the jobs 240, or the pausing of selected ones of the jobs 240. Also, the additional input can include the adding of computing resources like memory or disk space while pausing the processing of one or more of the jobs 240, or the re-allocation of existing computing resources while pausing the processing of one or more of the jobs 240.
In even yet further illustration of the operation of recoverable fault handler 300, FIG. 3 is a flow chart illustrating a process for run-time fault resolution from development time fault and fault resolution path identification. Beginning in block 310, a recoverable fault condition can be detected. The fault condition can include by way of example, excessive disk thrashing or low computing resources conditions like low memory. In block 320, it can be determined whether or not the fault condition is recoverable in the sense that a restart can be avoided in consequence of the fault condition. If not, in block 330 ordinary fault handling can be performed resulting in a restart of the computing system. Otherwise, the process can continue through block 340.
In block 340, a fault resolution path can be matched to the recoverable fault condition and in block 350, an external operator can be prompted with the matched fault resolution path. Operator input can be provided consistent with the matched resolution path, such as a suggested commanding of the re-prioritization of individual jobs, the termination of selected jobs, or the pausing of selected jobs. Also, the operator input can be consistent with the adding of computing resources like memory or disk space while pausing the processing of one or more jobs, or the re-allocation of existing computing resources while pausing the processing of one or more of the jobs. Thereafter, in decision block 360 if the operator input is provided as prompted, in block 370 the operation of the computing system can continue. Otherwise, in block 380 rollback and restart can be performed on the computing system.
Embodiments of the invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements. In a preferred embodiment, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, and the like. Furthermore, the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system.
For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
A data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution. Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) can be coupled to the system either directly or through intervening I/O controllers. Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.

Claims (17)

1. A method for run-time fault resolution from development time fault and fault resolution path identification, the method comprising:
detecting a recoverable fault condition in a computing system;
selecting a fault resolution path from amongst a plurality of development time specified fault resolution paths to match the recoverable fault condition, each path providing at least one step requisite to clearing a corresponding fault condition in a computing system without requiring a restart of the computing system;
prompting an operator with the selected fault resolution path; and,
resuming operation of the computing system without restart subsequent to the operator performing the selected resolution fault path.
2. The method of claim 1, wherein detecting a recoverable fault condition in a computing system, comprises detecting a low resource condition in the computing system.
3. The method of claim 1, wherein prompting an operator with the selected fault resolution path, comprises prompting an operator to re-prioritize an existing job in the computing system.
4. The method of claim 1, wherein prompting an operator with the selected fault resolution path, comprises prompting an operator to terminate an existing job in the computing system.
5. The method of claim 1, wherein prompting an operator with the selected fault resolution path, comprises prompting an operator to pause operation of the computing system while adding new resources to the computing system.
6. The method of claim 1, wherein prompting an operator with the selected fault resolution path, comprises prompting an operator to pause operation of the computing system while reallocating an existing resource to a different job in the computing system.
7. A computer data processing system configured for run-time fault resolution from development time fault and fault resolution path identification, the system comprising: a memory, said memory comprising: a host computing platform supporting an operating system and job controller; a computer program executing in the operating system, the computer program comprising a plurality of jobs managed by the job controller; and, a recoverable fault handler comprising program code enabled to detect a recoverable fault condition in the system, to select a fault resolution path from amongst a plurality of development time specified fault resolution paths to match the recoverable fault condition, to prompt an operator with the selected fault resolution path, and to resume operation of the system without restart subsequent to the operator performing the selected resolution fault path, each path providing at least one step requisite to clearing a corresponding fault condition in a computing system without requiring a restart of the computing system.
8. The system of claim 7, wherein the recoverable fault handler is coupled to the operating system.
9. The system of claim 7, wherein the recoverable fault handler is coupled to the job controller.
10. The system of claim 7, wherein the recoverable fault handler is coupled to the computer program.
11. The system of claim 7, wherein the recoverable fault handler is coupled to at least one of the jobs.
12. A computer program product comprising a non-transitory computer usable medium embodying computer usable program code for run-time fault resolution from development time fault and fault resolution path identification, the computer program product comprising:
computer usable program code for detecting a recoverable fault condition in a computing system;
computer usable program code for selecting a fault resolution path from amongst a plurality of development time specified fault resolution paths to match the recoverable fault condition, each path providing at least one step requisite to clearing a corresponding fault condition in a computing system without requiring a restart of the computing system;
computer usable program code for prompting an operator with the selected fault resolution path; and,
computer usable program code for resuming operation of the computing system without restart subsequent to the operator performing the selected resolution fault path.
13. The computer program product of claim 12, wherein the computer usable program code for detecting a recoverable fault condition in a computing system, comprises computer usable program code for detecting a low resource condition in the computing system.
14. The computer program product of claim 12, wherein the computer usable program code for prompting an operator with the selected fault resolution path, comprises computer usable program code for prompting an operator to re-prioritize an existing job in the computing system.
15. The computer program product of claim 12, wherein the computer usable program code for prompting an operator with the selected fault resolution path, comprises computer usable program code for prompting an operator to terminate an existing job in the computing system.
16. The computer program product of claim 12, wherein the computer usable program code for prompting an operator with the selected fault resolution path, comprises computer usable program code for prompting an operator to pause operation of the computing system while adding new resources to the computing system.
17. The computer program product of claim 12, wherein the computer usable program code for prompting an operator with the selected fault resolution path, comprises computer usable program code for prompting an operator to pause operation of the computing system while reallocating an existing resource to a different job in the computing system.
US12/142,945 2008-06-20 2008-06-20 Run-time fault resolution from development-time fault and fault resolution path identification Expired - Fee Related US8090997B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US12/142,945 US8090997B2 (en) 2008-06-20 2008-06-20 Run-time fault resolution from development-time fault and fault resolution path identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US12/142,945 US8090997B2 (en) 2008-06-20 2008-06-20 Run-time fault resolution from development-time fault and fault resolution path identification

Publications (2)

Publication Number Publication Date
US20090319823A1 US20090319823A1 (en) 2009-12-24
US8090997B2 true US8090997B2 (en) 2012-01-03

Family

ID=41432495

Family Applications (1)

Application Number Title Priority Date Filing Date
US12/142,945 Expired - Fee Related US8090997B2 (en) 2008-06-20 2008-06-20 Run-time fault resolution from development-time fault and fault resolution path identification

Country Status (1)

Country Link
US (1) US8090997B2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10884888B2 (en) 2019-01-22 2021-01-05 International Business Machines Corporation Facilitating communication among storage controllers

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9413893B2 (en) 2012-04-05 2016-08-09 Assurant, Inc. System, method, apparatus, and computer program product for providing mobile device support services
US9483344B2 (en) * 2012-04-05 2016-11-01 Assurant, Inc. System, method, apparatus, and computer program product for providing mobile device support services

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4922491A (en) * 1988-08-31 1990-05-01 International Business Machines Corporation Input/output device service alert function
US6487677B1 (en) * 1999-09-30 2002-11-26 Lsi Logic Corporation Methods and systems for dynamic selection of error recovery procedures in a managed device
US20030177417A1 (en) * 2002-03-14 2003-09-18 Sun Microsystems Inc., A Delaware Corporation System and method for remote performance analysis and optimization of computer systems
US6718489B1 (en) * 2000-12-07 2004-04-06 Unisys Corporation Electronic service request generator for automatic fault management system
US6751758B1 (en) * 2001-06-20 2004-06-15 Emc Corporation Method and system for handling errors in a data storage environment
US6760869B2 (en) * 2001-06-29 2004-07-06 Intel Corporation Reporting hard disk drive failure
US20060053347A1 (en) * 2004-09-09 2006-03-09 Microsoft Corporation Method, system, and apparatus for providing alert synthesis in a data protection system
US20070174720A1 (en) * 2006-01-23 2007-07-26 Kubo Robert A Apparatus, system, and method for predicting storage device failure
US20080005609A1 (en) * 2006-06-29 2008-01-03 Zimmer Vincent J Method and apparatus for OS independent platform recovery
US20080028264A1 (en) * 2006-07-27 2008-01-31 Microsoft Corporation Detection and mitigation of disk failures
US20080115014A1 (en) * 2006-11-13 2008-05-15 Kalyanaraman Vaidyanathan Method and apparatus for detecting degradation in a remote storage device
US7409586B1 (en) * 2004-12-09 2008-08-05 Symantec Operating Corporation System and method for handling a storage resource error condition based on priority information
US20080270842A1 (en) * 2007-04-26 2008-10-30 Jenchang Ho Computer operating system handling of severe hardware errors

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4922491A (en) * 1988-08-31 1990-05-01 International Business Machines Corporation Input/output device service alert function
US6487677B1 (en) * 1999-09-30 2002-11-26 Lsi Logic Corporation Methods and systems for dynamic selection of error recovery procedures in a managed device
US6718489B1 (en) * 2000-12-07 2004-04-06 Unisys Corporation Electronic service request generator for automatic fault management system
US6751758B1 (en) * 2001-06-20 2004-06-15 Emc Corporation Method and system for handling errors in a data storage environment
US6760869B2 (en) * 2001-06-29 2004-07-06 Intel Corporation Reporting hard disk drive failure
US20030177417A1 (en) * 2002-03-14 2003-09-18 Sun Microsystems Inc., A Delaware Corporation System and method for remote performance analysis and optimization of computer systems
US20060053347A1 (en) * 2004-09-09 2006-03-09 Microsoft Corporation Method, system, and apparatus for providing alert synthesis in a data protection system
US7409586B1 (en) * 2004-12-09 2008-08-05 Symantec Operating Corporation System and method for handling a storage resource error condition based on priority information
US20070174720A1 (en) * 2006-01-23 2007-07-26 Kubo Robert A Apparatus, system, and method for predicting storage device failure
US20080005609A1 (en) * 2006-06-29 2008-01-03 Zimmer Vincent J Method and apparatus for OS independent platform recovery
US20080028264A1 (en) * 2006-07-27 2008-01-31 Microsoft Corporation Detection and mitigation of disk failures
US20080115014A1 (en) * 2006-11-13 2008-05-15 Kalyanaraman Vaidyanathan Method and apparatus for detecting degradation in a remote storage device
US20080270842A1 (en) * 2007-04-26 2008-10-30 Jenchang Ho Computer operating system handling of severe hardware errors

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10884888B2 (en) 2019-01-22 2021-01-05 International Business Machines Corporation Facilitating communication among storage controllers

Also Published As

Publication number Publication date
US20090319823A1 (en) 2009-12-24

Similar Documents

Publication Publication Date Title
US9058265B2 (en) Automated fault and recovery system
US8918783B2 (en) Managing virtual computers simultaneously with static and dynamic dependencies
US20110296398A1 (en) Systems and methods for determining when to update a package manager software
US20160019123A1 (en) Fault tolerance for complex distributed computing operations
US10817819B2 (en) Workflow compilation
US11218386B2 (en) Service ticket escalation based on interaction patterns
US8918796B2 (en) Generating and using constraints associated with software related products
US20170078369A1 (en) Event-responsive download of portions of streamed applications
US9541980B2 (en) Operation management device, operation management method, and recording medium
US9049101B2 (en) Cluster monitor, method for monitoring a cluster, and computer-readable recording medium
US11983519B2 (en) Abort installation of firmware bundles
US8090997B2 (en) Run-time fault resolution from development-time fault and fault resolution path identification
CN110502399B (en) Fault detection method and device
US10838712B1 (en) Lifecycle management for software-defined datacenters
US20230342181A1 (en) Validation of combined software/firmware updates
CN115904636A (en) Event distribution method, event distribution device, storage medium, and electronic apparatus
CN111124095B (en) Power supply running state detection method and related device during upgrading of power supply firmware
US11893380B2 (en) Super bundles for software-defined datacenter upgrades
CN111930502A (en) Server management method, device, equipment and storage medium
US20240111579A1 (en) Termination of sidecar containers
CN109634769B (en) Fault-tolerant processing method, device, equipment and storage medium in data storage
WO2024076425A1 (en) Termination of sidecar containers
CN113080748A (en) Automatic paper fetching method and device, electronic equipment and storage medium
CN116991445A (en) Firmware upgrading method, device, equipment and readable storage medium
CN113821205A (en) Method and device for controlling small program page parameters, medium and electronic equipment

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:HAMPTON, MARK C.;REEL/FRAME:021331/0633

Effective date: 20080613

ZAAA Notice of allowance and fees due

Free format text: ORIGINAL CODE: NOA

ZAAB Notice of allowance mailed

Free format text: ORIGINAL CODE: MN/=.

STCF Information on status: patent grant

Free format text: PATENTED CASE

FPAY Fee payment

Year of fee payment: 4

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 8

FEPP Fee payment procedure

Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

LAPS Lapse for failure to pay maintenance fees

Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FP Lapsed due to failure to pay maintenance fee

Effective date: 20240103