EP1214655A1 - A method and system for handling errors in a distributed computer system - Google Patents

A method and system for handling errors in a distributed computer system

Info

Publication number
EP1214655A1
EP1214655A1 EP00928637A EP00928637A EP1214655A1 EP 1214655 A1 EP1214655 A1 EP 1214655A1 EP 00928637 A EP00928637 A EP 00928637A EP 00928637 A EP00928637 A EP 00928637A EP 1214655 A1 EP1214655 A1 EP 1214655A1
Authority
EP
European Patent Office
Prior art keywords
error
errors
resource
resolving
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP00928637A
Other languages
German (de)
French (fr)
Inventor
Albhy Galuten
Peter Williams
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Universal Music Group Inc
Original Assignee
Universal Music Group Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Universal Music Group Inc filed Critical Universal Music Group Inc
Publication of EP1214655A1 publication Critical patent/EP1214655A1/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information
    • G06F11/327Alarm or error message display

Definitions

  • the present invention relates to tracking and responding to errors in a distributed electronic system.
  • the present invention is a method and system for tracking and processing errors in a distributed computer system in which a centralized error processing utility handles errors generated by one or more applications. Specifically, as an application encounters an error, the present invention intercepts and assumes the processing of that error event. This global error processing is facilitated by the distributed network connecting the applications running on various user computers. Upon receipt of an error message from an application, the system creates an informative error package, propagates appropriate error alert to relevant subsystems, and attempts to resolve the error. The error may be resolved in various ways. For example, the system may select and dispatch appropriate help information to the user; or the system may locate an alternative resource to substitute for the failed resource. The system may prioritize errors when there is more than one error still unresolved at any given time. In addition, the system may filter errors that require different levels of response and the system may direct errors to resources capable of assisting in resolving the error.
  • Figure 1 is a block diagram showing the preferred embodiment of the present invention.
  • Figure 2 is a flow chart showing the method of the preferred embodiment.
  • the system creates error messages, propagates alerts and resolves errors that arise in the course of operation of a computer system.
  • the system in accordance with the preferred embodiment may be an independent, self- contained program, operating on errors occurring in other computer programs.
  • the present system may be part of another computer program, typically, a large program having many sub-systems.
  • the system is especially suitable for use with a network of computer systems where various applications or sub-systems may be operating simultaneously on different computers across the network, some operating independently and others operating cooperatively.
  • the system and method of the present invention are generally applicable to computer systems, ranging from stand-alone computers to larger global computer networks.
  • system element is used herein to refer to the broad range of computer programs and sub-systems that may be subject to the present invention, i.e., programs which generate errors.
  • System elements include, for example, applications programs, sub-programs, operating systems, communication protocols, and drivers for peripherals.
  • user refers to a party using an application but may also refer to the operator or monitor of a system element(s).
  • each system element is designed to handle exceptional conditions (such as expecting a message from another module, or trying to access a common resource which is unavailable), with an error message that is used in program debugging or is passed to an error handling routine that provides diagnostic information or user feedback.
  • exceptional conditions such as expecting a message from another module, or trying to access a common resource which is unavailable
  • an error message that is used in program debugging or is passed to an error handling routine that provides diagnostic information or user feedback.
  • the error handling and debugging subsystems generate a specific error message associated with an unpredictable or unstable state within the application.
  • the occurrences of errors are uniquely identified within the application program creating them, usually through a numbering or naming schema.
  • programs typically log each error to a log file for diagnostic or audit purposes.
  • errors there are numerous different types of errors that may occur in a system element. For example, some errors may affect the internal logic of an application program such that the program is unable to undertake the task(s) that were requested and it exits this state in either a stable or unstable form. Other errors affect only the operation of that system element and are reported to the user. Still other errors affect the operation of other system elements, for example, when the application program that experienced the error is in communication with other system elements synchronously or asynchronously. In this case the error may cause a number of system elements to exit the functions being undertaken either in a stable or unstable form.
  • a central resource creates an error information package based on a signal received from a system element indicating the occurrence of an error, e.g., an error message generated by an application program.
  • the error routing server (16) is a computer or utility designed to be utilized by multiple applications and/or network computers.
  • the error routing server acts as a clearinghouse directing incoming error messages and outgoing responses.
  • error messages (12) generated by system elements (10) are sent to the error routing server (16).
  • the error routing server (16) may then forward the error message (12) to the error resource server (18), which is a computer or utility designed to implement the central resource that processes errors as described herein.
  • the error resource server (18) may use the error FAQ server (20) to obtain information responsive to the error being processed.
  • the error resource server (18) may have access to one or more databases offering a variety of assistance options responsive to errors.
  • the error routing server (16) may forward incoming error messages (12) to an error filter (14) and escalate these errors.
  • the error filter may separate errors of different types and instruct the error routing server where each error message should be sent for processing.
  • these components provide assistance and/or resolve the error by sending, by way of the error routing server (16), an appropriate response or instruction to the system element (10) experiencing the error.
  • the operation of these components is discussed in more detail in connection with Figure 2.
  • the present invention intercepts the element's processing of the error or the system element generates an error message for onward transmission.
  • the system element determines whether the user is actively connected to the network. If the user is not actively connected to the network, at step 28, the error message may be sent to a local error management system if present and/or queued for later transmission. If at step 24, it is determined that the user is online, the process proceeds with step 26.
  • the element's error message is transmitted to a central resource for processing.
  • the central resource may reside locally or on another area network computer or the
  • the error may be formatted in a tamper resistant or secure format before transmission to the central resource.
  • the central resource may be located remotely and connected via a distributed network such as the Internet.
  • the error message is transmitted as the user is experiencing the error condition when using a complete network system with many points of failure.
  • the central resource generates an error information package (error pack) based on the received error message.
  • Each error pack may be identified by an error code, which may be a unique number for every occurrence of an error, or may also indicate the type of error as well. Sufficient additional information may be included in the error package to generate some provision of assistance to the user.
  • each error pack may include an identification of the application and/or subsystem element experiencing the error; a time stamp indicating the time that the error pack was created or the time that the error occurred; and an address indicating the location of the user (e.g. IP address, MAC address, or email address).
  • a priority code may be included to indicate the priority of the error. The priority may range, for example, from terminal, such as a system failure of the specific program, to a service disconnect where the error is a completed function or operation.
  • An indication of the internal state of the program or system element may also be included in the error pack in order to allow other system elements to adjust their response to this state. The internal state indicates the state of the application or subsystem experiencing the error, and enables the external system elements to adapt their responses to this situation.
  • the central resource dispatches to the originating application, or user, a help page or other dynamically updated help information.
  • the help message may direct the user to FAQ type pages associated with the problem at hand.
  • the help message may generate an automatic help "bot" or wizard that assists the user through a number of scenarios to try to identify or clear the problem.
  • a "bot” (as in robot) is a program used on the Internet that performs repetitive functions such as posting a message to multiple newsgroups or searching for information. These scenarios may be dynamic in that they respond to user input and/or additional error or system messages that are generated within the process.
  • Error messages received by the central resource may be grouped by their identifying number and processed either automatically or manually to update the knowledge base and associated assistance provided to the user.
  • the error information package may be provided in a secure format and sent to the relevant system resource.
  • the central resource propagates relevant information to any subsystem or program that may benefit from knowing about the occurrence of an error.
  • the error information package may be sent to a corresponding web based error management resource.
  • error alert messages may be generated and propagated throughout the system. These messages are designed to create system alerts that indicate the system itself is experiencing a problem, such as a complete element failure or communications outage. Errors such as timeouts from delivery systems may in fact be used to dynamically switch those users from the resource that encountered the time out to another resource either locally or remotely.
  • the propagation of error alert messages to additional system elements may also cause the system to respond in a different manner depending on the nature of the error(s).
  • the errors from one system element may cause a different system element to respond differently by potentially resetting another element or providing an instruction to another to act upon. This depends on the circumstances and architecture of each particular system.
  • the error alert propagation provides the basis for integration of error handling into a comprehensive customer care solution that includes the network and supporting infrastructure.
  • the creation and propagation of the error information package and error alert messages may have a significant impact on the perceived and realized customer service. However, the ultimate goal is to resolve the error.
  • the central resource therefore analyzes the error and provides a timely response to the user, even if that response only acts to inform the user of the problem they are experiencing.
  • Analyzing errors involves identification and evaluation of each error individually and/or in combination with other errors. Errors may be identified by the combination of information provided by the error information package. For example, based on the locations and internal state, the central resource may be able to assist in evaluation of the error and increase the likelihood of effective resolution.
  • the system may utilize an error routing server (16) to prioritize the processing of errors.
  • the error routing server identifies those errors that present the most significant threat to the continued operation of an underlying system element.
  • the routing server may take into account that various system elements have varying degrees of relative importance. For example, the operating system or some primary program that manages many other programs are more crucial than their respective application programs or modules. The decision as to which errors present the most important threats may be dependent on the priority level set beforehand and then evaluated through a series of rules.
  • routing server may also take into account that some errors may be related and should be handled jointly. Processing errors from various system elements at the central resource creates the ability to aggregate these errors and to provide alerts as to the problems with a primary system element, e.g., the failure of one or more delivery services or crucial pipes that are relied upon for other mission critical infrastructure.
  • the database may contain a history of past errors with suggestions as to resolutions of those errors.
  • the database may contain a compilation of frequently occurring errors or frequently asked questions that may guide the system in resolving the instant error.
  • the FAQ server may utilize common techniques to aggregate the errors and their causes, which may be indexed by both cause and error identification numbers. New FAQs may be created from the Error Resource Server, once the errors have been aggregated or associated with specific problems within the system elements.
  • the Error Resource Server is the repository of all the errors that are produced by the system.
  • the Error Resource Server may hold the representation of the system architecture with each interface of the system elements, and can use these interfaces as the mechanism to categorize the errors received.
  • the errors may be classified as either internal to a system element or external to an element.
  • the definition of errors can include an identification of the system element and the relationship of the error to that system element, or other system elements.
  • Errors can be related to each other in an object model using commonly known Object Modeling techniques, including, but not limited to, inheritance, pre and post conditions and attributes. Further details of such Object Modeling may be found in Meyer, "Object Oriented Software Construction" (Prentice Hall), the contents of which are incorporated herein by reference.
  • the identification of the relationship between errors and the treatment of these as individual objects within a systematic model provides the core of the Error Resource Server.
  • the mapping of the relationships of the errors to the system interface model provides the framework for the errors to be classified and accessed
  • the Error Resource Server provides the data resource for the rest of the error system, and acts as the repository from which the other system elements obtain their baseline information. This enables other system elements to provide an efficient and timely response to system errors, while at the same time maintaining a contemporaneous error management resource and management system that supports the operations of the systems.
  • the errors that occur become part of the customer care method which enables the efficient operation of the system as a whole. In this way, errors become homogenous within the system operation as a whole.
  • the central resource may be able to identify the underlying problem causing an error or group of errors. Having identified a problem, the resource may proceed to address the problem if possible.
  • the central resource filters errors according to the types of response or remedy required.
  • Such filtering is accomplished by an error filer (14).
  • the filter may separate out those errors that cannot be resolved without some physical change or human intervention. For example, an error caused by insufficient local disk space typically requires the user to delete some files creating available disk space or to add or replace disk space. Some errors may be filtered out and redirected for further processing. For example, an error that requires another system element to take action to resolve the issue may be redirected to the other system element. Another example is where a collection of system elements taken as a whole is dependent on external infrastructure or services provision that encountered a failure. In such an instance, the error may be redirected to the external element.
  • the error information packages generated by the central resource are well suited to importation into network management systems, which may be used for error management, monitoring, escalation and ultimately customer care.
  • the system and method of the invention handle errors by creating error information packages, propagating an error alert messages, and resolving the errors.
  • the creation, propagation, and resolution functions may be performed either serially or in parallel, and may be performed by the same module or different modules. Additional functions such as dispatching assistance to handle the error, prioritizing the various errors, and applying the error filter, may similarly be performed in a different order or by one or more different modules, depending on the particular application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Debugging And Monitoring (AREA)

Abstract

A method and system for tracking and processing errors in a distributed computer system. As an application encounters an error, a centralized system intercepts and assumes the processing of that error event. The central error processing may be used with a distributed network connecting the applications running on various user computers. Upon receipt of an error message from an application (22), the system creates an informative error package (30), propagates appropriate error alert to relevant subsystems (34), and attempts to resolve the error. The error may be resolved in various ways. For example, the system may select and dispatch appropriate help information to the user (32); or the system may locate an alternative resource to substitute for the failed resource (38). The system may prioritize errors when there is more than one error still unresolved at any given time (36). In addition, the system may filter errors that require different levels of response (40)and the system may direct errors to resources capable of assisting in resolving the error.

Description

A METHOD AND SYSTEM FOR HANDLING ERRORS IN A DISTRIBUTED COMPUTER SYSTEM
This application claims the priority of U.S. provisional patent application No. 60/131,412 filed April 28, 1999, which is incorporated herein by reference.
FIELD OF THE INVENTION The present invention relates to tracking and responding to errors in a distributed electronic system.
BACKGROUND OF THE INVENTION Application programs are typically designed to be self-contained, each having its own capacity for handling errors that may occur during the execution of the program. With the growing popularity of operating multiple programs simultaneously, much of the code for and processing of error messages in each program is redundant and therefore, inefficient. Furthermore, with the ever increasing use of the Internet, many applications operating locally use networked resources. Some applications use a central resource to provide automated help to users connected to the Internet. What is needed is a system that handles the error messaging and error processing in an efficient manner for applications executed on distributed systems. The present invention satisfies this and other needs.
SUMMARY OF THE INVENTION The present invention is a method and system for tracking and processing errors in a distributed computer system in which a centralized error processing utility handles errors generated by one or more applications. Specifically, as an application encounters an error, the present invention intercepts and assumes the processing of that error event. This global error processing is facilitated by the distributed network connecting the applications running on various user computers. Upon receipt of an error message from an application, the system creates an informative error package, propagates appropriate error alert to relevant subsystems, and attempts to resolve the error. The error may be resolved in various ways. For example, the system may select and dispatch appropriate help information to the user; or the system may locate an alternative resource to substitute for the failed resource. The system may prioritize errors when there is more than one error still unresolved at any given time. In addition, the system may filter errors that require different levels of response and the system may direct errors to resources capable of assisting in resolving the error.
BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 is a block diagram showing the preferred embodiment of the present invention; and
Figure 2 is a flow chart showing the method of the preferred embodiment.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT In the preferred embodiment of the present invention, the system creates error messages, propagates alerts and resolves errors that arise in the course of operation of a computer system. The system in accordance with the preferred embodiment may be an independent, self- contained program, operating on errors occurring in other computer programs. Alternatively the present system may be part of another computer program, typically, a large program having many sub-systems. The system is especially suitable for use with a network of computer systems where various applications or sub-systems may be operating simultaneously on different computers across the network, some operating independently and others operating cooperatively. However, the system and method of the present invention are generally applicable to computer systems, ranging from stand-alone computers to larger global computer networks. The term system element is used herein to refer to the broad range of computer programs and sub-systems that may be subject to the present invention, i.e., programs which generate errors. System elements include, for example, applications programs, sub-programs, operating systems, communication protocols, and drivers for peripherals. In addition, the term user refers to a party using an application but may also refer to the operator or monitor of a system element(s).
Typically, in modern programming, each system element is designed to handle exceptional conditions (such as expecting a message from another module, or trying to access a common resource which is unavailable), with an error message that is used in program debugging or is passed to an error handling routine that provides diagnostic information or user feedback. For example, within an application program, the error handling and debugging subsystems generate a specific error message associated with an unpredictable or unstable state within the application. The occurrences of errors are uniquely identified within the application program creating them, usually through a numbering or naming schema. In addition, programs typically log each error to a log file for diagnostic or audit purposes.
There are numerous different types of errors that may occur in a system element. For example, some errors may affect the internal logic of an application program such that the program is unable to undertake the task(s) that were requested and it exits this state in either a stable or unstable form. Other errors affect only the operation of that system element and are reported to the user. Still other errors affect the operation of other system elements, for example, when the application program that experienced the error is in communication with other system elements synchronously or asynchronously. In this case the error may cause a number of system elements to exit the functions being undertaken either in a stable or unstable form.
CREATION
A central resource creates an error information package based on a signal received from a system element indicating the occurrence of an error, e.g., an error message generated by an application program. Referring to Figure 1, the error routing server (16) is a computer or utility designed to be utilized by multiple applications and/or network computers. The error routing server acts as a clearinghouse directing incoming error messages and outgoing responses. As indicated by the arrows, error messages (12) generated by system elements (10) are sent to the error routing server (16). The error routing server (16) may then forward the error message (12) to the error resource server (18), which is a computer or utility designed to implement the central resource that processes errors as described herein. The error resource server (18) may use the error FAQ server (20) to obtain information responsive to the error being processed. Additionally, the error resource server (18) may have access to one or more databases offering a variety of assistance options responsive to errors. In addition, the error routing server (16) may forward incoming error messages (12) to an error filter (14) and escalate these errors. The error filter may separate errors of different types and instruct the error routing server where each error message should be sent for processing.
Finally, these components provide assistance and/or resolve the error by sending, by way of the error routing server (16), an appropriate response or instruction to the system element (10) experiencing the error. The operation of these components is discussed in more detail in connection with Figure 2.
Referring to Figure 2, in the event of an error during the processing of a system element, the present invention intercepts the element's processing of the error or the system element generates an error message for onward transmission. At step 24, the system element determines whether the user is actively connected to the network. If the user is not actively connected to the network, at step 28, the error message may be sent to a local error management system if present and/or queued for later transmission. If at step 24, it is determined that the user is online, the process proceeds with step 26. At step 26, the element's error message is transmitted to a central resource for processing. The central resource may reside locally or on another area network computer or the
Internet. The error may be formatted in a tamper resistant or secure format before transmission to the central resource. The central resource may be located remotely and connected via a distributed network such as the Internet. Generally, the error message is transmitted as the user is experiencing the error condition when using a complete network system with many points of failure. At step 30, the central resource generates an error information package (error pack) based on the received error message. Each error pack may be identified by an error code, which may be a unique number for every occurrence of an error, or may also indicate the type of error as well. Sufficient additional information may be included in the error package to generate some provision of assistance to the user. For example, each error pack may include an identification of the application and/or subsystem element experiencing the error; a time stamp indicating the time that the error pack was created or the time that the error occurred; and an address indicating the location of the user (e.g. IP address, MAC address, or email address). A priority code may be included to indicate the priority of the error. The priority may range, for example, from terminal, such as a system failure of the specific program, to a service disconnect where the error is a completed function or operation. An indication of the internal state of the program or system element may also be included in the error pack in order to allow other system elements to adjust their response to this state. The internal state indicates the state of the application or subsystem experiencing the error, and enables the external system elements to adapt their responses to this situation. In addition to generating an error information package, at step 32, the central resource dispatches to the originating application, or user, a help page or other dynamically updated help information. In this manner the user receives timely assistance as to the potential cause of the problem. The help message may direct the user to FAQ type pages associated with the problem at hand. In addition, the help message may generate an automatic help "bot" or wizard that assists the user through a number of scenarios to try to identify or clear the problem. A "bot" (as in robot) is a program used on the Internet that performs repetitive functions such as posting a message to multiple newsgroups or searching for information. These scenarios may be dynamic in that they respond to user input and/or additional error or system messages that are generated within the process.
Error messages received by the central resource may be grouped by their identifying number and processed either automatically or manually to update the knowledge base and associated assistance provided to the user. The error information package may be provided in a secure format and sent to the relevant system resource.
PROPAGATION
Having generated an error information package, at step 34, the central resource propagates relevant information to any subsystem or program that may benefit from knowing about the occurrence of an error. The error information package may be sent to a corresponding web based error management resource. In addition, depending on the type of the error, error alert messages may be generated and propagated throughout the system. These messages are designed to create system alerts that indicate the system itself is experiencing a problem, such as a complete element failure or communications outage. Errors such as timeouts from delivery systems may in fact be used to dynamically switch those users from the resource that encountered the time out to another resource either locally or remotely.
The propagation of error alert messages to additional system elements may also cause the system to respond in a different manner depending on the nature of the error(s). The errors from one system element may cause a different system element to respond differently by potentially resetting another element or providing an instruction to another to act upon. This depends on the circumstances and architecture of each particular system. The error alert propagation provides the basis for integration of error handling into a comprehensive customer care solution that includes the network and supporting infrastructure.
RESOLUTION
The creation and propagation of the error information package and error alert messages may have a significant impact on the perceived and realized customer service. However, the ultimate goal is to resolve the error. The central resource therefore analyzes the error and provides a timely response to the user, even if that response only acts to inform the user of the problem they are experiencing.
Analyzing errors involves identification and evaluation of each error individually and/or in combination with other errors. Errors may be identified by the combination of information provided by the error information package. For example, based on the locations and internal state, the central resource may be able to assist in evaluation of the error and increase the likelihood of effective resolution.
During the course of operation of the underlying system elements, many errors may occur contemporaneously and for any given error there may be errors that occurred earlier in time that are not yet resolved. To handle the numerous errors that may remain outstanding at any given time, at step 36, the system may utilize an error routing server (16) to prioritize the processing of errors. The error routing server identifies those errors that present the most significant threat to the continued operation of an underlying system element. The routing server may take into account that various system elements have varying degrees of relative importance. For example, the operating system or some primary program that manages many other programs are more crucial than their respective application programs or modules. The decision as to which errors present the most important threats may be dependent on the priority level set beforehand and then evaluated through a series of rules. These rules may be initially defined, though over time these may be automatically updated and modified as a history of errors and failures develops. The routing server may also take into account that some errors may be related and should be handled jointly. Processing errors from various system elements at the central resource creates the ability to aggregate these errors and to provide alerts as to the problems with a primary system element, e.g., the failure of one or more delivery services or crucial pipes that are relied upon for other mission critical infrastructure.
One way in which the system evaluates an error is to confer with a database of error related information (step 38). The database may contain a history of past errors with suggestions as to resolutions of those errors. The database may contain a compilation of frequently occurring errors or frequently asked questions that may guide the system in resolving the instant error. The FAQ server may utilize common techniques to aggregate the errors and their causes, which may be indexed by both cause and error identification numbers. New FAQs may be created from the Error Resource Server, once the errors have been aggregated or associated with specific problems within the system elements.
The Error Resource Server is the repository of all the errors that are produced by the system. The Error Resource Server may hold the representation of the system architecture with each interface of the system elements, and can use these interfaces as the mechanism to categorize the errors received. The errors may be classified as either internal to a system element or external to an element. The definition of errors can include an identification of the system element and the relationship of the error to that system element, or other system elements. Errors can be related to each other in an object model using commonly known Object Modeling techniques, including, but not limited to, inheritance, pre and post conditions and attributes. Further details of such Object Modeling may be found in Meyer, "Object Oriented Software Construction" (Prentice Hall), the contents of which are incorporated herein by reference. The identification of the relationship between errors and the treatment of these as individual objects within a systematic model provides the core of the Error Resource Server. The mapping of the relationships of the errors to the system interface model provides the framework for the errors to be classified and accessed by the rest of the system.
The Error Resource Server provides the data resource for the rest of the error system, and acts as the repository from which the other system elements obtain their baseline information. This enables other system elements to provide an efficient and timely response to system errors, while at the same time maintaining a contemporaneous error management resource and management system that supports the operations of the systems. In this model, the errors that occur become part of the customer care method which enables the efficient operation of the system as a whole. In this way, errors become homogenous within the system operation as a whole. By using these resources the central resource may be able to identify the underlying problem causing an error or group of errors. Having identified a problem, the resource may proceed to address the problem if possible.
Since there are many different possible errors and problems, the central resource filters errors according to the types of response or remedy required. Such filtering is accomplished by an error filer (14). At step 40, the filter may separate out those errors that cannot be resolved without some physical change or human intervention. For example, an error caused by insufficient local disk space typically requires the user to delete some files creating available disk space or to add or replace disk space. Some errors may be filtered out and redirected for further processing. For example, an error that requires another system element to take action to resolve the issue may be redirected to the other system element. Another example is where a collection of system elements taken as a whole is dependent on external infrastructure or services provision that encountered a failure. In such an instance, the error may be redirected to the external element.
The error information packages generated by the central resource are well suited to importation into network management systems, which may be used for error management, monitoring, escalation and ultimately customer care. In this way, the system and method of the invention handle errors by creating error information packages, propagating an error alert messages, and resolving the errors. It should be understood that the creation, propagation, and resolution functions may be performed either serially or in parallel, and may be performed by the same module or different modules. Additional functions such as dispatching assistance to handle the error, prioritizing the various errors, and applying the error filter, may similarly be performed in a different order or by one or more different modules, depending on the particular application.
While the invention has been particularly shown and described with reference to a preferred embodiment thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims

What is claimed is:
1. A method for tracking and processing errors in a distributed computer system, the method comprising the following steps: utilizing a centralized error detection system to intercept an error event from one of a plurality of applications; upon the interception of an error message from one of said applications, creating an informative error package; propagating appropriate error alerts to relevant subsystems, and resolving the error.
2. The method of claim 1 , wherein the resolving step includes the further steps of selecting and dispatching appropriate help information to the user.
3. The method of claim 1, wherein the resolving step includes the further step of locating an alternative resource to substitute for a failed resource associated with the intercepted error.
4. The method of claim 1 , further comprising the step of prioritizing errors when there is more than one error still unresolved at any given time.
5. The method of claim 1, further comprising the step of filtering errors that require different levels of response.
6. The method of claim 1, further comprising the step of directing errors to resources capable of assisting in resolving the error.
EP00928637A 1999-04-28 2000-04-27 A method and system for handling errors in a distributed computer system Withdrawn EP1214655A1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US13141299P 1999-04-28 1999-04-28
US131412P 1999-04-28
PCT/US2000/011702 WO2000065448A1 (en) 1999-04-28 2000-04-27 A method and system for handling errors in a distributed computer system

Publications (1)

Publication Number Publication Date
EP1214655A1 true EP1214655A1 (en) 2002-06-19

Family

ID=22449358

Family Applications (1)

Application Number Title Priority Date Filing Date
EP00928637A Withdrawn EP1214655A1 (en) 1999-04-28 2000-04-27 A method and system for handling errors in a distributed computer system

Country Status (4)

Country Link
EP (1) EP1214655A1 (en)
JP (1) JP2002543494A (en)
AU (1) AU4684200A (en)
WO (1) WO2000065448A1 (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7712083B2 (en) 2003-08-20 2010-05-04 Igt Method and apparatus for monitoring and updating system software
WO2005076147A1 (en) 2004-02-10 2005-08-18 Ian Andrew Maxwell A content distribution system
EP1734748A4 (en) * 2004-04-06 2008-12-03 Panasonic Corp Program execution device
GB2424086A (en) * 2004-09-14 2006-09-13 Acres Gaming Inc Monitoring computer system software
EP2951706B1 (en) * 2013-01-30 2017-06-21 Hewlett-Packard Enterprise Development LP Controlling error propagation due to fault in computing node of a distributed computing system
US9594622B2 (en) 2015-02-04 2017-03-14 International Business Machines Corporation Contacting remote support (call home) and reporting a catastrophic event with supporting documentation
US10275296B2 (en) * 2017-01-24 2019-04-30 Wipro Limited Method and system for resolving one or more errors in an enterprise storage system
US10817361B2 (en) 2018-05-07 2020-10-27 Hewlett Packard Enterprise Development Lp Controlling error propagation due to fault in computing node of a distributed computing system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0644242B2 (en) * 1988-03-17 1994-06-08 インターナショナル・ビジネス・マシーンズ・コーポレーション How to solve problems in computer systems
JP3675851B2 (en) * 1994-03-15 2005-07-27 富士通株式会社 Computer monitoring method
US5563805A (en) * 1994-08-16 1996-10-08 International Business Machines Corporation Multimedia context-sensitive real-time-help mechanism for use in a data processing system
US5892898A (en) * 1996-10-04 1999-04-06 Honeywell, Inc. Error management system for supporting the identification and logging of error messages
US5941996A (en) * 1997-07-25 1999-08-24 Merrill Lynch & Company, Incorporated Distributed network agents

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO0065448A1 *

Also Published As

Publication number Publication date
AU4684200A (en) 2000-11-10
WO2000065448A1 (en) 2000-11-02
JP2002543494A (en) 2002-12-17

Similar Documents

Publication Publication Date Title
US6918059B1 (en) Method and system for handling errors in a distributed computer system
US7051244B2 (en) Method and apparatus for managing incident reports
US6446058B1 (en) Computer platform alarm and control system
US9560109B2 (en) Message management facility for an industrial process control environment
US6742141B1 (en) System for automated problem detection, diagnosis, and resolution in a software driven system
US7860768B2 (en) Exception handling framework
US7506195B2 (en) Operation management method and operation management server
US7689688B2 (en) Multiple-application transaction monitoring facility for debugging and performance tuning
US8161323B2 (en) Health monitor
US20100083029A1 (en) Self-Optimizing Algorithm for Real-Time Problem Resolution Using Historical Data
JPH11259385A (en) System and method for communicating data
JPH01243135A (en) Problem processing system
JPH02105947A (en) Computer surrounding subsystem and exception event automatic detecting analyzing method
US7469287B1 (en) Apparatus and method for monitoring objects in a network and automatically validating events relating to the objects
WO2021086523A1 (en) Support ticket platform for improving network infrastructures
WO2000068793A1 (en) System for automated problem detection, diagnosis, and resolution in a software driven system
EP1214655A1 (en) A method and system for handling errors in a distributed computer system
CN108173711B (en) Data exchange monitoring method for internal system of enterprise
US20040039804A1 (en) Method and framework for service-based remote support delivery
KR20110037969A (en) Targeted user notification of messages in a monitoring system
EP1489499A1 (en) Tool and associated method for use in managed support for electronic devices
US8380729B2 (en) Systems and methods for first data capture through generic message monitoring
US8275865B2 (en) Methods, systems and computer program products for selecting among alert conditions for resource management systems
US7143415B2 (en) Method for using self-help technology to deliver remote enterprise support
CN116109112B (en) Service data processing method, device, medium and equipment based on aggregation interface

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20011120

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE

AX Request for extension of the european patent

Free format text: AL;LT;LV;MK;RO;SI

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN WITHDRAWN

18W Application withdrawn

Effective date: 20040723