WO2009150722A1

WO2009150722A1 - Trace information control device, trace information control method, and program intended for it

Info

Publication number: WO2009150722A1
Application number: PCT/JP2008/060624
Authority: WO
Inventors: 浩一中西
Original assignee: 富士通株式会社; 富士通周辺機株式会社
Priority date: 2008-06-10
Filing date: 2008-06-10
Publication date: 2009-12-17

Abstract

A trace information control device comprises a trace information acquiring means for acquiring trace information on a device control program for investigating the cause of a failure, a trace information storage memory for storing the trace information acquired by the trace information acquiring means, and a control unit for integrally controlling the trace information acquiring means and the trace information storage memory. The trace information includes first trace information indicating the route executed by the device control program and second trace information indicating the values of various parameters and variables related to the device control program. The control unit acquires/stores the first information by reducing the depth of the trace information during the normal operation of the trace information control device and, when detecting the operation leading to a failure of the computer system, controls so as to acquire/store the second information by increasing the depth of the trace information.

Description

Trace information control apparatus, trace information control method, and program therefor

The present invention, when acquiring and recording trace information related to the history of operation of a computer system control program as data for investigating the cause of failure occurrence in various computer systems or computer devices, The present invention relates to a trace information control apparatus, a trace information control method, and a program for causing a computer to execute the trace information control method.

In general, after a failure or malfunction occurs in various computer systems or computer devices using a general-purpose computer, when the cause of the failure of the computer system is investigated, it is acquired and recorded by computer firmware (also called logging). ) Trace information is enabled. The trace information includes normal information indicating which path the device control program for overall control of the computer system has executed, various hardware parameters, various parameters defined by the device control program, and Detailed information indicating the value of the variable is included. Normal information is trace information acquired in a mode with a large amount of information at the time of one trace information acquisition, and detailed information is trace information acquired in a mode with a small amount of information at the time of one trace information acquisition. is there. More specifically, the normal information is acquired in a mode in which only trace information that is really necessary for predictive monitoring of a computer abnormality (fault) is collected, or in a mode in which the amount of information at the time of one trace acquisition is small. It is defined as “shallow depth information”. On the other hand, detailed information is a mode in which trace information that is considered to be useful if it is added to the really necessary trace information acquired as normal information, or the amount of information at the time of one trace acquisition. It is defined as “deep information” that is acquired in a mode with a large amount of information. Furthermore, “firmware” is a term that means a combination of software and hardware necessary to control a computer system.

Trace information acquired by computer firmware is usually recorded by being stored in a memory or the like mounted on a computer system. In the conventional trace information control method performed using this computer system, the amount of trace information recorded is limited by the memory size limit of the memory mounted on the computer system, or the memory size limit of the memory is limited. Nevertheless, there is a problem that the processing performance inherent to the computer (for example, the processing speed of the computer) deteriorates due to excessive recording of trace information. For this reason, it becomes difficult to leave a lot of detailed information with great depth as trace information. As a result, a situation occurs in which information for analysis at the time of occurrence of a failure or the like is insufficient. In this case, a new trace information enhancement firmware may be created and a failure reproduction test may be performed.

On the other hand, if the trace information cannot be extracted from the memory immediately after a failure occurs, the trace processing (information acquisition / storage processing) related to the information of the part that is operating normally after the failure occurs Made and recorded. Since the information of the normally operating part is overwritten on the trace information at the time of the failure and sequentially recorded, the important part of the information at the time of the failure may be flowing (erased). .

Here, for reference, the following Patent Documents 1 to 4 related to the conventional trace information control method are presented as prior art documents.

In Patent Document 1, a monitoring unit that monitors firmware processing in a communication control processing device and a DMA unit that transfers data instructed by the monitoring unit to a trace data storage unit are provided and stored in the firmware storage unit. When a label is assigned to each of the plurality of processing modules and an abnormality is detected by the monitoring unit, the detailed data of the processing module corresponding to the assigned label is transferred to the trace data storage unit, and if normal, the processing module A firmware / trace data acquisition method is disclosed in which the DMA unit is controlled so that only the label is transferred to the trace data storage unit.

In Patent Document 2, normally, the operating environment up to the occurrence of a failure and a history of minor failure information are recorded in a device history information file in the operating system, and when an important failure occurs in a certain device, Failure information record that records in the nonvolatile memory by combining the failure information that caused the critical failure with the operating environment and minor failure information until the occurrence of the critical failure of the device recorded in the device history information file A method is disclosed.

In Patent Document 3, communication with an external device can be freely performed, and trace data can be stored when an abnormality occurs. A normal trace data storage memory dedicated for the occurrence of a normal abnormality and a dedicated memory for occurrence of a low-frequency abnormality. Based on the specified trace data storage memory and the data type, data amount, and data cut position set by the setting means, necessary data is stored in the specific trace data storage memory at the time of occurrence of low frequency abnormality. There is disclosed a recording medium processing apparatus including control means for controlling as described above.

In Patent Document 4, there are a plurality of trace areas for storing trace information, usually overwritten and saved in a link buffer, and when important trace information is acquired, the trace area is prohibited from being overwritten. When the trace area is full , Save to the next trace area, and on the other hand, if the number of trace areas in the overwrite-disabled state reaches a certain number, the trace information in the trace area that was the oldest overwrite-protected state is output to a file, A trace information management method is disclosed in which overwriting prohibition of the trace area is canceled.

However, in any of Patent Documents 1 to 4, the amount of trace information including detailed information that is effective when investigating the cause of a failure due to the limitation of the memory size of the memory mounted on the computer system. As for specific measures to deal with the problems of the prior art, such as limiting the performance of the computer or degrading the original processing performance of the computer due to excessive recording of trace information including the detailed information. Not. Therefore, none of Patent Documents 1 to 4 can address the problems that occur in the conventional trace information control method.

Japanese Patent Laid-Open No. 7-93233 JP-A-5-324367 JP 2001-93002 A JP 2001-175509 A

The purpose of this application is to prevent the amount of trace information including detailed information that is effective when investigating the cause of failure occurrence from being limited by the memory size of a memory such as a computer system, and An object of the present invention is to provide a trace information control apparatus, a trace information control method, and a program therefor that can prevent deterioration in the original processing performance of a computer by recording excessive trace information including detailed information.

In order to achieve the above object, the trace information control apparatus acquires the trace information acquisition means for acquiring the trace information of the apparatus control program for investigating the cause of the failure occurrence of the computer system, and the trace information acquisition means. A trace information storage memory for storing the trace information, and a control unit for controlling the trace information acquisition means and the trace information storage memory in a centralized manner. Including first trace information (for example, normal information) indicating whether it has been executed, and second trace information (for example, detailed information) indicating values of various parameters and variables related to the device control program, The control unit reduces the depth of the trace information during normal operation of the trace information control apparatus. Control to perform the acquisition / storage process of the second information by increasing the depth of the trace information when an operation leading to a failure of the computer system is detected by performing the acquisition / storage process of the first information. To do.

Here, “Decrease the depth of trade information” is set to a mode that collects only the trace information that is really necessary for predictive monitoring of computer abnormalities (failures) as described in the section “Background Art” above. By doing so, for example, it means that the amount of information at the time of one trace acquisition is reduced. On the other hand, “increasing the depth of trade information” is convenient if it is added to the really necessary trace information acquired as normal information, as explained in the section “Background Art” above. By setting the mode to collect trace information that is considered to be intended, this means, for example, increasing the amount of information at the time of one trace acquisition.

Preferably, in the trace information control device, the control unit sets the depth of the trace information when the failure does not occur after a predetermined time has elapsed since the operation leading to the failure is detected. The first information is acquired / stored shallowly.

On the other hand, the trace information control device according to the first aspect includes a trace information acquisition unit that acquires trace information of a device control program for investigating the cause of the failure occurrence of the computer system, and the trace information acquisition unit. Trace information storage memory for storing the acquired trace information, the device control program, a program storage memory for storing a trace information trace program for performing the acquisition / storage processing of the trace information, and the device control A program execution memory in which the device control program and the trace information tracing program are loaded from the program storage memory when the program and the trace information tracing program are executed, and the device loaded in the program execution memory A control unit that reads and executes the control program and the trace information trace program, and controls the trace information acquisition means, the trace information storage memory, the program storage memory, and the program execution memory, and The trace information includes first trace information indicating which path the device control program has executed, and second trace information indicating values of various parameters and variables related to the device control program. The control unit loads the first information tracing program for performing the acquisition / storage processing of the first information from the program storage memory to the program execution memory after the trace information control device is started up. If an action that leads to a failure is detected, the above A second information trace program for performing acquisition / storage processing of information 2 is overwritten from the program storage memory to the program execution memory, and the first information trace program is used for the second information trace Control to replace with a program.

Preferably, in the trace information control device according to the first aspect, the control unit performs the above operation when the failure does not occur after a predetermined time has elapsed since the operation leading to the failure was detected. A first information trace program is overwritten and loaded from the program storage memory to the program execution memory, and the second information trace program is replaced with the first information trace program.

On the other hand, the trace information control device according to the second aspect includes a trace information acquisition unit that acquires trace information of a device control program for investigating the cause of the failure occurrence of the computer system, and the trace information acquisition unit. A trace information storage memory that stores the acquired trace information; and a control unit that controls the trace information acquisition unit and the trace information storage memory in an integrated manner. First trace information (for example, a normal information storage area) indicating whether or not is executed, and second trace information (for example, a detailed information storage area) indicating values of various parameters and variables related to the device control program The trace information storage area in the trace information storage memory includes the first trace information. The control section is divided in advance into a storage area and a second trace information storage area, and the control unit stores the first information in the first information storage area during normal operation of the trace information control device, When an operation leading to a failure of the computer system is detected, the first information is stored in the first information storage area and the second information is stored in the second information storage area. Furthermore, when the number of times that the operation leading to the failure is detected exceeds a predetermined threshold, the size of the second information storage area is dynamically increased.

On the other hand, the trace information control device according to the third aspect includes a trace information acquisition unit that acquires trace information of a device control program for investigating the cause of the failure of the computer system, and the trace information acquisition unit. Trace information storage memory for storing the acquired trace information, the device control program, and a program storage memory for storing a trace information tracing program for performing the acquisition and storage processing of the trace information (for example, nonvolatile memory) Memory), the device control program, and the trace information tracing program, the program execution memory into which the device control program and the trace information tracing program are loaded from the program storage memory, and the program execution memory Control that reads and executes the loaded apparatus control program and trace information tracing program, and controls the trace information acquisition means, the trace information storage memory, the program storage memory, and the program execution memory in an integrated manner The trace information includes first trace information indicating which path the device control program has executed, and second trace indicating values of various parameters and variables related to the device control program. The trace information storage area in the trace information storage memory is divided in advance into a first trace information storage area and a second trace information storage area.

In the trace information control apparatus according to the third aspect, the control unit executes a first information trace program for performing the acquisition / storage process of the first information after the trace information control apparatus is activated. The program storage memory is loaded into the program execution memory, and the first information is controlled to be stored in the first information storage area based on the first information tracing program. The section overwrites and loads the second information tracing program for acquiring and storing the second information from the program storage memory to the program execution memory when an operation leading to a failure of the computer system is detected. And storing the second information in the second information storage area based on the second information tracing program. And controlled so, further, when the number of times of detecting the operation leading to the failure exceeds a predetermined threshold, and controls to dynamically increase the size of the second information storage area.

On the other hand, the trace information control method includes a trace information storage memory that acquires and stores trace information of a device control program for investigating the cause of the failure of the computer system, and the trace information is stored in the device control program. When controlling a trace information control device including first trace information indicating which route the program has executed and second trace information indicating various parameters and variable values related to the device control program. In the normal operation of the trace information control apparatus, when the depth of the trace information is reduced and the first information is acquired / stored, and when an operation leading to a failure of the computer system is detected, Increasing the depth of the trace information and performing the second information acquisition / storage process.

Alternatively, the trace information control method includes a trace information storage memory that acquires and stores trace information of a device control program for investigating the cause of the failure of the computer system, and the trace information is stored in the device control. When controlling a trace information control device including first trace information indicating which route the program has executed and second trace information indicating various parameters and variable values related to the device control program. Dividing the trace information storage area in the trace information storage memory into a first trace information storage area and a second trace information storage area in advance, and during normal operation of the trace information control apparatus, A step of storing the first information in the first information storage area and a failure of the computer system; The first information is stored in the first information storage area and the second information is stored in the second information storage area, and the operation leads to the failure. Dynamically increasing the size of the second information storage area when the number of detected times exceeds a predetermined threshold.

On the other hand, a program for causing a computer to execute this trace information control method is a trace provided with a trace information storage memory for acquiring and storing trace information of an apparatus control program for investigating the cause of the failure of the computer system. When controlling the information control device, during normal operation of the trace information control device, the depth of the trace information is reduced and the first information is acquired and stored, and an operation leading to a failure of the computer system is detected. In this case, the depth of the trace information is increased and the second information is acquired and stored.

Alternatively, a program for causing a computer to execute the trace information control method is a trace having a trace information storage memory for acquiring and storing trace information of a device control program for investigating the cause of the failure occurrence of the computer system. When controlling the information control apparatus, the trace information storage area in the trace information storage memory is divided into a first trace information storage area and a second trace information storage area in advance in a computer, and the trace information control apparatus During normal operation, first information is stored in the first information storage area, and when an operation leading to a failure of the computer system is detected, the first information is stored in the first information storage area. At the same time, the second information is stored in the second information storage area, and an operation leading to the failure is detected. When exceeding the threshold number predetermined, so as to perform the following comprising dynamically increase the size of the second information storage area.

In summary, in the disclosed trace information control apparatus, trace information control method, and program therefor, which path the apparatus control program executed by reducing the depth of the trace information during normal operation of the trace information control apparatus. If only the first information (for example, normal information) is obtained and recorded in the trace information storage memory and occurrence of a retry (predictive control) or an error leading to an important failure of the computer system is detected, The second depth information (for example, detailed information) indicating the values of various parameters and variables related to the device control program is acquired and recorded in the trace information storage memory automatically by increasing the depth of the trace information. It is supposed to change.

As a result, the second information having a deep depth is not recorded more than necessary, and the amount of the second information that is effective when investigating the cause of the occurrence of the failure when the important failure occurs is recorded. Can be avoided. On the other hand, it is possible to efficiently record the second information effective in investigating the cause of the failure without degrading the original processing performance of the computer by recording the second information excessively. Become.

Furthermore, in the disclosed trace information control apparatus, trace information control method, and program therefor, a first information tracing program for performing acquisition / storage processing of the first information having a shallow depth after activation of the trace information control apparatus Is loaded from the program storage memory (for example, non-volatile memory) into the program execution memory, and when the occurrence of a retry or error leading to an important failure of the computer system is detected, the deep second information is acquired and stored A second information trace program for performing processing is overwritten and loaded from the program storage memory, and the first information trace program is automatically replaced with the second information trace program.

As a result, the second information having a deep depth is not recorded more than necessary, and the amount of the second information that is effective when investigating the cause of the occurrence of the failure when the important failure occurs is recorded. Can be avoided. On the other hand, when one of the trace information tracing programs is executed, the processing overhead of frequently determining the flag is generated, so that the cause of the failure can be prevented without degrading the original processing performance of the computer. It becomes possible to efficiently record the second information that is effective for the investigation.

Furthermore, in the disclosed trace information control device, trace information control method, and program therefor, the trace information storage area in the trace information storage memory is divided into the first trace information storage area (for example, the normal information storage area) and the second trace information storage area. It is divided in advance into a trace information storage area (for example, a detailed information storage area). During normal operation of the trace information control device, the first information is stored in the first information storage area, which is an important failure of the computer system. When the occurrence of a connected retry or error is detected, the second information is stored in the second information storage area, and the number of occurrences of a retry or error leading to an important failure is determined in advance. When the threshold value is exceeded, control is performed to dynamically increase the size of the second information storage area.

As described above, by increasing the size of the second information storage area in accordance with the number of occurrences of retries or errors leading to an important failure, the necessary amount of second information having a deep depth can be obtained. Since the information is recorded in the information storage area 2, it is possible to efficiently record the second information effective in investigating the cause of the failure without affecting the original processing performance of the computer. Become.

On the other hand, by dividing the trace information storage area into a first trace information storage area and a second trace information storage area in advance, the occurrence of retries or errors leading to an important failure of the computer system is detected. If the second trace information cannot be taken out from the trace information storage memory immediately after the first trace information is overwritten, the first trace information of the part that is operating normally after the occurrence of the important fault is overwritten on the second trace information. Thus, it is possible to prevent the second information when the failure occurs from being erased.

The disclosed trace information control apparatus, trace information control method, and the like will be described below with reference to the accompanying drawings. here,
FIG. 1 is a block diagram showing the overall hardware configuration of a computer system to which the trace information control apparatus according to the embodiment is applied. FIG. 2 is a schematic diagram showing a state of switching between the normal information tracing program and the detailed information tracing program of FIG. FIG. 3 is a flowchart for explaining trace execution processing of trace information to be compared with the trace information control method of the present application; FIG. 4 is a flowchart for explaining trace execution processing of trace information (normal information) in the first embodiment. FIG. 5 is a flowchart for explaining trace execution processing of trace information (detailed information) in the first embodiment; FIG. 6 is a schematic diagram showing how the trace information storage area is divided in the second embodiment. FIG. 7 is a flowchart for explaining trace execution processing of trace information (normal information and detailed information) in the second embodiment; FIG. 8 is a flowchart for explaining trace execution processing of trace information (normal information) in the third embodiment; FIG. 9 is a flowchart for explaining trace execution processing of trace information (detailed information) in the third embodiment.

Hereinafter, the configuration and operation of the trace information control apparatus according to the present embodiment and the trace execution processing by the trace information control method will be described with reference to the attached drawings (FIGS. 1 to 9).

FIG. 1 is a block diagram showing the overall hardware configuration of a computer system to which the trace information control apparatus according to this embodiment is applied, and FIG. 2 is a program for normal information tracing and detailed information tracing in FIG. It is a schematic diagram which shows a mode that it switches between programs.

FIG. 1 illustrates a hardware configuration of a computer system 9 configured by the trace information control apparatus according to the present embodiment. FIG. 2 illustrates main configuration requirements of the trace information control apparatus according to the present embodiment. The programs stored in the program storage memory 1 and the program execution memory 2 are schematically shown.

However, as described above, the trace information used as data for investigating the cause of the failure in the computer system usually indicates which path the device control program for controlling the computer system has executed. Information and detailed information indicating various parameters of hardware and values of various parameters and variables defined by the apparatus control program are included. Hereinafter, the same components as those described above are denoted by the same reference numerals.

In the computer system (or computer apparatus) 9 of FIG. 1, a program storage memory 1 is provided for storing various programs related to the operation of the computer system when investigating the cause of the failure of the computer system. Yes. Preferably, the program storage memory 1 is configured by a nonvolatile memory such as a flash memory or a rewritable ROM (read-only memory). Further, the program storage memory 1 includes a device control program storage area 10 for storing a device control program 10p (see FIG. 2), and a normal information tracing program 11p for performing normal information acquisition / storage processing (see FIG. 2). ) For storing a normal information trace and a detailed information trace program storage area 12 for storing a detailed information trace program 12p (see FIG. 2) for acquiring and storing detailed information. is doing.

Further, in the computer system of FIG. 1, the apparatus control program 10p (see FIG. 2), the normal information tracing program 11p (see FIG. 2) and the detailed information tracing program 12p (see FIG. 2) can be executed. For this purpose, a program execution memory 2 in which these programs are loaded from the program storage memory 1 is provided. Preferably, the program execution memory 2 is configured by a memory that can be written and read as needed, such as a RAM (Random Access Memory). Further, the program execution memory 2 stores a program execution area for storing a device control program, a normal information tracing program, and a detailed information tracing program, and various parameters necessary for executing these programs. Parameter storage area.

Further, in the computer system 9 of FIG. 1, trace information acquisition means 5 for acquiring and temporarily holding trace information used as data for investigating the cause of the failure occurrence in the computer system, and this trace information acquisition means 5 And a trace information storage memory 3 for storing (recording) the trace information held in the memory. Preferably, the trace information storage memory 3 is configured by a non-volatile memory such as a rewritable ROM. Furthermore, this trace information storage memory 3 has a trace information storage area for storing trace information including normal information and detailed information. Preferably, the computer system 9 shown in FIG. 1 displays the trace information stored in the trace information storage memory 3, or executes the device control program, the normal information trace program, and the detailed information trace program to execute the trace. It is also possible to provide a trace information output means 6 including a display unit for displaying a state in which information is recorded.

Further, in the computer system 9 of FIG. 1, the control unit 4 that controls the program storage memory 1, the program execution memory 2, the trace information acquisition unit 5, the trace information storage memory 3, and the trace information output unit 6 is provided. Is provided. The program storage memory 1, the program execution memory 2, the trace information acquisition unit 5, the trace information storage memory 3, the trace information output unit 6, and the control unit 4 are connected to each other via a bus B.

Here, the functions of the control unit 5 and the trace information acquisition means 5 are realized by a CPU (Central Processing Unit) of a computer. More specifically, the device control program, the normal information tracing program, and the detailed information tracing program are loaded and copied from the ROM of the program storage memory 1 to the RAM of the program execution memory 2, for example. The function of the CPU firmware is realized by reading out the program, the normal information tracing program, the detailed information tracing program, and various parameters necessary for executing the program by the CPU and executing the program. Instead of the ROM and RAM included in the program storage memory 1 and the program execution memory 2, it is possible to use a ROM or RAM built in the CPU.

In the computer system 9 constituted by the trace information control apparatus according to the present embodiment of FIG. 1, the CPU firmware operates in the following processing flow (1) to (3).

(1) After starting the trace information control device, the trace information control device is set to the normal operation trace state, the normal information trace program is loaded from the program storage memory 1 to the program execution memory 2, and the normal information trace The normal information is acquired on the basis of the program for storage and stored in the trace information storage memory 2. In other words, after starting the trace information control apparatus, the depth of the trace information is set to the default shallow depth, so that only normal information with a shallow depth is recorded.
Here, for confirmation, the definitions of “normal information with a shallow depth” and “detailed information with a deep depth” are described again. “Normal depth information” is information that is acquired in a mode that collects only trace information that is really necessary for predictive monitoring of computer abnormalities (failures), or in a mode that has a small amount of information at the time of one trace acquisition. is there. On the other hand, “detailed detailed information” is a mode for collecting trace information that is considered to be useful if it was added to the trace information that is really necessary as normal information, or a single trace. This is information acquired in a mode with a large amount of information at the time of acquisition.

(2) When it is detected that a retry (predictive control) or error leading to an important failure of the computer system is detected, the detailed information tracing program is overwritten and loaded from the program storage memory 1 to the program execution memory 2 to obtain normal information. After completely replacing the trace program with the detailed information trace program, the detailed information is acquired based on the detailed information trace program and stored in the trace information storage memory 2. In other words, since the depth of the trace information is set deeper due to the occurrence of a retry or an error leading to an important failure, only the detailed information having a deep depth is selectively recorded (firmware for enhancing trace information) Automatically switch to).

(3) If a critical failure does not actually occur after a certain amount of time has elapsed since it was detected that a retry or error leading to a critical failure has occurred, the normal information trace program is stored in the program storage memory. 1 is overwritten to the program execution memory 2, the detailed information trace program is replaced with the normal information trace program again, and the normal information is acquired based on the normal information trace program and stored in the trace information storage memory 2. . In other words, after an elapse of a certain period of time, if an important failure does not occur, by setting the depth of the trace information to be shallow again, only normal information with a shallow depth is automatically recorded. Be changed.

According to the trace information control apparatus according to the present embodiment, the detailed information with a deep depth is not recorded more than necessary by automatically switching between the normal information tracing program and the detailed information tracing program. It is possible to avoid limiting the amount of detailed information that is effective when investigating the cause of the failure at the time of the failure. On the other hand, when one of the trace information tracing programs is executed, the processing overhead of frequently determining the flag is generated, thereby affecting the original processing performance of the computer (for example, the processing speed of the computer). It is possible to efficiently record detailed information that is effective in investigating the cause of the occurrence of a failure.

Next, the state of switching between the normal information tracing program 11p and the detailed information tracing program 12p will be described with reference to FIG.

As shown in FIG. 2A, the program storage memory 1 includes a device control program 10p, a normal information tracing program 11p for performing normal information acquisition / storage processing, and detailed information acquisition / storage processing. A detailed information tracing program 12p for performing the above is stored. After the trace information control device is activated, the device control program 10p and the normal information tracing program 11p are loaded from the program storage memory 1 to the program execution memory 2 (state during normal information tracing). Based on the normal information tracing program 11p, a process is performed in which only normal information having a shallow depth is acquired and stored in the trace information storage memory 2 (see FIG. 1).

On the other hand, as shown in FIG. 2B, when it is detected that a retry or an error leading to an important failure of the computer system has occurred, the device control program 10p and the detailed information trace are triggered by this. The program 12p is overwritten and loaded from the program storage memory 1 to the program execution memory 2 (the state at the time of tracing detailed information). At this time, in the program execution memory 2, the normal information tracing program 11p is completely replaced with the detailed information tracing program 12p. Based on the detailed information tracing program 12p, processing for selectively acquiring detailed information having a deep depth and storing it in the trace information storage memory 3 (see FIG. 1) is performed.

Although not shown in FIG. 2, if a critical failure does not occur after a certain period of time has elapsed since it was detected that a retry or error leading to a critical failure has occurred, device control is performed. The program 10p and the normal information tracing program 11p are overwritten and loaded from the program storage memory 1 to the program execution memory 2. At this time, in the program execution memory 2, the detailed information tracing program 12p is replaced with the normal information tracing program 11p again. Based on the normal information tracing program 11p, processing for obtaining normal information and storing it in the trace information storage memory 3 (see FIG. 1) is performed. Here, when it is detected again that a retry or error leading to an important failure of the computer system has occurred, the device control program 10p and the detailed information tracing program 12p are overwritten from the program storage memory 1 to the program execution memory 2 again. The operation as shown in FIG. 2B is repeated.

FIG. 3 is a flowchart for explaining the trace execution process of the trace information to be compared with the trace information control method of the present application, and FIG. 4 shows the trace execution process of the trace information (normal information) in the first embodiment. FIG. 5 is a flowchart for explaining trace execution processing of trace information (detailed information) in the first embodiment.

FIG. 3 illustrates a trace execution process based on a flag determination method executed by operating a computer CPU, and FIG. 4 illustrates a full replacement method for a trace information trace program executed by operating a computer CPU (described above). FIG. 2 will be used to explain the trace execution process by referring to the switching between the normal information trace program and the detailed information trace program.

In the flowchart of FIG. 3, when executing the trace execution process by the flag determination method, as shown in step S10, the detailed information for performing the detailed information trace process by determining the flag set in the apparatus control program It is determined whether the mode is the trace storage mode or the normal information trace storage mode in which the normal information trace processing is performed. If the detailed information trace storage mode is set, the process proceeds to step S11 to execute processing for acquiring detailed information and storing it in the trace storage memory.

On the other hand, if the normal information trace storage mode is set, the process proceeds to step S12, and the flag set in the device control program is determined again to determine whether an important failure of the computer system has occurred (ie, , Whether or not a process for storing detailed information in the trace storage memory is necessary) is detected. If it is detected that a serious failure has occurred in the computer system, the process proceeds to step S13, and the detailed information trace storage mode is set by switching from the normal information trace storage mode to the detailed information trace storage mode. Executes the process of storing in the trace storage memory.

If it is not detected that a serious failure has occurred in the computer system, the process proceeds to step S14, and the process of storing the normal information in the trace storage memory is executed while the normal information trace storage mode is set.

When executing the processes in steps S10 to S14, the program itself is simple, but the flag is frequently determined when either one of the normal information tracing program or the detailed information tracing program is executed. If processing overhead occurs, the original processing performance of the computer may be deteriorated.

The trace execution process by the full replacement method of the trace information trace program of FIG. 4 and FIG. 5 is presented in order to eliminate the disadvantages of the trace execution process by the flag determination method of FIG. 3 as described above. .

In the flowchart of FIG. 4, when executing the trace execution process by the full replacement method of the trace information trace program, after starting the trace information control apparatus, the trace information control apparatus is set to the state at the time of normal information tracing. In other words, after the trace information control device is activated, the normal information trace storage mode is automatically set. Therefore, it is not necessary to determine whether the detailed information trace storage mode is set or the normal information trace storage mode by determining the flag set on the apparatus control program. Here, since the normal information trace storage mode is set after the trace information control device is activated, the normal information is acquired and stored in the trace information storage memory.

Next, as shown in step S20 of FIG. 4, whether or not a retry or an error leading to an important failure of the computer system has occurred by determining the flag set in the device control program (that is, details) Whether or not processing for storing information in the trace storage memory is necessary) is detected. If it is detected that a retry or error leading to an important failure of the computer system has occurred, the process proceeds to step S21, and the detailed information tracing program is overwritten from the program storage memory to the program execution memory, and the normal information tracing program is loaded. Completely replace with detailed information tracing program.

Further, in the flowchart of FIG. 5, based on the detailed information tracing program, the trace information control device is set to the state at the time of detailed information tracing. At this point, since the normal information trace storage mode is switched to the detailed information trace storage mode, as shown in step S30, a process of acquiring detailed information and storing it in the trace information storage memory is executed.

On the other hand, if it is not detected that a retry or error leading to an important failure of the computer system has occurred, the process proceeds to step S22 in FIG. 4, and the normal information is stored in the trace storage memory while being set in the normal information trace storage mode. Execute the process stored in.

When executing the trace execution process according to the first embodiment described above, the process for determining the flag when executing either the normal information trace program or the detailed information trace program is the same as that shown in the flowchart of FIG. Less than you need. Therefore, the processing overhead of frequently determining the flag is substantially eliminated so that the original processing performance of the computer is not deteriorated.

FIG. 6 is a schematic diagram showing how the trace information storage area is divided in the second embodiment. Here, a state in which the trace information storage area in the trace information storage memory 3 is divided in advance into a normal information storage area and a detailed information storage area will be described.

In the second embodiment, as shown on the left side of FIG. 6, the trace information storage area in the trace information storage memory 3 stores the normal information storage area for storing the normal information 30 and the detailed information 31. The detailed information storage area is divided in advance. Preferably, the normal information storage area is disposed below the trace information storage area, and the detailed information storage area is disposed above the trace information storage area.

Since the normal information trace storage mode is set after the trace information control device is activated, the normal information is acquired and stored in the normal information normal information storage area. When the normal information is stored up to the last address at the bottom of the normal information storage area, the process returns to the top address at the top of the normal information storage area, and the normal information tracing process is continued. In this case, the old normal information written previously is overwritten in the normal information storage area.

As shown in the central part of FIG. 6, when it is detected that a retry or an error leading to an important failure of the computer system has occurred, this is used as the first trigger (first trigger) and normal information is stored as normal information. In addition to storing the detailed information in the area, the detailed information effective in investigating the cause of the failure although the occurrence frequency is low is stored in the detailed information storage area. In this case, only the normal information is stored in the normal information storage area, and only the detailed information is stored in the detailed information storage area. Detailed information of the important part is never erased.

If the number of detected retries or errors that lead to an important failure (number of triggers) exceeds a predefined threshold (here, the threshold is defined as 0), 6, the size of the detailed information storage area is automatically changed by dynamically increasing the size of the detailed information storage area. When changing the size of the detailed information storage area, the newly acquired detailed information may be overwritten on a part of the normal information storage area by changing the top address of the top of the normal information storage area. I can do it.

Furthermore, as shown on the right side of FIG. 6, when it is detected again that a retry or an error leading to an important failure of the computer system has occurred, this is used as the second opportunity (second opportunity) and detailed information is stored. Increase the size of the region further. As a result, further detailed information can be overwritten in a part of the normal information storage area.

Preferably, the size of the detailed information storage area is determined based on statistical information related to the most recent normal information and detailed information acquired immediately before, in addition to a predetermined value.

On the other hand, when it is determined that the details of the currently acquired detailed information is the same as the previously acquired detailed information, the currently acquired detailed information is not stored in the detailed information storage area. ing. As a result, only the detailed information effective when investigating the cause of the occurrence of the failure is recorded, so that the detailed information with low occurrence frequency can be used effectively.

FIG. 7 is a flowchart for explaining a trace execution process of trace information (normal information and detailed information) in the second embodiment. Here, the trace information storage area division method executed by operating the CPU of the computer (refer to the state where the trace information storage area is divided in advance into a normal information storage area and a detailed information storage area in FIG. 6 described above. The trace execution process according to the above is described.

In the flowchart of FIG. 7, when executing the trace execution process by the trace information storage area division method, the trace information storage area in the trace information storage memory is divided in advance into a normal information storage area and a detailed information storage area. After starting the trace information control device, a process of acquiring normal information and storing it in the normal information storage area is executed.

When it is detected that a retry or error leading to an important failure of the computer system has occurred, as shown in step S40, the normal information is acquired and stored in the normal information storage area, and the detailed information is acquired and the details are acquired. Execute processing to store in the information storage area. At this time, the number of times that the occurrence of a retry or error leading to an important failure is detected (that is, the number of traces) is counted up by a counter or the like in the computer system.

Furthermore, as shown in step S41, by determining whether or not the number of detected occurrences of retries or errors leading to an important failure has exceeded a predefined threshold, the detailed information storage area It is determined whether it is necessary to change the size. When it is determined that the number of detected occurrences of retries or errors leading to an important failure has exceeded the threshold (that is, when it is determined that it is necessary to change the size of the detailed information storage area), Proceeding to step S42, the size of the detailed information storage area is dynamically increased by changing the leading address of the normal information storage area that is the storage location of the trace information (detailed information).

Further, as shown in step S43, the newly acquired detailed information is overwritten in a part of the normal information storage area, thereby executing processing for storing this detailed information in the trace information storage area.

On the other hand, if it is determined that the number of detected occurrences of retries or errors leading to an important failure does not exceed the threshold value, the process proceeds to step S43, and the detailed information storage area size remains unchanged. Is stored in the trace information storage area.

When executing the trace execution processing according to the second embodiment described above, detailed information is required by increasing the size of the detailed information storage area according to the number of occurrences of retries or errors leading to an important failure. As much information is stored in the detailed information storage area, it is possible to efficiently record detailed information that is effective when investigating the cause of a failure without affecting the original processing performance of the computer. Become.

On the other hand, by dividing the trace information storage area into a normal information storage area and a detailed information storage area in advance, detailed information is traced immediately after the occurrence of a retry or error leading to an important failure is detected. Even when the information cannot be extracted from the information storage area, it is possible to prevent the important information from being erased when the failure occurs by erasing the normal information over the detailed information storage area.

FIG. 8 is a flowchart for explaining trace execution processing of trace information (normal information) in the third embodiment. FIG. 9 shows trace execution processing of trace information (detailed information) in the third embodiment. It is a flowchart for demonstrating. Here, a trace execution process executed by a combination of the full replacement method of the trace information trace program related to the first embodiment and the trace information storage area division method related to the second embodiment. Will be explained.

In the flowchart of FIG. 8, when the trace execution process is executed by a combination of the full replacement method of the trace information trace program and the trace information storage area division method, the trace information storage area in the trace information storage memory is changed to the normal information storage area. And a detailed information storage area.

After starting the trace information control device, set the trace information control device to the normal information trace state. Since the normal information trace storage mode is automatically set after the trace information control device is activated, the normal information is acquired and stored in the trace information storage memory.

Next, as shown in step S50 of FIG. 8, whether or not a retry or an error leading to an important failure of the computer system has occurred by determining the flag set in the device control program (that is, details) Whether or not processing for storing information in the trace storage memory is necessary) is detected. If it is detected that a retry or error leading to an important failure of the computer system has occurred, the process proceeds to step S51, where the detailed information tracing program is overwritten from the program storage memory to the program execution memory, and the normal information tracing program is loaded. Completely replace with detailed information tracing program.

On the other hand, if it is not detected that a retry or error leading to an important failure of the computer system has occurred, the process proceeds to step S52, and the normal information is stored in the trace storage memory while being set in the normal information trace storage mode. Execute the process.

The contents of the processes in steps S50 to S52 are substantially the same as the contents of the processes in steps S20 to S22 in FIG.

Further, when it is detected in the flowchart of FIG. 8 that a retry or an error leading to an important failure of the computer system has occurred, the trace information control device is changed based on the detailed information tracing program in the flowchart of FIG. Set the status when tracing detailed information. At this time, since the normal information trace storage mode is switched to the detailed information trace storage mode, as shown in step S60 of FIG. 9, a process of acquiring detailed information and storing it in the detailed information storage area is executed. At this time, the number of times that the occurrence of a retry or error leading to an important failure is detected (that is, the number of traces) is counted up by a counter or the like in the computer system.

Further, as shown in step S61, by determining whether or not the number of detected occurrences of retries or errors leading to an important failure has exceeded a predefined threshold, the detailed information storage area It is determined whether it is necessary to change the size. When it is determined that the number of detected occurrences of retries or errors leading to an important failure has exceeded the threshold (that is, when it is determined that it is necessary to change the size of the detailed information storage area), Proceeding to step S62, the size of the detailed information storage area is dynamically increased by changing the head address of the normal information storage area that is the storage location of the trace information (detailed information).

Further, as shown in step S63, the newly acquired detailed information is overwritten on a part of the normal information storage area, thereby executing processing for storing this detailed information in the trace information storage area.

On the other hand, if it is determined that the number of detected occurrences of retries or errors leading to an important failure does not exceed the threshold value, the process proceeds to step S63, and the detailed information storage area size remains unchanged. Is stored in the trace information storage area.

When executing the trace processing according to the third embodiment, the flag is set when executing either the normal information trace program or the detailed information trace program, as in the case of the first embodiment. The determination process is less than in the case of the flowchart of FIG. Therefore, the processing overhead of frequently determining the flag is substantially eliminated so that the original processing performance of the computer is not deteriorated.

Further, when the trace execution process according to the third embodiment is executed, the details according to the number of times that the occurrence of a retry or an error leading to an important failure is detected, as in the case of the second embodiment described above. By increasing the size of the information storage area, only the required amount of detailed information is stored in the detailed information storage area, so investigate the cause of the failure without affecting the original processing performance of the computer. In this case, it becomes possible to efficiently record detailed information that is effective.

Further, when executing the trace execution process according to the third embodiment, the trace information storage area is divided into a normal information storage area and a detailed information storage area in advance as in the case of the second embodiment. If detailed information cannot be retrieved from the trace information storage area immediately after the occurrence of a retry or error leading to an important failure is detected, the normal information is overwritten in the detailed information storage area. Therefore, it is possible to prevent the detailed information of the important part at the time of failure from being erased.

Claims

Trace information acquisition means for acquiring trace information of a device control program for investigating the cause of the failure of the computer system;
A trace information storage memory for storing the trace information acquired by the trace information acquisition means;
In a trace information control device comprising a control unit that comprehensively controls the trace information acquisition means and the trace information storage memory,
The trace information includes first trace information indicating which path the device control program has executed, and second trace information indicating values of various parameters and variables related to the device control program,
In the normal operation of the trace information control device, the control unit performs the acquisition and storage processing of the first information by reducing the depth of the trace information, and detects an operation that leads to a failure of the computer system. A trace information control apparatus, wherein control is performed to increase the depth of the trace information and to perform the acquisition / storage processing of the second information.
The control unit obtains the first information by reducing the depth of the trace information when the failure does not occur after a predetermined time has elapsed since the operation leading to the failure is detected. The trace information control apparatus according to claim 1, wherein storage processing is performed.
Trace information acquisition means for acquiring trace information of a device control program for investigating the cause of the failure of the computer system;
A trace information storage memory for storing the trace information acquired by the trace information acquisition means;
A program storage memory for storing the device control program, and a trace information tracing program for acquiring and storing the trace information;
A program execution memory in which the device control program and the trace information tracing program are loaded from the program storage memory when the device control program and the trace information tracing program are executed;
The device control program and the trace information tracing program loaded in the program execution memory are read and executed, and the trace information acquisition unit, the trace information storage memory, the program storage memory, and the program execution memory are integrated. In a trace information control device comprising a control unit for controlling automatically,
The trace information includes first trace information indicating which path the device control program has executed, and second trace information indicating values of various parameters and variables related to the device control program,
The control unit loads a first information trace program for obtaining and storing the first information from the program storage memory to the program execution memory after the trace information control device is activated, When an operation leading to a system failure is detected, the second information trace program for performing the acquisition / storage processing of the second information is overwritten and loaded from the program storage memory to the program execution memory, A trace information control apparatus for controlling to replace one information trace program with the second information trace program.
If the failure does not occur after a predetermined time has elapsed since the operation leading to the failure is detected, the control unit may transfer the first information tracing program from the program storage memory to the program. 4. The trace information control device according to claim 3, wherein the second information trace program is replaced with the first information trace program by overwriting the execution memory.
Trace information acquisition means for acquiring trace information of a device control program for investigating the cause of the failure of the computer system;
A trace information storage memory for storing the trace information acquired by the trace information acquisition means;
In a trace information control device comprising a control unit that comprehensively controls the trace information acquisition means and the trace information storage memory,
The trace information includes first trace information indicating which path the device control program has executed, and second trace information indicating values of various parameters and variables related to the device control program,
The trace information storage area in the trace information storage memory is divided in advance into a first trace information storage area and a second trace information storage area,
The control unit stores the first information in the first information storage area during normal operation of the trace information control apparatus, and detects the first information when an operation leading to a failure of the computer system is detected. Is stored in the first information storage area and the second information is stored in the second information storage area, and the number of times that the operation leading to the failure is detected is a predetermined threshold value. A trace information control apparatus for controlling the size of the second information storage area to be dynamically increased when the value exceeds the limit.
The size of the second trace information storage area is determined based on a predetermined value and the first trace information acquired immediately before and statistical information on the second trace information. 5. The trace information control device according to 5.
When it is determined that the content of the second trace information currently acquired is the same as the second trace information acquired previously, the second trace information currently acquired is 6. The trace information control apparatus according to claim 5, wherein the trace information control apparatus is not stored in the second trace information storage area.
Trace information acquisition means for acquiring trace information of a device control program for investigating the cause of the failure of the computer system;
A trace information storage memory for storing the trace information acquired by the trace information acquisition means;
A program storage memory for storing the device control program, and a trace information tracing program for acquiring and storing the trace information;
A program execution memory in which the device control program and the trace information tracing program are loaded from the program storage memory when the device control program and the trace information tracing program are executed;
The device control program and the trace information tracing program loaded in the program execution memory are read and executed, and the trace information acquisition unit, the trace information storage memory, the program storage memory, and the program execution memory are integrated. In a trace information control device comprising a control unit for controlling automatically,
The trace information includes first trace information indicating which path the device control program has executed, and second trace information indicating values of various parameters and variables related to the device control program,
The trace information storage area in the trace information storage memory is divided in advance into a first trace information storage area and a second trace information storage area,
The control unit loads a first information tracing program for performing acquisition / storage processing of the first information from the program storage memory to the program execution memory after the trace information control device is activated, Controlling to store the first information in the first information storage area based on a first information tracing program;
On the other hand, when the control unit detects an operation leading to a failure of the computer system, the control unit obtains a second information tracing program for performing the second information acquisition / storage process from the program storage memory. Overloading into the program execution memory, controlling to store the second information in the second information storage area based on the second information tracing program, and detecting an operation leading to the failure A trace information control apparatus for controlling to increase the size of the second information storage area dynamically when the number of times exceeds a predetermined threshold.
The size of the second trace information storage area is determined based on a predetermined value and the first trace information acquired immediately before and statistical information on the second trace information. 8. The trace information control device according to 8.
When it is determined that the content of the second trace information currently acquired is the same as the second trace information acquired previously, the second trace information currently acquired is 9. The trace information control apparatus according to claim 8, wherein the trace information control apparatus is not stored in the second trace information storage area.
A trace information storage memory is provided for acquiring and storing trace information of a device control program for investigating the cause of the failure occurrence of the computer system, and the trace information indicates which path the device control program has executed. A trace information control method for controlling a trace information control device including 1 trace information and second trace information indicating values of various parameters and variables related to the device control program,
A step of acquiring and storing the first information by reducing the depth of the trace information during a normal operation of the trace information control device;
And a step of acquiring and storing the second information by increasing the depth of the trace information when an operation leading to a failure of the computer system is detected.
The trace information control method may further reduce the depth of the trace information when the failure does not occur after a predetermined time has elapsed since the operation leading to the failure is detected. The trace information control method according to claim 11, further comprising a step of performing acquisition / storage processing of the information.
Trace information storage memory for acquiring and storing trace information of a device control program for investigating the cause of failure of a computer system, the device control program, and a trace for performing processing for acquiring and storing the trace information A program storage memory for storing an information trace program, and a program into which the device control program and the trace information trace program are loaded from the program storage memory when the device control program and the trace information trace program are executed An execution memory, and the trace information includes first trace information indicating which path the device control program has executed, and second parameters indicating values of various parameters and variables related to the device control program. Trace information A trace information control method for controlling the trace information control apparatus comprising,
After the trace information control device is activated, a first information trace program for acquiring and storing the first information is loaded from the program storage memory to the program execution memory, and the first information trace Performing a process of obtaining and storing the first information based on a program for use;
When an operation leading to a failure of the computer system is detected, the second information tracing program for performing the acquisition / storage processing of the second information is overwritten and loaded from the program storage memory to the program execution memory, Replacing the first information tracing program with the second information tracing program and then performing the second information acquisition / storage processing based on the second information tracing program. Trace information control method characterized by the above.
The trace information control method further stores the first information trace program in the program when the failure does not occur after a predetermined time has elapsed since the operation leading to the failure was detected. Overloading from the memory, replacing the second information trace program with the first information trace program, and then acquiring and storing the first information based on the first information trace program The trace information control method according to claim 13, which is performed.
A trace information storage memory is provided for acquiring and storing trace information of a device control program for investigating the cause of the failure occurrence of the computer system, the trace information indicating which path the device control program has executed. A trace information control method for controlling a trace information control device including 1 trace information and second trace information indicating values of various parameters and variables related to the device control program,
Pre-dividing a trace information storage area in the trace information storage memory into a first trace information storage area and a second trace information storage area;
Storing the first information in the first information storage area during normal operation of the trace information control device;
Storing the first information in the first information storage area and storing the second information in the second information storage area when an operation leading to a failure of the computer system is detected;
And a method for dynamically increasing the size of the second information storage area when the number of times that the operation leading to the failure is detected exceeds a predetermined threshold value. .
The trace information control method is further acquired when it is determined that the content of the currently acquired second trace information is the same as the previously acquired second trace information. The trace information control method according to claim 15, further comprising a step of not storing the second trace information being stored in the second trace information storage area.
Trace information storage memory for acquiring and storing trace information of a device control program for investigating the cause of failure of a computer system, the device control program, and a trace for performing processing for acquiring and storing the trace information A program storage memory for storing an information trace program, and a program into which the device control program and the trace information trace program are loaded from the program storage memory when the device control program and the trace information trace program are executed An execution memory, and the trace information includes first trace information indicating which path the device control program has executed, and second parameters indicating values of various parameters and variables related to the device control program. Trace information A trace information control method for controlling the trace information control apparatus comprising,
Pre-dividing a trace information storage area in the trace information storage memory into a first trace information storage area and a second trace information storage area;
After the trace information control device is activated, a first information trace program for acquiring and storing the first information is loaded from the program storage memory to the program execution memory, and the first information trace Storing the first information in the first information storage area based on a program for use;
When an operation leading to a failure of the computer system is detected, the second information tracing program for performing the acquisition / storage processing of the second information is overwritten and loaded from the program storage memory to the program execution memory, Storing the second information in the second information storage area based on the second information tracing program;
And a method of dynamically increasing the size of the second information storage area when the number of times that an operation leading to the failure is detected exceeds a predetermined threshold value. .
The trace information control method is further acquired when it is determined that the content of the currently acquired second trace information is the same as the previously acquired second trace information. The trace information control method according to claim 17, further comprising a step of not storing the second trace information being stored in the second trace information storage area.
A trace information storage memory is provided for acquiring and storing trace information of a device control program for investigating the cause of the failure occurrence of the computer system, and the trace information indicates which path the device control program has executed. When controlling a trace information control device including 1 trace information and second trace information indicating values of various parameters and variables related to the device control program,
During normal operation of the trace information control device, the depth of the trace information is reduced and the first information is acquired and stored.
A program for executing a process of acquiring and storing the second information by increasing the depth of the trace information when an operation leading to a failure of the computer system is detected.
A trace information storage memory is provided for acquiring and storing trace information of a device control program for investigating the cause of the failure occurrence of the computer system, and the trace information indicates which path the device control program has executed. When controlling a trace information control device including 1 trace information and second trace information indicating values of various parameters and variables related to the device control program,
Dividing the trace information storage area in the trace information storage memory in advance into a first trace information storage area and a second trace information storage area;
During normal operation of the trace information control device, the first information is stored in the first information storage area,
When an operation leading to a failure of the computer system is detected, the first information is stored in the first information storage area and the second information is stored in the second information storage area;
A program for dynamically increasing the size of the second information storage area when the number of times that an operation leading to the failure is detected exceeds a predetermined threshold.