JP2013101548A - Computer system and recovery method - Google Patents

Computer system and recovery method Download PDF

Info

Publication number
JP2013101548A
JP2013101548A JP2011245620A JP2011245620A JP2013101548A JP 2013101548 A JP2013101548 A JP 2013101548A JP 2011245620 A JP2011245620 A JP 2011245620A JP 2011245620 A JP2011245620 A JP 2011245620A JP 2013101548 A JP2013101548 A JP 2013101548A
Authority
JP
Japan
Prior art keywords
recovery
failure
system
user system
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP2011245620A
Other languages
Japanese (ja)
Inventor
Kentaro Otonashi
健太郎 音無
Original Assignee
Hitachi Systems Ltd
株式会社日立システムズ
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hitachi Systems Ltd, 株式会社日立システムズ filed Critical Hitachi Systems Ltd
Priority to JP2011245620A priority Critical patent/JP2013101548A/en
Publication of JP2013101548A publication Critical patent/JP2013101548A/en
Application status is Pending legal-status Critical

Links

Images

Abstract

PROBLEM TO BE SOLVED: To restore a fault by determining a cause of the fault and acquiring recovery information registered beforehand when the fault occurs in an application operated on a computer.SOLUTION: A recovery system 108 is activated from a user system 102 in which a fault has occurred, and receives input data at the time of fault occurrence and an error code that are sent from the user system 102. Recovery correspondence DB retrieval means 111 retrieves a recovery correspondence DB 114, and when correspondence necessity indicates need of correspondence, acquires complementation data corresponding to fault content at the time of the fault occurrence. Recovery means 112 repairs the input data at the time of the fault occurrence that is sent from the user system, and transmits repaired data to the user system 102 to continue subsequent processing after the fault occurrence.

Description

  The present invention relates to a computer system and a recovery method, and in particular, when a failure occurs in an open computer system, the computer system and a recovery method that can automatically identify the cause and identify the cause of the failure About.

  For example, a technique described in Patent Document 1 is known as a conventional technique for identifying a cause of a failure in a computer device and recovering the failed part. In this prior art, when a failure occurs in a computer device, a portable information terminal possessed by a maintenance person capable of infrared communication has received device information including log information indicating an operation history in the computer device in which the failure to be maintained has occurred. The computer device that requested the transmission transmits the requested device information to the portable information terminal, and the portable information terminal that has received the information is supplied from the computer device to the communicable server device. Notification is given when the server device that has previously stored maintenance information indicating the content of the failure associated with the supplied log information is notified of the log information including the device information from the portable information terminal. The maintenance information corresponding to the received device information is selected, and this maintenance information is transmitted to the portable information terminal. As a result, the maintenance person can quickly investigate and restore the cause of the failure of the computer device.

JP 2006-139572 A

  Generally, when a failure occurs in an application running on a computer device, execution of the program is interrupted, and the cause of the failure is specified as a first step. If the cause of the failure can be identified, the input data is restored as a second step and the program is re-executed. To identify the cause, it takes a lot of manpower and time to investigate the failure detection location and the input data, and the failure recovery delay occurs. In addition, since it takes human judgment to restore the input data and re-execute the program, there is a risk that an error will occur in the recovery process, leading to a secondary disaster.

  In the above-mentioned conventional technology, it is necessary to manually identify the cause, repair input data, and re-execute the program, requiring a lot of manpower and time, and the risk of causing a recovery process error and secondary disaster. It is something that cannot solve the problem.

  The purpose of the present invention is to solve the above-mentioned problems of the prior art, when a failure occurs in an application running on a computer, mechanically determine the cause of the failure and obtain pre-registered recovery information Thus, it is an object of the present invention to provide a computer system and a recovery method capable of automatically performing a series of processing until input data is restored and a program is re-executed.

  According to the present invention, the object is to provide a computer system including a user system and a recovery system that identifies a cause of the failure and recovers the failure when a failure occurs in the user system. , An error code received from the user system, a necessity of correspondence indicating necessity of correspondence in the recovery system, a recovery correspondence DB storing complementary data indicating complementary data, and a recovery correspondence for searching the recovery correspondence DB A DB search unit and a recovery unit, and when a failure occurs in the user system, the recovery unit receives the input data and the error code at the time of the failure that is started from the user system and sent from the user system; Correspondence DB search means searches the recovery correspondence DB, and the necessity of correspondence In the case where the response is indicated, the supplementary data corresponding to the failure content at the time of failure occurrence is obtained, and the recovery means repairs the input data at the time of failure sent from the user system, and the user system This is achieved by transmitting the repaired data and continuing the subsequent processing after the occurrence of the failure.

  According to the present invention, it is possible to recover a program in which a failure has occurred without requiring manpower and time, and since there is no human judgment error, the risk of secondary disaster due to a recovery work error is suppressed. can do.

It is a block diagram which shows the structural example of the computer system by one Embodiment of this invention. It is a flowchart explaining the processing operation of a code conversion function. It is a flowchart explaining the processing operation of a journal conversion function. It is a flowchart explaining the processing operation | movement of a recovery corresponding | compatible DB search function. It is a flowchart explaining the processing operation of a recovery function. It is a figure explaining the structure of layout definition DB. It is a figure explaining the structure of DB corresponding to recovery. It is a figure explaining the structure of the user journal (before conversion) hold | maintained in the SAM file. It is a figure explaining the structure of the user journal (after conversion) hold | maintained in the SAM file. It is a figure explaining the structure of the parameter of an error code and a character code.

  Embodiments of a computer system according to the present invention will be described below in detail with reference to the drawings.

  FIG. 1 is a block diagram showing a configuration example of a computer system according to an embodiment of the present invention. In the following description, even components having the same name are different in the reference numerals given to the components for each figure, but are the same components as long as they have the same name.

  A computer system according to an embodiment of the present invention is configured by building a user system 102 and an automatic recovery system 108 in a server machine 101 so as to operate on the server machine 101. Although not shown, the server machine 101 is a main memory, a storage device such as an HDD, a CPU for controlling the entire machine, a keyboard, a mouse connected to the user system 102 for use by a user using the user system 102 And an interface for an input / output device such as a display device.

  The user system 102 and the automatic recovery system 108 are constructed by loading a program stored in the storage device of the server machine 101 into the main memory and executing it by the CPU. Also, in the processing of the embodiment of the present invention described below, the program stored in the storage device of the server machine 101 is loaded into the main memory and executed by the CPU, as described above.

  In the above description, when the failure detection unit 103 detects a failure that has occurred during execution of a batch or online program, the user system 102 calls the automatic recovery system 108.

  When the automatic recovery system calling unit 104 calls the automatic recovery system 108, the automatic recovery system calling unit 104 acquires the input data causing the failure as a user journal, and the SAM file 105 (the configuration will be described later with reference to FIG. 8). Output). The user journal can be acquired by issuing a command provided by the middleware. For example, in the case of OpenTP1, which is one of Hitachi open middleware, a user journal can be acquired by issuing commands such as journal swap and journal edit.

  The user system 102 makes a call request in which the user journal in the SAM file 105, the error code 106 corresponding to the content of the failure that has occurred, and the character code 107 handled by the user system are set as parameters to call the automatic recovery system 108. I do.

  The automatic recovery system 108 includes four processing units, a code conversion function 109, a journal conversion function 110, a recovery correspondence DB search function 111, and a recovery function 112, and further includes a layout definition DB 113 and a recovery correspondence DB 114. Necessary data is registered in advance in the layout definition DB 113 and the recovery correspondence DB 114 and is referred to by the journal conversion function 110 and the recovery correspondence DB search function 111.

  In the automatic recovery system 108, the four processing units of the code conversion function 109, the journal conversion function 110, the recovery correspondence DB search function 111, and the recovery function 112 execute processing in order to repair the information that caused the failure. The recovered information is returned to the user system 102, so that the user system 102 can automatically continue the subsequent processing 115 after the occurrence of the failure without human intervention.

  FIG. 2 is a flowchart for explaining the processing operation of the code conversion function 109, which will be described next. The code conversion function 109 includes three processing units: a user journal input process 201, a character code conversion process 202, and a post-conversion data output process 203. Hereinafter, a processing operation of the code conversion function 109 will be described as a processing operation in these three processing units.

(1) The user journal input process 201 reads the user journal in the SAM file 204 and expands the user journal data before conversion on the buffer memory inside the program.

(2) Next, the character code conversion process 202 uniformly converts various types of character codes in the input file into SJIS codes. Since the character code before the conversion is registered in the parameter of the character code 205 (the configuration will be described later with reference to FIG. 10), it follows that.

(3) Next, the post-conversion data output processing 203 outputs the user journal data converted into the SJIS code by the character code conversion processing 202 and outputs the SAM file 206.

  FIG. 3 is a flowchart for explaining the processing operation of the journal conversion function 110, which will be described next. The journal conversion function 110 includes three processing units: a layout definition acquisition process 301, a text format conversion process 302, and a post-conversion data acquisition process 303. Hereinafter, a processing operation of the journal conversion function 110 will be described as a processing operation in these three processing units.

(1) A layout definition acquisition process 301 acquires user journal layout definition information from a layout definition DB 304 (which will be described later with reference to FIG. 6). The layout definition DB 304 can be registered for each execution environment of the user program. For example, the layout definition of the user journal acquired in the execution environment A is registered, or the user journal acquired in the execution environment B is registered. It is possible to register the layout definition.

(2) Next, the text format conversion process 302 receives the user journal data before conversion in the SAM file 305 as input, and the valid flag is turned on in the layout definition information registered in the layout definition DB 304. Refer to the layout definition that is, and execute the conversion process to the text format accordingly. The data records in the SAM file 305 are displayed in a text format as shown in FIG. 3A as a user journal 307 in which the left side is in hexadecimal format and the right side is surrounded by an ellipse. It is displayed in a unique format, such as a blank part is present every 4 bytes, or a line break is made at 16 bytes. Then, the text format conversion process 302 takes out the data before text format conversion on the right side as shown in FIG. 3B as 308, excludes the standard delimiter blank part and line feed, and inputs The information is converted into a text string format, and the original data as it is input is restored as indicated by 309 in FIG.

(3) Next, the post-conversion data output processing 303 converts the user journal data 309 after conversion converted by the text conversion processing 302 into a SAM file (the configuration will be described later with reference to FIG. 9) 306. Output to.

  FIG. 4 is a flowchart for explaining the processing operation of the recovery-corresponding DB search function 111, which will be described next. The recovery correspondence DB search function 111 includes two processing units, that is, a recovery correspondence DB search processing 401 and a correspondence content and necessity / unnecessity acquisition processing 402. Hereinafter, the processing operation of the recovery-corresponding DB search function 111 will be described as the processing operation in these two processing units.

(1) The recovery correspondence DB search process 401 uses the error code received as a parameter of the error code 403 (the configuration will be described later with reference to FIG. 10), and the recovery correspondence DB (the configuration is shown in FIG. 7). Search 404 (to be described later with reference). In the recovery correspondence DB 404, a recovery method corresponding to the error code is registered in advance.

(2) The processing of the correspondence content and necessity / unnecessity acquisition processing 402 expands the necessity of correspondence of the recovery processing acquired by the recovery correspondence DB search processing 401 and the information on the correspondence content on the buffer memory inside the program. To do.

  FIG. 5 is a flowchart for explaining the processing operation of the recovery function 112, which will be described next. The recovery function 112 includes two processing units, a determination process 501 for necessity of handling and an error recovery process 502. Hereinafter, the processing operation of the recovery function 112 will be described as a processing operation in these two processing units.

(1) The necessity / unnecessity determination process 501 determines the necessity / unnecessity of the recovery process based on the necessity / unnecessity of the recovery process acquired by the recovery process DB search process 401 and the information on the corresponding contents, and no recovery is required. In the case of, the processing here is terminated without performing any processing.

(2) If it is determined in the determination of necessity / unnecessity determination processing 501 for recovery processing that recovery is necessary, the recovery processing 502 inputs the user journal after conversion to the original data from the SAM file 503, and performs recovery. Data correction is performed in accordance with the correspondence content acquired from the correspondence DB 404. In the recovery correspondence DB 404, byte position information of the complementary data and the complementary point is registered in advance, and the complementary point is used to determine how many bytes from which byte of the user journal data is corrected. The complementary data is data for restoring the data for the point.

  Next, an example of correspondence contents will be described. In the data of the SAM file 503, as shown as data 505 before correction in FIG. 5A, 4 bytes from the fifth byte (each byte is indicated by Δ) are blank. An error was detected because it was determined that the required items in the blank area were not entered.

  In the recovery correspondence DB 404, it is assumed that an essential item has not been input in advance. As shown in FIG. 5B as the recovery correspondence DB 506, if the failure repair correspondence content is not input, it is 9999. Registered to supplement with data. In the case of the processing in the flow shown in FIG. 5, 9999 is supplemented to 4 bytes from the 5th byte of the pre-correction data 505, and the user journal data after correction is as shown by 507 in FIG. Will be corrected to the one.

  As described above, the user journal data corrected in the recovery process 502 is output to the SAM file 504, and the corrected data is continuously captured as input information on the user system 102 side and the subsequent process is executed. Can do. As a result, the user system 102 can automatically continue the processing without interrupting the processing when a failure occurs.

  FIG. 6 is a diagram for explaining the configuration of the layout definition DB. In the layout definition DB, as shown in FIG. 6A, the pattern ID given for each execution environment of the user program, the text start position indicating the acquisition start position of the text format, and the number of characters written in succession The execution environment of the user program as a set of data including a continuous character count indicating the number of characters per line indicating the number of characters up to a line break in the text format and a valid flag whose flag is turned on for the current execution environment Each time a plurality of data sets are stored as one record. When records including data as shown as data examples in the execution environments A and B in FIGS. 6B and 6C are stored as layout definitions in the layout definition DB, FIG. ) Is stored as a record as shown in FIG.

  FIG. 7 is a diagram for explaining the configuration of the recovery DB. In the recovery correspondence DB, as shown in FIG. 7A, an error code indicating an error code received from the user program, a necessity of correspondence indicating necessity of correspondence in the automatic recovery system, and complementary data are shown. The complement data and the complement point indicating the start position for complement are stored as one set of data, and each of a plurality of sets is stored as one record for each error code. In the recovery correspondence DB, as an example of data in FIG. 7B, as an example of a record when the required item of the error code XXX is not input, the numeric item of the error code YYY is invalid, and the classification value of the error code ZZZ is invalid The records shown in FIG. 7 are stored. In these records, the data before correction shown in FIG. 7C is corrected as shown in FIG. 7D as corrected data. Become.

  FIG. 8 is a diagram for explaining the configuration of the user journal (before conversion) held in the SAM file. In the SAM file, as shown in FIG. 8A, a plurality of records are stored as journal data that is a record of journal file data, which is input data that has caused a failure, as one record. . When the input data is data as shown in FIG. 8B as user journal information from the mainframe, the data shown as journal data before conversion in FIG. Stored as

  FIG. 9 is a diagram for explaining the configuration of the user journal (after conversion) held in the SAM file. In the SAM file, as shown in FIG. 9A, a plurality of records are stored as journal data that is a record of journal file data, which is input data that has caused the failure, as one record. . Similarly to the case before the conversion shown in FIG. 8, when the input data is the data shown as the user journal information from the main frame in FIG. 9B, the SAM file contains FIG. Data as shown in the converted journal data is stored in c).

  FIG. 10 is a diagram for explaining the configuration of error code and character code parameters. The error code is a code that can specify the error content as shown in FIG. 10A. As an example of the error code data, the error code XXX when the required item is not entered, the numeric item is invalid There are error codes such as an error code YYY in the case of, and an error code ZZZ in the case of an illegal division value. The character code is a character code handled by the automatic recovery system 108 as shown in FIG. 10B. Examples of the character code include SJIS as 0 in the data example and EBCDIC as 1 in the data example. There are UNICODE shown as 2 in the data example, EIG shown as 3 in the data example, and the like.

  According to the embodiment of the present invention described above, in an open computer system, repair data for a failure that occurred during program processing is registered in advance, and if a failure occurs during program processing, the recovery system Since it is possible to perform data restoration mechanically using the registered restoration data, it is possible to recover the program that caused the failure without human labor and time, and human judgment Since there are no mistakes, the risk of secondary disasters caused by recovery work errors can be suppressed.

101 Server machine 102 User system 103 Failure detection means 104 Automatic recovery system calling means 104
105 SAM File 108 Automatic Recovery System 109 Code Conversion Function 110 Journal Conversion Function 111 Recovery Corresponding DB Search Function 112 Recovery Function 113 Layout Definition DB
114 DB for recovery

Claims (4)

  1. In a computer system configured by a user system and a recovery system that identifies the cause of the failure and recovers the failure when a failure occurs in the user system,
    The recovery system includes an error code received from the user system, a response necessity indicating whether the recovery system needs to be supported, a recovery correspondence DB storing complementary data indicating complementary data, and the recovery correspondence DB. A recovery correspondence DB search means for searching, and a recovery means,
    When a failure occurs in the user system, it receives the input data and the error code at the time of the failure that is started from the user system and sent from the user system, and the recovery correspondence DB search means stores the recovery correspondence DB. If the response necessity indicates that the response is necessary, the supplementary data corresponding to the failure content at the time of the failure is obtained, and the recovery means sends the input data at the time of the failure sent from the user system. A computer system which performs a repair, transmits the repaired data to the user system, and causes subsequent processing after the occurrence of a failure to continue.
  2.   The recovery system includes code conversion means for input data at the time of failure sent from the user system, and the code conversion means uses the character code used in the user system for the character used in the own recovery system. The computer system according to claim 1, wherein the computer system is converted into a code.
  3. The input data at the time of failure sent from the user system is a user journal,
    3. The computer system according to claim 2, wherein the recovery system further includes journal conversion means, and the journal conversion means converts the user journal into a text format.
  4. In a recovery method for a failure in a computer system configured by a user system and a recovery system that identifies the cause of the failure and recovers the failure when a failure occurs in the user system,
    The recovery system includes an error code received from the user system, a response necessity indicating whether the recovery system needs to be supported, a recovery correspondence DB storing complementary data indicating complementary data, and the recovery correspondence DB. A recovery correspondence DB search means for searching, and a recovery means,
    When a failure occurs in the user system, it receives the input data and the error code at the time of the failure that is started from the user system and sent from the user system, and the recovery correspondence DB search means stores the recovery correspondence DB. If the response necessity indicates that the response is necessary, the supplementary data corresponding to the failure content at the time of the failure is obtained, and the recovery means sends the input data at the time of the failure sent from the user system. A recovery method comprising: performing repair, transmitting the repaired data to the user system, and causing subsequent processing after the occurrence of a failure to continue.
JP2011245620A 2011-11-09 2011-11-09 Computer system and recovery method Pending JP2013101548A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2011245620A JP2013101548A (en) 2011-11-09 2011-11-09 Computer system and recovery method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2011245620A JP2013101548A (en) 2011-11-09 2011-11-09 Computer system and recovery method

Publications (1)

Publication Number Publication Date
JP2013101548A true JP2013101548A (en) 2013-05-23

Family

ID=48622101

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2011245620A Pending JP2013101548A (en) 2011-11-09 2011-11-09 Computer system and recovery method

Country Status (1)

Country Link
JP (1) JP2013101548A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH04178843A (en) * 1990-11-14 1992-06-25 Tohoku Nippon Denki Software Kk Automatic handling system for program abort
JPH04217033A (en) * 1990-12-19 1992-08-07 Nec Software Kansai Ltd Automatic parameter correcting and reexecuting system
JP2004295364A (en) * 2003-03-26 2004-10-21 Ntt Comware Corp Database access system and method, database access server, and computer program
JP2004334869A (en) * 2003-05-07 2004-11-25 Microsoft Corp Diagnosis and solution of computer problem by program, and automatic report and updating thereof
JP2008210047A (en) * 2007-02-23 2008-09-11 Mitsubishi Electric Corp Business event data supplementation device and business event data supplementation program

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH04178843A (en) * 1990-11-14 1992-06-25 Tohoku Nippon Denki Software Kk Automatic handling system for program abort
JPH04217033A (en) * 1990-12-19 1992-08-07 Nec Software Kansai Ltd Automatic parameter correcting and reexecuting system
JP2004295364A (en) * 2003-03-26 2004-10-21 Ntt Comware Corp Database access system and method, database access server, and computer program
JP2004334869A (en) * 2003-05-07 2004-11-25 Microsoft Corp Diagnosis and solution of computer problem by program, and automatic report and updating thereof
JP2008210047A (en) * 2007-02-23 2008-09-11 Mitsubishi Electric Corp Business event data supplementation device and business event data supplementation program

Similar Documents

Publication Publication Date Title
US20180365264A1 (en) Telemetry system for a cloud synchronization system
JP2013520746A (en) System and method for failing over non-cluster aware applications in a cluster system
CN101060436A (en) A fault analyzing method and device for communication equipment
KR20150033711A (en) Run-time error repairing method, device and system
JP5075736B2 (en) System failure recovery method and system for virtual server
JP5119935B2 (en) Management program, management apparatus, and management method
JP5976221B2 (en) Information backup method and apparatus
US9146839B2 (en) Method for pre-testing software compatibility and system thereof
TWI582616B (en) Formatting data by example
CN103916482B (en) Sqlite one kind of data based on synchronous transmission
US6966014B2 (en) Method for system obstacle correspondence support
JPWO2006117833A1 (en) Monitoring simulation apparatus, method and program thereof
US9753954B2 (en) Data node fencing in a distributed file system
CN102750283A (en) Massive data synchronization system and method
EP1675007A1 (en) Fault management system in multistage copy configuration
JP2009015476A (en) Journal management method in cdp remote configuration
CN101777014A (en) Backup processing method and device
WO2013140608A1 (en) Method and system that assist analysis of event root cause
CN103458086B (en) An intelligent phones and fault detection method
US20130124914A1 (en) Method and Device for Detecting Data Reliability
JP2005258501A (en) Obstacle influence extent analyzing system, obstacle influence extent analyzing method and program
TWI608344B (en) Robust hardware fault management system, method and framework for enterprise devices
CN103699548A (en) Method and equipment for recovering database data by using logs
US20150213100A1 (en) Data synchronization method and system
CN104239476A (en) Method, device and system for synchronizing databases

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20141105

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20150630

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20150714

A02 Decision of refusal

Free format text: JAPANESE INTERMEDIATE CODE: A02

Effective date: 20151208