CN107301115B

CN107301115B - Application program exception monitoring and recovery method and device

Info

Publication number: CN107301115B
Application number: CN201710494314.5A
Authority: CN
Inventors: 汪晓臣; 夏德春; 张铭; 杜呈欣; 陈栋; 阚庭明; 吴卉; 孙同庆; 黄志威; 田源; 赵伟慧; 孟宇坤; 王志飞; 韦登荣; 蔡晓蕾
Original assignee: China Academy of Railway Sciences Corp Ltd CARS; Institute of Computing Technologies of CARS; Beijing Jingwei Information Technology Co Ltd
Current assignee: China Academy of Railway Sciences Corp Ltd CARS; Institute of Computing Technologies of CARS; Beijing Jingwei Information Technology Co Ltd
Priority date: 2017-06-26
Filing date: 2017-06-26
Publication date: 2020-12-18
Anticipated expiration: 2037-06-26
Also published as: CN107301115A

Abstract

The invention provides an application program abnormity monitoring and recovering method and device, which are used for solving the problem of the defects of the traditional watchdog program. The method comprises the following steps: extracting information in the application program log file according to a preset extraction rule, and recording the information into a data table; analyzing information in the data table, and analyzing whether log data are matched with a preset abnormal rule or not; and if the log data is matched with the preset abnormal rule, generating a fault recovery instruction according to the analysis result and a preset recovery strategy. The method of the invention realizes the monitoring of the application program by identifying the keywords in the log output by the application program without embedding codes in the application program, and compared with the modes based on message interaction mode, process reporting and the like, the monitoring program does not need to implant monitoring class codes irrelevant to the service, thereby reducing the coupling and instability of the program.

Description

Application program exception monitoring and recovery method and device

Technical Field

The invention relates to a computer technology, in particular to a method and equipment for monitoring and recovering application program abnormity.

Background

The application program is operated and deployed in a production environment, and a watchdog program is generally set for improving the reliability of the program and ensuring that the program can be recovered in time when the program is abnormal. The core thought of the traditional watchdog program is to periodically monitor the running state of the monitored program process, so as to ensure that the running state of the monitored process is normal, and once the running state is abnormal, the monitored program is restarted. The monitoring modes mainly comprise the following modes:

(1) based on a message interaction mode, communication between a monitoring process and a monitored process is realized, and the monitored process periodically sends messages to the monitoring process;

(2) a process alive reporting instruction mode, wherein the monitored process and the monitoring process agree on a two-party communication alive reporting instruction and interact through a periodic alive reporting instruction;

(3) and monitoring the PID of the monitored program process. By monitoring the PID of the monitored program process in the operating system, if the PID exists, the program is normal.

The conventional watchdog program has the following disadvantages:

(1) on one hand, no matter the monitoring function is realized based on a message interaction mode or a process reporting mode, monitoring program codes irrelevant to service are required to be implanted into a monitoring program, and the coupling and instability of the codes are increased; meanwhile, if a plurality of application programs are monitored, monitoring codes need to be added in each program, and the expandability is poor;

(2) on the other hand, the types of errors of the application program are complicated, some error types are expected at the beginning of programming, and the unexpected exceptions are difficult to monitor by the existing watchdog program. For example, for some application programs, once errors such as data message errors and instruction errors of an application program interface side beyond the design of the two sides occur, the display state of the application program process is normal, but the application program process cannot normally process other subsequent message data.

Disclosure of Invention

In view of the above, the present invention proposes an application exception monitoring and recovery method and apparatus that overcomes or at least partially solves the above mentioned problems.

To this end, in a first aspect, the present invention provides an application exception monitoring and recovery method, including:

extracting information in the application program log file according to a preset extraction rule, and recording the information into a data table;

analyzing information in the data table, and analyzing whether log data are matched with a preset abnormal rule or not;

and if the log data is matched with the preset abnormal rule, generating a fault recovery instruction according to the analysis result and a preset recovery strategy.

Optionally, the fault information generated by the monitored application program is recorded according to the analysis result, and fault reminding is performed to the user through one or more modes of webpage popup, short message or mail.

Optionally, the preset abnormal rule includes an illegal keyword rule and a legal keyword rule. A first abnormal rule formed by a single communication illegal keyword and/or a plurality of communication illegal keywords and used for identifying an abnormal state;

the preset exception rule comprises at least one first exception rule;

the preset abnormal rule also comprises a second abnormal rule which is composed of a single communication legal keyword and/or any plurality of communication legal keywords and is used for identifying a normal state. The preset exception rule comprises at least one second exception rule. Optionally, the generating the fault recovery instruction includes:

generating a thread-level recovery instruction for recovering the thread with the exception;

and generating program level recovery, namely generating a program restart instruction for the exception which can be recovered only by restarting the program.

In a second aspect, the invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method as set forth in any one of the above.

In a third aspect, the invention provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the methods described above when executing the program.

According to the technical scheme, codes do not need to be embedded in the application program, the application program is monitored by identifying the keywords in the log output by the application program, and compared with a message interaction mode, a process reporting mode and the like, monitoring class codes which are irrelevant to service do not need to be implanted in the monitoring program, so that the coupling and instability of the program are reduced. In addition, the method of the invention can realize the monitoring of a plurality of programs simultaneously, and the application of the monitoring service of the application program is simple and has strong expansibility. Compared with a watchdog system adopting a process PID monitoring mode, the watchdog system can monitor the program class of which the display state of the application program process is normal but the service processing is abnormal.

The foregoing is a brief summary that provides an understanding of some aspects of the invention. This section is neither exhaustive nor exhaustive of the invention and its various embodiments. It is neither intended to identify key or critical features of the invention nor to delineate the scope of the invention but rather to present selected principles of the invention in a simplified form as a brief introduction to the more detailed description presented below. It is to be understood that other embodiments of the present invention are possible utilizing, alone or in combination, one or more of the features set forth above or described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow diagram performed in one embodiment of the present invention.

Detailed Description

The invention will be described in connection with an exemplary system.

As shown in fig. 1, the present invention provides an application program exception monitoring and recovering method, including:

s101, extracting information in an application program log file according to a preset extraction rule, and recording the information into a data table;

s102, analyzing information in the data table, and analyzing whether log data are matched with a preset abnormal rule or not;

s103, if the log data are matched with the preset abnormal rule, generating a fault recovery instruction according to the analysis result and a preset recovery strategy.

For example, for java application programs, some exceptions can be captured and processed in a try { }/catch () manner during running, some exceptions can be captured but cannot be processed in the try { }/catch () manner, and a log is output after the exceptions are captured in the try { }/catch () manner. It is the computing logic of these applications and the log of their outputs that the method of the present invention was devised herein. The abnormal operation of the program can be finally caused by the abnormality, and the program can be finally operated in a mode deviating from the expected mode of a designer, so that the normal operation of the program can be timely recovered by monitoring the abnormal operation and restarting the program in time.

In some embodiments of the invention, extracting the keywords in the log may be performed by: and recording the running log by using a tool such as log4j, collecting the communication keywords, the communication illegal keywords and the log timestamp data by a collection module (the collection module collects the keywords according to an extraction rule, and the extraction rule is designed according to the functions, the operation logic and the characteristics of the log of the application program) and recording the collected data into a data table.

It is understood that before the method of the present application is executed, the extraction rule, the exception rule and the recovery policy need to be preset, which may be when the configuration parameter is configured.

If the log data is matched with the preset abnormal rule, the application program with the error can be obtained through analysis, the fault recovery instruction is used for recovering the application program with the error, and the recovery can be a recovery step executed according to the error data after restarting or collecting the error data.

The recovery instruction is preset, and the corresponding recovery instruction is selected and called according to the analysis result. For example, when the method of the invention is used for monitoring the wireless network module, the information in the log output by the wireless network module is extracted according to the preset rule and is recorded in the data table; the extracted information may be a record formed by a combination of a keyword and a timestamp; and analyzing the information in the data table, and analyzing whether the log data is matched with a preset abnormal rule or not.

For example, in an embodiment of the present invention, if a sentence log includes CONNECTED, the sentence log should further include a CONNECTED wireless network name (i.e., an ssid name), that is, if a CONNECTED keyword is found in the log, the content and the timestamp of the log are stored in a data table, and whether the log includes a wireless network name is analyzed, if the content of the log is as follows: the method includes the steps of (1) registering a wireless network name, (b) registering a wireless network name, (c) registering a wireless network name, (d) registering a wireless network name, ("CONNECTED wireless network name"), and (d) identifying an anomaly. At this time, if the preset recovery strategy is according to the wireless network restart, an instruction for restarting the wireless network is generated.

It is understood that the above embodiments only exemplify the case of recognizing an abnormality from one log. In the actual monitoring process, an anomaly may be identified from multiple logs, for example, by repeatedly appearing the keyword X3 times, and by appearing the keyword Y in the log program after 1 minute, the anomaly z is identified. For example, when an application program works normally, instruction access is usually required to be performed regularly, response data is obtained, and the request/response keywords are recorded in a log file; by combining the service requirements of the application program, part of request/response pair data appears regularly and only appears after the request is sent out, regular expressions of request/response keywords appearing in the log can be abstracted, and the keywords and the keyword regular expressions are started and loaded by the configuration parameter loading module. The log analysis module matches the collected log data with the keyword regular expression, and performs character string analysis through a regular expression analysis algorithm, so that whether the program data communication state is normal can be accurately monitored.

The method of the invention realizes the monitoring of the application program by identifying the keywords in the log output by the application program without embedding codes in the application program, and compared with the modes based on message interaction mode, process reporting and the like, the monitoring program does not need to implant monitoring class codes irrelevant to the service, thereby reducing the coupling and instability of the program.

In addition, the method of the invention can realize the monitoring of a plurality of programs simultaneously, and the application of the monitoring service of the application program is simple and has strong expansibility. Compared with a watchdog system adopting a process PID monitoring mode, the watchdog system can monitor the program class of which the display state of the application program process is normal but the service processing is abnormal.

In some embodiments of the present invention, the fault information generated by the monitored application program is recorded according to the analysis result, and the fault is prompted to the user through one or more of a webpage popup, a short message, or an email. So that the user can check and analyze the log in time.

Optionally, the preset abnormal rule includes an illegal keyword rule and a legal keyword rule. A first abnormal rule formed by a single communication illegal keyword and/or a plurality of communication illegal keywords and used for identifying an abnormal state; the first exception rule is also referred to as an illegal state rule. That is, if the first exception rule is satisfied, the monitored application program is considered to be in an exception state.

The preset abnormal rule also comprises a second abnormal rule which is composed of a single communication legal keyword and/or a plurality of communication legal keywords and is used for identifying a normal state. The second exception rule is also called a legal status rule, i.e. if the second exception rule is met, the monitored application program is considered to be in a normal status.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The single keyword is a continuous character string, and the single word represents a quantifier. A single keyword may also be referred to as a keyword. The keyword may be a keyword in information output by program execution in the log, for example, the keyword may be a directory name, a script name, a class name of the program code, and the like. The abnormal rule is composed of the keywords and the relations between the keywords, and it can be understood that the abnormal rule also includes a fault-tolerant rule. That is, in some embodiments, the abnormal rule may be represented by a state machine, the keywords in the log may be extracted according to a preset extraction rule, and the state of the state machine may be determined according to the extracted keywords.

The combination of the single keywords in the text includes a combination formed by a plurality of keywords in sequence, namely, if the keywords appear in the log in sequence, the abnormal rule is considered to be met. And identifying whether the monitored application program is in an abnormal state or a normal state according to whether the combination of the single keywords belongs to the first abnormal rule or the second abnormal rule. It is understood that the combination of the plurality of keywords also includes a combination having a time series characteristic formed by the plurality of keywords. For example, in one embodiment of the present invention, an exception rule formed by a combination of multiple keywords is included, i.e., the rule: keyword X appears 3 times repeatedly, and keyword Y appears in the log program after 1 minute.

In other embodiments of the present invention, the keywords are extracted from the log, and according to a rule formed by a preset single keyword or a combination of keywords, it can be further identified whether the monitored application program is in a normal state. The rules formed by the single key or the combination of keys are also called fault-tolerant rules.

Optionally, the generating the fault recovery instruction includes:

The invention provides a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method as set forth in any of the above.

The invention provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method as described in any one of the above when executing the program.

As used herein, "monitoring" includes any type of function associated with observing, recording or detecting with an instrument that does not have any effect on the operation or status of the element or group of elements being monitored.

As used herein, "at least one," "one or more," and/or "are open-ended expressions that can be combined and separated when used. For example, "at least one of A, B and C," "at least one of A, B or C," "one or more of A, B and C," and "one or more of A, B or C" mean a alone, B alone, C, A and B together, a and C together, B and C together, or A, B and C together.

The term "a" or "an" entity refers to one or more of that entity. Thus the terms "a", "an", "one or more" and "at least one" are used interchangeably herein. It should also be noted that the terms "comprising," "including," and "having" are also used interchangeably.

The term "automated" and variations thereof as used herein refers to any process or operation that is completed without substantial human input when the process or operation is performed. However, a process or operation may be automated even if substantial or insubstantial human input received prior to performing the process or operation is used in performing the process or operation. An artificial input is considered essential if the input affects how the process or operation will proceed. Human input that does not affect the processing or operation is not considered essential.

The term "computer-readable medium" as used herein refers to any tangible storage device and/or transmission medium that participates in providing instructions to a processor for execution. The computer readable medium may be a serial set of instructions encoded in a network transport (e.g., SOAP) over an IP network. Such a medium may take many forms, including but not limited to, non-volatile media, and transmission media. Non-volatile media includes, for example, NVRAM or magnetic or optical disks. Volatile media include dynamic memory, such as main memory (e.g., RAM). Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, or any other magnetic medium, magneto-optical medium, a CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, a PROM, an EPROM, a FLASH-EPROM, a solid state medium such as a memory card, any other memory chip or cartridge, a carrier wave as described hereinafter, or any other medium from which a computer can read. Digital file attachments to e-mail or other self-contained information archives or sets of archives are considered distribution media equivalent to tangible storage media. When the computer readable medium is configured as a database, it should be understood that the database may be any type of database, such as a relational database, a hierarchical database, an object-oriented database, and the like. Accordingly, the present invention is considered to include a tangible storage or distribution medium and equivalents of the prior art known as well as future developed media in which to store a software implementation of the present invention.

The terms "determine," "calculate," and "compute," and variations thereof, as used herein, are used interchangeably and include any type of methodology, process, mathematical operation or technique. More specifically, such terms may include interpreted rules or rule languages such as BPEL, where logic is not hard coded but represented in a rule file that can be read, interpreted, compiled, and executed.

The term "module" or "tool" as used herein refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware and software that is capable of performing the functionality associated with that element. Additionally, while the invention has been described with reference to exemplary embodiments, it should be understood that aspects of the invention may be separately claimed.

It is to be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrases "comprising … …" or "comprising … …" does not exclude the presence of additional elements in a process, method, article, or terminal that comprises the element. Further, herein, "greater than," "less than," "more than," and the like are understood to exclude the present numbers; the terms "above", "below", "within" and the like are to be understood as including the number.

Although the embodiments have been described, once the basic inventive concept is obtained, other variations and modifications of these embodiments can be made by those skilled in the art, so that the above embodiments are only examples of the present invention, and not intended to limit the scope of the present invention, and all equivalent structures or equivalent processes using the contents of the present specification and drawings, or any other related technical fields, which are directly or indirectly applied thereto, are included in the scope of the present invention.

Claims

1. The application program exception monitoring and recovery method is characterized by comprising the following steps:

if the log data are matched with a preset abnormal rule, generating a fault recovery instruction according to the analysis result and a preset recovery strategy; wherein the exception rules include fault tolerance rules;

the preset abnormal rule comprises a first abnormal rule which is composed of a single communication illegal keyword and/or a plurality of communication illegal keywords and is used for identifying an abnormal state; the preset exception rule comprises at least one first exception rule;

the preset abnormal rule also comprises a second abnormal rule which is composed of a single communication legal keyword and/or a plurality of communication legal keywords and is used for identifying a normal state;

the preset exception rule comprises at least one second exception rule.

2. The method of claim 1, wherein the fault information generated by the monitored application program is recorded according to the analysis result, and the fault is prompted to the user through one or more of webpage popup, short message or mail.

3. The method of claim 1, wherein generating the fault recovery instruction comprises:

4. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 3.

5. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 3 are implemented when the program is executed by the processor.