CN112749038A

CN112749038A - Method and system for realizing software watchdog in software system

Info

Publication number: CN112749038A
Application number: CN202110104876.0A
Authority: CN
Inventors: 赵康; 瞿洪桂
Original assignee: Beijing Sinonet Science and Technology Co Ltd
Current assignee: Beijing Sinonet Science and Technology Co Ltd
Priority date: 2021-01-26
Filing date: 2021-01-26
Publication date: 2021-05-04
Anticipated expiration: 2041-01-26
Also published as: CN112749038B

Abstract

The invention discloses a method and a system for realizing a software watchdog in a software system, wherein the method comprises the steps of S1, starting a monitoring process, loading a configuration file in the monitoring process, and entering S2; if no configuration file exists, automatically generating a default configuration file and entering S2; loading a configuration file into a memory, and loading all monitored business processes into a process linked list in the memory according to the configuration file in a linked list mode; and the monitoring program starts each service process one by one according to the process linked list, registers watchdog information for each service process after each service process is started, and adds the watchdog information to the watchdog information linked list corresponding to the process linked list one by one. The advantages are that: by detecting the process file state in the virtual file system at regular time without depending on the way of inter-process message communication, the problem of false restart caused by that a service process does not send heartbeat messages to a monitoring process due to busy operation is solved.

Description

Method and system for realizing software watchdog in software system

Technical Field

The invention relates to the technical field of software service monitoring, in particular to a method and a system for realizing a software watchdog in a software system.

Background

Currently, a watchdog system includes a hardware watchdog system and a software watchdog system. The hardware watchdog system generates interruption and restarts the system when the system enters an unrecoverable error, and is mainly applied to an embedded system. The hardware watchdog system has high manufacturing cost and single function, and the system restart will cause the termination of other normal running processes. The software watchdog system is realized in most cases in a mode of utilizing inter-Linux process communication to complete message transmission between a monitoring process and a service process, and each service process sends heartbeat to the monitoring process at regular time to prove that the software watchdog system is in a normal running state. When the monitoring process finds that a heartbeat message is not sent in a certain process for a long time, the monitoring process judges that the process is hung up, and restarts the process to enable the system to be normal. Because a certain process has no time to send a heartbeat message to the monitoring process because the normal operation is very busy, under the condition, the monitoring process can mistakenly think that the program which normally operates abnormally exits, thereby causing unnecessary fault recovery.

Disclosure of Invention

The present invention aims to provide a method and a system for implementing a software watchdog in a software system, so as to solve the aforementioned problems in the prior art.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a method for implementing a software watchdog in a software system, comprising the steps of,

s1, starting a monitoring process, loading a configuration file by the monitoring process, and entering S2; if no configuration file exists, automatically generating a default configuration file and entering S2;

s2, loading a configuration file into a memory, and loading all monitored business processes into a process linked list in the memory according to the configuration file in a linked list mode; the monitoring program starts each service process one by one according to the process linked list, registers watchdog information for each service process after each service process is started, and adds the watchdog information to the watchdog information linked list corresponding to the process linked list one by one;

s3, traversing the process linked list in a timing cycle manner, clearing or accumulating the watchdog timer of each service process according to the survival state of each service process, restarting the corresponding service process after the timeout time is reached, and registering watchdog information for the corresponding service process again;

s4, according to the watchdog information linked list, circularly traversing each watchdog information, and according to the size relation between the timer in the watchdog information and the overtime time in the watchdog information, determining whether the service process corresponding to the watchdog information is restarted and the restarting times thereof, thereby adopting different coping strategies.

Preferably, the configuration file includes a name of the service process, whether the service process needs to be started, a parameter for starting the service process, and a delay time of the service process.

Preferably, the watchdog information of the service process includes a process name, a process number, a timeout time, a survival flag, and a number of restarts of the service process.

Preferably, step S3 specifically includes the following steps,

s31, judging whether the current business process needs to be started, if not, entering S34; if yes, go to S32;

s32, judging whether the current business process is started, if not, starting the business process after delaying according to the delay time of the configuration file, registering the business process into a watchdog information linked list, and entering S34; if yes, go to S33;

s33, acquiring the survival state according to the process number of the business process, if the survival state is alive, resetting the watchdog timer of the business process, and entering S34; if the survival status is dead, replacing the flag bit of the service process indicating that the service process is started with false, and entering S34;

s34, judging whether the service process is the last service process in the process linked list, if yes, ending the traversal, and starting the traversal of the process linked list next time; if not, the judgment of the next service process in the process linked list is carried out.

Preferably, the specific process of acquiring the survival status of the service process according to the process number of the service process in step S33 is,

s331, acquiring a complete path of a cmdline file and a stat file of a business process in a virtual file system according to a process number of the business process;

s332, reading the stat file of the service process into a memory; judging whether the name of the process in the stat file is the same as the name of the business process and judging whether the process state in the stat file is a zombie state; if the process name in the stat file is the same as the name of the service process and the process state in the stat file is not a zombie state, the step S333 is entered; otherwise, returning the survival state of the business process as dead;

s333, reading the cmdlene file of the business process into a memory; judging whether the cmdline file contains the name string of the business process, if so, returning the survival state of the business process as alive; if not, returning the survival state of the business process as dead.

Preferably, step S4 specifically includes the following steps,

s41, judging whether the watchdog timer of the service process corresponding to the current watchdog information is larger than the overtime time in the current watchdog information, if so, indicating that the service process is restarted, resetting the watchdog timer, adding 1 to the restarting times, and entering S42; if not, the watchdog timer is automatically increased, and the process goes to S42;

s42, judging whether the current watchdog information is the last watchdog information in the watchdog information linked list, if so, ending the traversal, and starting the traversal of the watchdog information linked list next time; if not, the judgment of the next watchdog information in the watchdog information linked list is carried out.

Preferably, in step S41, when the number of restart times of the service process exceeds the preset number of restart times, the system may be restarted to prevent the service process from being restarted all the time and failing, or to stop the service process from being started.

The invention also aims to provide a system for realizing the software watchdog in the software system, which is used for realizing any one of the above methods for realizing the software watchdog, and the system for realizing the software watchdog comprises,

a dynamic configuration module; the system comprises a process chain table, a monitoring process, a process chain table and a memory, wherein the process chain table is used for starting the monitoring process to load a configuration file and loading all monitored service processes into the process chain table in the memory according to the configuration file in a chain table mode; the monitoring program starts each service process one by one according to the process linked list, registers watchdog information for each service process after each service process is started, and adds the watchdog information to the watchdog information linked list corresponding to the process linked list one by one;

a timing detection module; the system is used for regularly and circularly traversing the process linked list, clearing or accumulating the watchdog timer of each service process according to the survival state of each service process, restarting the corresponding service process after the timeout time is reached, and registering watchdog information for the corresponding service process again; according to the watchdog information linked list, circularly traversing each watchdog information, and according to the size relationship between the timer in the watchdog information and the timeout time in the watchdog information, determining whether the service process corresponding to the watchdog information is restarted and the restarting times thereof, thereby adopting different coping strategies;

a process state query module; and the method is used for acquiring the survival state of the business process according to the process number of the business process.

Preferably, the dynamic configuration module includes two interfaces respectively used for dynamically opening or closing monitoring on a certain service process, which are a SetProcessActive interface and a SetProcessInactive interface respectively;

the SetProcessActive interface is used for dynamically activating the service process, and transmitting the name of a certain service process, the starting parameter of the service process and the delay time of the service process; the interface firstly searches whether a service process with the same name as the service process exists in a process linked list, if not, the service process is added into the process linked list, the service process is started in the first circulation after the delay time is reached, and the watchdog information of the service process is registered;

the SetProcessInactive interface is used for dynamically enabling the business process to be separated from monitoring and transmitting the name of a certain business process; the interface searches whether a service process identical to the service process exists in a process linked list and a watchdog information linked list, if so, the corresponding zone bit of the service process is replaced by false, and the survival state of the service process is not detected any more.

Preferably, when the number of restart times of the service process exceeds the preset number of restart times, the system may be restarted to prevent the service process from being failed to restart all the time, or the setprocessinactivve interface of the dynamic configuration module stops the starting of the service process.

The invention has the beneficial effects that: 1. by detecting the process file state in the virtual file system (/ proc) at regular time without depending on the mode of inter-process message communication, the problem of false restart caused by that a service process does not send heartbeat messages to a monitoring process due to busy operation is solved. 2. The problem of error restarting is solved, meanwhile, the system overhead is reduced, and the coupling between programs is reduced.

Drawings

FIG. 1 is a schematic flow chart of a method for implementing a watchdog in an embodiment of the present invention;

FIG. 2 is a schematic flow chart of loop detection of the survival status of each business process in the embodiment of the present invention;

FIG. 3 is a flowchart illustrating an embodiment of obtaining the survival status of a business process according to the process number of the business process;

fig. 4 is a schematic flowchart of a process of circularly detecting watchdog information of each business process in the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

Example one

In the embodiment, as shown in fig. 1, there is provided a method for implementing a software watchdog in a software system, comprising the following steps,

In this embodiment, the configuration file includes a name of the service process, whether the service process needs to be started, a parameter for starting the service process, and a delay time of the service process.

In this embodiment, assuming that the monitoring process sequentially starts M service processes, after each service process is started, watchdog information needs to be registered for each service process, where each watchdog information includes a process name, a process number (pid), timeout time, a survival flag, and a restart frequency of the service process corresponding to the watchdog information. One business process can be selected from the M business processes as a sub daemon process, and other processes and the sub daemon process are mutually guarded while the monitoring process monitors the business processes, so that the software watchdog can be quickly recovered when the software watchdog is abnormal, and the self reliability of the software watchdog is ensured.

As shown in fig. 2, in this embodiment, step S3 specifically includes the following contents,

s31, judging whether the current service process needs to be started (namely, judging a flag bit active), if not, entering S34; if yes, go to S32;

s32, judging whether the current service process is started (namely judging flag isStarted), if not, after delaying according to the delay time of the configuration file, starting the service process, and registering the service process in the watchdog information linked list; if yes, go to S33;

s33, acquiring the survival state of the business process according to the pid of the business process, and if the survival state is alive, clearing a watchdog timer (timeTick) of the business process; if the survival state is dead, replacing a flag bit (an isStarted flag) which indicates that the service process is started with false;

As shown in fig. 3, in this embodiment, the specific process of acquiring the survival status of the service process according to the pid of the service process in step S33 is,

In this embodiment, the/proc directory on the Linux system is a file system, i.e., a proc file system. Unlike other common file systems,/proc is a pseudo file system (i.e., a virtual file system) that stores a series of special files of the current kernel running state, and under the/proc directory there is a directory name that is consistent with the process PID, and this directory contains all the information of the process. The state information of the process can be obtained by querying the information in the directory. cmdlene — a complete command to start the current process, this file in the bot process directory does not contain any information. Stat- -State information of the current process.

As shown in fig. 4, in this embodiment, step S4 specifically includes the following contents,

In step S41, when the number of times of restarting the service process exceeds the preset number of times of restarting, the system may be restarted to prevent the service process from being restarted all the time and failing, or to stop the starting of the service process. For example: the restart caused by insufficient system memory can be avoided by restarting the virtual file system; the restart of the business process itself, caused by a problem, may be stopped if the business is not critical.

When the number of restart times of a service process exceeds a preset number of restart times, for example, a service process is restarted continuously, and when the service process is restarted continuously ten times (the number of restart times exceeds the preset number of restart times), the service process is restarted all the time for two reasons: 1. if the system is hung up after the business process is started due to insufficient system memory, the system can be restarted to avoid the failure of restarting the business process all the time; 2. a problem with the business process itself, at which point the start of the business process may be stopped if the business process is not a particularly important business process. Moreover, the reason for the process restart is not limited to the two reasons, and different strategies can be adopted for different reasons. The preset restart times can be specifically set according to actual conditions so as to better meet actual requirements.

Example two

In this embodiment, a system for implementing a software watchdog in a software system is provided, where the system for implementing a software watchdog is used to implement the above method for implementing a software watchdog, and the system for implementing a software watchdog includes,

a process state query module; for obtaining its survival status according to the pid (process number) of the business process.

In this embodiment, the dynamic configuration module includes two interfaces respectively used for dynamically opening or closing monitoring on a certain service process, which are a SetProcessActive interface and a SetProcessInactive interface;

In this embodiment, when the number of restart times of a service process exceeds a preset number of restart times, for example, when a service process is restarted continuously for ten times (exceeding the preset number of restart times), there are two reasons why the service process is restarted all the time: 1. if the system is hung up after the business process is started due to insufficient system memory, the system can be restarted to avoid the failure of restarting the business process all the time; 2. the problem of the business process itself, at this time, if the business process is not a particularly important business process, the starting of the business process is stopped through the SetProcessInactive interface of the dynamic configuration module under the condition of not influencing the work of other business modules. Moreover, the reasons for restarting the process are not limited to the two reasons, and different strategies are adopted for different reasons. The preset restart times can be specifically set according to actual conditions so as to better meet actual requirements.

In this embodiment, in many unix computer systems, the process file system comprises a pseudo file system that is dynamically generated at startup for accessing process information through the kernel. The file system is usually mounted to the/proc directory, and since the/proc is not a real file system, it does not occupy storage space and only occupies limited memory. The system for realizing the software watchdog provided by the invention does not need to rely on inter-process message communication (IPC), but obtains the survival state of the process with the process number being pid by reading/proc/[ pid ]/state files under the directory. The system solves the problem of false restart, reduces the system overhead and reduces the coupling between programs.

In this embodiment, an sdkdevds process is taken as an example to describe, sdkdevds is a process for acquiring device capability, and when the sdkdevds process is started, a directory having the same number as the sdkdevds process exists in the/proc/directory; when detecting that sdkdevds is hung up in a certain time, the session monitoring process restarts sdkdevds and registers the sdkdevds in the watchdog information linked list.

By adopting the technical scheme disclosed by the invention, the following beneficial effects are obtained:

the invention provides a method and a system for realizing a software watchdog in a software system, which solve the problem of false restart caused by that a business process does not send heartbeat messages to a monitoring process due to busy operation by detecting the state of a process file in a virtual file system (/ proc) at regular time without depending on a mode of inter-process message communication. The problem of error restarting is solved, meanwhile, the system overhead is reduced, and the coupling between programs is reduced.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and improvements can be made without departing from the principle of the present invention, and such modifications and improvements should also be considered within the scope of the present invention.

Claims

1. A method for realizing a software watchdog in a software system is characterized in that: comprises the following steps of (a) carrying out,

2. The method of claim 1 for implementing a software watchdog in a software system, wherein: the configuration file comprises the name of the business process, whether the business process needs to be started or not, the starting parameter of the business process and the delay time of the business process.

3. The method of claim 2 for implementing a software watchdog in a software system, wherein: the watchdog information of the business process comprises a process name, a process number, timeout time, a survival flag and restart times of the business process.

4. A method for implementing a software watchdog in a software system according to claim 3, characterized in that: the step S3 specifically includes the following contents,

5. The method of claim 4 for implementing a software watchdog in a software system, wherein: the specific process of acquiring the survival status according to the process number of the service process in step S33 is,

s333, reading the cmdlene file of the business process into a memory; judging whether the cmdl ine file contains the name string of the business process, if so, returning the survival state of the business process to be alive; if not, returning the survival state of the business process as dead.

6. The method of claim 5 for implementing a software watchdog in a software system, wherein: the step S4 specifically includes the following contents,

7. The method of claim 6 for implementing a software watchdog in a software system, wherein: in step S41, when the number of times of restarting the service process exceeds the preset number of times of restarting, the system may be restarted to prevent the service process from being restarted all the time and failing, or to stop the starting of the service process.

8. A system for implementing a software watchdog in a software system, comprising: system for implementing a software watchdog for implementing a method for implementing a software watchdog according to any one of the preceding claims 1 to 7, the system for implementing a software watchdog comprising,

9. The system of claim 8, wherein the software watchdog is implemented in a software system comprising: the dynamic configuration module comprises two interfaces which are respectively used for dynamically starting or closing monitoring on a certain service process, namely a SetProcessActive interface and a SetProcessInactive interface;

10. The system of claim 8, wherein the software watchdog is implemented in a software system comprising: when the restart times of the business process exceed the preset restart times, the system can be restarted to avoid the failure of restarting the business process all the time, or the startup of the business process is stopped through a SetProcessInactive interface of the dynamic configuration module.