CN113064762A

CN113064762A - Service self-recovery method based on multiple detection

Info

Publication number: CN113064762A
Application number: CN202110384903.4A
Authority: CN
Inventors: 程永新; 宋辉; 苏树昌
Original assignee: Shanghai New Torch Network Information Technology Ltd By Share Ltd
Current assignee: Shanghai New Torch Network Information Technology Ltd By Share Ltd
Priority date: 2021-04-09
Filing date: 2021-04-09
Publication date: 2021-07-02
Anticipated expiration: 2041-04-09
Also published as: CN113064762B

Abstract

The invention discloses a service self-recovery method based on various detection, which comprises the following steps: s1: a monitoring script is configured in advance in a monitored server; s2: a timing task carried by a server is adopted to automatically run a monitoring script at regular time; s3: and the monitoring script judges the availability of the state monitoring service of the monitoring item, and when the monitoring script monitors that any monitoring item is abnormal, the monitoring script acquires and stores the fault environment information and then restarts the service. According to the method, the CURL page response detection and the process online state detection are combined with service fault operation environment information extraction of Jstack and Jmap commands, when the states of a service process and a test page are monitored, once a configured monitoring item is found to be offline or response returns to be abnormal, the information of a network connection state and related Java virtual machines of the JAVA service is stored through Jstack and Jmap related commands, and meanwhile, the service is restarted for self recovery; and the fault state of the program is kept while the service is recovered, and relevant information is provided for subsequent fault analysis.

Description

Service self-recovery method based on multiple detection

Technical Field

The present invention relates to a service self-recovery method, and more particularly, to a service self-recovery method based on multiple probes.

Background

Service monitoring and automatic recovery are basic requirements of high availability of a business system at present, and meanwhile, the inspection work of service states is also one of the works that operation and maintenance personnel often need to maintain. With the increasingly complex business requirements and the explosive increase of concurrency, the requirement on the availability of services is higher, and relevant analysis information is extracted for problem troubleshooting while the business is recovered as soon as possible, so that great working pressure is brought to operation and maintenance personnel. The conventional means obviously cannot meet the current requirements, and the maintenance efficiency is low. Comprehensive self-detection and automated execution and information collection are the development directions of service self-recovery.

The conventional self-monitoring service for operation and maintenance usually monitors the service process and the state of a monitoring test page only through a monitoring port, and once the monitoring port or the monitoring service is found to be not on line or the monitoring test page returns abnormal, the service is recovered by adopting a direct restarting method, so that relevant fault environment information such as network connection state and JVM information related to JAVA service and the like cannot be reserved, and great difficulty is brought to problem troubleshooting. Therefore, the prior art has yet to be improved.

Disclosure of Invention

The invention provides a service self-recovery method based on various detection, which adopts a combined mode of service process state and test page response state code to monitor the usability of the service, when any one of the monitoring items is abnormal, the current network connection state of a server and the related virtual machine information of JAVA service are stored, and the service is restarted for self-recovery; and the related failure environment information is kept while the service is recovered.

The technical scheme adopted by the invention for solving the technical problems is to provide a service self-recovery method based on various detections, which comprises the following steps: s1: a monitoring script is configured in advance in a monitored server; s2: a timing task carried by a server is adopted to automatically run a monitoring script at regular time; s3: and the monitoring script judges the availability of the state monitoring service of the monitoring item, and when the monitoring script monitors that any monitoring item is abnormal, the monitoring script acquires and stores the fault environment information and then restarts the service.

Further, the monitoring items in the step S3 include service process status monitoring and test page response status monitoring; judging whether the service process state is online or not by detecting whether the service process ID exists or not; and judging whether the response of the test page is normal or not by detecting the return value of the test page.

Further, a service process ID is inquired in the monitoring script through a service name, if the service process ID is returned, the service process is on line, and if the service process ID is not returned, the service process is in an off-line abnormal state.

Further, accessing a test page and acquiring a return value of the test page through a CURL command in the monitoring script, wherein if the acquired return value is normal, the test page responds normally, and the service is in a normal state; if no return value exists, the test page response is abnormal, and the service is in an abnormal state of no response in a false death.

Further, when the monitoring script is configured in step S1, a service variable and a test page address are defined in the monitoring script.

Further, the fault environment information includes network connection state information and Jave virtual machine information related to the service, the network connection state information and current connection concurrency number information are counted through a command of a Net Stat console in the monitoring script, and the network connection state information includes a routing table, actual network connection and state information of each network interface device.

Further, the Jave virtual machine information comprises Jave stack information and Jave heap memory information; the monitoring script calls a Jstack tracking tool, Java stack information is obtained according to the service process ID, the current thread snapshot information of the Java virtual machine is generated by the Java stack information, and the thread snapshot information comprises the stack information of each thread; the monitoring script calls a Jmap heap memory tracking tool to acquire memory mapping information or heap memory information of the Java process and reflect memory mirror images used by the Java heap, wherein the memory mirror images include system information, virtual machine attributes, complete thread transfer storage and state information of all classes and objects.

Compared with the prior art, the invention has the following beneficial effects: the invention provides a service self-recovery method based on various detections, which combines the CURL page response detection and the process online state detection with the service failure operation environment information extraction of Jstack and Jmap commands, when monitoring the service process and the test page state, once the configured monitoring item is found to be not online or the response returns to abnormal, the Jstack and Jmap related commands are firstly used for storing the network connection state and the related Java virtual machine information of the JAVA service, and the service is restarted for self-recovery; and the fault state of the program is kept while the service is recovered, and relevant information is provided for subsequent fault analysis and disk replication.

Drawings

FIG. 1 is a flow chart of a method for self-recovery of services based on multiple probing according to an embodiment of the present invention;

FIG. 2 is a flow chart of a monitoring script according to an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the figures and examples.

FIG. 1 is a flow chart of a method for self-recovery of services based on multiple probing according to an embodiment of the present invention; FIG. 2 is a flow chart of a monitoring script according to an embodiment of the present invention.

Referring to fig. 1, the service self-recovery method based on multiple probes according to the embodiment of the present invention includes the following steps:

s1: a monitoring script is configured in advance in a monitored server; when a monitoring script is configured, defining a service variable and defining a test page address in the monitoring script;

s2: a timing task carried by a server is adopted to automatically run a monitoring script at regular time;

s3: and the monitoring script judges the availability of the state monitoring service of the monitoring item, and when the monitoring script monitors that any monitoring item is abnormal, the monitoring script acquires and stores the fault environment information and then restarts the service.

Referring to fig. 2, in the service self-recovery method based on multiple probing according to the embodiment of the present invention, the monitoring item in step S3 includes service process state monitoring and test page response state monitoring; judging whether the service process state is online or not by detecting whether the service process ID exists or not; and judging whether the response of the test page is normal or not by detecting the return value of the test page.

And inquiring the ID of the service process through the service name in the monitoring script, if the ID of the service process is returned, the service process is on line, and if the ID of the service process is not returned, the service process is in an off-line abnormal state.

Accessing a test page and acquiring a return value of the test page through a CURL command in a monitoring script, wherein if the acquired return value is normal, the test page responds normally, and the service is in a normal state; if no return value exists, the test page response is abnormal, and the service is in an abnormal state of no response in a false death.

The CURL command is a powerful web tool that can access the test URL through the command line, serving to issue web requests, and then get and extract data for display on the standard output. The return value of the Web page is tested by using the CURL in the script, so that the running state of the Web service can be conveniently monitored at regular time, and the state that the service processing is falsely dead and has no response is eliminated. The Http Request message can be constructed by using a CURL command, Http Response returned by the server can be analyzed, Cookie characteristics are additionally supported, the basic functions of the Web browser can be completed, and protocols such as HTTPS/FTP/FTPS/TELNET/LDAP and the like are also supported. The file can be downloaded in the modes of Http, Ftp and the like, and can also be uploaded.

Specifically, the failure environment information in step S3 includes network connection status information and Jave virtual machine information related to the service, and statistics is performed on the network connection status information and current connection concurrency number information through a Net Stat console command in the monitoring script, where the network connection status information includes a routing table, actual network connection, and status information of each network interface device.

The Net Stat is a console command, a very useful tool for monitoring the TCP/IP network, which can display the routing tables, the actual network connections, and status information for each network interface device. The Net Stat is used for displaying statistical data related to IP, TCP, UDP and ICMP protocols, and is generally used for checking the network connection condition of each port of the computer. And printing and counting information such as network connection state, current connection concurrency and the like by using the Net Stat in the script to provide network connection conditions for subsequent problem investigation.

The Jave virtual machine information comprises Jave stack information and Jave heap memory information; the monitoring script calls a Jstack tracking tool, Java stack information is obtained according to the service process ID, the current thread snapshot information of the Java virtual machine is generated by the Java stack information, and the thread snapshot information comprises the stack information of each thread; the monitoring script calls a Jmap heap memory tracking tool to acquire memory mapping information or heap memory information of the Java process and reflect memory mirror images used by the Java heap, wherein the memory mirror images include system information, virtual machine attributes, complete thread transfer storage and state information of all classes and objects.

Jstack is a stack tracking tool of the Java virtual machine, and is used for printing out a given Java process ID or core file or Java stack information of a remote debugging service to generate current thread snapshot information of the virtual machine, wherein the current thread snapshot information comprises the stack information of each thread. The command is usually used to locate the thread stalling reason, and when the thread stalls, the stack information of each thread can be checked through the stack, so that the stalling reason can be analyzed. If the Java program crashes to generate the core file, the Jstack tool can be used to obtain the information of Java Stack and Native Stack of the core file, so that it can be easily known how the Java program crashes and where the program is in trouble. In addition, the Jstack tool can be attached to the running Java program, the information of Java stack and Native stack of the running Java program can be seen, and the Jstack tool is very useful if the running Java program is in the state of hung.

Jmap is a Heap memory tracking tool carried by the Java virtual machine, and can be mainly used for printing memory maps of Java processes or details (such as which objects are generated and the number of objects and the like) of Heap memory, namely a Heap Dump file. The method is mainly used for checking large objects with memory leakage and severe image performance, checking which object in a system is created most, analyzing the sizes occupied by various objects and the like, wherein a Dump file is a memory copy of a process. The heap Dump is a memory image reflecting the use of the Java heap, and mainly includes system information, virtual machine attributes, a complete thread Dump, states of all classes and objects, and the like. Generally, memory leaks are suspected in cases of memory shortage, GC abnormality, and the like. At this time we can make the heap Dump to look at specific conditions and analyze the reason.

In summary, in the service self-recovery method based on multiple probes of the embodiment of the present invention, the CURL page response detection and the process online state detection are combined with the service failure operation environment information extraction of the Jstack and Jmap commands, when the service process and the test page state are monitored, once the configured monitoring item is found to be offline or the response returns to be abnormal, the network connection state and the information of the JAVA virtual machine related to the JAVA service are stored by the Jstack and Jmap related commands, and the service is restarted to perform self-recovery; and the fault state of the program is kept while the service is recovered, and relevant information is provided for subsequent fault analysis and disk replication.

Although the present invention has been described with respect to the preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A service self-recovery method based on multiple detection is characterized by comprising the following steps:

s1: a monitoring script is configured in advance in a monitored server;

2. The multiple probing based service self-healing method of claim 1, wherein the monitoring items in step S3 include service process status monitoring and test page response status monitoring; judging whether the service process state is online or not by detecting whether the service process ID exists or not; and judging whether the response of the test page is normal or not by detecting the return value of the test page.

3. The service self-recovery method based on multiple probing as claimed in claim 2, wherein the service process ID is queried in the monitoring script by the service name, and if the service process ID is returned, it indicates that the service process is online, and if the service process ID is not returned, it indicates that the service process is in an offline abnormal state.

4. The service self-recovery method based on multiple probing as claimed in claim 2, wherein the monitoring script accesses the test page and obtains the return value of the test page through the CURL command, if the obtained return value is normal, it indicates that the test page responds normally, and the service is in a normal state; if no return value exists, the test page response is abnormal, and the service is in an abnormal state of no response in a false death.

5. The multiple probing based service self-healing method of claim 1, wherein when the monitoring script is configured in step S1, a service variable and a test page address are defined in the monitoring script.

6. The diverse probing based service self-recovery method as claimed in claim 1, wherein the failure environment information includes network connection status information and Jave virtual machine information related to the service, the network connection status information and current connection concurrency number information are counted in the monitoring script through a Net Stat console command, and the network connection status information includes a routing table, actual network connection and status information of each network interface device.

7. The diverse-probe-based service self-recovery method of claim 6, wherein the Jave virtual machine information comprises Jave stack information and Jave heap memory information; the monitoring script calls a Jstack tracking tool, Java stack information is obtained according to the service process ID, the current thread snapshot information of the Java virtual machine is generated by the Java stack information, and the thread snapshot information comprises the stack information of each thread; the monitoring script calls a Jmap heap memory tracking tool to acquire memory mapping information or heap memory information of the Java process and reflect memory mirror images used by the Java heap, wherein the memory mirror images include system information, virtual machine attributes, complete thread transfer storage and state information of all classes and objects.