CN116737514A

CN116737514A - Automatic operation and maintenance method based on log and probe analysis

Info

Publication number: CN116737514A
Application number: CN202311024041.XA
Authority: CN
Inventors: 徐振权
Original assignee: Nanjing Guorui Xinwei Software Co ltd
Current assignee: Nanjing Guorui Xinwei Software Co ltd
Priority date: 2023-08-15
Filing date: 2023-08-15
Publication date: 2023-09-12
Anticipated expiration: 2043-08-15
Also published as: CN116737514B

Abstract

The invention discloses an automatic operation and maintenance method based on log and probe analysis, which belongs to the technical field of software development operation and maintenance. The JavaAgents write the execution details of the code dimension into the ES; the FileBeat real-time acquisition system and the third party application use own log information, and the data is preprocessed by the LogStash in real time and then stored in the ES in real time. The invention displays server portraits, client portraits, log portraits, end-to-end complete application links and service flow patterns in real time and realizes one-stop full-dimension monitoring and application operation and maintenance solving methods.

Description

Automatic operation and maintenance method based on log and probe analysis

Technical Field

The invention relates to an automatic operation and maintenance method based on log and probe analysis, and belongs to the technical field of software development operation and maintenance.

Background

With the rapid development of technologies such as mobile internet, cloud computing, big data, internet of things and the like, various business applications are continuously appeared, the complexity of IT application is explosively increased, the high real-time performance of data acquisition, the rapid iteration of business demands and the immediate landing of products and services are realized, the high requirements make the responsibility born by development and operation and maintenance teams heavier, and research, development and operation and maintenance engineers need to ensure the reliability and stability of services and products, optimize the services, rapidly locate faults, promote user experience and the like, and meanwhile, data support is provided for business decisions, thereby leading to business innovation. However, after the service system is on-line, there are often the following problems:

1. the response speed of a certain module is slow, but it is not known why it is so slow, in particular where it is;

2. the operation and maintenance only monitors the resource index of the server, and the reason why the operation and maintenance is slow cannot be seen from the server resource alone;

3. the system is always abnormal, but the specific deep reason is not known;

4. in the distributed service system, performance problems occur, and N server system logs need to be analyzed, so that the component logs are more and more, and the analysis is not performed from the beginning.

Disclosure of Invention

In order to solve the problems, the invention discloses an automatic operation and maintenance method based on log and probe analysis, which has the following specific technical scheme:

an automatic operation and maintenance method based on log and probe analysis comprises the following steps:

step S1, monitoring Java Heap and GC, class & Object & Method & Field meta-information of a third party program through a Pinpoint distributed system performance monitoring tool;

step S2, the third party Jar packet is intercepted by JavaAgents before the class is loaded, inserted with monitoring byte codes,

s3, analyzing monitoring byte code information by the Java agent through a byte add framework, and storing record data called by a function in a third-party jar into an Elastomer Search (ES) database;

step S4, the FileBeat engine collects an app.log file generated by a third party Jar package in real time;

s5, the FileBeat engine collects logs of the operating system in real time;

step S6, synchronizing the log information acquired by the FileBeat in the step S4-step S5 into the LogStash component in real time;

step S7, the LogStash component carries out filtering pretreatment on the data and stores the data in the elastic search in real time;

step S8, prometaus monitors all server node resource information of third party service deployment, wherein the server node resource information comprises a CPU, a memory and a hard disk storage;

step S9, grafana displays service portraits, client portraits, log portraits and end-to-end complete application links and service flow patterns in real time by connecting ES data storage, inquiring ES and fusing Prometaus monitoring data.

Further, in the step S3, the specific process of analyzing the monitoring byte code information by the JavaAgent through the byte add framework is as follows: when the third party jar is started, the Java agent records which interfaces the jar is called by, and records whether the calling state is normal or abnormal, and the calling state is recorded in the ES database in a fault reporting way.

Further, the log of the operating system in step S5 includes a log in the Centos system, specifically:

core startup log:/var/log/dmesg,

system log:/var/log/messages,

recently logged on server IP log:/var/log/lastlog.

Further, the query grammar execution process of the ES in step S8 is as follows: and the data stored in the ES are used for inquiring the data in a plurality of tables according to the process number for monitoring the starting of the third party jar as a unique identity through the SQL statement of the ES, and the full-link portrait is displayed.

Further, the service portraits comprise service topology charts, the service topology charts visually display calling relations applied in the whole system, grafana monitors data in an elastic search in real time in a large screen, a browser can be logged in to view the data according to requirements, the data are displayed in real time when the data are viewed, monitoring of all deployed third party node servers is achieved through Prometheus, data display is conducted through Grafana, a certain service node is clicked, detailed information of the node can be displayed, and the detailed information of the node comprises CPU use conditions, memory use conditions, hard disk use conditions, network spitting amount and the request quantity of the node interface.

Further, the client portrait comprises a real-time active thread map, the real-time active thread map monitors the execution condition of active threads in the application, and the execution performance of the threads of the application is intuitively known; the method for monitoring the active threads in the application comprises the following steps: the java agent, pinpoint and GrafanaJVM are compatible with programs developed in different languages according to different scenes corresponding to each method.

Further, the log portrait comprises a request response scatter diagram, request counting and response time display are carried out on the request response scatter diagram in a time dimension, the data second level of system statistics is synchronized to an ES database in real time, grafana carries out polling display on data in the ES in an interface in a 5-minute time granularity mode, and corresponding request checking execution details are selected through dragging the diagram.

Further, the end-to-end complete application link comprises a call stack view, the call stack view provides visibility of a code dimension for each request in the distributed environment, execution details of the request aiming at the code dimension are checked in a page, and the search of bottlenecks and fault reasons of the request is facilitated.

Further, the service flow chart includes application state and machine state checking, which can check detailed information of related application programs by monitoring process name and detecting server node information in real time by Prometheus;

the process name and the information of the detection server node comprise whether the program is abnormal or not, and the node CPU; details of the relevant applications CPU usage, memory status, garbage collection status, TPS and JVM information parameters.

Further, the specific process of the call stack checking is as follows:

(1) When the Grafana interface provides an alarm, clicks the alarm information,

(2) The server where the alarm information is located can be checked,

(3) Clicking on the on-going alert prompt on the corresponding server,

(4) The JVM stack information and Linux server system log of the process are looked up,

(5) Log information within a range of 3 minutes,

(6) Clicking to view the detail and displaying the complete log information.

The working principle of the invention is as follows:

according to the invention, through an open source Pinpoint distributed system performance monitoring tool, functions such as monitoring a third party service real-time active thread map, requesting a response scatter map, calling stack checking and the like are monitored, meanwhile, points which cannot be monitored by Pinpoint are perfected, a Javangent probe technology is added, a third party Jar package is intercepted before class is loaded, and monitoring byte codes are inserted. The Java agent analyzes the byte code information through a byte BUddy framework, and writes the execution details of the code dimension into an elastic search database (ES for short); the FileBeat acquires system-level message log information and app.log information of a third party application, and synchronizes the system-level message log information and app.log information to the LogStash component in real time. The LogStash pre-processes the data and stores the data in the ES in real time. The method simultaneously uses Prometaus to monitor server nodes deployed by a third party service, finally connects ES data storage through Grafana, and uses the inquiry multidimensional inquiry of the ES to fuse the Prometaus monitoring data to display server portraits, client portraits, log portraits, end-to-end complete application links and service flow patterns in real time.

The beneficial effects of the invention are as follows:

1. the invention can draw accurate service portraits, client portraits, log portraits, end-to-end complete application links and service flow patterns according to data analysis in the field of system operation and maintenance.

2. The customer can know whether the system has problems fast, and the simple problem can operate according to the prompt of the system, so that the normal operation of the third party system is guaranteed fast, and the loss caused by the system problems is reduced.

3. The operation and maintenance personnel using different systems can locate the problem of the program faster, and the time spent for locating the problem can be greatly reduced.

4. Developers of different systems can quickly know the bug of the program and quickly repair the problem.

5. Advantages of the use effect: under normal conditions, when a problem occurs in the system, the client feeds back to a developer of the system, the developer remotely guides the client to check, when the client is not a professional, the problem cannot be solved, the manufacturer can arrange operation and maintenance engineers preferentially to solve the problem, but after a lot of time is spent for checking, the problem cannot be solved, the development engineers are also required to be arranged to solve the problem, the whole process can take a plurality of days to position, but the loss caused by the incapacity of using the system of the client is immeasurable, the operation and maintenance cost of the developer of the system is increased greatly, and the positioning problem and the bug solving time can be greatly reduced after the method is adopted.

Drawings

Fig. 1 is a block diagram of the present invention.

Description of the embodiments

The invention is further elucidated below in connection with the drawings and the detailed description. It should be understood that the following detailed description is merely illustrative of the invention and is not intended to limit the scope of the invention.

The following is a Chinese name matching explanation for the program names referred to in this patent:

as can be seen with reference to fig. 1, the implementation of the present invention is as follows:

step S1, monitoring Java Heap and GC, class & Object & Method & Field meta-information of a third party program through a Pinpoint distributed system performance monitoring tool; monitoring a real-time active thread map of a third-party service, requesting to respond to the scatter map, calling stack to view and the like, perfecting points which cannot be monitored by Pinpoint, and increasing a JavaAgent probe technology.

Step S2, the third party Jar packet is intercepted by the JavaAgent before the class is loaded, and a monitoring byte code is inserted; the flow of program operation is known clearly by listening to the bytecode.

And S3, analyzing monitoring byte code information by the Java agent through a byte add framework, and storing record data called by the function in the third-party jar into an ES database.

The specific process of the Java agent analyzing the monitoring byte code information through the byte framework is as follows: when the third party jar is started, the Java agent records which interfaces the jar is called by, and records whether the calling state is normal or abnormal, and the calling state is recorded in the ES database in a fault reporting way. This process, for example, you'll park security, now requires that every point on the park be checked with a gate card device, and the system knows where the park security is currently located or the points pass that point.

Further, the query grammar execution process of the ES in step S8 is as follows:

step S4, the FileBeat engine collects an app.log file generated by a third party Jar package in real time; fileBeat is an open source tool, and has the function of log data acquisition of regular expressions.

Step S5, the FileBeat collects logs of the operating system itself in real time, such as logs in the Centos system

Core boot Log:/var/log/dmesg

System log:/var/log/messages

Recently logged on server IP log:/var/log/lastlog.

S6, synchronizing log information acquired by the FileBeat into the LogStash component in real time; logstack is a heavyweight open source logging tool.

Step S7, the LogStash carries out filtering pretreatment on the data and stores the data into an elastic search (ES for short) in real time;

step S8, prometaus monitors all server node resource information of third party service deployment, such as CPU, memory, hard disk storage and the like;

step S9, grafana queries data in a plurality of tables according to the starting process number of a monitoring third party jar as a unique identity through the SQL statement of the ES by connecting ES data storage, querying and fusing Prometheus monitoring data, and displaying service portraits, client portraits, log portraits, end-to-end complete application links and service flow charts in real time in full links. The data stored in the ES shows a full link portrait.

The following describes the implementation effect of this patent in detail:

the service portraits comprise service topological diagrams, the service topological diagrams visually display calling relations applied in the whole system, grafana monitors data in an elastic search in real time in a large screen, a browser can be logged in according to needs to check, the data are displayed in real time when checked, monitoring of all deployed third-party node servers is achieved through Prometaus, data display is conducted through Grafana, detailed information of a node can be displayed by clicking on the node, and the detailed information of the node comprises CPU use conditions of current node states, memory use conditions, hard disk use conditions, network spitting quantity and request quantity of the node interfaces.

The client portrait comprises a real-time active thread map, the real-time active thread map monitors the execution condition of active threads in the application, and the execution performance of the threads of the application is intuitively known; the method for monitoring the active threads in the application comprises the following steps: the java agent, pinpoint and GrafanaJVM are compatible with programs developed in different languages according to different scenes corresponding to each method.

The log portrait comprises a request response scatter diagram, request counting and response time display are carried out on the request response scatter diagram in a time dimension, the data second level of system statistics is synchronized to an ES database in real time, polling display of 5-minute time granularity on an interface is carried out on data in the ES by Grafana, and the corresponding request checking execution detail is selected through dragging the diagram.

The end-to-end complete application link comprises a call stack view which provides visibility of code dimensions for each request in the distributed environment, and the view of execution details of the request to the code dimensions in the page helps to find bottlenecks and fault reasons of the request.

The specific process of call stack checking is as follows:

(1) When the Grafana interface provides an alarm, clicks the alarm information,

(2) The server where the alarm information is located can be checked,

(3) Clicking on the on-going alert prompt on the corresponding server,

(5) Log information within a range of 3 minutes,

(6) Clicking to view the detail and displaying the complete log information.

The service flow chart comprises application state and machine state checking, wherein the application state and machine state checking can be used for monitoring the process name and detecting the information of the server node in real time through Prometaus, and can be used for checking the detailed information of related application programs.

The technical means disclosed by the scheme of the invention is not limited to the technical means disclosed by the technical means, and also comprises the technical scheme formed by any combination of the technical features.

With the above-described preferred embodiments according to the present invention as an illustration, the above-described descriptions can be used by persons skilled in the relevant art to make various changes and modifications without departing from the scope of the technical idea of the present invention. The technical scope of the present invention is not limited to the description, but must be determined according to the scope of claims.

Claims

1. The automatic operation and maintenance method based on log and probe analysis is characterized by comprising the following steps of:

step S2, the third party Jar packet is intercepted by the JavaAgent before the class is loaded, and a monitoring byte code is inserted, so that the running flow of the program is known clearly;

s3, analyzing monitoring byte code information by the Java agent through a byte add framework, and storing record data called by a function in the jar of the third party into an elastic search database;

s5, the FileBeat engine collects logs of the operating system in real time;

and S9, grafana displays the service portraits, the client portraits, the log portraits, the end-to-end complete application links and the service flow patterns in real time through connecting an elastic search data storage, inquiring the elastic search and fusing Prometaaus monitoring data.

2. The automated operation and maintenance method based on log and probe parsing according to claim 1, wherein the specific process of the java agent parsing the monitoring bytecode information through the bytebuclady framework in step S3 is as follows: when the third party jar is started, the Java agent records which interfaces the jar is called by, and records whether the calling state is normal or abnormal, and the calling state is recorded in an elastic search database in an error reporting mode.

3. The automated operation and maintenance method based on log and probe parsing according to claim 1, wherein the log of the operating system itself in step S5 includes a log in a Centos system, specifically:

core startup log:/var/log/dmesg,

system log:/var/log/messages,

recently logged on server IP log:/var/log/lastlog.

4. The automated operation and maintenance method based on log and probe parsing according to claim 1, wherein the query grammar execution process of the elastic search in step S9 is: and inquiring the data stored in the elastic search according to the process number for monitoring the starting of the third party jar as a unique identity through the SQL statement of the elastic search, and displaying the full-link portrait.

5. The automated operation and maintenance method based on log and probe analysis according to claim 1, wherein the service portraits comprise a service topology graph, the service topology graph performs visual display on the calling relation applied in the whole system, grafana monitors data in elastsearch in real time in a large screen, a browser can be logged in according to needs to view, the data is displayed in real time when viewing, monitoring of all deployed third party node servers is realized through Prometaus, data display is performed through Grafana, a certain service node is clicked, detailed information of the node can be displayed, and the detailed information of the node comprises the current node state CPU use condition, memory use condition, hard disk use condition, network spitting amount and the request quantity of the node interface.

6. The automatic operation and maintenance method based on log and probe analysis according to claim 1, wherein the client portrait comprises a real-time active thread map, the real-time active thread map monitors the execution condition of active threads in the application, and the execution performance of the threads of the application is intuitively known; the method for monitoring the active threads in the application comprises the following steps: the java agent, pinpoint and GrafanaJVM are compatible with programs developed in different languages according to different scenes corresponding to each method.

7. The automated operation and maintenance method based on log and probe parsing according to claim 1, wherein the log portraits comprise request response scatter diagrams, the request response scatter diagrams display request counts and response time in a time dimension, the data seconds of system statistics are synchronized to an elastic search database in real time, grafana performs polling display of data in the elastic search at an interface for 5 minutes time granularity, and the details of execution are checked by dragging the diagrams to select corresponding requests.

8. The automated log and probe based resolution operation and maintenance method of claim 1, wherein the end-to-end complete application link includes a call stack view that provides visibility of code dimensions to each request in the distributed environment, view details of execution of the request to the code dimensions in the page, and help find bottlenecks and failure causes of the request.

9. The automated operation and maintenance method based on log and probe parsing according to claim 1, wherein the service flow graph includes application state and machine state check, the application state and machine state check monitors the process name and detects the information of the server node in real time by promethaus, and can check the detailed information of the related application program;

10. The automated operation and maintenance method based on log and probe parsing according to claim 8, wherein the call stack viewing specific process is:

(1) When the Grafana interface provides an alarm, clicks the alarm information,

(2) The server where the alarm information is located can be checked,

(3) Clicking on the on-going alert prompt on the corresponding server,

(5) Log information within a range of 3 minutes,

(6) Clicking to view the detail and displaying the complete log information.