CN116991661A

CN116991661A - Problem alarm system and method for software system

Info

Publication number: CN116991661A
Application number: CN202310895691.5A
Authority: CN
Inventors: 刘华; 于泳洋; 刘晓明
Original assignee: Beijing Zhiketong Technology Co ltd
Current assignee: Beijing Zhiketong Technology Co ltd
Priority date: 2023-07-20
Filing date: 2023-07-20
Publication date: 2023-11-03

Abstract

The embodiment of the invention discloses a problem alarm system and a method of a software system, which collect log data through a filecoat arranged at a client; distributing the log data to a Storm data analysis cluster through a first Kafka data distribution cluster; the Storm data analysis cluster performs stream computation processing on the received log data to obtain processed log data; distributing the processed log data to a document type storage engine through a second Kafka data distribution cluster to store the data; carrying out graphic processing on the stored processed log data, judging whether the processed log data has abnormal values, and if so, acquiring system problem information based on the abnormal values and the log data; and sending out alarm prompt information based on the system problem information. The problem alarm method of the software system solves the problem that the prior art cannot quickly discover, locate and solve faults occurring in the running process of the software system.

Description

Problem alarm system and method for software system

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a problem alarm system and method for a software system, an electronic device, and a storage medium.

Background

A series of problems can occur in the online running process of the software system, and huge business loss can be caused if the problems in the running process of the software system cannot be acquired in time. The monitoring service functions of some existing software systems are single, more monitoring on the aspects of software system hardware, such as cpu, memory, network and the like, cannot take comprehensive capabilities of interface performance monitoring, abnormal monitoring, alarming, log tracking and the like into account, and problems can be quickly found and positioned, so that the problems can be conveniently and quickly solved.

There is a need for a software business monitoring method that can quickly discover and locate problems, thereby facilitating quick solutions to the problems.

Disclosure of Invention

The embodiment of the invention aims to provide a problem alarm system, a method, electronic equipment and a storage medium of a software system, which are used for solving the problem that faults in the running process of the software system cannot be found, positioned and solved rapidly in the prior art.

In order to achieve the above object, an embodiment of the present invention provides a method for alarming a problem in a software system, the method specifically includes:

collecting log data through a filecoat installed at a client;

the log data are transmitted to a first Kafka data distribution cluster, and the log data are distributed to a Storm data analysis cluster through the first Kafka data distribution cluster;

the Storm data analysis cluster performs stream computation processing on the received log data to obtain processed log data;

transmitting the processed log data to a second Kafka data distribution cluster, and distributing the processed log data to a document type storage engine for data storage through the second Kafka data distribution cluster;

carrying out graphical processing on the stored processed log data, judging whether the processed log data has an abnormal value, and if so, acquiring system problem information based on the abnormal value and the log data;

and sending out alarm prompt information based on the system problem information.

Based on the technical scheme, the invention can also be improved as follows:

further, the collecting log data through the filebean installed at the client includes:

acquiring user information, and writing the user information and log acquisition parameters configured by the user into a filecoat default configuration file;

and when the filebean is installed on the client, verifying the user information.

Further, the collecting log data through the filebean installed at the client side further includes:

and after the filecoat is successfully started, carrying user information to interact with the first Kafka data distribution cluster so as to carry out data transmission.

grading the log data based on an application scene, wherein the log data comprises application log data and performance log data;

recording service application information through the application log data, and monitoring service abnormality based on the service application information;

and monitoring system abnormality based on the performance information accessed through the performance log data recording interface.

Further, the performing the graphics processing on the stored processed log data, determining whether the processed log data has an outlier, if so, acquiring system problem information based on the outlier and the log data, and further including:

and determining the abnormal code corresponding to each abnormal type, and monitoring the abnormal condition corresponding to each abnormal code to determine the abnormal type corresponding to the system problem.

Further, the sending the alarm prompt information based on the system problem information includes:

configuring alarm rules;

the alarm rule includes: the current minute request quantity is larger than a first preset value, and alarming is started;

the current system abnormality rate is larger than a second preset value and early warning is started;

the current business abnormality rate is larger than a third preset value and early warning is started;

the current average execution time is larger than a fourth preset value to start early warning;

the current response time is larger than a fifth preset value to start alarming;

the current average rate of increase of the minute request is larger than a sixth preset value, and alarming is started;

the current response time starts to alarm when the cycle-to-cycle growth rate is larger than a seventh preset value;

and the current minute request volume ring rate of increase is larger than an eighth preset value to start alarming.

Further, the sending of the alarm prompt information based on the system problem information further includes:

configuring a sending channel of alarm prompt information, wherein the sending channel comprises a short message prompt, a mail prompt and a WeChat prompt;

the alarm prompt information comprises alarm product line information, alarm application name information, alarm method information, alarm value information, alarm description information and trigger time information.

A problem alert system for a software system, comprising:

the filecoat module is arranged at the client and used for collecting log data;

a first Kafka data distribution cluster for distributing the log data to a Storm data analysis cluster;

the Storm data analysis cluster is used for carrying out stream computation processing on the received log data to obtain processed log data;

the second Kafka data distribution cluster is used for distributing the processed log data to a document type storage engine for data storage;

the abnormal value acquisition module is used for carrying out graphic processing on the stored processed log data, judging whether the processed log data has abnormal values or not, and if so, acquiring system problem information based on the abnormal values and the log data;

and the alarm prompt module is used for sending alarm prompt information based on the system problem information.

An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method when the computer program is executed.

A non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method.

The embodiment of the invention has the following advantages:

according to the problem alarming method of the software system, log data are collected through a filecoat installed at a client; the log data are transmitted to a first Kafka data distribution cluster, and the log data are distributed to a Storm data analysis cluster through the first Kafka data distribution cluster; the Storm data analysis cluster performs stream computation processing on the received log data to obtain processed log data; transmitting the processed log data to a second Kafka data distribution cluster, and distributing the processed log data to a document type storage engine for data storage through the second Kafka data distribution cluster; carrying out graphical processing on the stored processed log data, judging whether the processed log data has an abnormal value, and if so, acquiring system problem information based on the abnormal value and the log data; and sending out alarm prompt information based on the system problem information, so that the problem that faults in the running process of a software system cannot be found, positioned and solved rapidly in the prior art is solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It will be apparent to those skilled in the art from this disclosure that the drawings described below are merely exemplary and that other embodiments may be derived from the drawings provided without undue effort.

The structures, proportions, sizes, etc. shown in the present specification are shown only for the purposes of illustration and description, and are not intended to limit the scope of the invention, which is defined by the claims, so that any structural modifications, changes in proportions, or adjustments of sizes, which do not affect the efficacy or the achievement of the present invention, should fall within the scope of the invention.

FIG. 1 is a flow chart of a problem alert method of a software system of the present invention;

FIG. 2 is a first architecture diagram of a problem alert system of the software system of the present invention;

FIG. 3 is a graph of performance versus the software system of the present invention;

FIG. 4 is a flow monitoring diagram of a software system of the present invention;

FIG. 5 is an anomaly monitoring graph of the software system of the present invention;

FIG. 6 is a subdivision anomaly monitoring graph of the software system of the present invention;

fig. 7 is a schematic diagram of an entity structure of an electronic device according to the present invention.

Wherein the reference numerals are as follows:

the system comprises a filebean module 10, a first Kafka data distribution cluster 20, a storm data analysis cluster 30, a second Kafka data distribution cluster 40, an outlier acquisition module 50, an alarm prompting module 60, an electronic device 70, a processor 701, a memory 702 and a bus 703.

Detailed Description

Other advantages and advantages of the present invention will become apparent to those skilled in the art from the following detailed description, which, by way of illustration, is to be read in connection with certain specific embodiments, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Examples

Fig. 1 is a flowchart of an embodiment of a problem alarm method of a software system according to the present invention, as shown in fig. 1, and the problem alarm method of a software system according to the embodiment of the present invention includes the following steps:

s101, acquiring log data through a filecoat installed at a client;

specifically, filebean is a lightweight transport for forwarding and concentrating log data. Filebean monitors the log files or locations you specify, collects log events, and forwards them to the elastomer search or logflash for indexing.

The filecoat works as follows: when filebean is started, it will start one or more inputs that will be looked up in the locations specified for the log data. For each log found by filebean, filebean will launch the collector. Each collector reads a single log to obtain new content and sends new log data to libbreak, which aggregates events and sends aggregated data to the output configured for filefloat.

After the filebean is successfully started, the filebean interacts with the first Kafka data distribution cluster 20 to perform data transmission.

recording service application information through the application log data, monitoring service abnormality based on the service application information, facilitating problem investigation and log tracking by a developer, and finding problems through fast positioning and executing processes of parameter information recorded in a log;

and monitoring system abnormality based on the performance information accessed through the performance log data recording interface. And the subsequent analysis of the performance is convenient.

The log format adopts a unified log processing frame, and the application log and the performance log are automatically processed in the frame, so that service development is not required to pay attention to implementation details, and the method has no invasiveness to the service. The developer only needs to print the application log according to a fixed method.

Each application service records its own log, including application log, standard, biglog, performancelog, nginx, which is recorded under a server specific log path, so that filebean on each server monitors and collects.

Related log path specification/home/eyelog/{ service_name }, each service creates a root directory, 3 subdirectories are placed under each service root directory, and application log is a new application log, corresponding to an application log index in kibana, biglog: gateway log of new version, index to biglog-in kibana, performancelog: the qian eye performance log, corresponding to the performancelog, has been closed;

the method comprises the steps that an nginx access log is located at/home/wwlogs/lower of each server, and indexes in corresponding kibana are ginx-; the old php, nodejs log, under catalog/home/nodeLogs/is divided into 2 categories, common log record is of-out-0.log, and index in corresponding kibana is of standard-out log;

php, nodejs error log is recorded as: -err-0.log, and index in corresponding kibana is standard-errlog-;

the java framework of the method already encapsulates the tool class of the kilo-eye related log. The indices in the corresponding kibana are respectively:

applicationlog-*；standard-out-*；standard-errlog-*；

recording access logs of each station nginx, wherein the logs are stored in an nginx index;

the method comprises the steps that a request log is generally recorded in a gateway layer, and the request log mainly comprises a gateway log and a gateway log of a java mobile api at present, wherein the log is stored in a biglog index;

when searching the log, determining what log to search, locating the index in which the log is located, and developing and testing a common set of test; the pre-sending and the production are commonly used, the corresponding index is selected, the query time range is shortened as much as possible, and the keyword is searched: { filename }: "keyword"; when the query speed is slow, the specific index is precisely: for example, the rule of each index name is application log- { product_line } - { app_name } - { yyyyy.mm.dd }. Log, and the index selects the corresponding product_line and app_name index, which greatly reduces the query scope; the query scope, such as time, server host, etc., is minimized.

S102, log data are transmitted to the first Kafka data distribution cluster 20, and the log data are distributed to the Storm data analysis cluster 30 through the first Kafka data distribution cluster 20;

specifically, kafka is a high-throughput distributed publish-subscribe messaging system (message engine system) that can handle all action flow data of consumers in websites. Such actions (web browsing, searching and other user actions) are a key factor in many social functions on modern networks. These data are typically addressed by processing logs and log aggregations due to throughput requirements. This is a viable solution for log data and offline analysis systems like Hadoop, but with the limitation of requiring real-time processing. The purpose of Kafka is to unify on-line and off-line message processing through the Hadoop parallel loading mechanism, and also to provide real-time messages through the clusters.

System a sends a message to kafka (message engine system) and system B reads a sent message from kafka. Whereas kafka is an intermediate quotient.

A messaging system is responsible for transferring data from one application to another application, and an application only needs to focus on the data, and does not need to focus on how the data is transferred between two or more applications. Distributed messaging is based on reliable message queues to asynchronously transfer messages between client applications and a messaging system. There are two main modes of messaging: point-to-point delivery mode, publish-subscribe mode. Most messaging systems use a publish-subscribe mode. Kafka is a publish-subscribe model.

S103, the Storm data analysis cluster 30 performs stream calculation processing on the received log data to obtain processed log data;

in particular, storm is an open source distributed computing system for processing real-time data streams. The analysis of data in Storm involves mainly the following steps:

define data sources (sources), which are sources of data streams in a Storm, can be any data source, such as Kafka, rabbitMQ, etc. A Spout needs to be defined to read data from the data source.

Define data processing units (Bolts), which are the main units in Storm that process data. You can define one or more Bolts to process the data received from spouses. Bolts can perform any you need operation of filtering, functions, aggregation, connections, database interactions, etc.

Topology is defined, which is a network of spots and Bolts, defining how data flows in the system. You need to define a topology to specify which Bolt receives data from which Spout and how the data passes between Bolts.

The topology is deployed and executed, once defined, it can be deployed and executed on a Storm cluster. Storm will automatically distribute the data and process them.

And (3) storing the processing result into a database according to the requirement information, or visualizing the processing result through a real-time instrument board so as to perform further analysis.

S104, the processed log data is transmitted to the second Kafka data distribution cluster 40, and the processed log data is distributed to the document type storage engine for data storage through the second Kafka data distribution cluster 40.

S105, carrying out graphic processing on the stored processed log data, judging whether the processed log data has abnormal values, and if so, acquiring system problem information based on the abnormal values and the log data.

Specifically, after the calculated performance data is subjected to persistent storage, the performance data can be compared in a graphical mode, and when the performance changes are visually shown through the graphical mode, the change nodes of the performance are quickly found, so that the driving optimization is facilitated.

As shown in fig. 3: initial performance at 18 days 6 is better than 16 days 6, with a sudden time-consuming increase at 2 points in the first red circle, indicating that there must be an event at this point that reduces the performance of the service. At the second red circle, 3 points for 30 minutes, time consuming recovery. It can thus be concluded that there is an event that has an impact on performance during the 2-to 3-point 30 minute period.

As shown in fig. 4, the flow monitoring can perform multi-date comparison, visually sense the flow change through a graphical interface, quickly find the flow peak-valley value, and provide a reference of flow dimension for problem positioning. And meanwhile, the flow prediction method is used for providing data support during the period of large-scale activity, so that the service capacity can be conveniently estimated.

The system provides monitoring of outliers for finding outlier variations. The abnormality is classified into a business abnormality and a system abnormality, and the business abnormality refers to an abnormality which needs to be monitored on a business, such as insufficient inventory, frequent login and the like. System anomalies refer to system-level anomalies, such as network anomalies, service unavailability anomalies, and so forth.

By means of the anomaly monitoring, anomaly changes within a period of time can be quickly found, and by means of anomaly values and combination with logs, system problems can be quickly located.

As shown in fig. 5, it can be found that both system anomalies and business anomalies suddenly increased during the 2:06 to 3:36 period and lasted for 1.5 hours.

As shown in fig. 6, system and business anomalies can be found by anomaly monitoring, but it is not possible to see what type of anomaly is in particular. Then it is necessary to refine the anomaly type to facilitate finer granularity of anomaly point discovery. Thus providing monitoring of abnormal subdivision. The abnormal distinction can be made according to the abnormal codes, so that the abnormal condition corresponding to each abnormal code is monitored.

S106, sending out alarm prompt information based on the system problem information;

specifically, a sending channel of alarm prompt information is configured, wherein the sending channel comprises a short message prompt, a mail prompt and a micro message prompt;

configuring alarm rules;

the alarm rule includes: the current minute request quantity is larger than a first preset value, and alarming is started; the current system abnormality rate is larger than a second preset value and early warning is started; the current business abnormality rate is larger than a third preset value and early warning is started; the current average execution time is larger than a fourth preset value to start early warning; the current response time is larger than a fifth preset value to start alarming; the current average rate of increase of the minute request is larger than a sixth preset value, and alarming is started; the current response time starts to alarm when the cycle-to-cycle growth rate is larger than a seventh preset value; and the current minute request volume ring rate of increase is larger than an eighth preset value to start alarming. Preferably, the first to eighth preset values are 150%.

The control interface can intuitively see the changes and contrast conditions of flow, abnormality and the like, however, the alarm capacity is needed for fast sensing when abnormality occurs. The scheme provides multi-dimensional monitoring alarm rules such as minute request quantity, response time, system abnormality rate, business abnormality rate, 500, 404 abnormality rate, corresponding Zhou Tongbi, ring ratio and the like, and supports flexible configuration rules and notification modes. The notification modes comprise mail, enterprise micro, short message and the like.

According to the problem alarm method of the software system, log data are collected through a filecoat installed at a client; the log data are transmitted to a first Kafka data distribution cluster 20, and the log data are distributed to a Storm data analysis cluster 30 through the first Kafka data distribution cluster 20; the Storm data analysis cluster 30 performs stream computation processing on the received log data to obtain processed log data; transmitting the processed log data to a second Kafka data distribution cluster 40, and distributing the processed log data to a document type storage engine for data storage through the second Kafka data distribution cluster 40; carrying out graphical processing on the stored processed log data, judging whether the processed log data has an abnormal value, and if so, acquiring system problem information based on the abnormal value and the log data; and sending out alarm prompt information based on the system problem information. The method solves the problem that faults in the running process of the software system cannot be found, positioned and solved quickly in the prior art.

FIG. 2 is a flow chart of an embodiment of a problem alert system of the software system of the present invention; as shown in fig. 2, the problem alarm system of a software system provided by the embodiment of the invention includes the following steps:

the filecoat module 10 is installed at the client and used for collecting log data;

a first Kafka data distribution cluster 20 for distributing said log data to a Storm data analysis cluster 30;

a Storm data analysis cluster 30 for performing a stream computation process on the received log data to obtain processed log data;

a second Kafka data distribution cluster 40 for distributing the processed log data to a document storage engine for data storage;

an outlier obtaining module 50, configured to graphically process the stored processed log data, determine whether the processed log data has an outlier, and if yes, obtain system problem information based on the outlier and the log data;

the alarm prompting module 60 is configured to issue alarm prompting information based on the system problem information.

The filebean module 10 is further configured to:

The alarm prompting module 60 is further configured to:

configuring alarm rules;

The alarm prompting module 60 is further configured to;

According to the problem alarm system of the software system, log data are collected through a filecoat module 10 installed at a client; distributing the log data to a Storm data analysis cluster 30 by a first Kafka data distribution cluster 20; performing streaming calculation processing on the received log data through a Storm data analysis cluster 30 to obtain processed log data; distributing the processed log data to a document type storage engine through a second Kafka data distribution cluster 40 for data storage; graphically processing the stored processed log data through an abnormal value acquisition module 50, judging whether the processed log data has an abnormal value, and if so, acquiring system problem information based on the abnormal value and the log data; an alarm prompt message is issued by the alarm prompt module 60 based on the system problem information. The problem alarm method of the software system solves the problem that faults in the running process of the software system cannot be found, positioned and solved rapidly in the prior art.

Fig. 7 is a schematic diagram of an entity structure of an electronic device according to an embodiment of the present invention, as shown in fig. 7, an electronic device 70 includes: a processor 701, a memory 702, and a bus 703;

wherein, the processor 701 and the memory 702 complete communication with each other through the bus 703;

the processor 701 is configured to invoke program instructions in the memory 702 to perform the methods provided by the above-described method embodiments, for example, including: collecting log data through a filecoat installed at a client; the log data are transmitted to a first Kafka data distribution cluster 20, and the log data are distributed to a Storm data analysis cluster 30 through the first Kafka data distribution cluster 20; the Storm data analysis cluster 30 performs stream computation processing on the received log data to obtain processed log data; transmitting the processed log data to a second Kafka data distribution cluster 40, and distributing the processed log data to a document type storage engine for data storage through the second Kafka data distribution cluster 40; carrying out graphical processing on the stored processed log data, judging whether the processed log data has an abnormal value, and if so, acquiring system problem information based on the abnormal value and the log data; and sending out alarm prompt information based on the system problem information.

The present embodiment provides a non-transitory computer readable storage medium storing computer instructions that cause a computer to perform the methods provided by the above-described method embodiments, for example, including: collecting log data through a filecoat installed at a client; the log data are transmitted to a first Kafka data distribution cluster, and the log data are distributed to a Storm data analysis cluster through the first Kafka data distribution cluster; the Storm data analysis cluster performs stream computation processing on the received log data to obtain processed log data; transmitting the processed log data to a second Kafka data distribution cluster, and distributing the processed log data to a document type storage engine for data storage through the second Kafka data distribution cluster; carrying out graphical processing on the stored processed log data, judging whether the processed log data has an abnormal value, and if so, acquiring system problem information based on the abnormal value and the log data; and sending out alarm prompt information based on the system problem information.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware associated with program instructions, where the foregoing program may be stored in a computer readable storage medium, and when executed, the program performs steps including the above method embodiments; and the aforementioned storage medium includes: various storage media such as ROM, RAM, magnetic or optical disks may store program code.

The apparatus embodiments described above are merely illustrative, wherein elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on such understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the embodiments or the methods of some parts of the embodiments.

While the invention has been described in detail in the foregoing general description and specific examples, it will be apparent to those skilled in the art that modifications and improvements can be made thereto. Accordingly, such modifications or improvements may be made without departing from the spirit of the invention and are intended to be within the scope of the invention as claimed.

Claims

1. The problem alarming method of the software system is characterized by comprising the following steps:

collecting log data through a filecoat installed at a client;

2. The method of claim 1, wherein the collecting log data by a filebean installed at a client comprises:

3. The method for alarming problems in a software system according to claim 1, wherein the collecting log data by a filebean installed at a client side further comprises:

4. The method for alarming problems in a software system according to claim 1, wherein the collecting log data by a filebean installed at a client side further comprises:

5. The method of claim 4, wherein graphically processing the stored processed log data to determine whether an outlier exists in the processed log data, and if so, obtaining system problem information based on the outlier and the log data, comprises:

6. The method of claim 1, wherein the sending an alarm prompt message based on the system problem information comprises:

configuring alarm rules;

7. The method of claim 6, wherein the sending an alarm prompt message based on the system problem information, further comprises:

8. A problem alert system for a software system, comprising:

the filecoat module is arranged at the client and used for collecting log data;

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1 to 7 when the computer program is executed.

10. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any of claims 1 to 7.