CN107992398B

CN107992398B - Monitoring method and monitoring system of service system

Info

Publication number: CN107992398B
Application number: CN201711404987.3A
Authority: CN
Inventors: 张现双; 薛菲
Original assignee: Yiren Hengye Technology Development Beijing Co ltd
Current assignee: Yiren Hengye Technology Development Beijing Co ltd
Priority date: 2017-12-22
Filing date: 2017-12-22
Publication date: 2021-04-27
Anticipated expiration: 2037-12-22
Also published as: CN107992398A

Abstract

The invention provides a monitoring method of a service system, which comprises the following steps: the method comprises the steps of collecting log data of a service system, filtering and standardizing the obtained data, screening out data from the preprocessed service system data, wherein the data are derived from a specific IP address and/or a specific host and/or a specific system, standardizing and verifying the data, judging whether the reprocessed service system log is an abnormal unit according to a keyword preset by a monitoring item corresponding to a monitored service, generating a monitoring abnormal alarm according to a monitoring rule, selecting a corresponding monitoring strategy to send alarm information through one or more media, and the like, so that 0 invasion to the monitored system can be realized, the modification cost of the monitored system is greatly reduced, and the technical effects of real-time performance, high arrival rate and the like are greatly reduced. In addition, the embodiment of the invention also provides a monitoring system, an electronic device and a computer readable storage medium.

Description

Monitoring method and monitoring system of service system

Technical Field

The present invention relates to the field of computers, and in particular, to a monitoring method, a monitoring system, an electronic device, and a computer-readable storage medium for a business system.

Background

With the rapid development of company business, function iteration is usually performed on the basis of the existing business system, so that the market is rapidly occupied. In this process, technicians usually have no time to perfect real-time monitoring of each branch of the service system, resulting in the loss of infrastructure such as a monitoring platform, and bringing restrictions to the further growth of subsequent services. In order to make up for the deficiency of the monitoring system, the existing solution usually carries out subsequent monitoring on middleware, machines, disks, cpus and the like, and these monitoring methods are very invasive to the monitored system, and need to make greater modification to the system, thereby increasing more cost virtually. For example, by evaluating a loan transaction system, if more than 300 subsystems are modified, 2400 people are required, which is a huge investment, the quality of modification cannot be guaranteed, the service continuity cannot be guaranteed, and the system needs to be shut down.

At present, related monitoring products in the industry are designed and realized from the perspective of application program performance monitoring (APM), for example, a distributed monitoring product CAT sourced by a popular comment company, and the system has strong invasiveness to a monitored system and certain loss of performance.

However, with the great improvement of the reliability and the availability of software and hardware, the performance of the application program is more stable, and the probability of failure is lower. Taking the loan transaction system as an example, system failures issued by application performance monitoring every year: if the proportion of memory overflow, deadlock, over-high CPU utilization rate and memory alarm is not abnormal, the business system is abnormal: such as failure of user withdrawal, failure of user transaction, and low short message arrival rate by 5%. Therefore, the monitoring and alarming of the service system for the service abnormity become important.

On the other hand, if an alarm is generated for each abnormality of the monitored service system without control, the pressure of the alarm system is huge. For example, some abnormal monitoring items other than system faults, such as service abnormality caused by user misoperation, may also cause waste of background maintenance personnel if an alarm is given every time abnormality occurs. Therefore, it is desirable to provide a flexible and configurable monitoring and alarm method that enables maintenance personnel to receive alarm reminders at any time and any place through a variety of transmission methods.

The foregoing description is not admitted to be prior art by inclusion in the background.

Disclosure of Invention

In order to solve the technical problems that the prior art cannot monitor the service abnormality of the system in real time, the intrusiveness of the application program performance monitoring on the monitored system is strong, the performance loss is large, the alarming is too frequent and the mode is single, the invention mainly aims to provide a monitoring method and system of the service system, electronic equipment and a computer readable storage medium.

In order to achieve the above object, the present invention provides a method for monitoring a plurality of service systems, which is used for monitoring a plurality of service systems simultaneously, and comprises:

collecting and preprocessing service system data, wherein the service system data comprises service system logs;

acquiring and reprocessing the preprocessed service system data;

analyzing monitoring items based on the reprocessed service system data, and judging an abnormal unit;

generating a monitoring abnormity alarm according to the monitoring rule;

and sending alarm information by adopting a corresponding monitoring strategy.

Optionally, the service system data further includes application performance monitoring data.

Optionally, the preprocessing the business system data includes filtering and normalizing the business system data.

Optionally, the preprocessing the service system data further includes performing distribution control on the service system data.

Optionally, the reprocessing is to screen the preprocessed service system data for data originating from a specific IP address and/or a specific host and/or a specific system.

Optionally, the reprocessing further includes normalizing and verifying the preprocessed business system data.

Optionally, the monitoring item analysis determines whether the reprocessed service system log is the abnormal unit according to a keyword preset by the monitoring item corresponding to the monitored service.

Optionally, the monitoring rule includes a system-level exception merge and/or an application-level exception merge, where the system-level exception merge defines a threshold of the number of monitoring exception alarms, and the application-level exception merge defines a priority of the user-definable monitoring exception merge.

Optionally, the monitoring policy includes any one or more of a node monitoring policy, a node threshold ratio monitoring policy, a fixed-point data threshold monitoring policy, a standard anomaly monitoring policy, and a standard APM monitoring policy.

Optionally, the sending of the alarm information is implemented by one or more media of a mail, a short message or a WeChat.

In addition, to achieve the above object, the present invention further provides a monitoring system for monitoring a plurality of service systems simultaneously, including:

the data acquisition and preprocessing module is used for acquiring and preprocessing service system data, wherein the service system data comprises service system logs;

a monitoring module, comprising: the reprocessing unit is used for acquiring the preprocessed service system data and reprocessing the preprocessed service system data; the judging unit is used for analyzing the monitoring item based on the reprocessed service system data and judging the abnormal unit; the alarm information generating unit is used for generating monitoring abnormal alarm according to the monitoring rule;

the alarm information sending module is used for sending alarm information by adopting a corresponding monitoring strategy;

and the monitoring configuration module is used for configuring any one or more of a monitoring item, a monitoring strategy, a preset alarm mode, a person on duty, system configuration and alarm information query of the visual interface.

Optionally, the sending of the alarm information may be implemented by one or more media of a mail, a short message, or a WeChat.

In addition, to achieve the above object, the present invention also provides an electronic device including:

a memory for storing a computer program;

a processor for executing a computer program stored in the memory, which when executed performs the steps of the monitoring method as described above.

Furthermore, to achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the monitoring method as described above.

According to the technical scheme provided by the invention, the acquired data is filtered and standardized and preprocessed by collecting the log data of the service system, the data from a specific IP address and/or a specific host and/or a specific system is screened from the preprocessed service system data, standardization and verification are carried out, whether the reprocessed service system log is an abnormal unit or not is judged according to a keyword preset by a monitoring item corresponding to the monitored service, a monitoring abnormal alarm is generated according to a monitoring rule, and a monitoring module selects a corresponding monitoring strategy to send alarm information through one or more media.

Compared with the prior art, the invention collects log data through an IT service system to monitor service abnormal data with higher abnormal probability in real time, judges whether the reprocessed service system log is the abnormal unit or not through presetting the key words corresponding to the monitored service monitoring items, realizes intrusion to the monitored system 0, improves the application program 0, has strong scalability of the whole service system, greatly reduces the improvement cost of the monitored system, also provides a plurality of selectable monitoring strategies and a plurality of flexible and configurable alarm modes, and realizes the technical effects of real time, low cost, high arrival rate and the like.

Drawings

Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

fig. 1 schematically shows a schematic structural diagram of a monitoring system according to an embodiment of the invention;

FIG. 2 schematically illustrates a flow diagram of access layer data acquisition pre-processing in accordance with an embodiment of the present invention;

FIG. 3 schematically illustrates a monitoring module workflow diagram according to an embodiment of the invention;

FIG. 4 schematically illustrates a monitoring item and monitored service entity relationship diagram according to an embodiment of the invention;

FIG. 5 is a schematic diagram of the internal structure of a data acquisition preprocessing module according to an embodiment of the invention;

FIG. 6 schematically illustrates a monitoring module according to an embodiment of the invention;

FIG. 7 schematically illustrates a monitoring & alarm holistic model diagram according to an embodiment of the invention;

FIG. 8 schematically illustrates a monitoring configuration module user visualization interface diagram according to an embodiment of the invention;

FIG. 9 schematically illustrates a monitoring configuration module user visualization interface diagram according to an embodiment of the invention;

fig. 10 schematically shows a structural diagram of an electronic device according to an embodiment of the invention.

Detailed Description

The principles and spirit of the present invention will be described with reference to several specific embodiments. It is understood that these examples are only for the purpose of enabling those skilled in the art to better understand and to implement the present invention, and do not limit the scope of the present invention in any way.

The invention provides a monitoring method of a business system, a monitoring system, electronic equipment and a computer readable storage medium. The invention may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

As shown in fig. 1, fig. 1 is a schematic structural diagram of a monitoring system according to an embodiment of the present invention. The data acquisition preprocessing module 101 performs filtering and standardization preprocessing on acquired data in the access layer 1013 by acquiring service system log data 1011, the reprocessing unit 1021 in the monitoring module 102 screens out data from the preprocessed service system data, which are originated from a specific IP address and/or a specific host and/or a specific system, and performs standardization and verification, the judging unit 1022 judges whether the reprocessed service system log is the abnormal unit according to a keyword preset by a monitoring item corresponding to a monitored service, the alarm generating unit 1023 generates a monitoring abnormal alarm according to a monitoring rule, and the alarm sending module 103 selects a corresponding monitoring strategy and sends alarm information through one or more media.

The technical scheme provided by the embodiment of the invention uses an open-source streaming computing platform as a tool, combines a streaming computing technology, takes the logs of each service system as a monitoring object, and performs real-time analysis and computation on related monitoring points in the logs by using various monitoring strategies and alarm strategies and by assisting various measures such as distributed synchronization, a distributed locking technology, a streaming computing time window and a pipeline technology and the like, thereby performing monitoring and alarming.

An embodiment of the present invention provides a method for monitoring a plurality of service systems, which is used for monitoring the plurality of service systems simultaneously, and includes:

acquiring and reprocessing the preprocessed service system data;

generating a monitoring abnormity alarm according to the monitoring rule;

and sending alarm information by adopting a corresponding monitoring strategy.

Further, the collecting business system data, including business system logs and application performance monitoring data 1012(APM), performs the following steps from the business system log collecting service:

based on a log collection component (logstack) of an open source data collection tool, collecting logs of a served system by a data collector (agent) of the log collection component, and pushing the logs to a unified log access layer; when pushing logs, a log collection component (Logstash) needs to carry admission authentication information provided by a monitoring system, a communication protocol header (header) carries a unique ID (identity) ARGUS _ ID (identity) distributed by the monitoring system for a served system, the monitoring system checks whether the ID carried by the monitoring system is consistent with a host address (host) configured in the system or not when accessing the logs, if so, the log collection component (Logstash) carries out filtering processing on log packets, and if not, the log collection component (Logstash) discards the log packets.

Further, as shown in fig. 2, the collecting and preprocessing the service system data includes filtering and normalizing the service system data. Fig. 2 is a flow chart of access layer data acquisition preprocessing, where the log preprocessing is performed as follows:

s201: the monitoring system provides a uniform http access service address, a plurality of http service nodes use a reverse proxy server (nginx) to perform load balancing (lb), a log collection component (logstack) pushes log information to the address, and the reverse proxy server (nginx) forwards the log information to an access layer; intercepting processing of ARGUS _ ID identification verification is carried out on a reverse proxy server (nginx) so as to reduce the flow pressure processed by a back-end control module;

s202: setting a white list rule for monitoring admission verification, and setting the white list rule according to the IP address of the accessed server to prevent a certain service from intercepting an ARGUS _ ID due to the plaintext transmission of the ARGUS _ ID and possibly forging the identity access server;

if the IP address of the accessed server is not in the white list, executing step S203: directly rejecting the response;

if the IP address of the accessed server is in the white list, the step S204 is executed through checking: (ii) a

S204: configuring a filtering rule, and after the access layer acquires log data, filtering the data, such as filtering debugged logs or filtering mybatis logs, so as to reduce the flow of the logs flowing into a back-end system as much as possible; specific filtering rules are determined in conjunction with specific logs, such as the following to list several categories:

s205: the log standardization is carried out, and the log is stored by adopting a unified model in combination with the structure of an actual log so as to process various information contained in a log text, such as log level, timestamp, calling class, package name, specific basic information and the like, wherein the model is defined as follows:

because the printing rules of different access system logs are different, such as LogId, the definition is flexible and can adapt to the analysis rules under various conditions, if the analysis rules can be generated, the analysis rules are empty if the analysis rules cannot be obtained.

S206: distributing the log unified model entity information, for example, distributing the log unified model entity information to a corresponding processing queue, processing the log unified model entity information by a downstream system, and monitoring the state of the queue;

if the queue is full, step S207 is executed: sending out an alarm;

s208: performing access service degradation, adopting a packet discarding strategy and rejecting service, or scheduling the downstream processing system resources to protect the monitoring system from flooding;

s210: ending the log data processing;

if the queue is not full, step S209 is performed: carrying out distribution and enqueuing processing on the log data;

s211: the log data processing is ended.

Further, the step S206 of controlling the distribution of the log data specifically includes:

after the logs are standardized, pushing the logs into a log queue for subsequent tasks to analyze and process;

acquiring server configuration of all cache servers (redis) of a monitoring configuration system by an access layer, and initializing the server configuration to a node of a distributed application program coordination service (zookeeper);

after log enqueuing, timely registering a queue under a corresponding cache server (redis) node, informing a data extractor (spout) to extract a log stream from the queue, and acquiring a queue list by the data extractor (spout) of a monitoring system;

each access system takes HOST addresses (HOST) as dimensions, each HOST uses one queue, and a plurality of cache servers are used for fragmentation (shared);

using ARGUS _ ID as an identification object, and adopting a consistent HASH (HASH) algorithm to determine which fragment the log enters;

and when the redis node is down, dynamically adjusting by adopting the HASH algorithm again.

Further, as shown in fig. 3, fig. 3 is a monitoring module work flow diagram, and after the log is cleaned, processed and standardized by the access stratum, the log obtains and reprocesses the preprocessed service system data, which includes step S301:

acquiring all queues of all fragments distributed in each cache server by adopting a plurality of data extractors (spouts);

the data extractor (spout) initially acquires a fragment list, a queue list under the fragment and a local cache (cache) from the distributed application program coordination service (zookeeper), and registers a listener (listener) node on the distributed application program coordination service (zookeeper);

if the node is not changed, the queue list extraction log is obtained from a local cache memory (cache) every time in the subsequent extraction process and is transmitted out;

when the access layer registration, removal and monitoring configuration module maintenance trigger and the like cause the change of the registration node, the latest queue list extraction log stream can be actively acquired.

Further, the reprocessing is to screen out data originated from a specific IP address and/or a specific host and/or a specific system from the preprocessed service system data, and specifically includes step S302:

loading monitoring configuration, configuring a target monitoring item before monitoring processing of the log stream, and generating a run-time rule for system analysis and application;

filtering the monitoring items, and uniformly managing the identification and the filtration of the monitored objects through monitoring configuration information; the target system is a monitoring object of the system, the related information acquisition service is operated on the system, and the unique identification is realized through the service and the ip address or the domain name of the machine where the service is located; whenever a monitored system is newly accessed, the system registers in the system to generate a unique identifier which needs to be consistent with the information collection service.

Further, the monitoring item analysis determines whether the reprocessed service system log is the abnormal unit according to a keyword preset by the monitoring item corresponding to the monitored service, and the abnormal unit determines whether to generate alarm information only by double filtering of the monitoring rule and the alarm rule, specifically including:

step S303, identifying a target log to be alarmed by defining a monitoring item, and taking the keyword as a monitoring item according to preset keywords, such as a 'cash deposit failure' keyword appearing in the log, namely 'cash deposit failure';

if the keyword of 'money put failure' appears, executing step S304, and defining a service exception to the service system as an exception unit; acquiring corresponding configuration monitoring items according to ARGUS _ ID, HOST and other information of the log packet, searching or regularly matching the monitoring items, generating abnormal units, assembling body messages of the abnormal units, and transmitting the body messages;

the message format of the exception unit is as follows:

if there is no traffic anomaly, the execution of step S305 ends.

Further, each monitoring item must correspond to a monitored service during configuration, the entity relationship between the monitoring item and the monitored service is as shown in fig. 4, each monitoring item uniquely belongs to one monitored service, and each monitored service may have multiple monitoring items.

Further, step S306 is executed, and a corresponding monitoring exception merging rule is selected, so as to determine the number of generated monitoring exceptions; the monitoring rule includes system-level exception merging and/or application-level exception merging, where the system-level exception merging defines a threshold of the number of times of monitoring exception alarm, and the application-level exception merging defines a priority of monitoring exception merging that a user can customize to generate a monitoring exception alarm, and specifically includes:

the system-level abnormal merging rule can be automatically applied to all served systems, an upper limit threshold value of the number of monitoring abnormal conditions needing to be alarmed is set for the whole monitoring system, if a certain monitoring item of a certain application system alarms for more than 1000 times within 10 seconds, in this case, the details of each alarm are unimportant in nature, and the fact that a large number of alarms are generated is important, so that the alarms are merged and then sent more reasonably, and therefore, the downstream alarm system originally needs to process 1000 alarm information, only 1 alarm is needed to be sent, and the resource occupation is greatly reduced;

the application-level exception merging rule is a custom exception merging rule provided when the monitoring strategy is configured according to the application, and the threshold value of the rule is lower than the system-level exception merging strategy;

the merged objects are all HOST addresses HOST, and the generated monitoring abnormal objects enter a message queue MQ.

Further, executing step S308, processing and generating alarm information by an alarm generating module according to the selected monitoring policy; the monitoring strategy comprises any one or more of a node monitoring strategy, a node threshold proportion monitoring strategy, a fixed point data threshold monitoring strategy, a standard abnormity monitoring strategy and a standard APM monitoring strategy, and various monitoring strategies are configured for selection according to the requirements of relevant service scenes and system monitoring scenes, and the method specifically comprises the following steps:

and (3) node monitoring strategy: the strategy is suitable for such a scene, namely, after expected node keywords appear in a monitored log, generation of an alarm unit is triggered, for example, in the system log monitoring process, if the keyword of 'transaction failure' appears, an alarm is given, and regular matching is supported in a mode of 'transaction failure and bank card quota or transaction failure and payment password error' and the like; the monitoring strategy can be used for alarming, and can also be used for counting, such as counting the times of certain operations or requests, combining with certain time window definitions, and outputting values of certain frequencies, such as counting the page browsing amount pv of login requests, or the query rate per second qps of transaction requests, and the like;

monitoring node threshold: comparing the occurrence frequency of a certain type of monitoring strategy in a certain time window with a configured threshold value, and alarming when a certain condition is reached; for example, when the registration amount of a user in a certain channel is less than 100w within half an hour, an alarm is given; or, when the purchasing behavior of a certain user is less than 10 times within 5 days, the scene of alarming and the like is realized;

monitoring the node threshold ratio: the occurrence frequency of a certain type of monitoring type is compared with the occurrence proportion of all types and a preset threshold value, and an alarm is given when a certain condition is reached; for example, the user registration generates a short message sending behavior, if the system generates 1000 short message sending requests within 10 minutes, and only 900 times of short message sending succeeds, the sending success rate is 90%, and if the sending success rate is lower than 95%, an alarm is given; or within 12 hours, the system processes 1000 transaction requests totally, and only 900 transaction requests return success, so that the payment success rate is 90%, and if the payment success rate is lower than 95%, scenes such as alarm and the like are generated;

fixed point data threshold monitoring: monitoring certain data in some monitoring data sources, for example, in log monitoring, a system returns data in response to a front-end request, if data in JSON format is returned, the following table:

as with the JSON formatted data in the above table, if the minMobileAmt value in the JSON object of the productDetailId corresponding to "e 5a05b6d03ee45589f5704ba7210e3 dd" is more than 1000, an alarm is generated;

the system can automatically resolve different types of data, so that in configuration, one anchor point data is configured, a value describing which key the anchor point data represents is configured, such as productDetailId, the system locates to a data Object, namely an Object of JSON, through the two data, and then obtains the value of minMobileAmt in the Object. Regardless of the data format, the application system front-end parses according to certain rules, and the data existing in the log also follows the rules, so that the monitoring system parses according to the rules. The method can be supported by a common JSON and KEY-Value representation mode system;

standard anomaly monitoring: the monitoring strategy aims to monitor some standard exceptions in the log source, so that a user can select the exception without inputting the exception, such as NullpointerException, ClassCastException and the like. A list of exceptions may be maintained and distinguished according to different language types;

standard APM monitoring: the monitoring strategy aims to monitor some parameters such as Java virtual machine JVM, runtime parameters of a database, redis instance states and the like; for example, the middleware may be configured to print a log of GC periodically, and alarm when Full GC appears in the log; or the JMX periodically prints the states of all parameters of the current JVM instance of the Java virtual machine, monitors some parameters in the log, and obtains the alarm when the parameter values pass the threshold value comparison; or redis, the output of the message info instruction can be periodically acquired and printed in the log, and some parameter values in the log are monitored.

Further, step S309 is executed to send out alarm information, where the sending out alarm information is implemented by one or more message notification mediums of an email, a short message, or a WeChat.

Further, for the generated monitoring abnormality, an alarm which should not be processed is directly generated and pushed to relevant people, for example, in an extreme case that a server depending on some services is down, monitoring abnormality flooding is caused, if the abnormality is not processed, the pressure of an alarm system is huge, so that the pressure of a short message system and a mail system is increased excessively, and an avalanche effect is brought, so that different alarm strategies need to be selected to send alarm information according to application needs, for example, the monitoring abnormality of "transaction failure" can be defined as sending a short message after 3 times of occurrence within 10s, and can also be defined as alarming 2 times within 1 minute at most, alarming 100 times within 1 day at most, and after 2 times of alarming, the alarm times occurring during the interval of 900s are reminded again; although the alarm information is merged, each alarm is still recorded on the data storage, for example, 300 alarms occur within 900S, and although only one short message is sent, the background still records 300 alarm records, that is, step S310.

Depending on the definition of the alarm strategy and the problem to be solved, several basic alarm strategies are now presented:

further, the embodiment of the present invention also supports alarm shielding, and closes the predictable alarm sending action, and the shielding strategy is as follows:

the time dimension is as follows: comprises a continuous time period, such as continuous 1 hour, no alarm sending action is executed; a fixed time period, such as 0 to 6 points per day, no longer alarms; the alarm is not given in holidays; no alarm on weekends, etc.;

shielding granularity: if a certain monitoring item does not generate an alarm; the specific application service does not give an alarm, does not support complete closing, and has the granularity up to the application level;

the alarm object: certain alarm recipients are masked.

In addition, an embodiment of the present invention further provides a monitoring system, as shown in fig. 1, the monitoring system according to the embodiment of the present invention is composed of four modules, which are a data acquisition preprocessing module 101, a monitoring module 102, an alarm information sending module 103, and a monitoring configuration module 104, and has an overall architecture as follows:

the data acquisition and preprocessing module 101 is configured to acquire and preprocess service system data, where the service system data includes service system log data 1011;

a monitoring module 102, comprising: a reprocessing unit 1021, configured to obtain the preprocessed service system data and perform reprocessing; a determining unit 1022, configured to perform monitoring item analysis based on the reprocessed service system data, and determine an abnormal unit; the alarm information generating unit 1023 is used for generating monitoring abnormity alarm according to the monitoring rule;

the alarm information sending module 103 is used for sending alarm information by adopting a corresponding monitoring strategy;

and the monitoring configuration module 104 is configured to configure any one or more of a monitoring item of the visual interface, a monitoring policy, a predetermined alarm mode, a person on duty, system configuration, and alarm information query.

Further, the business system data also includes application performance monitoring data 1012 (APM).

Further, as shown in fig. 5, fig. 5 is a schematic diagram of an internal structure of the data acquisition preprocessing module 101 according to an embodiment of the present invention. Since real-time analysis of online logs is performed, the data volume is large, and if the online logs directly enter a monitoring system, impact is brought to the system, so the access layer 1013(buffer) serves as a buffer before data enters the monitoring system, and a certain data discarding strategy is allowed, such as discarding logs generated by non-com.

Performing preliminary filtering and cleaning on the log, for example, filtering out meaningless texts and supporting filtering of rule configuration;

and carrying out standardized processing on the logs, and carrying out standardization according to a log entity model predefined by the system.

Further, the preprocessing the service system data further includes performing distribution control on the service system data, and specifically includes:

distributing the entity information of the log unified model, for example, distributing the entity information to a corresponding processing queue, and processing the entity information by a downstream system;

monitoring the state of the queue, and if the queue is full, sending an alarm and degrading the access service;

and adopting a packet discarding strategy and denial of service, or scheduling the resources of a downstream processing system so as to protect the monitoring system from flooding.

Further, as shown in fig. 6, fig. 6 is a schematic structural diagram of a monitoring module according to an embodiment of the present invention, where a core of a monitoring system according to the embodiment of the present invention is a monitoring module 102, and according to a preset monitoring item, a real-time analysis is performed on an acquired log, and for a triggered monitoring item, a monitoring policy is applied to determine whether to generate an alarm, so as to provide an extensible and flexibly configured service; and on one hand, the generated alarm information is stored persistently, and on the other hand, the generated alarm information is pushed to an alarm system to send the alarm information to a preset attendant.

Further, the reprocessing is to screen out data originated from a specific IP address and/or a specific host and/or a specific system from the preprocessed service system data, and specifically includes:

the target system is a monitoring object of the system, the related information acquisition service runs on the system, and the unique identification is realized through the service and the IP address or the domain name of the machine where the service is located;

whenever a monitored system is newly accessed, the system registers in the system to generate a unique identifier which needs to be consistent with the information collection service.

Further, the reprocessing further comprises standardizing and verifying the preprocessed business system data.

Further, the analyzing of the monitoring item judges whether the reprocessed service system log is the abnormal unit according to the preset keyword corresponding to the monitoring item of the monitored service, which specifically includes:

defining a monitoring item according to preset keywords, identifying a target log to be alarmed, if keywords of 'cash deposit failure' appear in the log, determining that a service system is abnormal, defining the log as an abnormal unit, and determining 'cash deposit failure' as a monitoring item;

each monitoring item uniquely belongs to one monitored service, and each monitored service can have a plurality of monitoring items.

The monitoring items have the following basic properties:

further, the monitoring rules include system-level exception merging and/or application-level exception merging, the system-level exception merging defines a threshold of the number of monitoring exception alarms, and the application-level exception merging is a priority of the user-definable monitoring exception merging to generate the monitoring exception alarms.

Further, the monitoring strategy refers to any one or more of a node monitoring strategy, a node threshold ratio monitoring strategy, a fixed point data threshold monitoring strategy, a standard anomaly monitoring strategy and a standard APM monitoring strategy.

Furthermore, the monitoring exception is not directly generated by an exception unit directly hitting the monitoring item, but generated through a certain strategy, if the monitoring item of 'transaction failure' triggers the exception unit, and the exception unit generates the monitoring exception within 10s for three times, so that the service exception generated with small probability is more flexible and subdivided to define the strategy;

further, the monitoring exception is generated only if the exception unit triggered by the monitoring item satisfies the predetermined monitoring policy, the logical relationship among the monitoring item, the exception unit, the monitoring policy, and the model of the monitoring exception is as shown in fig. 7, and fig. 7 is a monitoring & alarm integral model:

further, the alarm information sending module 103 is configured to send out alarm information by using a corresponding monitoring policy, and specifically includes:

the monitoring module 102 and the alarm information sending module 103 are decoupled through a distributed double-ended queue, and a distributed lock is adopted for synchronization;

the monitoring policy of each monitoring module 102 corresponds to a queue;

the monitored alarm information generation unit 1023 enters a queue, and an alarm information sending module maintains a thread pool;

the alarm information sending module 103 acquires a queue list in real time, and dynamically initiates a processing thread to process alarm information whenever a new queue is generated;

if the queue is idle for a long time, the thread is terminated to save resources;

receiving the alarm information generated by the alarm information generating unit 1023 of the monitoring module 102, applying a preset monitoring strategy, sending an alarm to a person on duty, and supporting multiple alarm modes;

the alarm information issuing module 103 is provided with a retry mechanism.

Further, the sending of the alarm information may be implemented by one or more media of short message 1031, mail 1032, or WeChat 1033.

Further, as shown in fig. 8 and 9, fig. 8 and 9 are schematic diagrams of a user visual interface of a monitoring configuration module according to an embodiment of the present invention, respectively, where the monitoring configuration module 104 is configured to configure any one or more of system configurations of a monitoring item 1041, a monitoring policy 1042, an alarm query 1043, an attendant 1044, an alarm mode 1045, a statistical analysis 1046, and the like of the visual interface, and specifically includes:

a monitoring configuration module 104, which is used as a management and configuration system of the whole platform and some parameter monitoring, and monitors the running state of the whole platform;

the user sets and selects corresponding functions and parameters through the module according to the self requirement, and the setting and the selection comprise the following steps:

configuration of a monitoring item 1041 and a monitoring strategy 1042;

alarm query 1043, checking an alarm Web page, skipping a short network address, attaching to an alarm short message and a mail, and checking alarm details;

configuration of a person on duty 1044 and the like;

configuring an alarm mode 1045, including alarm and alarm strategy configuration;

system parameter configuration, queue and storm running state monitoring;

and maintaining and loading the queue list, and accessing the new monitored application during running.

Further, the monitoring configuration module is further configured to perform application-level information statistical analysis 1046, which specifically includes:

performing qps and tps statistics of a system, and performing real-time statistics of pv and uv;

access path and response time statistics;

monitoring specific user access traces, requesting tracking, and the like;

other related business application scenarios involve log data.

Further, the monitoring system further comprises a storage module 105, which is used for performing persistent storage on the generated alarm information for inquiry and statistical analysis. If alarm merging is carried out, if 10 alarms are generated by a certain monitoring item, the alarm system merges the 10 alarms into one alarm short message, and 10 alarm messages are still stored.

In addition, an embodiment of the present invention further provides an electronic device 40, as shown in fig. 10, the electronic device 40 of this embodiment may include, but is not limited to: one or more processors or processors 401, a memory 402, a bus 403 connecting the different components (including the memory 402 and the processor 401), an I/O interface 405 for data transfer with external devices 404 (such as a keyboard, pointing device, display, etc.). The electronic device 40 of this embodiment may be a PC (personal computer), a tablet computer, a portable computer, or other display device with a processor.

Further, the processor 401 executes various functional applications and data processing by executing computer programs stored in the memory 402, for example, executes instructions for implementing the steps in the above-described method embodiments; in particular, the processor 401 may execute a computer program stored in the memory 402, and when the computer program is executed, the following instructions are executed: collecting and preprocessing service system data, wherein the service system data comprises service system logs;

acquiring and reprocessing the preprocessed service system data;

generating a monitoring abnormity alarm according to the monitoring rule;

and sending alarm information by adopting a corresponding monitoring strategy.

Further, memory 402 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)4021 and/or cache memory 4022; the monitoring device 40 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, ROM 4023 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 10, commonly referred to as a "hard drive"). Although not shown in FIG. 10, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to the bus 403 by one or more data media interfaces.

Further, at least one program product may be included in system memory 402 having a set (e.g., at least one) of program modules configured to perform the functions of embodiments of the present invention, program/utility 4025 having a set (at least one) of program modules 4024, which may be stored in, for example, system memory 402, and such program modules 4024 include, but are not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment. The program modules 4024 generally perform the functions and/or methods of the embodiments described herein.

As shown in FIG. 10, the network adapter 406 communicates with other modules of the monitoring device 40, such as the processor 401, via the bus 403.

Furthermore, an embodiment of the present invention also provides a computer-readable storage medium on which a computer program is stored. When executed by a processor, the program implements the steps described in the method embodiments above, for example:

acquiring and reprocessing the preprocessed service system data;

generating a monitoring abnormity alarm according to the monitoring rule;

and sending alarm information by adopting a corresponding monitoring strategy.

Further, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Moreover, while the operations of the method of the invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

While the spirit and principles of the invention have been described with reference to the above specific embodiments, it is to be understood that the invention is not limited to the specific embodiments disclosed, nor is the division of the aspects, which is for convenience only as the features in these aspects cannot be combined to advantage. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A monitoring method of a service system is used for monitoring a plurality of service systems simultaneously, and is characterized by comprising the following steps:

collecting and preprocessing service system data, wherein the service system data comprises service system logs; the preprocessing comprises the steps of setting a white list rule to carry out monitoring admission verification and filtering and standardizing service system data;

acquiring and reprocessing the preprocessed service system data;

analyzing monitoring items based on the reprocessed service system data, and judging an abnormal unit; wherein, the monitoring item analysis adopts distributed synchronization, distributed lock technology, flow type computation time window and pipeline technology to perform real-time analysis computation;

generating a monitoring abnormity alarm according to the monitoring rule; the monitoring rule includes system-level exception merging and/or application-level exception merging, and specifically includes: the system-level abnormal merging rule can be automatically applied to all served systems, and an upper limit threshold value of the number of monitoring abnormalities needing to be alarmed is set for the whole monitoring system; the application-level exception merge rule is a custom exception merge rule provided when the monitoring policy is configured according to the application, and the threshold value of the rule is lower than that of the system-level exception merge rule;

and sending alarm information by adopting a corresponding monitoring strategy.

2. The monitoring method of claim 1, wherein the business system data further comprises application performance monitoring data.

3. The monitoring method of claim 1, wherein preprocessing the business system data further comprises performing distribution control on the business system data.

4. A monitoring method according to any of claims 1-3, characterized in that the reprocessing is a screening of pre-processed traffic system data for data originating from a specific IP address and/or a specific host and/or a specific system.

5. A monitoring method in accordance with claim 4, wherein the reprocessing further comprises normalizing and validating the preprocessed business system data.

6. A monitoring method according to any one of claims 1-3, wherein the monitoring item analysis includes determining whether the reprocessed service system log is the abnormal unit according to a keyword preset by the monitoring item corresponding to the monitored service.

7. The monitoring method of any one of claims 1-3, wherein the monitoring policy includes any one or more of a node monitoring policy, a node threshold ratio monitoring policy, a fixed point data threshold monitoring policy, a standard anomaly monitoring policy, and a standard application performance monitoring data monitoring policy.

8. The monitoring method of claim 1, wherein the sending alarm information is implemented by one or more media of mail, short message or WeChat.

9. A monitoring system for simultaneously monitoring a plurality of business systems, comprising:

the data acquisition and preprocessing module is used for acquiring and preprocessing service system data, wherein the service system data comprises service system logs; the preprocessing comprises setting a white list rule to carry out monitoring admission verification and filtering and standardizing service system data;

the monitoring configuration module is used for configuring any one or more of a monitoring item, a monitoring strategy, a preset alarm mode, a person on duty, system configuration and alarm information query of the visual interface;

the judging unit analyzes and calculates the monitoring items in real time by adopting a distributed synchronization technology, a distributed lock technology, a streaming calculation time window and a pipeline technology;

the monitoring rule includes system-level exception merging and/or application-level exception merging, and specifically includes: the system-level abnormal merging rule can be automatically applied to all served systems, and an upper limit threshold value of the number of monitoring abnormalities needing to be alarmed is set for the whole monitoring system; the application-level exception merge rule is a custom exception merge rule provided when the monitoring policy is configured according to the application, and the threshold value of the rule is lower than that of the system-level exception merge rule.

10. The monitoring system of claim 9, wherein the business system data further comprises application performance monitoring data.

11. The monitoring system of claim 9, wherein the preprocessing the business system data further comprises distribution control of the business system data.

12. A monitoring system according to any of claims 9-11, characterized in that the reprocessing is a screening of preprocessed traffic system data for data originating from a specific IP address and/or a specific host and/or a specific system.

13. The monitoring system of claim 12, wherein the reprocessing further comprises normalizing and validating the preprocessed business system data.

14. The monitoring system according to any one of claims 9-11, wherein the monitoring item analysis includes determining whether the reprocessed service system log is the abnormal unit according to a keyword preset by the monitoring item corresponding to the monitored service.

15. The monitoring system of any one of claims 9-11, wherein the monitoring policy includes any one or more of a node monitoring policy, a node threshold ratio monitoring policy, a fixed point data threshold monitoring policy, a standard anomaly monitoring policy, and a standard application performance monitoring data monitoring policy.

16. The monitoring system of claim 9, wherein the sending alarm information is implemented by one or more media of mail, text message or WeChat.

17. An electronic device, comprising:

a memory for storing a computer program;

a processor for executing a computer program stored in the memory, wherein the computer program when executed performs the steps of the monitoring method of any one of claims 1-8.

18. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the monitoring method as claimed in any one of the preceding claims 1 to 8.