CN117336145A - Processing system for event monitoring - Google Patents

Processing system for event monitoring Download PDF

Info

Publication number
CN117336145A
CN117336145A CN202311346062.3A CN202311346062A CN117336145A CN 117336145 A CN117336145 A CN 117336145A CN 202311346062 A CN202311346062 A CN 202311346062A CN 117336145 A CN117336145 A CN 117336145A
Authority
CN
China
Prior art keywords
service
event
resource
query
instruction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311346062.3A
Other languages
Chinese (zh)
Inventor
管海鹏
张洁儒
杨洋
张新博
于得水
袁飞
傅宇晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Shumi Intelligent Technology Co ltd
Shenzhen Showmac Network Technology Co ltd
Guangdong Shumi Technology Co ltd
Original Assignee
Beijing Shumi Intelligent Technology Co ltd
Shenzhen Showmac Network Technology Co ltd
Guangdong Shumi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Shumi Intelligent Technology Co ltd, Shenzhen Showmac Network Technology Co ltd, Guangdong Shumi Technology Co ltd filed Critical Beijing Shumi Intelligent Technology Co ltd
Priority to CN202311346062.3A priority Critical patent/CN117336145A/en
Publication of CN117336145A publication Critical patent/CN117336145A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/069Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/02Capturing of monitoring data
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/04Processing captured monitoring data, e.g. for logfile generation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/06Generation of reports
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/12Network monitoring probes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/50Testing arrangements

Abstract

The embodiment of the invention relates to a processing system for event monitoring, which comprises: the system comprises a plurality of first monitoring objects, a first main control module, a first storage module, a first query module and a first early warning module; presetting a first acquisition module on each first monitoring object; the first acquisition module is connected with the first main control module; the first main control module is also respectively connected with the first storage module, the first query module and the first early warning module; the system of the invention can further improve the running stability of the service system, reduce the workload of operation and maintenance personnel and improve the working efficiency.

Description

Processing system for event monitoring
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a processing system for event monitoring.
Background
And the Internet of things operation and maintenance business uses a business system to manage all Internet of things equipment. At present, in order to improve system stability in the running process of a service system, the use state of resources (such as a CPU, a GPU, a memory, a disk, a bandwidth, a load, a disk I/O and the like) of each node object (such as an internet of things device, a network device, a server, a database, a virtual machine and the like) in the system is monitored, and abnormal resource event identification and early warning are performed according to the detection result. However, we have found that this conventional single resource event monitoring approach has some drawbacks after a long period of system operation: 1) The service called by the node object is not monitored, abnormal service events in the system cannot be identified and early-warned, and the stability of the system is reduced; 2) The flexible query interface is not provided for the historical monitoring data, so that the operation and maintenance personnel can take extra workload to process when constructing a monitoring data-time observation curve and performing the same ratio/loop comparison according to the historical monitoring data/abnormal events.
Disclosure of Invention
The object of the present invention is to address the deficiencies of the prior art and to provide a processing system for event monitoring, the system comprising: the system comprises a plurality of first monitoring objects, a first main control module, a first storage module, a first query module and a first early warning module; the first monitoring objects are all node objects (such as internet of things equipment, network equipment, servers, databases, virtual machines and the like) in the service system, and a first acquisition module is preset on each first monitoring object for data acquisition; the first main control module is respectively connected with the first acquisition module, the first storage module, the first query module and the first early warning module of each first monitoring object. The system of the invention adds a service event monitoring mode on the basis of being compatible with a conventional resource event monitoring mode, thereby being capable of monitoring the service called by the node object and identifying and early warning the abnormal service event occurring in the system; the system of the invention further provides eight types of inquiry instructions (a resource monitoring inquiry instruction, a resource abnormality inquiry instruction, a resource monitoring comparison instruction, a resource abnormality comparison instruction, a service monitoring inquiry instruction, a service abnormality inquiry instruction, a service monitoring comparison instruction and a service abnormality comparison instruction), so that time sequence data sequences required by a resource acquisition data-time curve, an abnormal resource event-time curve, a service acquisition data-time curve and an abnormal service event-time curve can be automatically searched, and the average utilization ratio comparison result, the total quantity of abnormal resource events comparison result, the total quantity of service calls comparison result and the total quantity of abnormal service events comparison result of two designated time periods (two same-period or two ring-period) can be automatically calculated. The system of the invention can not only further improve the running stability of the service system, but also reduce the workload of operation and maintenance personnel and improve the working efficiency.
To achieve the above object, an embodiment of the present invention provides a processing system for event monitoring, the system including: the system comprises a plurality of first monitoring objects, a first main control module, a first storage module, a first query module and a first early warning module;
presetting a first acquisition module on each first monitoring object; the first acquisition module is connected with the first main control module; the first main control module is also respectively connected with the first storage module, the first query module and the first early warning module;
the first acquisition module is used for periodically acquiring the resource data of the first monitoring object according to a preset resource monitoring frequency to obtain a corresponding first acquisition data packet and sending the corresponding first acquisition data packet to the first main control module;
the first acquisition module is further used for carrying out appointed service call monitoring on the first monitoring object according to a preset monitoring service set, acquiring service data monitored each time to generate corresponding first service acquisition data and sending the corresponding first service acquisition data to the first main control module; the monitoring service set comprises a plurality of monitoring service records; the monitoring service record comprises a monitoring service name and a monitoring service calling interface; the monitoring service names comprise multiple classes of appointed service names;
The first main control module is used for storing the first acquired data packet into the first storage module when the first acquired data packet is received; carrying out abnormal resource event identification processing according to the first acquired data packet to obtain a corresponding first abnormal event set; when the first abnormal event set is not empty, storing the first abnormal event set into the first storage module, and sending the first abnormal event set to the first early warning module;
the first main control module is used for carrying out abnormal service event identification processing according to the first service acquisition data to obtain a corresponding second abnormal event when receiving the first service acquisition data; when the second abnormal event is not empty, storing the second abnormal event into the first storage module, and sending the second abnormal event to the first early warning module;
the first main control module is also used for receiving a first query instruction sent by the first query module; performing data query processing on the first storage module according to the first query instruction to generate corresponding first query feedback and sending the corresponding first query feedback back to the first query module;
the first query module is used for sending the first query instruction input by the user to the first main control module; and feeding back the first query feedback sent back by the first main control module to a user;
The first early warning module is used for carrying out resource abnormal event early warning processing according to the first abnormal event set when the first abnormal event set is received; and the first early warning module is also used for carrying out early warning processing on the service abnormal event according to the second abnormal event when the second abnormal event is received.
Preferably, the first collection data packet is composed of a plurality of first resource collection data; the first resource acquisition data comprises a first acquisition time stamp, a first object identifier, a first resource type and first resource data; the first resource type comprises a CPU, a GPU, a memory, a disk, a bandwidth, a load and a disk I/O; when the first resource type is CPU/GPU, the corresponding first resource data comprises the total number of CPU/GPU cores and the utilization rate of the CPU/GPU cores; when the first resource type is memory/disk/bandwidth/load, the corresponding first resource data comprises total memory/disk/bandwidth/load and memory/disk/bandwidth/load utilization; when the first resource type is disk I/O, the corresponding first resource data comprises disk reading rate and disk writing rate;
The first service acquisition data comprises a second acquisition time stamp, a second object identifier, a first service name, a first service cumulative value and a first service return state; the first service return state includes a success state and a failure state;
the first set of abnormal events includes one or more first abnormal events; the first abnormal event comprises a first event time stamp, a first event object identification, a first event resource type and a first event abnormal level; the first event resource type comprises a CPU, a GPU, a memory, a disk, a bandwidth, a load and a disk I/O; the first event anomaly level includes a middle level and a high level;
the second abnormal event comprises a second event time stamp, a second event object identifier and a first event service name;
the first query instruction comprises a first instruction head and a first instruction parameter; the first instruction head comprises a resource monitoring query instruction, a resource abnormality query instruction, a resource monitoring comparison instruction, a resource abnormality comparison instruction, a service monitoring query instruction, a service abnormality query instruction, a service monitoring comparison instruction and a service abnormality comparison instruction;
when the first instruction head is a resource monitoring query instruction, the corresponding first instruction parameters comprise a first query resource type and a first resource query period; the first query type comprises a CPU, a GPU, a memory, a disk, a bandwidth, a load and a disk I/O;
When the first instruction head is a resource abnormal inquiry instruction, the corresponding first instruction parameters comprise a second inquiry resource type and a second resource inquiry period; the second query type comprises a CPU, a GPU, a memory, a disk, a bandwidth, a load and a disk I/O;
when the first instruction head is a resource monitoring comparison instruction, the corresponding first instruction parameters comprise a third query resource type, a first one-to-one period and a first two-to-one period; the third query type comprises a CPU, a GPU, a memory, a disk, a bandwidth, a load and a disk I/O;
when the first instruction head is a resource abnormality comparison instruction, the corresponding first instruction parameters comprise a fourth query resource type, a first third period and a first fourth period; the fourth query type comprises a CPU, a GPU, a memory, a disk, a bandwidth, a load and a disk I/O;
when the first instruction header is a service monitoring query instruction, the corresponding first instruction parameters comprise a first query service name and a first service query period;
when the first instruction head is a service abnormal inquiry instruction, the corresponding first instruction parameters comprise a second inquiry service name and a second service inquiry period;
When the first instruction head is a service monitoring comparison instruction, the corresponding first instruction parameters comprise a third query service name, a second first period and a second period;
when the first instruction header is a service abnormality comparison instruction, the corresponding first instruction parameters comprise a fourth query service name, a second third period and a second fourth period.
Preferably, the first acquisition module is specifically configured to periodically acquire, when the first acquisition data packet obtained by periodically acquiring the resource data of the first monitored object according to a preset resource monitoring frequency is sent to the first main control module, perform data acquisition on the CPU/GPU resource configuration and the utilization rate of the first monitored object according to the resource monitoring frequency to generate a corresponding total number of CPU/GPU cores and a corresponding CPU/GPU core utilization rate; data acquisition is carried out on the memory/disk/bandwidth/load resource allocation and the utilization rate of the first monitoring object to generate the corresponding total memory/disk/bandwidth/load and the corresponding memory/disk/bandwidth/load utilization rate; data acquisition is carried out on the disk read/write rate of the first monitoring object to generate a corresponding disk read rate and a corresponding disk write rate; the current time is used as the corresponding first acquisition time stamp; taking the object identifier of the first monitoring object as the corresponding first object identifier;
When the total number of the CPU/GPU cores is not empty, forming corresponding first resource data by the total number of the CPU/GPU cores and the corresponding CPU/GPU core utilization rate, setting the corresponding first resource type as the corresponding CPU/GPU, and forming corresponding first resource acquisition data by the obtained first acquisition time stamp, the first object identifier, the first resource type and the first resource data;
when the total memory/disk/bandwidth/load is not empty, forming corresponding first resource data by the total memory/disk/bandwidth/load and the corresponding memory/disk/bandwidth/load memory utilization rate, setting the corresponding first resource type as the corresponding memory/disk/bandwidth/load, and forming corresponding first resource acquisition data by the obtained first acquisition time stamp, the first object identifier, the first resource type and the first resource data;
when the disk reading rate or the disk writing rate is not null, forming corresponding first resource data by the disk reading rate and the disk writing rate, setting the corresponding first resource type as disk I/O, and forming corresponding first resource acquisition data by the obtained first acquisition time stamp, the first object identifier, the first resource type and the first resource data;
And the corresponding first acquisition data packet formed by all the obtained first resource acquisition data is sent to the first main control module.
Preferably, the first acquisition module is specifically configured to set a corresponding counter on the first monitored object for each monitoring service record of the monitoring service set as a corresponding monitoring service counter when performing specified service call monitoring on the first monitored object according to the preset monitoring service set and acquiring service data monitored each time to generate corresponding first service acquisition data and send the corresponding first service acquisition data to the first main control module, and clear each monitoring service counter periodically according to a preset counter reset frequency;
when the first monitoring object calls any service interface each time, the currently called service interface is used as a corresponding current service interface; the monitoring service record matched with the current service interface by the monitoring service calling interface in the monitoring service set is used as a corresponding current record; and when the current record is not empty, adding 1 to the monitoring service counter corresponding to the current record, extracting the count value of the monitoring service counter after adding 1 to be used as the corresponding first service cumulative value, taking the current time as the corresponding second acquisition time stamp, taking the object identification of the first monitoring object as the corresponding second object identification, taking the monitoring service name of the current record as the corresponding first service name, taking the service success/failure state returned by the current service interface as the corresponding first service return state, and forming the corresponding first service acquisition data by the obtained second acquisition time stamp, the obtained second object identification, the obtained first service name, the obtained first service cumulative value and the obtained first service return state to be sent to the first main control module.
Preferably, the first main control module is specifically configured to identify the first resource type of each first resource acquisition data of the first acquisition data packet when the first abnormal event set is obtained by performing abnormal resource event identification processing according to the first acquisition data packet;
if the first resource type is a CPU/GPU, setting the corresponding first event exception level to be a middle level or a high level when the CPU/GPU kernel utilization rate of the corresponding first resource data meets a preset risk level CPU/GPU utilization rate range or a high risk level CPU/GPU utilization rate range, using the corresponding first acquisition timestamp, the first object identifier and the first resource type as the corresponding first event timestamp, the first event object identifier and the first event resource type, and forming a corresponding first exception by the obtained first event timestamp, the first event object identifier, the first event resource type and the first event exception level;
if the first resource type is memory/disk/bandwidth/load, setting the corresponding first event exception level to be medium or high when the memory/disk/bandwidth/load utilization rate of the corresponding first resource data meets a preset risk level memory/disk/bandwidth/load utilization rate range or a high risk level memory/disk/bandwidth/load utilization rate range, taking the corresponding first acquisition timestamp, the first object identifier and the first resource type as the corresponding first event timestamp, the first event object identifier and the first event resource type, and forming a corresponding first exception event by the obtained first event timestamp, the first event object identifier, the first event resource type and the first event exception level;
If the first resource type is a disk I/O, setting the corresponding first event exception level to be a middle level or a high level when the disk read/write rate of the corresponding first resource data meets a preset risk level read/write rate range or a high risk level read/write rate range, taking the corresponding first acquisition timestamp, the first object identifier and the first resource type as the corresponding first event timestamp, the first event object identifier and the first event resource type, and forming a corresponding first exception event by the obtained first event timestamp, the first event object identifier, the first event resource type and the first event exception level;
and forming a corresponding first abnormal event set by all the obtained first abnormal events.
Preferably, the first main control module is specifically configured to identify, when the abnormal service event identification processing is performed according to the first service acquisition data to obtain a corresponding second abnormal event, whether the first service return state of the first service acquisition data is a failure state; if yes, the second collection time stamp, the second object identifier and the first service name of the first service collection data serve as the corresponding second event time stamp, the second event object identifier and the first event service name to form a corresponding second abnormal event.
Preferably, the first main control module is specifically configured to extract the corresponding first query instruction and the first instruction parameter from the first query instruction when the first query feedback corresponding to the data query processing performed on the first storage module according to the first query instruction is sent back to the first query module; identifying the first query instruction;
if the first instruction head is a resource monitoring query instruction, extracting the corresponding first query resource type and the first resource query period from the first instruction parameter; extracting all the first resource acquisition data in the first storage module, wherein the first resource type is matched with the first query resource type, and the first acquisition time stamp is matched with the first resource query period to form corresponding first query feedback;
if the first instruction head is a resource abnormal inquiry instruction, extracting the corresponding second inquiry resource type and the second resource inquiry period from the first instruction parameter; extracting all first abnormal events in the first storage module, wherein the first event resource type is matched with the second query resource type, and the first event time stamp is matched with the second query time period to form corresponding first query feedback;
If the first instruction head is a resource monitoring comparison instruction, extracting the corresponding third query resource type, the first one-to-one time period and the first two-to-one time period from the first instruction parameter; the average value of the CPU core/GPU core/memory/disk/bandwidth/load utilization rate of the first resource data of all the first resource acquisition data in which the first resource type is matched with the third query resource type and the first acquisition time stamp is matched with the first one-to-one period is calculated to obtain a corresponding first utilization rate average value; the average value of the CPU core/GPU core/memory/disk/bandwidth/load utilization rate of the first resource data of all the first resource acquisition data in which the first resource type is matched with the third query resource type and the first acquisition time stamp is matched with the first two-period in the first storage module is calculated to obtain a corresponding second utilization rate average value; calculating corresponding first comparison values according to the first and second utilization rate average values, wherein the first comparison value is = (first utilization rate average value-second utilization rate average value)/second utilization rate average value; and taking the obtained first comparison value as the corresponding first query feedback;
If the first instruction head is a resource abnormality comparison instruction, extracting the corresponding fourth query resource type, the first third period and the first fourth period from the first instruction parameter; counting the number of the first abnormal events, of which the first event resource type is matched with the fourth query resource type and the first event time stamp is matched with the first period, in the first storage module to obtain a corresponding first abnormal number; counting the number of the first abnormal events, of which the first event resource type is matched with the fourth query resource type and the first event time stamp is matched with the first four time period, in the first storage module to obtain a corresponding second abnormal number; calculating a corresponding second comparison value according to the first abnormal quantity and the second abnormal quantity, wherein the second comparison value is = (first abnormal quantity-second abnormal quantity)/second abnormal quantity; and taking the obtained second comparison value as the corresponding first query feedback;
if the first instruction head is a service monitoring query instruction, extracting the corresponding first query service name and the first service query period from the first instruction parameter; extracting all the first service acquisition data of which the first service name is matched with the first query service name and the second acquisition time stamp is matched with the first resource query time period from the first storage module to form corresponding first query feedback;
If the first instruction head is a service abnormal inquiry instruction, extracting the corresponding second inquiry service name and the second service inquiry time period from the first instruction parameter; and the first event service name is matched with the second query service name, and the second event time stamp is extracted from all the second abnormal events matched with the second service query time period in the first storage module to form corresponding first query feedback;
if the first instruction head is a service monitoring comparison instruction, extracting the corresponding third query service name, second time period and second time period from the first instruction parameter; the first service cumulative values of all the first service acquired data in which the first service name is matched with the third query service name and the second acquisition time stamp is matched with the second time period in the first storage module are summed to obtain a corresponding first sum; the first service cumulative values of all the first service acquired data in which the first service name is matched with the third query service name and the second acquisition time stamp is matched with the second period of time in the first storage module are summed to obtain corresponding second sum; and calculating a corresponding third comparison value according to the first sum and the second sum, wherein the third comparison value is = (first sum-second sum)/second sum; and taking the obtained third comparison value as corresponding first query feedback;
If the first instruction head is a service abnormality comparison instruction, extracting the corresponding fourth inquiry service name, the second third period and the second fourth period from the first instruction parameter; counting the number of the second abnormal events, in which the first event service name is matched with the fourth query service name and the second event time stamp is matched with the second third period, in the first storage module to obtain a corresponding third abnormal number; counting the number of the second abnormal events, in which the first event service name is matched with the fourth query service name and the second event time stamp is matched with the second fourth time period, in the first storage module to obtain a corresponding fourth abnormal number; and calculating a corresponding fourth comparison value according to the third and fourth abnormal numbers, wherein the fourth comparison value is = (third abnormal number-fourth abnormal number)/fourth abnormal number; and taking the obtained fourth comparison value as the corresponding first query feedback;
and sending the obtained first query feedback back to the first query module.
Preferably, the first early warning module is specifically configured to, when performing resource abnormal event early warning processing according to the first abnormal event set, bring the first abnormal event set into a preset resource abnormal early warning message template to perform early warning message packaging to obtain a corresponding first resource abnormal early warning message; and sending the first resource abnormality early warning message to a preset early warning message notification interface.
Preferably, the first early warning module is specifically configured to, when performing service abnormal event early warning processing according to the second abnormal event, bring the second abnormal event into a preset service abnormal early warning message template to perform early warning message encapsulation to obtain a corresponding first service abnormal early warning message; and sending the first service abnormality early warning message to a preset early warning message notification interface.
The embodiment of the invention provides a processing system for event monitoring, which comprises the following components: the system comprises a plurality of first monitoring objects, a first main control module, a first storage module, a first query module and a first early warning module; the first monitoring object is each node object (such as an internet of things device, a network device, a server, a database, a virtual machine and the like) in the service system; a first acquisition module is preset on each first monitoring object and is used for periodically acquiring each item of resource information of the monitoring object and sending an acquired first acquisition data packet to a first main control module for analysis, and the first acquisition module is also used for acquiring the information of the current service interface once and sending acquired first service acquisition data to the first main control module for analysis when the monitoring object calls any one designated service interface each time; the first main control module stores the received first acquisition data packet and the first service acquisition data in the first storage module for subsequent query, identifies the corresponding abnormal resource event and abnormal service event of the first acquisition data packet and the first service acquisition data, stores the obtained abnormal resource event set (namely the first abnormal event set) or the obtained abnormal service event (namely the second abnormal event) in the first storage module for subsequent query when the abnormal event is identified, and simultaneously sends the first abnormal event set or the second abnormal event to the first early warning module for corresponding abnormal resource event early warning or abnormal service event early warning; in addition, the first main control module also provides eight types of inquiry instructions (a resource monitoring inquiry instruction, a resource abnormality inquiry instruction, a resource monitoring comparison instruction, a resource abnormality comparison instruction, a service monitoring inquiry instruction, a service abnormality inquiry instruction, a service monitoring comparison instruction and a service abnormality comparison instruction) for the operation and maintenance personnel, and inquiry interaction is completed between the first inquiry module and the operation and maintenance personnel. The system of the invention adds a service event monitoring mode on the basis of being compatible with a conventional resource event monitoring mode, thereby being capable of monitoring the service called by the node object and identifying and early warning the abnormal service event occurring in the system; the system of the invention additionally provides eight types of inquiry instructions, so that time sequence data sequences required by a resource acquisition data-time curve, an abnormal resource event-time curve, a service acquisition data-time curve and an abnormal service event-time curve can be automatically searched, and the average utilization ratio comparison result, the total quantity comparison result of the abnormal resource event, the total quantity comparison result of the service call and the total quantity comparison result of the abnormal service event of two designated time periods (two same comparison time periods or two ring comparison time periods) can be automatically calculated. The system of the invention not only further improves the operation stability of the service system, but also reduces the workload of operation and maintenance personnel and improves the working efficiency.
Drawings
Fig. 1 is a schematic block diagram of a processing system for event monitoring according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
An embodiment of the present invention provides a processing system for event monitoring, as shown in fig. 1, which is a schematic block diagram of a processing system for event monitoring according to an embodiment of the present invention, where the system includes: the system comprises a plurality of first monitoring objects 1, a first main control module 2, a first storage module 3, a first query module 4 and a first early warning module 5.
A first acquisition module 11 is preset on each first monitoring object 1; the first acquisition module 11 is connected with the first main control module 2; the first main control module 2 is also respectively connected with the first storage module 3, the first query module 4 and the first early warning module 5.
Here, the first monitoring object 1 in the embodiment of the present invention is a platform node object in a service system, such as an internet of things device, a network device, a server, a database, a virtual machine, and the like; the first acquisition module 11 is a software or hardware module preset on the first monitored object 1, and is used for data acquisition. The first main control module 2 is a software or hardware module, device, equipment, server, system or platform for implementing the functions of the following main control modules. The first storage module 3 is a software or hardware module, device, apparatus, server, database, system or platform that implements the functions of the following storage modules. The first query module 4 is a software or hardware module, device, apparatus, server, system or platform that implements the functionality of the following query modules. The first early warning module 5 is a software or hardware module, device, equipment, server, system or platform that implements the functions of the following early warning modules.
(one), a first acquisition module 11 on the first monitored object 1:
the first acquisition module 11 is configured to periodically acquire resource data of the first monitored object 1 according to a preset resource monitoring frequency, so as to obtain a corresponding first acquisition data packet, and send the corresponding first acquisition data packet to the first main control module 2. Here, the resource monitoring frequency is a preset time frequency parameter, and can be adaptively adjusted according to specific monitoring requirements.
The first acquisition data packet consists of a plurality of first resource acquisition data; the first resource acquisition data comprises a first acquisition time stamp, a first object identifier, a first resource type and first resource data; the first resource type comprises CPU, GPU, memory, disk, bandwidth, load and disk I/O; when the first resource type is CPU/GPU, the corresponding first resource data comprises the total number of CPU/GPU cores and the utilization rate of the CPU/GPU cores; when the first resource type is memory/disk/bandwidth/load, the corresponding first resource data comprises total memory/disk/bandwidth/load and utilization rate of memory/disk/bandwidth/load; when the first resource type is disk I/O, the corresponding first resource data comprises disk reading rate and disk writing rate.
The first acquisition module 11 is further configured to perform specified service call monitoring on the first monitored object 1 according to a preset monitoring service set, acquire service data monitored each time, generate corresponding first service acquisition data, and send the first service acquisition data to the first main control module 2.
Wherein the monitoring service set comprises a plurality of monitoring service records; the monitoring service record comprises a monitoring service name and a monitoring service calling interface; the monitoring service names comprise multi-class appointed service names; here, the service name such as a waiting short message buffer service, an uplink short message service, a downlink short message service, a callback short message service, etc. is specified; the monitoring service records in the monitoring service set can be adaptively increased or decreased according to specific monitoring requirements.
The first service acquisition data comprises a second acquisition time stamp, a second object identifier, a first service name, a first service cumulative value and a first service return state; the first service return state includes a success state and a failure state.
In a specific implementation manner of the embodiment of the present invention, the first acquisition module 11 is specifically configured to, when performing resource data acquisition on the first monitored object 1 periodically according to a preset resource monitoring frequency to obtain a corresponding first acquisition data packet, send the first acquisition data packet to the first main control module 2:
step A1, carrying out data acquisition on CPU/GPU resource configuration and utilization rate of a first monitoring object 1 according to resource monitoring frequency to generate corresponding total number of CPU/GPU cores and CPU/GPU core utilization rate; the data acquisition is carried out on the memory/disk/bandwidth/load resource allocation and the utilization rate of the first monitoring object 1 to generate corresponding total memory/disk/bandwidth/load and utilization rate of the memory/disk/bandwidth/load; data acquisition is carried out on the disk read/write rate of the first monitoring object 1 to generate a corresponding disk read rate and disk write rate; the current time is used as a corresponding first acquisition time stamp; the object identifier of the first monitoring object 1 is used as a corresponding first object identifier;
A2, when the total number of CPU/GPU cores is not empty, forming corresponding first resource data by the total number of CPU/GPU cores and the corresponding CPU/GPU core utilization rate, setting the corresponding first resource type as the corresponding CPU/GPU, and forming corresponding first resource acquisition data by the obtained first acquisition time stamp, the first object identifier, the first resource type and the first resource data;
step A3, when the total amount of the memory/disk/bandwidth/load is not empty, forming corresponding first resource data by the total amount of the memory/disk/bandwidth/load and the corresponding memory/disk/bandwidth/load memory utilization rate, setting the corresponding first resource type as the corresponding memory/disk/bandwidth/load, and forming corresponding first resource acquisition data by the obtained first acquisition time stamp, the first object identifier, the first resource type and the first resource data;
step A4, when the disk reading rate or the disk writing rate is not null, forming corresponding first resource data by the disk reading rate and the disk writing rate, setting the corresponding first resource type as disk I/O, and forming corresponding first resource acquisition data by the obtained first acquisition time stamp, the first object identifier, the first resource type and the first resource data;
And step A5, forming corresponding first acquisition data packets by all the obtained first resource acquisition data, and sending the corresponding first acquisition data packets to the first main control module 2.
In another specific implementation manner of the embodiment of the present invention, the first acquisition module 11 is specifically configured to, when performing specified service call monitoring on the first monitored object 1 according to a preset monitoring service set and acquiring service data monitored each time to generate corresponding first service acquisition data, send the first service acquisition data to the first main control module 2:
step B1, setting a corresponding counter for each monitoring service record of a monitoring service set on a first monitoring object 1 to be a corresponding monitoring service counter, and resetting each monitoring service counter periodically according to preset counter resetting frequency;
step B2, when the first monitoring object 1 calls any service interface each time, the currently called service interface is used as the corresponding current service interface; taking a monitoring service record matched with the current service interface by a monitoring service calling interface in the monitoring service set as a corresponding current record; and when the current record is not empty, adding 1 to the monitoring service counter corresponding to the current record, extracting the count value of the monitoring service counter after adding 1 to be used as a corresponding first service cumulative value, taking the current time as a corresponding second acquisition time stamp, taking the object identification of the first monitoring object 1 as a corresponding second object identification, taking the monitoring service name of the current record as a corresponding first service name, taking the service success/failure state returned by the current service interface as a corresponding first service return state, and forming corresponding first service acquisition data by the obtained second acquisition time stamp, the second object identification, the first service name, the first service cumulative value and the first service return state to be sent to the first main control module 2.
(II), a first main control module 2:
the first main control module 2 is used for storing the first acquired data packet into the first storage module 3 when the first acquired data packet is received; carrying out abnormal resource event identification processing according to the first acquisition data packet to obtain a corresponding first abnormal event set; and when the first abnormal event set is not empty, storing the first abnormal event set into the first storage module 3, and sending the first abnormal event set to the first early warning module 5.
Wherein the first set of abnormal events includes one or more first abnormal events; the first exception event includes a first event timestamp, a first event object identification, a first event resource type, and a first event exception level; the first event resource type comprises CPU, GPU, memory, disk, bandwidth, load and disk I/O; the first event exception level includes a medium level and a high level.
The first main control module 2 is used for carrying out abnormal service event identification processing according to the first service acquisition data to obtain a corresponding second abnormal event when the first service acquisition data is received; and when the second abnormal event is not empty, the second abnormal event is stored in the first storage module 3, and the second abnormal event is sent to the first early warning module 5.
Wherein the second exception event includes a second event timestamp, a second event object identification, and a first event service name.
The first main control module 2 is further configured to receive a first query instruction sent by the first query module 4; and performs data query processing on the first storage module 3 according to the first query instruction to generate corresponding first query feedback and send the corresponding first query feedback back to the first query module 4.
The first query instruction comprises a first instruction head and a first instruction parameter;
the first instruction head comprises a resource monitoring query instruction, a resource abnormality query instruction, a resource monitoring comparison instruction, a resource abnormality comparison instruction, a service monitoring query instruction, a service abnormality query instruction, a service monitoring comparison instruction and a service abnormality comparison instruction; the instruction parameters of the eight inquiry instructions are:
1) When the first instruction head is a resource monitoring query instruction, the corresponding first instruction parameters comprise a first query resource type and a first resource query period; the first query type comprises CPU, GPU, memory, disk, bandwidth, load and disk I/O;
here, the resource monitoring query instruction is used for querying a time sequence data sequence of the resource acquisition data-time curve;
2) When the first instruction head is a resource abnormal inquiry instruction, the corresponding first instruction parameters comprise a second inquiry resource type and a second resource inquiry period; the second query type comprises CPU, GPU, memory, disk, bandwidth, load and disk I/O;
Here, the resource anomaly query instruction is used for querying a time sequence data sequence of an anomaly resource event-time curve;
3) When the first instruction head is a resource monitoring comparison instruction, the corresponding first instruction parameters comprise a third query resource type, a first one-to-one period and a first two-to-one period; the third query type comprises CPU, GPU, memory, disk, bandwidth, load and disk I/O;
here, the resource monitoring comparison instruction is used for inquiring the average utilization ratio comparison result of the resources in two designated time periods (two same-ratio time periods or two ring-ratio time periods);
4) When the first instruction head is a resource abnormality comparison instruction, the corresponding first instruction parameters comprise a fourth query resource type, a first third period and a first fourth period; the fourth query type comprises CPU, GPU, memory, disk, bandwidth, load and disk I/O;
here, the resource monitoring comparison instruction is used for inquiring the total comparison result of the abnormal resource events in two designated time periods (two same-ratio time periods or two ring-ratio time periods);
5) When the first instruction header is a service monitoring query instruction, the corresponding first instruction parameters comprise a first query service name and a first service query period;
here, the service monitoring query instruction is used for querying a time sequence data sequence of the service acquisition data-time curve;
6) When the first instruction head is a service abnormal inquiry instruction, the corresponding first instruction parameters comprise a second inquiry service name and a second service inquiry period;
here, the service anomaly query instruction is used for querying a time sequence data sequence of an anomaly service event-time curve;
7) When the first instruction head is a service monitoring comparison instruction, the corresponding first instruction parameters comprise a third query service name, a second first period and a second period;
here, the service monitoring comparison instruction is used for inquiring the service call total comparison result of two designated time periods (two same comparison time periods or two ring comparison time periods);
8) When the first instruction header is a service abnormality comparison instruction, the corresponding first instruction parameters comprise a fourth query service name, a second third period and a second fourth period.
Here, the service anomaly comparison instruction is used for querying the total quantity comparison result of the anomaly service events in two specified periods (two same-ratio periods or two ring-ratio periods).
In another specific implementation manner of the embodiment of the present invention, the first main control module 2 is specifically configured to, when performing abnormal resource event identification processing according to the first collected data packet to obtain a corresponding first abnormal event set:
Step C1, identifying a first resource type of each first resource acquisition data of a first acquisition data packet;
step C2, if the first resource type is CPU/GPU, setting the corresponding first event exception level as a middle level or a high level when the CPU/GPU kernel utilization rate of the corresponding first resource data meets the preset risk level CPU/GPU utilization rate range or high risk level CPU/GPU utilization rate range, using the corresponding first acquisition time stamp, the first object identifier and the first resource type as the corresponding first event time stamp, the first event object identifier and the first event resource type, and forming a corresponding first exception event by the obtained first event time stamp, the first event object identifier, the first event resource type and the first event exception level;
step C3, if the first resource type is memory/disk/bandwidth/load, setting the corresponding first event exception level as a middle level or a high level when the memory/disk/bandwidth/load utilization rate of the corresponding first resource data meets the preset risk level memory/disk/bandwidth/load utilization rate range or high risk level memory/disk/bandwidth/load utilization rate range, using the corresponding first acquisition timestamp, the first object identifier and the first resource type as the corresponding first event timestamp, the first event object identifier and the first event resource type, and forming a corresponding first exception event by the obtained first event timestamp, the first event object identifier, the first event resource type and the first event exception level;
Here, the first main control module 2 in the embodiment of the present invention presets 6 risk level usage ranges and 6 high risk level usage ranges in the local area to perform risk level recognition in the steps C2 and C3; the 6 risk classes use rate ranges are: a risk level CPU/GPU/memory/disk/bandwidth/load usage range; the 6 high risk level usage ranges are: high risk level CPU/GPU/memory/disk/bandwidth/load usage range;
step C4, if the first resource type is disk I/O, setting the corresponding first event exception level as a middle level or a high level when the disk read/write rate of the corresponding first resource data meets the preset medium risk level read/write rate range or high risk level read/write rate range, using the corresponding first acquisition time stamp, the first object identifier and the first resource type as the corresponding first event time stamp, the first event object identifier and the first event resource type, and forming a corresponding first exception event by the obtained first event time stamp, the first event object identifier, the first event resource type and the first event exception level;
here, the first main control module 2 in the embodiment of the present invention locally presets 1 risk level disk read rate range, 1 risk level disk write rate range, 1 high risk level disk read rate range, and 1 high risk level disk write rate range, where the risk level identification is performed in step C3;
And step C5, forming a corresponding first abnormal event set by all the obtained first abnormal events.
In another specific implementation manner of the embodiment of the present invention, the first main control module 2 is specifically configured to identify, when an abnormal service event identification process is performed according to the first service collected data to obtain a corresponding second abnormal event, whether a first service return state of the first service collected data is a failure state; if yes, a second collection time stamp, a second object identifier and a first service name of the first service collection data are used as a corresponding second event time stamp, a second event object identifier and a first event service name to form a corresponding second abnormal event.
In another specific implementation manner of the embodiment of the present invention, the first main control module 2 is specifically configured to, when performing data query processing on the first storage module 3 according to the first query instruction to generate a corresponding first query feedback and send the corresponding first query feedback back to the first query module 4:
step D1, extracting a corresponding first query instruction and a first instruction parameter from the first query instruction; identifying the first query instruction;
step D2, if the first instruction head is a resource monitoring query instruction, extracting a corresponding first query resource type and a first resource query period from the first instruction parameter; extracting all first resource acquisition data of which the first resource type is matched with the first query resource type and the first acquisition time stamp is matched with the first resource query time period in the first storage module 3 to form corresponding first query feedback;
Here, when the first instruction header in the embodiment of the present invention is a resource monitoring query instruction, the first query feedback correspondingly output is actually a time sequence required for constructing a resource acquisition data-time curve;
step D3, if the first instruction head is a resource abnormal inquiry instruction, extracting a corresponding second inquiry resource type and a second resource inquiry period from the first instruction parameter; extracting all first abnormal events in the first storage module 3, wherein the first event resource type is matched with the second query resource type, and the first event time stamp is matched with the second resource query period to form corresponding first query feedback;
here, when the first instruction header in the embodiment of the present invention is a resource abnormal query instruction, the first query feedback correspondingly output is actually a time sequence data sequence required for constructing an abnormal resource event-time curve;
step D4, if the first instruction head is a resource monitoring comparison instruction, extracting a corresponding third query resource type, a first time period and a first two time period from the first instruction parameter; the average value calculation is carried out on the CPU core/GPU core/memory/disk/bandwidth/load utilization rate of the first resource data of all the first resource acquisition data, wherein the first resource type is matched with the third query resource type and the first acquisition time stamp is matched with the first time interval in the first storage module 3, so that a corresponding first utilization rate average value is obtained; the average value calculation is carried out on the CPU core/GPU core/memory/disk/bandwidth/load utilization rate of the first resource data of all the first resource acquisition data, wherein the first resource type is matched with the third query resource type and the first acquisition time stamp is matched with the first two-period in the first storage module 3, so as to obtain a corresponding second utilization rate average value; calculating corresponding first comparison values according to the first and second utilization rate average values, wherein the first comparison value is = (first utilization rate average value-second utilization rate average value)/second utilization rate average value; and taking the obtained first comparison value as corresponding first query feedback;
Here, when the first instruction header in the embodiment of the present invention is a resource monitoring comparison instruction, the first query feedback output correspondingly is actually a comparison result of average utilization rates of resources in two periods, if the first one-to-one period and the first two-to-two period are set to be the same period in two years, the first query feedback outputs a same-ratio data, and if the first one-to-one period and the first two-to-two period are set to be two equal-length periods adjacent to each other in front and behind, the first query feedback outputs a ring-ratio data;
step D5, if the first instruction head is a resource abnormality comparison instruction, extracting a corresponding fourth query resource type, a first third period and a first fourth period from the first instruction parameter; counting the number of first abnormal events of which the first event resource type is matched with the fourth query resource type and the first event time stamp is matched with the first third period in the first storage module 3 to obtain a corresponding first abnormal number; counting the number of first abnormal events, of which the first event resource type is matched with the fourth query resource type and the first event time stamp is matched with the first four time periods, in the first storage module 3 to obtain a corresponding second abnormal number; and calculating a corresponding second comparison value according to the first and second abnormal numbers, wherein the second comparison value is = (first abnormal number-second abnormal number)/second abnormal number; and taking the obtained second comparison value as corresponding first query feedback;
Here, when the first instruction header in the embodiment of the present invention is a resource anomaly comparison instruction, the first query feedback output by the corresponding output is actually the total amount of the anomaly resource events in two periods, if the first three period and the first four period are set to be the same period within two years, the first query feedback outputs the same-ratio data, and if the first three period and the first four period are set to be two equal-length periods adjacent to each other, the first query feedback outputs the same-ratio data;
step D6, if the first instruction head is a service monitoring query instruction, extracting a corresponding first query service name and a first service query period from the first instruction parameter; extracting all first service acquisition data of which the first service name is matched with the first query service name and the second acquisition time stamp is matched with the first resource query time period in the first storage module 3 to form corresponding first query feedback;
here, when the first instruction header in the embodiment of the present invention is a service monitoring query instruction, the first query feedback correspondingly output is actually a time sequence required for constructing a service acquisition data-time curve;
step D7, if the first instruction head is a service abnormal inquiry instruction, extracting a corresponding second inquiry service name and a second service inquiry period from the first instruction parameter; and all second abnormal events in the first storage module 3, wherein the first event service name is matched with the second query service name, and the second event time stamp is matched with the second service query period are taken out to form corresponding first query feedback;
Here, when the first instruction header in the embodiment of the present invention is a service exception query instruction, the first query feedback correspondingly output is actually a time sequence data sequence required for constructing an exception service event-time curve;
step D8, if the first instruction head is a service monitoring comparison instruction, extracting a corresponding third query service name, a second first period and a second period from the first instruction parameter; and performing sum calculation on first service cumulative values of all first service acquired data of which the first service names are matched with the third query service names and the second acquisition time stamps are matched with the second time period in the first storage module 3 to obtain corresponding first sum; and performing sum calculation on first service cumulative values of all first service acquired data of which the first service names are matched with the third query service names and the second acquisition time stamps are matched with the second time periods in the first storage module 3 to obtain corresponding second sum; and calculating a corresponding third comparison value according to the first sum and the second sum, wherein the third comparison value is = (first sum-second sum)/second sum; and taking the obtained third comparison value as corresponding first query feedback;
here, when the first instruction header in the embodiment of the present invention is a service monitoring comparison instruction, the first query feedback output by the corresponding output is actually a service call total amount comparison result of two periods, if the second first period and the second period are set to be the same period within two years, the first query feedback outputs the same-ratio data, and if the second period and the second period are set to be two equal-length periods adjacent to each other, the first query feedback outputs the same-ratio data;
Step D9, if the first instruction head is a service abnormality comparison instruction, extracting a corresponding fourth query service name, a second third period and a second fourth period from the first instruction parameter; counting the number of second abnormal events, of which the first event service name is matched with the fourth query service name and the second event time stamp is matched with the second third period, in the first storage module 3 to obtain a corresponding third abnormal number; counting the number of second abnormal events, of which the first event service name is matched with the fourth query service name and the second event time stamp is matched with the second four time period, in the first storage module 3 to obtain a corresponding fourth abnormal number; and calculating a corresponding fourth comparison value according to the third and fourth abnormal numbers, wherein the fourth comparison value is = (third abnormal number-fourth abnormal number)/fourth abnormal number; and taking the obtained fourth comparison value as corresponding first query feedback;
here, when the first instruction header in the embodiment of the present invention is a service exception comparison instruction, the first query feedback output by the corresponding output is actually the total amount comparison result of the exception service events in two periods, if the second third period and the second fourth period are set to be the same period within two years, the first query feedback outputs the same-ratio data, and if the second third period and the second fourth period are set to be two equal-length periods adjacent to each other, the first query feedback outputs the same-ratio data;
Step D10, and sending the obtained first query feedback back to the first query module 4.
(III), a first storage module 3:
the first storage module 3 is configured to store all first resource acquisition data of all first acquisition data packets. The first storage module 3 is further configured to store the first abnormal events of all the first abnormal event sets. The first storage module 3 is further configured to store all first service acquisition data. The first storage module 3 is also used for storing all second abnormal events.
(IV), a first query module 4:
the first query module 4 is configured to send a first query instruction input by a user to the first main control module 2; and feeds back the first query feedback sent back by the first main control module 2 to the user.
Fifth, the first early warning module 5:
the first early warning module 5 is configured to perform resource abnormal event early warning processing according to the first abnormal event set when the first abnormal event set is received.
In another specific implementation manner of the embodiment of the present invention, the first early-warning module 5 is specifically configured to, when performing resource abnormal event early-warning processing according to the first abnormal event set, bring the first abnormal event set into a preset resource abnormal early-warning message template to perform early-warning message encapsulation to obtain a corresponding first resource abnormal early-warning message; and sending the first resource abnormality early warning message to a preset early warning message notification interface.
Here, the resource abnormality early warning message template in the embodiment of the invention is a preset text message template; the early warning message notification interface of the embodiment of the invention is a preset message sending service interface, and the interface can be a mail interface, a short message interface or an instant messaging software interface, wherein the instant messaging software interface commonly comprises a nailing office software interface and a micro-message interface.
The first early warning module 5 is further configured to perform service abnormal event early warning processing according to the second abnormal event when the second abnormal event is received.
In another specific implementation manner of the embodiment of the present invention, the first early-warning module 5 is specifically configured to, when performing service abnormal event early-warning processing according to the second abnormal event, bring the second abnormal event into a preset service abnormal early-warning message template to perform early-warning message encapsulation to obtain a corresponding first service abnormal early-warning message; and sending the first service abnormality early warning message to a preset early warning message notification interface.
Here, the service abnormality early warning message template in the embodiment of the invention is a preset text message template.
The embodiment of the invention provides a processing system for event monitoring, which comprises the following components: the system comprises a plurality of first monitoring objects, a first main control module, a first storage module, a first query module and a first early warning module; the first monitoring object is each node object (such as an internet of things device, a network device, a server, a database, a virtual machine and the like) in the service system; a first acquisition module is preset on each first monitoring object and is used for periodically acquiring each item of resource information of the monitoring object and sending an acquired first acquisition data packet to a first main control module for analysis, and the first acquisition module is also used for acquiring the information of the current service interface once and sending acquired first service acquisition data to the first main control module for analysis when the monitoring object calls any one designated service interface each time; the first main control module stores the received first acquisition data packet and the first service acquisition data in the first storage module for subsequent query, identifies the corresponding abnormal resource event and abnormal service event of the first acquisition data packet and the first service acquisition data, stores the obtained abnormal resource event set (namely the first abnormal event set) or the obtained abnormal service event (namely the second abnormal event) in the first storage module for subsequent query when the abnormal event is identified, and simultaneously sends the first abnormal event set or the second abnormal event to the first early warning module for corresponding abnormal resource event early warning or abnormal service event early warning; in addition, the first main control module also provides eight types of inquiry instructions (a resource monitoring inquiry instruction, a resource abnormality inquiry instruction, a resource monitoring comparison instruction, a resource abnormality comparison instruction, a service monitoring inquiry instruction, a service abnormality inquiry instruction, a service monitoring comparison instruction and a service abnormality comparison instruction) for the operation and maintenance personnel, and inquiry interaction is completed between the first inquiry module and the operation and maintenance personnel. The system of the invention adds a service event monitoring mode on the basis of being compatible with a conventional resource event monitoring mode, thereby being capable of monitoring the service called by the node object and identifying and early warning the abnormal service event occurring in the system; the system of the invention additionally provides eight types of inquiry instructions, so that time sequence data sequences required by a resource acquisition data-time curve, an abnormal resource event-time curve, a service acquisition data-time curve and an abnormal service event-time curve can be automatically searched, and the average utilization ratio comparison result, the total quantity comparison result of the abnormal resource event, the total quantity comparison result of the service call and the total quantity comparison result of the abnormal service event of two designated time periods (two same comparison time periods or two ring comparison time periods) can be automatically calculated. The system of the invention not only further improves the operation stability of the service system, but also reduces the workload of operation and maintenance personnel and improves the working efficiency.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of function in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (9)

1. A processing system for event monitoring, the system comprising: the system comprises a plurality of first monitoring objects, a first main control module, a first storage module, a first query module and a first early warning module;
presetting a first acquisition module on each first monitoring object; the first acquisition module is connected with the first main control module; the first main control module is also respectively connected with the first storage module, the first query module and the first early warning module;
the first acquisition module is used for periodically acquiring the resource data of the first monitoring object according to a preset resource monitoring frequency to obtain a corresponding first acquisition data packet and sending the corresponding first acquisition data packet to the first main control module;
the first acquisition module is further used for carrying out appointed service call monitoring on the first monitoring object according to a preset monitoring service set, acquiring service data monitored each time to generate corresponding first service acquisition data and sending the corresponding first service acquisition data to the first main control module; the monitoring service set comprises a plurality of monitoring service records; the monitoring service record comprises a monitoring service name and a monitoring service calling interface; the monitoring service names comprise multiple classes of appointed service names;
The first main control module is used for storing the first acquired data packet into the first storage module when the first acquired data packet is received; carrying out abnormal resource event identification processing according to the first acquired data packet to obtain a corresponding first abnormal event set; when the first abnormal event set is not empty, storing the first abnormal event set into the first storage module, and sending the first abnormal event set to the first early warning module;
the first main control module is used for carrying out abnormal service event identification processing according to the first service acquisition data to obtain a corresponding second abnormal event when receiving the first service acquisition data; when the second abnormal event is not empty, storing the second abnormal event into the first storage module, and sending the second abnormal event to the first early warning module;
the first main control module is also used for receiving a first query instruction sent by the first query module; performing data query processing on the first storage module according to the first query instruction to generate corresponding first query feedback and sending the corresponding first query feedback back to the first query module;
the first query module is used for sending the first query instruction input by the user to the first main control module; and feeding back the first query feedback sent back by the first main control module to a user;
The first early warning module is used for carrying out resource abnormal event early warning processing according to the first abnormal event set when the first abnormal event set is received; and the first early warning module is also used for carrying out early warning processing on the service abnormal event according to the second abnormal event when the second abnormal event is received.
2. The processing system for event monitoring as set forth in claim 1, wherein,
the first acquisition data packet consists of a plurality of first resource acquisition data; the first resource acquisition data comprises a first acquisition time stamp, a first object identifier, a first resource type and first resource data; the first resource type comprises a CPU, a GPU, a memory, a disk, a bandwidth, a load and a disk I/O; when the first resource type is CPU/GPU, the corresponding first resource data comprises the total number of CPU/GPU cores and the utilization rate of the CPU/GPU cores; when the first resource type is memory/disk/bandwidth/load, the corresponding first resource data comprises total memory/disk/bandwidth/load and memory/disk/bandwidth/load utilization; when the first resource type is disk I/O, the corresponding first resource data comprises disk reading rate and disk writing rate;
The first service acquisition data comprises a second acquisition time stamp, a second object identifier, a first service name, a first service cumulative value and a first service return state; the first service return state includes a success state and a failure state;
the first set of abnormal events includes one or more first abnormal events; the first abnormal event comprises a first event time stamp, a first event object identification, a first event resource type and a first event abnormal level; the first event resource type comprises a CPU, a GPU, a memory, a disk, a bandwidth, a load and a disk I/O; the first event anomaly level includes a middle level and a high level;
the second abnormal event comprises a second event time stamp, a second event object identifier and a first event service name;
the first query instruction comprises a first instruction head and a first instruction parameter; the first instruction head comprises a resource monitoring query instruction, a resource abnormality query instruction, a resource monitoring comparison instruction, a resource abnormality comparison instruction, a service monitoring query instruction, a service abnormality query instruction, a service monitoring comparison instruction and a service abnormality comparison instruction;
when the first instruction head is a resource monitoring query instruction, the corresponding first instruction parameters comprise a first query resource type and a first resource query period; the first query type comprises a CPU, a GPU, a memory, a disk, a bandwidth, a load and a disk I/O;
When the first instruction head is a resource abnormal inquiry instruction, the corresponding first instruction parameters comprise a second inquiry resource type and a second resource inquiry period; the second query type comprises a CPU, a GPU, a memory, a disk, a bandwidth, a load and a disk I/O;
when the first instruction head is a resource monitoring comparison instruction, the corresponding first instruction parameters comprise a third query resource type, a first one-to-one period and a first two-to-one period; the third query type comprises a CPU, a GPU, a memory, a disk, a bandwidth, a load and a disk I/O;
when the first instruction head is a resource abnormality comparison instruction, the corresponding first instruction parameters comprise a fourth query resource type, a first third period and a first fourth period; the fourth query type comprises a CPU, a GPU, a memory, a disk, a bandwidth, a load and a disk I/O;
when the first instruction header is a service monitoring query instruction, the corresponding first instruction parameters comprise a first query service name and a first service query period;
when the first instruction head is a service abnormal inquiry instruction, the corresponding first instruction parameters comprise a second inquiry service name and a second service inquiry period;
When the first instruction head is a service monitoring comparison instruction, the corresponding first instruction parameters comprise a third query service name, a second first period and a second period;
when the first instruction header is a service abnormality comparison instruction, the corresponding first instruction parameters comprise a fourth query service name, a second third period and a second fourth period.
3. The processing system for event monitoring as set forth in claim 2, wherein,
the first acquisition module is specifically configured to periodically acquire, according to a preset resource monitoring frequency, data of a CPU/GPU resource configuration and a utilization rate of the first monitored object to generate a corresponding total number of CPU/GPU cores and a corresponding CPU/GPU core utilization rate when the first acquisition data packet obtained by periodically acquiring the resource data of the first monitored object according to the preset resource monitoring frequency is sent to the first main control module; data acquisition is carried out on the memory/disk/bandwidth/load resource allocation and the utilization rate of the first monitoring object to generate the corresponding total memory/disk/bandwidth/load and the corresponding memory/disk/bandwidth/load utilization rate; data acquisition is carried out on the disk read/write rate of the first monitoring object to generate a corresponding disk read rate and a corresponding disk write rate; the current time is used as the corresponding first acquisition time stamp; taking the object identifier of the first monitoring object as the corresponding first object identifier;
When the total number of the CPU/GPU cores is not empty, forming corresponding first resource data by the total number of the CPU/GPU cores and the corresponding CPU/GPU core utilization rate, setting the corresponding first resource type as the corresponding CPU/GPU, and forming corresponding first resource acquisition data by the obtained first acquisition time stamp, the first object identifier, the first resource type and the first resource data;
when the total memory/disk/bandwidth/load is not empty, forming corresponding first resource data by the total memory/disk/bandwidth/load and the corresponding memory/disk/bandwidth/load memory utilization rate, setting the corresponding first resource type as the corresponding memory/disk/bandwidth/load, and forming corresponding first resource acquisition data by the obtained first acquisition time stamp, the first object identifier, the first resource type and the first resource data;
when the disk reading rate or the disk writing rate is not null, forming corresponding first resource data by the disk reading rate and the disk writing rate, setting the corresponding first resource type as disk I/O, and forming corresponding first resource acquisition data by the obtained first acquisition time stamp, the first object identifier, the first resource type and the first resource data;
And the corresponding first acquisition data packet formed by all the obtained first resource acquisition data is sent to the first main control module.
4. The processing system for event monitoring as set forth in claim 2, wherein,
the first acquisition module is specifically configured to set a corresponding counter on the first monitoring object for each monitoring service record of the monitoring service set as a corresponding monitoring service counter when the first monitoring object is subjected to specified service call monitoring according to a preset monitoring service set and service data monitored each time are acquired to generate corresponding first service acquisition data to be sent to the first main control module, and clear each monitoring service counter periodically according to a preset counter reset frequency;
when the first monitoring object calls any service interface each time, the currently called service interface is used as a corresponding current service interface; the monitoring service record matched with the current service interface by the monitoring service calling interface in the monitoring service set is used as a corresponding current record; and when the current record is not empty, adding 1 to the monitoring service counter corresponding to the current record, extracting the count value of the monitoring service counter after adding 1 to be used as the corresponding first service cumulative value, taking the current time as the corresponding second acquisition time stamp, taking the object identification of the first monitoring object as the corresponding second object identification, taking the monitoring service name of the current record as the corresponding first service name, taking the service success/failure state returned by the current service interface as the corresponding first service return state, and forming the corresponding first service acquisition data by the obtained second acquisition time stamp, the obtained second object identification, the obtained first service name, the obtained first service cumulative value and the obtained first service return state to be sent to the first main control module.
5. The processing system for event monitoring as set forth in claim 2, wherein,
the first main control module is specifically configured to identify the first resource type of each first resource acquisition data of the first acquisition data packet when the first abnormal event set is obtained by performing abnormal resource event identification processing according to the first acquisition data packet;
if the first resource type is a CPU/GPU, setting the corresponding first event exception level to be a middle level or a high level when the CPU/GPU kernel utilization rate of the corresponding first resource data meets a preset risk level CPU/GPU utilization rate range or a high risk level CPU/GPU utilization rate range, using the corresponding first acquisition timestamp, the first object identifier and the first resource type as the corresponding first event timestamp, the first event object identifier and the first event resource type, and forming a corresponding first exception by the obtained first event timestamp, the first event object identifier, the first event resource type and the first event exception level;
if the first resource type is memory/disk/bandwidth/load, setting the corresponding first event exception level to be medium or high when the memory/disk/bandwidth/load utilization rate of the corresponding first resource data meets a preset risk level memory/disk/bandwidth/load utilization rate range or a high risk level memory/disk/bandwidth/load utilization rate range, taking the corresponding first acquisition timestamp, the first object identifier and the first resource type as the corresponding first event timestamp, the first event object identifier and the first event resource type, and forming a corresponding first exception event by the obtained first event timestamp, the first event object identifier, the first event resource type and the first event exception level;
If the first resource type is a disk I/O, setting the corresponding first event exception level to be a middle level or a high level when the disk read/write rate of the corresponding first resource data meets a preset risk level read/write rate range or a high risk level read/write rate range, taking the corresponding first acquisition timestamp, the first object identifier and the first resource type as the corresponding first event timestamp, the first event object identifier and the first event resource type, and forming a corresponding first exception event by the obtained first event timestamp, the first event object identifier, the first event resource type and the first event exception level;
and forming a corresponding first abnormal event set by all the obtained first abnormal events.
6. The processing system for event monitoring as set forth in claim 2, wherein,
the first main control module is specifically configured to identify whether the first service return state of the first service acquisition data is a failure state when the first service acquisition data is subjected to abnormal service event identification processing to obtain a corresponding second abnormal event; if yes, the second collection time stamp, the second object identifier and the first service name of the first service collection data serve as the corresponding second event time stamp, the second event object identifier and the first event service name to form a corresponding second abnormal event.
7. The processing system for event monitoring as set forth in claim 2, wherein,
the first main control module is specifically configured to extract a corresponding first query instruction and a corresponding first instruction parameter from the first query instruction when the first query feedback generated by performing data query processing on the first storage module according to the first query instruction is sent back to the first query module; identifying the first query instruction;
if the first instruction head is a resource monitoring query instruction, extracting the corresponding first query resource type and the first resource query period from the first instruction parameter; extracting all the first resource acquisition data in the first storage module, wherein the first resource type is matched with the first query resource type, and the first acquisition time stamp is matched with the first resource query period to form corresponding first query feedback;
if the first instruction head is a resource abnormal inquiry instruction, extracting the corresponding second inquiry resource type and the second resource inquiry period from the first instruction parameter; extracting all first abnormal events in the first storage module, wherein the first event resource type is matched with the second query resource type, and the first event time stamp is matched with the second query time period to form corresponding first query feedback;
If the first instruction head is a resource monitoring comparison instruction, extracting the corresponding third query resource type, the first one-to-one time period and the first two-to-one time period from the first instruction parameter; the average value of the CPU core/GPU core/memory/disk/bandwidth/load utilization rate of the first resource data of all the first resource acquisition data in which the first resource type is matched with the third query resource type and the first acquisition time stamp is matched with the first one-to-one period is calculated to obtain a corresponding first utilization rate average value; the average value of the CPU core/GPU core/memory/disk/bandwidth/load utilization rate of the first resource data of all the first resource acquisition data in which the first resource type is matched with the third query resource type and the first acquisition time stamp is matched with the first two-period in the first storage module is calculated to obtain a corresponding second utilization rate average value; calculating corresponding first comparison values according to the first and second utilization rate average values, wherein the first comparison value is = (first utilization rate average value-second utilization rate average value)/second utilization rate average value; and taking the obtained first comparison value as the corresponding first query feedback;
If the first instruction head is a resource abnormality comparison instruction, extracting the corresponding fourth query resource type, the first third period and the first fourth period from the first instruction parameter; counting the number of the first abnormal events, of which the first event resource type is matched with the fourth query resource type and the first event time stamp is matched with the first period, in the first storage module to obtain a corresponding first abnormal number; counting the number of the first abnormal events, of which the first event resource type is matched with the fourth query resource type and the first event time stamp is matched with the first four time period, in the first storage module to obtain a corresponding second abnormal number; calculating a corresponding second comparison value according to the first abnormal quantity and the second abnormal quantity, wherein the second comparison value is = (first abnormal quantity-second abnormal quantity)/second abnormal quantity; and taking the obtained second comparison value as the corresponding first query feedback;
if the first instruction head is a service monitoring query instruction, extracting the corresponding first query service name and the first service query period from the first instruction parameter; extracting all the first service acquisition data of which the first service name is matched with the first query service name and the second acquisition time stamp is matched with the first resource query time period from the first storage module to form corresponding first query feedback;
If the first instruction head is a service abnormal inquiry instruction, extracting the corresponding second inquiry service name and the second service inquiry time period from the first instruction parameter; and the first event service name is matched with the second query service name, and the second event time stamp is extracted from all the second abnormal events matched with the second service query time period in the first storage module to form corresponding first query feedback;
if the first instruction head is a service monitoring comparison instruction, extracting the corresponding third query service name, second time period and second time period from the first instruction parameter; the first service cumulative values of all the first service acquired data in which the first service name is matched with the third query service name and the second acquisition time stamp is matched with the second time period in the first storage module are summed to obtain a corresponding first sum; the first service cumulative values of all the first service acquired data in which the first service name is matched with the third query service name and the second acquisition time stamp is matched with the second period of time in the first storage module are summed to obtain corresponding second sum; and calculating a corresponding third comparison value according to the first sum and the second sum, wherein the third comparison value is = (first sum-second sum)/second sum; and taking the obtained third comparison value as corresponding first query feedback;
If the first instruction head is a service abnormality comparison instruction, extracting the corresponding fourth inquiry service name, the second third period and the second fourth period from the first instruction parameter; counting the number of the second abnormal events, in which the first event service name is matched with the fourth query service name and the second event time stamp is matched with the second third period, in the first storage module to obtain a corresponding third abnormal number; counting the number of the second abnormal events, in which the first event service name is matched with the fourth query service name and the second event time stamp is matched with the second fourth time period, in the first storage module to obtain a corresponding fourth abnormal number; and calculating a corresponding fourth comparison value according to the third and fourth abnormal numbers, wherein the fourth comparison value is = (third abnormal number-fourth abnormal number)/fourth abnormal number; and taking the obtained fourth comparison value as the corresponding first query feedback;
and sending the obtained first query feedback back to the first query module.
8. The processing system for event monitoring as set forth in claim 2, wherein,
The first early warning module is specifically configured to, when performing resource abnormal event early warning processing according to the first abnormal event set, bring the first abnormal event set into a preset resource abnormal early warning message template to perform early warning message encapsulation to obtain a corresponding first resource abnormal early warning message; and sending the first resource abnormality early warning message to a preset early warning message notification interface.
9. The processing system for event monitoring as set forth in claim 2, wherein,
the first early warning module is specifically configured to, when performing service abnormal event early warning processing according to the second abnormal event, bring the second abnormal event into a preset service abnormal early warning message template to perform early warning message encapsulation to obtain a corresponding first service abnormal early warning message; and sending the first service abnormality early warning message to a preset early warning message notification interface.
CN202311346062.3A 2023-10-17 2023-10-17 Processing system for event monitoring Pending CN117336145A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311346062.3A CN117336145A (en) 2023-10-17 2023-10-17 Processing system for event monitoring

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311346062.3A CN117336145A (en) 2023-10-17 2023-10-17 Processing system for event monitoring

Publications (1)

Publication Number Publication Date
CN117336145A true CN117336145A (en) 2024-01-02

Family

ID=89295132

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311346062.3A Pending CN117336145A (en) 2023-10-17 2023-10-17 Processing system for event monitoring

Country Status (1)

Country Link
CN (1) CN117336145A (en)

Similar Documents

Publication Publication Date Title
CN107992398B (en) Monitoring method and monitoring system of service system
US6658367B2 (en) System for time-bucketing of baselined data collector data
CN106548402B (en) Resource transfer monitoring method and device
CN108459939A (en) A kind of log collecting method, device, terminal device and storage medium
CN109039817B (en) Information processing method, device, equipment and medium for flow monitoring
CN108182139B (en) Early warning method, device and system
CN111538563A (en) Event analysis method and device for Kubernetes
CN110297746A (en) A kind of data processing method and system
CN102404760B (en) Method and device for real-time measurement of system performance
CN115934774A (en) Flow control method, engine and medium for high-concurrency multi-dimensional distributed transaction system
CN103034733A (en) Data monitoring statistical method for call center
CN111431733B (en) Service alarm coverage information evaluation method and device
CN114401158A (en) Flow charging method and device, electronic equipment and storage medium
CN117336145A (en) Processing system for event monitoring
CN111694721A (en) Fault monitoring method and device for microservice
CN106304122B (en) Business data analysis method and system
CN111401874A (en) Self-service transaction system monitoring method and device
CN110633191A (en) Method and system for monitoring service health degree of software system in real time
CN113472881B (en) Statistical method and device for online terminal equipment
CN112965793B (en) Identification analysis data-oriented data warehouse task scheduling method and system
CN115344633A (en) Data processing method, device, equipment and storage medium
CN116109322A (en) Data acquisition method, data acquisition device, and computer-readable storage medium
CN115185710A (en) Transaction interface time consumption counting and early warning method
CN112910684B (en) Method and terminal for monitoring key data through real-time streaming platform
EP2590366A1 (en) Method and system for monitoring message objects sent from a client to invoke operations on a server in a distributed computing environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination