CN110287081A - A kind of service monitoring system and method - Google Patents

A kind of service monitoring system and method Download PDF

Info

Publication number
CN110287081A
CN110287081A CN201910543000.9A CN201910543000A CN110287081A CN 110287081 A CN110287081 A CN 110287081A CN 201910543000 A CN201910543000 A CN 201910543000A CN 110287081 A CN110287081 A CN 110287081A
Authority
CN
China
Prior art keywords
data
module
failure
service
monitored results
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910543000.9A
Other languages
Chinese (zh)
Inventor
朱龙云
陈阳
姚建明
元长才
袁文頔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Chengdu Co Ltd
Original Assignee
Tencent Technology Chengdu Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Chengdu Co Ltd filed Critical Tencent Technology Chengdu Co Ltd
Priority to CN201910543000.9A priority Critical patent/CN110287081A/en
Publication of CN110287081A publication Critical patent/CN110287081A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/302Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a software system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information
    • G06F11/327Alarm or error message display
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3476Data logging

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The embodiment of the present application discloses a kind of service monitoring system and method, the system comprises: data acquisition module, monitoring data processing module, daily record data processing module and computing engines;Data acquisition module marks off monitoring data and daily record data from service operation basic data for obtaining service operation basic data;Monitoring data processing module obtains corresponding monitored results according to preset statistical indicator Statistical monitor data for obtaining monitoring data;Daily record data processing module generates the corresponding topological diagram of single service request for obtaining daily record data, and according to daily record data, which is used to characterize the background module call relation in response to single service request;Computing engines determine background module with the presence or absence of failure for obtaining monitored results and topological diagram, and according to monitored results and the corresponding alarm threshold value of statistical indicator;It is depositing in the case of a fault, according to monitored results and topological diagram accurately positioning failure reason.

Description

A kind of service monitoring system and method
Technical field
This application involves technical field of information processing more particularly to a kind of service monitoring system and method.
Background technique
With the rapid development of Internet technology, user oriented disparate networks service emerges one after another, these services are just Often operation generally requires to rely on a large amount of background module, and user calls primary service that would generally trigger several backstage moulds in front end Block tens or even up to a hundred calling.
In order to guarantee service quality, it usually needs setting monitoring system carries out in fact each background module for supporting service operation When monitor;Specifically, monitoring system collects the abnormal data for supporting each background module of service operation to report, and to each backstage mould The abnormal data that block reports correspondingly is handled respectively, in the case where processing result exceeds preset alarm threshold value, is determined There are failures for corresponding background module, and relevant person in charge is prompted to overhaul the background module.
In practical applications, the service call of a front end usually triggers the collaborative work of several background modules, therein One or more background modules break down, and the background module that may result in other collaborative works also generates abnormal data, base Each background module is independently monitored in above-mentioned monitoring system, tends to determine the module for producing abnormal data To break down, and it can not determine that each background module is that abnormal data is generated because of faults itself, or because after by other failures The influence of platform module and generate abnormal data.As it can be seen that service monitoring system in the related technology can not be also accurately located at present Failure cause.
Summary of the invention
The embodiment of the present application provides a kind of service monitoring system and method, being capable of accurately positioning failure reason.
In view of this, the application first aspect provides a kind of service monitoring system, and the system comprises: data acquisition mould Block, monitoring data processing module, daily record data processing module and computing engines;
The data acquisition module is drawn from the service operation basic data for obtaining service operation basic data Separate monitoring data and daily record data;
The monitoring data processing module counts institute for obtaining the monitoring data, and according to preset statistical indicator Monitoring data is stated, the corresponding monitored results of the statistical indicator are obtained;
The daily record data processing module generates single for obtaining the daily record data, and according to the daily record data The corresponding topological diagram of service request;The topological diagram is used to characterize the background module call relation in response to single service request;
The computing engines, for obtaining the monitored results and the topological diagram, according to monitored results and described The corresponding alarm threshold value of statistical indicator determines the background module with the presence or absence of failure;It is depositing in the case of a fault, according to institute It states monitored results and the topological diagram determines failure cause.
The application second aspect provides a kind of service monitoring method, which comprises
Service operation basic data is obtained, marks off monitoring data and log number from the service operation basic data According to;
The monitoring data is counted according to preset statistical indicator, obtains the corresponding monitored results of the statistical indicator;
The corresponding topological diagram of single service request is generated according to the daily record data;The topological diagram for characterize in response to The background module call relation of single service request;
According to the monitored results and the corresponding alarm threshold value of the statistical indicator, determine that the background module whether there is Failure;
It is depositing in the case of a fault, is determining failure cause according to the monitored results and the topological diagram.
The application third aspect provides a kind of server, and the server includes processor and memory:
The computer program is transferred to the processor for storing computer program by the memory;
The processor is used for according to the computer program, executing the service monitoring method as described in above-mentioned second aspect Step.
The application fourth aspect provides a kind of computer readable storage medium, and the computer readable storage medium is used for Computer program is stored, the computer program is for the step of executing service monitoring method described in above-mentioned second aspect.
The 5th aspect of the application provides a kind of computer program product including instruction, when it runs on computers When, so that the step of computer executes service monitoring method described in above-mentioned second aspect.
As can be seen from the above technical solutions, the embodiment of the present application has the advantage that
The embodiment of the present application provides a kind of service monitoring system, which can acquire service operation basis number in real time According to, and each background module for supporting service operation is monitored in real time based on service operation basic data collected.Tool Body, the data acquisition module in service monitoring system can mark off monitoring number from the service operation basic data that it is acquired According to and daily record data;Then, monitoring data count according to preset statistical indicator by monitoring data processing module It should by daily record data processing module according to the corresponding topological diagram of daily record data single service request to corresponding monitored results Topological diagram can characterize the background module call relation in response to single service request;In turn, the calculating in service monitoring system Engine can determine that background module with the presence or absence of failure, exists according to monitored results and the corresponding alarm threshold value of statistical indicator In the case where failure, further according to monitoring data, monitored results and topological diagram positioning failure reason.Above-mentioned service monitoring system When being monitored to background module, other than needing to refer to associated monitoring data, also with further reference to daily record data, and it is based on Daily record data generates the topological diagram for characterizing the background module call relation in response to single service request, correspondingly, meter Engine is calculated when determining failure cause, which backstage mould can be distinguished in conjunction with the call relation between background module in topological diagram Block is the abnormal data generated because of faults itself, which background module is to be influenced and generated by other fault daemon modules Abnormal data is accurately located the background module for being truly present failure, consequently facilitating relevant person in charge is according to so orienting Failure cause to background module carry out correspondingly maintenance adjust.
Detailed description of the invention
Fig. 1 is the work configuration diagram of service monitoring system provided by the embodiments of the present application;
Fig. 2 is the schematic diagram that fault data provided by the embodiments of the present application changes chart;
Fig. 3 is the structural schematic diagram of service monitoring system provided by the embodiments of the present application;
Fig. 4 is the flow diagram of service monitoring method provided by the embodiments of the present application;
Fig. 5 is the structural schematic diagram of server provided by the embodiments of the present application.
Specific embodiment
In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only this Apply for a part of the embodiment, instead of all the embodiments.Based on the embodiment in the application, those of ordinary skill in the art exist Every other embodiment obtained under the premise of creative work is not made, shall fall in the protection scope of this application.
The description and claims of this application and term " first ", " second ", " third ", " in above-mentioned attached drawing The (if present)s such as four " are to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should manage The data that solution uses in this way are interchangeable under appropriate circumstances, so as to embodiments herein described herein can in addition to Here the sequence other than those of diagram or description is implemented.In addition, term " includes " and " having " and their any deformation, Be intended to cover it is non-exclusive include, for example, containing the process, method of a series of steps or units, system, product or setting It is standby those of to be not necessarily limited to be clearly listed step or unit, but may include be not clearly listed or for these mistakes The intrinsic other step or units of journey, method, product or equipment.
This technical problem of failure cause can not be accurately positioned for service monitoring system in the related technology, the application is implemented Example provides a kind of new service monitoring system, when the system positioning failure reason, generates further combined with according to daily record data Topological diagram the background module for being truly present failure is positioned, thereby guarantee that and failure cause be accurately positioned.
Specifically, service monitoring system provided by the embodiments of the present application includes: data acquisition module, monitoring data processing mould Block, daily record data processing module and computing engines;Wherein, data acquisition module is for obtaining service operation basic data, and from Monitoring data and daily record data are marked off in service operation basic data;Monitoring data processing module is used to obtain monitoring data, And acquired monitoring data is counted according to preset statistical indicator, obtain the corresponding monitored results of statistical indicator;Day Will data processing module generates according to acquired daily record data that single service request is corresponding to open up for obtaining daily record data Figure is flutterred, which can characterize the background module call relation in response to single service request;Computing engines are for obtaining prison Control result and topological diagram, and according to monitored results and the corresponding alarm threshold value of statistical indicator, determine support service operation it is each after Platform module whether there is failure, and deposit in the case of a fault, former according to monitored results and the further positioning failure of topological diagram Cause.
Above-mentioned service monitoring system is determined according to daily record data in response to service request using daily record data processing module Background module call relation, and in positioning failure reason, reference can characterize the topological diagram of the background module call relation, Failure cause is further accurately positioned, distinguishing which background module is that abnormal data is generated because of faults itself, which backstage Module is to generate abnormal data due to being influenced by other fault daemon modules.
It should be noted that above-mentioned service monitoring system can be used for monitoring user oriented service, this kind of service is usual It is run by multiple background module Collaboration supports;When service monitoring system provided by the present application is run, service fortune is supported in specific monitoring Whether capable multiple background modules break down.
It should be understood that in practical applications, service monitoring system provided by the present application can dispose to be run on the server, tool Body can be deployed on independent server, can also be deployed on cluster server;It, can be with when being deployed on cluster server Being utilized respectively different servers supports different modules to run, and also can use same server and supports multiple module operations, Any restriction is not done to the deployment way of service monitoring system provided by the present application herein.
Service monitoring system provided by the present application is introduced below by embodiment.
Referring to Fig. 1, Fig. 1 is the work configuration diagram of service monitoring system provided by the embodiments of the present application.Such as Fig. 1 institute Show, includes: data acquisition module 110, monitoring data processing module 120, daily record data processing module in the service monitoring system 130 and computing engines 140.
Data acquisition module 110 is divided for obtaining service operation basic data, and from the service operation basic data Monitoring data and daily record data out.
Service operation basic data is the data of generation during monitored service operation, which is specifically as follows branch Hold the data that the modules of the service operation generate in the process of running, or user by the data of import of services, It can also be the related data of the hardware facility of support service operation, herein not to data included by service operation basic data Type is specifically limited.
In one possible implementation, it can specifically include in data acquisition module 110: version change acquisition of information Unit, infrastructure information acquiring unit, active probe unit and business information acquiring unit.
Wherein, version information acquiring unit is for obtaining service distribution modification information, the service distribution modification information be with The relevant version change data of monitored service, such as service updated edition data, dilatation data, capacity reducing data.Base Infrastructure information acquisition unit is used to obtain the underlying hardware facilities information for supporting service operation, supports the hard of service operation herein Part is specially the equipment for supporting each background module operation, and underlying hardware facilities information herein is that can transport to each background module of support The data that the operation of capable equipment has an impact, such as suspension information, outage information, device hardware fault data, equipment centre Manage device (Central Processing Unit, CPU) load data, device memory consumption data, equipment network interface card data on flows Deng.Active probe unit is used for the running state information according to agreement active probe service, which is to support clothes The running state information of each background module of business operation, as the survival information of background module, abnormal input data, boundary input number According to etc..Business information acquiring unit is for obtaining the service operation information that business reports, after which is specially each The operation data that platform module reports, such as downstream abnormal data, current request amount, service logic abnormal data.
In this way, service monitoring system can pass through the version change information acquisition unit in data acquisition module 110, basis Facilities information acquiring unit, active probe unit and business information acquiring unit are obtained from multiple and different dimensions for monitoring The reference data of service, i.e. acquisition service operation basic data, to guarantee that service monitoring system can be from multiple dimensions to clothes The operation of business is monitored, that is, guarantees to carry out overall monitor to service.
It should be understood that in practical applications, the data acquisition module 110 in service monitoring system is in addition to that can pass through above-mentioned version This modification information acquiring unit, infrastructure information acquiring unit, active probe unit and business information acquiring unit, are obtained respectively Take service distribution modification information, underlying hardware facilities information, active probe service operation status information and service operation information Outside as service operation basic data, other kinds of information can also be obtained by other means as service operation basis number According to the type of the mode to acquisition service operation basic data and acquired service operation basic data is not appointed herein What is limited.
After data acquisition module 110 gets service operation basic data, further from acquired service operation basis Monitoring data and daily record data are marked off in data.Specifically, the service operation basic data that data acquisition module 110 obtains is logical It often include two class data, i.e. monitoring data and daily record data, monitoring data is usually the whole number of generation during service operation According to, if background module A calls background module B at certain time point, daily record data is the refining data of generation during service operation, It is requested as background module A is sent at certain time point to background module B, for another example background module B is executed in response to received request Specific operation etc. can usually restore monitoring data based on daily record data;Since monitoring data is different with the format of daily record data, Therefore, data acquisition module 110 can be directly based upon the format of acquired service operation basic data, distinguish monitoring data And daily record data.
Optionally, data acquisition module 110 can also be before dividing monitoring data and daily record data, to acquired clothes Business operation basic data is cleared up, and monitoring service is not joined with filtering out from acquired service operation basic data The data of value are examined, and the data scrubbing filtered out is fallen.
Optionally, data acquisition module 110 can also when detecting that the data volume of service operation basic data is uprushed, It does not influence to carry out current limliting, peak clipping under the premise of quality monitoring.Specifically, data acquisition module 110 can be acquired with real-time detection Service operation basic data data volume, when detecting the increasing degree of data volume within a preset period of time beyond predetermined amplitude When, the same or similar normal data of service logic is merged, to realize current limliting and peak clipping.
Monitoring data processing module 120, for obtaining monitoring data, and according to acquired in preset statistical indicator statistics Monitoring data, to obtain the corresponding monitored results of statistical indicator.
Specifically, monitoring data processing module 120 obtains monitoring data from data acquisition module 110, then according to pre- If statistical indicator counted based on acquired monitoring data, for example, counting response timeout in each predetermined period Number, to obtain monitored results corresponding to statistical indicator.
It should be noted that above-mentioned preset statistical indicator can usually be determined according to the focus of service monitoring, for example, When the focus of service monitoring be service expired times when, can using expired times of each background module in predetermined period as Statistical indicator.Correspondingly, in the case where preset statistical indicator is expired times of each background module in predetermined period, prison Controlling data processing module 120 can determine that the statistical indicator is corresponding according to the monitoring data of characterization background module response timeout Monitored results determine each corresponding expired times of background module in predetermined period.
It should be understood that in practical applications, other than it expired times can be redefined for statistical indicator, can also set Service logic errors number, CPU overburden number etc. are used as statistical indicator, do not do any limit to preset statistical indicator herein It is fixed.In addition, monitoring data processing module 120 can be counted only for a statistical indicator, determine that the statistical indicator is corresponding Monitored results, can also be counted for multiple statistical indicators, determine that the corresponding monitoring of each statistical indicator is tied respectively Fruit.
In one possible implementation, it can specifically include in monitoring data processing module 120: data middleware, Real-time statistics unit and time series database.
Wherein, data middleware obtains the monitoring data of module transfer for receiving data, and as monitoring data Buffer area provides data source for real-time statistics unit;In addition, data middleware can also be reset convenient for monitoring data.Specifically, Data middleware can be for its received monitoring data setting storage period, in storage of the monitoring data in data middleware In the case that duration does not reach its corresponding storage period, real-time statistics unit can transfer the monitoring number from data middleware According to being deleted from data middleware when storage duration of the monitoring data in data middleware reaches its corresponding storage period Except the monitoring data.Real-time statistics unit is for counting monitoring data according to preset statistical indicator, to be united Count the corresponding monitored results of index;In turn, each monitored results are stored to time series database, i.e., using time series database to real-time Each monitored results that statistic unit counts are stored.
It should be understood that in practical applications, monitoring data processing module 120 is in addition to that can be based on data middleware, in real time system It counts unit and time series database is realized outside the statistics of monitored results, be also based on other unit Statistical monitors as a result, herein not Processing structure in monitoring data processing module 120 is specifically limited.
Daily record data processing module 130 generates single service request pair for obtaining daily record data, and according to daily record data The topological diagram answered, the topological diagram are used to characterize the background module call relation in response to single service request.
Specifically, daily record data processing module 130 can be based on the generation time and each log number of each daily record data According to the data content of record, the corresponding topological diagram of single service request is generated by way of log searching, which can Characterize the background module call relation in response to single service request;Under normal conditions, a front end services request would generally touch It sends out several background modules mutually to call, cooperate, the daily record data that daily record data processing module 130 is obtained according to it can Correspondingly to determine in response to the call relation between each background module of a service request, and drawn according to the call relation The corresponding topological diagram of the secondary service request is made, further, after which can also reflect that the secondary service request is called The hardware device that platform module is relied on.
Optionally, since daily record data can correspondingly reflect that background module generates abnormal data, daily record data Processing module 130, can also be further according to reflection backstage in the case where determining the corresponding topological diagram of single service request Module generates the daily record data of abnormal data, determines that part is extremely topological in the topological diagram, i.e., in single service request pair In the whole topological diagram answered, the part exception topological diagram being made of the abnormal background module of performance is further determined that.
In one possible implementation, can specifically include in daily record data processing module 130: log storage is single Member, Topology g eneration unit and abnormal Topology g eneration unit.
Wherein, buffer area of the log storage unit as daily record data provides for Topology g eneration unit and generates topological diagram institute The daily record data needed.Topology g eneration unit is used to be determined according to the daily record data stored in log storage unit in response to single Call relation between multiple background modules of service request, and to generate the single service request corresponding based on the call relation Topological diagram.Abnormal Topology g eneration unit is on the basis of the topological diagram that Topology g eneration unit generates, further according to reaction backstage The daily record data of module exception determines the part exception topological diagram in topological diagram, that is, is determined to characterize between abnormal background module Call relation topological diagram.
It should be understood that in practical applications, daily record data processing module 130 is in addition to that can be based on log storage unit, topology Generation unit and abnormal Topology g eneration unit generate outside the abnormal topological diagram of topological diagram and part, are also based on the generation of other units The abnormal topological diagram of topological diagram and part, is not specifically limited the processing structure in daily record data processing module 130 herein.
Computing engines 140, for obtaining at the monitored results and daily record data that monitoring data processing module 120 generates The topological diagram that module 130 generates is managed, and according to monitored results and the corresponding alarm threshold value of statistical indicator, whether determines background module There are failures;It is depositing in the case of a fault, is further determining failure cause according to monitored results and topological diagram.
When specific implementation, computing engines 140 can be first corresponding according to the corresponding monitored results of statistical indicator and statistical indicator Alarm threshold value, determine background module with the presence or absence of failure;Specifically, computing engines 140 can be more than alarm in monitored results In the case where threshold value, determining background module, there is currently failures.For example, it is assumed that referring to for this statistics of service response expired times Target monitored results are time-out 50 times, and the corresponding alarm threshold value of the statistical indicator is 30 times, at this point, monitored results exceed warning level Value correspondingly determines that there is currently failures for monitored service, that is, determine support the background module of the service operation there is currently Failure.
In determining monitored service there is currently in the case where failure, computing engines 140 may further determine that and work as The associated background module of preceding existing failure, and obtain the corresponding monitoring knot of these background modules under the statistical indicator Fruit further determines that there may be abnormal background modules according to the corresponding monitored results of these background modules.For example, false If background module associated with this overtime failure of service response includes background module A, background module B and background module C, into And it is corresponding super to obtain the corresponding expired times of background module A, the corresponding expired times of background module B and background module C respectively When number background module A, background module B and backstage mould further judged according to the corresponding alarm threshold value of each background module Whether block C is that there may be abnormal background modules.
It should be noted that in practical applications, computing engines 140 may be determined that out only by once judgement operation can There can be abnormal background module, it is also possible to it needs to determine by repeatedly judgement operation there may be abnormal background module, This does not determine that the number there may be judgement operation experienced when abnormal background module does any restriction to computing engines.
In addition, if computing engines 140 can not determine that there may be abnormal background module, computing engines according to monitored results 140 can also further recall the monitoring data that monitoring data processing module 120 obtains from data acquisition module 110, i.e. root Determine that there may be abnormal background modules according to original monitoring data.
Computing engines 140 through aforesaid operations determine there may be abnormal background module, may be itself really to deposit In the background module of failure, it is also possible to be the module for being influenced and generating abnormal data by other malfunctioning modules, in order to further Accurately positioning failure reason, the topological diagram that computing engines 140 can be generated in conjunction with daily record data processing module 130, determination can The background module of failure can be truly present.Specifically, since the topological diagram that daily record data processing module 130 generates is able to reflect sound Should be in the call relation between the background module of service request, therefore, computing engines 140 can be determined in conjunction with the topological diagram may In the presence of the call relation between abnormal background module, the failure based on downstream background module can produce upstream background module Raw data impact this principle, further there may be in abnormal background module, determine the backstage for being in downstream Module is the background module of real failure, as failure cause.
It should be noted that if daily record data processing module 130 can also determine part in topological diagram according to daily record data Abnormal topological diagram, then computing engines 140 can also obtain the part exception topological diagram, and according to the part exception topological diagram and prison Control result determines failure cause.
That is, daily record data processing module 130 can provide to computing engines 140 and small-scale show abnormal topology Figure, in this way, quickly finding in the small-scale part exception topological diagram convenient for computing engines 140, there may be after exception Platform module, and determine there may be the upstream-downstream relationship of abnormal background module, further increase the determination efficiency of failure cause.
Optionally, in order to guarantee to accurately determine background module with the presence or absence of failure, monitoring system is also based on Other verifying logics, verifying background module currently whether there is failure, in turn, the verifying that will be obtained based on different verifying logics As a result combine, judge that background module currently whether there is failure.
That is, service monitoring system further include: bypass authentication module;The bypass authentication module for obtain monitoring data and/ Or monitored results, and bypass authentication is generated as a result, the bypass authentication result being capable of table according to the monitoring data and/or monitored results Levying background module whether there is failure.In turn, computing engines 140 obtain the bypass authentication as a result, referring to according to monitored results, statistics Corresponding alarm threshold value and bypass authentication are marked as a result, determining background module with the presence or absence of failure.
Specifically, bypass authentication module can judge that monitored service is current according to monitoring data and/or monitored results Computing engines 140 are sent to the presence or absence of failure, and then by the judging result (i.e. bypass authentication result), so as to computing engines 140 combine the judging result, determine that background module currently whether there is failure.For example, it is assumed that monitored service is certain ticketing Service, bypass authentication module can obtain the ticketing from the server for providing the ticketing service and service sale total in one day Volume, in turn, bypass authentication module calculate ticketing in this day according to monitoring data and service total sales volume, can if the two is consistent To determine bypass authentication result, as background module, there is currently no failures, conversely, can determine that bypass is tested if the two is inconsistent Demonstrate,proving result is that there is currently failures for background module.
It should be understood that above-mentioned processing logic is only the illustrative processing logic of one kind that bypass authentication module can use, In practical application, bypass authentication module is also based on other processing logic checking background modules with the presence or absence of failure, herein not The processing logic of bypass authentication module is specifically limited.In addition, can only include that a bypass is tested in service monitoring system Module is demonstrate,proved, also may include multiple bypass authentication modules, different bypass authentication modules is based on different processing logics, after verifying Platform module whether there is failure.
It should be noted that computing engines 140 other than can be with positioning failure reason, can also determine failure rank, event Hinder the information such as coverage;For example, computing engines 140 can determine that failure cause is corresponding in conjunction with preset failure levels list Failure rank can recorde between the combination of monitored results and failure cause and failure rank in above-mentioned failure levels list Corresponding relationship;In another example the available original monitoring data of computing engines 140, works as according to the content determination in monitoring data The coverage of preceding existing failure.
Optionally, in order to further increase the accuracy of determined failure cause, computing engines 140 are determining backstage When module whether there is failure and specific failure cause, related service operation history data can be combined with.
It can also include: data warehouse and intelligent analysis module in service monitoring system provided by the embodiments of the present application; Wherein, data warehouse is used for storage service operation history data;Intelligent analysis module is used to obtain service operation historical data, and Service data historical variations trend is determined according to acquired service operation historical data;Correspondingly, computing engines 140 are also used to Service operation historical data and service data history variation tendency are obtained from intelligent analysis module, is referred to according to monitored results, statistics Corresponding alarm threshold value and service data history variation tendency are marked, determines background module with the presence or absence of failure;And there is event In the case where barrier, failure cause is determined according to monitored results, topological diagram and acquired service operation historical data.
Specifically, intelligent analysis module can obtain service operation history number relevant to statistical indicator from data warehouse According to the service operation historical data is specifically as follows history monitored results corresponding with statistical indicator;In turn, intelligent analysis module Sequentially in time, acquired service operation historical data is arranged, obtains the corresponding service data of statistical indicator and goes through History variation tendency.It in turn, can be further combined with intellectual analysis when computing engines 140 determine that background module whether there is failure The service data historical variations trend that module determines determines background module with the presence or absence of failure, for example, it is assumed that computing engines 140 are examined The response timeout number for measuring certain service is more than preset alarm threshold value within certain period, and still, computing engines 140 are according to this The corresponding historical variations trend of the response timeout number of service finds that the service can be more than default within the daily period Alarm threshold value, at this point, computing engines 140 are it is considered that above-mentioned overtime phenomenon belongs to the specific run rule of the service, not The service can be represented and currently really there is failure.
If computing engines 140 judge background module there is currently failure, computing engines 140 can also be further combined with clothes Business operation history data positioning failure reason;Specifically, computing engines 140 are available and respectively there may be abnormal backstage moulds Then the relevant service operation historical data of block in conjunction with acquired service operation historical data, further judges each to deposit In the abnormal whether real failure of background module;For example, it is assumed that acquired service operation historical data characterization there is currently Failure is caused by being broken down by background module A mostly, and correspondingly, computing engines 140 can refer to the information, increases backstage There are the probability of failure for modules A.
It should be understood that above-mentioned combination service data historical variations Trend judgement background module whether there is the mode of failure, with And the mode of failure cause is determined in conjunction with service operation historical data, merely illustrative, in practical applications, computing engines 140 can By other means, to determine background module with the presence or absence of failure, and combination service in conjunction with service data historical variations trend Operation history data determines failure cause, does not determine that background module whether there is in conjunction with service data historical variations trend herein Failure and combination service operation historical data determine that the specific implementation of failure cause does any restriction.
Optionally, since the corresponding alarm threshold value of statistical indicator can also may have an impact the accuracy of breakdown judge, Therefore, in practical applications, it can be continuously updated the corresponding alarm threshold value of statistical indicator in conjunction with service operation historical data, from And guarantee the reasonability of alarm threshold value, that is, guarantee accurately judge whether there is failure based on set alarm threshold value.
That is, intelligent analysis module is also used to determine that statistical indicator is corresponding according to acquired service operation historical data Alarm threshold value.Specifically, due to the service operation historical data stored in data warehouse may include after user's maintenance failure on The feedback data of biography, therefore, intelligent analysis module can correspondingly obtain the corresponding history prison of statistical indicator from data warehouse Control is as a result, feedback data corresponding with the history monitored results, in turn, intelligent analysis module can be supervised according to acquired history Result and feedback data are controlled, the corresponding alarm threshold value of the statistical indicator is redefined.
It should be understood that in practical applications, intelligent analysis module can carry out alarm threshold value according to the preset update cycle It updates, i.e., alarm threshold value is once updated every a update cycle, to guarantee the reasonability of alarm threshold value.
Optionally, in computing engines 140 also it needs to be determined that in the other situation of failure level, intelligent analysis module be can be combined with Service operation historical data is continuously updated failure levels list.Specifically, since different failure ranks corresponds to Different type of alarms, it is related negative in order to guarantee to remind using relatively reasonable type of alarm for different failure causes People is blamed, intelligent analysis module can also be continuously updated failure levels list according to its service operation historical data obtained, into And updated failure levels list is sent to computing engines 140, so that computing engines 140 are according to the failure levels list, Determine failure rank belonging to failure cause.
Specifically, after further including relevant person in charge maintenance failure due to the service operation historical data stored in data warehouse The feedback data of upload, therefore, intelligent analysis module can accordingly based upon these feedback data, determine various failure causes with And the corresponding failure rank of combination of monitored results, that is, determine that failure corresponding to the combination of failure cause and monitored results is tight Weight degree constructs corresponding between failure rank list records failure cause and the combination and failure rank of monitored results in turn Relationship.Correspondingly, when computing engines 140 determine failure rank, current event can be determined directly according to the failure levels list Hinder the corresponding failure rank of combination of reason and monitored results.
It should be understood that in practical applications, intelligent analysis module can be according to the preset update cycle to failure levels list It is updated, i.e., failure levels list is once updated every a update cycle, to guarantee the conjunction of failure levels list Rationality.
It should be noted that being calculated in the case where including data warehouse and intelligent analysis module in service monitoring system Engine 140 can also further determine that failure Estimated Time Of Recovery, monitoring data change curve and historical trend correlation curve.Example Such as, computing engines 140 can obtain service operation historical data from intelligent analysis module, based in service operation historical data The feedback data uploaded after relevant person in charge maintenance failure correspondingly determines failure Estimated Time Of Recovery.In another example computing engines History monitored results in 140 available short time (such as one hour, two hours) based on current monitored results and are obtained History monitored results in the short time taken draw monitoring data change curve.If for another example computing engines 140 are available A history monitored results in dry default monitoring period (such as one day, one week), based on current monitored results and acquired History monitored results, render history trend correlation curve.
It should be noted that in the case where including data warehouse in service monitoring system, monitoring data processing module 120 After itself counting and obtaining monitored results, it is also necessary to monitored results obtained are transmitted to data warehouse, to utilize data bins Library stores monitored results.
Optionally, in one possible implementation, service monitoring system can actively inform that relevant person in charge is current There are failure and failure causes for background module.Specifically, service monitoring system can also include: Report Engine and work order push Module;Wherein, Report Engine is used to obtain monitored results and service operation history data from computing engines, and according to monitored results Fault data, which is generated, with service operation history data changes chart;Work order pushing module for obtain fault data variation chart, Failure cause and failure rank search the contact method of the corresponding responsible person of failure cause, according to the corresponding prompt of failure rank Mode notifies relevant person in charge failure cause.
Specifically, will can further determine failure cause after computing engines 140 determine failure cause, failure rank The monitored results and related service operation history data of foundation are transmitted to Report Engine;Report Engine receives computing engines 140 After the data of transmission, fault data variation diagram is drawn according to the received monitored results of institute and related service operation history data Table, fault data variation chart are able to reflect the variation tendency of data relevant to current failure.Work order pushing module is from report The fault data variation chart that its generation is obtained at table engine, obtains failure cause and failure rank from computing engines 140, into And in the table for recording the mapping relations between faulty reason and relevant person in charge, it is corresponding to search current failure cause Responsible person, and the contact method of the responsible person is obtained, based on the contact method of acquired responsible person, according to failure rank pair The prompting mode answered notify responsible person there is currently failure and failure cause.
It should be noted that above-mentioned fault data variation chart specifically can be as shown in Fig. 2, i.e. in fault data variation diagram The variation tendency of existing alarm the previous day data relevant to failure, and data relevant to failure on the day of alarm are opened up in table Variation tendency.Certainly, fault data variation chart shown in Fig. 2 is only a kind of example, in practical applications, fault data variation Chart can also show as other forms, can also show other data contents in fault data variation chart, such as inferior health number According to variation tendency etc., the form of expression to fault data variation chart and the data content showed do not do specific limit herein It is fixed.
It should be noted that in practical applications, service monitoring system can be preset for different failure ranks Different prompting modes in order to combine the timeliness for troubleshooting, and is reduced as far as being responsible for correlation The harassing and wrecking (prompting for relevant person in charge is reduced i.e. under failure not serious situation) of people, service monitoring system can be preparatory Failure rank is meticulously divided.For example, service monitoring system can according to the sequence of fault severity level from high to low, 0 grade to 5 grades failure rank is respectively set, for 0 grade of failure, relevant person in charge can be prompted by way of instant Advise By Wire There are failure and corresponding failure causes can notify relevant person in charge by instant message for 1 grade of failure for current service The processing for failure is within a specified time completed, for 2 grades of failures, correlation can be prompted in such a way that instant message notifies There are failure and corresponding failure causes for responsible person's current service, but do not set troubleshooting time limit, needle to relevant person in charge To 3 grades of failures, prompting message can be sent to responsible person using minute as the period, so relevant person in charge current service be prompted to exist Failure and corresponding failure cause can send prompt mail to responsible person using hour as the period, so for 4 grades of failures There are failure and corresponding failure causes can be converged by report day for 5 grades of failures for prompt relevant person in charge current service Relevant person in charge current service is prompted in the way of always (sending report to responsible person as the period by day), and there are failures and corresponding Failure cause.
It should be understood that the other division mode of above-mentioned failure level and the corresponding prompting mode of failure rank are merely illustrative, in reality In the application of border, failure rank can also be divided using other modes, and its correspondence is correspondingly set for different failure ranks Prompting mode.
It should be noted that after relevant person in charge is completed for the processing of failure, it can also be anti-by work order pushing module Associated processing intelligence is presented, after work order pushing module receives the processing information of user feedback, is further write these processing information Enter data warehouse, so that subsequent intelligent analysis module can be based on these processing information to parameters such as alarm threshold value, failure ranks It is updated.
Optionally, in alternatively possible implementation, service monitoring system can be in response to the failure of user's triggering Data query operation shows fault data variation chart and failure cause to user.Specifically, service monitoring system can be with It include: Report Engine and managing visual module;Wherein, Report Engine is used to obtain monitored results and service fortune from computing engines Row historical data, and fault data is generated according to monitored results and service operation history data and changes chart;Managing visual mould Block is used to check operation in response to the fault data that user triggers, obtains failure cause and fault data changes chart, and show Failure cause and fault data change chart.
Specifically, will can further determine failure cause after computing engines 140 determine failure cause, failure rank The monitored results and related service operation history data of foundation are transmitted to Report Engine;Report Engine receives computing engines 140 After the data of transmission, fault data variation diagram is drawn according to the received monitored results of institute and related service operation history data Table, fault data variation chart are able to reflect the variation tendency of data relevant to current failure.The inspection of managing visual module Measure user triggering fault data check operation after, correspondingly obtained from computing engines 140 its determination failure cause, from Fault data is obtained in Report Engine and changes chart, and acquired failure cause and fault data variation chart is passed through related Display interface is presented to user.
Optionally, in some cases, first monitoring system can also be further set to the service monitoring system in the application System is monitored.This yuan of monitoring system is the system independently of the service monitoring system in the application, can be to service monitoring The operating status of each operational module is monitored in real time in system, for example, each operational module is current in monitoring service monitoring system Whether normal operation.In turn, there are when the operational module of misoperation, inform related be responsible in monitoring service monitoring system Operational module in people's service monitoring system exists abnormal.
When above-mentioned service monitoring system is monitored background module, other than needing to refer to associated monitoring data, also into One step is generated with reference to daily record data, and based on daily record data for characterizing the background module tune in response to single service request With the topological diagram of relationship, correspondingly, computing engines, can be in conjunction between background module in topological diagram when determining failure cause Call relation, distinguishing which background module is the abnormal data generated because of faults itself, which background module is by other The influence of fault daemon module and generate abnormal data, that is, be accurately located the background module for being truly present failure, consequently facilitating Relevant person in charge carries out correspondingly maintenance to background module according to the failure cause so oriented and adjusts.
For the ease of further understanding service monitoring system provided by the embodiments of the present application, below with reference to Fig. 3 to the application The service monitoring system of offer does globality introduction.
Referring to Fig. 3, Fig. 3 is the structural schematic diagram of another service monitoring system provided by the embodiments of the present application.Such as Fig. 3 institute Show, includes: data acquisition module 3100, monitoring data processing module 3200, daily record data processing mould in the service monitoring system Block 3300, data warehouse 3400, intelligent analysis module 3500, bypass authentication module 3600, computing engines 3700, Report Engine 3800, work order pushing module 3900 and managing visual module 3110.It wherein, include: that version becomes in data acquisition module 3100 More information acquisition unit 3101, infrastructure information acquiring unit 3102, active probe unit 3103 and business information obtain single Member 3104;It include: data middleware 3201, real-time statistics unit 3202 and time series database in monitoring data processing module 3200 3203;It include: log storage unit 3301, Topology g eneration unit 3302 and abnormal topological life in daily record data processing module 3300 At unit 3303.
When service monitoring system is run, data acquisition module 3100 obtains service operation basic data, specifically, version becomes More information acquisition unit 3101 obtains service distribution modification information, and infrastructure information acquiring unit 3102, which obtains, supports service fortune Capable underlying hardware facilities information, the running state information of 3103 active probe service of active probe unit, business information obtain The service operation information that 3104 acquisition business of unit reports.Data acquisition module 3100 gets above-mentioned service operation basic data Afterwards, monitoring data and daily record data, and the number by monitoring data transmission into monitoring data processing module 3200 are therefrom marked off According to middleware 3201, the log storage unit 3301 daily record data being transmitted in daily record data processing module 3300.
In monitoring data processing module 3200, buffer area of the data middleware 3201 as monitoring data, for system in real time It counts unit 3202 and multi-data source is provided;Real-time statistics unit 3202 transfers monitoring data from data middleware 3201, and presses Monitoring data is counted according to preset statistical indicator, to obtain the corresponding monitored results of statistical indicator;In turn, it unites in real time It counts unit 3202 and monitored results is transmitted to time series database 3203.
Bypass authentication module 3600 obtains monitoring data from data middleware 3201, and/or from time series database 3202 Middle acquisition monitored results based on acquired monitoring data and/or monitored results, generate bypass authentication as a result, the bypass in turn Verification result, which can characterize background module, currently whether there is failure.
After time series database 3202 receives monitored results, further the monitored results are stored to data warehouse 3400 In, a large amount of service operation historical data is stored in the data warehouse 3400.Intelligent analysis module 3500 can be according to data The service operation historical data stored in warehouse 3400 determines service data historical variations trend, and corresponding to statistical indicator The parameters such as alarm threshold value, failure levels list are updated.
In daily record data processing module 3300, buffer area of the log storage unit 3301 as daily record data, for topology Data source needed for generation unit 3302 provides generation topological diagram;Topology g eneration unit 3302 is adjusted from log storage unit 3301 It determines with daily record data, and according to the daily record data called in response to the tune between multiple background modules of single service request With relationship, and based on the call relation correspondingly drawing topological graph;Abnormal Topology g eneration unit 3303 is in Topology g eneration unit On the basis of 3302 topological diagrams generated, exception topological diagram in part therein is further determined that.
Computing engines 3700 obtain monitored results from time series database 3202, obtain and service from intelligent analysis module 3500 Operation history data, service data historical variations trend, the corresponding alarm threshold value of statistical indicator and failure levels list, from bypass Authentication module 3600 obtains bypass authentication as a result, from abnormal 3303 fetching portion exception topological diagram of Topology g eneration unit.In turn, root Become according to acquired monitored results, bypass authentication result, the corresponding alarm threshold value of statistical indicator and service data historical variations Gesture determines that background module currently whether there is failure;It is deposited in the case of a fault determining, it is abnormal according to monitored results, part Topological diagram and service operation history data determine failure cause, and according to monitored results, failure cause and failure levels list, Determine the corresponding failure rank of current failure cause.
Optionally, computing engines 3700 can also determine fault incidence according to monitored results, be gone through according to service operation History data determine failure Estimated Time Of Recovery, draw monitoring data change curve according to monitored results and service operation history data With historical trend correlation curve.
Report Engine 3800 obtains monitored results and service operation history data from computing engines 3700, and according to being obtained The monitored results and service operation history data taken, draw fault data and change chart;In turn, work order pushing module 3900 is from report Fault data is obtained at table engine 3800 and changes chart, and failure cause and failure rank are obtained from computing engines 3700, is searched The contact method of the corresponding responsible person of failure cause, and according to the corresponding prompting mode notice relevant person in charge backstage of failure rank There is currently failure and specific failure causes for module.In addition, user can also be after completing for the maintenance of failure, Xiang Gong Related feedback information is written in single pushing module 3900, is stored related feedback information to data bins by work order pushing module 3900 Library 3400.
When user, which triggers fault data by managing visual module 3110, checks operation, managing visual module 3110 Failure cause is correspondingly obtained from computing engines 3700, and fault data is obtained from Report Engine 3800 and changes chart;In turn, Show that acquired fault data changes chart and failure cause to user by display interface.
Further, it is also possible to which first monitoring system 3120 independently of service monitoring system is arranged, this yuan of monitoring system is utilized The working condition of each operational module in 3120 pairs of service monitoring systems is monitored, if first monitoring system 3120 monitors service There is exception in operational module in monitoring system, then correspondingly notifies relevant person in charge.
For the service monitoring system of above-described embodiment description, the embodiment of the present application also provides corresponding service monitoring sides Method is below first introduced service monitoring method provided by the embodiments of the present application.
Referring to fig. 4, Fig. 4 is the flow diagram of service monitoring method provided by the embodiments of the present application;It should be noted that The executing subject of the service monitoring method is usually server;As shown in figure 4, the service monitoring method the following steps are included:
Step 401: obtain service operation basic data, marked off from the service operation basic data monitoring data and Daily record data.
Step 402: counting the monitoring data according to preset statistical indicator, obtain the corresponding monitoring of the statistical indicator As a result.
Step 403: the corresponding topological diagram of single service request is generated according to the daily record data;The topological diagram is used for table Levy the background module call relation in response to single service request.
Step 404: according to the monitored results and the corresponding alarm threshold value of the statistical indicator, determining the background module With the presence or absence of failure.
Step 405: depositing in the case of a fault, determining failure cause according to the monitored results and the topological diagram.
Wherein, step 401 can the data acquisition module 110 in service monitoring system as shown in Figure 1 execute, step 402 can the monitoring data processing module 120 in service monitoring system as shown in Figure 1 execute, step 403 can be by Fig. 1 institute Daily record data processing module 130 in the service monitoring system shown executes, and step 404 and step 405 can clothes as shown in Figure 1 Computing engines 140 in monitoring system of being engaged in execute.
It should be understood that in practical applications, above-mentioned steps 401 to step 405 are not limited in being executed by above-mentioned module, may be used also To execute above-mentioned multiple steps using same module, any limit is not done to the executing subject of above-mentioned steps 401 to step 405 herein It is fixed.
Optionally, the service monitoring method further include:
Service operation historical data is obtained, and determines that service data history becomes according to acquired service operation historical data Change trend;
It is then described according to the monitored results and the corresponding alarm threshold value of the statistical indicator, determine that the background module is It is no that there are failures, comprising:
Become according to the monitored results, the corresponding alarm threshold value of the statistical indicator and the service data historical variations Gesture determines the background module with the presence or absence of failure;
It is then described to determine failure cause according to the monitored results and the topological diagram, comprising:
According to the monitored results, the topological diagram and acquired service operation historical data, determine that the failure is former Cause.
Optionally, the service monitoring method further include:
According to acquired service operation historical data, the corresponding alarm threshold value of the statistical indicator is determined.
Optionally, the service monitoring method further include:
According to the daily record data, the part exception topological diagram in the topological diagram is determined;
It is then described to determine failure cause according to the monitored results and the topological diagram, comprising:
The failure cause is determined according to the monitored results and the abnormal topological diagram.
Optionally, the service monitoring method further include:
The monitoring data and/or the monitored results are obtained, and according to the monitoring data and/or the monitored results Generate bypass authentication result;The bypass authentication result is for characterizing the background module with the presence or absence of failure;
It is then described according to the monitored results and the corresponding alarm threshold value of the statistical indicator, determine that the background module is It is no that there are failures, comprising:
According to the monitored results, the corresponding alarm threshold value of the statistical indicator and the bypass authentication as a result, determining institute Background module is stated with the presence or absence of failure.
Optionally, the service monitoring method further include:
Failure levels list is determined according to service operation historical data;Record has monitored results in the failure levels list Corresponding relationship between the combination and failure rank of failure cause;
According to the monitored results, the failure cause and the failure levels list, determine that the failure cause is corresponding Failure rank.
Optionally, the service monitoring method further include:
Fault data, which is generated, according to the monitored results and the service operation historical data changes chart;
The contact method for searching the corresponding responsible person of the failure cause, according to the corresponding prompting mode of the failure rank Failure cause described in the responsible person and the fault data is notified to change chart.
Optionally, the service monitoring method further include:
The monitored results and the service operation historical data are obtained, and are transported according to the monitored results and the service Row historical data generates fault data and changes chart;
Operation is checked in response to the fault data of user's triggering, shows the failure cause and the fault data variation diagram Table.
Optionally, the service operation basic data includes: service distribution modification information, underlying hardware facilities information, master The service operation information that the running state information and business of the service of dynamic detection report.
When above-mentioned service monitoring method is monitored background module, other than needing to refer to associated monitoring data, also into One step is generated with reference to daily record data, and based on daily record data for characterizing the background module tune in response to single service request It when determining failure cause, can correspondingly be closed in conjunction with the calling between background module in topological diagram with the topological diagram of relationship System, distinguishing which background module is the abnormal data generated because of faults itself, which background module is after by other failures The influence of platform module and generate abnormal data, that is, the background module for being truly present failure is accurately located, consequently facilitating related negative It blames people and correspondingly maintenance adjustment is carried out to background module according to the failure cause so oriented.
The embodiment of the present application also provides a kind of servers for supporting service monitoring system to run, and Fig. 5 is the application A kind of structural schematic diagram for server that embodiment provides, the server 500 can generate bigger because of configuration or performance difference Difference, may include one or more central processing unit (central processing units, CPU) 522 (examples Such as, one or more processors) and memory 532, one or more storage application programs 542 or data 544 Storage medium 530 (such as one or more mass memory units).Wherein, memory 532 and storage medium 530 can be Of short duration storage or persistent storage.The program for being stored in storage medium 530 may include that one or more modules (do not mark by diagram Out), each module may include to the series of instructions operation in server.Further, central processing unit 522 can be set It is set to and is communicated with storage medium 530, the series of instructions operation in storage medium 530 is executed on server 500.
Server 500 can also include one or more power supplys 526, one or more wired or wireless networks Interface 550, one or more input/output interfaces 558, and/or, one or more operating systems 541, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM etc..
The step as performed by server can be based on the server architecture shown in fig. 5 in above-described embodiment.
Wherein, CPU 522 is for executing following steps:
Service operation basic data is obtained, marks off monitoring data and log number from the service operation basic data According to;
The monitoring data is counted according to preset statistical indicator, obtains the corresponding monitored results of the statistical indicator;
The corresponding topological diagram of single service request is generated according to the daily record data;The topological diagram for characterize in response to The background module call relation of single service request;
According to the monitored results and the corresponding alarm threshold value of the statistical indicator, determine that the background module whether there is Failure;
It is depositing in the case of a fault, is determining failure cause according to the monitored results and the topological diagram.
Optionally, CPU 522 can be also used for executing any one realization of service monitoring method in the embodiment of the present application The step of mode.
The embodiment of the present application also provides a kind of computer readable storage medium, for storing computer program, the computer Program is used to execute any one embodiment in a kind of service monitoring method described in foregoing individual embodiments.
The embodiment of the present application also provides a kind of computer program product including instruction, when run on a computer, So that computer executes any one embodiment in a kind of service monitoring method described in foregoing individual embodiments.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.
In several embodiments provided herein, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or The mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, the indirect coupling of device or unit It closes or communicates to connect, can be electrical property, mechanical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.
It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, the technical solution of the application is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, server or the network equipment etc.) executes the complete of each embodiment the method for the application Portion or part steps.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (full name in English: Read-Only Memory, english abbreviation: ROM), random access memory (full name in English: Random Access Memory, english abbreviation: RAM), the various media that can store computer program such as magnetic or disk.
The above, above embodiments are only to illustrate the technical solution of the application, rather than its limitations;Although referring to before Embodiment is stated the application is described in detail, those skilled in the art should understand that: it still can be to preceding Technical solution documented by each embodiment is stated to modify or equivalent replacement of some of the technical features;And these It modifies or replaces, the spirit and scope of each embodiment technical solution of the application that it does not separate the essence of the corresponding technical solution.

Claims (10)

1. a kind of service monitoring system, which is characterized in that the system comprises: data acquisition module, monitoring data processing module, Daily record data processing module and computing engines;
The data acquisition module is marked off from the service operation basic data for obtaining service operation basic data Monitoring data and daily record data;
The monitoring data processing module counts the prison for obtaining the monitoring data, and according to preset statistical indicator Data are controlled, the corresponding monitored results of the statistical indicator are obtained;
The daily record data processing module generates single service for obtaining the daily record data, and according to the daily record data Request corresponding topological diagram;The topological diagram is used to characterize the background module call relation in response to the single service request;
The computing engines, for obtaining the monitored results and the topological diagram, according to the monitored results and the statistics The corresponding alarm threshold value of index determines the background module with the presence or absence of failure;It is depositing in the case of a fault, according to the prison Control result and the topological diagram determine failure cause.
2. system according to claim 1, which is characterized in that the system also includes: data warehouse and intellectual analysis mould Block;
The data warehouse is used for storage service operation history data;
The intelligent analysis module, for obtaining the service operation historical data, and according to the acquired service operation Historical data determines service data historical variations trend;
The computing engines are also used to obtain the service operation historical data and the service number from the intelligent analysis module According to historical variations trend;
The computing engines are specifically used for according to the monitored results, the corresponding alarm threshold value of the statistical indicator and the clothes Business data history variation tendency determines the background module with the presence or absence of failure;It is depositing in the case of a fault, according to the prison Result, the topological diagram and acquired service operation historical data are controlled, determines the failure cause.
3. system according to claim 2, which is characterized in that the intelligent analysis module is also used to according to acquired The service operation historical data determines the corresponding alarm threshold value of the statistical indicator.
4. system according to claim 1, which is characterized in that the daily record data processing module is also used to according to Daily record data determines the part exception topological diagram in the topological diagram;
The computing engines are also used to obtain the part exception topological diagram;
The computing engines, specifically for determining the failure cause according to the monitored results and the abnormal topological diagram.
5. system according to claim 1, which is characterized in that the system also includes: bypass authentication module;
The bypass authentication module, for obtaining the monitoring data and/or the monitored results, and according to the monitoring data And/or the monitored results generate bypass authentication result;The bypass authentication result is for characterizing whether the background module is deposited In failure;
The computing engines are also used to obtain the bypass authentication result;
The computing engines are specifically used for according to the monitored results, the corresponding alarm threshold value of the statistical indicator and the side Road verification result determines the background module with the presence or absence of failure.
6. system according to claim 2, which is characterized in that the intelligent analysis module is also used to according to acquired Service operation historical data determines failure levels list;Record has monitored results and failure cause in the failure levels list Corresponding relationship between combination and failure rank;
The then computing engines are also used to obtain the failure levels list, and according to the monitored results, the failure cause With the failure levels list, the corresponding failure rank of the failure cause is determined.
7. system according to claim 6, which is characterized in that the system also includes: Report Engine and work order push mould Block;
The Report Engine, for obtaining the monitored results and the service operation historical data from the computing engines, and Fault data, which is generated, according to the monitored results and the service operation historical data changes chart;
The work order pushing module, for obtaining the fault data variation chart, the failure cause and the failure rank, The contact method for searching the corresponding responsible person of the failure cause, according to described in the corresponding prompting mode notice of the failure rank Failure cause described in responsible person.
8. system according to claim 2, which is characterized in that the system also includes: Report Engine and managing visual Module;
The Report Engine, for obtaining the monitored results and the service operation historical data from the computing engines, and Fault data, which is generated, according to the monitored results and the service operation historical data changes chart;
The managing visual module obtains the failure cause for checking operation in response to the fault data that user triggers Change chart with the fault data, and shows the failure cause and fault data variation chart.
9. system according to claim 1, which is characterized in that the data acquisition module includes: that version change information obtains Take unit, infrastructure information acquiring unit, active probe unit and business information acquiring unit;
The version change information acquisition unit, for obtaining service distribution modification information;
The infrastructure information acquiring unit, for obtaining the underlying hardware facilities information for supporting service operation;
The active probe unit, the running state information for service described in active probe;
The business information acquiring unit, the service operation information reported for obtaining business.
10. a kind of service monitoring method, which is characterized in that the described method includes:
Service operation basic data is obtained, marks off monitoring data and daily record data from the service operation basic data;
The monitoring data is counted according to preset statistical indicator, obtains the corresponding monitored results of the statistical indicator;
The corresponding topological diagram of single service request is generated according to the daily record data;The topological diagram is for characterizing in response to described The background module call relation of single service request;
According to the monitored results and the corresponding alarm threshold value of the statistical indicator, determine the background module with the presence or absence of event Barrier;
It is depositing in the case of a fault, is determining failure cause according to the monitored results and the topological diagram.
CN201910543000.9A 2019-06-21 2019-06-21 A kind of service monitoring system and method Pending CN110287081A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910543000.9A CN110287081A (en) 2019-06-21 2019-06-21 A kind of service monitoring system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910543000.9A CN110287081A (en) 2019-06-21 2019-06-21 A kind of service monitoring system and method

Publications (1)

Publication Number Publication Date
CN110287081A true CN110287081A (en) 2019-09-27

Family

ID=68005324

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910543000.9A Pending CN110287081A (en) 2019-06-21 2019-06-21 A kind of service monitoring system and method

Country Status (1)

Country Link
CN (1) CN110287081A (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190372832A1 (en) * 2018-05-31 2019-12-05 Beijing Baidu Netcom Science Technology Co., Ltd. Method, apparatus and storage medium for diagnosing failure based on a service monitoring indicator
CN110855502A (en) * 2019-11-22 2020-02-28 叶晓斌 Fault cause determination method and system based on time-space analysis log
CN110855503A (en) * 2019-11-22 2020-02-28 叶晓斌 Fault cause determining method and system based on network protocol hierarchy dependency relationship
CN110990433A (en) * 2019-11-21 2020-04-10 深圳马可孛罗科技有限公司 Real-time service monitoring and early warning method and early warning device
CN111464390A (en) * 2020-03-31 2020-07-28 中国建设银行股份有限公司 Network application system monitoring and early warning method and system
CN111724489A (en) * 2020-06-08 2020-09-29 上海申通地铁集团有限公司 General software system of ticket selling and checking machine
CN111882289A (en) * 2020-07-01 2020-11-03 国网河北省电力有限公司经济技术研究院 Device and method for measuring and calculating item data audit index interval
CN112003747A (en) * 2020-08-21 2020-11-27 中国建设银行股份有限公司 Fault positioning method of cloud virtual gateway
CN112491622A (en) * 2020-11-30 2021-03-12 苏宁金融科技(南京)有限公司 Method and system for positioning fault root cause of business system
CN112783720A (en) * 2021-01-05 2021-05-11 广州品唯软件有限公司 Topological structure diagram generation method and device, computer equipment and display system
CN113656252A (en) * 2021-08-24 2021-11-16 北京百度网讯科技有限公司 Fault positioning method and device, electronic equipment and storage medium
WO2022073406A1 (en) * 2020-10-10 2022-04-14 华为技术有限公司 Network intent monitoring method, network intent monitoring system and storage medium
CN115083166A (en) * 2022-07-14 2022-09-20 深圳市维力谷无线技术股份有限公司 Vehicle-road communication testing system and method based on 5G technology
CN115955385A (en) * 2022-09-29 2023-04-11 中国联合网络通信集团有限公司 Fault diagnosis method and device for Internet of things service
CN116016201A (en) * 2021-11-04 2023-04-25 贵州电网有限责任公司 Abnormal early warning method based on business backtracking

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103812699A (en) * 2014-02-17 2014-05-21 无锡华云数据技术服务有限公司 Monitoring management system based on cloud computing
CN105207806A (en) * 2015-08-20 2015-12-30 百度在线网络技术(北京)有限公司 Monitoring method and apparatus of distributed service
CN105406991A (en) * 2015-10-26 2016-03-16 上海华讯网络系统有限公司 Method and system for generating service threshold by historical data based on network monitoring indexes
CN107864058A (en) * 2017-11-09 2018-03-30 凌云天博光电科技股份有限公司 Fault judgment method and device
CN107943668A (en) * 2017-12-15 2018-04-20 江苏神威云数据科技有限公司 Computer server cluster daily record monitoring method and monitor supervision platform
CN108234168A (en) * 2016-12-15 2018-06-29 腾讯科技(深圳)有限公司 A kind of method for exhibiting data and system based on service topology
CN108322351A (en) * 2018-03-05 2018-07-24 北京奇艺世纪科技有限公司 Generate method and apparatus, fault determination method and the device of topological diagram
US20190097872A1 (en) * 2017-09-25 2019-03-28 Dasan Zhone Solutions, Inc. System and method for remote maintenance
CN109687992A (en) * 2018-09-07 2019-04-26 平安科技(深圳)有限公司 Monitoring method, device, equipment and the computer readable storage medium of service node
CN109766247A (en) * 2018-12-19 2019-05-17 平安科技(深圳)有限公司 Alarm setting method and system based on system data monitoring

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103812699A (en) * 2014-02-17 2014-05-21 无锡华云数据技术服务有限公司 Monitoring management system based on cloud computing
CN105207806A (en) * 2015-08-20 2015-12-30 百度在线网络技术(北京)有限公司 Monitoring method and apparatus of distributed service
CN105406991A (en) * 2015-10-26 2016-03-16 上海华讯网络系统有限公司 Method and system for generating service threshold by historical data based on network monitoring indexes
CN108234168A (en) * 2016-12-15 2018-06-29 腾讯科技(深圳)有限公司 A kind of method for exhibiting data and system based on service topology
US20190097872A1 (en) * 2017-09-25 2019-03-28 Dasan Zhone Solutions, Inc. System and method for remote maintenance
CN107864058A (en) * 2017-11-09 2018-03-30 凌云天博光电科技股份有限公司 Fault judgment method and device
CN107943668A (en) * 2017-12-15 2018-04-20 江苏神威云数据科技有限公司 Computer server cluster daily record monitoring method and monitor supervision platform
CN108322351A (en) * 2018-03-05 2018-07-24 北京奇艺世纪科技有限公司 Generate method and apparatus, fault determination method and the device of topological diagram
CN109687992A (en) * 2018-09-07 2019-04-26 平安科技(深圳)有限公司 Monitoring method, device, equipment and the computer readable storage medium of service node
CN109766247A (en) * 2018-12-19 2019-05-17 平安科技(深圳)有限公司 Alarm setting method and system based on system data monitoring

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张晓丹: "面向业务应用交易的IT运维监控系统建设思路", 《中国金融电脑》 *
聂晶: "云计算系统服务器节点故障的检测算法", 《内蒙古师范大学学报(自然科学汉文版)》 *

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10805151B2 (en) * 2018-05-31 2020-10-13 Beijing Baidu Netcom Science Technology Co., Ltd. Method, apparatus, and storage medium for diagnosing failure based on a service monitoring indicator of a server by clustering servers with similar degrees of abnormal fluctuation
US20190372832A1 (en) * 2018-05-31 2019-12-05 Beijing Baidu Netcom Science Technology Co., Ltd. Method, apparatus and storage medium for diagnosing failure based on a service monitoring indicator
CN110990433A (en) * 2019-11-21 2020-04-10 深圳马可孛罗科技有限公司 Real-time service monitoring and early warning method and early warning device
CN110855503A (en) * 2019-11-22 2020-02-28 叶晓斌 Fault cause determining method and system based on network protocol hierarchy dependency relationship
CN110855502A (en) * 2019-11-22 2020-02-28 叶晓斌 Fault cause determination method and system based on time-space analysis log
CN111464390A (en) * 2020-03-31 2020-07-28 中国建设银行股份有限公司 Network application system monitoring and early warning method and system
CN111724489A (en) * 2020-06-08 2020-09-29 上海申通地铁集团有限公司 General software system of ticket selling and checking machine
CN111724489B (en) * 2020-06-08 2024-01-23 上海申通地铁集团有限公司 General software system of ticket vending and checking machine
CN111882289A (en) * 2020-07-01 2020-11-03 国网河北省电力有限公司经济技术研究院 Device and method for measuring and calculating item data audit index interval
CN111882289B (en) * 2020-07-01 2023-11-14 国网河北省电力有限公司经济技术研究院 Device and method for measuring and calculating project data auditing index interval
CN112003747A (en) * 2020-08-21 2020-11-27 中国建设银行股份有限公司 Fault positioning method of cloud virtual gateway
EP4207645A4 (en) * 2020-10-10 2024-02-28 Huawei Tech Co Ltd Network intent monitoring method, network intent monitoring system and storage medium
WO2022073406A1 (en) * 2020-10-10 2022-04-14 华为技术有限公司 Network intent monitoring method, network intent monitoring system and storage medium
CN112491622B (en) * 2020-11-30 2023-06-09 苏宁金融科技(南京)有限公司 Method and system for locating fault root cause of service system
CN112491622A (en) * 2020-11-30 2021-03-12 苏宁金融科技(南京)有限公司 Method and system for positioning fault root cause of business system
CN112783720A (en) * 2021-01-05 2021-05-11 广州品唯软件有限公司 Topological structure diagram generation method and device, computer equipment and display system
CN113656252B (en) * 2021-08-24 2023-07-25 北京百度网讯科技有限公司 Fault positioning method, device, electronic equipment and storage medium
CN113656252A (en) * 2021-08-24 2021-11-16 北京百度网讯科技有限公司 Fault positioning method and device, electronic equipment and storage medium
CN116016201A (en) * 2021-11-04 2023-04-25 贵州电网有限责任公司 Abnormal early warning method based on business backtracking
CN115083166B (en) * 2022-07-14 2022-11-11 深圳市维力谷无线技术股份有限公司 Vehicle-road communication testing system and method based on 5G technology
CN115083166A (en) * 2022-07-14 2022-09-20 深圳市维力谷无线技术股份有限公司 Vehicle-road communication testing system and method based on 5G technology
CN115955385A (en) * 2022-09-29 2023-04-11 中国联合网络通信集团有限公司 Fault diagnosis method and device for Internet of things service

Similar Documents

Publication Publication Date Title
CN110287081A (en) A kind of service monitoring system and method
US9672085B2 (en) Adaptive fault diagnosis
US20130018703A1 (en) Method and system for distributed and collaborative monitoring
CN109726072A (en) Monitoring alarm method, apparatus, system and the computer storage medium of weblogic server
CN112162907A (en) Health degree evaluation method based on monitoring index data
CN109165138A (en) A kind of method and apparatus of monitoring equipment fault
US10402298B2 (en) System and method for comprehensive performance and availability tracking using passive monitoring and intelligent synthetic transaction generation in a transaction processing system
US7369967B1 (en) System and method for monitoring and modeling system performance
CN104574219A (en) System and method for monitoring and early warning of operation conditions of power grid service information system
CN110765189A (en) Exception management method and system for Internet products
CN105516293A (en) Cloud resource monitoring system of intelligent substation
CN109783260A (en) Intelligent IT whole process O&M method, apparatus, equipment and readable storage medium storing program for executing
CN112994972B (en) Distributed probe monitoring platform
CN109144789A (en) A kind of method, apparatus and system for restarting OSD
CN114356499A (en) Kubernetes cluster alarm root cause analysis method and device
CN114154035A (en) Data processing system for dynamic loop monitoring
CN116719664B (en) Application and cloud platform cross-layer fault analysis method and system based on micro-service deployment
CN111563022A (en) Centralized storage monitoring method and device
WO2020236358A1 (en) Techniques for correlating service events in computer network diagnostics
CN115860729A (en) IT operation and maintenance integrated management system
CN114118991A (en) Third-party system monitoring system, method, device, equipment and storage medium
Buga et al. Towards modeling monitoring of smart traffic services in a large-scale distributed system
CN114727166A (en) Remote online metering instrument state monitoring method and system based on Internet of things
TW201409968A (en) Information and communication service quality estimation and real-time alarming system and method
CN113742400A (en) Network data acquisition system and method based on self-adaptive constraint conditions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190927

RJ01 Rejection of invention patent application after publication