CN105320585A - Method and device for achieving application fault diagnosis - Google Patents

Method and device for achieving application fault diagnosis Download PDF

Info

Publication number
CN105320585A
CN105320585A CN201410324069.XA CN201410324069A CN105320585A CN 105320585 A CN105320585 A CN 105320585A CN 201410324069 A CN201410324069 A CN 201410324069A CN 105320585 A CN105320585 A CN 105320585A
Authority
CN
China
Prior art keywords
data
application
service
diagnosis
relevant
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410324069.XA
Other languages
Chinese (zh)
Other versions
CN105320585B (en
Inventor
谌颐
胡盛华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Venus Information Security Technology Co Ltd
Venus Info Tech Inc
Beijing Venus Information Technology Co Ltd
Original Assignee
Beijing Venus Information Security Technology Co Ltd
Beijing Venus Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Venus Information Security Technology Co Ltd, Beijing Venus Information Technology Co Ltd filed Critical Beijing Venus Information Security Technology Co Ltd
Priority to CN201410324069.XA priority Critical patent/CN105320585B/en
Publication of CN105320585A publication Critical patent/CN105320585A/en
Application granted granted Critical
Publication of CN105320585B publication Critical patent/CN105320585B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a method and device for achieving application fault diagnosis. The method comprises the steps that multi-dimensional application data are collected; when a service application is abnormal, associated diagnosis data related to service abnormities are obtained for the collected multi-dimensional application data from the time and space associated relation of the service abnormities according to the service abnormity type; the obtained associated diagnosis date related to the service abnormities are compared with historical diagnosis data of the associated diagnosis data to determine the application fault type. Fault diagnosis is carried out on the service application abnormities through the multi-dimensional application data, the problem that a single terminal exists due to the fact that fault diagnosis is carried out through single data is solved, the service fault is determined more comprehensively, and the service abnormity problem is solved.

Description

A kind of method and device realizing application and trouble diagnosis
Technical field
The present invention relates to computer application field, espespecially a kind of method and device realizing application and trouble diagnosis.
Background technology
Along with the development of IT technology application, the miscellaneous service process of carrying out of enterprise more and more closely combines with Internet technology, and the applied information system be made up of server, database, middleware etc. also becomes and becomes increasingly complex.Even if progressively improve the level requirement of technician, but still exist and carry out the more and more difficult problem of failture evacuation.The running quality of service application (ability of finishing service, speed and stability) direct relation enterprise can be supplied to the professional skill of user.Manage the monitoring performance of Mission critical applications, carrying out analyzing and diagnosing for Problems existing in performance supervision timely and effectively, is an urgent demand improving customer service application availability.
At present, mainly the following aspects is comprised to the monitoring performance management of service application: 1, the access situation of application is monitored; 2, when service application generation property abnormality, judge whether because abnormal causing appears in network system performance; 3, when service application generation access exception, judge whether to cause because network or application are subjected to attack.By the diagnosis to service application fault, technician can be effectively helped to carry out the instant recovery of service application.
The fault diagnosis of existing service application mainly carries out fault analysis from single data such as data on flows or monitor datas (such as, applying daily record); Because the data of carrying out Analysis on Fault Diagnosis are single, easily cause the fault diagnosis result obtained to exist unilateral or not enough, this just needs to complete fault diagnosis by how artificial participation.
Summary of the invention
In order to solve the problems of the technologies described above, the invention provides a kind of method and the device that realize application and trouble diagnosis, according to the data of multidimensional, comprehensive diagnostic can be carried out to traffic failure, reduce and artificially participate in.
In order to reach foregoing invention object, the invention discloses a kind of method realizing application and trouble diagnosis, comprising:
Gather multidimensional application data;
When service application occurs abnormal, to the multidimensional application data collected from the relevant diagnosis data related to the Time and place incidence relation of service exception, according to service exception type acquisition service exception;
The relevant diagnosis data related to by the service exception of acquisition, compare with the historical diagnostic data of each relevant diagnosis data respectively, determine application and trouble type.
Further, multidimensional application data comprises: the data on flows of monitor data, service application service device IP and destination address extraction extracted according to service application service device IP and the application performance data of service application service device IP and destination address extraction.
Further, monitor data at least comprises: IP address and/or monitoring period and/or cpu busy percentage and/or disk utilization and/or disk input and output io and/or internal memory relevant information and/or swapace relevant information and/or network interface relevant information and/or database response time and/or use si from the exchange memory that internal memory called in by disk and/or use so and/or the size bo from internal memory write disk and/or the size bi from disk write memory and/or service state from the exchange memory that internal memory calls in disk.
Further, data on flows for by the session of identical five-tuple institute uniquely identified, at least comprises: handshake SYN bag number and/or the code bit field FIN bag number sending TCP header and/or TCP relevant information that acquisition time and/or source/destination address and/or source/destination port and/or agreement and/or send uses when TCP/IP connects and/or send access specified services in RST number and/or unit interval total flow extremely.
Further, application performance data at least comprise: source/destination address and/or destination interface and/or request time and/or server response time and/or the time of loading and/or page relevant information and/or Http relevant information and/or tomcat global access velocity sag and/or in the unit interval database access amount abnormal and/or Weblogic current sessions number is abnormal;
Described application performance data acquisition is in the performance data of http protocol and/or the performance data of ORACLE database service and/or the performance data of MYSQL database server.
Further, the relevant diagnosis data related to by the service exception of acquisition, compare with the historical diagnostic data of each relevant diagnosis data respectively, determine that application and trouble type specifically comprises:
The relevant diagnosis data that the service exception of acquisition is related to, compared by periodicity baseline or moving window baseline with the historical diagnostic data of each relevant diagnosis data respectively, according to the threshold range of each relevant diagnosis data preset, determine application and trouble type.
Further, described historical diagnostic data is: the monitor data in the first preset duration; Data on flows in second preset duration and real-time application performance data.
Further, when fault diagnosis does not analyze result, the method also comprises: store relating to abnormal multidimensional data, after historical data upgrades, determine application and trouble type further again.
Further, the method also comprises: according to determining application and trouble type, provide fault recovery to advise from historical diagnostic data.
On the other hand, the application also provides a kind of device realizing application and trouble diagnosis, comprising: collecting unit, acquiring unit and failure diagnosis unit; Wherein,
Collecting unit, for gathering multidimensional application data;
Acquiring unit, for when service application occurs abnormal, to the multidimensional application data collected from the relevant diagnosis data related to the Time and place incidence relation of service exception, according to service exception type acquisition service exception;
Failure diagnosis unit, for the relevant diagnosis data related to by the service exception of acquisition, compares with the historical diagnostic data of each relevant diagnosis data respectively, determines application and trouble type.
Further, multidimensional application data comprises: the data on flows of monitor data, service application service device IP and destination address extraction extracted according to service application service device IP and the application performance data of service application service device IP and destination address extraction.
Further, monitor data at least comprises: IP address and/or monitoring period and/or cpu busy percentage and/or disk utilization and/or disk input and output io and/or internal memory relevant information and/or swapace relevant information and/or network interface relevant information and/or database response time and/or use si from the exchange memory that internal memory called in by disk and/or use so and/or the size bo from internal memory write disk and/or the size bi from disk write memory and/or service state from the exchange memory that internal memory calls in disk.
Further, data on flows for by the session of identical five-tuple institute uniquely identified, at least comprises: handshake SYN bag number and/or the code bit field FIN bag number sending TCP header and/or TCP relevant information that acquisition time and/or source/destination address and/or source/destination port and/or agreement and/or send uses when TCP/IP connects and/or send access specified services in RST number and/or unit interval total flow extremely.
Further, application performance data at least comprise: source/destination address and/or destination interface and/or request time and/or server response time and/or the time of loading and/or page relevant information and/or Http relevant information and/or tomcat global access velocity sag and/or in the unit interval database access amount abnormal and/or Weblogic current sessions number is abnormal;
Described application performance data acquisition is in the performance data of http protocol and/or the performance data of ORACLE database service and/or the performance data of MYSQL database server.
Further, failure diagnosis unit specifically for, the relevant diagnosis data that the service exception of acquisition is related to, compared by periodicity baseline or moving window baseline with the historical diagnostic data of each relevant diagnosis data respectively, according to the threshold range of each relevant diagnosis data preset, determine application and trouble type.
Further, historical diagnostic data is: the monitor data in the first preset duration; Data on flows in second preset duration and real-time application performance data.
Further, this device also comprises follow-up diagnosis unit, for storing relating to abnormal multidimensional data when fault diagnosis does not analyze result, after historical data upgrades, determines application and trouble type further again.
Further, this device also comprises recovery suggestion unit, for according to determining application and trouble type, provides fault recovery to advise from historical diagnostic data.
Technical scheme comprises: gather multidimensional application data; When service application occurs abnormal, to the multidimensional application data collected from the relevant diagnosis data related to the Time and place incidence relation of service exception, according to service exception type acquisition service exception; The relevant diagnosis data related to by the service exception of acquisition, compare with the historical diagnostic data of each relevant diagnosis data respectively, determine application and trouble type, and analyzing failure cause.The present invention carries out fault diagnosis by multidimensional application data to service application is abnormal, avoids the problem that the terminal that adopts single data to cause diagnosing malfunction is single, more fully determines traffic failure, solve service exception problem.
Accompanying drawing explanation
Accompanying drawing described herein is used to provide a further understanding of the present invention, and form a application's part, schematic description and description of the present invention, for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the process flow diagram that the present invention realizes the method for application and trouble diagnosis;
Fig. 2 is the structured flowchart that the present invention realizes the device of application and trouble diagnosis.
Embodiment
Fig. 1 is a kind of process flow diagram realizing the method for application and trouble diagnosis, as shown in Figure 1, comprising:
Step 100, collection multidimensional application data;
In this step, the multidimensional application data of collection comprises: the data on flows of monitor data, service application service device IP and destination address extraction extracted according to service application service device IP and the application performance data of service application service device IP and destination address extraction.
Further, monitor data at least comprises: IP address, and/or monitoring period, and/or cpu busy percentage, and/or disk utilization, and/or disk input and output (io), and/or internal memory relevant information, and/or swapace relevant information, and/or network interface relevant information, and/or the database response time, and/or exchange memory use (si) of internal memory is called in from disk, and/or exchange memory use (so) of disk is called in from internal memory, and/or the size (bo) of disk is write from internal memory, and/or from the size (bi) of disk write memory, and/or service state.
Data on flows is for by the session of identical five-tuple institute uniquely identified, at least comprise: data on flows for by the session of identical five-tuple institute uniquely identified, at least comprises: acquisition time and/or source/destination address and/or source/destination port and/or agreement and/or send SYN (handshake used when TCP/IP connects) bag number and/or send FIN (the code bit field of TCP header) bag number and/or TCP relevant information and/or send access specified services in RST number and/or unit interval total flow extremely.Here, TCP relevant information comprises: TCP number of retransmissions, TCP check and errors number, TCP are connected abnormal closedown number of times etc.
Application performance data at least comprise: source/destination address and/or destination interface and/or request time and/or server response time and/or the time of loading and/or page relevant information and/or Http relevant information and/or tomcat global access velocity sag and/or in the unit interval database access amount abnormal and/or Weblogic current sessions number is abnormal;
Here, tomcat is existing a kind of WEB application server, and Weblogic is the WEB middleware in JAVA Program Appliance.
Application performance data acquisition is in the performance data of http protocol and/or the performance data of ORACLE database service and/or the performance data of MYSQL database server.
Here, page relevant information comprises: page downloading time, the slack-off ratio of the page etc.
Http relevant information comprises: Http access rate, Http error rate, in the unit interval, http access number is abnormal etc.
Step 101, when service application occurs abnormal, to the multidimensional application data collected from the relevant diagnosis data obtaining service exception the Time and place incidence relation of service exception, according to service exception type and relate to.
It should be noted that, the Time and place incidence relation of service exception, refer to the time occurred by service exception, according to the abnormal time occurred, relevant diagnosis data are obtained in temporal information confirmable in multidimensional data, from the information of the protocol layer related to, obtain relevant relevant diagnosis data.
Because service application abnormal conditions are complicated, those skilled in the art should understand and cannot exemplify comprehensively; In order to clearly the present invention will be described, to concentrating, common service application is abnormal illustrates here, and briefly provides the relevant diagnosis data partly related to.
It should be noted that, service exception type is the summary that those skilled in the art rule of thumb analyze the service exception kind drawn, is below kind and the relevant diagnosis data that relate to of common service exception type:
1, service application service availability is abnormal, comprise: the abnormity diagnosis of the availabilities such as main frame, database, middleware, service access, the relevant diagnosis data be mainly concerned with comprise: service state (start/stop), cpu busy percentage, disk utilization, internal memory utilize correlation parameter etc., and these part abnormal conditions are mainly from monitor data.
2, service application service device response abnormality, the relevant diagnosis data be mainly concerned with comprise: the application request time, application page downloading time, the slack-off ratio of the page, Http access rate, Http error rate (s), server response time, the database response time, the exchange memory of calling in internal memory from disk uses (si), call in disk swapping internal memory from internal memory and use (so), free memory, from the size (bo) of internal memory write disk, from the size (bi) of disk write memory, cpu utilization factor etc., in these achievement datas, first 6 is application performance data, latter 6 is monitor data.
3, service application service access exception, the relevant diagnosis data be mainly concerned with comprise: in the unit interval, the total flow of access specified services is abnormal, in unit interval, http access number is abnormal, tomcat global access velocity sag, in unit interval, database access amount is abnormal, and Weblogic current sessions number is abnormal, in these diagnosis indexs, first achievement data is from water flow collection device, and other achievement datas carry out self-application collector.
4, service application Traffic Anomaly, the relevant diagnosis data be mainly concerned with comprise: agreement abnormal proportion event (Tcp/Udp/Icmp/Igmp) abnormal proportion, flow extraordinary (bps, pps, session), these achievement datas are mainly from water flow collection device.
5, the service performance of service application is abnormal, and the relevant diagnosis data be mainly concerned with comprise: service performance monitoring is abnormal.
6, the service state of service application is abnormal, and the relevant diagnosis data be mainly concerned with comprise: service state (start/stop), and service state monitoring is abnormal.
7, the exception that causes due to network attack of service application, the relevant diagnosis data be mainly concerned with comprise: the transmission SYN bag number in the unit interval is abnormal, and average packet is long abnormal, and worm event alarm appears in circuit: CodeRed, hard disk killer, SqlSlammer, shock wave, shock wave killer, Sasser, worm mail, WinNuke attacks, UdpFragmentFlood.Achievement data is mainly from water flow collection device.
8, service application circuit is abnormal, and the relevant diagnosis data be mainly concerned with comprise: Layer 2 data Traffic Anomaly, tcp data bag retransmission rate, TCP inspection and error rate, and TCP connects abnormal closedown number of times etc.Achievement data is from water flow collection device and application collector.
Step 102, the relevant diagnosis data related to by the service exception of acquisition, compare with the historical diagnostic data of each relevant diagnosis data respectively, determine application and trouble type.
Concrete, the relevant diagnosis data that the service exception of acquisition is related to, compared by periodicity baseline or moving window baseline with the historical diagnostic data of each relevant diagnosis data respectively, according to the threshold range of each relevant diagnosis data preset, determine application and trouble type.
In this step, historical diagnostic data is: the monitor data in the first preset duration; Data on flows in second preset duration and real-time application performance data.
Here, for monitor data, because mainly comprising of employing aims at the interior data identical with daily record character day, so the first preset duration, generally refer to the monitor data in several cycles of generation, the monitor data type that the cycle of monitor data is designed according to physical fault abnormal conditions is relevant, generally minute to obtain as minimum unit;
Data on flows is referred to and to be compared by the flow parameter of short-term, and to determine exception, therefore, the second preset duration generally refers to the duration of about 20S.
Certainly, according to actual conditions, the first preset duration and the second preset duration can adjust according to practical situations and demand.
When fault diagnosis does not analyze result, the inventive method also comprises: store relating to abnormal multidimensional data, after historical data upgrades, determine application and trouble type further again.
The inventive method also comprises: according to determining application and trouble type and reason, provide fault recovery to advise from historical diagnostic data.
Fig. 2 is the structured flowchart that the present invention realizes the device of application and trouble diagnosis, as shown in Figure 2, comprising:
Collecting unit, acquiring unit and failure diagnosis unit; Wherein,
Collecting unit, for gathering multidimensional application data;
Here, multidimensional application data comprises: the data on flows of monitor data, service application service device IP and destination address extraction extracted according to service application service device IP and the application performance data of service application service device IP and destination address extraction.
Monitor data at least comprises: IP address, and/or monitoring period, and/or cpu busy percentage, and/or disk utilization, and/or disk input and output io, and/or internal memory relevant information, and/or swapace relevant information, and/or network interface relevant information, and/or the database response time, and/or the exchange memory use si of internal memory is called in from disk, and/or exchange memory use so of disk is called in from internal memory, and/or the size bo of disk is write from internal memory, and/or from the size bi of disk write memory, and/or service state.
Data on flows for by the session of identical five-tuple institute uniquely identified, at least comprises: handshake SYN bag number and/or the code bit field FIN bag number sending TCP header and/or TCP relevant information that acquisition time and/or source/destination address and/or source/destination port and/or agreement and/or send uses when TCP/IP connects and/or send access specified services in RST number and/or unit interval total flow extremely.
Application performance data at least comprise: source/destination address and/or destination interface and/or request time and/or server response time and/or the time of loading and/or the page (URL) relevant information and/or Http relevant information and/or tomcat global access velocity sag and/or in the unit interval database access amount abnormal and/or Weblogic current sessions number is abnormal;
Application performance data acquisition is in the performance data of http protocol and/or the performance data of ORACLE database service and/or the performance data of MYSQL database server.
Acquiring unit, for when service application occurs abnormal, to the multidimensional application data collected from the relevant diagnosis data related to the Time and place incidence relation of service exception, according to service exception type acquisition service exception;
Failure diagnosis unit, for the relevant diagnosis data related to by the service exception of acquisition, compares with the historical diagnostic data of each relevant diagnosis data respectively, determines application and trouble type.
Failure diagnosis unit specifically for, the relevant diagnosis data that the service exception of acquisition is related to, compared by periodicity baseline or moving window baseline with the historical diagnostic data of each relevant diagnosis data respectively, according to the threshold range of each relevant diagnosis data preset, determine application and trouble type.
Historical diagnostic data is: the monitor data in the first preset duration; Data on flows in second preset duration and real-time application performance data.
Apparatus of the present invention also comprise follow-up diagnosis unit, for storing relating to abnormal multidimensional data when fault diagnosis does not analyze result, after historical data upgrades, determine application and trouble type further again.
Apparatus of the present invention also comprise recovers suggestion unit, for according to determining application and trouble type, provides fault recovery to advise from historical diagnostic data.
Below by specific embodiment, to know detailed description to the present invention, embodiment only for content of the present invention is clearly described, and is not used in display institute of the present invention protection domain.
Embodiment 1
The long-term online stable operation of certain business application system, find that certain business data module data manipulation is shown sporadic slack-off gradually one period, and progressively expanding the service exception that other modules also start to occur slack-off situation (but slack-off degree is relatively little) to, abnormal failure cause is failed to understand.
Be below the method for traditional application and trouble diagnosis, mainly through application daily record, system application and trouble progressively diagnosed:
First, by checking application daily record, checking switch and router state and configuration in application, and checking equipment packet loss, the data such as Packet Error Rate, find that the network equipment is acted normally; Check simultaneously and find that slowly situation does not appear obviously in other application, get rid of the possibility that network goes wrong.
Owing to adopting above-mentioned single application daily record, cannot diagnostic application fault type, therefore existing method needs to adopt the following artificial mode participated in carry out fault diagnosis:
Checked the system cpu of application place main frame by utility command row, internal memory, system cache, disk io situation, find that above parameter is acted normally.Owing to not checking out exception,
Further, operation maintenance personnel utility command row checks system cpu, internal memory, system cache, the disk io situation of problem application data base place main frame, through repeatedly to check and to compare disk io between discovery system lag phase frequent, apparently higher than the system normal moment, this problem is classified as suspicious item.
Communication between operation maintenance personnel inspection application and database facility, continues capture packet and analyze by bag analysis tool, and the system that finds occurs first about 20-40 minute slowly, and communication data amount increases, and this is classified as the suspicious item of failure exception.
Operation maintenance personnel checks out above two suspicious items, and suspect that system is slack-off relevant with application, notice application research staff shows up research.
For determining failure exception problem, carrying out application operating daily record and attending a day school and code walk-through, and continuing to monitor applied host machine, database host, database operational factor.In code walk-through, find may there is the problem reading raw data when running long-time interval report data, to solve application and trouble problem.
Above process adopts single data to carry out effective fault diagnosis, participates in just achieving fault diagnosis in failure diagnostic process by a large amount of thinking.
Use application and trouble diagnostic system of the present invention, the diagnosis associated data of first 5 minutes after system is slack-off is analyzed; Here, suppose that according to the working experience of those skilled in the art, the collection period of monitor data is 1 minute, the monitor data then obtaining continuous 5 cycles is analyzed, general, while this cycle of setting, can also by the alarm cycle of this cycle set system failure exception.
The time occurred to respond slow fault associates as Time and place with operation system IP, extracts monitor data, comprises the following indexs such as internal memory is relevant:
Wherein, monitor data comprises: the virtual memory utilization rate in internal memory relevant information is greater than 70%, and the historical context data of virtual memory utilization rate are for being less than 10%.
The work numerical value of calling in the exchange memory use of internal memory from disk is greater than 800, and the historical context data that the exchange memory that internal memory called in by disk uses are about 0-120.
The work numerical value of calling in the exchange memory use of disk from internal memory is greater than 900, and the historical context data of calling in the exchange memory use of disk from internal memory are about 0-100.
Idle physical memory is about 80-140M, and historical context data are 400-500M.
Often be greater than 600 from the size of internal memory write disk, and historical context data are 20-100.
Often exceed 600 from the size of disk write memory, and historical context data are 40-70.
In the system slack-off stage, in the unit interval, database access amount obviously rises.Access rate in Http relevant information is then without significant change.
When system starts slack-off, url significantly slack-off in Http relevant information is relevant to certain business (through inquiry system url list, can know that this URL is the Report Operations page) operation pages, these pages server response time the response time taper to more than 3500ms subsequently by the 50-200ms of historical context data;
More than the present embodiment each historical context data are all the numerical value of periodic window baseline.
Moving window baseline is the response time mean value of nearest one period of short period, and periodically baseline refers to the data response of the synchronization of a unit period (working day, a week, January);
After slack-off from above data certainty annuity, obtain the response time of the page of other business from application performance data, its page response time variations is to about 1500ms.
Determine that application and trouble reason comprises:
1, to a large amount of data in magnetic disk frequent operation.
2, disk buffering is less than normal or fragment is too much.
3, physical memory is too small, causes physical memory to take too high, affects digital independent.
4, the accidental exception of the URL page of operation system association, the exception that the unreasonable use of operation maintenance personnel causes.(system carried out the combing of URL, can correspond to the operation of application, as Report Operations from the URL access of application)
Fault diagnosis is advised:
1, the operating frequency of data in magnetic disk is reduced.
2, expand disk buffering or carry out defragmentation.
3, increase physical memory too small, reduce physical memory occupancy.
4, whether determination operation interference operates relevant to particular type, adjusts causing the item of interference.
From above-mentioned diagnostic result, if carry out fault diagnosis according to existing method, can only be diagnosed the exception of internal memory, disk by monitor data; If employing performance data, can only diagnose the accidental exception of URL and the association page, adopt existing method, diagnostic result is unilateral, affects service application and recovers in time from exception.
Although the embodiment disclosed by the application is as above, the embodiment that described content only adopts for ease of understanding the application, and be not used to limit the application.Those of skill in the art belonging to any the application; under the prerequisite not departing from the spirit and scope disclosed by the application; any amendment and change can be carried out in the form implemented and details; but the scope of patent protection of the application, the scope that still must define with appending claims is as the criterion.

Claims (18)

1. realize a method for application and trouble diagnosis, it is characterized in that, comprising:
Gather multidimensional application data;
When service application occurs abnormal, to the multidimensional application data collected from the relevant diagnosis data related to the Time and place incidence relation of service exception, according to service exception type acquisition service exception;
The relevant diagnosis data related to by the service exception of acquisition, compare with the historical diagnostic data of each relevant diagnosis data respectively, determine application and trouble type.
2. method according to claim 1, it is characterized in that, described multidimensional application data comprises: the data on flows of monitor data, service application service device IP and destination address extraction extracted according to service application service device IP and the application performance data of service application service device IP and destination address extraction.
3. method according to claim 2, it is characterized in that, described monitor data at least comprises: IP address, and/or monitoring period, and/or cpu busy percentage, and/or disk utilization, and/or disk input and output io, and/or internal memory relevant information, and/or swapace relevant information, and/or network interface relevant information, and/or the database response time, and/or the exchange memory use si of internal memory is called in from disk, and/or exchange memory use so of disk is called in from internal memory, and/or the size bo of disk is write from internal memory, and/or from the size bi of disk write memory, and/or service state.
4. method according to claim 2, it is characterized in that, described data on flows for by the session of identical five-tuple institute uniquely identified, at least comprises: handshake SYN bag number and/or the code bit field FIN bag number sending TCP header and/or TCP relevant information that acquisition time and/or source/destination address and/or source/destination port and/or agreement and/or send uses when TCP/IP connects and/or send access specified services in RST number and/or unit interval total flow extremely.
5. method according to claim 2, it is characterized in that, described application performance data at least comprise: source/destination address and/or destination interface and/or request time and/or server response time and/or the time of loading and/or page relevant information and/or Http relevant information and/or tomcat global access velocity sag and/or in the unit interval database access amount abnormal and/or Weblogic current sessions number is abnormal;
Described application performance data acquisition is in the performance data of http protocol and/or the performance data of ORACLE database service and/or the performance data of MYSQL database server.
6. method according to claim 1, is characterized in that, the described relevant diagnosis data related to by the service exception of acquisition, compares respectively, determine that application and trouble type specifically comprises with the historical diagnostic data of each relevant diagnosis data:
The relevant diagnosis data that the service exception of acquisition is related to, compared by periodicity baseline or moving window baseline with the historical diagnostic data of each relevant diagnosis data respectively, according to the threshold range of each relevant diagnosis data preset, determine application and trouble type.
7. method according to claims 1 to 6, is characterized in that, described historical diagnostic data is: the monitor data in the first preset duration; Data on flows in second preset duration and real-time application performance data.
8. method according to claim 1, is characterized in that, when fault diagnosis does not analyze result, the method also comprises: store relating to abnormal multidimensional data, after historical data upgrades, determine application and trouble type further again.
9. method according to claims 1 to 8, is characterized in that, the method also comprises: according to determining application and trouble type, provide fault recovery to advise from historical diagnostic data.
10. realize a device for application and trouble diagnosis, it is characterized in that, comprising: collecting unit, acquiring unit and failure diagnosis unit; Wherein,
Collecting unit, for gathering multidimensional application data;
Acquiring unit, for when service application occurs abnormal, to the multidimensional application data collected from the relevant diagnosis data related to the Time and place incidence relation of service exception, according to service exception type acquisition service exception;
Failure diagnosis unit, for the relevant diagnosis data related to by the service exception of acquisition, compares with the historical diagnostic data of each relevant diagnosis data respectively, determines application and trouble type.
11. devices according to claim 10, it is characterized in that, described multidimensional application data comprises: the data on flows of monitor data, service application service device IP and destination address extraction extracted according to service application service device IP and the application performance data of service application service device IP and destination address extraction.
12. devices according to claim 10, it is characterized in that, described monitor data at least comprises: IP address, and/or monitoring period, and/or cpu busy percentage, and/or disk utilization, and/or disk input and output io, and/or internal memory relevant information, and/or swapace relevant information, and/or network interface relevant information, and/or the database response time, and/or the exchange memory use si of internal memory is called in from disk, and/or exchange memory use so of disk is called in from internal memory, and/or the size bo of disk is write from internal memory, and/or from the size bi of disk write memory, and/or service state.
13. devices according to claim 10, it is characterized in that, described data on flows for by the session of identical five-tuple institute uniquely identified, at least comprises: handshake SYN bag number and/or the code bit field FIN bag number sending TCP header and/or TCP relevant information that acquisition time and/or source/destination address and/or source/destination port and/or agreement and/or send uses when TCP/IP connects and/or send access specified services in RST number and/or unit interval total flow extremely.
14. devices according to claim 10, it is characterized in that, described application performance data at least comprise: source/destination address and/or destination interface and/or request time and/or server response time and/or the time of loading and/or page relevant information and/or Http relevant information and/or tomcat global access velocity sag and/or in the unit interval database access amount abnormal and/or Weblogic current sessions number is abnormal;
Described application performance data acquisition is in the performance data of http protocol and/or the performance data of ORACLE database service and/or the performance data of MYSQL database server.
15. devices according to claim 10, it is characterized in that, failure diagnosis unit specifically for, the relevant diagnosis data that the service exception of acquisition is related to, compared by periodicity baseline or moving window baseline with the historical diagnostic data of each relevant diagnosis data respectively, according to the threshold range of each relevant diagnosis data preset, determine application and trouble type.
16., according to the device described in claim 10 ~ 15, is characterized in that, described historical diagnostic data is: the monitor data in the first preset duration; Data on flows in second preset duration and real-time application performance data.
17. devices according to claim 10, it is characterized in that, this device also comprises follow-up diagnosis unit, for storing relating to abnormal multidimensional data when fault diagnosis does not analyze result, after historical data upgrades, determines application and trouble type further again.
18., according to the device described in claim 10 ~ 17, is characterized in that, this device also comprises recovery suggestion unit, for according to determining application and trouble type, provides fault recovery to advise from historical diagnostic data.
CN201410324069.XA 2014-07-08 2014-07-08 A kind of method and device for realizing application failure diagnosis Active CN105320585B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410324069.XA CN105320585B (en) 2014-07-08 2014-07-08 A kind of method and device for realizing application failure diagnosis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410324069.XA CN105320585B (en) 2014-07-08 2014-07-08 A kind of method and device for realizing application failure diagnosis

Publications (2)

Publication Number Publication Date
CN105320585A true CN105320585A (en) 2016-02-10
CN105320585B CN105320585B (en) 2019-04-02

Family

ID=55248005

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410324069.XA Active CN105320585B (en) 2014-07-08 2014-07-08 A kind of method and device for realizing application failure diagnosis

Country Status (1)

Country Link
CN (1) CN105320585B (en)

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105871638A (en) * 2016-06-03 2016-08-17 北京启明星辰信息安全技术有限公司 Network security control method and device
CN106130786A (en) * 2016-07-26 2016-11-16 腾讯科技(深圳)有限公司 The detection method of a kind of network failure and device
CN106452941A (en) * 2016-08-24 2017-02-22 重庆大学 Network anomaly detection method and device
CN106484555A (en) * 2016-09-29 2017-03-08 广东欧珀移动通信有限公司 Abnormality detection and the method recovered and mobile terminal
CN107342891A (en) * 2017-06-07 2017-11-10 厦门金龙旅行车有限公司 A kind of method of remote collection vehicle trouble data
CN107995056A (en) * 2016-10-27 2018-05-04 中国移动通信集团公司 The method and device of fire wall recessiveness NAT breakdown judges
CN108183821A (en) * 2017-12-26 2018-06-19 国网山东省电力公司信息通信公司 A kind of application performance acquisition methods and device towards electrical network business
CN108508874A (en) * 2018-05-08 2018-09-07 网宿科技股份有限公司 A kind of method and apparatus of monitoring equipment fault
CN108923952A (en) * 2018-05-31 2018-11-30 北京百度网讯科技有限公司 Method for diagnosing faults, equipment and storage medium based on service monitoring index
CN108920326A (en) * 2018-06-14 2018-11-30 阿里巴巴集团控股有限公司 Determine system time-consuming abnormal method, apparatus and electronic equipment
CN109002261A (en) * 2018-07-11 2018-12-14 佛山市云端容灾信息技术有限公司 Difference block big data analysis method, apparatus, storage medium and server
CN109491844A (en) * 2018-09-21 2019-03-19 国网技术学院 A kind of computer system identifying exception information
CN109787816A (en) * 2018-12-28 2019-05-21 北京奇安信科技有限公司 Traffic failure localization method, device, equipment and medium
CN109828863A (en) * 2019-01-10 2019-05-31 网联清算有限公司 Data disaster tolerance method, apparatus, storage medium and computer equipment
CN109857431A (en) * 2019-01-11 2019-06-07 平安科技(深圳)有限公司 Code revision method and device, computer-readable medium and electronic equipment
CN110362442A (en) * 2018-04-09 2019-10-22 阿里巴巴集团控股有限公司 A kind of data monitoring method, device and equipment
CN111193609A (en) * 2019-11-20 2020-05-22 腾讯科技(深圳)有限公司 Application abnormity feedback method and device and application abnormity monitoring system
CN111371623A (en) * 2020-03-13 2020-07-03 杨磊 Service performance and safety monitoring method and device, storage medium and electronic equipment
CN112783718A (en) * 2020-12-31 2021-05-11 航天信息股份有限公司 Management system and method for system abnormity
CN112887354A (en) * 2019-11-29 2021-06-01 贵州白山云科技股份有限公司 Method and device for acquiring performance information
CN113064762A (en) * 2021-04-09 2021-07-02 上海新炬网络信息技术股份有限公司 Service self-recovery method based on multiple detection
CN113691405A (en) * 2021-08-25 2021-11-23 北京知道创宇信息技术股份有限公司 Access abnormity diagnosis method and device, storage medium and electronic equipment
CN113722142A (en) * 2021-09-02 2021-11-30 北京天融信网络安全技术有限公司 Method and device for analyzing reasons of insufficient memory, electronic equipment and storage medium
WO2022063242A1 (en) * 2020-09-27 2022-03-31 中兴通讯股份有限公司 Two-layer service state detection method, communication device, and storage medium
CN115225462A (en) * 2022-07-21 2022-10-21 北京天融信网络安全技术有限公司 Network fault diagnosis method and device
CN115696444A (en) * 2022-09-23 2023-02-03 中兴通讯股份有限公司 Time delay detection method and device, data analysis platform and readable storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101848477A (en) * 2009-03-24 2010-09-29 亚信科技(中国)有限公司 Method and system for diagnosing fault
CN102081623A (en) * 2009-11-30 2011-06-01 中国移动通信集团浙江有限公司 Method and system for detecting database abnormality
CN102340415A (en) * 2011-06-23 2012-02-01 北京新媒传信科技有限公司 Server cluster system and monitoring method thereof
CN102761448A (en) * 2012-08-07 2012-10-31 中国石油大学(华东) Cluster monitoring and early warning method
WO2013086996A1 (en) * 2011-12-13 2013-06-20 华为技术有限公司 Failure processing method, device and system
CN103412805A (en) * 2013-07-31 2013-11-27 交通银行股份有限公司 IT (information technology) fault source diagnosis method and IT fault source diagnosis system
CN103532940A (en) * 2013-09-30 2014-01-22 广东电网公司电力调度控制中心 Network security detection method and device
CN103532776A (en) * 2013-09-30 2014-01-22 广东电网公司电力调度控制中心 Service flow detection method and system
CN103595584A (en) * 2013-11-13 2014-02-19 德科仕通信(上海)有限公司 Method and system for diagnosing Web application performance problem

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101848477A (en) * 2009-03-24 2010-09-29 亚信科技(中国)有限公司 Method and system for diagnosing fault
CN102081623A (en) * 2009-11-30 2011-06-01 中国移动通信集团浙江有限公司 Method and system for detecting database abnormality
CN102340415A (en) * 2011-06-23 2012-02-01 北京新媒传信科技有限公司 Server cluster system and monitoring method thereof
WO2013086996A1 (en) * 2011-12-13 2013-06-20 华为技术有限公司 Failure processing method, device and system
CN102761448A (en) * 2012-08-07 2012-10-31 中国石油大学(华东) Cluster monitoring and early warning method
CN103412805A (en) * 2013-07-31 2013-11-27 交通银行股份有限公司 IT (information technology) fault source diagnosis method and IT fault source diagnosis system
CN103532940A (en) * 2013-09-30 2014-01-22 广东电网公司电力调度控制中心 Network security detection method and device
CN103532776A (en) * 2013-09-30 2014-01-22 广东电网公司电力调度控制中心 Service flow detection method and system
CN103595584A (en) * 2013-11-13 2014-02-19 德科仕通信(上海)有限公司 Method and system for diagnosing Web application performance problem

Cited By (43)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105871638A (en) * 2016-06-03 2016-08-17 北京启明星辰信息安全技术有限公司 Network security control method and device
CN106130786B (en) * 2016-07-26 2019-05-07 腾讯科技(深圳)有限公司 A kind of detection method and device of network failure
CN106130786A (en) * 2016-07-26 2016-11-16 腾讯科技(深圳)有限公司 The detection method of a kind of network failure and device
CN106452941A (en) * 2016-08-24 2017-02-22 重庆大学 Network anomaly detection method and device
CN106484555A (en) * 2016-09-29 2017-03-08 广东欧珀移动通信有限公司 Abnormality detection and the method recovered and mobile terminal
CN106484555B (en) * 2016-09-29 2019-05-17 Oppo广东移动通信有限公司 The method and mobile terminal of abnormality detection and recovery
CN107995056A (en) * 2016-10-27 2018-05-04 中国移动通信集团公司 The method and device of fire wall recessiveness NAT breakdown judges
CN107995056B (en) * 2016-10-27 2021-04-13 中国移动通信集团公司 Method and device for judging hidden NAT fault of firewall
CN107342891A (en) * 2017-06-07 2017-11-10 厦门金龙旅行车有限公司 A kind of method of remote collection vehicle trouble data
CN108183821A (en) * 2017-12-26 2018-06-19 国网山东省电力公司信息通信公司 A kind of application performance acquisition methods and device towards electrical network business
CN108183821B (en) * 2017-12-26 2021-03-30 国网山东省电力公司信息通信公司 Application performance obtaining method and device for power grid service
CN110362442A (en) * 2018-04-09 2019-10-22 阿里巴巴集团控股有限公司 A kind of data monitoring method, device and equipment
CN110362442B (en) * 2018-04-09 2023-09-22 创新先进技术有限公司 Data monitoring method, device and equipment
EP3591485A4 (en) * 2018-05-08 2020-04-29 Wangsu Science & Technology Co., Ltd. Method and device for monitoring for equipment failure
CN108508874A (en) * 2018-05-08 2018-09-07 网宿科技股份有限公司 A kind of method and apparatus of monitoring equipment fault
CN108923952B (en) * 2018-05-31 2021-11-30 北京百度网讯科技有限公司 Fault diagnosis method, equipment and storage medium based on service monitoring index
CN108923952A (en) * 2018-05-31 2018-11-30 北京百度网讯科技有限公司 Method for diagnosing faults, equipment and storage medium based on service monitoring index
CN108920326A (en) * 2018-06-14 2018-11-30 阿里巴巴集团控股有限公司 Determine system time-consuming abnormal method, apparatus and electronic equipment
CN109002261B (en) * 2018-07-11 2022-03-22 佛山市云端容灾信息技术有限公司 Method and device for analyzing big data of difference block, storage medium and server
CN109002261A (en) * 2018-07-11 2018-12-14 佛山市云端容灾信息技术有限公司 Difference block big data analysis method, apparatus, storage medium and server
CN109491844B (en) * 2018-09-21 2022-03-04 国网技术学院 Computer system for identifying abnormal information
CN109491844A (en) * 2018-09-21 2019-03-19 国网技术学院 A kind of computer system identifying exception information
CN109787816A (en) * 2018-12-28 2019-05-21 北京奇安信科技有限公司 Traffic failure localization method, device, equipment and medium
CN109828863A (en) * 2019-01-10 2019-05-31 网联清算有限公司 Data disaster tolerance method, apparatus, storage medium and computer equipment
CN109857431A (en) * 2019-01-11 2019-06-07 平安科技(深圳)有限公司 Code revision method and device, computer-readable medium and electronic equipment
CN109857431B (en) * 2019-01-11 2022-06-03 平安科技(深圳)有限公司 Code modification method and device, computer readable medium and electronic equipment
CN111193609B (en) * 2019-11-20 2021-09-28 腾讯科技(深圳)有限公司 Application abnormity feedback method and device and application abnormity monitoring system
CN111193609A (en) * 2019-11-20 2020-05-22 腾讯科技(深圳)有限公司 Application abnormity feedback method and device and application abnormity monitoring system
CN112887354A (en) * 2019-11-29 2021-06-01 贵州白山云科技股份有限公司 Method and device for acquiring performance information
CN111371623B (en) * 2020-03-13 2023-02-28 杨磊 Service performance and safety monitoring method and device, storage medium and electronic equipment
CN111371623A (en) * 2020-03-13 2020-07-03 杨磊 Service performance and safety monitoring method and device, storage medium and electronic equipment
WO2022063242A1 (en) * 2020-09-27 2022-03-31 中兴通讯股份有限公司 Two-layer service state detection method, communication device, and storage medium
CN112783718A (en) * 2020-12-31 2021-05-11 航天信息股份有限公司 Management system and method for system abnormity
CN113064762A (en) * 2021-04-09 2021-07-02 上海新炬网络信息技术股份有限公司 Service self-recovery method based on multiple detection
CN113064762B (en) * 2021-04-09 2024-02-23 上海新炬网络信息技术股份有限公司 Service self-recovery method based on various detection
CN113691405A (en) * 2021-08-25 2021-11-23 北京知道创宇信息技术股份有限公司 Access abnormity diagnosis method and device, storage medium and electronic equipment
CN113691405B (en) * 2021-08-25 2023-12-01 北京知道创宇信息技术股份有限公司 Access abnormality diagnosis method and device, storage medium and electronic equipment
CN113722142A (en) * 2021-09-02 2021-11-30 北京天融信网络安全技术有限公司 Method and device for analyzing reasons of insufficient memory, electronic equipment and storage medium
CN113722142B (en) * 2021-09-02 2023-08-25 北京天融信网络安全技术有限公司 Method and device for analyzing reasons of insufficient memory, electronic equipment and storage medium
CN115225462A (en) * 2022-07-21 2022-10-21 北京天融信网络安全技术有限公司 Network fault diagnosis method and device
CN115225462B (en) * 2022-07-21 2024-02-02 北京天融信网络安全技术有限公司 Network fault diagnosis method and device
CN115696444A (en) * 2022-09-23 2023-02-03 中兴通讯股份有限公司 Time delay detection method and device, data analysis platform and readable storage medium
CN115696444B (en) * 2022-09-23 2023-09-12 中兴通讯股份有限公司 Time delay detection method, device, data analysis platform and readable storage medium

Also Published As

Publication number Publication date
CN105320585B (en) 2019-04-02

Similar Documents

Publication Publication Date Title
CN105320585A (en) Method and device for achieving application fault diagnosis
US20210042270A1 (en) Alarm log compression method, apparatus, and system, and storage medium
US9910727B2 (en) Detecting anomalous accounts using event logs
US10122575B2 (en) Log collection, structuring and processing
CN108039957B (en) Intelligent analysis system for complex network traffic packet
Lee et al. An internet traffic analysis method with mapreduce
US9158649B2 (en) Methods and computer program products for generating a model of network application health
US10574669B1 (en) Packet filters in security appliances with modes and intervals
US11632320B2 (en) Centralized analytical monitoring of IP connected devices
CN103152352A (en) Perfect information security and forensics monitoring method and system based on cloud computing environment
US20220343168A1 (en) Multi-domain service assurance using real-time adaptive thresholds
US20190007292A1 (en) Apparatus and method for monitoring network performance of virtualized resources
CN107066370A (en) A kind of automatic monitoring and the instrument and method for collecting faulty hard disk daily record
EP3282643A1 (en) Method and apparatus of estimating conversation in a distributed netflow environment
CN107911387A (en) Power information acquisition system account logs in the monitoring method with abnormal operation extremely
US20190007285A1 (en) Apparatus and Method for Defining Baseline Network Behavior and Producing Analytics and Alerts Therefrom
CN105119767A (en) Data self-check and self-cleaning software operation state monitoring method and system
CN114039900A (en) Efficient network data packet protocol analysis method and system
JP2020092332A (en) Network abnormality detection device, network abnormality detection system, and network abnormality detection method
US9645877B2 (en) Monitoring apparatus, monitoring method, and recording medium
CN103957128A (en) Method and system for monitoring data flow direction in cloud computing environment
US10038603B1 (en) Packet capture collection tasking system
US11556120B2 (en) Systems and methods for monitoring performance of a building management system via log streams
CN108400905B (en) Method for processing end-to-end flow analysis of distributed storage
CN113254313A (en) Monitoring index abnormality detection method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant