JP2004348640A - Method and system for managing network - Google Patents

Method and system for managing network Download PDF

Info

Publication number
JP2004348640A
JP2004348640A JP2003147663A JP2003147663A JP2004348640A JP 2004348640 A JP2004348640 A JP 2004348640A JP 2003147663 A JP2003147663 A JP 2003147663A JP 2003147663 A JP2003147663 A JP 2003147663A JP 2004348640 A JP2004348640 A JP 2004348640A
Authority
JP
Japan
Prior art keywords
operation information
monitored
monitored machine
status
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP2003147663A
Other languages
Japanese (ja)
Inventor
Hajime Hirose
肇 広瀬
Original Assignee
Hitachi Ltd
株式会社日立製作所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hitachi Ltd, 株式会社日立製作所 filed Critical Hitachi Ltd
Priority to JP2003147663A priority Critical patent/JP2004348640A/en
Publication of JP2004348640A publication Critical patent/JP2004348640A/en
Pending legal-status Critical Current

Links

Images

Abstract

An object of the present invention is to provide a network management system capable of accurately specifying the cause of a trouble even when operation information has a defect.
An operating information acquisition unit (121, 131, 141) for acquiring operating information of each monitored machine (150, 160, 170, 180) and an operating information of the monitored machine acquired by the movable information acquiring unit (100) are monitored. An operation information database 108 stored for each machine, a status database 109 for storing status information indicating presence / absence of lack of operation information for each monitored machine for each monitored computer, and a monitored database stored in the operation information database. An operation information analysis unit 101 is provided for identifying and displaying, by correlation analysis, candidates for monitored machines that cause a change in the operation rate of a specific monitored machine based on the operation information of the machines and the status information stored in the status database. Was.
[Selection diagram] Fig. 1

Description

[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a network management system and a network management method, and more particularly to a network management system and a network management method capable of specifying a cause of a trouble that has occurred in a computer network.
[0002]
[Prior art]
2. Description of the Related Art In recent years, stable operation of a Web system or the like built on the Internet has been desired. A computer network system that constitutes a Web system or the like generally includes a plurality of network devices. For example, as shown in Patent Literature 1, the state is obtained by acquiring operation information and the like using various kinds of dedicated software. Monitoring. For example, when the response of the Web page is reduced, the administrator manually investigates the cause based on the operation information.
[0003]
Japanese Patent Application Laid-Open No. H11-163873 discloses a method for deteriorating the performance of a specific application presented by a specific server by using traffic information that can be acquired as standard in a multimedia network, minimizing the influence of a load on an operation server, and the like. To be able to be analyzed.
[0004]
Patent Document 3 discloses that the relationship between the components of the system to be managed is quantified based on the operation information, thereby narrowing down the components causing a performance bottleneck or a failure. It is shown that the identification can be realized early.
[0005]
[Patent Document 1]
JP 2001-144761 A [0006]
[Patent Document 2]
JP 2001-195285 A
[Patent Document 3]
JP-A-2002-342182
[Problems to be solved by the invention]
According to the method of Patent Document 1, when a trouble such as a decrease in the response of a Web page occurs, it is a root cause of the problem based on operation information of a large number of network devices stored in advance. The potential must be investigated manually. Such a task is difficult without a skilled network administrator.
[0009]
Further, according to the methods described in Patent Documents 2 and 3, it is difficult to accurately identify the cause of the trouble when information such as operation information to be acquired has a defect.
[0010]
The present invention has been made in view of these problems, and provides a network management system and a network management method capable of accurately specifying the cause of a trouble even when operation information has a defect.
[0011]
[Means for Solving the Problems]
The present invention employs the following means in order to solve the above problems.
[0012]
An operation information acquisition unit that acquires operation information of each monitored machine; an operation information database that stores operation information of the monitored machines acquired by the movable information acquisition unit for each monitored machine; A status database that stores status information indicating the presence or absence of lack of operation information for each computer to be monitored, and a specific database based on the operation information of the monitored machine stored in the operation information database and the status information stored in the status database. An operation information analysis unit is provided for identifying and displaying a candidate of the monitored machine that causes a change in the operation rate of the monitored machine by correlation analysis.
[0013]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. FIG. 1 is a diagram illustrating a network management system according to an embodiment of the present invention. The operation information analysis computer 100 acquires the operation information (CPU utilization, memory utilization, Web page response time, etc.) acquired by the operation information collection computer 120, and performs a correlation analysis or the like based on the acquired operation information. Search for the cause of computer system troubles (problems).
[0014]
The operation information analysis computer 100 includes an analysis unit 101, an operation information collection unit 105, a screen display unit 107, an operation information database 108, and a status database 109. The analysis unit 101 is a unit that actually performs analysis, and includes a correlation analysis unit 102, a risk degree calculation unit 103, and a cause degree calculation unit 104. The operation information collection unit 105 periodically obtains operation information from the operation information collection computer 120. In addition, when the operation information cannot be obtained, a status generation unit 106 that generates a status indicating that fact is provided.
[0015]
The screen display unit 107 performs display corresponding to various processes such as selection of an analysis target, display of an analysis result, and narrowing of an analysis range. The operation information database 108 is storage means for storing operation information periodically acquired from the operation information collection computer 120. The status database 109 is a storage unit that stores status information generated when the operation information extracted from the operation information collection computer 120 has a partial loss. The operation information analysis computer 100 can acquire operation information from an arbitrary number of operation information collection computers 120.
[0016]
The operation information collection computer 120 collects operation information of the monitored machine 150 to be actually monitored on the computer network, and responds to the operation information acquisition request from the collected operation information analysis computer 100 to convert the operation information. Has the ability to send. The operation information collection computer 120 includes an operation information acquisition unit 121 and an operation information collection tool 122. The operation information acquisition unit 121 acquires the operation information collected by the operation information collection tool 122, and transmits the obtained information to the operation information analysis computer. The operation information collection tool 122 is a general commercially available network management tool, and collects operation information from a plurality of monitored machines 150.
[0017]
The monitored machine 150 is a network device constituting a computer network, and generally corresponds to a router, a hub, a switch, a workstation, a PC, or the like.
[0018]
FIG. 2 is a diagram illustrating an example of a computer network to which the network management system of the present invention is applied. This example is an example of a typical Web system constructed when implementing a Web shopping mall or the like.
[0019]
As shown in the figure, a client PC 220 and a Web system 220 are configured with a network (WAN) 210 interposed therebetween. The Web system 200 includes network devices such as a firewall 201, a router 202, a Web server 203, AP (application) servers 204 and 205, and DB (database) servers 206, 207, and 208.
[0020]
The operation information of each network device is generally collected by a plurality of network management applications. In the case of the example shown in the figure, the machine (router 202, server 204, etc.) on which the network management application is installed is the operation information collection computer 120.
[0021]
FIG. 3 is a diagram illustrating the status generation processing. Status is a method introduced to make up for the shortcomings of correlation analysis. Correlation analysis is a statistical method that compares two different pieces of operation information and checks whether the time-series data is correlated (causal relationship). It is.
[0022]
Correlation analysis cannot perform accurate correlation analysis if part of the target data has a defect. If a network device constituting the network is temporarily stopped due to the occurrence of a trouble, the operation information is not collected during the stop period. In this case, the suspended network device or server cannot be a target of the correlation analysis even though there is a high possibility of causing a network trouble.
[0023]
The present invention introduces the concept of status in order to solve this problem. The status is assigned, for example, “1” during a period in which the operation information is collected, and “0”, for example, in a period during which the operation information is not collected. The status is assigned to all network devices separately from the operation information. Assign and accumulate. When performing a correlation analysis, the correlation analysis is performed by referring to a status in addition to the operation information. This makes it possible to perform correlation analysis on network devices for which operation information could not be collected (network devices stopped due to a trouble) by referring to the status instead of the operation information.
[0024]
The status generation processing is performed by the status generation unit 106 of the operation information collection unit 105 of the operation information analysis computer 100. First, in step 300, the operation information collection unit 105 acquires operation information from the operation information acquisition unit 121 of the operation information collection computer 120. In step 301, the collected operation information is examined in chronological order, and it is determined whether the operation information has been acquired in that time zone. If the operation information has been acquired, the status is set to 1 in step 302 and stored in the status database 109. At the same time, the operation information itself is stored in the operation information database 108 in step 304. If the operation information has not been acquired, the status is set to 0 in step 303 and stored in the status database 109. This process is performed until the collected operation information is exhausted.
[0025]
FIG. 4 is a diagram illustrating an image of the status information. When the operation information 400 whose information is missing due to the stoppage of the network device or the server is acquired, when the status is generated by the status generation process shown in FIG. 3, the status data 401 is obtained.
[0026]
As described above, among the monitoring target periods, “1” is assigned to a period during which operation information can be collected, and “0” is assigned to a period during which operation information cannot be collected.
[0027]
As shown in the figure, when the change of the operation information 402 of the network device or the like to be analyzed is affected by the stop of the network device or the like, the correlation between the status data 401 and the operation information 402 is determined using the status data 401. The cause of the change in the operation information 402 can be specified by analyzing the information.
[0028]
FIG. 5 is a diagram illustrating a process of identifying a monitored machine candidate (cause candidate) that causes a change in the operation rate by correlation analysis.
[0029]
First, in step 500, the operation information of the monitored machine to be analyzed is selected, and the analysis period (time range) is determined. This is done by the network administrator. As described above, in the case of a Web system, generally, a response time of a Web page is an analysis target. Next, in step 501, the range of the network device to be investigated as a cause candidate is determined. This operation is also performed by the network administrator. Note that elements that are not sure cause candidates are excluded from the scope of investigation here.
[0030]
Step 502 and subsequent steps are automatically performed by the operation information analysis computer 100. First, in step 502, a correlation analysis with the analysis target is performed on the operation information of the network device to be investigated as a cause candidate, and a correlation coefficient in the range of 0 to 1 is calculated. The larger the correlation coefficient is, the higher the correlation with the analysis target is. If there is a defect in the operation information, in step 503, status information is acquired from the status database, a correlation analysis is performed based on the operation rate and the status information to be analyzed, and a correlation coefficient is calculated. In step 504, the operation information of some of the higher correlation coefficients (for example, 10 values designated in advance, for example, 10) is determined as a cause candidate that has affected the operation information to be analyzed.
[0031]
FIG. 6 is a diagram illustrating a process of calculating the degree of cause of a cause candidate. First, in step 600, the threshold value of the operation information of the network device that has become the cause candidate by the processing shown in FIG. 4 is read. Note that the threshold value is appropriately set in advance by a network administrator. In step 601, it is calculated how long the operation information of the network device that has become the cause candidate has exceeded the threshold value in the analysis target period, and a risk ratio in the range of 0 to 1 is calculated. For example, if the operation information exceeds the threshold for 15 minutes during one hour, the risk ratio of the operation information is set to 0.25. In step 602, the cause is calculated from the correlation coefficient and the risk ratio. The degree of cause is calculated by ((correlation coefficient × α) + (risk ratio × (100−α))) / 100, and is in the range of 0 to 1. Here, α is a coefficient for weighting and an arbitrary value from 0 to 100 can be designated. When α is increased, the degree of correlation is regarded as important, and when α is decreased, the risk ratio is regarded as important.
[0032]
FIG. 7 is a diagram illustrating the configuration of the operation information database 108. The operation information database stores operation information of each network device at predetermined time intervals. As shown in the figure, Web page response time, line usage rate, CPU usage rate, cache hit rate, and the like are stored as operation information. When the operation information has not been acquired, there is no value.
[0033]
FIG. 8 is a diagram illustrating the configuration of the status database 109. The status database stores the status of each network device. Only one status exists for each network device, and a value of 0 or 1 is stored at regular intervals.
[0034]
FIG. 9 is a diagram illustrating an example of an image of the analysis result list screen 900. The analysis result list screen 900 is displayed on the screen display unit 107 of the operation information analysis computer 100. The analysis result list screen 900 includes an analysis time view 901, an analysis target view 902, an analysis result view 903, and a graph display button 904. The analysis time view 901 displays the analysis target range (period) by time. The analysis target view 903 displays operation information designated as an analysis target and a machine name representing a network device that has acquired the operation information. The analysis result view 903 displays the final analysis result. The analysis result view 903 displays a list of candidates having the highest possibility (causes having a high degree of cause) as cause candidates in order. It is considered that there is a high possibility that the operation information at the top of the list is the operation information of the machine that causes the response time of the Web page to be analyzed to deteriorate. The analysis result view 903 displays a rank, a cause candidate name, a cause degree, a correlation coefficient, and a risk ratio. When the graph display button 904 is pressed, an analysis result graph screen is displayed.
[0035]
FIG. 10 is a diagram illustrating an example of an image of the analysis result graph screen 1000. When the graph display button 904 of the analysis result list screen 900 is pressed, the analysis result graph screen 1000 is displayed on the screen display unit 107 of the operation information analysis computer 100. The analysis result graph screen 1000 includes a graph view 1001, a graph element view 1002, and an analysis time view 1003. The graph view 1001 simultaneously displays a graph of operation information as an analysis target and a candidate for a cause. The graph element view 1002 displays which operation information each of the displayed graphs belongs to. The analysis time view 1003 displays the analysis target range by time.
[0036]
As described above, according to the present embodiment, when a trouble occurs in a computer system, the cause of the trouble can be accurately specified even if there is a defect in the operation information, and the possibility of the cause may be determined. A network device can be shown to a network administrator. As a result, the network administrator can quickly perform the primary isolation of the problem and the recovery processing. The functions described in the specification of the operation information analysis computer 100 and the operation information collection computer 120 can be realized by software.
[0037]
Further, as described above, the monitoring target period is divided into predetermined periods, and among the divided periods, “1” is set during a period in which operation information is collected, and “0” is set during a period in which operation information is not collected. The correlation analysis is performed by using the status data to which the status data is assigned. For this reason, a network device or the like that has been temporarily stopped due to the occurrence of a trouble (it is highly likely to be a cause of a decrease in the operation rate) can be subjected to the correlation analysis.
[0038]
【The invention's effect】
As described above, according to the present invention, it is possible to provide a network management system capable of accurately specifying the cause of a trouble even when operation information has a defect.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating a network management system according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating an example of a computer network to which the network management system of the present invention is applied.
FIG. 3 is a diagram illustrating a status generation process.
FIG. 4 is a diagram illustrating an image of status information.
FIG. 5 is a diagram illustrating a process of identifying a monitored machine candidate that causes a change in the operation rate by correlation analysis;
FIG. 6 is a diagram illustrating a process of calculating the degree of cause of a cause candidate.
FIG. 7 is a diagram illustrating a configuration of an operation information database.
FIG. 8 is a diagram illustrating a configuration of a status database 109.
FIG. 9 is a diagram illustrating an example of an image of an analysis result list screen 900.
FIG. 10 is a diagram illustrating an example of an image of an analysis result graph screen 1000.
[Explanation of symbols]
101 operation information analysis computer 102 correlation analysis unit 103 risk calculation unit 104 cause calculation unit 105 operation information collection unit 106 status generation unit 107 screen display unit 108 operation information database 109 status database 110 networks 120, 130, 140 operation information collection computer 121, 131, 141 Operation information acquisition unit 122, 132, 142 Operation information collection tool 150, 160, 170, 180 Monitored machine 200 Web system 201 Firewall 202 Router 203 Web server 204, 205 AP server 206, 207, 208 DB Server 210 Network 220 Client PC

Claims (5)

  1. An operation information acquisition unit that acquires operation information of each monitored machine;
    An operation information database that stores operation information of the monitored machines acquired by the movable information acquisition unit for each monitored machine,
    A status database that stores status information indicating whether or not the operation information is missing for each monitored machine for each monitored computer;
    Based on the operation information of the monitored machine stored in the operation information database and the status information stored in the status database, the candidate of the monitored machine causing the change in the operation rate of the specific monitored machine is identified by correlation analysis. A network management system comprising an operation information analysis unit for displaying.
  2. The network management system according to claim 1,
    The analysis unit acquires operation information of a computer that is a candidate of the monitored machine that causes a change in the operation rate of the specific monitored machine, and the operation rate indicated by the operation information exceeds a predetermined threshold. A network management system for calculating a risk based on a period.
  3. The network management system according to claim 2,
    The network management system, wherein the analysis unit calculates a cause based on the correlation coefficient obtained by the correlation analysis and the risk.
  4. A step of acquiring operation information of each monitored machine;
    Storing the operation information of the monitored machine acquired by the movable information acquisition unit in an operation information database for each monitored machine;
    A step of storing status information indicating presence or absence of loss of operation information of each monitored machine in a status database for each monitored computer;
    Based on the operation information of the monitored machine stored in the operation information database and the status information stored in the status database, the candidate of the monitored machine causing the change in the operation rate of the specific monitored machine is identified by correlation analysis. A network management method comprising a step of displaying.
  5. The network management method according to claim 4,
    The analysis unit acquires operation information of a computer that is a candidate of the monitored machine that causes a change in the operation rate of the specific monitored machine, and the operation rate indicated by the operation information exceeds a predetermined threshold. A network management method comprising calculating a risk level based on a period.
JP2003147663A 2003-05-26 2003-05-26 Method and system for managing network Pending JP2004348640A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2003147663A JP2004348640A (en) 2003-05-26 2003-05-26 Method and system for managing network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2003147663A JP2004348640A (en) 2003-05-26 2003-05-26 Method and system for managing network

Publications (1)

Publication Number Publication Date
JP2004348640A true JP2004348640A (en) 2004-12-09

Family

ID=33534134

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2003147663A Pending JP2004348640A (en) 2003-05-26 2003-05-26 Method and system for managing network

Country Status (1)

Country Link
JP (1) JP2004348640A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007060721A1 (en) * 2005-11-24 2007-05-31 Hewlett-Packard Development Company, L.P. Network administrating device and method of administrating network
CN100435151C (en) * 2005-09-30 2008-11-19 兄弟工业株式会社 Information management device, information management system, and computer usable medium
CN100447790C (en) * 2005-07-26 2008-12-31 兄弟工业株式会社 Information management system, information processing device, and program
WO2010038327A1 (en) * 2008-09-30 2010-04-08 株式会社 日立製作所 Root cause analysis method targeting information technology (it) device not to acquire event information, device and program
JP2010191738A (en) * 2009-02-19 2010-09-02 Hitachi Ltd Failure analysis support system
JP2011258057A (en) * 2010-06-10 2011-12-22 Fujitsu Ltd Analysis program, analysis method, and analyzer
WO2012046293A1 (en) * 2010-10-04 2012-04-12 富士通株式会社 Fault monitoring device, fault monitoring method and program
JP2016006567A (en) * 2014-06-20 2016-01-14 富士通株式会社 Output program, output device and output method
JP2016197450A (en) * 2016-07-25 2016-11-24 日本電気株式会社 Operation management device, operation management system, information processing method, and operation management program
JP2018530803A (en) * 2015-07-14 2018-10-18 サイオス テクノロジー コーポレーションSios Technology Corporation Apparatus and method for utilizing machine learning principles for root cause analysis and repair in a computer environment

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100447790C (en) * 2005-07-26 2008-12-31 兄弟工业株式会社 Information management system, information processing device, and program
CN100435151C (en) * 2005-09-30 2008-11-19 兄弟工业株式会社 Information management device, information management system, and computer usable medium
US8359378B2 (en) 2005-11-24 2013-01-22 Hewlett-Packard Development Company, L.P. Network system and method of administrating networks
WO2007060721A1 (en) * 2005-11-24 2007-05-31 Hewlett-Packard Development Company, L.P. Network administrating device and method of administrating network
US8479048B2 (en) 2008-09-30 2013-07-02 Hitachi, Ltd. Root cause analysis method, apparatus, and program for IT apparatuses from which event information is not obtained
WO2010038327A1 (en) * 2008-09-30 2010-04-08 株式会社 日立製作所 Root cause analysis method targeting information technology (it) device not to acquire event information, device and program
CN101981546A (en) * 2008-09-30 2011-02-23 株式会社日立制作所 Root cause analysis method targeting information technology (IT) device not to acquire event information, device and program
US8020045B2 (en) 2008-09-30 2011-09-13 Hitachi, Ltd. Root cause analysis method, apparatus, and program for IT apparatuses from which event information is not obtained
CN101981546B (en) * 2008-09-30 2015-04-01 株式会社日立制作所 Root cause analysis method targeting information technology (IT) device not to acquire event information, device and program
JP2010086115A (en) * 2008-09-30 2010-04-15 Hitachi Ltd Root cause analysis method targeting information technology (it) device not to acquire event information, device and program
JP2010191738A (en) * 2009-02-19 2010-09-02 Hitachi Ltd Failure analysis support system
JP2011258057A (en) * 2010-06-10 2011-12-22 Fujitsu Ltd Analysis program, analysis method, and analyzer
WO2012046293A1 (en) * 2010-10-04 2012-04-12 富士通株式会社 Fault monitoring device, fault monitoring method and program
JP2016006567A (en) * 2014-06-20 2016-01-14 富士通株式会社 Output program, output device and output method
JP2018530803A (en) * 2015-07-14 2018-10-18 サイオス テクノロジー コーポレーションSios Technology Corporation Apparatus and method for utilizing machine learning principles for root cause analysis and repair in a computer environment
JP2016197450A (en) * 2016-07-25 2016-11-24 日本電気株式会社 Operation management device, operation management system, information processing method, and operation management program

Similar Documents

Publication Publication Date Title
US9807116B2 (en) Methods and apparatus to identify priorities of compliance assessment results of a virtual computing environment
US9015316B2 (en) Correlation of asynchronous business transactions
US20200153714A1 (en) Systems and methods for displaying adjustable metrics on real-time data in a computing environment
US9893963B2 (en) Dynamic baseline determination for distributed transaction
EP2590081B1 (en) Method, computer program, and information processing apparatus for analyzing performance of computer system
US9003230B2 (en) Method and apparatus for cause analysis involving configuration changes
US9652317B2 (en) Remedying identified frustration events in a computer system
US7992040B2 (en) Root cause analysis by correlating symptoms with asynchronous changes
US7984007B2 (en) Proactive problem resolution system, method of proactive problem resolution and program product therefor
US7953847B2 (en) Monitoring and management of distributing information systems
US7673291B2 (en) Automatic database diagnostic monitor architecture
DE69925557T2 (en) Monitoring the throughput of a computer system and a network
US20190279098A1 (en) Behavior Analysis and Visualization for a Computer Infrastructure
Yuan et al. Automated known problem diagnosis with event traces
US7133805B1 (en) Load test monitoring system
JP4872945B2 (en) Operation management apparatus, operation management system, information processing method, and operation management program
US7676706B2 (en) Baselining backend component response time to determine application performance
US6845474B2 (en) Problem detector and method
JP4626852B2 (en) Communication network failure detection system, communication network failure detection method, and failure detection program
US7912947B2 (en) Monitoring asynchronous transactions within service oriented architecture
JP5468837B2 (en) Anomaly detection method, apparatus, and program
US8645769B2 (en) Operation management apparatus, operation management method, and program storage medium
US7444263B2 (en) Performance metric collection and automated analysis
US8595176B2 (en) System and method for network security event modeling and prediction
US7984334B2 (en) Call-stack pattern matching for problem resolution within software