WO2017134758A1

WO2017134758A1 - Management computer and method for managing computer to be managed

Info

Publication number: WO2017134758A1
Application number: PCT/JP2016/053126
Authority: WO
Inventors: 小林　恵美子; 峰義増田
Original assignee: 株式会社日立製作所
Priority date: 2016-02-03
Filing date: 2016-02-03
Publication date: 2017-08-10
Also published as: JP6674481B2; US10909016B2; US20180210803A1; JPWO2017134758A1

Abstract

In a system in which resources are used in a varying manner, when performances are monitored on the basis of past behavior detected when the system was operating normally, if performance behavior different from when the system was operating normally is detected, it is difficult to determine whether the detected performance behavior results from resources being used differently than when the system was operating normally. According to the present invention, monitoring accuracy is improved by determining whether performance behavior results from change in characteristics relating to the usage of the system, using: a means for measuring performances when the system is in operation, and detecting a performance (if any) different from when the system was operating normally; a means for measuring characteristics relating to the usage of the system, and detecting whether the characteristics are different from when the system was operating normally; and a means for comparing performance information with characteristic information relating to the usage of the system.

Description

Management computer and managed computer management method

The present invention relates to a management computer and a management method for monitoring the operating status of the system.

In system operation management, the system processing status (hereinafter referred to as operation performance) is monitored to prevent the user from being affected by a decrease in system processing capacity during system operation. It is necessary to notice abnormalities in operating performance at an early stage and take appropriate measures in advance. As indexes for grasping the operation performance, there are resource usage, throughput, response time, and the like.

Here, the resource is a resource necessary for the system to perform processing, and includes a server, a storage device, a network device, and a CPU, a memory, an input / output device, a secondary storage device, and the like of the device. . As a method for monitoring the operation performance of a system, there is conventionally a method of detecting, as an abnormality, a case where the performance index is different from the normal behavior on the basis of measurement data of the normal performance index.

[Patent Document 1] discloses a method of creating a time-series standard for monitoring items of a single performance index as a method of creating a standard from measurement data of a performance index of a normal system. Further, [Patent Document 2] discloses a method of creating a reference by combining monitoring items of a plurality of performance indexes and classifying them by measurement data vector positions.

JP 2001-142746 JP 2012-242985 A

When a system is provided to a user as in a cloud environment, the usage of system resources (hereinafter referred to as usage characteristics) such as the behavior of applications that use system resources may change. Even if the usage characteristics do not change, the operating performance of the system resources may fluctuate due to problems with the system resources themselves. Sometimes. Appropriate measures to be taken differ depending on whether the fluctuation in operating performance is caused by a problem of the system resource itself or a change in usage characteristics. However, in the above prior art, since the change in usage characteristics is not taken into consideration, the measured value related to the performance index during resource operation is in the range that is not included in the data at the time of creating the reference, and if it is detected as abnormal, the administrator First, it is necessary to analyze whether the operating performance is abnormal due to a problem in the system itself or whether the operating performance is abnormal due to a change in usage characteristics. For this reason, there is a problem that a large amount of work is required until an appropriate countermeasure is taken, and a quick countermeasure cannot be performed.
It is an object of the present invention to reduce the work burden of investigating the cause and taking countermeasures for an administrator by making an appropriate determination in consideration of changes in usage (usage characteristics) of system resources in operation performance monitoring.

The management computer includes a processor and manages the first management target computer accessed from the first application program. The processor acquires a first operation performance that is a value related to the resource performance of the first managed computer and a first usage characteristic that is a value related to access to the first managed computer from the first application program. . Then, the degree of abnormality of the first operating performance and the degree of abnormality of the first usage characteristic are calculated, and the first degree of abnormality of the first usage characteristic and the calculated degree of abnormality of the first usage characteristic are calculated. The operating status of the management target computer 1 is notified.

According to the present invention, in the monitoring of operation performance, the burden of analyzing the cause of an administrator is reduced by appropriately determining whether or not a change in usage (usage characteristics) of a resource is a cause, and appropriate measures to be taken are quickly determined. I can do it.

It is a figure which shows the concept of the system in Example 1 of this invention. It is a figure which shows the hardware constitutions of the management server in Example 1 of this invention. It is a figure which shows the structure of the functional module of the performance monitoring program in Example 1 of this invention. It is a figure which shows the flowchart of the performance monitoring program in Example 1 of this invention. It is a figure which shows the table structure of the operation performance monitoring item management table which manages the monitoring item regarding the operation performance in Example 1 of this invention. It is a figure which shows the table structure of the usage characteristic monitoring item management table which manages the monitoring item regarding the usage characteristic in Example 1 of this invention. It is a figure which shows the mechanism which manages the reference data in Example 1 of this invention. It is a figure which shows the flowchart of the operation condition diagnosis process of the performance monitoring program in Example 1 of this invention. It is a figure which shows the table structure of the monitoring data management table which manages the monitoring object data in Example 1 of this invention in time series. It is a figure which shows the structure of the determination method in Example 1 of this invention. It is a figure which shows the structure of the determination method using the data for a fixed period in Example 1 of this invention. It is a figure which shows the table structure of the notification management table which manages the notification content according to the determination result in Example 1 of this invention. It is a figure which shows the example of the output screen in Example 1 of this invention. It is a figure which shows the outline | summary of the system in Example 2 of this invention. It is a figure which shows the concept of the system in Example 3 of this invention. It is a figure which shows the example of the abnormality degree of 1 time in Example 3 of this invention.

[First embodiment]
FIG. 1 is a conceptual diagram of a computer system for implementing the present invention. Each includes one or more user computers 100, a server 102, a network device 103, a storage device 104, and a management server 101 for managing the system. The application program 106 runs on one or more user computers 100, and each of the one or more servers is connected to a network. The server 102 and the storage apparatus 104 are connected via the network device 103 in FIG. 1, but may be directly connected. The management server 101 is connected to each device via a management network (not shown). In the server, middleware 105 such as a database (DB) execution base (hereinafter referred to as a DB server) or an application execution base operates, and the application program 106 accesses the middleware via the Internet or a local network. The application program may run on the same server as the middleware.

FIG. 2 shows the hardware configuration of the management server 101. One or more central processing units (CPU) 201, a memory 202, a secondary storage device 203 such as a hard disk, a keyboard, an input / output interface 204 for controlling input information from a mouse and output information to a display, a network interface connected to a network 205. The hardware configuration of the other server 102 is the same. The server and management server may be virtual servers.

The performance monitoring program 206 is loaded on the memory 202 of the management server 101 and executed by the CPU 201. The secondary storage device 203 stores data of the table 207 used by the performance monitoring program. In the application server 102, an application program is loaded on the memory and executed by the CPU.

Each server may be implemented as a virtual machine instead of a physical machine.

Fig. 3 shows the functional module configuration of the performance monitoring program. Operational performance information collection unit 301 that collects operating information such as resource usage, which is a performance index of servers, storage devices, etc., usage information collection unit 302 that collects information on access to middleware on the server, normal system A monitoring standard creation unit 303 that creates monitoring standards using information collected for a certain period of time during operation, a monitoring standard management table 304 that manages the created monitoring standards, and compares periodic measurement data of operational performance information with the standards. An operational performance abnormality degree calculation unit 305 that calculates an abnormality degree, a usage characteristic abnormality calculation part 306 that calculates an abnormality degree by comparing periodic measurement data of usage information with a reference, and determines a situation from the calculated abnormality degree A situation determination unit 307 and an output unit 308 that outputs a determination result are included.

FIG. 4 is a flowchart of the performance monitoring program in the present embodiment, and shows the processing flow of the present embodiment. Each step is executed by a central processing unit (CPU) 201. In the monitoring standard creation step S401, a monitoring standard is created for each of the performance information and usage characteristic information collected for a certain period. The operational performance monitoring standard is created based on the operational performance monitoring item management table of FIG. 5, and the usage characteristic monitoring standard is created based on the usage characteristic monitoring item management table of FIG. Therefore, the operation performance monitoring item management table 500 in FIG. 5 and the usage characteristic monitoring item management table 600 in FIG. 6 will be described first.

FIG. 5 is a diagram showing the configuration of the operation performance monitoring item management table. A vector ID field 501 for managing a vector in which a plurality of monitoring items are combined, a target field 502 indicating the type of collection target apparatus or software, and a monitoring item field 503 indicating a monitoring target item name are configured. A monitoring item is an item that periodically collects information from a target device such as a server, middleware on the server, a storage device, and a network device, and software, and a vector is managed by a combination of one or more monitoring items. These monitoring items are values related to the resource performance of the monitoring target device, and indicate how much the resources of the monitoring target device are currently performing their processing capabilities. In the example of FIG. 5, a vector with a vector ID of 1 manages a CPU usage rate and a memory usage rate, which are monitoring items, using a collection target as a server. Combinations of vector IDs 501 may be defined in advance by the system, or may be set by addition or deletion by a user using the management server. The combinations of vector IDs 501 shown in FIG. 5 are merely examples.

FIG. 6 is a diagram showing the configuration of the usage characteristic monitoring item management table. A vector ID field 601 for managing a vector in which a plurality of monitoring items are combined, a target field 602 indicating the type of apparatus or software to be collected, and a monitoring item field 603 are configured. In the example of FIG. 6, a vector with a vector ID of 1 manages the number of sessions and the number of transactions, which are monitoring items, with the collection target as a server. These monitoring items are values relating to access to the server from an application program running on the user computer. The monitoring items are set in advance by the system, or set by a user using the management server by adding or deleting, and managed as one or more vectors. The combinations of vector IDs 601 shown in FIG. 6 are merely examples.

Returning to the description of the criteria creation in the monitoring criteria creation step S401 in FIG. For a monitoring item in the monitoring item field 503 of the operation performance monitoring item management table of FIG. 5, information for a certain period is collected. The period here may be fixed in advance in the system, or may be set by an administrator who uses the management server. The data of each monitoring item is regarded as data at the same monitoring time or within the monitoring time error, and is expressed as a multi-dimensional vector value with each monitoring item as an axis.

When the monitoring time error is less than 1 minute, for example, the data x1 of the CPU usage rate in the monitoring item field 503 of 10:00:00 and the data y1 of the memory usage rate of 10:00:10 are one vector. Data (x1, y1). Data for a certain period expressed as a vector value is classified into one or more groups. The classification method is, for example, a K-means method in which close values are classified into a plurality of circles (in the case of two dimensions, a sphere in the case of three dimensions or more), and the center coordinates and radius of the group are extracted. The group here is called a cluster. The classification result is stored in the monitoring reference management table in FIG. The monitoring reference management table will be described later.

Also, for the monitoring items managed in the usage characteristic monitoring item management table of FIG. 6, information is collected and classified in the same manner.

Next, an operation status diagnosis using the created monitoring standard is performed (S402). The operation diagnosis process will be described later using the detailed flow of FIG.

After the operation diagnosis process, it is determined whether notification is necessary based on the result (S403). This determination is made when there is an abnormal state in the determination result of each vector based on the result of measurement data for one time (S404). Alternatively, notification may be made when an abnormal state continues n times or more based on the past plural (m) measurement data results. The number of times m and n is defined by the system or specified by the user.

The operation diagnosis process is a process flow for each measurement data collection, but it may be performed for a certain period of time. In that case, the determination of whether notification is necessary may be a method in which the determination results of data for a certain period are collected for each vector, and only notifications of the most types are output.

Further, it is determined whether or not it is necessary to recreate the monitoring standard (S405). If the abnormality level of the usage characteristic data is equal to or greater than the threshold value, it is determined whether the abnormality level of the past multiple (m) usage characteristic data is equal to or more than the threshold value n times. If the threshold value is greater than or equal to n times, it is necessary to re-create the monitoring standard, and the monitoring standard is created again for both operating performance and usage characteristics.

Fig. 7 shows the mechanism for managing the created cluster that is the reference for monitoring. FIG. 7A is a diagram showing a configuration of a monitoring reference management table for managing the created monitoring reference. Cluster ID field 701 for identifying a cluster extracted by the monitoring reference creation process, center coordinate field 702 managed by numerical values for each axis constituting a vector for the center coordinate for each cluster, cluster circle (sphere in 3D or more) Is composed of a radius field 703 for managing the radius of the. FIG. 7 shows an example of a two-dimensional vector based on two monitoring items, but in the case of three or more dimensions, the axis field of the center coordinate is adjusted to the number of dimensions. FIG. 7B is a diagram illustrating an example of a reference cluster created from two monitoring items of the CPU usage rate and the memory usage rate on a two-dimensional graph. Here, four clusters are created and given IDs.

FIG. 8 is a flowchart showing the flow of the operation status diagnosis process. Each step is executed by a central processing unit (CPU) 201. For the measured usage characteristic data, a numerical value indicating the degree of deviation from the monitoring standard compared to the monitoring standard (hereinafter, this numerical value is referred to as the degree of abnormality) is calculated (S801). Here, the degree of abnormality is calculated based on the distance between the measurement data and the center coordinate by specifying the cluster having the closest distance between the measurement data and the center coordinate, normalizing the radius of the cluster to 1. The further away the measured data is from the cluster, the greater the degree of abnormality.

The management server manages the threshold for the degree of abnormality as a criterion for notifying the user. The threshold value may be the same value or different value between the operation performance monitoring vector and the usage characteristic monitoring vector. The threshold value may be defined in advance by the system, or may be set by the user.

Next, similarly to the operation performance monitoring vector, the degree of abnormality is calculated from the measurement data for each vector (S802).
In FIG. 8, the abnormality level of the usage characteristic data is first compared with the threshold value (S803), and then the threshold value is compared with the abnormality level of the operation performance data (S804, S805). As a result, the state is determined (S806 to S809). Here, the operating status of the resource is defined under the following conditions.

-Normal state: When the usage characteristics are below the threshold and the operating performance is below the threshold-Warning state: When the usage characteristics are below the threshold and the operating performance is above the threshold-Attention required: The usage characteristics are above the threshold and the operating performance is above the threshold -Attention (low risk) state: When the usage characteristics are equal to or greater than the threshold and the operation performance is less than the threshold, the state determination by comparison with this threshold is repeated for all vectors for operation performance monitoring (S810, S811).

FIG. 9 shows a configuration of a monitoring data management table for managing measurement data of monitoring items. For each of the usage characteristic vector and the operation performance monitoring vector, the measured data and the calculated degree of abnormality are managed for each time. Here, the measurement data with respect to the time is regarded as data at the time within the monitoring time error. For example, assuming that the monitoring time error is less than ± 30 seconds, the measurement data at 10:00 is assumed to be data having a monitoring time of 09:59:31 to 10:00:30.

FIG. 10 is a diagram showing a mechanism of operation status diagnosis. FIG. 10A is an example of a usage characteristic monitoring vector. As monitoring items, the number of transactions indicating the use of the DB server is the x-axis, and the number of sessions is the y-axis. Each circle indicates a cluster which is a monitoring standard. FIG. 10B is an example of an operation performance monitoring vector. As monitoring items, the CPU usage rate is the x-axis and the memory usage rate is the y-axis. Circles on each vector indicate measurement data. Circles # 1 to # 4 indicate that they are data at the same time. For example, from the data managed in FIG. 9, when the measurement data at time T1 is # 1, the data (1001) of # 1 in FIG. is there.

In Fig. 10 (b), the data # 1 (1002) has an abnormality degree of a11 and is above the threshold, and is out of the circle of the cluster. FIG. 10 (c) shows the abnormalities of the usage characteristics and operation performance on the x-axis and y-axis of the graph. The degree of abnormality (a1, a11) at the time T1 is plotted at the position of the data 1003. Since this position is the warning range 1004, the state is determined to be a warning. Similarly, when plotting based on the degree of abnormality for # 2 to # 4, the ranges are normal 1005, attention required 1006, and caution (small risk) 1007, and each state is determined.

FIG. 11 is a diagram showing an example of a result of monitoring data for a certain period for a certain operation performance vector. In the determination of whether notification in the flowchart of FIG. 4 is necessary (S403), when data for a certain period is used, among normal period data, normal state, warning state, caution state, caution (low risk) state The number of determined data is measured, and the state with the largest number of data is notified. For example, as shown in FIG. 11, when there is the most data in the range (1101) in which the degree of abnormality is equal to or greater than the threshold, it is determined that the state needs attention and a notification is output.

Further, in determining whether the reference needs to be re-created (S405), if there is a certain ratio or more of data in the range (1101 and 1102) in which the abnormality level of the usage characteristics in FIG. Judge that it is necessary to re-create the criteria for usage characteristics. When it is determined that re-creation is necessary, the operational performance criteria and the usage performance criteria are re-created, and the management server 101 re-creates the operational performance criteria and the usage characteristic criteria stored in the secondary storage device 203. . Specifically, the monitoring reference management table in FIG.

FIG. 12 is a table for managing notification contents to be output according to the state. It includes a target field 1201 indicating a resource to be monitored, a type field 1203 indicating a message type corresponding to the status field 1202, and a message field 1204 for managing message contents. In this example, the normal state is managed with no message (null) and is not output. The message may include a target vector monitoring item or target device.

FIG. 13 is a diagram showing an example of an output screen in the present embodiment. The upper level (1301) displays the degree of abnormality of the monitored performance monitoring vector in time series, and the lower level (1302) displays the output notification as an event list. In the event list, a message proposing an appropriate countermeasure may be displayed together with the type of notification. Even when the operating performance exceeds the threshold on the upper graph, the notifications are of different types such as warning (1303) and caution (1304). With these displays, the administrator can identify appropriate countermeasures to be taken according to different notifications, and prompt actions can be taken.

Also, some administrators may want to preferentially respond to, for example, a warning (1303) notification among many notifications. Therefore, on the output screen shown in FIG. 13, the administrator may accept selection of a notification type to be notified, and display only the notification of the selected notification type. Thereby, when many notifications are generated, only the notifications currently required by the administrator are displayed, thereby improving the management efficiency of the administrator. FIG. 13 is merely an example of an output screen. For example, the screen of FIG. 11 may be output.

As described above, in resource monitoring, notifications can be divided according to appropriate determination as to whether or not a change in the characteristics of the resource user is the cause, and the burden of the separation process on the administrator can be reduced.

In one specific example, a DB server is provided to a user's application program in the form of a PaaS (Platform as a Service) in a cloud environment. In monitoring the provided system, if it is detected that the CPU usage rate of the server executing the DB server is different from the usual, for example, the case where the CPU usage increased due to using an inappropriate execution plan on the DB server side This is an abnormality of the resource itself, that is, the abnormality of the resource itself, but the case where the number of transactions is larger than usual is an increase in CPU usage due to a change in the usage (usage characteristics) of the resource such as an increase in input. In the prior art, these cannot be distinguished, and the administrator must analyze which case, and appropriate measures cannot be taken promptly.

However, according to the present invention, when the CPU usage rate change is detected and notified, a different notification is made depending on whether or not the number of transactions has changed. Can be performed quickly.
[Second Embodiment]
As a modification of the first embodiment of the present invention, an embodiment having a configuration in which middleware used by an application program is distributed to a plurality of servers will be described. In the first embodiment, one device for monitoring the operation status and a form for monitoring the usage characteristics of the application program by one vector, whereas in this embodiment, there are a plurality of devices and middleware for monitoring the operation status. There are some differences.

FIG. 14 shows an outline of a system targeted by the present invention in the second embodiment. The same middleware operates on a plurality of servers, and the application program and the server are connected to the load balancer 1401. Access from an application program is distributed to a plurality of middleware and processed by a load distribution device. The distribution to a plurality of middleware may be executed by the user computer 106 or the server 102 having the load distribution software, or by a device other than the user computer 106 or the server 102 having the load distribution software. May be.

Here, as in the first embodiment, middleware will be described using an example of a DB server. In a configuration in which a plurality of DB servers perform distributed processing, the DB server shares data by sharing the storage device.

In this embodiment, the usage characteristics of the application program are acquired from the OS and DB server of each server that is the access destination. Further, a value obtained by adding the measurement data of the monitoring items related to the usage information acquired from each server at the same time is calculated. Note that data whose monitoring time is within a certain error is regarded as measurement data at the same time.

For the usage characteristic monitoring vector, a vector for monitoring each of the plurality of DB servers and a vector for monitoring the total value are provided. In the monitoring data management table shown in FIG. 9, columns for usage characteristic monitoring (number of transactions, number of sessions, degree of abnormality in the example of FIG. 9) are provided for each server. Furthermore, for each of the number of transactions and the number of sessions, a column for managing the total value of all servers in a distributed configuration and the degree of abnormality in the total value is provided.

As in the first embodiment, the operation performance is collected from servers, storage devices, etc., and an operation performance monitoring vector is provided for each device for monitoring.

The summation process of measurement data is performed in the monitoring reference creation process (S401) and the operation diagnosis process (S402) in the flowchart of FIG.

As for the monitoring standard, a standard for each DB server distributed as a monitoring standard for monitoring usage characteristics and a standard for each application program that is a total value of the distribution to the DB server are created.
In the operation diagnosis process shown in the flowchart of FIG. 8, in the step of calculating the degree of abnormality from the usage characteristic data (S801), the degree of abnormality of the usage characteristic for each server and the degree of abnormality of the total value of the usage characteristics of each server are calculated. . The degree of abnormality of usage characteristics for each server is calculated from the usage characteristics of each server and the criteria for each DB server. The degree of abnormality of the total usage characteristics of each server is the sum of the usage characteristics for each server and It is calculated from the standard.

The calculation method is the same as in Example 1. When comparing the degree of abnormality in usage characteristics and the degree of abnormality in server operation performance, in the step of comparing the degree of abnormality and determining the situation, first, as in the first embodiment, the degree of abnormality in the usage characteristics of the server is determined for each server. The determination is made by comparing the degree of abnormality of each operational performance. That is, the degree of abnormality of the usage characteristics for each server is calculated from the usage characteristics for each server and the standards for each DB server in step (S801) of FIG. In step (S803), it is determined whether the abnormality level of the usage characteristics for each server is less than a threshold value. The other steps in FIG. 8 and the display to the user after the determination are the same as in the first embodiment.

By this determination, it is possible to perform notification that determines the situation in consideration of the influence of the access from the application program to the server distributed to each server on the operation performance of each server.

In addition to the above determination, in the present embodiment, a process for determining the degree of abnormality of each operation performance of each server by comparing the degree of abnormality of the usage characteristics of the application program based on the total value of the distributed access. to add. That is, the degree of abnormality of the total value of the usage characteristics of each server is determined from the total value of the usage characteristics of each server and the standard for each application program, which is the total value of the distribution to the DB server in step (S801) of FIG. calculate. In step (S803), it is determined whether the degree of abnormality of the total value of the usage characteristics of each server is less than a threshold value. The other steps in FIG. 8 and the display to the user after the determination are the same as in the first embodiment.

In the above-mentioned judgment by matching the usage characteristics of each server with the degree of abnormality, it is not known whether the distribution ratio has changed or the usage characteristics of the application program itself have changed. By comparing, it is possible to notify whether the usage characteristics of the application program are different from normal.

Furthermore, when comparing the abnormalities of usage characteristics with the abnormalities of storage device operation performance, since the storage devices are shared by the distributed servers, the usage characteristics use the total abnormalities, and each storage device Judged against the degree of abnormality in operational performance. That is, the degree of abnormality of the total value of the usage characteristics of each server is determined from the total value of the usage characteristics of each server and the standard for each application program, which is the total value of the distribution to the DB server in step (S801) of FIG. calculate. In step (S803), it is determined whether the degree of abnormality of the total value of the usage characteristics of each server is less than a threshold value. The other steps in FIG. 8 and the display to the user after the determination are the same as in the first embodiment.

By this determination, even when the storage device operating performance is different from normal (the degree of abnormality is greater than or equal to the threshold), it is possible to perform appropriate notification that determines whether the usage characteristics of the application program are different from normal it can. Further, in the present embodiment, in the determination process (S405) for determining whether the reference needs to be recreated in the flowchart of FIG. 4, in order to determine whether or not the usage status has changed for each application program, the usage characteristic data of each server is determined. Data on the degree of abnormality calculated from the total value is used. When the abnormality level of the total usage characteristic data is equal to or greater than the threshold, if a certain number of past data for a certain period is equal to or greater than the threshold, the monitoring criteria are re-created for the vector of the usage characteristics and each operation performance Is deemed necessary.

As described above, even in a distributed processing configuration system, it is possible to notify by appropriately determining the state of the resource by comparing the operation performance of the distributed resource with the data of the usage characteristics of the application program that uses these resources. It becomes.
[Third embodiment]
As a modification of the first embodiment of the present invention, an embodiment in a configuration in which a plurality of software is used for the resource of one device will be described. In the first embodiment, one device for monitoring the operation performance and the use characteristic by the application program are monitored by one vector, whereas in the present embodiment, the use characteristic for the operation performance of one device. The difference is that there are a plurality of vectors.
Here, a server virtualization environment is taken as an example. FIG. 15 shows an outline of a system targeted by the present invention in the third embodiment. In this configuration, the resources of one physical server 1501 are virtualized by a hypervisor 1502 that is virtualization platform software and used by a plurality of virtual machines 1503. In the cloud environment, an IaaS (Infrastructure as a Service) form that provides virtual machines to customers is assumed.

The physical server 1501 is connected to the storage apparatus 104 via the network device 103 as in FIG. 1, but may be directly connected. The management server 101 is connected to each device as shown in FIG. 15 via a management network (not shown). Application programs run on the virtual machine 1503, but individual application programs are not monitored here, and information on the use of physical server resources for each virtual machine is acquired as usage characteristic monitoring vector information To do. The usage characteristic monitoring item management table of FIG. 6 manages combinations of CPU usage rates and memory usage rates, which are monitoring items of virtual machines, with the target being a virtual machine.

As for the operation performance information, information on operation is collected from the device as in the first embodiment. Here, the hypervisor of the physical server is targeted, and items related to resource competition and the like are collected as information on the operation performance monitoring vector. For example, a value indicating the percentage of time when execution of the virtual machine could not be scheduled by the CPU, memory swap usage, and the like are set. In the operation performance monitoring item management table of FIG. 5, the target is a hypervisor and these items are managed in combination.

Measured data is managed by providing a column for usage characteristic monitoring for each virtual machine in the monitoring data management table of FIG. The operational performance monitoring column is managed by the monitoring item column for the hypervisor.

In the flowchart of FIG. 4, the monitoring reference creation process (S401) is the same as in the first embodiment. A monitoring standard is created from past measurement data for usage characteristic data and hypervisor operational performance data for each virtual machine.

In the operation status diagnosis process of FIG. 8, the degree of abnormality is calculated for each usage characteristic data and hypervisor operation performance data for each virtual machine. Regarding processing for judging the situation by comparing data, in the first embodiment, the degree of abnormality of each usage performance data is compared with the degree of abnormality of one usage characteristic data in the same time period. The difference is that a plurality of usage characteristic data in the same time zone are compared with the degree of abnormality of one operational performance data.

FIG. 16 is a diagram showing a mechanism for determining data at a certain time in the present embodiment. The degree of abnormality in the performance data of the hypervisor is represented on the y axis, and the degree of abnormality in the usage characteristic data of each virtual machine is represented on the x axis. Circles indicate coordinates representing the degree of abnormality at a certain time as a vector. The degree of abnormality of the operation performance data is the same at the same time, and FIG. 16 shows data 1601 at time T1 and data 1602 at time T2.

In the determination in the present embodiment, when the performance performance data of the hypervisor is different from normal (the degree of abnormality is equal to or greater than a threshold value), and all the virtual machines have the same usage characteristics as normal (the degree of abnormality is less than the threshold value) (1601). ) Determines that the operating status of the hypervisor is in a warning state.

If the operation performance data of the hypervisor is different from normal (the degree of abnormality is greater than or equal to the threshold) and the usage characteristic data of some virtual machines is different from normal (the degree of abnormality is greater than or equal to the threshold) (1602), This is the behavior of the hypervisor due to changes in usage characteristics, and is determined to be a state of caution. In this case, it may be determined that the virtual machine having the abnormality degree of the usage characteristic data equal to or higher than the threshold is a specific ratio or more with respect to the total number, or one virtual machine having the abnormality degree equal to or higher than the threshold may be determined. Even if it is, it may be judged as a state of caution. Judgment conditions for the proportion of virtual machines included in the range are defined in advance by the system or the administrator. Even if the performance data of the hypervisor is less than the threshold, whether it is normal or caution (low risk) status depends on the abnormalities of the usage characteristic data for each virtual machine and the percentage of virtual machines included in each range. judge.

The message at the time of notification is the same as in FIG. 12 of the first embodiment, and the target is managed as a hypervisor for each state and notified according to the determination.

In addition, about notification, it is good also as not only the method of notifying when the judgment result of one time is other than the normal time but also a method of notifying the state containing the most judgment results about the judgment result of a fixed period. For example, if the determination results from time T1 to T10 are warnings at T1 and attention is required from T2 to T10, notification is made as a warning state after determination at T10.

Further, in this embodiment, the notification of the determined state is notified including the virtual machine information. For example, when the determination state is a warning, the degree of abnormality of the usage characteristic data of each virtual machine is less than the threshold value, and “no virtual machine affects the operation performance” is set. When the judgment state is cautionary, there is a virtual machine whose usage characteristic data abnormality degree is equal to or greater than a threshold, and information such as “virtual machines whose usage characteristics have changed are VM1, VM2, VM3” is given to the notification.
The display to the user is not limited to the notification shown in FIG. 12. For example, the management computer displays the screen shown in FIG. 16, and each virtual machine such as VM1, VM2, and VM3 is displayed on each data on the screen shown in FIG. You may show to a user correspondingly. As a result, the user can grasp the degree of abnormality of the virtual machine and its usage characteristics that affect the operating performance.

In the present embodiment, the determination processing (S405) in the flowchart of FIG. 4 has a plurality of usage characteristic data for each virtual machine. When the number of virtual machines that have become more than a specific percentage defined by the system, it is determined that the standard needs to be recreated. Re-create the monitoring criteria for the usage characteristics data of each virtual machine and the performance performance data of the hypervisor.

Based on the above, by comparing the operating performance of the hypervisor that is the resource provider with the usage characteristics of each virtual machine that uses the resource, whether there is a problem with the resources of the hypervisor or the virtual machine whose usage characteristics have changed Appropriate notification can be performed by determining whether it is affected. In addition, when the operating performance of the hypervisor is different from that in the normal state, the administrator can easily determine the virtual machine that has an influence.

Further, the present embodiment is not limited to the configuration of FIG. 15, and can be applied to the case where a plurality of user computers 100 exist in FIG. 1 and the server 102 is accessed from a plurality of application programs 106.

In this case, for each of a plurality of application programs 106, a value related to access to the server 102 from the application program 106 shown in FIG. 6 is managed. The information that the usage characteristic acquires as monitoring vector information is not information that uses a physical server for each virtual machine, but is a value related to access to the server 102 for each of a plurality of application programs 106.

The operation performance information is the same as that in the first embodiment, and the measurement data is managed by providing a column for use characteristic monitoring for each application program (AP) 106 in the monitoring data management table of FIG.

In the flowchart of FIG. 4, the monitoring reference creation process (S401) is the same as that of the first embodiment. A monitoring reference is created from past measurement data for the usage characteristic data for each application program 106 and the performance data for the monitoring items shown in FIG.

8, the degree of abnormality is calculated for each of the usage characteristic data for each application program 106 and the operation performance data of the server 105 and the storage 104. In the process of judging the situation by comparing the data, a plurality of usage characteristic data in the same time zone are compared with the degree of abnormality of one operation performance data as in the case of the configuration of FIG.

16 is different from the configuration of FIG. 15 in that the abnormality level of the operational performance data of the server 102 or the storage apparatus 104 is represented on the y axis and the abnormality degree of the usage characteristic data of each application program 106 is represented on the x axis.

In the determination in the case of the configuration of FIG. 1, the operation performance data of the server 102 or the storage device 104 is different from normal (abnormality is greater than or equal to a threshold value), and all of the plurality of usage characteristics are the same usage characteristics as normal. When the degree of abnormality is less than the threshold (1601), the operating status of the server 102 or the storage device 104 is determined as a warning state.

When the operation performance data of the server 102 or the storage device 104 is different from normal (abnormality is greater than or equal to a threshold value) and some use characteristic data is different from normal (abnormality is equal to or greater than a threshold) (1602), the use characteristic is It is the behavior of the server 102 or the storage apparatus 104 due to the change, and is determined to be a state of caution. Here, when the number of application programs 106 whose usage characteristic data abnormality level is equal to or greater than a threshold is equal to or greater than a specific ratio with respect to the total number, it may be determined that a state of caution is required. Even if there is one, it may be determined that a state of caution is required. The determination condition for the ratio of the application program 106 included in the range is defined in advance by the system or the administrator. Even when the performance data of the hypervisor is less than the threshold, whether it is normal or attention (low risk) is the same depending on the degree of abnormality of the usage characteristic data of the application program 106 and the ratio of the application program 106 included in each range. Judgment.

Regarding the message at the time of notification, the target is managed by the state as the server 102 or the storage device 104, and notified according to the determination.

Further, in the present embodiment in the case of the configuration of FIG. 1, the notification of the determined state is notified including the information of the application program 106. For example, when the determination state is a warning, the abnormalities of the usage characteristic data of each application program 106 are all less than the threshold value, and “there is no application program 106 that affects the operation performance”. If the judgment state is cautionary, there is an application program 106 whose degree of abnormality in the usage characteristic data is equal to or greater than the threshold, and information such as “Application program 106 whose usage characteristics have changed is AP1, AP2, AP3” is given to the notification. To do.

The display to the user is not limited to the notification shown in FIG. 12. For example, the management computer displays the screen shown in FIG. 16, and the application program 106 such as AP1, AP2, AP3 corresponds to each data on the screen shown in FIG. It may be shown to the user. As a result, the user can grasp the degree of abnormality of the application program 106 and its usage characteristics that has an influence on the operating performance.

In the present embodiment in the configuration of FIG. 1, the determination processing (S405) in the flowchart of FIG. 4 has a plurality of usage characteristic data for each application program 106. When the application program 106 whose degree of abnormality is equal to or higher than a threshold value exceeds a specific ratio defined by the system, it is determined that the reference needs to be recreated. Monitoring criteria are re-created for the usage characteristic data of each application program 106 and the operation performance data of the server 102 and the storage apparatus 104, respectively.

As described above, the resources of the server 102 and the storage apparatus 104 can be obtained by comparing the operation performance of the server 102 and the storage apparatus 104 on the resource providing side with the usage characteristics that are values related to access to the server 102 from each application program 106. Therefore, it is possible to determine whether the application program 106 having changed usage characteristics has an influence, and perform appropriate notification. In addition, the administrator can easily determine which application program 106 is influencing when the operation performance is different from the normal performance.

100: user computer 101: management server 102: server 103: network device 104: storage device 105: execution platform software 106: application program

Claims

A management computer that manages a first managed computer accessed from a first application program and includes a processor;
The processor is
A first operating performance that is a value related to resource performance of the first managed computer and a first usage characteristic that is a value related to access to the first managed computer from the first application program are acquired. ,
Calculating an abnormality degree of the first operating performance and an abnormality degree of the first usage characteristic;
Notifying the operating status of the first managed computer from the calculated abnormality degree of the first operating performance and the calculated abnormality degree of the first usage characteristic;
A management computer characterized by that.
The processor is
Making a first determination comparing the degree of abnormality of the first operating performance with a first threshold, making a second determination comparing the degree of abnormality of the first usage characteristic and a second threshold;
Based on the results of the first determination and the second determination, the operating status of the first managed computer is notified.
The management computer according to claim 1.
The management computer further includes a storage device,
The storage device stores a reference value for operating performance and a reference value for use characteristics,
The processor is
Calculating a degree of deviation of the acquired first operating performance from a reference value of the operating performance as an abnormality degree of the first operating performance;
Calculating a degree of deviation of the acquired first usage characteristic from a reference value of the usage characteristic as an abnormality degree of the first usage characteristic;
The management computer according to claim 2.
The management computer is
In the first determination, the abnormality degree of the first operating performance is equal to or higher than the first threshold value, and the abnormality degree of the first usage characteristic is the second threshold value in the second determination. If so, display the first notification,
In the first determination, the abnormality degree of the first operating performance is equal to or higher than the first threshold value, and the abnormality degree of the first usage characteristic is the second threshold value in the second determination. If not, display a second notification,
The management computer according to claim 3.
The processor is within a predetermined period of time
In the first determination, the abnormality degree of the first operating performance is equal to or higher than the first threshold value, and the abnormality degree of the first usage characteristic is the second threshold value in the second determination. If the first number of times,
In the first determination, the abnormality degree of the first operating performance is equal to or higher than the first threshold value, and the abnormality degree of the first usage characteristic is the second threshold value in the second determination. A second number of times if less than,
In the first determination, the abnormality degree of the first operating performance is less than the first threshold value, and in the second determination, the abnormality degree of the first usage characteristic is the second threshold value. If this is the case, the third number of times,
In the first determination, the abnormality degree of the first operating performance is less than the first threshold value, and in the second determination, the abnormality degree of the first usage characteristic is the second threshold value. If not, the fourth number of times,
Measure
Calculating a maximum one of the first number, the second number, the third number, and the fourth number;
The management computer is
Displaying different notifications depending on which of the first number, the second number, the third number, and the fourth number is the maximum,
The management computer according to claim 3, wherein:
The processor is
Creating a reference value of the operational performance value from the acquired first operational performance value, and creating a reference value of the usage characteristic value from the acquired first usage characteristic value;
The storage device stores a reference value of the created operational performance value and a reference value of the created usage characteristic value.
The management computer according to claim 3.
When the ratio of the abnormality degree of the first usage characteristic exceeding the second threshold exceeds a predetermined value within a predetermined period, the processor determines the operational performance reference value and the usage characteristic reference. Recreate the value and
The storage device stores the regenerated reference value of the operational performance and the regenerated reference value of the usage characteristic.
The management computer according to claim 6.
The management computer further manages a second managed computer,
The processor is
The second operating performance that is a value related to the resource performance of the second managed computer and the second usage characteristic that is a value related to access to the second managed computer from the first application program are acquired. ,
Calculating an abnormality degree of the second operating performance and an abnormality degree of the second usage characteristic;
Calculate a total usage characteristic that is the sum of the first usage characteristic and the second usage characteristic in the previous period,
Calculate the degree of abnormality of the total use characteristics,
The calculated degree of abnormality of the first operating performance, the degree of abnormality of the first usage characteristic calculated, the degree of abnormality of the second operating performance, and the degree of abnormality of the second usage characteristic are calculated. Notifying the operating status of the first managed computer and the second managed computer from the total usage characteristics;
The management computer according to claim 1.
The management computer is further accessed from a second application program,
The processor further includes:
Obtaining a third usage characteristic which is a value related to access to the second managed computer from the second application program;
Calculating the degree of abnormality of the third usage characteristic;
The operating status of the first managed computer based on the calculated abnormality degree of the first operating performance, the calculated abnormality degree of the first usage characteristic, and the calculated abnormality degree of the third usage characteristic. To notify,
The management computer according to claim 1.
A management method of a first managed computer accessed from a first application program,
A first operating performance that is a value related to resource performance of the first managed computer and a first usage characteristic that is a value related to access to the first managed computer from the first application program are acquired. ,
Calculating an abnormality degree of the first operating performance and an abnormality degree of the first usage characteristic;
Notifying the operating status of the first managed computer from the calculated abnormality degree of the first operating performance and the calculated abnormality degree of the first usage characteristic;
The management method of the management object computer characterized by the above-mentioned.
Making a first determination comparing the degree of abnormality of the first operating performance with a first threshold;
A second determination comparing the degree of abnormality of the first usage characteristic with a second threshold;
Based on the results of the first determination and the second determination, the operating status of the first managed computer is notified.
The management method of the management object computer of Claim 10 characterized by the above-mentioned.
The management computer further includes a storage device,
The storage device stores a reference value for operating performance and a reference value for use characteristics,
Calculating a degree of deviation of the acquired first operating performance from a reference value of the operating performance as an abnormality degree of the first operating performance;
Calculating a degree of deviation of the acquired first usage characteristic from a reference value of the usage characteristic as an abnormality degree of the first usage characteristic;
The management method of a management object computer of Claim 11 characterized by the above-mentioned.
In the first determination, the abnormality degree of the first operating performance is equal to or higher than the first threshold value, and the abnormality degree of the first usage characteristic is the second threshold value in the second determination. If so, display the first notification,
In the first determination, the abnormality degree of the first operating performance is equal to or higher than the first threshold value, and the abnormality degree of the first usage characteristic is the second threshold value in the second determination. If not, display a second notification,
The management method of the management object computer of Claim 12 characterized by the above-mentioned.
Within a given period,
In the first determination, the abnormality degree of the first operating performance is equal to or higher than the first threshold value, and the abnormality degree of the first usage characteristic is the second threshold value in the second determination. If the first number of times,
In the first determination, the abnormality degree of the first operating performance is equal to or higher than the first threshold value, and the abnormality degree of the first usage characteristic is the second threshold value in the second determination. A second number of times if less than,
In the first determination, the abnormality degree of the first operating performance is less than the first threshold value, and in the second determination, the abnormality degree of the first usage characteristic is the second threshold value. If this is the case, the third number of times,
In the first determination, the abnormality degree of the first operating performance is less than the first threshold value, and in the second determination, the abnormality degree of the first usage characteristic is the second threshold value. If not, the fourth number of times,
Measure
Calculating a maximum one of the first number, the second number, the third number, and the fourth number;
Different notification depending on which of the first number, the second number, the third number, and the fourth number is the maximum,
The management method of the management object computer of Claim 12 characterized by the above-mentioned.
Create a reference value of the operating performance value from the acquired first operating performance value,
A reference value of the use characteristic value is created from the acquired first use characteristic value,
Storing the created operational performance value reference value and the created usage characteristic value reference value;
The management method of the management object computer of Claim 12 characterized by the above-mentioned.
If the ratio of the abnormality degree of the first usage characteristic exceeding the second threshold exceeds a predetermined value within a predetermined period, the operating performance reference value and the usage characteristic reference value are re-established. make,
Storing the recreated operational performance standard value and the recreated usage characteristic standard value;
The management method of the management object computer of Claim 15 characterized by the above-mentioned.
The management computer further manages a second managed computer,
The second operating performance that is a value related to the resource performance of the second managed computer and the second usage characteristic that is a value related to access to the second managed computer from the first application program are acquired. ,
Calculating an abnormality degree of the second operating performance and an abnormality degree of the second usage characteristic;
Calculate a total usage characteristic that is the sum of the first usage characteristic and the second usage characteristic in the previous period,
Calculate the degree of abnormality of the total use characteristics,
The calculated degree of abnormality of the first operating performance, the degree of abnormality of the first usage characteristic calculated, the degree of abnormality of the second operating performance, and the degree of abnormality of the second usage characteristic are calculated. Notifying the operating status of the first managed computer and the second managed computer from the total usage characteristics;
The management method of the management object computer of Claim 10 characterized by the above-mentioned.
The management computer is further accessed from a second application program,
Obtaining a third usage characteristic which is a value related to access to the second managed computer from the second application program;
Calculating the degree of abnormality of the third usage characteristic;
The operating status of the first managed computer based on the calculated abnormality degree of the first operating performance, the calculated abnormality degree of the first usage characteristic, and the calculated abnormality degree of the third usage characteristic. To notify,
The management method of the management object computer of Claim 10 characterized by the above-mentioned.