CN108337100B

CN108337100B - Cloud platform monitoring method and device

Info

Publication number: CN108337100B
Application number: CN201710043469.7A
Authority: CN
Inventors: 龚国成; 舒忠玲; 刘强; 余永华; 张伟
Original assignee: China Mobile Communications Group Co Ltd; China Mobile IoT Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile IoT Co Ltd
Priority date: 2017-01-19
Filing date: 2017-01-19
Publication date: 2021-07-09
Anticipated expiration: 2037-01-19
Also published as: CN108337100A

Abstract

The embodiment of the invention discloses a cloud platform monitoring method, which comprises the following steps: establishing a corresponding hierarchical structure for at least one virtual device to be monitored on a cloud platform; collecting operation data of the virtual equipment; when the acquired operation data are determined to be abnormal operation data, determining that a fault event occurs in the cloud platform, and marking the associated information of the abnormal operation data; and after the cloud platform is determined to have a fault event, according to the associated information, the source tracing operation of the fault event is realized. The embodiment of the invention also discloses a device for monitoring the cloud platform.

Description

Cloud platform monitoring method and device

Technical Field

The invention relates to the technical field of cloud computing, in particular to a method and a device for monitoring a cloud platform.

Background

The cloud platform has the characteristics of large scale, virtualization, dynamics, instantaneity and the like, and the cloud platform monitoring system is required to be capable of monitoring large-scale resources, monitoring virtual resources and dynamic resources, viewing monitoring reports in real time, monitoring the scalability of service and the like. The existing cloud platform monitoring system has the following defects: when a fault event is monitored on the cloud platform, it is difficult to determine which monitoring equipment is caused by the fault event, namely, the tracing operation difficulty of the fault event is higher; the functions of the monitoring systems on many cloud platforms are relatively fixed, and dynamic expansion of the system functions is difficult to realize.

Disclosure of Invention

In order to solve the technical problem, embodiments of the present invention are expected to provide a method and an apparatus for cloud platform monitoring, so as to implement a tracing operation on a fault event.

The technical scheme of the invention is realized as follows:

the embodiment of the invention provides a cloud platform monitoring method, which comprises the following steps:

establishing a corresponding hierarchical structure for at least one virtual device to be monitored on a cloud platform; the hierarchical structure sequentially comprises from top to bottom: identification information of the virtual device located at a top layer and at least one function group of the virtual device located at a second layer; each function group is used for representing a type of functions of the virtual equipment during operation;

collecting operation data of the virtual equipment;

when the acquired operation data are determined to be abnormal operation data, determining that a fault event occurs in the cloud platform, and marking the associated information of the abnormal operation data; the associated information of the abnormal operation data comprises: a function group of a virtual device associated with abnormal operation data in a second layer of the hierarchy, and virtual device identification information associated with abnormal operation data in a top layer of the hierarchy;

and after the cloud platform is determined to have a fault event, according to the associated information, the source tracing operation of the fault event is realized.

In the above solution, the hierarchical structure further includes: the IP address of at least one server used by at least one function group of the virtual equipment positioned on the third layer, at least one monitoring item corresponding to the IP address of at least one server positioned on the fourth layer, and the operation data of the virtual equipment corresponding to the at least one monitoring item positioned on the bottom layer.

In the foregoing solution, the associated information of the abnormal operation data further includes: monitoring items associated with abnormal operation data in a fourth layer of the hierarchy and server IP addresses associated with abnormal operation data in a third layer of the hierarchy;

the associated information for marking the abnormal operation data comprises: when the acquired operation data is determined to be abnormal operation data, marking the associated information of the fourth layer in the associated information of the abnormal operation data according to the abnormal operation data in the bottom layer; according to the marked associated information of the fourth layer, marking the associated information of the third layer in the associated information of the abnormal operation data; according to the marked associated information of the third layer, marking the associated information of the second layer in the associated information of the abnormal operation data; and marking the top-level associated information in the associated information of the abnormal operation data according to the marked associated information of the second level.

In the foregoing solution, after determining that the cloud platform has a failure event, according to the association information, implementing a tracing operation on the failure event, including: after the cloud platform is determined to have the fault event, inquiring the associated information of at least one other layer except the bottom layer in the associated information of the abnormal operation data according to the associated information marked in the hierarchical structure, and realizing the source tracing operation of the fault event.

In the above scheme, the method further comprises: after establishing a corresponding hierarchy for the at least one virtual device, at least one functional group in the second level of the hierarchy is added or deleted.

The embodiment of the invention also provides a device for monitoring the cloud platform, which comprises: the system comprises an establishing module, an acquisition module, a processing module and a positioning module; wherein the content of the first and second substances,

the system comprises an establishing module, a monitoring module and a monitoring module, wherein the establishing module is used for establishing a corresponding hierarchical structure for at least one virtual device to be monitored on a cloud platform; the hierarchical structure sequentially comprises from top to bottom: identification information of the virtual device located at a top layer and at least one function group of the virtual device located at a second layer; each function group is used for representing a type of functions of the virtual equipment during operation;

the acquisition module is used for acquiring the operation data of the virtual equipment;

the processing module is used for determining that the cloud platform has a fault event and marking the associated information of the abnormal operation data when the acquired operation data is determined to be the abnormal operation data; the associated information of the abnormal operation data comprises: a function group of a virtual device associated with abnormal operation data in a second layer of the hierarchy, and virtual device identification information associated with abnormal operation data in a top layer of the hierarchy;

and the positioning module is used for realizing the source tracing operation of the fault event according to the associated information when the fault event of the cloud platform is determined.

the processing module is specifically configured to mark, according to the abnormal operation data in the bottom layer, the associated information of the fourth layer in the associated information of the abnormal operation data when it is determined that the acquired operation data is the abnormal operation data; according to the marked associated information of the fourth layer, marking the associated information of the third layer in the associated information of the abnormal operation data; according to the marked associated information of the third layer, marking the associated information of the second layer in the associated information of the abnormal operation data; and marking the top-level associated information in the associated information of the abnormal operation data according to the marked associated information of the second level.

In the above scheme, the positioning module is specifically configured to, when it is determined that the cloud platform has a fault event, query, according to the associated information marked in the hierarchical structure, associated information of at least one other layer, except for a bottom layer, in the associated information of the abnormal operation data, so as to implement a tracing operation on the fault event.

In the foregoing solution, the establishing module is further configured to add or delete at least one function group in the second layer of the hierarchical structure after establishing the corresponding hierarchical structure for the at least one virtual device.

In the embodiment of the invention, a corresponding hierarchical structure is established for at least one virtual device to be monitored on a cloud platform; collecting operation data of the virtual equipment; when the acquired operation data are determined to be abnormal operation data, determining that a fault event occurs in the cloud platform, and marking the associated information of the abnormal operation data; and after the cloud platform is determined to have a fault event, according to the associated information, the source tracing operation of the fault event is realized. Thus, the source tracing operation of the fault event is realized.

Drawings

Fig. 1 is a flowchart of a first embodiment of a cloud platform monitoring method according to the present invention;

fig. 2 is a schematic diagram of a first hierarchical structure of a virtual device to be monitored according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a first component structure of a device for cloud platform monitoring according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a second hierarchical structure of virtual devices to be monitored according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a second component structure of the cloud platform monitoring apparatus according to the embodiment of the present invention.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

First embodiment

Fig. 1 is a flowchart of a method for cloud platform monitoring according to a first embodiment of the present invention, as shown in fig. 1, the method includes:

step 100: and establishing a corresponding hierarchical structure for at least one virtual device to be monitored on the cloud platform.

In this step, the hierarchical structure includes from top to bottom in proper order: identification information of the virtual device located at a top layer and at least one function group of the virtual device located at a second layer; each function group is used for representing a type of functions of the virtual equipment during operation.

Preferably, the hierarchical structure further comprises: the virtual device comprises an Internet Protocol (IP) address of at least one server used by at least one functional group of the virtual device located on the third layer, at least one monitoring item corresponding to the IP address of at least one server located on the fourth layer, and running data of the virtual device corresponding to the at least one monitoring item located on the bottom layer.

According to the embodiment of the invention, the key host of each system to be monitored is abstracted into the virtual equipment, and the attribute information of the virtual equipment is used as the identification information to represent different virtual equipment. Here, the attribute information of the virtual device includes: device shorthand names, device descriptions, names of belonging items or products, belonging function groups, IP addresses, operating systems, and versions.

Preferably, the embodiment of the invention establishes the hierarchical structure relationship of the virtual equipment to be monitored according to the monitoring content of the product, the function group, the server address, the monitoring item and the monitoring content.

Fig. 2 is a schematic diagram of a first hierarchical structure of a virtual device to be monitored in the embodiment of the present invention, and as shown in fig. 2, the hierarchical structure of the virtual device includes: the product identification information is positioned at the top layer, N function groups are positioned on the second layer of virtual equipment, N is an integer larger than 0, all IP addresses used by each function positioned at the third layer, monitoring items corresponding to each IP address positioned at the fourth layer, and running data of the virtual equipment corresponding to each monitoring item positioned at the bottom layer.

Exemplary monitoring items may include: hosts, applications, services, user behavior, middleware or databases, and the like. When the monitoring item is the host, the corresponding monitoring content may include: the system comprises a Central Processing Unit (CPU), a user mode CPU utilization rate, a kernel mode CPU utilization rate, an interrupt CPU utilization rate, a hard disk residual space, a hard disk utilization rate, a disk I/O average frequency, a disk I/O average throughput rate, a physical memory utilization rate, an exchange memory utilization rate, a network uplink rate, a network downlink rate and the like;

when the monitoring item is an application program, the corresponding monitoring content may include: the operation data and the access records of some key applications determine the availability and the quality of the applications by judging the monitoring contents. For example: the number of times of calling the key API, the response condition and the like;

when the monitoring item is a service, the corresponding monitoring content may include: the running state of the large service software. For example: nginx accumulated request times, Nginx requests per second, Nginx active connections, Nginx dropped connections, and the operating states of Tomcat, MySQL, Apache, etc.;

when the monitoring item is a user behavior, the corresponding monitoring content comprises: access monitoring, Uniform Resource Locator (URL) monitoring, content monitoring. The access monitoring is used for acquiring the access speed of a user, the URL monitoring comprises response time and failure rate so as to know the real-time access state of the service, and the content monitoring is used for mastering the element change of the webpage;

when the monitoring item is a middleware or a database, the corresponding monitoring content includes: i \ O throughput rate, CPU utilization rate, disk occupancy rate and the like.

In this step, the management operation on the hierarchical structure may also be implemented, and specifically, the management operation may be: after establishing a corresponding hierarchy for the at least one virtual device, at least one functional group in the second level of the hierarchy is added or deleted.

Step 101: and collecting the operation data of the virtual equipment.

Optionally, the operation data of the item to be monitored of the virtual device is collected as monitoring data, the monitoring data is analyzed through protocol adaptation, and the analyzed data is cached in the message queue. The analyzed data can be directly stored in the database according to the service requirement and the preset storage rule, or the analyzed data can be processed and then stored in the database according to the preset storage rule.

The step of processing and storing the analyzed data comprises the following steps: firstly, processing data flow of the analyzed data, and finishing processing such as calculation, alarm and the like according to a configured strategy; and secondly, training the data processing model by using the historical data, and then processing the analyzed data by using the data processing model.

Here, the storage rule is set to realize classification of data and partial database storage. Illustratively, the storage rule may be: the Key-Value database is used for storing metadata; the relational database is used for storing data such as user information, processing result information, configuration information, historical alarm, historical statistics and the like; a non-relational (NoSQL) database is used to persistently store historical data.

Step 102: when the acquired operation data are determined to be abnormal operation data, determining that a fault event occurs in the cloud platform, and marking the associated information of the abnormal operation data; the associated information of the abnormal operation data comprises: the function group of the virtual device associated with the abnormal operation data in the second layer of the hierarchy, and the virtual device identification information associated with the abnormal operation data in the top layer of the hierarchy.

In actual implementation, the cloud platform monitors the operation data in real time according to the configured strategy, and when the operation data meets the preset alarm strategy, the acquired operation data is determined to be abnormal operation data.

Preferably, the association information of the abnormal operation data may further include: monitoring items associated with abnormal operation data in the fourth layer of the hierarchical structure and server IP addresses associated with abnormal operation data in the third layer of the hierarchical structure;

correspondingly, when the acquired operation data are determined to be abnormal operation data, marking the associated information of the fourth layer in the associated information of the abnormal operation data (namely monitoring items associated with the abnormal operation data in the fourth layer of the hierarchical structure) according to the abnormal operation data in the bottom layer; according to the marked associated information of the fourth layer, marking the associated information of the third layer in the associated information of the abnormal operation data (namely the IP address of the server associated with the abnormal operation data in the third layer of the hierarchical structure); according to the marked associated information of the third layer, marking the associated information of the second layer in the associated information of the abnormal operation data (namely the function group associated with the abnormal operation data in the second layer of the hierarchical structure); and marking the associated information of the top layer in the associated information of the abnormal operation data (namely the virtual equipment identification information associated with the abnormal operation data in the top layer of the hierarchical structure) according to the marked associated information of the second layer.

When a plurality of abnormal operation data appear on the cloud platform at the same time, the association information is marked by adopting a layer-by-layer searching method from bottom to top for the layer-by-layer structure, and the monitoring item, the IP address, the function group and the product identification information corresponding to the abnormal operation data are determined. Optionally, in order to distinguish the associated information corresponding to different abnormal operation data, the identification information of different abnormal operation data may be added when the associated information is marked. For example, the abnormal operation data 1 to the abnormal operation data X correspond to the identification information 1 to the identification information X, respectively, so that it can be determined to which abnormal operation data the associated information in the hierarchical structure belongs by detecting the identification information.

Step 103: and after the cloud platform is determined to have a fault event, according to the associated information, the source tracing operation of the fault event is realized.

In this step, after it is determined that the cloud platform has a fault event, querying, according to the association information marked in the hierarchical structure, association information of at least one other layer except for the bottom layer in the association information of the abnormal operation data, so as to implement a tracing operation on the fault event.

Illustratively, when the CPU utilization of the host of the virtual device a needs to be monitored, when the monitored CPU utilization exceeds a threshold value, the relevant information of the CPU utilization of the host is marked. At this time, the associated information includes: the monitoring content of the bottom layer in the hierarchical structure of the virtual device A is the CPU utilization rate, the monitoring item of the fourth layer is host monitoring, the server address used by the third layer is IP address 1, the function group 1 of the second layer and the product identification of the top layer. When the current CPU utilization rate is monitored to be abnormal, the hierarchical structure is marked to determine all the associated information of the abnormal operation data, and the tracing operation of the abnormal operation data is realized by searching the associated information marked in the hierarchical structure of the virtual device A.

It can be understood that a plurality of virtual devices to be monitored may exist on the cloud platform, and therefore when a fault event occurs and a virtual device with a fault event needs to be quickly located, top-level product identification information in the hierarchical structures of all the virtual devices may be queried, and the virtual device with the fault event is determined by judging whether the top-level product identification information in each hierarchical structure is marked as the associated information of the fault event, so that the efficiency of finding the fault event is improved.

The embodiment of the invention can also comprise: dynamic management of cloud platform system functions, the content of management including addition or deletion of at least one system function, the system functions including: the system comprises the functions of operation monitoring, equipment detail, configuration management, historical alarm, prediction analysis, user management, data export and the like. In actual implementation, dynamic expansion of system functions can be achieved through a RESTful API mode, monitoring and management requirements on a cloud platform are met, and the defect that monitoring functions on the existing cloud platform are fixed is overcome.

Wherein, the operation monitoring function: based on a tree list established by a virtual equipment hierarchical structure, the state of each level is checked according to a product, a function group and a server IP address three-level directory. And the running state of the virtual equipment is represented by adopting color identification, such as: the green mark is normal, the yellow mark is used for generating an alarm event, the red mark is used for generating an abnormal event, and meanwhile, the current monitoring numerical value is marked.

The device detail checking function: and the method supports finding a certain device through the tree directory and checking the detailed operation condition of the device. For example: the device has application programs, middleware and databases, how many processes are executing, and the conditions of using resources by the processes comprise I \ O throughput rate, CPU utilization rate, memory utilization rate, disk utilization rate and other data, and supports graphical display of the data.

Configuration management function: the identification of all resources of the cloud platform is realized, wherein the identification comprises information such as hardware resources, server groups, process resources, port resources, IP resources, services, software and the like.

The configured policies include: alarm strategy and failure recovery strategy. Here, the alarm policy includes: alarm trigger condition, alarm object, alarm receiver, alarm receiving mode, etc. the alarm strategy may be associated with product and strategy type. For example: the alarm triggering condition may be: when the monitoring value of a certain device in a certain product exceeds the alarm threshold value, alarms with different levels are generated. The alarm triggering condition can also be a simple conditional expression, such as: a is greater than or equal to C, A and less than or equal to C, A > C or A < C, wherein A is a monitoring value and C is an alarm threshold value. The alarm threshold value can be self-defined, and the threshold value is mainly used for single monitoring data.

The failure recovery strategy comprises the following steps: for general failures, the system can execute the failure recovery strategy, for example, when the disk is full, the system automatically deletes and expands the garbage files, and when the utilization rate of the CPU and the memory is too high, the 'invalid' process is killed.

History alarm query function: the historical alarm information can be inquired according to the conditions of alarm time, alarm strategy type, alarm level and the like, and the alarm information processing condition can be checked.

The predictive analysis function: and the functions of managing, predicting, inquiring the prediction result and the like of the data processing model are realized. Model management mainly manages data processing models trained by a data analysis engine using historical data; the prediction function is mainly to carry out prediction analysis on monitoring data such as a host, service, application, user and the like by using a model, not only can carry out real-time monitoring on the online data, but also can carry out offline batch processing on historical data, thereby improving the efficiency of prediction analysis; and the prediction result query is to display the relevant prediction analysis results of the host, the service, the application and the user according to the query conditions.

The user management function: and two-level user management, a super manager and a common user are supported. The super administrator can distribute, check and modify the relative information of the common users to the common users, and the common users can only look up data and modify basic information.

The data export function: and by configuring a data export rule, the monitoring data, the processing result, the alarm information and the like are exported to be convenient to check. The deriving rules may include: set export time, amount of export data, format of export file, export data storage location, etc. For example: and (4) automatically exporting data at fixed intervals by a background, and exporting the data into formats such as Excel or pdf.

In the embodiment of the invention, the monitoring data is processed on line by adopting a data stream processing mode, so that the data processing efficiency is improved; when data is queried and analyzed, the online data is processed in real time or the historical data is processed in an offline batch mode, and a data preprocessing mechanism is provided, so that processing results can be directly read from a database for display, and query effects are optimized; the data classification and sub-database storage are supported, and the data reading efficiency is improved; establishing a hierarchical structure of monitoring projects, and supporting flexible expansion and management of cloud platform monitoring projects; function presentation and operation are realized through a RESTful API mode, and dynamic expansion of system functions is facilitated.

In the embodiment of the invention, a corresponding hierarchical structure is established for at least one virtual device to be monitored on a cloud platform; collecting operation data of the virtual equipment; when the acquired operation data are determined to be abnormal operation data, determining that a fault event occurs in the cloud platform, and marking the associated information of the abnormal operation data; and after the cloud platform is determined to have a fault event, according to the associated information, the source tracing operation of the fault event is realized. Therefore, the source tracing operation of the fault event is realized.

Second embodiment

Fig. 3 is a schematic diagram of a first component structure of a device for cloud platform monitoring according to an embodiment of the present invention, as shown in fig. 3, the device includes: the system comprises an establishing module 300, an acquisition module 301, a processing module 302 and a positioning module 303; wherein the content of the first and second substances,

the establishing module 300 is configured to establish a corresponding hierarchical structure for at least one to-be-monitored virtual device on the cloud platform; the hierarchical structure sequentially comprises from top to bottom: identification information of the virtual device located at a top layer and at least one function group of the virtual device located at a second layer; each function group is used for representing a type of functions of the virtual equipment during operation;

the acquisition module 301 is configured to acquire operation data of the virtual device;

the processing module 302 is configured to determine that a fault event occurs in the cloud platform and mark associated information of the abnormal operation data when it is determined that the acquired operation data is the abnormal operation data; the associated information of the abnormal operation data comprises: a function group of a virtual device associated with abnormal operation data in a second layer of the hierarchy, and virtual device identification information associated with abnormal operation data in a top layer of the hierarchy;

and the positioning module 303 is configured to, when it is determined that the cloud platform has a fault event, implement a tracing operation on the fault event according to the association information.

Preferably, the hierarchical structure may further include: the IP address of at least one server used by at least one function group of the virtual equipment positioned on the third layer, at least one monitoring item corresponding to the IP address of at least one server positioned on the fourth layer, and the operation data of the virtual equipment corresponding to the at least one monitoring item positioned on the bottom layer.

Fig. 4 is a schematic diagram of a second hierarchical structure of a virtual device to be monitored according to an embodiment of the present invention, and as shown in fig. 4, the hierarchical structure of the virtual device is established according to a product-function group-IP address-monitoring item (including application program, middleware, database, etc.) -monitoring content (I/O port, CPU, hard disk, etc.). The server of a product can be divided into a plurality of function groups according to functions, one function group can be realized by a plurality of servers, the servers can be deployed with middleware, databases and the like, various indexes of a host, services and applications are monitored through processes, and the hierarchical structure is convenient for realizing flexible expansion and management of monitoring projects and is beneficial to tracking fault events.

Preferably, the associated information of the abnormal operation data further includes: monitoring items associated with abnormal operation data in a fourth layer of the hierarchy and server IP addresses associated with abnormal operation data in a third layer of the hierarchy;

the processing module 302 is specifically configured to mark, according to the abnormal operation data in the bottom layer, the associated information of the fourth layer in the associated information of the abnormal operation data when it is determined that the acquired operation data is the abnormal operation data; according to the marked associated information of the fourth layer, marking the associated information of the third layer in the associated information of the abnormal operation data; according to the marked associated information of the third layer, marking the associated information of the second layer in the associated information of the abnormal operation data; and marking the top-level associated information in the associated information of the abnormal operation data according to the marked associated information of the second level.

The positioning module 303 is specifically configured to, when it is determined that the cloud platform has a fault event, query, according to the associated information marked in the hierarchical structure, associated information of at least one other layer, except for the bottom layer, in the associated information of the abnormal operation data, so as to implement a tracing operation on the fault event.

The establishing module 300 is further configured to add or delete at least one function group in the second layer of the hierarchical structure after establishing the corresponding hierarchical structure for the at least one virtual device.

In practical application: the establishing module 300, the collecting module 301, the Processing module 302 and the positioning module 303 can be implemented by a Central Processing Unit (CPU), a microprocessor Unit (MPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), etc. located in the terminal device.

Third embodiment

Fig. 5 is a schematic diagram of a second component structure of a device for cloud platform monitoring according to an embodiment of the present invention, as shown in fig. 5, the device includes: the system comprises an acquisition module, a data analysis module, a data storage module, a management module and a Web end.

The collection module includes: data acquisition unit, agreement adaptation unit, message buffer unit. Wherein:

the data acquisition unit is used for acquiring the running logs of the virtual equipment to generate data points and uploading the data points periodically; it should be noted that, when uploading a data point, the data packet construction format may refer to the REST API specification, and the data point value is constructed by using a single-layer JavaScript Object Notation (JSON), which is convenient for cross-platform and cross-language data use and interaction.

The protocol adaptation unit is used for realizing protocol adaptation and analysis of the monitoring data;

and the message buffer unit is used for buffering the analyzed data into a message queue so as to facilitate the data analysis module to read and process the data.

The data analysis module is used for reading the analyzed monitoring data in the message queue and realizing online processing/offline batch processing of the monitoring data; judging whether the processed data is abnormal operation data or not, determining that a fault event occurs on the cloud platform when the acquired operation data is determined to be the abnormal operation data, and marking all associated information of the abnormal operation data in the hierarchical structure; and when the collected operation data is determined to be normal operation data, continuing monitoring.

Specifically, the main functions of the data analysis module include: firstly, processing a real-time data stream, reading analysis data according to a configuration strategy to complete processing such as calculation, alarm and the like; secondly, model training is carried out by utilizing historical data, and the model can be customized according to services and is used for calculation and prediction analysis; and thirdly, the online data is monitored in real time by using the model, or the historical data is processed in an offline batch mode, so that the prediction analysis efficiency is improved.

And the data storage module is used for realizing the classification of the data and the storage of the sub database. Here, the storage rule may be: the Key-Value database is used for storing metadata; the relational database is used for storing data such as user information, processing result information, configuration information, historical alarm, historical statistics and the like; a non-relational (NoSQL) database is used to persistently store historical data.

And the Web end is used for realizing the management of system functions and the access operation of monitoring events.

Specifically, the Web end can dynamically expand the system function through a RESTful API mode, the monitoring and management requirements on the cloud platform are met, and the defect of fixed monitoring function on the existing cloud service platform is overcome.

The Web end can also realize flexible access to monitoring data, processing results, alarm information and the like in a RESTful API mode, when a monitored virtual device has a fault event, all associated information of abnormal operation data is presented through the Web end, the tracing operation of the fault event is realized, and the processing efficiency of the fault event is improved.

When the embodiment of the invention inquires the monitoring data, the following three conditions can be adopted:

firstly, aiming at monitoring data/processing results with high query request frequency, a data storage module firstly stores the monitoring data/processing results, and a Web end directly reads and displays the monitoring data/processing results in the data storage module;

secondly, for the query request needing to be processed in real time, the data analysis module carries out online processing on the monitoring data according to a preset configuration strategy and sends the processed monitoring data to a Web end, and the Web end displays the processed monitoring data; meanwhile, the processed monitoring data is stored in a data storage module;

thirdly, when all the monitoring data are sent out query requests, the data analysis module carries out batch processing on the offline monitoring data, the processed monitoring data are stored in the data storage module, and the Web end reads and displays the monitoring data in the data storage module.

The management module comprises: the system comprises a heartbeat management unit, a configuration unit, a user management unit and an upgrading unit. Wherein:

and the heartbeat management unit is used for monitoring the state of the host.

And the configuration management unit is used for identifying all resources of the cloud platform, the configuration content comprises hardware resource information, server grouping information, monitoring strategy information, an alarm strategy, a fault recovery strategy and the like, and the addition, deletion, modification and check operations of the resources and the strategies are realized.

And the user management unit is used for managing the user authority and the basic information.

And the upgrading unit is used for providing upgrading service for the cloud platform.

In the embodiment of the invention, the corresponding hierarchical structure is established by utilizing the product identification, the function group, the IP address and the like of the virtual equipment, so that the source tracing operation of the fault event is facilitated.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims

1. A method of cloud platform monitoring, the method comprising:

collecting operation data of the virtual equipment;

after the cloud platform is determined to have a fault event, according to the associated information, tracing operation of the fault event is achieved;

the hierarchy further comprises: the IP address of at least one server used by at least one function group of the virtual equipment positioned on the third layer, at least one monitoring item corresponding to the IP address of at least one server positioned on the fourth layer and the running data of the virtual equipment corresponding to at least one monitoring item positioned on the bottom layer;

the associated information of the abnormal operation data further comprises: monitoring items associated with abnormal operation data in a fourth layer of the hierarchy and server IP addresses associated with abnormal operation data in a third layer of the hierarchy;

2. The method according to claim 1, wherein after determining that the cloud platform has the failure event, implementing, according to the association information, a tracing operation on the failure event includes: after the cloud platform is determined to have the fault event, inquiring the associated information of at least one other layer except the bottom layer in the associated information of the abnormal operation data according to the associated information marked in the hierarchical structure, and realizing the source tracing operation of the fault event.

3. The method of claim 1, further comprising: after establishing a corresponding hierarchy for the at least one virtual device, at least one functional group in the second level of the hierarchy is added or deleted.

4. An apparatus for cloud platform monitoring, the apparatus comprising: the system comprises an establishing module, an acquisition module, a processing module and a positioning module; wherein the content of the first and second substances,

the processing module is further specifically configured to mark, according to the abnormal operation data in the bottom layer, associated information of a fourth layer in the associated information of the abnormal operation data when it is determined that the acquired operation data is the abnormal operation data; according to the marked associated information of the fourth layer, marking the associated information of the third layer in the associated information of the abnormal operation data; according to the marked associated information of the third layer, marking the associated information of the second layer in the associated information of the abnormal operation data; according to the marked associated information of the second layer, marking the associated information of the top layer in the associated information of the abnormal operation data;

5. The apparatus according to claim 4, wherein the location module is specifically configured to, when it is determined that the cloud platform has the failure event, query, according to the association information marked in the hierarchical structure, association information of at least one other layer, except for a bottom layer, in the association information of the abnormal operation data, so as to implement a tracing operation on the failure event.

6. The apparatus of claim 4, wherein the establishing module is further configured to add or delete at least one function group in the second layer of the hierarchy after establishing the corresponding hierarchy for the at least one virtual device.