WO2015187001A2

WO2015187001A2 - System and method for managing resources failure using fast cause and effect analysis in a cloud computing system

Info

Publication number: WO2015187001A2
Application number: PCT/MY2015/050042
Authority: WO
Inventors: Binti Hasan SALIZA; Bin Wijee NAZARUDIN; Bin Alli MOHAMAD ZAKARIA; Hong Hoe ONG; Tulasi Raju MORAMPUDI RAMA
Original assignee: Mimos Berhad
Priority date: 2014-06-04
Filing date: 2015-05-29
Publication date: 2015-12-10
Also published as: WO2015187001A3

Abstract

There is disclosed a cloud computing managing system, whereby the system comprising at least one computing resource (10) being managed by a Log Manager (11) adapted and configured to collect, parse, analyse and visualize information based on a failure within the cloud computing system and generating an output. The system further comprises at least one log collector (12) interconnected at the computing resource clusters (100) and is configured to collect and gather information related to the failure, storing the information in a database (40) and forward the information to the Log Manager (11). The system further provides an analytical business intelligence module (13) with a dashboard (14) for use as an interface and to display the output from Log Manager (11). A method thereof is also provided.

Description

SYSTEM AND METHOD FOR MANAGING RESOURCES FAILURE USING FAST CAUSE AND EFFECT ANALYSIS IN A CLOUD COMPUTING SYSTEM

FIELD OF INVENTION

[0001] The present invention generally relates to cloud computing, and more particularly a method and system for managing cloud resources over a network.

BACKGROUND OF INVENTION

[0002] Cloud computing is an emerging technology and is undoubtedly beneficial for mid-size or large companies owing to its efficiencies be it for maintenance, deployment and upgrading, when it comes to adding capacity or capabilities to deliver a certain service to users or customers. It is continuously evolving and various industries are adapting to this technology.

[0003] Cloud computing typically encompasses a plurality of computing resources configured to deliver cloud-based services over a network. Perceptibly, with the installation of various resources, managing these resources becomes crucial in order to fulfil service delivery requirements. With the conventional cloud computing systems, cloud providers often unable to identify and trouble shoot computing resources failures, which consequently incur penalty fees. The types of failures commonly experienced include hardware failure, software malfunction and dependencies, abnormal activities between the resources and etc. Further, the conventional systems are typically based on best-effort approach rather than risk- aware approach when accepting Service Level Agreement (SLA). As seen in FIGURE 1, in the conventional systems, the cloud computing resources are managed manually with user intervention, whereby the administrator performs identification of failures and step-by-step troubleshooting as well cause-effect analysis manually. Such manual process is highly time consuming. These prominent drawbacks can be effectively addressed in the event that all these tasks, especially fixing the failure were performed autonomously upon detected the cause of failure within the computing system. [0004] Thus, there is clearly a considerable need for systems and methods that can conveniently address the above-discussed shortcomings of managing failures in a cloud computing system.

SUMMARY

[0005] In one aspect of the present invention and broadly defined, there is provided a cloud computing managing system comprising: least one computing resource; a Log Manager configured to collect, parse, analyse and visualize information based on a failure within the cloud computing system and generating an output; at least one log collector configured to collect and gather information related to the failure, storing the information in a database and forward the information to the Log Manager and; an analytical business intelligence module with a dashboard for use as an interface and to display the output from Log Manager.

[0006] In one embodiment, the Log Manager further comprises a log parser, a correlation engine, a log analyser, and a data visualizer.

[0007] In a further embodiment, the log parser is configured to extract the log files from the log collector and remove all unnecessary information from the log and reduce said information into small pieces or chunks.

[0008] In yet a further embodiment, correlation engine is configured to obtain a summarized version of log messages and create logical relationship between the messages that have similarities within the failure domain and consolidates them into suitable structured information.

[0009] In another embodiment, log analyze is configured to analyse the structured information from the correlation engine and conduct semantic cause-effect analysis in accordance to a predefined taxonomy provided by a storage device, and rank the cause-effect based on the highest relevancy of the failures detected.

[0010] In a further embodiment, the data visualizer is configured to parse the findings into a specific language to be used by the dashboard. [0011] In yet a further embodiment, the system further comprising a cloud controller and an accelerator library for accelerating the processes performed by the log parser and correlation engine.

[0012] In yet another embodiment, the log manager comprises at least one knowledge base database, a processed database and an Online Analytical Processing (OLAP) database.

[0013] In another aspect of the present invention, there is disclosed a method for managing resources in a cloud computing system comprising the steps of monitoring a failure for a period of time; gathering data and log files from at least one resource; storing said gather data and log files in a database; retrieves the log files and data from the database and reducing the data and log files into chunks of data; arranging and organizing the chunks of data to a multidimensional analysis data format and storing said multidimensional data format in database; scanning at least one database to identify failure similarities data within the same failure domain; correlating the identified similarities data by tagging, compressing and storing it in a database; performing a semantic search and comparing the correlated data with predefined relationships contained within at least one database; performing a root-cause analysis; generating a conclusion upon applied a solution for the failure; identifying the frequency of failure occurrences and ranking the occurrences of failures.

BRIEF DESCRIPTION OF THE DRAWINGS

[0014] The invention will be more understood by reference to the description below taken in conjunction with the accompanying drawings herein:

[0015] FIGURE 1 shows a conventional cloud computing system;

[0016] FIGURE 2 shows an overall view of the cloud computing managing system in accordance with an embodiment of the present invention;

[0017] FIGURE 3 depicts the functionalities of the Log Manager in accordance with an embodiment of the present invention;

[0018] FIGURE 4 illustrates the Log Manager in accordance with an embodiment of the present invention;

[0019] FIGURE 5 provides a schematic flowchart showing the processes performed by the log collector in accordance with an embodiment of the present invention;

[0020] FIGURE 6 shows the processes involved and performed by the log parser in accordance with an embodiment of the present invention;

[0021] FIGURE 7 A shows the processes involved and performed by the correlation engine in accordance with an embodiment of the present invention;

[0022] FIGURE 7B shows the examples of errors, which may be captured from the cloud storage controller and the cloud node controller, respectively;

[0023] FIGURE 7C shows an example of the final output generated upon completion of the correlation process by the correlation engine;

[0024] FIGURE 8 shows the processes involved and performed by the log analyser in accordance with an embodiment of the present invention; [0025] FIGURE 9 shows the analytical and intelligence business module and dashboard; and [0026] FIGURE 10 illustrates an example of the implementation of the system and method in accordance with an embodiment of the present invention.

DETAILED DESCRIPTION [0027] In line with the above summary, the following description of a number of specific and alternative embodiments is provided to understand the inventive features of the present invention. It shall be apparent to one skilled in the art, however that this invention may be practiced without such specific details. Some of the details may not be described at length so as not to obscure the invention. For ease of reference, common reference numerals will be used throughout the figures when referring to the same or similar features common to the figures.

[0028] The present invention provides a method and system that is configured to detect failures, and then correlate, collect, and analyze cloud resources information including possible factors that cause the failure, whereby the outcome of rectification of such failure can be visualized prior to actual rectification. The latter feature thereby enables the user to anticipate the outcome prior to performing actual fixing. For the purpose of clear description and avoidance of doubt, the term "failure" in this specification is used to mean any form of technical failures or errors or glitches or any form of technical intervention that causes technical interruption of the cloud computing system.

[0029] With reference FIGURE 2, the overall system in accordance with an embodiment of the present invention will now be described. Generally, the system for managing cloud computing contemplated in accordance with an embodiment of the present invention comprises at least one computing resource 100, a Log Manager 11, at least one log collector 12, an analytical business intelligence module 13 and a dashboard 14. FIGURE 3 depicts the functionalities of the Log Manager 11 in accordance with an embodiment of the present invention. The Log Manager 11 is configured to be in communication with the assigned clusters 100 (cloud computing resources), the cloud controller ISA, accelerator library 16, analytical and business intelligence module 13 and the dashboard 14. In one embodiment, the apparatus further includes; at least one log collector 12 is disposed at each cluster of the computing resources 100, and at least one log collector 12 is disposed at the cloud node controller 15 and at the cloud controller 15A respectively. In one embodiment, the Log Manager 11 comprises a log parser 11a, a correlation engine lib, a log analyser 11c and a data visualizer lid.

[0030] In accordance with one embodiment of the present invention, the log collector

12 can be a distributed agent based framework, and is suitably configured to collect the new log files on a local server. It is further adapted to be lightweight, and non-intrusive data collector. In the event of a failure, the log collector 12 is configured to push the log information to the Log Manager 11 for further analysis. The collector 12 is further equipped with a database 40 for storing all data and information associated to the log files. [0031] Referring to FIGURE 4, the Log Manager 11 is generally a platform for parsing, analysing and provides visualization of information in the event of system failures which may include, but not limiting to network, application and hardware failures. The Log Manager 11 is further configured to rank the failure according to the relevancy of the error identified and suggests the possible remedies to the respective system administrator. The log parser 11a is configured to extract the log files from the log collector 12 and remove all unnecessary information from the log and reduce said information into small pieces or chunks. Examples of log files include, but not limiting to, system log, network log, storage log, web server log, web server log, applications log, scripting log, java log and service log. It is further configured to arrange and organize the chunks of the log information to a multidimensional analysis data in an Online Analytical Processing (OLAP). The latter process of organizing and arranging the chunks of information is assisted by the accelerate library 16 with accelerator API 16a (as seen in FIGURE 4) which further aids to accelerate the process.

[0032] Still referring to FIGURE 4, the correlation engine lib is configured to obtain a summarized log messages and create logical relationship between the messages that have similarities within the failure domain and consolidates them into suitable structured information. In this process, the accelerator library 16, which is equipped with an accelerator API 16a, is in communication with the correlation engine lib and thereby aids to accelerate the process. The data visualizer lid is configured to parse the findings into a specific language, for example XML format in order to be used by the dashboard 14. The user may view the output of findings via the dashboard 14.

[0033] The Log Manager 11 further comprises a log analyser 11c. The log analyser 11c in accordance with an embodiment of the present invention is configured to analyse the structured information from the correlation engine lib and conduct semantic cause-effect analysis according to the pre-defined taxonomy. The pre-defined taxonomy can be retrieved from a database 51, which may be in the form of a Knowledge Base repository or database. Accordingly, the database is in communication with the log analyser 11c. The log analyser 11c is also configured to rank the cause-effect based on the highest relevancy of the failures detected. There is further provided a processed database 52 configured to be in communication with the Log Manager 11 so as to store all processed data by the Log Manager 11.

[0034] A system incorporating the operational method in accordance with an embodiment of the present invention will now be described based on the steps or process stages performed by each component of the Log Manager 11 in the event that there is a failure within the system. FIGURE 5 shows a process flow adapted by the log collector 12 in accordance with an embodiment of the present invention. In the event that a failure is identified, the log collector 12 is triggered at 500 and thus it monitors 501 the event failure for n period of time. Such feature is to ensure that the failure event is genuine and reproducible. Upon completion of the n period of time, the log collector 12 collects and gathers raw log files from various sources at 502 and 502A. As discussed earlier, examples of raw files may include system log, network log, storage log, web server log, applications log, scripting log, java log and service log. All gathered data and information associated to log files are stored or saved within the log database at 503.

[0035] FIGURE 6 shows the process adapted by the log parser 11a in accordance with an embodiment of the present invention. The log parser 11a reads all log files being stored in the log database of the log collector 12 at 600. It should be noted that each log file may have a different log format. Accordingly, at 601, the log parser 11a then formats and chunks raw log files into a structured and summarized format. In this process, the log parser 11a engages the accelerator library 16 to swiftly scan the log files and identifies the formats and then breaking the log messages into several chunks of data. Each chunk contains important information that represents the message, for example, but not limiting to date, time, server type, program, error/warning messages, nouns/objects, verbs/methods and attributes/options. Next and upon completion of summarization and chunks conversion, the log parser 11a arranges and organizes the chunks of data to a multidimensional analysis data in Online Analytical Processing (OLAP) at 602. This process is also assisted by the accelerator library 16 to accelerate the arranging and organizing of the chunks of information. The new format is saved at 603 and then into an OLAP database at 603A.

[0036] FIGURE 7A shows the process adapted by the correlation engine lib in accordance with an embodiment of the present invention. Firstly at 700, the correlation engine lib scans the log messages stored within the log database 40 to find and identify problem similarities within the same failure domain. This process is further accelerated with assistance from the accelerator library 16. Upon completion of the scanning, the correlation engine lib tags and correlates the identified logs into one bucket by tagging the output, compressing the output and stores it in the log database 40, at 701. This process is also further accelerated with the assistance of the accelerator library 16. Tagging the output at 701 may include details associated to date, time, server, program name, errors/warning, noun/objects, verbs/methods and attributes/options. It should be noted that the OLAP makes data access significantly quick by using the multidimensional data model. FIGURE 7B shows the examples of errors, which may be captured from the cloud storage controller (SC) and the cloud node controller (NC) 15, respectively. FIGURE 7C shows an example of the final output flags generated upon completion of the correlation process by the correlation engine lib. In one embodiment, the final output may contain information such as time and data on the identified error or failure. [0037] FIGURE 8 shows the process adapted by the log analyser 11c in accordance with an embodiment of the present invention. Firstly, the log analyser 11c reads log from the bucket and performs semantic search using the Knowledge Base database 51 at 800. If the search result is a "success" the analyser 11c proceeds to compare the "correlated log" to that of the predefined relationships and other available information in the Knowledge Base database at 801. If the search fails, the log analyser 11c updates the Knowledge Base database 51 at 801B with the current log info at 801 A. Upon completion of comparison, the log analyser 11c then proceeds to perform a Root Cause Analysis (RCA) against the log in order to determine whether the identified log is coming from the resource or environment or human intervention and etc, at 802. The Root Cause Analysis (RCA) is performed to identify the root factors that causing the failure problem. A conclusion is established at 803 when an identified solution that may exist in the Knowledge Base database 51 is applied. In the event that the solution is not found in the Knowledge Base database 51, the system then proceeds to establish a new solution by considering the flaw factors relationships of the problems and solutions, whereby the system will identify frequency of repeated root-cause by performing a semantic analysis against the semantic unit in the database, at 804. In the next step at 805, the rank is assigned to each root-cause, which may be based on the severity in the log/depth of problem correlation and also the simplification of solution implementation. Further, once the conclusion is generated and completion of solution deployment, the system proceeds to update a processed database at 805A.

[0038] FIGURE 9 shows the analytical and intelligence business module 13 and dashboard 14. The analytical and intelligence business module 13 and dashboard 14 are configured to serve as interfaces for the user. The user may input data associated to the output report. At, the user chooses the required reports and the module 13 accepts the user inputs if any; whereby the end result based on the input and instructions prompted by the user are visualized on a browser. A web container accepts the users request at 901 and sends back the response of the backend system. Next, a report builder analyses the user requests at 902, retrieves the required information from the processed database 52 and sends the information back to the analytical and intelligence business module 13 and dashboard 14 for visualization.

[0039] In one example, the system in one embodiment of the present invention may be adapted as FIGURE 10. The cloud computing system comprises at least one device being connected to a public network 101, at least one device being connected to a private network 102, a Log Manager 11 of the present invention, coupled with an accelerator library 16, a cloud controller 17, a cloud node controller 15 and at least one storage means 103, said storage means 103 which may be adapted to accommodate at least one database described in the preceding paragraphs. [0040] As would be apparent to a person having ordinary skilled in the art, the afore- described methods and components may be provided in many variations, modifications or alternatives to existing testing systems. The principles and concepts disclosed herein may also be implemented in various manner which may not have been specifically described herein but which are to be understood as encompassed within the scope and letter of the following claims.

Claims

A cloud computing managing system comprising: least one computing resource (100); a Log Manager (11) being in communication with the computing resource (100) and configured to collect, parse, analyse and visualize information based on a failure within the cloud computing system and generating an output; at least one log collector (12) being in communication with the Log Manager (1 1) and the computing resource (100) and configured to collect and gather information related to the failure, storing the information in a database (40) and forward the information to the Log Manager (11); and; an analytical business intelligence module (13) being in communication with the Log Manager (11) and configured to review the output from the Log Manager (1 1), having a dashboard (14) adapted as an interface and to display the output from Log Manager (11).

The cloud computing managing system as claimed in Claim 1, wherein the Log Manager (11) further comprising a log parser (1 1a), a correlation engine (l ib), a log analyser (1 1c) and a data visualizer (l id).

The cloud computing managing system as claimed in Claim 2 wherein the log parser (1 1a) is configured to extract the log files from the log collector (12) and remove all unnecessary information from the log and reduce said information into small pieces or chunks.

The cloud computing managing system as claimed in Claim 2 wherein the correlation engine (l ib) is configured to obtain a summarized version of log messages and create logical relationship between the messages that have similarities within the failure domain and consolidates them into suitable structured information.

5. The cloud computing managing system as claimed in Claim 2 wherein the log analyzer (1 1c) is configured to analyse the structured information from the correlation engine (l ib) and conduct semantic cause-effect analysis in accordance to a predefined taxonomy provided by a storage device (51 ), and rank the cause- effect based on the highest relevancy of the failures detected.

6. The cloud computing managing system as claimed in Claim 2 wherein the data visualizer (l id) is configured to parse the findings into a specific language to be used by the dashboard (14).

7. The cloud computing managing system as claimed in Claim 1 wherein the system further comprising a cloud controller (15) being in communication with the Log Manager (1 1) and an accelerator library (16) connected to the Log Manager (1 1).

8. The cloud computing managing system as claimed in Claim 7 wherein the accelerator library (16) is configured to accelerate processes performed by the log parser (1 1a) and correlation engine (l ib).

9. The cloud computing system as claimed in Claim 1 wherein the log manager (1 1) comprises at least one knowledge base database (51) for storing pre-defined taxonomy, a processed database (52) for storing all processed data from the Log Manager (1 1) and an Online Analytical Processing (OLAP) database (50) for storing log information to a multidimensional analysis data from the Log Manager (1 1).

10. A method for managing resources in a cloud computing system comprising: detecting and monitoring a failure for a period of time 501;

gathering data and log files from at least one resource 502/502A;

storing said gather data and log files in a database 503; retrieves the log files and data from the database and reducing the data and log files into chunks of data 600, 601;

arranging and organizing the chunks of data to a multidimensional analysis data format and storing said multidimensional data format 602, 603 in database;

scanning at least one database to identify failure similarities data within the same failure domain 700;

correlating the identified similarities data by tagging, compressing and storing it in a database 701;

performing a semantic search and comparing the correlated data with predefined relationships contained within at least one database 800, 801;

performing a root-cause analysis 802;

generating a conclusion upon applied a solution for the failure 803;

identifying the frequency of failure occurrences 804;

ranking the root-cause occurrences of failures 805.