WO2015187001A2 - System and method for managing resources failure using fast cause and effect analysis in a cloud computing system - Google Patents

System and method for managing resources failure using fast cause and effect analysis in a cloud computing system Download PDF

Info

Publication number
WO2015187001A2
WO2015187001A2 PCT/MY2015/050042 MY2015050042W WO2015187001A2 WO 2015187001 A2 WO2015187001 A2 WO 2015187001A2 MY 2015050042 W MY2015050042 W MY 2015050042W WO 2015187001 A2 WO2015187001 A2 WO 2015187001A2
Authority
WO
WIPO (PCT)
Prior art keywords
log
data
cloud computing
database
failure
Prior art date
Application number
PCT/MY2015/050042
Other languages
French (fr)
Other versions
WO2015187001A3 (en
Inventor
Binti Hasan SALIZA
Bin Wijee NAZARUDIN
Bin Alli MOHAMAD ZAKARIA
Hong Hoe ONG
Tulasi Raju MORAMPUDI RAMA
Original Assignee
Mimos Berhad
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mimos Berhad filed Critical Mimos Berhad
Publication of WO2015187001A2 publication Critical patent/WO2015187001A2/en
Publication of WO2015187001A3 publication Critical patent/WO2015187001A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • G06F11/0769Readable error formats, e.g. cross-platform generic formats, human understandable formats

Definitions

  • the present invention generally relates to cloud computing, and more particularly a method and system for managing cloud resources over a network.
  • Cloud computing is an emerging technology and is undoubtedly beneficial for mid-size or large companies owing to its efficiencies be it for maintenance, deployment and upgrading, when it comes to adding capacity or capabilities to deliver a certain service to users or customers. It is continuously evolving and various industries are adapting to this technology.
  • Cloud computing typically encompasses a plurality of computing resources configured to deliver cloud-based services over a network. Perceptibly, with the installation of various resources, managing these resources becomes crucial in order to fulfil service delivery requirements. With the conventional cloud computing systems, cloud providers often unable to identify and trouble shoot computing resources failures, which consequently incur penalty fees. The types of failures commonly experienced include hardware failure, software malfunction and dependencies, abnormal activities between the resources and etc. Further, the conventional systems are typically based on best-effort approach rather than risk- aware approach when accepting Service Level Agreement (SLA). As seen in FIGURE 1, in the conventional systems, the cloud computing resources are managed manually with user intervention, whereby the administrator performs identification of failures and step-by-step troubleshooting as well cause-effect analysis manually.
  • SLA Service Level Agreement
  • a cloud computing managing system comprising: least one computing resource; a Log Manager configured to collect, parse, analyse and visualize information based on a failure within the cloud computing system and generating an output; at least one log collector configured to collect and gather information related to the failure, storing the information in a database and forward the information to the Log Manager and; an analytical business intelligence module with a dashboard for use as an interface and to display the output from Log Manager.
  • the Log Manager further comprises a log parser, a correlation engine, a log analyser, and a data visualizer.
  • the log parser is configured to extract the log files from the log collector and remove all unnecessary information from the log and reduce said information into small pieces or chunks.
  • correlation engine is configured to obtain a summarized version of log messages and create logical relationship between the messages that have similarities within the failure domain and consolidates them into suitable structured information.
  • log analyze is configured to analyse the structured information from the correlation engine and conduct semantic cause-effect analysis in accordance to a predefined taxonomy provided by a storage device, and rank the cause-effect based on the highest relevancy of the failures detected.
  • the data visualizer is configured to parse the findings into a specific language to be used by the dashboard.
  • the system further comprising a cloud controller and an accelerator library for accelerating the processes performed by the log parser and correlation engine.
  • the log manager comprises at least one knowledge base database, a processed database and an Online Analytical Processing (OLAP) database.
  • OLAP Online Analytical Processing
  • a method for managing resources in a cloud computing system comprising the steps of monitoring a failure for a period of time; gathering data and log files from at least one resource; storing said gather data and log files in a database; retrieves the log files and data from the database and reducing the data and log files into chunks of data; arranging and organizing the chunks of data to a multidimensional analysis data format and storing said multidimensional data format in database; scanning at least one database to identify failure similarities data within the same failure domain; correlating the identified similarities data by tagging, compressing and storing it in a database; performing a semantic search and comparing the correlated data with predefined relationships contained within at least one database; performing a root-cause analysis; generating a conclusion upon applied a solution for the failure; identifying the frequency of failure occurrences and ranking the occurrences of failures.
  • FIGURE 1 shows a conventional cloud computing system
  • FIGURE 2 shows an overall view of the cloud computing managing system in accordance with an embodiment of the present invention
  • FIGURE 3 depicts the functionalities of the Log Manager in accordance with an embodiment of the present invention
  • FIGURE 4 illustrates the Log Manager in accordance with an embodiment of the present invention
  • FIGURE 5 provides a schematic flowchart showing the processes performed by the log collector in accordance with an embodiment of the present invention
  • FIGURE 6 shows the processes involved and performed by the log parser in accordance with an embodiment of the present invention
  • FIGURE 7 A shows the processes involved and performed by the correlation engine in accordance with an embodiment of the present invention
  • FIGURE 7B shows the examples of errors, which may be captured from the cloud storage controller and the cloud node controller, respectively;
  • FIGURE 7C shows an example of the final output generated upon completion of the correlation process by the correlation engine
  • FIGURE 8 shows the processes involved and performed by the log analyser in accordance with an embodiment of the present invention
  • FIGURE 9 shows the analytical and intelligence business module and dashboard
  • FIGURE 10 illustrates an example of the implementation of the system and method in accordance with an embodiment of the present invention.
  • the present invention provides a method and system that is configured to detect failures, and then correlate, collect, and analyze cloud resources information including possible factors that cause the failure, whereby the outcome of rectification of such failure can be visualized prior to actual rectification.
  • the latter feature thereby enables the user to anticipate the outcome prior to performing actual fixing.
  • the term "failure” in this specification is used to mean any form of technical failures or errors or glitches or any form of technical intervention that causes technical interruption of the cloud computing system.
  • the system for managing cloud computing contemplated in accordance with an embodiment of the present invention comprises at least one computing resource 100, a Log Manager 11, at least one log collector 12, an analytical business intelligence module 13 and a dashboard 14.
  • FIGURE 3 depicts the functionalities of the Log Manager 11 in accordance with an embodiment of the present invention.
  • the Log Manager 11 is configured to be in communication with the assigned clusters 100 (cloud computing resources), the cloud controller ISA, accelerator library 16, analytical and business intelligence module 13 and the dashboard 14.
  • the apparatus further includes; at least one log collector 12 is disposed at each cluster of the computing resources 100, and at least one log collector 12 is disposed at the cloud node controller 15 and at the cloud controller 15A respectively.
  • the Log Manager 11 comprises a log parser 11a, a correlation engine lib, a log analyser 11c and a data visualizer lid.
  • the Log Manager 11 can be a distributed agent based framework, and is suitably configured to collect the new log files on a local server. It is further adapted to be lightweight, and non-intrusive data collector. In the event of a failure, the log collector 12 is configured to push the log information to the Log Manager 11 for further analysis. The collector 12 is further equipped with a database 40 for storing all data and information associated to the log files. [0031] Referring to FIGURE 4, the Log Manager 11 is generally a platform for parsing, analysing and provides visualization of information in the event of system failures which may include, but not limiting to network, application and hardware failures. The Log Manager 11 is further configured to rank the failure according to the relevancy of the error identified and suggests the possible remedies to the respective system administrator.
  • the log parser 11a is configured to extract the log files from the log collector 12 and remove all unnecessary information from the log and reduce said information into small pieces or chunks.
  • log files include, but not limiting to, system log, network log, storage log, web server log, web server log, applications log, scripting log, java log and service log. It is further configured to arrange and organize the chunks of the log information to a multidimensional analysis data in an Online Analytical Processing (OLAP).
  • OLAP Online Analytical Processing
  • the correlation engine lib is configured to obtain a summarized log messages and create logical relationship between the messages that have similarities within the failure domain and consolidates them into suitable structured information.
  • the accelerator library 16 which is equipped with an accelerator API 16a, is in communication with the correlation engine lib and thereby aids to accelerate the process.
  • the data visualizer lid is configured to parse the findings into a specific language, for example XML format in order to be used by the dashboard 14. The user may view the output of findings via the dashboard 14.
  • the Log Manager 11 further comprises a log analyser 11c.
  • the log analyser 11c in accordance with an embodiment of the present invention is configured to analyse the structured information from the correlation engine lib and conduct semantic cause-effect analysis according to the pre-defined taxonomy.
  • the pre-defined taxonomy can be retrieved from a database 51, which may be in the form of a Knowledge Base repository or database. Accordingly, the database is in communication with the log analyser 11c.
  • the log analyser 11c is also configured to rank the cause-effect based on the highest relevancy of the failures detected.
  • There is further provided a processed database 52 configured to be in communication with the Log Manager 11 so as to store all processed data by the Log Manager 11.
  • FIGURE 5 shows a process flow adapted by the log collector 12 in accordance with an embodiment of the present invention.
  • the log collector 12 is triggered at 500 and thus it monitors 501 the event failure for n period of time. Such feature is to ensure that the failure event is genuine and reproducible.
  • the log collector 12 collects and gathers raw log files from various sources at 502 and 502A.
  • examples of raw files may include system log, network log, storage log, web server log, applications log, scripting log, java log and service log. All gathered data and information associated to log files are stored or saved within the log database at 503.
  • FIGURE 6 shows the process adapted by the log parser 11a in accordance with an embodiment of the present invention.
  • the log parser 11a reads all log files being stored in the log database of the log collector 12 at 600. It should be noted that each log file may have a different log format. Accordingly, at 601, the log parser 11a then formats and chunks raw log files into a structured and summarized format. In this process, the log parser 11a engages the accelerator library 16 to swiftly scan the log files and identifies the formats and then breaking the log messages into several chunks of data.
  • Each chunk contains important information that represents the message, for example, but not limiting to date, time, server type, program, error/warning messages, nouns/objects, verbs/methods and attributes/options.
  • the log parser 11a arranges and organizes the chunks of data to a multidimensional analysis data in Online Analytical Processing (OLAP) at 602. This process is also assisted by the accelerator library 16 to accelerate the arranging and organizing of the chunks of information.
  • OLAP Online Analytical Processing
  • FIGURE 7A shows the process adapted by the correlation engine lib in accordance with an embodiment of the present invention.
  • the correlation engine lib scans the log messages stored within the log database 40 to find and identify problem similarities within the same failure domain. This process is further accelerated with assistance from the accelerator library 16.
  • the correlation engine lib tags and correlates the identified logs into one bucket by tagging the output, compressing the output and stores it in the log database 40, at 701. This process is also further accelerated with the assistance of the accelerator library 16.
  • Tagging the output at 701 may include details associated to date, time, server, program name, errors/warning, noun/objects, verbs/methods and attributes/options.
  • FIGURE 7B shows the examples of errors, which may be captured from the cloud storage controller (SC) and the cloud node controller (NC) 15, respectively.
  • FIGURE 7C shows an example of the final output flags generated upon completion of the correlation process by the correlation engine lib. In one embodiment, the final output may contain information such as time and data on the identified error or failure.
  • FIGURE 8 shows the process adapted by the log analyser 11c in accordance with an embodiment of the present invention. Firstly, the log analyser 11c reads log from the bucket and performs semantic search using the Knowledge Base database 51 at 800.
  • the analyser 11c proceeds to compare the "correlated log" to that of the predefined relationships and other available information in the Knowledge Base database at 801. If the search fails, the log analyser 11c updates the Knowledge Base database 51 at 801B with the current log info at 801 A. Upon completion of comparison, the log analyser 11c then proceeds to perform a Root Cause Analysis (RCA) against the log in order to determine whether the identified log is coming from the resource or environment or human intervention and etc, at 802. The Root Cause Analysis (RCA) is performed to identify the root factors that causing the failure problem. A conclusion is established at 803 when an identified solution that may exist in the Knowledge Base database 51 is applied.
  • RCA Root Cause Analysis
  • the system then proceeds to establish a new solution by considering the flaw factors relationships of the problems and solutions, whereby the system will identify frequency of repeated root-cause by performing a semantic analysis against the semantic unit in the database, at 804.
  • the rank is assigned to each root-cause, which may be based on the severity in the log/depth of problem correlation and also the simplification of solution implementation. Further, once the conclusion is generated and completion of solution deployment, the system proceeds to update a processed database at 805A.
  • FIGURE 9 shows the analytical and intelligence business module 13 and dashboard 14.
  • the analytical and intelligence business module 13 and dashboard 14 are configured to serve as interfaces for the user.
  • the user may input data associated to the output report.
  • the user chooses the required reports and the module 13 accepts the user inputs if any; whereby the end result based on the input and instructions prompted by the user are visualized on a browser.
  • a web container accepts the users request at 901 and sends back the response of the backend system.
  • a report builder analyses the user requests at 902, retrieves the required information from the processed database 52 and sends the information back to the analytical and intelligence business module 13 and dashboard 14 for visualization.
  • the system in one embodiment of the present invention may be adapted as FIGURE 10.
  • the cloud computing system comprises at least one device being connected to a public network 101, at least one device being connected to a private network 102, a Log Manager 11 of the present invention, coupled with an accelerator library 16, a cloud controller 17, a cloud node controller 15 and at least one storage means 103, said storage means 103 which may be adapted to accommodate at least one database described in the preceding paragraphs.
  • the afore- described methods and components may be provided in many variations, modifications or alternatives to existing testing systems.
  • the principles and concepts disclosed herein may also be implemented in various manner which may not have been specifically described herein but which are to be understood as encompassed within the scope and letter of the following claims.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Debugging And Monitoring (AREA)

Abstract

There is disclosed a cloud computing managing system, whereby the system comprising at least one computing resource (10) being managed by a Log Manager (11) adapted and configured to collect, parse, analyse and visualize information based on a failure within the cloud computing system and generating an output. The system further comprises at least one log collector (12) interconnected at the computing resource clusters (100) and is configured to collect and gather information related to the failure, storing the information in a database (40) and forward the information to the Log Manager (11). The system further provides an analytical business intelligence module (13) with a dashboard (14) for use as an interface and to display the output from Log Manager (11). A method thereof is also provided.

Description

SYSTEM AND METHOD FOR MANAGING RESOURCES FAILURE USING FAST CAUSE AND EFFECT ANALYSIS IN A CLOUD COMPUTING SYSTEM
FIELD OF INVENTION
[0001] The present invention generally relates to cloud computing, and more particularly a method and system for managing cloud resources over a network.
BACKGROUND OF INVENTION
[0002] Cloud computing is an emerging technology and is undoubtedly beneficial for mid-size or large companies owing to its efficiencies be it for maintenance, deployment and upgrading, when it comes to adding capacity or capabilities to deliver a certain service to users or customers. It is continuously evolving and various industries are adapting to this technology.
[0003] Cloud computing typically encompasses a plurality of computing resources configured to deliver cloud-based services over a network. Perceptibly, with the installation of various resources, managing these resources becomes crucial in order to fulfil service delivery requirements. With the conventional cloud computing systems, cloud providers often unable to identify and trouble shoot computing resources failures, which consequently incur penalty fees. The types of failures commonly experienced include hardware failure, software malfunction and dependencies, abnormal activities between the resources and etc. Further, the conventional systems are typically based on best-effort approach rather than risk- aware approach when accepting Service Level Agreement (SLA). As seen in FIGURE 1, in the conventional systems, the cloud computing resources are managed manually with user intervention, whereby the administrator performs identification of failures and step-by-step troubleshooting as well cause-effect analysis manually. Such manual process is highly time consuming. These prominent drawbacks can be effectively addressed in the event that all these tasks, especially fixing the failure were performed autonomously upon detected the cause of failure within the computing system. [0004] Thus, there is clearly a considerable need for systems and methods that can conveniently address the above-discussed shortcomings of managing failures in a cloud computing system.
SUMMARY
[0005] In one aspect of the present invention and broadly defined, there is provided a cloud computing managing system comprising: least one computing resource; a Log Manager configured to collect, parse, analyse and visualize information based on a failure within the cloud computing system and generating an output; at least one log collector configured to collect and gather information related to the failure, storing the information in a database and forward the information to the Log Manager and; an analytical business intelligence module with a dashboard for use as an interface and to display the output from Log Manager.
[0006] In one embodiment, the Log Manager further comprises a log parser, a correlation engine, a log analyser, and a data visualizer.
[0007] In a further embodiment, the log parser is configured to extract the log files from the log collector and remove all unnecessary information from the log and reduce said information into small pieces or chunks.
[0008] In yet a further embodiment, correlation engine is configured to obtain a summarized version of log messages and create logical relationship between the messages that have similarities within the failure domain and consolidates them into suitable structured information.
[0009] In another embodiment, log analyze is configured to analyse the structured information from the correlation engine and conduct semantic cause-effect analysis in accordance to a predefined taxonomy provided by a storage device, and rank the cause-effect based on the highest relevancy of the failures detected.
[0010] In a further embodiment, the data visualizer is configured to parse the findings into a specific language to be used by the dashboard. [0011] In yet a further embodiment, the system further comprising a cloud controller and an accelerator library for accelerating the processes performed by the log parser and correlation engine.
[0012] In yet another embodiment, the log manager comprises at least one knowledge base database, a processed database and an Online Analytical Processing (OLAP) database.
[0013] In another aspect of the present invention, there is disclosed a method for managing resources in a cloud computing system comprising the steps of monitoring a failure for a period of time; gathering data and log files from at least one resource; storing said gather data and log files in a database; retrieves the log files and data from the database and reducing the data and log files into chunks of data; arranging and organizing the chunks of data to a multidimensional analysis data format and storing said multidimensional data format in database; scanning at least one database to identify failure similarities data within the same failure domain; correlating the identified similarities data by tagging, compressing and storing it in a database; performing a semantic search and comparing the correlated data with predefined relationships contained within at least one database; performing a root-cause analysis; generating a conclusion upon applied a solution for the failure; identifying the frequency of failure occurrences and ranking the occurrences of failures.
BRIEF DESCRIPTION OF THE DRAWINGS
[0014] The invention will be more understood by reference to the description below taken in conjunction with the accompanying drawings herein:
[0015] FIGURE 1 shows a conventional cloud computing system;
[0016] FIGURE 2 shows an overall view of the cloud computing managing system in accordance with an embodiment of the present invention;
[0017] FIGURE 3 depicts the functionalities of the Log Manager in accordance with an embodiment of the present invention;
[0018] FIGURE 4 illustrates the Log Manager in accordance with an embodiment of the present invention;
[0019] FIGURE 5 provides a schematic flowchart showing the processes performed by the log collector in accordance with an embodiment of the present invention;
[0020] FIGURE 6 shows the processes involved and performed by the log parser in accordance with an embodiment of the present invention;
[0021] FIGURE 7 A shows the processes involved and performed by the correlation engine in accordance with an embodiment of the present invention;
[0022] FIGURE 7B shows the examples of errors, which may be captured from the cloud storage controller and the cloud node controller, respectively;
[0023] FIGURE 7C shows an example of the final output generated upon completion of the correlation process by the correlation engine;
[0024] FIGURE 8 shows the processes involved and performed by the log analyser in accordance with an embodiment of the present invention; [0025] FIGURE 9 shows the analytical and intelligence business module and dashboard; and [0026] FIGURE 10 illustrates an example of the implementation of the system and method in accordance with an embodiment of the present invention.
DETAILED DESCRIPTION [0027] In line with the above summary, the following description of a number of specific and alternative embodiments is provided to understand the inventive features of the present invention. It shall be apparent to one skilled in the art, however that this invention may be practiced without such specific details. Some of the details may not be described at length so as not to obscure the invention. For ease of reference, common reference numerals will be used throughout the figures when referring to the same or similar features common to the figures.
[0028] The present invention provides a method and system that is configured to detect failures, and then correlate, collect, and analyze cloud resources information including possible factors that cause the failure, whereby the outcome of rectification of such failure can be visualized prior to actual rectification. The latter feature thereby enables the user to anticipate the outcome prior to performing actual fixing. For the purpose of clear description and avoidance of doubt, the term "failure" in this specification is used to mean any form of technical failures or errors or glitches or any form of technical intervention that causes technical interruption of the cloud computing system.
[0029] With reference FIGURE 2, the overall system in accordance with an embodiment of the present invention will now be described. Generally, the system for managing cloud computing contemplated in accordance with an embodiment of the present invention comprises at least one computing resource 100, a Log Manager 11, at least one log collector 12, an analytical business intelligence module 13 and a dashboard 14. FIGURE 3 depicts the functionalities of the Log Manager 11 in accordance with an embodiment of the present invention. The Log Manager 11 is configured to be in communication with the assigned clusters 100 (cloud computing resources), the cloud controller ISA, accelerator library 16, analytical and business intelligence module 13 and the dashboard 14. In one embodiment, the apparatus further includes; at least one log collector 12 is disposed at each cluster of the computing resources 100, and at least one log collector 12 is disposed at the cloud node controller 15 and at the cloud controller 15A respectively. In one embodiment, the Log Manager 11 comprises a log parser 11a, a correlation engine lib, a log analyser 11c and a data visualizer lid.
[0030] In accordance with one embodiment of the present invention, the log collector
12 can be a distributed agent based framework, and is suitably configured to collect the new log files on a local server. It is further adapted to be lightweight, and non-intrusive data collector. In the event of a failure, the log collector 12 is configured to push the log information to the Log Manager 11 for further analysis. The collector 12 is further equipped with a database 40 for storing all data and information associated to the log files. [0031] Referring to FIGURE 4, the Log Manager 11 is generally a platform for parsing, analysing and provides visualization of information in the event of system failures which may include, but not limiting to network, application and hardware failures. The Log Manager 11 is further configured to rank the failure according to the relevancy of the error identified and suggests the possible remedies to the respective system administrator. The log parser 11a is configured to extract the log files from the log collector 12 and remove all unnecessary information from the log and reduce said information into small pieces or chunks. Examples of log files include, but not limiting to, system log, network log, storage log, web server log, web server log, applications log, scripting log, java log and service log. It is further configured to arrange and organize the chunks of the log information to a multidimensional analysis data in an Online Analytical Processing (OLAP). The latter process of organizing and arranging the chunks of information is assisted by the accelerate library 16 with accelerator API 16a (as seen in FIGURE 4) which further aids to accelerate the process.
[0032] Still referring to FIGURE 4, the correlation engine lib is configured to obtain a summarized log messages and create logical relationship between the messages that have similarities within the failure domain and consolidates them into suitable structured information. In this process, the accelerator library 16, which is equipped with an accelerator API 16a, is in communication with the correlation engine lib and thereby aids to accelerate the process. The data visualizer lid is configured to parse the findings into a specific language, for example XML format in order to be used by the dashboard 14. The user may view the output of findings via the dashboard 14.
[0033] The Log Manager 11 further comprises a log analyser 11c. The log analyser 11c in accordance with an embodiment of the present invention is configured to analyse the structured information from the correlation engine lib and conduct semantic cause-effect analysis according to the pre-defined taxonomy. The pre-defined taxonomy can be retrieved from a database 51, which may be in the form of a Knowledge Base repository or database. Accordingly, the database is in communication with the log analyser 11c. The log analyser 11c is also configured to rank the cause-effect based on the highest relevancy of the failures detected. There is further provided a processed database 52 configured to be in communication with the Log Manager 11 so as to store all processed data by the Log Manager 11.
[0034] A system incorporating the operational method in accordance with an embodiment of the present invention will now be described based on the steps or process stages performed by each component of the Log Manager 11 in the event that there is a failure within the system. FIGURE 5 shows a process flow adapted by the log collector 12 in accordance with an embodiment of the present invention. In the event that a failure is identified, the log collector 12 is triggered at 500 and thus it monitors 501 the event failure for n period of time. Such feature is to ensure that the failure event is genuine and reproducible. Upon completion of the n period of time, the log collector 12 collects and gathers raw log files from various sources at 502 and 502A. As discussed earlier, examples of raw files may include system log, network log, storage log, web server log, applications log, scripting log, java log and service log. All gathered data and information associated to log files are stored or saved within the log database at 503.
[0035] FIGURE 6 shows the process adapted by the log parser 11a in accordance with an embodiment of the present invention. The log parser 11a reads all log files being stored in the log database of the log collector 12 at 600. It should be noted that each log file may have a different log format. Accordingly, at 601, the log parser 11a then formats and chunks raw log files into a structured and summarized format. In this process, the log parser 11a engages the accelerator library 16 to swiftly scan the log files and identifies the formats and then breaking the log messages into several chunks of data. Each chunk contains important information that represents the message, for example, but not limiting to date, time, server type, program, error/warning messages, nouns/objects, verbs/methods and attributes/options. Next and upon completion of summarization and chunks conversion, the log parser 11a arranges and organizes the chunks of data to a multidimensional analysis data in Online Analytical Processing (OLAP) at 602. This process is also assisted by the accelerator library 16 to accelerate the arranging and organizing of the chunks of information. The new format is saved at 603 and then into an OLAP database at 603A.
[0036] FIGURE 7A shows the process adapted by the correlation engine lib in accordance with an embodiment of the present invention. Firstly at 700, the correlation engine lib scans the log messages stored within the log database 40 to find and identify problem similarities within the same failure domain. This process is further accelerated with assistance from the accelerator library 16. Upon completion of the scanning, the correlation engine lib tags and correlates the identified logs into one bucket by tagging the output, compressing the output and stores it in the log database 40, at 701. This process is also further accelerated with the assistance of the accelerator library 16. Tagging the output at 701 may include details associated to date, time, server, program name, errors/warning, noun/objects, verbs/methods and attributes/options. It should be noted that the OLAP makes data access significantly quick by using the multidimensional data model. FIGURE 7B shows the examples of errors, which may be captured from the cloud storage controller (SC) and the cloud node controller (NC) 15, respectively. FIGURE 7C shows an example of the final output flags generated upon completion of the correlation process by the correlation engine lib. In one embodiment, the final output may contain information such as time and data on the identified error or failure. [0037] FIGURE 8 shows the process adapted by the log analyser 11c in accordance with an embodiment of the present invention. Firstly, the log analyser 11c reads log from the bucket and performs semantic search using the Knowledge Base database 51 at 800. If the search result is a "success" the analyser 11c proceeds to compare the "correlated log" to that of the predefined relationships and other available information in the Knowledge Base database at 801. If the search fails, the log analyser 11c updates the Knowledge Base database 51 at 801B with the current log info at 801 A. Upon completion of comparison, the log analyser 11c then proceeds to perform a Root Cause Analysis (RCA) against the log in order to determine whether the identified log is coming from the resource or environment or human intervention and etc, at 802. The Root Cause Analysis (RCA) is performed to identify the root factors that causing the failure problem. A conclusion is established at 803 when an identified solution that may exist in the Knowledge Base database 51 is applied. In the event that the solution is not found in the Knowledge Base database 51, the system then proceeds to establish a new solution by considering the flaw factors relationships of the problems and solutions, whereby the system will identify frequency of repeated root-cause by performing a semantic analysis against the semantic unit in the database, at 804. In the next step at 805, the rank is assigned to each root-cause, which may be based on the severity in the log/depth of problem correlation and also the simplification of solution implementation. Further, once the conclusion is generated and completion of solution deployment, the system proceeds to update a processed database at 805A.
[0038] FIGURE 9 shows the analytical and intelligence business module 13 and dashboard 14. The analytical and intelligence business module 13 and dashboard 14 are configured to serve as interfaces for the user. The user may input data associated to the output report. At, the user chooses the required reports and the module 13 accepts the user inputs if any; whereby the end result based on the input and instructions prompted by the user are visualized on a browser. A web container accepts the users request at 901 and sends back the response of the backend system. Next, a report builder analyses the user requests at 902, retrieves the required information from the processed database 52 and sends the information back to the analytical and intelligence business module 13 and dashboard 14 for visualization.
[0039] In one example, the system in one embodiment of the present invention may be adapted as FIGURE 10. The cloud computing system comprises at least one device being connected to a public network 101, at least one device being connected to a private network 102, a Log Manager 11 of the present invention, coupled with an accelerator library 16, a cloud controller 17, a cloud node controller 15 and at least one storage means 103, said storage means 103 which may be adapted to accommodate at least one database described in the preceding paragraphs. [0040] As would be apparent to a person having ordinary skilled in the art, the afore- described methods and components may be provided in many variations, modifications or alternatives to existing testing systems. The principles and concepts disclosed herein may also be implemented in various manner which may not have been specifically described herein but which are to be understood as encompassed within the scope and letter of the following claims.

Claims

A cloud computing managing system comprising: least one computing resource (100); a Log Manager (11) being in communication with the computing resource (100) and configured to collect, parse, analyse and visualize information based on a failure within the cloud computing system and generating an output; at least one log collector (12) being in communication with the Log Manager (1 1) and the computing resource (100) and configured to collect and gather information related to the failure, storing the information in a database (40) and forward the information to the Log Manager (11); and; an analytical business intelligence module (13) being in communication with the Log Manager (11) and configured to review the output from the Log Manager (1 1), having a dashboard (14) adapted as an interface and to display the output from Log Manager (11).
The cloud computing managing system as claimed in Claim 1, wherein the Log Manager (11) further comprising a log parser (1 1a), a correlation engine (l ib), a log analyser (1 1c) and a data visualizer (l id).
The cloud computing managing system as claimed in Claim 2 wherein the log parser (1 1a) is configured to extract the log files from the log collector (12) and remove all unnecessary information from the log and reduce said information into small pieces or chunks.
The cloud computing managing system as claimed in Claim 2 wherein the correlation engine (l ib) is configured to obtain a summarized version of log messages and create logical relationship between the messages that have similarities within the failure domain and consolidates them into suitable structured information.
5. The cloud computing managing system as claimed in Claim 2 wherein the log analyzer (1 1c) is configured to analyse the structured information from the correlation engine (l ib) and conduct semantic cause-effect analysis in accordance to a predefined taxonomy provided by a storage device (51 ), and rank the cause- effect based on the highest relevancy of the failures detected.
6. The cloud computing managing system as claimed in Claim 2 wherein the data visualizer (l id) is configured to parse the findings into a specific language to be used by the dashboard (14).
7. The cloud computing managing system as claimed in Claim 1 wherein the system further comprising a cloud controller (15) being in communication with the Log Manager (1 1) and an accelerator library (16) connected to the Log Manager (1 1).
8. The cloud computing managing system as claimed in Claim 7 wherein the accelerator library (16) is configured to accelerate processes performed by the log parser (1 1a) and correlation engine (l ib).
9. The cloud computing system as claimed in Claim 1 wherein the log manager (1 1) comprises at least one knowledge base database (51) for storing pre-defined taxonomy, a processed database (52) for storing all processed data from the Log Manager (1 1) and an Online Analytical Processing (OLAP) database (50) for storing log information to a multidimensional analysis data from the Log Manager (1 1).
10. A method for managing resources in a cloud computing system comprising: detecting and monitoring a failure for a period of time 501;
gathering data and log files from at least one resource 502/502A;
storing said gather data and log files in a database 503; retrieves the log files and data from the database and reducing the data and log files into chunks of data 600, 601;
arranging and organizing the chunks of data to a multidimensional analysis data format and storing said multidimensional data format 602, 603 in database;
scanning at least one database to identify failure similarities data within the same failure domain 700;
correlating the identified similarities data by tagging, compressing and storing it in a database 701;
performing a semantic search and comparing the correlated data with predefined relationships contained within at least one database 800, 801;
performing a root-cause analysis 802;
generating a conclusion upon applied a solution for the failure 803;
identifying the frequency of failure occurrences 804;
ranking the root-cause occurrences of failures 805.
PCT/MY2015/050042 2014-06-04 2015-05-29 System and method for managing resources failure using fast cause and effect analysis in a cloud computing system WO2015187001A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
MY2014001625 2014-06-04
MY2014001625 2014-06-04

Publications (2)

Publication Number Publication Date
WO2015187001A2 true WO2015187001A2 (en) 2015-12-10
WO2015187001A3 WO2015187001A3 (en) 2016-01-28

Family

ID=54767527

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/MY2015/050042 WO2015187001A2 (en) 2014-06-04 2015-05-29 System and method for managing resources failure using fast cause and effect analysis in a cloud computing system

Country Status (1)

Country Link
WO (1) WO2015187001A2 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017124704A1 (en) * 2016-01-18 2017-07-27 中兴通讯股份有限公司 Method and apparatus for displaying log content
CN107544894A (en) * 2016-06-23 2018-01-05 中兴通讯股份有限公司 The method, apparatus and server of a kind of log processing
US10831711B2 (en) 2017-09-26 2020-11-10 International Business Machines Corporation Prioritizing log tags and alerts
KR102244782B1 (en) * 2020-10-08 2021-04-27 (주)시큐레이어 Method for automatical parser matching based on nomalization rates to be used for accurately analyzing unstructured logs having arbitrary structures and device using the same
WO2022111158A1 (en) * 2020-11-26 2022-06-02 北京百度网讯科技有限公司 Fault detection method and apparatus for live broadcast service, electronic device, and readable storage medium
CN115756919A (en) * 2022-11-10 2023-03-07 上海鼎茂信息技术有限公司 Root cause positioning method and system for multidimensional data

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7263632B2 (en) * 2003-05-07 2007-08-28 Microsoft Corporation Programmatic computer problem diagnosis and resolution and automated reporting and updating of the same

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
None

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017124704A1 (en) * 2016-01-18 2017-07-27 中兴通讯股份有限公司 Method and apparatus for displaying log content
CN107544894A (en) * 2016-06-23 2018-01-05 中兴通讯股份有限公司 The method, apparatus and server of a kind of log processing
US10831711B2 (en) 2017-09-26 2020-11-10 International Business Machines Corporation Prioritizing log tags and alerts
KR102244782B1 (en) * 2020-10-08 2021-04-27 (주)시큐레이어 Method for automatical parser matching based on nomalization rates to be used for accurately analyzing unstructured logs having arbitrary structures and device using the same
WO2022111158A1 (en) * 2020-11-26 2022-06-02 北京百度网讯科技有限公司 Fault detection method and apparatus for live broadcast service, electronic device, and readable storage medium
CN115756919A (en) * 2022-11-10 2023-03-07 上海鼎茂信息技术有限公司 Root cause positioning method and system for multidimensional data
CN115756919B (en) * 2022-11-10 2023-10-31 上海鼎茂信息技术有限公司 Root cause positioning method and system for multidimensional data

Also Published As

Publication number Publication date
WO2015187001A3 (en) 2016-01-28

Similar Documents

Publication Publication Date Title
US11500757B2 (en) Method and system for automatic real-time causality analysis of end user impacting system anomalies using causality rules and topological understanding of the system to effectively filter relevant monitoring data
WO2015187001A2 (en) System and method for managing resources failure using fast cause and effect analysis in a cloud computing system
US20170031659A1 (en) Defining Event Subtypes Using Examples
US20110191394A1 (en) Method of processing log files in an information system, and log file processing system
US20140081925A1 (en) Managing Incident Reports
US7913233B2 (en) Performance analyzer
CN111190888A (en) Method and device for managing graph database cluster
US20230376408A1 (en) Application programming interface test method and apparatus
CN110209518A (en) A kind of multi-data source daily record data, which is concentrated, collects storage method and device
JP2003141075A (en) Log information management device and log information management program
CN110750426A (en) Service state monitoring method and device, electronic equipment and readable storage medium
CN110231998B (en) Detection method and device for distributed timing task and storage medium
CN113760677A (en) Abnormal link analysis method, device, equipment and storage medium
CN110851471A (en) Distributed log data processing method, device and system
US10915510B2 (en) Method and apparatus of collecting and reporting database application incompatibilities
US9922539B1 (en) System and method of telecommunication network infrastructure alarms queuing and multi-threading
Chen et al. Automatic Root Cause Analysis via Large Language Models for Cloud Incidents
CN106789335B (en) Method and system for processing information
CN110011845B (en) Log collection method and system
CN113760634A (en) Data processing method and device
CN116594840A (en) Log fault acquisition and analysis method, system, equipment and medium based on ELK
CN112685370B (en) Log collection method, device, equipment and medium
CN112882892B (en) Data processing method and device, electronic equipment and storage medium
CN114756301A (en) Log processing method, device and system
CN113676356A (en) Alarm information processing method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15759955

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15759955

Country of ref document: EP

Kind code of ref document: A2