CN118119927A

CN118119927A - Method and system for recommending an operation manual for a detected event

Info

Publication number: CN118119927A
Application number: CN202280068628.8A
Authority: CN
Inventors: R·H·R·帕蒂; C·A·罗伊; A·M·H·麦卡洛姆; M·古斯瓦米; J·K·科尔科; S·雷迪
Original assignee: Oracle International Corp
Current assignee: Oracle International Corp
Priority date: 2021-08-24
Filing date: 2022-08-05
Publication date: 2024-05-31

Abstract

Techniques for selecting an operation manual to recommend for remedying a detected event are disclosed. When the system detects an event, the system obtains metadata associated with the event. The metadata provides information about the event and the system topology of the system in which the event occurred. The system generates a recommendation for an operation manual for the remedial event based on one or both of the characteristics of the event and the characteristics of the topology in which the event occurred. The system compares the system topology to the system topology associated with the previously executed operation manual. The system recommends one of the previously executed runbooks to remedy the detected event based on determining that the topology associated with the previously executed runbooks is similar to the topology of the system in which the event occurred.

Description

Method and system for recommending an operation manual for a detected event

Technical Field

The present disclosure relates to recommending relevant operation manuals during an operation manual selection process. In particular, the present disclosure relates to event-aware, topology-aware, operation manual selection for remedial events.

Background

Modern information technology systems include a large number of different types of components. For example, there may be database systems, network systems, computer applications, and the like. Each such system may be managed and/or monitored by a specialized Information Technology (IT) professional.

During normal operation, a computer system may produce or encounter unexpected or desired behaviors or results of an operator monitoring the system. Such actions or results may generate event records (e.g., slow process running or process stopping). Upon encountering an event log or incident message, the user may wish to solve the problem by performing one or more remedial tasks. The user may perform a remedial task defined by the operating manual. An operation manual is a guideline that a user can follow to perform a series of tasks to achieve a desired result, such as remediation of unexpected or undesired results in a system.

The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Thus, unless otherwise indicated, any approaches described in this section are not to be construed so as to qualify as prior art merely by virtue of their inclusion in this section.

Drawings

Embodiments are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings. It should be noted that references to "an embodiment" or "one embodiment" in this disclosure are not necessarily to the same embodiment, and they mean at least one. In the drawings:

FIG. 1A illustrates a system in accordance with one or more embodiments;

FIG. 1B illustrates an example system topology in accordance with one or more embodiments;

FIG. 2 illustrates an example set of operations for recommending an operation manual to be performed for a remedial event based on system topology in accordance with one or more embodiments;

FIG. 3 illustrates an example set of operations for recommending an operation manual to be performed for diagnosing and/or remediating an event based on topology data associated with the operation manual in accordance with one or more embodiments;

FIG. 4 illustrates an example set of operations for recommending an operation manual for diagnosing and/or remediating an event based on event attributes in accordance with one or more embodiments;

FIG. 5 illustrates an example set of operations for previewing an operation manual operation to recommend an operation manual for diagnosing/remediating an event in accordance with one or more embodiments;

FIG. 6 illustrates an example embodiment of an operation manual recommending diagnostic/remedial events;

FIG. 7 illustrates another example embodiment of an operating manual recommending diagnostic/remedial events;

FIGS. 8A and 8B illustrate an example embodiment of a Graphical User Interface (GUI);

FIG. 9 shows a block diagram illustrating a computer system in accordance with one or more embodiments.

Detailed Description

In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding. One or more embodiments may be practiced without these specific details. Features described in one embodiment may be combined with features described in another embodiment. In some instances, well-known structures and devices are described with reference to block diagram form in order to avoid unnecessarily obscuring the present invention.

1. General overview

2. System architecture

3. Example System topology

4. Topology-based recommendation of an operating manual for event diagnosis and remediation

5. Event-based recommendation of an operator manual for event diagnosis and remediation

6. Previewing an operation manual operation to recommend candidate operation manuals

7. Example embodiment

8. Computer network and cloud network

9. Other matters; expansion of

10. Hardware overview

1. General overview

An operation manual defines a collection of independently executable operations that a user can perform to diagnose and remedy problems occurring in the system. For example, an Information Technology (IT) technician may open a particular operating manual to diagnose the cause of a data communication failure based on detecting that communication with the data communication device has been interrupted. The operation manual may further comprise the step of recovering access to the data communication device. Complex systems may have hundreds or thousands of operating manuals to help technicians diagnose the cause of an event and remedy the event. Thus, it may be difficult for a technician to identify a particular operation manual that should be performed for a particular detected event.

One or more embodiments select a particular operating manual to recommend cause diagnosis and remediation for an event detected in a system based on system topology data and/or event attributes associated with the event detected. The system obtains a system topology associated with the event. The system topology specifies the relationships between components in the system. The system identifies another topology that is similar to the identified system topology. Another topology may be selected from the stored set of topologies. If the stored topology and the system topology are sufficiently similar, the system recommends an operating manual associated with the stored topology to remedy the detected event. For example, the operation manual may be an operation manual previously executed with a stored topology to successfully remedy the detected event.

For example, the system may detect a communication failure in the server. The system identifies additional components, including another server, a database, and a gateway, in communication with the server in which the failure was detected. The system identifies a stored topology that includes a similar topology including a server in communication with another server and a gateway. The system selects an operating manual associated with the stored topology to recommend for remediating the server failure based on determining that the stored topology is similar to the topology associated with the server failure. The system may determine whether the stored topology is similar to the system topology in which the event was detected based on the number of components in common between the two topologies.

One or more embodiments present an operator manual selection interface that allows a user to select an event. The system collects metadata associated with the selected event. The metadata may include event data such as the time of the event, the device on which the event occurred, the program that was run when the event occurred, and the user affected by the event. The metadata may also include topology data, including the topological relationships of the system components in the system where the event occurred. Based on the topology attributes and event attributes, the system presents an operation manual for execution in an operation manual selection interface to remedy the selected event.

One or more embodiments recommend an operation manual to remedy a detected event based on a relationship between a target component associated with the event and one or more other components associated with operation of the operation manual. For example, the system may identify operations in an operating manual that require a user to interact with four system components. The system may determine that the four components are closely related to the target component associated with the event based on the system topology. Based on determining that the component associated with the operation manual is closely related to the target component associated with the event, the system recommends the operation manual to remedy the event.

One or more embodiments recommend an operation manual to diagnose a cause of an event and/or remedy the detected event based on a similarity between the detected event and an event associated with the operation manual. The system identifies attributes associated with the event, such as the time of the event, the device at which the event occurred, the program that was run when the event occurred, and the user affected by the event. The system identifies attributes associated with the stored event. If the stored event and the detected event have predefined similarities, the system recommends an operation manual for remedying the stored event to remedy the detected event.

The exemplary embodiment detects a user selection of an operation manual for the remediation of an event. The system analyzes event attributes associated with the detected event, with a second event associated with the selected operation manual, and with a third event associated with a third operation manual. Based on determining that the third event is more similar than the second event and the detected event, the system recommends the third operating manual as an alternative to the selected operating manual to remedy the detected event.

One or more embodiments allow a system administrator to associate a particular operation manual with a particular event. For example, a system administrator may specify via an operation manual selection interface that a particular operation manual should be presented when a particular target system component is offline or when a particular operation threshold is met. Additionally or alternatively, a system administrator may generate a set of tags or labels associated with an operation manual to describe the topology associated with the operation manual. The system may compare the topology associated with the operation manual to the topology associated with the detected event to select an operation manual to remedy the detected event. The system administrator may also facilitate the display of an operation manual in another event management system that performs a particular action upon detection of an event. For example, the event management system may detect that a particular server is unexpectedly offline. The event management system may perform a set of actions in response to a server shutdown, including generating a notification to an administrator, rerouting data requests to a backup server, and presenting a specific operation manual of the remedial event. The system may recommend an administrator-specified operation manual and one or more additional operation manuals to remedy the event. The system may identify additional operation manuals based on similarities of topologies associated with the detected events and topologies associated with the additional operation manuals.

One or more embodiments receive user feedback in response to presenting an operator manual recommendation. The user may interact with user interface elements of the GUI to indicate (a) that the recommendation is helpful in diagnosing the cause and the remedial event, or (b) that the recommendation is not helpful in diagnosing the cause and the remedial event. For example, the system may recommend two operation manuals based on the detected events. An operation manual may include operations associated with applications running in the monitored system. Another operation manual may include operations associated with applications that are not running in the system. The user may select an icon representing positive feedback to indicate that the previous operation manual is conducive to remedying the event. The user may select another icon representing negative feedback to indicate that the latter operation manual does not help with the remedial event because it is for applications that are not running in the system. The system may update the workbook rating based on user feedback. The next time the same event occurs, the system may avoid recommending a subsequent operation manual to remedy the event based on the reduced rating associated with the operation manual.

One or more embodiments preview one or more workbook operations without requiring the user to initiate the workbook operation to identify interesting workbook operation results. The system performs one or more operations associated with the operation manual operation without the user initiating the operation manual operation. The system may perform the operation manual operation itself, or an operation that is not an operation manual operation but that provides insight into the operation manual operation. The system analyzes the results of the operation to determine whether the operation produces interesting results. For example, the system may determine whether the operation result is related to the detected event. The system recommends candidate operation manuals based on the correlation of the operational results associated with the candidate operation manuals and the diagnosis and/or remediation of the event. The system determines the relevance of the outcome and the diagnosis and/or remediation of the event based on (a) the outcome itself or (b) information associated with the outcome. The results may include data generated by executing one or more of the independently executable operations of the operation manual. The information associated with the results may include, for example, data used to calculate the results, the type of data included in the results, the source of the results, the level of detail included in the results, software/hardware component information included in the results. The information about the results may indicate whether the type of problem identified by the results is a problem detected in the target software/hardware component or environment affected by the event.

The system may generate information about the results by analyzing the results generated by executing the operation(s) of the candidate operation manual before the user selects the operation manual for the remedial event. Alternatively or additionally, the system may obtain information about the results by performing a query that returns information from the database (rather than obtaining the results themselves by performing the operation(s) of the candidate execution manual).

For example, the system may identify the event "communication lost with the router". The system may identify candidate operation manuals associated with the event "communication lost with router". The system may preview one or more operation manual operations by any of the following: (a) Performing an operation specified in a step of the operation manual, or (b) performing an operation not specified in any step of the operation manual, the operation providing information about the operation specified in the operation manual. If the operation manual step specifies an operation "check power connection to network router," the system may preview the operation by checking power to network router before the user performs the operation. Alternatively, the system may preview operations, such as transmitting a status check signal to the router, by performing operations that provide information about the power connection to the network router. If the router returns a status response, the system may determine that the router is powered on. The system may determine that the power of the router is unlikely to contribute to the detected event due to the router powering up, and thus the operation is not "interesting" in the context of the event. Thus, the system may reduce the relevance score of the operating manual.

According to another example, if the step specifies the operation "check network router status in status screen", the system may preview the operation manual operation by analyzing the data transfer log between the router and adjacent components in the network. If the router is transmitting the expected amount of data, the system may determine that the "check network router status in status screen" step is unlikely to contribute to the remediation event and is therefore not "interesting". Thus, the system may reduce the relevance score of the associated operating manual.

As another example, the system may identify an event marked as "data loss exceeds a threshold". The system may identify a set of candidate operation manuals associated with an event type "data loss exceeds a threshold". Before recommending any workbooks for the remedial event, the system previews the workbooks by performing one or more operations associated with the candidate workbooks. For example, the system may perform the steps specified in the operating manual to "check the data transmission queue log of server a". The system may compare data associated with the data transmission queue log to threshold data prior to recommending an operation manual. For example, the system may determine whether the data transmission queue is full. Based on the results of the performed operations, the system determines whether the particular operation manual includes operations having "interesting" results related to the detected event. The system determines whether to recommend an operation manual for remedying an event based on whether the particular operation manual includes operations with "interesting" results related to the detected event. If the system determines that the result of the performed operation is likely to result in a remediated identified event (e.g., "data loss exceeds a threshold"), the system increases the relevance score of the associated operating manual. The system may rank a plurality of operation manuals. The system may recommend one or more operating manuals with the highest relevance scores to remedy the event.

In one or more embodiments, the system presents an operation manual selection interface to allow a user to select events for which the user wants to execute an operation manual. When a user selects an interface element associated with an event in the operation manual selection interface, the system may perform operations associated with steps of one or more candidate operation manuals associated with the event.

One or more embodiments described in the specification and/or recited in the claims may not be included in this general overview section.

2. System architecture

FIG. 1A illustrates a system 100 in accordance with one or more embodiments. As shown in fig. 1A, system 100 includes a monitored system 110, an event remediation platform 120, and a data repository 130. In one or more embodiments, the system 100 may include more or fewer components than those shown in FIG. 1A. The components shown in fig. 1 may be located locally or remotely relative to each other. The components shown in fig. 1 may be implemented in software and/or hardware. Each component may be distributed across multiple applications and/or machines. Multiple components may be combined into one application and/or machine. Operations described with respect to one component may instead be performed by another component.

Additional embodiments and/or examples related to computer networks are described below in section 6 entitled "computer networks and cloud networks".

In one or more embodiments, data repository 130 is any type of storage unit and/or device (e.g., a file system, a database, a collection of tables, or any other storage mechanism) for storing data. In addition, the data repository 130 may include a plurality of different storage units and/or devices. The plurality of different storage units and/or devices may or may not be of the same type or located at the same physical site. In addition, the data repository 130 may be implemented or executed on the same computing system as the event remediation platform 120. Alternatively or additionally, the data repository 130 may be implemented or executed on a computing system separate from the event remediation platform 120. The data repository 130 may be communicatively coupled to the event remediation platform 120 via a direct connection or via a network.

Information describing the system topology 131, system data 132, operating manual 133, historical system topology 134, and historical system events 135 may be implemented across any component within the system 100. But for purposes of clarity and explanation this information is shown within data repository 104.

In one or more embodiments, the event remediation platform 120 refers to hardware and/or software configured to perform the operations described herein for recommending an operation manual for diagnosing the cause of a detected event and remediating the detected event. An operation example of an operation manual for recommending diagnosing the cause of the detected event and remedying the detected event is described below with reference to fig. 2 to 4. An example of operations for previewing an operation manual to identify interesting operation manual operations for recommending diagnostic and/or remedial events is described below with reference to FIG. 5.

In an embodiment, the event remediation platform 120 is implemented on one or more digital devices. The term "digital device" refers to any hardware device that includes a processor. A digital device may refer to a physical device or virtual machine that executes an application. Examples of digital devices include computers, tablet computers, laptop computers, desktops, netbooks, servers, web servers, network policy servers, proxy servers, general-purpose machines, function-specific hardware devices, hardware routers, hardware switches, hardware firewalls, hardware Network Address Translators (NATs), hardware load balancers, mainframes, televisions, content receivers, set-top boxes, printers, mobile phones, smart phones, personal digital assistants ("PDAs"), wireless receivers and/or transmitters, base stations, communication management devices, routers, switches, controllers, access points, and/or client devices.

The event remediation platform 120 includes a data collection engine 121. The data collection engine 121 collects data from the monitored system 110, such as log data, sensor data, analog and digital device status data, and program status data. The data collection engine 121 may also obtain system data 132 from the data repository 130. The system data 132 may include log data, sensor data, and metric values of system performance metrics generated by the monitored system 110.

The event detection engine 122 monitors the data obtained by the data collection engine 121 to detect events in the system 110. For example, the event detection system 122 may monitor activity logs generated by one or more applications running in the system 110 and sensor data that generates output values based on characteristics of devices in the system 110 to detect faults of one or more components in the system 110. Examples of events may include: a computing device fails or operates below a defined threshold, an application fails or operates below a defined threshold, access to the device or application by an unauthorized entity, a data transfer rate is below a defined threshold, a data latency along a communication channel is above a defined threshold, a data loss along the communication channel is above a defined threshold, and a sensor level of a monitoring system component exceeds or fails to meet a defined threshold.

Topology identification engine 123 identifies the topology of monitored system 110. The topology includes physical components and interconnections between physical components. Examples of physical components that make up the system topology include computing devices, communication channels connecting computing devices, power supplies, power channels, device storage fixtures, cooling components, and system monitoring components. The topology also includes applications running on the physical components, configurations of sub-components and software on the physical components, and data stored by the physical components. For example, topology identification engine 123 can identify as part of the topology of system 110 a database storing data type a and data type B associated with two different tenants, a server connected to the database to allow access to the database, an application running on a virtual machine hosted by the server to perform queries to the database, a communication channel between the server and the database, and a power channel from a power supply to the server and the database. Topology identification engine 123 identifies components of system topology 131 based on one or both of user input via user interface 125 and detecting attributes of the system components without user input. For example, in a cloud-based system, a user may select components to include in a computing environment. Topology identification engine 123 can identify the user selections and the physical devices maintained by the cloud associated with the user selections (e.g., remote devices maintained by the cloud environment management entity). For example, the system may detect when a new device is added to the cloud environment via the communication protocol without receiving user input to perform the detection of the new device. For example, the cloud environment management entity may connect the firewall device to a server associated with the user selection. The topology identification entity 123 may identify characteristics of the firewall device, such as port information, applications running on the firewall device, cloud devices connected to the firewall device, etc., without user prompting. According to one embodiment, topology identification engine 123 adds newly detected devices, applications running on the devices, and other detected hardware to system topology 131.

According to one or more embodiments, topology identification engine 123 includes user interface elements, such as display elements displayed on user interface 125, to allow an administrator to specify topology elements of monitored system 110. For example, an administrator may provide descriptions of components such as "primary server," backup server, "" client a server, "" client B server, "" database-client a, and client B. Topology identification engine 123 can store topology data provided by an administrator in system topology 131.

The operation manual selection engine 124 performs operations to select an operation manual for diagnosing the cause of the detected event and/or remedying the detected event. For example, the operation manual selection engine 124 may compare the system topology 131 associated with the detected event to the historical system topology 134 of the same system 110 or other systems to identify other operation manuals associated with the historical system topology 134 having particular similarities to the system topology 131. The operation manual selection engine 124 may recommend one or more operation manuals associated with the detected event for execution based on the similarity between the system topology 131 and the historical system topology 134. According to another example, the operation manual selection engine 124 may identify similarities between the system topology 131 and system topology components associated with particular steps of different operation manuals 133. The operation manual selection engine 124 may identify system components associated with the steps of the operation manual 133. The operation manual selection engine 124 may recommend one or more operation manuals 133 for execution based on the similarity between the system components of the operation manual 133 and the system components of the system topology 131.

The operation manual selection engine 124 performs operations to select an operation manual for diagnosing and/or remediating the detected event. The operation manual selection engine 124 identifies candidate operation manuals associated with the detected event from among the stored operation manuals 133. For example, if an event includes a description of an "application crash," then the operation manual selection engine 124 may identify ten different operation manuals 133 that include a description of an "application crash. The operation manual selection engine 124 analyzes the attributes associated with the ten different candidate operation manuals 133 to select one or more operation manuals to recommend for diagnosing and/or remedying an application crash.

In accordance with one or more embodiments, the operation manual selection engine 124 compares the system topology 131 associated with the detected event to the historical system topology 134 of the same system 110 or other systems to identify other operation manuals associated with the historical system topology 134 having particular similarities to the system topology 131. The operation manual selection engine 124 may recommend one or more operation manuals associated with the detected event for execution based on the similarity between the system topology 131 and the historical system topology 134. According to another example, the operation manual selection engine 124 may identify similarities between the system topology 131 and system topology components associated with particular steps of different operation manuals 133. The operation manual selection engine 124 may identify system components associated with the steps of the operation manual 133. The operation manual selection engine 124 may recommend one or more operation manuals 133 for execution based on the similarity between the system components of the operation manual 133 and the system components of the system topology 131.

According to another example, the system may recommend an operation manual for execution based on the similarity between the detected event and one or more historical system events 135. For example, the operation manual selection engine 124 may identify attributes associated with the detected event. Attributes may include, for example, log values, sensor values, and topology characteristics. The operation manual selection engine 124 may identify similarities between the attributes of the detected event and the attributes of one or more historical system events 135. The operation manual selection engine 124 may identify an operation manual that is applied to the historical system events 135. The operation manual selection engine 124 may select one or more operation manuals to be applied to the historical system events 135 to apply to the currently detected events based on the similarity between the attributes of the historical system events 135 and the currently detected events.

According to another example, the workbook selection engine 124 previews one or more workbook operations of the candidate workbooks to identify interesting workbook operation results. The operation manual selection engine 124 applies relevance scores to candidate operation manuals based on whether the operation generated interesting results. Specifically, the operation manual selection engine 124 performs one or more operations associated with the candidate operation manuals to determine which of the candidate operation manuals to recommend for execution to remedy the event. The operation manual selection engine 124 may execute one or more of one or more independently executable operations that constitute an operation manual to obtain an operation result. Alternatively, the operation manual selection engine 124 may perform an operation that is not explicitly specified in any independently executable operation, but that provides information about the correlation of the operation with the remedy of the detected event.

For example, the candidate operation manual may include an operation of "check port configuration" among a set of independently executable operations. According to one embodiment, the operation manual selection engine 124 may perform operations to check port configuration prior to presenting the candidate operation manual as a recommendation for a remedial event. Specifically, the operation manual selection engine 124 may compare the data stored in memory and describing the port configuration with the expected port configuration. The operation manual selection engine 124 analyzes the results of the operation with respect to the detected event to determine if the results are of interest. Based on the analysis, the workbook selection engine 124 may assign or adjust a relevance score for the candidate workbook. If the operation manual selection engine 124 determines that the results of the operation do not provide interesting results, the operation manual selection engine 124 may decrease the relevance score of the candidate operation manual. For example, if the port configuration is consistent with the expected configuration, the operator manual selection engine 124 may determine that checking the port configuration is unlikely to remedy the detected event, and thus the results are not interesting with respect to the detected event. The system may correspondingly decrease the relevance score of the associated candidate execution manual. On the other hand, if the operation manual selection engine 124 determines that the port configuration values are inconsistent with the expected configuration, the operation manual selection engine 124 may determine that the results are interesting with respect to the event. The operation manual selection engine 124 may increase the relevance score of the candidate operation manual, indicating that the operation of checking the port configuration is likely to remedy the detected event.

According to another embodiment, the operation manual selection engine 124 may perform an operation associated with the candidate operation manual before the user selects the candidate operation manual, but the operation is not one of the specific independently executable operations defined by the candidate operation manual. For example, if the candidate operation manual includes an operation to "check port configuration," the operation manual selection engine 124 may perform an operation to check a transmission log associated with the port of the target component. If the operation manual selection engine 124 determines that the result of the operation is not interesting with respect to the detected event, the operation manual selection engine 124 may decrease the relevance score of the candidate operation manual. For example, if the operation manual selection engine 124 determines that the transmission log indicates normal data transmission to and from the data port when the detected event occurs, the operation manual selection engine 124 may determine that the results are not interesting with respect to the detected event. The operation manual selection engine 124 assigns a low relevance score to the candidate operation manual. Conversely, if the operation manual selection engine 124 determines that the transmission log indicates an interruption of data transmission to and from the data port, the operation manual selection engine 124 may increase the relevance score assigned to the candidate operation manual, indicating that operation of the candidate operation manual inspection port configuration may likely remedy the detected event.

According to yet another example, the system may consider a combination of criteria to adjust the relevance score of the candidate operating manual. The criteria may include, for example: (a) whether the operation of the operation manual generates interesting results, (b) whether the attributes of the detected event are similar to the attributes of the event associated with the candidate operation manual, and (c) whether the topology associated with the detected event is similar to the topology associated with the candidate operation manual. The operation manual selection engine 124 may adjust the relevance score of the candidate operation manual based on the similarity between the detected event and one or more historical system events 135. For example, the operation manual selection engine 124 may identify similarities between the attributes of the detected event and the attributes of one or more historical system events 135. The operation manual selection engine 124 may identify an operation manual that is applied to the historical system events 135. The operation manual selection engine 124 may increase the relevance score of one or more candidate operation manuals based on the similarity between the historical system events 135 associated with the candidate operation manuals and the attributes of the currently detected events.

According to one or more embodiments, the operation manual selection engine 124 selects one or more operations to perform from among the independently executable operations of the candidate operation manual based on one or both of event data associated with the detected event and topology data associated with the detected event. For example, if a detected event is assigned the name "communication lost with a server," the system may analyze the event data to identify the time of the event, the ID of the server with which communication was lost, and log data identifying the application running on the server. The system may analyze the system topology 131 to identify components in the monitored system 110 that communicate with the server. The system may identify a set of candidate operating manuals associated with an event type "communication lost with the server". Based on the event data and topology data, the operation manual selection engine 124 may select a set of preview operations to perform with respect to one or more of the candidate operation manuals. For example, the operation manual selection engine 124 may perform operations associated with the independently executable operations of the candidate operation manual to "check for application login on the server" based on the obtained event data. According to another example, the operation manual selection engine 124 may perform operations associated with independently executable operations of another candidate operation manual to "check the status of a gateway in communication with a server" based on the obtained system topology data. The operation manual selection engine 124 selects one or more of the candidate operation manuals to recommend "communication lost with the server" for diagnosing and/or remedying an event based on (a) whether the operation of the operation manual generates interesting results, (b) whether the attributes of the detected event are similar to the attributes of the event associated with the candidate operation manual, and (c) whether the topology associated with the detected event is similar to the topology associated with the candidate operation manual.

In accordance with one or more embodiments, the operation manual selection engine 124 includes a Graphical User Interface (GUI) generator to display a GUI on the user interface 125. The GUI may include an operation manual selection interface. The operation manual selection interface may display an event and one or more recommended operation manuals for remedying the event. The GUI may display interface elements to allow the user to provide feedback regarding recommended manuals.

The event remediation platform 120 may display one or more operation manuals for selection by the user via the user interface 125. Further, the user may perform the steps of the selected operation manual via the user interface 125. In one or more embodiments, the interface 125 refers to hardware and/or software configured to facilitate communication between a user and the event remediation platform 120. Interface 125 renders user interface elements and receives input via the user interface elements. Examples of interfaces include a Graphical User Interface (GUI), a Command Line Interface (CLI), a haptic interface, and a voice command interface. Examples of user interface elements include check boxes, radio buttons, drop down lists, list boxes, buttons, switches, text fields, date and time selectors, command lines, sliders, pages, and forms.

In an embodiment, different components of interface 125 are specified in different languages. The behavior of the user interface element is specified in a dynamic programming language (such as JavaScript). The content of the user interface element is specified in a markup language, such as hypertext markup language (HTML) or XML user interface language (XUL). The layout of the user interface elements is specified in a style sheet language, such as Cascading Style Sheets (CSS). Alternatively, the interface 125 may be specified in one or more other languages (such as Java, C, or C++).

3. Example System topology

FIG. 1B illustrates an example of a system topology 131 in accordance with one or more embodiments. "system topology" refers to the overall architecture, arrangement, type of resource, dependency of the resource, and/or use in the monitored system 110.

In accordance with one or more embodiments, the system uses topology metadata to generate a system topology 131. The topology metadata includes information describing the type of target component deployed and involved in the execution of the application. Example target types may include, but are not limited to, cloud services, syndication services, and other types of software services, clusters, groups, hosts, java Virtual Machines (JVM), JVM pools, applications, servers, database instances, OS services, central Processing Units (CPUs), network ports, memory pools, and any other classification of software or hardware resources.

In one or more embodiments, the topology metadata includes information describing dependencies and/or other relationships between the targets. For example, the topology graph may show that one node (corresponding to a target resource) is connected to another node (corresponding to a different target resource), indicating that the two nodes/corresponding target resources have a relationship with each other. If one target resource is "connected" to another target resource in the topology, then the two resources are determined to be functionally associated with each other. In various embodiments, a relationship may not merely indicate a connection between two nodes, such as functionality and/or directions associated with the connection. For example, functionality exists in the relationship of "A runs on B" or "A stores on B" or "A uses B as a load balancer". The directions may exist in a relationship of "a uses B" or "B uses a" or even "B uses a and a uses B". The topology map can be traversed to determine which resources are functionally dependent on other resources and/or other relationship information. For example, in the context of an application server, a topology map may have nodes corresponding to application servers connected to several applications, indicating that the server is "connected" to each application. The topology map may also indicate that each application is functionally dependent on the application server.

Complex software applications often include multiple levels (tier) or layers. Each "tier" or "layer" of a multi-tier application represents a different logical and/or physical element responsible for a different set of functions. The number and configuration of the tiers within the multi-tier architecture may vary depending on the particular implementation. For example, a three-tier system may include: a presentation hierarchy comprising logic for displaying and/or receiving information; an application level including logic for implementing application-specific functions; and a data hierarchy including logic for storing and retrieving data. In other examples, the multi-level architecture may include, in addition to or instead of the previously listed levels, a web level including logic for processing web requests and/or a middleware level including logic for connecting other levels within the architecture, and/or any other level including one or more software and/or one or more hardware components. Topology metadata may describe relationships between target resources in the same hierarchy and different hierarchies, including the type of targets deployed at each hierarchy.

In a clustered environment, topology metadata can capture which software components are deployed across multiple physical and/or virtual hosts. For example, the topology metadata may indicate that a first instance of an application is executing on a first server/host, that a second instance of the application is executing on a second server/host, and so on. In this example, the first instance of the application is functionally dependent on the servers and hosts on which it is executing, but not on servers and hosts in other nodes in the cluster. But if one of the nodes in the other cluster becomes inoperable, this results in an increase in traffic on the first node.

The system topology 131 may include a physical topology and a virtual topology. For example, the physical topology may include general purpose computing machines 141, 142, 143, and 144. The general purpose computing machine may be, for example, a server. The physical topology may include hardware routers 145 and hardware firewall devices 146. The physical topology of system topology 131 may include more or fewer digital devices than those shown in fig. 1B. Each digital device is represented as a box. Each digital device may be connected to any number of one or more other digital devices within the physical topology. The digital devices may be located in a single geographic location or distributed across various geographic locations. The physical devices may include physical ports 147 and 148. Physical ports 147 and 148 may connect physical devices via wires. Additionally, or alternatively, one or more of devices 141-146 may communicate wirelessly.

In an embodiment, system topology 131 may correspond to a cloud network. The digital devices shown in system topology 131 can be shared among multiple client devices and/or tenants. A particular digital device may perform the same function for different client devices and/or tenants. A particular digital device may perform different functions for different client devices and/or tenants.

According to one or more embodiments, the physical topology includes a virtual topology instantiated on the physical topology. Referring to fig. 1B, elements of the virtual topology include nodes 149 and 150, virtual machines 151 and 152, a firewall 155, and a virtual router 156. In one embodiment of the invention, a node is a representation of a managed entity or application type. A node may represent hardware (such as a managed host) or software (such as an application). In the example shown in fig. 1B, node 149 includes an application 153 and node 150 includes an application 154. In one embodiment of the invention, for the application view type, a node corresponding to the application type is also generated. According to one embodiment, when the system instantiates a node, the system populates the node with data from the managed entity table and the dynamic state table. The system may set the hierarchy of nodes based on the type of entity the nodes represent. In one embodiment of the invention, the managed entities may be organized into one of three tiers based on the type of entity: web level, middleware level, and database level.

There are a number of ways to instantiate a virtual topology over a physical topology that is described by the same virtual topology specification. Instantiation of a virtual topology over a physical topology includes mapping Virtual Topology Entities (VTEs) described in a virtual topology specification to digital devices of the physical topology.

Each VTE is associated with one or more functions. Examples of functions include data routing, data filtering, data checking, data storage, and/or any other type of data processing function.

The virtual topology is instantiated on a physical topology based on a virtual topology specification. During instantiation, the VTE of the virtual topology specification is mapped to the digital devices of the physical topology. The VTE may correspond to the digital device itself or be a virtual component executing on the digital device. A single VTE may map to multiple digital devices. Instead, multiple VTEs may be mapped to a single digital device. A particular digital device mapped to a particular VTE implements a function corresponding to the particular VTE. The virtual topology specification may, but need not, include any reference to the physical topology or digital devices therein. The virtual topology specification may, but need not, specify which digital devices of the physical topology perform which functions of which VTEs.

Multiple computer networks implemented according to respective virtual topologies may be instantiated on a single physical topology. As an example, multiple tenants may share a collection of digital devices arranged according to a physical topology. Each tenant may have a different desired arrangement of VTEs. Each arrangement of VTEs corresponds to a different virtual topology. Each virtual topology of the respective tenant may be instantiated on a physical topology.

The VTE in the virtual topology may be executed in the overlay network. The overlay network is implemented on top of an underlying network corresponding to the physical topology. Each VTE is associated with two addresses: (a) An overlay address corresponding to the VTE and (b) an underlay address corresponding to the digital device on which the VTE is instantiated. The address may be fixed (e.g., entered by a network administrator). Additionally or alternatively, the address may be dynamically assigned (e.g., via Dynamic Host Configuration Protocol (DHCP) and/or another application). Data is transported between VTEs in the virtual topology through tunneling of the underlying network.

The system topology 131 specifies how data should traverse the VTE. Data should traverse the VTE according to the connection linking the VTE. For example, data may be transferred from node 149 to virtual machine 151 by passing through firewall 155 and router 156. At the firewall 155, the data may be processed to perform firewall functionality associated with the firewall 155. Based on the firewall functionality, the data may be checked to determine if the data is allowed to pass. Further, at router 156, the data may be processed to perform routing functionality of router 156. Based on the routing functionality, the next hop of data may be identified as virtual machine 151. Router 156 may forward the data to virtual machine 151.

FIG. 2 illustrates an example set of operations for recommending an operation manual for diagnosing and/or remediating an event based on system topology in accordance with one or more embodiments. One or more of the operations shown in fig. 2 may be modified, rearranged, or omitted entirely. Thus, the particular order of operations illustrated in FIG. 2 should not be construed as limiting the scope of one or more embodiments.

The system may include one or more monitoring devices and monitoring applications. The system detects events associated with the monitored system through the monitoring device and/or application (operation 202). For example, the system may identify a value or event in an event log associated with the monitoring application. The event log may identify: sensor values, data throughput values, data storage values, application states (e.g., "OK" and "no response"), and access events, such as identifying a request from an entity to access an application or device. According to one example embodiment, the system identifies abnormal sensor values or abnormal application states that are outside of a threshold for proper operating state. According to another example embodiment, a system detects an access request from an entity that is not authorized to access a system component to access the system component. In one example embodiment, the system may detect a successful access attempt by an unauthorized or unrecognized entity.

In accordance with one or more embodiments, detecting events includes analyzing predictions generated by a machine learning model that monitors sensor data and other system data. According to one or more alternative embodiments, detecting an event includes analyzing user-generated event entries, such as a "ticket" generated by one user that notifies one or more additional users, such as a technician trained for maintaining the system, of an exception. The worksheet may be based on user complaints (e.g., "computer is not working"). Alternatively, the worksheet may be based on data detected by the system (e.g., a sensor monitoring a component indicates that the component is operating outside a threshold range of values). In accordance with one or more embodiments, detecting an event includes comparing a log entry to a known or predicted exception. For example, a failure to log into an application or device may be detected based on log entries generated by the application or device. The system may analyze the log entries, identify events (e.g., login failures) based on the log entries, and collect system metadata associated with the events. The system metadata may include related log entries (e.g., previous login failures originating from the same device), user information, application information, device information, communication protocols, and security protocols.

The system obtains the system topology of the monitored system (operation 204). The system topology identifies physical system components such as the computing device, a communication channel connecting the computing device, a power channel supplying power to the computing device, applications running on the computing device, and data stored in the computing device. The topology may also identify access grants to particular system components, such as which entities have grants to access different system hardware and software. The system topology may be stored in a data repository. The stored system topology may be updated based on changes in the system topology. According to alternative embodiments, the system may detect the system topology in real time or on demand as events are detected in the monitored system.

According to an example embodiment, the system may detect an event, such as a login failure in the device. The system may identify a topology associated with the device, including: the user may be attempting to log in to an application, other applications running on the device, the type of data stored on the device, a communication channel between the device and one or more additional devices, hardware and software that facilitate communication between the device and one or more additional devices, a communication path (e.g., a communication channel, a device, and a communication layer program) between the device and another device from which the login attempt was initiated, an additional device connected to the target device, a program running on the additional device, and data stored in the additional device. For example, the system may determine that a user is attempting to log in from a terminal external to the cloud computing environment to an application running on a computing node of the cloud computing environment. The system may identify a topology of the cloud computing environment, including additional computing nodes, intermediate tier nodes, databases, and security hardware (such as firewall devices). The system may also identify applications running on the add-on device.

The system identifies candidate system topologies (operation 206). The system may store a variety of different system topologies. The system topology may be based on the actual system topology. For example, the system topology may correspond to the configuration of the monitored system at a particular point in time in the past. Alternatively, the system topology may correspond to the topology of a different system. Different system topologies may be categorized by system components and system functions. For example, if the monitored system includes a database and one or more servers running virtual machines, the system may select a candidate topology associated with the system that includes the database or the system that has already run the virtual machines. Examples of classification of candidate topologies include characteristics of the system being monitored, such as: load distribution among servers, specific applications running on the system, specific number of computing nodes in the cloud environment, and presence of additional types of nodes (e.g., elastic nodes, intermediate level nodes) in the cloud environment. The candidate topology may be associated with another entity associated with the same enterprise, such as a different department of a large corporation. Alternatively, the candidate topologies may be associated with the same type of division for different enterprises, such as two cloud computing environments associated with two different manufacturing companies or two different software type companies. The candidate topology may be a topology template that is not based on the historical topology of the actual organization. For example, the set of candidate topologies may include a topology template associated with a lower data security level and another topology template associated with a higher data security level.

According to one or more embodiments, the candidate topology includes user-defined topology elements. For example, a user may generate an operation manual to remedy an event. In generating an operation manual, a user may generate a set of tags or labels that specify event characteristics and topology elements associated with the operation manual. The indicia may include the name of the event, the system components associated with the operation manual operation, the system applications associated with the operation manual operation, and the relationship between the components. The system may identify candidate topologies defined by user-generated tokens associated with user-generated operation manuals. In addition, the user may generate a flag to specify event characteristics and topology elements associated with the existing operation manual. For example, if a gateway device in the system is updated to an updated model, the system administrator may delete or modify the indicia of the operating manual associated with the previous model of the gateway device to ensure that the system presents an operating manual for remedying the event associated with the updated model of the gateway device.

In accordance with one or more embodiments, the system may combine the user-generated tag with the system-generated tag or data. For example, a user may generate a pair of indicia associated with a pair of system components. The system can identify relationships between components. The system can identify relationships between components without receiving user input to direct the system to identify the relationships. The system may generate a new tag based on the identified relationship. For example, the user may generate the labels "server A" and "database A". The system may identify the database as the object of the query generated by the server. The system generates a flag to specify that the database is communicating with the server. The system may generate another flag to specify that the database stores data accessed by queries from the server. The system may identify candidate topologies based on a combination of user-generated tokens and system-generated tokens.

The system compares the system topology of the monitored system to the candidate topology to determine if a similarity criterion is met (operation 208). The similarity criteria may include, for example, determining that a threshold percentage of topology elements between the system topology and the candidate topology are the same. The system may perform a comparison on a predetermined number of candidate topologies to identify the candidate topology having the highest similarity to the system topology. For example, the system may identify three different candidate topologies among a set of hundreds of candidate topologies that meet a threshold similarity criterion with the system topology. The system may identify a candidate topology having the highest similarity to the system topology among the three candidate topologies.

Based on determining that the candidate topology meets similarity criteria with the system topology, the system identifies an operating manual associated with the candidate topology (operation 210). For example, the system may determine that the operation manual was previously executed in a system having a topology similar to the system topology. The system may also determine that the operation manual is executing in a candidate system associated with an event of the same type as the event detected in the monitored system.

The system presents an operation manual to diagnose the cause of the event and/or remedy the event detected in the monitored system (operation 212). For example, the system may provide user interface elements on a graphical user interface to allow a user to select an operation manual. Selecting the operation manual may cause one or more user interface elements associated with the independently executable operations corresponding to the steps of the operation manual to be displayed. In accordance with one or more embodiments, a system may identify a plurality of topologies that satisfy a similarity criterion. The system may present an operation manual associated with the candidate topology having the highest similarity to the system topology. The system may rank a plurality of runbooks associated with a plurality of candidate topologies based on a degree of similarity of the candidate topologies and the system topology. The system may present candidate runbooks of a predefined number of remedial events via the GUI. The system may rank the candidate runbooks based on the degree of similarity of the candidate topology and the system topology. For example, the system may display the candidate operating manual associated with the topology having the highest similarity ranking above the candidate operating manual associated with the topology having the next highest similarity ranking.

According to one or more embodiments, presenting an operation manual for diagnosing a cause of an event and/or remedying an event may include displaying information regarding why the candidate operation manual satisfies the similarity criteria. For example, the system may display information about the candidate topology, such as "candidate topology includes nodes A, B and C connected to device X. The system may display components representing the system topology and/or components of the candidate topology via text or via visual elements (no text).

In addition to determining whether similarity criteria are met between the candidate topology and the system topology, the system may include additional criteria to determine whether to present a particular operating manual for diagnosing the cause of the event and/or remedying the event, according to one or more embodiments. For example, the system may determine whether similarity criteria are met between the detected event and the event associated with the particular operation manual. In other words, the system may select an operation manual to present for remedying an event based on both (a) the similarity of the detected event and the event associated with the operation manual and (b) the similarity of the topology associated with the detected event and the topology associated with the operation manual. In accordance with one or more embodiments, the system identifies topology and event-based relationships by collecting metadata associated with detected events. The metadata includes, for example, a user ID, a time, a device ID, an application type, a port number associated with the event, a power source associated with the device, a communication channel type, a communication protocol, an encryption type, a data type, and data content (e.g., whether the data associated with the event is associated with an Operating System (OS) or an application running on the OS, whether the data associated with the event is associated with a particular tenant of a cloud-based environment, etc.).

FIG. 3 illustrates another example set of operations for recommending an operation manual for diagnosing a cause of an event and/or remedying the event based on a relationship between a target component in a system topology and one or more additional components in the system topology, in accordance with one or more embodiments. One or more of the operations shown in fig. 3 may be modified, rearranged, or omitted entirely. Thus, the particular order of operations illustrated in FIG. 3 should not be construed as limiting the scope of one or more embodiments.

The system may include one or more monitoring devices and monitoring applications. The system detects, by the monitoring device and/or application, an event associated with a target component in the monitored system (operation 302). For example, the system may identify values or events in an event log associated with a component being monitored by the monitoring application. The event log may identify: sensor values, data throughput values, data storage values, application states, and access events (such as identifying a request from an entity to access an application or device). According to one example embodiment, the system identifies an abnormal sensor value or abnormal application state that is outside of a threshold for a proper operating state of the target component. According to another example embodiment, a system detects an access request from an entity that is not authorized to access a system component to access the system component. In one example embodiment, the system may detect a successful access attempt by an unauthorized or unrecognized entity.

The system identifies a topological relationship between the target component and one or more other components in the system topology (operation 304). The system topology describes physical system components such as a computing device, a communication channel connecting the computing device, a power channel supplying power to the computing device, applications running on the computing device, and data stored in the computing device. The system topology describes the relationships between different components, such as how components are connected by communication channels, which components are directly connected to each other and which components are indirectly connected along the same communication path, which components depend on other components to function, and which software type components can run on hardware type components.

Identifying a topological relationship between a target component and one or more other components in the system topology may include identifying the type of component, the functionality of the component, and the communication connection of the component with the other components in the system. For example, the system may identify a topological relationship between the database application and the hardware on which the database application is running. As another example, the system may identify a topological relationship between two virtual machines connected to different client terminals and running different applications from the same server. The topology may also identify access grants for particular system components, such as which entities have grants to access different system hardware and software. The system topology may be stored in a data repository. The stored system topology may be updated based on changes in the system topology. According to alternative embodiments, the system may detect the system topology in real time or on demand as events are detected in the monitored system.

The system selects candidate operation manuals associated with one or more components in the system topology (operation 306). The candidate operating manual may be selected from a set of stored operating manuals. For example, the system may maintain tens, hundreds, or thousands of runbooks associated with different routine processes associated with events in the system.

In accordance with one or more embodiments, the system gathers metadata associated with an event to identify a target component and one or more additional components in the system topology. The system may select a candidate operation manual based on the collected metadata. For example, metadata collected by the system may include a user ID of a user associated with the target component, an organization name or tenant associated with one or more components, a device ID of one or more devices associated with the target component, application information (such as an application name), an application running state (such as whether the application is executing correctly, stopped, or unresponsive), a process executed by the application when the event occurs, an application type, a port number associated with the event, a power source associated with the event, a communication channel type of a communication channel associated with the event, a communication protocol, an encryption type, a data type, and data content (e.g., whether data associated with the event is associated with an Operating System (OS) associated with the event or an application running on the OS).

The system determines whether the selected candidate operation manual meets a relevance threshold level based on the topological relationship between the target component and one or more components associated with the candidate operation manual (operation 308). According to one embodiment, the system may determine how tightly one or more components are in communication with a target component. For example, if the system detects an event "server crash," the system topology may identify a first set of system components that are in communication with the server and a second set of system components that operate independently of the server and are not affected by the server crash. Identifying a topological relationship between the target component and one or more other components includes determining whether one or more components associated with the candidate operating manual belong to a first set of system components (in communication with the server) or a second set of system components (not in communication with the server). If one or more components belong to the first set of system components, the system may determine that the candidate execution manual satisfies a threshold level of relevance to the target component. Conversely, if one or more components belong to the second set of system components, the system may determine that the candidate operating manual fails to meet the threshold relevance level.

In accordance with one or more embodiments, the system may determine whether the candidate operation manual satisfies a threshold level of relevance based on a degree of communication between the target component and one or more other components associated with the candidate operation manual. For example, in the above example, where the system detects an event "server crash," the system may determine whether a component associated with an operation manual operation in the candidate operation manual is among a first set of components that directly communicates with the server (a first degree of communication), among a second set of components that directly communicates with one component in the first set of components (a second degree of communication), among a third set of components that directly communicates with one component in the second set of components (a third degree of communication), and so forth. The system can assign a highest relevance score to the candidate runmanual associated with the first set of components, a next highest relevance score to the candidate runmanual associated with the second set of components, and a lowest relevance score to the candidate runmanual associated with the third set of components.

Based on determining that the candidate operation manual meets the relevance threshold level, the system presents an operation manual for diagnosing the cause of the event and/or remedying the event detected in the monitored system (operation 310). For example, the system may provide user interface elements on a graphical user interface to allow a user to select an operation manual. Selecting the operation manual may cause one or more user interface elements associated with the independently executable operations corresponding to the steps of the operation manual to be displayed.

In accordance with one or more embodiments, the system can identify a plurality of candidate operation manuals that satisfy a threshold correlation with an event. The system may present the candidate operation manual having the highest relevance score based on a topological relationship between the target component and one or more components associated with operation of the candidate operation manual. The system may rank the plurality of candidate runbooks based on the relevance scores of the candidate runbooks. The system may present candidate runbooks of a predefined number of remedial events via the GUI. The system may display the ranked candidate runmanual based on the ranking of the candidate runmanual. For example, the system may display the candidate operating manual with the highest relevance score above the candidate operating manual with the next highest relevance score.

In accordance with one or more embodiments, presenting a candidate operation manual for a remedial event may include displaying information regarding a relationship between a target component and one or more components associated with operation of the operation manual that contributed to the relevance score. The system may display "there are X components in the system topology associated with the operation of the candidate operation manual" via text or via visual elements (no text) representing components of the system topology. Alternatively, the system may display "X components associated with the operation of the candidate workbook and the target component associated with the event have a Y degree of communication" via text or via visual elements (with or without text) of the components representing the system topology.

FIG. 4 illustrates an example set of operations for recommending an operation manual for diagnosing a cause of an event and/or remedying an event based on the operation manual associated with another event in accordance with one or more embodiments. One or more of the operations illustrated in fig. 4 may be modified, rearranged, or omitted entirely. Thus, the particular order of operations illustrated in FIG. 4 should not be construed as limiting the scope of one or more embodiments.

The system may include one or more monitoring devices and monitoring applications. The system detects events associated with the monitored system through a monitoring device and/or application (operation 402). For example, the system may identify a value or event in an event log associated with the monitoring application. The event log may identify: sensor values, data throughput values, data storage values, application states, and access events (such as identifying a request from an entity to access an application or device). According to one example embodiment, the system identifies abnormal sensor values or abnormal application states that are outside of a threshold for proper operating state. According to another example embodiment, a system detects an access request from an entity that is not authorized to access a system component to access the system component. In one example embodiment, the system may detect a successful access attempt by an unauthorized or unrecognized entity.

The system identifies event attributes associated with the event (operation 404). Identifying event attributes may include collecting and analyzing metadata associated with the event. For example, in embodiments in which an event is detected when a user generates a work order describing a particular component in the system, the system may review log data associated with the component. The log data can identify users, applications, and states associated with the component within a predefined period of time. According to another example embodiment where the system detects an event by monitoring sensor data or state data of a system component, identifying an event attribute may include identifying recent historical sensor data or state data, identifying an application associated with a target component, and identifying a user associated with the target component.

According to an example embodiment, metadata collected by the system may include a user ID of a user associated with the event, an organization name or tenant associated with the event, a timestamp, a device ID of one or more devices associated with the event, application information (such as an application name), an application running state (such as whether the application is executing correctly, stopped, or unresponsive), a process executed by the application when the event occurs, an application type, a port number associated with the event, a power source associated with the event, a communication channel type of a communication channel associated with the event, a communication protocol, an encryption type, a data type, and data content (e.g., whether data associated with the event is associated with an Operating System (OS) associated with the event or an application running on the OS).

The system identifies candidate historical events and candidate runbooks associated with the candidate historical events (operation 406). Candidate historical events may be selected from a stored set of historical events. For example, each time a user creates a worksheet reporting an event, the system may store the event and metadata associated with the event. Similarly, each time the system detects an event based on sensor data, the system may store the event and metadata associated with the event. According to another example embodiment, the system may store metadata associated with the detected event each time the system detects a particular component state (such as an unresponsive application or a component operating outside of a threshold operating range). Further, each time an operation manual is applied to an event, the system may store an operation manual associated with the event or event type. According to one example embodiment, the system may obtain feedback from a user of the operation manual to determine whether the operation manual successfully resolved the event. If the operation manual successfully resolves an event, the system may store the operation manual in association with the event. If the operation manual does not successfully address the event, the system may refrain from storing the operation manual in association with the event. According to alternative embodiments, the system may store a plurality of operation manuals in association with the event. The system may increase the ranking of one of the plurality of operation manuals based on (a) the operation manual selected to remedy the event and/or (b) the operation manual to successfully remedy the event. The system may decrease the ranking of one of the plurality of operation manuals based on (a) the operation manual not being selected to remedy the event and/or (b) the operation manual being selected but unsuccessful in remedying the event.

The system determines whether the candidate historical event meets a similarity criterion with the detected event (operation 408). For example, the system may compare metadata associated with the detected event to metadata associated with the candidate event.

The similarity criteria may include, for example, determining that a threshold percentage of metadata elements are the same between the detected event and the candidate event. The system may perform a comparison on a predetermined number of candidate events to identify candidate events having the highest similarity to the detected event. For example, the system may identify three different candidate events among a set of hundreds of candidate topologies that satisfy a threshold similarity criterion with the detected event. The system may identify a candidate event having the highest similarity to the detected event among the three candidate events.

For example, the system may detect an event associated with the name "communication lost with the server". The system may identify three candidate events that satisfy the similarity criteria. The first event may involve a specific server outage. The second event may relate to a communication port that was erroneously programmed by an application running on the server. The third event may relate to a communication failure of a device along a communication channel between the server and the user terminal. The system may analyze metadata associated with the detected event, including: the power level of the server, the applications running on the server, and the presence and/or status of devices along the communication channel between the server and the user terminal. The system may select a second candidate event having the highest similarity to the detected event based on determining from the metadata associated with the event that the same application was running on the target server associated with the detected event and the server associated with the candidate event.

Based on determining that the candidate event meets the similarity criteria with the detected event, the system presents a candidate operating manual for diagnosing the cause of the event and/or remedying the detected event associated with the candidate event (operation 410). For example, the system may provide user interface elements on a graphical user interface to allow a user to select an operation manual. Selecting the operation manual may cause one or more user interface elements associated with the independently executable operations corresponding to the steps of the operation manual to be displayed.

While fig. 2-4 illustrate example operations for selecting an operation manual to present to remedy a detected event based on the system topology and attributes of the detected event, one or more embodiments combine the elements of fig. 2-4 to select an operation manual to present to remedy a detected event. For example, one or more embodiments include selecting an operation manual to present to remedy an event based on any combination of: (a) similarity of topology and candidate topology associated with a system associated with the detected event, (b) similarity of topology components associated with operation of the candidate operation manual and topology associated with the detected event, and (c) similarity of the detected event and similarity of the event associated with the particular operation manual.

FIG. 5 illustrates an example set of operations for recommending candidate operation manuals based on correlation of results of operation manual operations with remediation of an event in accordance with one or more embodiments. One or more of the operations shown in fig. 2 may be modified, rearranged, or omitted entirely. Thus, the particular order of operations illustrated in FIG. 5 should not be construed as limiting the scope of one or more embodiments.

The system may include one or more monitoring devices and monitoring applications. The system detects events associated with the monitored system through a monitoring device and/or application (operation 502). For example, the system may identify a value or event in an event log associated with the monitoring application. The event log may identify: sensor values, data throughput values, data storage values, application states (e.g., "OK" and "no response"), and access events, such as identifying a request from an entity to access an application or device. According to one example embodiment, the system identifies abnormal sensor values or abnormal application states that are outside of a threshold for proper operating state. According to another example embodiment, a system detects an access request from an entity that is not authorized to access a system component to access the system component. In one example embodiment, the system may detect a successful access attempt by an unauthorized or unrecognized entity.

In accordance with one or more embodiments, detecting events includes analyzing predictions generated by a machine learning model that monitors sensor data and other system data. According to one or more alternative embodiments, detecting an event includes analyzing user-generated event entries, such as "worksheets" generated by one user to notify one or more additional users (such as a technician trained to maintain the system) of an anomaly. The worksheet may be based on user complaints (e.g., "computer is not working"). Alternatively, the worksheet may be based on data detected by the system (e.g., a sensor monitoring a component indicates that the component is operating outside a threshold range of values). In accordance with one or more embodiments, detecting an event includes comparing a log entry to a known or predicted exception. For example, a login failure of an application or device may be detected based on log entries generated by the application or device. The system may analyze the log entries, identify events (e.g., login failures) based on the log entries, and collect system metadata associated with the events. The system metadata may include related log entries (e.g., previous login failures originating from the same device), user information, application information, device information, communication protocols, and security protocols.

The system identifies candidate operation manuals (operation 504). The candidate operating manual may be selected from a set of stored operating manuals. For example, the system may maintain tens, hundreds, or thousands of runbooks associated with different routine processes associated with events in the system. According to one embodiment, the system identifies attributes associated with an event or system topology to identify an operation manual. For example, the system may identify a name associated with the event, such as a name entered by an operator for reporting the event. Alternatively, the system may identify system components associated with the event in the system topology. For example, if an event includes a particular state of a component, then a candidate operating manual may be associated with that component or another component in the system topology.

In accordance with one or more embodiments, the system collects metadata associated with an event. The system may identify candidate operation manuals based on the collected metadata. For example, metadata collected by the system may include a user ID of a user associated with the event, an organization name or tenant associated with the event, a timestamp, a device ID of one or more devices associated with the event, application information (such as an application name), an application running state (such as whether the application is executing correctly, stopped, or unresponsive), a process executed by the application when the event occurs, an application type, a port number associated with the event, a power source associated with the event, a communication channel type of a communication channel associated with the event, a communication protocol, an encryption type, a data type, and data content (e.g., whether data associated with the event is associated with an Operating System (OS) associated with the event or an application running on the OS).

The system previews one or more operations of the candidate operating manual prior to presenting the candidate operating manual for diagnostic and/or remedial events. The system performs one or more operations associated with the steps of the candidate execution manual to obtain a set of results (operation 506). The one or more operations may be (a) operations specified in a step of the candidate operation manual, or (b) operations that are not specified in any step of the candidate operation manual but that provide results associated with the step of the candidate operation manual.

For example, the candidate operation manual may include steps describing the operation of "checking the data transfer log". The system may perform operations specified in the operation manual to examine the data transfer log before presenting the candidate operation manual as a recommendation for the remedial event. The system may compare the values described in the data transfer log with expected values to obtain results associated with the operation. Additionally or alternatively, the system may perform operations related to one of the steps of the candidate operation manual but not specified in any of the steps of the candidate operation manual. For example, if the candidate operation manual includes an operation of "checking a data transmission log of the server", the system may perform an operation of checking a state of the data routing device between the server and the user terminal. The system may obtain a result of checking the status of the data routing device between the server and the user terminal, such as whether the device is transmitting data during a detected event or whether data transmission is interrupted during a detected event.

According to one or more embodiments, the system selects one or more operations to execute from among the independently executable operations of the candidate execution manual based on one or both of event data associated with the detected event and topology data associated with the detected event. For example, if a detected event is assigned the name "login attempt failed," the system may analyze the event data to identify the time of the event, the ID of the server running the application making the login attempt, and the application running on the server. The system may analyze the system topology to identify components in the monitored system that communicate with a server running an application that makes login attempts. The system may identify a set of candidate operation manuals associated with the event type "login attempt failed". The system may select operations to be performed with respect to one or more of the candidate runbooks based on the event data and the topology data. For example, the system may perform operations associated with the independently executable operations of the candidate execution manual to "check the authorization level required by the application" based on the obtained event data. According to another example, the system may perform an operation associated with an independently executable operation of another candidate operation manual to "check the status of a gateway in communication with the server" based on the obtained system topology data.

In accordance with one or more embodiments, the system identifies candidate operations to preview based on analyzing the metadata. For example, the system may store metadata for each operation of the operation manual that indicates whether a particular operation is eligible to be performed to preview the operation results. The system may designate operations as eligible for execution to preview the operation manual results based on: (a) The operation itself may be performed, or (b) another operation different from the operation manual operation may be performed to provide insight into the operation manual operation. An example of an operation that may be performed to preview the result of an operation is an operation that may be performed by a computer without user input. In contrast, an example of an operation that may not qualify for previewing the operation results is an operation that requires user input to perform. Metadata associated with the operation manual operation may also identify one or more related operations that may be performed to preview the operation manual operation. For example, if an operation directs a user to check for a power connection of a server, the metadata may specify that the computer may preview the operation by determining whether the server is actively communicating with another server.

According to another example embodiment, the system may designate an operation as being ineligible for execution for previewing the operation manual operation result based on execution of the operation to change the state of the system. For example, an operator manual operation may direct a user to initiate a data backup, restore backup data, initiate a series of requests to a server, or reset a virtual firewall. Performing the preview result results in a change in system state. Thus, for the purpose of obtaining a preview of the operation results, the system may store metadata associated with the operation manual operation, the metadata indicating that the operation is not eligible for execution.

Based on the results of the performed operations, the system determines whether the performed operations generate interesting results (operation 508). The system uses the relevance score to indicate whether a particular operation outcome is of interest. The system determines the relevance of the operation of the candidate operation manual to diagnosing and/or remediating the detected event and assigns a corresponding relevance score to the operation. In accordance with one or more embodiments, the system analyzes information associated with results of a set of one or more operations associated with an operation manual to calculate a relevance score for the operation manual. Examples of information that the system may analyze to generate a relevance score include: the data set used to perform the operation specified in the operation manual, the type of data included in the result set, the source of the operation result, the software and/or hardware components analyzed in the operation result, and an indication of whether the type of problem identified by the operation result is a problem detected in the event-affected target component or environment.

For example, the candidate operation manual may include steps describing the operation of "checking the data transfer log". The system analyzes the data transfer log by comparing the actual value to the expected value to obtain a set of results. Based on the result of the comparison, the system assigns a relevance score to the candidate operation manual or adjusts the relevance score of the candidate operation manual. If the system determines that the result of the operation does not provide a useful result for remedying the detected event, the system may decrease the relevance score of the candidate operation manual. For example, if the data transfer log shows a continuous data transfer within a threshold range, the system may determine that checking the data transfer log is unlikely to remedy the detected event. Thus, the system reduces the relevance score of the associated candidate operating manual. On the other hand, if the system determines that the data transfer log value is not within the expected threshold range of values, the system may increase the relevance score of the candidate operation manual, indicating that the operation of checking the data transfer log is likely to remedy the detected event. The system may determine a relevance score for the candidate operation manual based on a plurality of relevance scores associated with a plurality of operations associated with the steps of the candidate operation manual. For example, if three operations associated with a step of the candidate operation manual are determined to be related to a remedial event, and if one operation associated with a step of the candidate operation manual is determined to be unrelated to a remedial event, the system may generate a relevance score of 0.75 (ranging from 0 to 1).

According to another example, if the candidate operation manual includes an operation of "checking a data transmission log of the server", the system may perform an operation of checking a state of the data routing device between the server and the user terminal. The system may obtain a result of checking the status of the data routing device between the server and the user terminal, such as whether the device is transmitting data during a detected event or whether data transmission is interrupted during a detected event. If the system determines that the data routing device is transmitting and receiving data according to an expected data transfer rate, the system may decrease the relevance score of the candidate operation manual, indicating that it is unlikely that checking the data transfer log of the server will help remedy the detected event. Conversely, if the system determines that the data routing device is not transmitting data to and receiving data from the server according to the expected data transfer rate, the system may increase the relevance score of the candidate operation manual, indicating that checking the data transfer log of the server is likely to be helpful in remedying the detected event.

The system may obtain expected results for comparison with currently generated results based on previously collected system data, published specifications, previously executed operating manual results, and user provided expected values. For example, a cloud computing system may include a monitoring platform for monitoring system attributes. The monitoring system may identify operating parameters corresponding to the expected operating state, such as data transfer rate, bandwidth utilization, memory utilization, and running applications. For example, the monitoring system may determine that a data transfer rate between two servers that is less than a particular value corresponds to an expected data transfer range. A data transfer rate exceeding the specific value may correspond to an outlier. According to one example embodiment, a monitoring platform includes a machine learning model trained to receive system metrics and generate one or more predictions that indicate whether the system metrics are within an expected range or abnormal. Alternatively, the user may assign a label or tag to a particular value or set of values, indicating whether the values are abnormal or within an expected range. Additionally or alternatively, the system may detect that the user has previously executed an operating manual to diagnose and/or remedy an event based on detecting a particular set of outliers.

The system may set a relevance score for the candidate operating manual based on the results of the operation or operations. For example, if the system performs one operation associated with a step of the candidate operation manual, the system sets the relevance score of the candidate operation manual based on the result of this one operation. Alternatively, the system may combine multiple relevance scores for multiple operations associated with steps of the candidate operation manual to determine a composite relevance score for the entire candidate operation manual. The system may assign a relevance score between 0 and 1 for each operation. The system may average the relevance scores associated with the plurality of operations to determine a relevance score for the candidate operation manual. Alternatively, the system may assign different weights to different relevance scores of operations associated with the workbook steps to determine the composite relevance score of the candidate workbook.

The system may generate a relevance score associated with the operation result based on (a) whether the result is different from the expected range and (b) a relationship between the result and a target associated with the operation manual recommendation. The system may assign a higher relevance score to operations that are more closely related to the recommended actions than to the recommended actions. For example, the system may identify a target component in which the detected event occurred. The system may identify events detected in the diagnostic target component as targets of the recommended operating manual. In other words, the purpose of the recommended operation manual is to provide the user with an operation manual that is likely to diagnose and/or remedy the detected event. The system can analyze two operation manual operations. An operator manual operation may be directed to analyzing data stored in the target component. Another operation may be directed to analyzing data stored in another component of the system that is not connected to the target component. The system may determine that the first operation is more closely related to the objective of the operation manual than the second operation based on topology data associated with the two operation manual operations. The system may assign a higher relevance score to the result associated with the first operation than the result associated with the second operation.

According to one or more embodiments, the system identifies how closely the operating results of the operating manual are associated with the recommended objectives of the operating manual based on one or both of (a) event data and (b) topology data associated with the detected event. For example, if the system determines that the objective of the instruction manual recommendation is to diagnose the cause of an application crash, the system may determine that instruction manual operations that analyze data related to the state of the application at the time of the application crash are of greater interest than instruction manual operations that analyze the state of the application at a time further from the time of the crash. The system assigns a higher relevance score to an operation manual operation that generates data more closely related to the event data than to an operation manual operation that generates data less related to the event data. Also, the system assigns a higher relevance score to an operation manual operation of the target component that analyzes the event occurring therein as compared to an operation manual operation of a component in the topology of the analysis system that is remote from the target component.

For example, if the system performs two operations associated with steps of a candidate operation manual, the system may assign a relatively low relevance score to the first operation manual operation based on determining: (a) The results of the first operation manual operation are within an expected range (corresponding to a low relevance score) and the first operation is associated with a component in the system topology that is not closely related to the target component (corresponding to a low relevance score), and (b) the results of the second operation are outside of an expected range (corresponding to a high relevance score) and the second operation is associated with a component in the system topology that is closely related to the target component (corresponding to a high relevance score).

In accordance with one or more embodiments, the system may generate a relevance score based on results of operations performed by the system associated with one or more steps of the candidate execution manual and one or more of: (a) Similarity of the detected event to an event associated with the candidate operation manual, and (b) similarity of the topology associated with the detected event to the topology associated with the candidate operation manual. In accordance with one or more embodiments, the system identifies topological relationships and event-based relationships by collecting metadata associated with detected events. The metadata includes, for example, a user ID, a time, a device ID, an application type, a port number associated with the event, a power source associated with the device, a communication channel type, a communication protocol, an encryption type, a data type, and data content (e.g., whether the data associated with the event is associated with an Operating System (OS) or an application running on the OS, whether the data associated with the event is associated with a particular tenant of a cloud-based environment, etc.).

The system determines whether a relevance score associated with the candidate execution manual exceeds a threshold (operation 510). For example, the system may set the threshold relevance score to 0.6, ranging from 0 to 1. If the relevance score of the candidate operation manual is 0.6 or greater, the system may determine that the candidate operation manual meets the threshold relevance score. Although a range of values between 0 and 1 for the relevance score is provided by way of example, embodiments contemplate any range of values, such as between 1-10, between 0-100, between A-E, or any other range of values. Alternatively, the relevance score may be binary, either 0 or 1, instead of a range of values.

In accordance with one or more embodiments, the system repeats operations 504-510 for each of a set of candidate operation manuals. For example, if a detected event is assigned the label "terminal power down" and if ten stored operation manuals are associated with the label "terminal power down", the system may repeat operations 504-510 for each of the ten stored operation manuals. The system assigns relevance scores to the respective candidate runbooks.

Based on determining that the candidate operation manual meets the relevance threshold, the system presents the candidate operation manual for remedying the detected event (operation 512). For example, the system may provide user interface elements on a graphical user interface to allow a user to select an operation manual. Selecting the operation manual may cause one or more user interface elements associated with the independently executable operations corresponding to the steps of the operation manual to be displayed.

In accordance with one or more embodiments, the system may identify a plurality of candidate operation manuals that satisfy the relevance criteria. The system may present the candidate operation manual with the highest relevance score. The system may rank the plurality of candidate runbooks based on their respective relevance scores. The system may present a predefined number of candidate runbooks via the GUI to diagnose and/or remedy the event. The system may display the ranked candidate runmanual based on the ranking of the candidate runmanual. For example, the system may display the candidate operating manual with the highest relevance score on the display at a location above the candidate operating manual with the next highest relevance score. The system may refrain from displaying candidate runbooks that meet the threshold relevance if the candidate runbooks are not among the predetermined number of highest ranked candidate runbooks. For example, if the system determines that five candidate runbooks meet the threshold relevance, the system may display the three candidate runbooks with the highest relevance scores as recommendations for the remedial event. The system may avoid displaying the other two candidate operation manuals.

In accordance with one or more embodiments, presenting a candidate operation manual for a remedial event may include displaying information regarding why the candidate operation manual meets a relevance threshold. For example, as described above, the system may display a plurality of candidate operation manuals according to the extent to which the operation associated with the steps of the candidate operation manual performed by the system indicates a high likelihood of providing results that remedy the detected event-related information. The system may display, via text or via visual elements (no text) representing steps of the candidate operation manual, that one or more steps defined in the candidate operation manual are likely to provide information related to remedying the detected event.

7. Example embodiment

For clarity, detailed examples are described below. The components and/or operations described below should be understood as one specific example that may not be applicable to certain embodiments. Accordingly, the components and/or operations described below should not be construed as limiting the scope of any claims.

Fig. 6 illustrates a monitored system 601 monitored by an event remediation platform 610. The monitored system 601 includes nodes 602-506 and a database 607. User terminals 608 and 609 access nodes 602-606 and database 607 via the network. Event remediation platform 610 monitors system 601 to identify events associated with system 601.

In the example embodiment shown in fig. 6, event remediation platform 610 detects login failures and errors in communication loss with the server. For example, a user may interact with the user terminal 608 to access applications running on the node 604. Node 604 may communicate with user terminal 608 via node 602. Nodes 602 and 604 may comprise hardware servers. The user may attempt to log onto the application on the logging node 604. The system may detect a predefined number of login failures, such as three consecutive login failures from the user terminal 608, and generate an event notification 612. The system displays the detected event 613 and recommended operation manual for remedying the events 614 and 615 on the display screen 611. User interface elements 614 and 615, which represent the operation manual, may be user selectable elements. Selection of element 614 or 615 may cause the system to modify the displayed data to include steps of the corresponding operation manual 614 or 615.

Further, in the example embodiment shown in fig. 6, the operator may generate a work order indicating that communication with the server represented by node 606 is lost. For example, an operator attempting to access database 607 via node 606 may determine that the request was not answered. Based on the operator generated worksheet, the system generates event notification 616. The system displays the detected event 617 and recommended operation manuals for remedial events 618 and 619 on the display 611. User interface elements 618 and 619 representing an operation manual may be user selectable elements. Selecting an element 618 or 619 may cause the system to modify the displayed data to include steps of the corresponding operation manual 618 or 619.

The system identifies the operation manuals 614, 615, 618, and 619 to display based on determining that one or both of the event and/or the topology associated with the event satisfies a threshold similarity to one or both of the candidate historical event and/or the candidate topology. For example, the system may identify the type of operating system and applications that the user is attempting to access in node 604. The system may recommend an operation manual 614 based on determining that the operation manual is associated with a topology that includes the same OS and the same application and intermediate nodes that are similar to node 602. Based on determining that the topology associated with operation manual 615 includes the same OS and application, but intermediate nodes of a different type than the type of node 602, the system may recommend operation manual 615 having a lower similarity value (i.e., 90%).

The system may identify attributes associated with event 617 including the time at which the event occurred, the power supply specification of the power provided to node 606, the communication port configuration of node 606, and the communication channel protocol. The system may recommend an operation manual 618 based on determining that the operation manual 618 is associated with another event that includes the same port configuration, communication channel protocol, and power configuration as the node 606. Based on determining that the event includes the same port configuration and communication channel protocol as node 606 but a different power configuration, the system may recommend an operation manual 619 having a lower similarity value (i.e., 85%).

In an example embodiment, the system may store hundreds of operating manuals associated with tens of different events. Based on detecting and displaying event 617 "communication lost with server", the system can identify ten candidate operation manuals associated with event type "communication lost with server" from hundreds of stored operation manuals.

The system may analyze event attributes and/or topological characteristics associated with the candidate operation manuals to identify which operation manuals to recommend for diagnosing and/or remediating the currently displayed event. For example, the system may determine that four of the candidate events are characterized by the server running a different communication protocol than the server associated with the currently detected event. The system may also determine that two of the remaining candidate operation manuals relate to system components that are not present in the topology associated with the currently detected event. Among the remaining four candidate runbooks, the system may determine that three of the candidate runbooks satisfy a similarity threshold for the currently detected event. The system may identify one of the candidate runbooks as having a high validity rating in terms of successful remedial events based on historical data tracking success rates of the runbooks. Thus, the system may recommend a particular operation manual that meets the similarity threshold and has a high validity rating for remedying the currently detected event.

FIG. 7 illustrates a monitored system 701 monitored by an event remediation platform 710. The monitored system 701 includes nodes 702-706 and a database 707. User terminals 708 and 709 access nodes 702-706 and database 707 via a network. Event remediation platform 710 monitors system 701 to identify events associated with system 701.

In the example embodiment shown in fig. 7, event remediation platform 710 detects an error in the communication loss of the server. For example, the operator may generate a work order indicating that communication with the server represented by node 706 is lost. For example, an operator attempting to access database 707 via node 706 may determine that the request was not answered. Based on the operator generated worksheet, the system identifies candidate operation manuals 714 and 716 associated with the event type "communication lost with server". The system may identify candidate operation manuals 714 and 716 based on comparing the event data and/or topology data to one or both of event data associated with events of a previous application operation manual and topologies associated with events preceding the previous application operation manual. For example, the system may analyze event data associated with an event type "communication lost with the server" to identify the server as a server running a virtual machine. The system may identify the operation manuals 714 and 716 from among a set of candidate operation manuals associated with the event type "communication lost with the server", these operation manuals 714 and 716 having been applied to events in which communication was lost with the server running the virtual machine previously.

The operation manual 714 includes the following steps, each associated with one or more independently executable operations: step A: checking the power of the server (715 a), step B: checking the physical port connection (715 b), step C: status lights are confirmed 715c until step n 715 n. The operation manual 716 includes the following steps, each associated with one or more independently executable operations: step A: checking the power of the server (717 a), step B: checking the port status on the status screen (717 b), step C: the application license is checked 717c until step p 717 p.

Before the operation manual recommendation engine 713 generates a graphical user interface to recommend an operation manual 714 or 716 for remedying the detected event, the operation execution engine 712 performs operations associated with one or more of the steps 715a-315n and 717a-717 p. The operation execution engine 712 executes the operation associated with step 715b of the operation manual 714. The operation execution engine 712 performs operations different from checking the physical port connection indicated in step 715 b. Instead, the operation execution engine 712 performs an operation indicating whether step 715b is likely to be helpful in resolving the identified event. Specifically, the operation execution engine 712 analyzes the data connection from the target server to the gateway in communication with the target server. Based on determining that the data transfer between the target server and the gateway is within the threshold range, the event remediation platform 710 reduces the relevance score of the operation manual 714 (such as from 60/100 to 50/100), indicating that checking the physical port connection in step 715b is unlikely to help resolve the communication failure with the server.

The operation execution engine 712 also executes the operation indicated by step 717b of the operation manual 716. The operation execution engine 712 checks the port configuration of the data port of the target server. Based on determining that the port configuration does not match the expected port configuration, event remediation platform 710 increases the relevance score of operation manual 716 (such as from 60/100 to 75/100), indicating that the port status on the inspection status screen specified in step 717b may help resolve the communication failure with the server. The operation execution engine 712 also executes the operation indicated by step 717c of the operation manual 716. The operation execution engine 712 checks application permissions of the application running on the target server. Based on determining that the application license may limit communication with the external device, event remediation platform 710 increases the relevance score of operation manual 716 (such as from 75/100 to 85/100), indicating that checking the application license specified in 717c may help resolve the communication failure with the server.

The operation manual recommendation engine 713 presents one or more operation manuals as recommendations to remedy the detected event. In the example shown in FIG. 7, the operation manual recommendation engine 713 presents the operation manual 716 as a recommendation to remedy the server communication failure based on determining that the relevance score (85/100) exceeds the threshold relevance score (60/100). The operation manual recommendation engine 713 also presents the operation manual 716 as a recommendation to remedy the server communication failure based on determining that the relevance score (85/100) is higher than the relevance score of any other candidate operation manual. Based on determining that the relevance score (60/100) of the operation manual 714 does not meet the threshold relevance score (60/100), the operation manual recommendation engine 713 avoids presenting the operation manual 714 as a recommendation to remedy the server communication failure.

In an example embodiment, the system may store hundreds of operating manuals associated with tens of different events. Based on detecting the event "communication lost with the server", the system may identify ten candidate operation manuals associated with the event type "communication lost with the server" from among hundreds of stored operation manuals. The system may analyze event attributes and/or topological characteristics associated with the candidate execution manual to identify a subset of the candidate execution manual for which the operation execution engine 712 may perform one or more operations. The system analyzes the results generated by the one or more operations performed by the operations execution engine 712 to identify which of the subset of the operation manuals to display on the operation manual selection interface as a recommendation to remedy the detected event.

Fig. 8A and 8B illustrate detailed examples of Graphical User Interfaces (GUIs) for presenting one or more operation manuals as recommendations for remedial events. The components and/or operations described below should be understood as one specific example that may not be applicable to certain embodiments. Accordingly, the components and/or operations described below should not be construed as limiting the scope of any claims.

In the example embodiment shown in fig. 8A, the event remediation platform 810 includes a Graphical User Interface (GUI) 811 that displays an event 817 "communication lost with the server". The event may be displayed based on detecting a metric value from the system monitoring platform that is outside a threshold range of metric values. Alternatively, events may be displayed based on a user generated "worksheet". For example, a user may experience a loss of communication with a server and a worksheet may be generated to report the event. Although one event 817 is illustrated in fig. 8A, the GUI may display any number of detected events.

Based on the user's selection of the user interface element associated with event 817, the event remediation engine initiates selection of one or more operation manuals to recommend a process for remediating the event. In particular, the event remediation engine 810 identifies a set of candidate operation manuals. The event remediation engine 810 performs one or more sub-operations associated with one or more steps of the candidate execution manual. Based on the results, the event remediation engine 810 determines whether to recommend a candidate operation manual to remedy the event.

As shown in fig. 8B, based on identifying the operation manual a 818 and the operation manual B819 as having relevance scores meeting the relevance criteria, the event remediation platform 810 displays the interface elements 818 and 819 as recommendations for execution to remedy the identified event. The user may select interface element 818 to display a set of steps for remedying the identified event.

8. Computer network and cloud network

In one or more embodiments, a computer network provides connectivity between a set of nodes. Nodes may be local to each other and/or remote from each other. The nodes are connected by a set of links. Examples of links include coaxial cables, unshielded twisted pair wires, copper cables, optical fibers, and virtual links.

The subset of nodes implements a computer network. Examples of such nodes include switches, routers, firewalls, and Network Address Translators (NATs). Another subset of nodes uses a computer network. Such nodes (also referred to as "hosts") may execute client processes and/or server processes. The client process makes a request for a computing service, such as execution of a particular application and/or storage of a particular amount of data. The server process responds by executing the requested service and/or returning corresponding data.

The computer network may be a physical network comprising physical nodes connected by physical links. A physical node is any digital device. The physical nodes may be function specific hardware devices such as hardware switches, hardware routers, hardware firewalls, and hardware NATs. Additionally or alternatively, the physical nodes may be general-purpose machines configured to execute various virtual machines and/or applications that perform corresponding functions. A physical link is a physical medium that connects two or more physical nodes. Examples of links include coaxial cables, unshielded twisted cables, copper cables, and optical fibers.

The computer network may be an overlay network. An overlay network is a logical network implemented over another network, such as a physical network. Each node in the overlay network corresponds to a respective node in the underlay network. Thus, each node in the overlay network is associated with both an overlay address (addressed to the overlay node) and an underlay address (addressed to the underlay node implementing the overlay node). The overlay nodes may be digital devices and/or software processes (such as virtual machines, application instances, or threads). The links connecting the overlay nodes are implemented as tunnels through the underlying network. The overlay nodes at either end of the tunnel treat the underlying multi-hop path between them as a single logical link. Tunneling is performed by encapsulation and decapsulation.

In embodiments, the client may be located locally to the computer network and/or remotely from the computer network. Clients may access a computer network through other computer networks, such as a private network or the internet. The client may transmit the request to the computer network using a communication protocol, such as the hypertext transfer protocol (HTTP). The request is transmitted through an interface such as a client interface (such as a web browser), a program interface, or an Application Programming Interface (API).

In an embodiment, a computer network provides a connection between a client and a network resource. The network resources include hardware and/or software configured to execute server processes. Examples of network resources include processors, data storage, virtual machines, containers, and/or software applications. Network resources are shared among multiple clients. The clients request computing services from the computer network independently of each other. Network resources are dynamically allocated to the requesting and/or client as needed. The network resources allocated to each request and/or client may be scaled up or down based on, for example, (a) computing services requested by a particular client, (b) aggregated computing services requested by a particular tenant, and/or (c) requested aggregated computing services of a computer network. Such a computer network may be referred to as a "cloud network".

In an embodiment, a service provider provides a cloud network to one or more end users. The cloud network may implement various service models including, but not limited to, software as a service (SaaS), platform as a service (PaaS), and infrastructure as a service (IaaS). In SaaS, a service provider provides end users with the ability to use applications of the service provider that are executing on network resources. In PaaS, service providers provide end users with the ability to deploy custom applications onto network resources. Custom applications may be created using programming languages, libraries, services, and tools supported by a service provider. In IaaS, service providers provide end users with the ability to provision processing, storage, networking, and other basic computing resources provided by network resources. Any arbitrary application may be deployed on the network resources, including the operating system.

In an embodiment, a computer network may implement various deployment models including, but not limited to, private clouds, public clouds, and hybrid clouds. In private clouds, network resources are exclusively used by a particular group of one or more entities (the term "entity" as used herein refers to an enterprise, organization, individual, or other entity). The network resources may be local to and/or remote from the premises of the particular entity group. In a public cloud, cloud resources are supplied to multiple entities (also referred to as "tenants" or "customers") that are independent of each other. The computer network and its network resources are accessed by clients corresponding to different tenants. Such computer networks may be referred to as "multi-tenant computer networks. Several tenants may use the same particular network resources at different times and/or at the same time. The network resources may be local to the premises of the tenant and/or remote from the premises of the tenant. In a hybrid cloud, a computer network includes a private cloud and a public cloud. The interface between private and public clouds allows portability of data and applications. Data stored at the private cloud and data stored at the public cloud may be exchanged through the interface. An application implemented at a private cloud and an application implemented at a public cloud may have dependencies on each other. Calls from applications at the private cloud to applications at the public cloud (and vice versa) may be performed through the interface.

In an embodiment, tenants of the multi-tenant computer network are independent of each other. For example, the business or operation of one tenant may be separate from the business or operation of another tenant. Different tenants may have different network requirements for the computer network. Examples of network requirements include processing speed, data storage, security requirements, performance requirements, throughput requirements, latency requirements, resilience requirements, quality of service (QoS) requirements, tenant isolation, and/or consistency. The same computer network may need to fulfill different network requirements as required by different tenants.

In one or more embodiments, tenant isolation is implemented in a multi-tenant computer network to ensure that applications and/or data of different tenants are not shared with each other. Various tenant isolation methods may be used.

In an embodiment, each tenant is associated with a tenant ID. Each network resource of the multi-tenant computer network is labeled with a tenant ID. A tenant is allowed to access a particular network resource only if the tenant and the particular network resource are associated with the same tenant ID.

In an embodiment, each tenant is associated with a tenant ID. Each application implemented by the computer network is labeled with a tenant ID. Additionally or alternatively, each data structure and/or data set stored by the computer network is labeled with a tenant ID. A tenant is only allowed to access a particular application, data structure, and/or data set if the tenant and the particular application, data structure, and/or data set are associated with the same tenant ID.

As an example, each database implemented by a multi-tenant computer network may be labeled with a tenant ID. Only the tenant associated with the corresponding tenant ID may access the data of the particular database. As another example, each entry in a database implemented by a multi-tenant computer network may be labeled with a tenant ID. Only the tenant associated with the corresponding tenant ID may access the data of the particular entry. Multiple tenants may share the database.

In an embodiment, the subscription list indicates which tenants have access to which applications. For each application, a list of tenant IDs of tenants authorized to access the application is stored. A tenant is allowed to access a particular application only if its tenant ID is contained in a subscription list corresponding to the particular application.

In an embodiment, network resources (such as digital devices, virtual machines, application instances, and threads) corresponding to different tenants are isolated to tenant-specific overlay networks maintained by the multi-tenant computer network. As an example, a data packet from any source device in the tenant overlay network may be sent only to other devices within the same tenant overlay network. Encapsulation tunnels are used to prohibit any transmission from a source device on the tenant overlay network to devices in other tenant overlay networks. In particular, packets received from the source device are encapsulated within external packets. External data packets are sent from a first encapsulation tunnel endpoint (in communication with a source device in the tenant overlay network) to a second encapsulation tunnel endpoint (in communication with a destination device in the tenant overlay network). The second encapsulation tunnel endpoint decapsulates the external data packet to obtain the original data packet sent by the source device. The original data packet is sent from the second encapsulation tunnel endpoint to the destination device in the same particular overlay network.

9. Various; expansion of

Embodiments are directed to a system having one or more devices including a hardware processor and configured to perform any of the operations described herein and/or in any of the following claims.

In an embodiment, a non-transitory computer-readable storage medium includes instructions that when executed by one or more hardware processors cause performance of any of the operations described and/or claimed herein.

Any combination of the features and functions described herein may be used in accordance with one or more embodiments. In the foregoing specification, embodiments have been described with reference to numerous specific details that vary from implementation to implementation. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. The sole and exclusive indicator of the scope of the application, and what the applicant expects to be the scope of the application, is the literal and equivalent scope of the set of claims that issue from this application, in the specific form in which such claims issue, including any subsequent correction.

10. Hardware overview

According to one embodiment, the techniques described herein are implemented by one or more special purpose computing devices. The special purpose computing device may be hardwired to perform the present techniques, or may include a digital electronic device, such as one or more Application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs), or Network Processing Units (NPUs), or may include one or more general purpose hardware processors programmed to perform the present techniques in accordance with program instructions in firmware, memory, other storage, or a combination. Such special purpose computing devices may also combine custom hardwired logic, ASICs, FPGAs, or NPUs with custom programming to implement the present techniques. The special purpose computing device may be a desktop computer system, portable computer system, handheld device, networking device, or any other device that implements techniques in conjunction with hardwired and/or program logic.

For example, FIG. 9 is a block diagram that illustrates a computer system 900 upon which an embodiment of the invention may be implemented. Computer system 900 includes a bus 902 or other communication mechanism for communicating information, and a hardware processor 904 coupled with bus 902 for processing information. The hardware processor 904 may be, for example, a general purpose microprocessor.

Computer system 900 also includes a main memory 906, such as a Random Access Memory (RAM) or other dynamic storage device, coupled to bus 902 for storing information and instructions to be executed by processor 904. Main memory 906 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 904. Such instructions, when stored in a non-transitory storage medium accessible to the processor 904, cause the computer system 900 to be a special purpose machine that is customized to perform the operations specified in the instructions.

Computer system 900 also includes a Read Only Memory (ROM) 908 or other static storage device coupled to bus 902 for storing static information and instructions for processor 904. A storage device 910, such as a magnetic disk or optical disk, is provided and coupled to bus 902 for storing information and instructions.

Computer system 900 may be coupled via bus 902 to a display 912, such as a Cathode Ray Tube (CRT), for displaying information to a computer user. An input device 914, including alphanumeric and other keys, is coupled to bus 902 for communicating information and command selections to processor 904. Another type of user input device is cursor control 916, such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 904 and for controlling cursor movement on display 912. Such input devices typically have two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), to allow the device to specify positions in a plane.

Computer system 900 may implement the techniques described herein using custom hardwired logic, one or more ASICs or FPGAs, firmware, and/or program logic in combination with a computer system to make computer system 900 a special purpose machine or to program computer system 900 into a special purpose machine. According to one embodiment, the techniques herein are performed by computer system 900 in response to processor 904 executing one or more sequences of one or more instructions contained in main memory 906. Such instructions may be read into main memory 906 from another storage medium, such as storage device 910. Execution of the sequences of instructions contained in main memory 906 causes processor 904 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.

The term "storage medium" as used herein refers to any non-transitory medium that stores data and/or instructions that cause a machine to operate in a specific manner. Such storage media may include non-volatile media and/or volatile media. Non-volatile media includes, for example, optical or magnetic disks, such as storage device 910. Volatile media includes dynamic memory, such as main memory 906. Common forms of storage media include, for example, a floppy disk, a flexible disk, hard disk, solid state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, any other memory chip or cartridge, a Content Addressable Memory (CAM), and a Ternary Content Addressable Memory (TCAM).

Storage media are different from, but may be used in conjunction with, transmission media. Transmission media participate in transferring information between storage media. For example, transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 902. Transmission media can also take the form of acoustic or light waves, such as those generated during radio wave and infrared data communications.

Various forms of media may be involved in carrying one or more sequences of one or more instructions to processor 904 for execution. For example, the instructions may initially be carried on a magnetic disk or solid state drive of a remote computer. The remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem. A modem local to computer system 900 can receive the data on the telephone line and use an infrared transmitter to convert the data to an infrared signal. The infrared detector may receive the data carried in the infrared signal and appropriate circuitry may place the data on bus 902. Bus 902 carries the data to main memory 906, from which main memory 906 processor 904 retrieves and executes the instructions. The instructions received by main memory 906 may optionally be stored on storage device 910 either before or after execution by processor 904.

Computer system 900 also includes a communication interface 918 coupled to bus 902. Communication interface 918 provides a two-way data communication coupling to a network link 920, wherein network link 920 is connected to a local network 922. For example, communication interface 918 may be an Integrated Services Digital Network (ISDN) card, a cable modem, a satellite modem, or a modem to provide a data communication connection to a corresponding type of telephone line. As another example, communication interface 918 may be a Local Area Network (LAN) card to provide a data communication connection to a compatible LAN. Wireless links may also be implemented. In any such implementation, communication interface 918 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.

Network link 920 typically provides data communication through one or more networks to other data devices. For example, network link 920 may provide a connection through local network 922 to a host computer 924 or to data equipment operated by an Internet Service Provider (ISP) 926. ISP 926 in turn provides data communication services through the world wide packet data communication network now commonly referred to as the "Internet" 928. Local network 922 and internet 928 both use electrical, electromagnetic or optical signals that carry digital data streams. The signals through the various networks and the signals on network link 920 and through communication interface 918, which carry the digital data to computer system 900 or from computer system 900, are exemplary forms of transmission media.

Computer system 900 can send messages and receive data, including program code, through the network(s), network link 920 and communication interface 918. In the internet example, a server 930 might transmit a requested code for an application program through internet 928, ISP 926, local network 922 and communication interface 918.

The received code may be executed by processor 904 as it is received, and/or stored in storage device 910, or other non-volatile storage for later execution.

In the foregoing specification, embodiments of the application have been described with reference to numerous specific details that may vary from implementation to implementation. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. The sole and exclusive indicator of the scope of the application, and what the applicant expects to be the scope of the application, is the literal and equivalent scope of the set of claims that issue from this application, in the specific form in which such claims issue, including any subsequent correction.

Claims

1. A non-transitory computer-readable medium comprising instructions that, when executed by one or more hardware processors, cause performance of operations comprising:

Detecting the occurrence of an event to be remediated, the event corresponding to the target component;

Obtaining first topology data indicative of a first set of one or more topological relationships between a target component and a first set of one or more other components;

Obtaining second topology data indicative of a second set of one or more topological relationships between the second component and a second set of one or more other components;

identifying an operation manual previously performed in association with the second component, the operation manual defining a list of independently executable operations; and

Based on determining that the first set of one or more topological relationships and the second set of one or more topological relationships satisfy the similarity criteria, an operation manual is presented as a recommendation for remedying the event.

2. The non-transitory computer readable medium of claim 1, wherein the first topology data and the second topology data are obtained in response to:

presenting an operation manual selection interface comprising functionality for selecting at least one of the plurality of operation manuals for execution; and

An event to be remediated is selected.

3. The non-transitory computer-readable medium of claim 1, wherein the operations further comprise:

metadata associated with the event is collected and,

Wherein the first topology data is obtained based on metadata associated with the event.

4. The non-transitory computer-readable medium of claim 1, wherein determining that the first set of one or more topological relationships and the second set of one or more topological relationships satisfy a similarity criterion comprises:

Determining that (a) a percentage of components that are identical in the first topology data and the second topology data and (b) a percentage of communication channels that are identical in the first topology data and the second topology data satisfy a threshold percentage.

5. The non-transitory computer-readable medium of claim 1, wherein determining that the first set of one or more topological relationships and the second set of one or more topological relationships satisfy a similarity criterion comprises:

identifying a first set of components in the first topology data having a particular topological relationship with the target component; and

It is determined that the second set of components in the second topology data has the same particular topological relationship with the particular components in the second topology data.

6. The non-transitory computer-readable medium of claim 1, wherein the first set of one or more topological relationships comprises communicatively connected to the target component within a predefined degree of separation defined by a plurality of intermediary components along a communication path between the particular component and the target component.

7. The non-transitory computer readable medium of claim 1, wherein the second topology data is obtained from a repository of a plurality of topologies, each topology including a plurality of components and connections between the plurality of components,

Wherein each topology of the plurality of topologies is associated in a repository with at least one operating manual previously executed with respect to at least one component in the respective topology.

8. The non-transitory computer-readable medium of claim 1, wherein presenting the operation manual as a recommendation for the remedial event is further based on:

Identifying a particular event to which an operating manual was previously applied to remedy the particular event; and

It is determined that an event associated with the first topology data and a particular event satisfy a threshold level of similarity to each other.

9. The non-transitory computer-readable medium of claim 1, wherein the operation manual includes at least one user-generated data tag that specifies at least one topological relationship in the second set of one or more topological relationships.

10. A non-transitory computer-readable medium comprising instructions that, when executed by one or more hardware processors, cause performance of operations comprising:

detecting an occurrence of an event to be remediated, the event being associated with a target component in a system of components;

Obtaining first topology data for a system of components, the first topology data indicating one or more topological relationships between a target component and a first set of one or more other components in the system of components;

Based on the one or more topological relationships: selecting a candidate operation manual associated with a first set of one or more other components for remedying an event associated with the target component, wherein the candidate operation manual defines a set of independently executable operations; and

Candidate runbooks for remedial events are recommended.

11. The non-transitory computer readable medium of claim 10, wherein the first topology data is obtained in response to:

An event to be remediated is selected.

12. The non-transitory computer-readable medium of claim 10, wherein the operations further comprise:

metadata is collected that is related to the event,

13. The non-transitory computer-readable medium of claim 10, wherein the operations further comprise:

Determining a relevance score for the candidate execution manual based on one or more topological relationships between the target component and the first set of one or more other components; and

In response to determining that the relevance score of the candidate operating manual meets a relevance threshold level, the candidate operating manual is selected for recommendation to remedy the event.

14. The non-transitory computer-readable medium of claim 13, wherein the operations further comprise:

Obtaining second topology data indicative of one or more topological relationships between a second set of components associated with a set of independently executable operations of the operating manual; and

A relevance score for the candidate operating manual is determined based on the similarity between the first topology data and the second topology data.

15. The non-transitory computer readable medium of claim 14, wherein determining that the relevance score of the candidate operating manual meets the relevance threshold level comprises:

Determining that a first topological relationship between a second set of components in the second topological data and a particular component in the second topological data is the same as a second topological relationship between the one or more other components in the first topological data and the target component.

16. The non-transitory computer readable medium of claim 13, wherein determining the relevance score for the candidate operating manual based on one or more topological relationships between the target component and the first set of one or more other components comprises:

Determining that a first particular component among the one or more other components is associated with an event;

Determining at least one operation from among a set of independently executable operations performed on the second particular component; and

It is determined that the second particular component corresponds to the first particular component.

17. A non-transitory computer-readable medium comprising instructions that, when executed by one or more hardware processors, cause performance of operations comprising:

Detecting an occurrence of a first event to be remediated;

Identifying one or more attributes corresponding to the first event;

Identifying a second event based on the similarity between the first event and the second event meeting a threshold criterion;

determining that a specific operation manual was executed to remedy the second event; and

The particular operating manual is presented as a recommendation to remedy the first event.

18. The non-transitory computer-readable medium of claim 17, wherein identifying the one or more attributes corresponding to the first event is performed in response to: an operation manual selection interface is presented that includes functionality for selecting at least one operation manual of the plurality of operation manuals for execution.

19. The non-transitory computer-readable medium of claim 17, wherein the operations further comprise:

First metadata associated with a first event is collected,

Wherein identifying the second event comprises:

Identifying second metadata associated with the second event; and

The first metadata is compared to the second metadata to determine a similarity between the first event and the second event.

20. The non-transitory computer-readable medium of claim 17, wherein the operations further comprise:

detecting that the user has selected to execute the second operation manual to remedy the first event before presenting the particular operation manual; and

Determining that the particular operation manual is more effective at remedying the event of the event type corresponding to the first event than the second operation manual,

Wherein presenting the particular operation manual as a recommendation for remedying the first event is in response to determining that the particular operation manual has been more effective than the second operation manual.

21. The non-transitory computer-readable medium of claim 17, wherein the operations further comprise:

Determining that the first event is associated with the first component;

obtaining topology data associated with the first component, wherein the topology data indicates one or more topological relationships between the first component and one or more other components;

Determining that the first component is related to the second component based on the topology data; and

Based on the determination, a third operating manual is selected to execute with respect to the second component.

22. The non-transitory computer-readable medium of claim 17, wherein the particular operation manual is presented as a recommendation to remedy the first event in response to determining that the particular operation manual is executed to remedy the second event.

23. A method comprising the operations of any one of claims 1-22.

24. A system comprising a hardware processor and configured to perform operations according to any of claims 1-22.

25. A system comprising means for performing the operations of any one of claims 1-22.