CN117632666B - Alarm method, equipment and storage medium - Google Patents

Alarm method, equipment and storage medium Download PDF

Info

Publication number
CN117632666B
CN117632666B CN202410112376.5A CN202410112376A CN117632666B CN 117632666 B CN117632666 B CN 117632666B CN 202410112376 A CN202410112376 A CN 202410112376A CN 117632666 B CN117632666 B CN 117632666B
Authority
CN
China
Prior art keywords
abnormal
request
target
nodes
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410112376.5A
Other languages
Chinese (zh)
Other versions
CN117632666A (en
Inventor
洪元东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Alibaba Cloud Feitian Information Technology Co ltd
Original Assignee
Hangzhou Alibaba Cloud Feitian Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Alibaba Cloud Feitian Information Technology Co ltd filed Critical Hangzhou Alibaba Cloud Feitian Information Technology Co ltd
Priority to CN202410112376.5A priority Critical patent/CN117632666B/en
Publication of CN117632666A publication Critical patent/CN117632666A/en
Application granted granted Critical
Publication of CN117632666B publication Critical patent/CN117632666B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The embodiment of the application provides an alarm method, alarm equipment and a storage medium. Under the condition of an alarm triggering event, determining a request path for the abnormal IO requests, and clustering the abnormal IO requests based on the request path to generate at least one abnormal IO request group, wherein the abnormal IO requests corresponding to the request path passing through the same physical node are positioned in the same abnormal IO request group; moreover, it is proposed to output alarm information in units of abnormal IO request groups. In this way, from the perspective of a single physical node with an exception, the exception IO requests caused by the physical node can be clustered in the same exception IO request group because the request paths all pass through the physical node, so that the exception IO requests caused by the same exception cause can be ensured to realize alarming in one piece of alarming information, repeated alarming on the same exception cause is avoided, the number of the alarming information is reduced, the alarming information can be processed in time, and the processing efficiency of alarming is improved.

Description

Alarm method, equipment and storage medium
Technical Field
The present application relates to the field of cloud storage technologies, and in particular, to an alarm method, an alarm device, and a storage medium.
Background
Cloud storage is understood to be a model of online storage (Cloud storage) on a network. Along with the development of cloud storage technology, a storage and computation separation architecture, namely a storage and computation separation architecture, is also provided based on the cloud storage technology. Cloud storage techniques may be used to implement storage layers in a computationally separate architecture. The memory separation architecture can also comprise a calculation layer, the calculation layer and a storage layer are decoupled and communicated through a network, and the calculation layer and the storage layer can be realized as independent distributed systems. Each computing node in the computing layer can access the storage layer in the manner of IO request to read and write data from the storage nodes in the storage layer.
At present, an anomaly monitoring system of a full link is generally deployed in a memory separation architecture, and is used for finding and automatically diagnosing an IO request with anomaly, and alarming by taking the anomaly IO request as a unit.
However, as the order of magnitude of IO requests is continuously increased, the number of alarms is also continuously increased, and massive alarm information causes alarm accumulation, so that great alarm processing pressure is brought, and alarms cannot be processed in time.
Disclosure of Invention
Aspects of the present application provide an alarm method, apparatus, and storage medium for improving the processing efficiency of alarms.
The embodiment of the application provides an alarm method, which comprises the following steps:
Under the condition of an alarm triggering event, determining request paths corresponding to the abnormal IO requests respectively, wherein a single request path characterizes a communication relationship between physical nodes through which the corresponding abnormal IO requests pass;
clustering the abnormal IO requests based on the request paths to generate at least one abnormal IO request group, wherein the abnormal IO requests corresponding to the request paths passing through the same physical node are located in the same abnormal IO request group;
And outputting alarm information by taking the abnormal IO request group as a unit.
Further, determining the request paths corresponding to the abnormal IO requests respectively includes:
Acquiring the abnormality monitoring information corresponding to each abnormal IO request;
And analyzing the identification information and the passing sequence of the physical nodes passed by the corresponding abnormal IO requests from the abnormal monitoring information to determine the request paths corresponding to the abnormal IO requests.
Further, obtaining the anomaly monitoring information corresponding to each of the anomaly IO requests includes:
An abnormality monitoring information acquisition request is sent to an abnormality monitoring system, wherein the acquisition request carries a target abnormality type, and the abnormality monitoring system has diagnosed the abnormality type corresponding to each abnormality IO request;
and receiving the abnormality monitoring information corresponding to the abnormality IO request which is returned by the abnormality monitoring system and is diagnosed as the target abnormality type.
Further, trace information is adopted in the anomaly monitoring information, the trace information corresponds to the IO request one by one, the trace information comprises a plurality of span items with sequence relations, the span items correspond to physical nodes through which the IO request passes one by one, the span items comprise identification information of the corresponding physical nodes, and the sequence relations among the span items contained in the trace information are used for representing the passing sequence among the physical nodes through which the IO request passes.
Further, clustering the abnormal IO requests based on the request paths to generate at least one abnormal IO request group, including:
clustering abnormal IO requests corresponding to each request path capable of being communicated through physical nodes into an abnormal IO request group, or
Inquiring a physical node capable of communicating a plurality of request paths as a clustering node; and clustering the abnormal IO requests corresponding to each request path which can be communicated by the single clustering node into an abnormal IO request group.
Further, clustering abnormal IO requests corresponding to each request path capable of being communicated through the physical node into an abnormal IO request group includes:
Taking physical nodes on each request path as vertexes, and taking a communication relation among the physical nodes on each request path as edges to construct an undirected graph;
Searching for connected components from the undirected graph;
and clustering the abnormal IO requests corresponding to the request paths contained in the single connected component into an abnormal IO request group.
Further, searching for connected components from the undirected graph includes:
searching a connected component of a target vertex when traversing to the target vertex in the undirected graph;
deleting a connected component of the target vertex from the undirected graph;
continuing to determine the next target vertex from the residual vertices in the undirected graph, and searching and deleting the corresponding connected components until the residual vertices do not exist in the undirected graph;
outputting the searched connected component.
Further, with the abnormal IO request group as a unit, outputting alarm information, including:
After the physical nodes passed by each abnormal IO request in the target abnormal IO request group are de-duplicated, the rest physical nodes are determined as target nodes;
based on the identification information of the target node, the identification information of the cluster to which the target node belongs and/or the abnormality type related to the target abnormal IO request group, outputting alarm information for the target abnormal IO request group;
wherein the target abnormal IO request group is any abnormal IO request group.
Further, after de-duplicating the physical nodes through which each abnormal IO request passes in the target abnormal IO request group, determining the remaining physical nodes as target nodes, including:
And if an undirected graph is constructed based on each request path and connected components are searched from the undirected graph to cluster out the target abnormal IO request group, taking physical nodes represented by all vertexes contained in the connected components corresponding to the target abnormal IO request group as target nodes.
Further, the physical node includes at least a computing node and a storage node, and after outputting the alarm information, the method further includes:
responding to an alarm processing instruction, and analyzing a communication structure formed between a computing node and a storage node under a target abnormal IO request group corresponding to target alarm information;
Based on the pointing relation between the communication structure and the abnormal nodes, the abnormal nodes causing the abnormality are speculated under the target abnormal IO request group.
Further, based on the directional relation between the communication structure and the abnormal node, the abnormal node causing the abnormality is presumed under the target abnormal IO request group, comprising:
If a first type of communication structure exists under the target abnormal IO request group, computing nodes in the first type of communication structure are presumed to be abnormal nodes, and the first type of communication structure is that one computing node is communicated with a plurality of storage nodes; or alternatively
If a second type of communication structure exists under the target abnormal IO request group, presuming the storage nodes in the second type of communication structure as abnormal nodes, wherein the second type of communication structure is a storage node communicated with a plurality of computing nodes; or alternatively
If a third type of communication structure exists under the target abnormal IO request group, the intermediate nodes in the third type of communication structure are presumed to be abnormal nodes, and the third type of communication structure is formed by communicating a plurality of computing nodes and a plurality of storage nodes through the intermediate nodes.
Further, the target exception type comprises an IO unavailable class or an IO damaged class, the physical node comprises a computing node in a computing system, a storage node in a storage system and/or an intermediate node for network connection, and the abnormal IO request is an IO request which is initiated by the computing node in the computing system to the storage node in the storage system and has abnormality.
The embodiment of the application also provides electronic equipment, which comprises a memory and a processor;
the memory is used for storing one or more computer instructions;
The processor is coupled to the memory for executing the one or more computer instructions for performing the alert method described previously.
Embodiments of the present application also provide a computer-readable storage medium storing computer instructions that, when executed by one or more processors, cause the one or more processors to perform the aforementioned data processing method.
Embodiments of the present application also provide a computer program product comprising a computer program/instructions, wherein the computer program, when executed by a processor, causes the processor to implement the alert method described above.
In the embodiment of the application, under the condition of an alarm triggering event, request paths are respectively determined for the abnormal IO requests, the abnormal IO requests are clustered based on the request paths so as to generate at least one abnormal IO request group, and the abnormal IO requests corresponding to the request paths passing through the same physical node are positioned in the same abnormal IO request group; moreover, it is proposed to output alarm information in units of abnormal IO request groups. In this way, from the perspective of a single physical node with an exception, request paths of the exception IO requests caused by the physical node are all routed through the physical node, so that the exception IO requests corresponding to the request paths can be clustered in the same exception IO request group, therefore, the exception IO requests caused by the same exception cause can be ensured to realize alarming in one piece of alarming information, repeated alarming on the same exception cause can be avoided, the number of alarming information is reduced, the alarming information can be timely processed, and the processing efficiency of alarming is improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
FIG. 1 is a flow chart of an alarm method according to an exemplary embodiment of the present application;
FIG. 2 is a schematic diagram of an exemplary request path for an abnormal IO request in accordance with an exemplary embodiment of the present application;
FIG. 3 is a logical schematic diagram of an exemplary clustering scheme provided by an exemplary embodiment of the present application;
FIG. 4 is a flow chart of another alarm method according to an exemplary embodiment of the present application;
FIG. 5 is a flow chart of yet another alert method according to an exemplary embodiment of the present application;
FIG. 6 is a schematic diagram of several exemplary communication structures provided by an exemplary embodiment of the present application;
Fig. 7 is a schematic structural diagram of an electronic device according to another exemplary embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
As mentioned in the background art, in the current alarm scheme for IO requests, the alarm information is usually output in units of IO requests, so that in the case that the number of IO requests is continuously increased, the resulting alarm information is also increased in mass. The continuously piled alarm information brings great pressure to alarm processing work, and the alarms cannot be processed in time, so that the alarm processing efficiency is poor.
Therefore, the embodiment of the application provides a novel alarm method, and the basic concept is to cluster abnormal IO requests to generate an abnormal IO request group and output alarm information by taking the abnormal IO request group as a unit. Therefore, the number of the alarm information can be effectively reduced, and the timely access of the alarm information is ensured, so that the alarm processing efficiency is improved.
The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.
Fig. 1 is a flowchart of an alarm method according to an exemplary embodiment of the present application, where the method may be performed by an alarm device, and the alarm device may be implemented as software, hardware, or a combination of software and hardware, and the alarm device may be integrated in an electronic device. Referring to fig. 1, the method may include:
step 100, under the condition of an alarm triggering event, determining request paths corresponding to the abnormal IO requests respectively, wherein a single request path characterizes a communication relationship between physical nodes through which the corresponding abnormal IO requests pass;
step 101, clustering abnormal IO requests based on request paths to generate at least one abnormal IO request group, wherein the abnormal IO requests corresponding to the request paths passing through the same physical node are located in the same abnormal IO request group;
And 102, outputting alarm information by taking the abnormal IO request group as a unit.
The alarm method provided by the embodiment can be applied to various scenes in which abnormal alarm is required to be carried out for IO requests. In different scenarios, there may be a difference between the initiator and the recipient to which the IO request corresponds. For example, in the cloud storage scenario mentioned in the background, the initiator of an IO request is typically a compute node in a computing system and the recipient is typically a storage node in a storage system. For IO requests in other scenarios, no more examples of initiator and receiver are made here. It should be understood that, the application scenario is not limited in this embodiment, and the alarm method provided in this embodiment may be used in an application scenario where an IO request passes through multiple physical nodes, so as to improve the processing efficiency of the alarm.
Referring to fig. 1, in step 100, an alarm triggering event may be that an alarm triggering instruction is received, or a preset alarm period is reached, or any other type of event is also possible, and the event type of the alarm triggering event is not limited in this embodiment, and accordingly, the starting timing of the alarm method is not limited.
The abnormal IO request in step 100 refers to an IO request that has been diagnosed as abnormal. Regarding the abnormality diagnosis link of the IO request, the present embodiment is not limited. In this embodiment, the existing or future available IO request abnormality monitoring means is supported to perform abnormality monitoring on the IO request, and the abnormal IO request is monitored in time. For example, in a cloud storage scenario, an anomaly monitoring system based on full link tracking has been deployed, and such anomaly monitoring system can generate tracking trace information in units of IO requests and automatically diagnose anomaly types corresponding to the IO requests. Of course, this is merely exemplary, and other anomaly monitoring means are also supported in the present embodiment to discover an anomalous IO request in advance.
In this embodiment, the abnormal IO request is taken into consideration to have an alarm requirement, so in step 100, the abnormal IO request may be screened out as a processing object in this embodiment.
On this basis, in step 100, the request paths corresponding to the abnormal IO requests may be determined. The request path is used for representing the communication relation between the physical nodes through which the corresponding abnormal IO request passes. It should be understood that the request path includes at least a physical node that initiates the abnormal IO request and a physical node that responds to the abnormal IO request, and of course, the request path may also include a physical node for intermediately forwarding the abnormal IO request.
FIG. 2 is a schematic diagram of an exemplary request path for an abnormal IO request in accordance with an exemplary embodiment of the present application. Referring to FIG. 2, in a memory separation architecture, a computing system, which may include computing nodes, and a storage system, which may include storage nodes, may be included. For different cloud storage products, the provided storage systems can adopt non-identical node organization structures. For example, as shown in FIG. 2, for a block storage product, it provides a storage system in which at least two types of storage nodes can be deployed: the first type of storage nodes are provided with data block servers (blockserver) for scheduling storage resources, the second type of storage nodes are provided with block servers (chunkserver) for storing data, namely, data related to IO requests are stored on the second type of storage nodes in the storage system, and the first type of storage nodes are mainly responsible for scheduling and managing the storage resources on the second type of storage nodes and do not bear data storage. Of course, the split architecture shown in fig. 2 is merely exemplary, and the storage system may not include the first type of storage node, which is not limited herein.
With continued reference to fig. 2, the abnormal IO request shown in fig. 2 is issued by a computing node a in the computing system, intermediately forwarded via a storage node 1 in the storage system, finally reaches a storage node 3 in the storage system and completes a response by the storage node 3. Based on this, the request path corresponding to the abnormal IO request may be determined as: compute node a-store node 1-store node 2. It will be appreciated that the request path characterizes not only the physical nodes traversed by the abnormal IO request, but also the connectivity between the physical nodes traversed by the abnormal IO request. Referring to fig. 2, a request path corresponding to an abnormal IO request may characterize that computing node a communicates with storage node 1, and storage node 1 communicates with storage node 2.
The inventor finds that the request paths corresponding to different abnormal IO requests may pass through the same physical node in the research process, and the request paths may be communicated based on the physical nodes.
On this basis, the present embodiment proposes that in step 101, the abnormal IO requests may be clustered based on the request paths to generate at least one abnormal IO request group. The abnormal IO requests corresponding to the request paths passing through the same physical node are located in the same abnormal IO request group. It should be appreciated that the abnormal IO requests are clustered in step 102 so that at least one abnormal IO request group may be generated. The clustering basis is the communication relation between the request paths.
As mentioned above, the request paths may be communicated based on the same physical node passing through, so that the request paths communicated based on the same physical node may be clustered together, which may ensure that the abnormal IO requests passing through the same physical node can be clustered within the same abnormal IO request group. Thus, based on the connection relationship between the request paths, each request path determined in step 100 will be classified under each clustered abnormal IO request group.
The inventors have found during the course of research that a first request path may be in communication with a plurality of other request paths, which may differ from the physical node on which the communication between the first request path is based. In a preferred implementation, in step 101, the abnormal IO requests corresponding to the request paths that can be communicated based on the physical node are clustered into an abnormal IO request group. In this preferred implementation, multiple other request paths that can communicate with the first request path based on different physical nodes may be clustered under the same abnormal IO request group. The first request path may be any request path determined in step 100.
For example, the first request path may be: a-B, based on physical node a, other request paths that communicate with the first request path have: A-C and A-D-F; while based on the physical node B, the other request paths that communicate with the first request path are: B-G. Thus, although request paths B-G and request paths A-C do not pass through the same physical node, both may be clustered to the same abnormal IO request group as they are in communication with the first request path.
In this preferred implementation, a variety of clustering schemes may be employed to implement clustering of abnormal IO requests. Fig. 3 is a logical schematic diagram of an exemplary clustering scheme provided by an exemplary embodiment of the present application. Referring to fig. 3, in this exemplary clustering scheme: physical nodes on each request path are taken as vertexes, and communication relations among the physical nodes on each request path are taken as edges, so that an undirected graph is constructed; searching for connected components from the undirected graph; and clustering the abnormal IO requests corresponding to the request paths contained in the single connected component into an abnormal IO request group.
In this exemplary clustering scheme, a data structure, an undirected graph, is introduced to carry each request path. The Graph is a nonlinear structure and comprises two elements, namely a vertex and an edge, wherein the edge is used for representing the connection relation between the vertices. For a graph, if the edges included are unoriented, it may be referred to as an undirected graph. In this exemplary clustering scheme, the connectivity relationship between physical nodes in the request paths may be characterized by an undirected graph, and whether the request paths can be connected or not may also be characterized. Based on this, it is proposed in this exemplary clustering scheme that connected components can be searched from the constructed undirected graph. The connected component refers to a huge sub-graph in the undirected graph that can be connected, that is, an reachable path exists between any two vertices in the connected component. Referring to fig. 3, an undirected graph constructed in this exemplary clustering scheme is shown, it being understood that there is not an reachable path between any two vertices in the undirected graph, e.g., there is no reachable path between vertex 6 and vertex 7. Referring to fig. 3, three connected components searched from the undirected graph are also shown.
In this exemplary clustering scheme, the implementation logic for searching for connected components may be:
searching a connected component of a target vertex when traversing to the target vertex in the undirected graph;
Deleting the connected component of the target vertex from the undirected graph;
Continuing to determine the next target vertex from the residual vertices in the undirected graph, and searching and deleting the corresponding connected components until the residual vertices do not exist in the undirected graph;
outputting the searched connected component.
In practical application, the IP address of the physical node may be used as the identification information, so in the undirected graph, each vertex may be marked as an IP address, based on this, each IP address in the undirected graph may be traversed, and when traversing to the target IP address, a connected component where the target vertex corresponding to the target IP address is located may be searched. Here, the search algorithm employed in searching for the connected component where the target vertex is located is not limited, and for example, a depth-first search (DFS) algorithm or the like may be employed, and the search principle will not be described in detail here. The searched connected components can then be deleted from the undirected graph and the next target IP address can be determined from the remaining IP addresses, and in this way, all connected components in the undirected graph can be searched.
Therefore, the communication condition among the request paths can be accurately represented based on the undirected graph, and the problem of clustering among the request paths can be converted into the problem of searching for communication components in the undirected graph, so that the clustering efficiency among the request paths is effectively improved, and the substance of clustering among the request paths is the clustering among the abnormal IO requests.
It should be noted that the clustering scheme provided in fig. 3 is merely exemplary, and other clustering schemes may be used in the present embodiment to implement the clustering of the abnormal IO request in step 101. For example, each request path may be traversed, and when traversing to the target request path, other request paths having the same physical node as the target request path are searched, and the communication relationship between the target request path and the other request paths is recorded, and then the next request path is traversed continuously, so that the request paths in which each request path is respectively communicated can be obtained. On the basis, one request path can be selected as a starting path, the starting path and the connected request paths are added into a path group, then, each request path connected with the request path is continuously used as a starting path, the request paths connected with each starting path are continuously added into the path group, the request paths in the path group do not need to be repeatedly added, and the cycle is performed, so that the request paths capable of being connected can be clustered into the same path group. This clustering scheme may achieve a clustering effect consistent with that of fig. 3. No further clustering schemes are illustrated here, and the present embodiment is not limited thereto.
In the preferred implementation manner, the abnormal IO requests corresponding to the directly or indirectly connected request paths can be clustered into the same abnormal IO request group, so that not only can the abnormal IO requests passing through the same physical node be clustered in the same abnormal IO request group, but also each abnormal IO request can be ensured to only appear in one abnormal IO request group, and therefore repeated analysis of the same abnormal IO request can be avoided, and the number of clustered abnormal IO request groups can be reduced better.
In addition to the preferred implementation provided above, in this embodiment, other implementations may be used to cluster out the abnormal IO request group in step 101.
For example, in another alternative implementation: inquiring a physical node capable of communicating a plurality of request paths as a clustering node; and clustering the abnormal IO requests corresponding to each request path which can be communicated by the single clustering node into an abnormal IO request group. The implementation method can also ensure that the abnormal IO requests passing through the same physical node can be clustered in the same abnormal IO request group, so that the subsequently output alarm information can point to a single abnormal reason, and the alarm processing is more convenient. But the number of clustered abnormal IO request groups will be more than the preferred implementation described previously. In addition, in this alternative implementation, heterogeneous IO request groups may also be clustered by using the undirected graph approach described above, illustratively: when traversing to a target vertex in the undirected graph, if a plurality of request paths pass through the target vertex, clustering abnormal IO requests corresponding to the plurality of request paths passing through the target vertex into an abnormal IO request group; deleting a plurality of request paths passing through the target vertex from the undirected graph; and continuously determining the next target vertex from the remaining vertices in the undirected graph, and searching and deleting a plurality of request paths passing through the target vertex until the remaining vertices do not exist in the undirected graph so as to obtain at least one abnormal IO request group.
It will be appreciated that, in this embodiment, at least one abnormal IO request group may be clustered in step 101, and from the perspective of a single physical node, abnormal IO requests routed through the same physical node may be clustered in the same abnormal IO request group. The inventor finds that, in the research process, in the case that an abnormality occurs in a physical node, an IO request needing to pass through the physical node is abnormal with a high probability, but in step 101 of this embodiment, abnormal IO requests passing through the physical node are clustered into the same abnormal IO request group, which can be understood as that abnormal IO requests caused by the same abnormality cause are clustered into the same abnormal IO request group.
On this basis, referring to fig. 1, in the present embodiment, in step 102, it is proposed to output alarm information in units of abnormal IO request groups. That is, an abnormal IO request group outputs an alarm message. In this embodiment, the alarm content included in the alarm information is not limited, and necessary content required for performing alarm processing under the abnormal IO request group may be provided. In this embodiment, the content fields in the alarm information are supported to be configured as required, and in step 102, corresponding alarm content may be generated based on the content fields required in the alarm information, and the generated alarm content may be encapsulated in the corresponding content fields in the alarm information, so as to generate and output the alarm information. The construction scheme of the alarm information will be described in the following by way of example, and will not be described in detail here.
In the step 102, the abnormal IO requests caused by the same abnormal cause can be alerted in one piece of alert information, which can avoid repeated alerts for the same abnormal cause. The inventors found during the research that the number of abnormal IO request groups clustered based on step 101 in this embodiment is much smaller than the number of abnormal IO requests, and therefore the number of alarm information output in step 102 will be much smaller than the number of alarm information output in units of IO requests in the conventional art.
In summary, in this embodiment, it is proposed that, in the case of an alarm triggering event, request paths are determined for the abnormal IO requests respectively, and the abnormal IO requests are clustered based on the request paths, so as to generate at least one abnormal IO request group, where the abnormal IO requests corresponding to the request paths routed through the same physical node are located in the same abnormal IO request group; moreover, it is proposed to output alarm information in units of abnormal IO request groups. In this way, from the perspective of a single physical node with an exception, request paths of the exception IO requests caused by the physical node are all routed through the physical node, so that the exception IO requests corresponding to the request paths can be clustered in the same exception IO request group, therefore, the exception IO requests caused by the same exception cause can be ensured to realize alarming in one piece of alarming information, repeated alarming on the same exception cause can be avoided, the number of alarming information is reduced, the alarming information can be timely processed, and the processing efficiency of alarming is improved.
Fig. 4 is a flow chart of another alarm method according to an exemplary embodiment of the present application, and referring to fig. 4, the method may include:
step 400, under the condition of an alarm triggering event, acquiring the abnormal monitoring information corresponding to each abnormal IO request;
step 401, analyzing identification information and routing sequence of physical nodes through which the corresponding abnormal IO requests pass from the abnormal monitoring information to determine request paths corresponding to the abnormal IO requests respectively;
Step 402, clustering the abnormal IO requests based on the request paths to generate at least one abnormal IO request group, wherein the abnormal IO requests corresponding to the request paths passing through the same physical node are located in the same abnormal IO request group;
And step 403, outputting alarm information by taking the abnormal IO request group as a unit.
Step 402 and step 403 may refer to the related descriptions in the foregoing embodiments, and are not repeated here. In this embodiment, an alternative implementation manner of determining a request path corresponding to an abnormal IO request is provided based on step 400 and step 401. This alternative implementation can be combined with the implementation provided for the other steps in the above or in the embodiments described below to create new solutions.
Referring to fig. 4, in this alternative implementation, anomaly monitoring information corresponding to each of the anomaly IO requests may be obtained. As mentioned above, various anomaly monitoring methods for IO requests are supported in the present embodiment, and anomaly monitoring information can be generated in these anomaly monitoring methods. Preferably, an anomaly monitoring means capable of producing anomaly monitoring information in units of IO requests may be employed in the present embodiment. Of course, this is only preferable, and in the case where abnormality monitoring data is produced in other units, it is supported in the present embodiment that these abnormality monitoring data are collated into abnormality monitoring information in units of IO requests.
In one exemplary scenario: the anomaly monitoring information adopts tracking trace information, the trace information corresponds to the IO request one by one, the trace information comprises a plurality of span items with sequence relations, the span items correspond to physical nodes through which the IO request passes one by one, the span items comprise identification information of the corresponding physical nodes, and the sequence relations among the span items contained in the trace information are used for representing the passing sequence among the physical nodes through which the IO request passes. In this exemplary scenario, in step 400, trace information corresponding to each of the abnormal IO requests may be obtained from an abnormal monitoring system based on full link tracking, and as mentioned above, such an abnormal monitoring system may generate trace information in units of IO requests and automatically diagnose the type of abnormality corresponding to the IO request. In addition, the anomaly monitoring system respectively constructs span items for all physical nodes through which the IO request passes, is used for describing the IO processing process on the physical nodes, has a sequence relation between the span items contained in trace information, and can be used as a basis for determining the passing sequence between the physical nodes through which the IO request passes. It should be understood that this is merely exemplary, and that other types of anomaly monitoring information may also be employed in the present embodiment, and is not limited thereto.
Based on this, referring to fig. 4, it is proposed in step 401 that, from the anomaly monitoring information, identification information and a routing order of physical nodes through which the corresponding anomaly IO requests pass may be parsed, so as to determine request paths corresponding to the anomaly IO requests. The inventor finds that in the research process, the description information of IO processing process on each physical node through which the abnormal IO request passes is generally recorded in the abnormal monitoring information, and the description information of IO processing process generally comprises information of identification information of the physical node, identification information of a cluster to which the physical node belongs, time consumption of IO processing on the physical node, identification information of a last hop node of the physical node, identification information of a next hop node of the physical node, and information of various aspects such as types of IO processing operations executed on the physical node. Such as those noted above, the span entries have these IO process description information recorded on the physical node.
Based on this, in step 401, the identification information and the routing order of the physical node through which the corresponding abnormal IO request is routed may be analyzed from the abnormal monitoring information, and further, the request path corresponding to the abnormal IO request may be determined.
Further, the inventors found during the course of the study that the exception types corresponding to different exception IO requests may not be exactly the same. In practice, the exception types of IO requests may include, but are not limited to, IO unavailable classes or IO damaged classes, and the like. Wherein, the IO unavailable class is usually an IO request incomplete response, and the IO impaired class is usually an IO request complete response but the response rate is too slow. It should be understood that the several types of anomalies provided herein are merely exemplary and that the present embodiment is not limited thereto. Furthermore, the inventors have found that the type of anomaly is an important reference in the subsequent alarm handling link.
For this reason, in this alternative implementation, an exemplary acquisition scheme is proposed for a process of acquiring the anomaly monitoring information corresponding to each of the anomaly IO requests: an abnormality monitoring information acquisition request can be sent to an abnormality monitoring system deployed for the storage system, wherein the acquisition request carries a target abnormality type, and the abnormality monitoring system has diagnosed the abnormality type corresponding to each abnormality IO request; and receiving the abnormality monitoring information corresponding to the abnormality IO request which is returned by the abnormality monitoring system and is diagnosed as the target abnormality type.
In the exemplary acquisition scheme, the target anomaly type is carried in an anomaly monitoring information acquisition request generated to the anomaly monitoring system, and because the anomaly type corresponding to the anomaly IO request is already diagnosed in the anomaly monitoring system, the anomaly monitoring system can screen the anomaly IO request which is already diagnosed as the target anomaly type and return the anomaly monitoring information corresponding to the screened anomaly IO request.
In this way, based on the exemplary acquisition scheme, in the alarm scheme provided in this embodiment, first, a layer of clustering is performed on the abnormal IO requests from the dimension of the abnormal type, and the abnormal IO requests diagnosed as the same abnormal type are clustered together; next, according to the concept of clustering based on request paths provided in step 402, re-clustering may be performed for the abnormal IO requests under different abnormal types, respectively. In this way, the abnormal IO request groups may be clustered under different abnormal types, that is, the abnormal types corresponding to the abnormal IO requests in the same abnormal IO request group will be consistent, and further, only one abnormal type will be involved in the single alarm information output in step 403. This may provide a reference for subsequent alarm handling links with respect to the type of anomaly.
Of course, in this alternative implementation, other exemplary schemes may also be employed to support the showing of anomaly types in the alert information. For example, the anomaly monitoring data corresponding to all the anomaly IO requests can be obtained from the anomaly monitoring system, and the anomaly monitoring data are processed in a unified manner according to steps 401-403. However, in step 403, the clustering may be performed under a single abnormal IO request group according to the abnormal type, and the abnormal IO requests involved under different abnormal types may be recorded in the alarm information. This may also provide a reference for subsequent alert processing links regarding the type of anomaly, and the present embodiment is not limited thereto, as no further examples are given herein for schemes that can support the type of anomaly shown in the alert information.
In summary, in this embodiment, the anomaly monitoring information corresponding to each of the abnormal IO requests may be obtained, and the request paths corresponding to each of the abnormal IO requests may be accurately determined based on the anomaly monitoring information, so as to provide an accurate basis for clustering of the abnormal IO requests. In addition, it is also proposed that the abnormal IO requests can be clustered in a layer from the dimension of the abnormal type, so that reasonable display of the abnormal type in the alarm information is supported, reference basis is provided for the subsequent alarm processing link, and the alarm processing efficiency can be further improved.
In the above or below embodiments, various implementation manners may be used to implement the construction of the alarm information. Because the logic of constructing the alarm information under each abnormal IO request group is consistent, for convenience of description, the alarm information construction scheme is described below by taking the target abnormal IO request group as an example. It should be appreciated that the target abnormal IO request group may be any clustered abnormal IO request group.
In an alternative implementation: the physical nodes which are passed by each abnormal IO request in the target abnormal IO request group can be de-duplicated, and the rest physical nodes are determined as target nodes; and outputting alarm information for the target abnormal IO request group based on the identification information of the target node, the identification information of the cluster to which the target node belongs and/or the abnormal type related to the target abnormal IO request group.
Wherein, as mentioned before, different request paths may be routed through the same physical node, for which purpose it is proposed in this alternative implementation to de-duplicate the physical node. Through the reprocessing, the physical nodes involved in the target abnormal IO request group can be determined. In one exemplary deduplication scheme: if an undirected graph is constructed based on each request path and connected components are searched from the undirected graph to cluster out a target abnormal IO request group, taking physical nodes represented by all vertexes contained in the connected components corresponding to the target abnormal IO request group as target nodes.
It may be appreciated that in this alternative implementation, the alert information may include at least a content field for recording identification information of the target node, content fields for recording an anomaly type related to the target anomalous IO request group, and/or content fields for recording identification information of a cluster to which the target node belongs. Of course, these content fields are merely examples, and the present embodiment is not limited thereto.
The foregoing identification information of the target node and the identification information of the cluster to which the target node belongs may be obtained from the anomaly monitoring information corresponding to each of the anomaly IO requests included in the target anomaly IO request group, for example, may be obtained from the span item mentioned above. For the exception type related to the target exception IO request, the foregoing exemplary scheme may be referred to, and before the exception IO requests are clustered based on the request path, the exception IO requests are clustered based on the exception type in a first layer, where in this case, each exception IO request clustered based on the request path under the target exception type may be marked as the target exception type, and then the alarm information may be carried as the exception type marked by the target exception IO request group. If clustering is performed on the abnormal IO requests based on the request path and then re-clustering is performed in the target abnormal IO request group based on the abnormal types, the clustered abnormal types can be marked for the target abnormal IO request group, the abnormal IO requests under different abnormal types can be recorded respectively, and the abnormal types marked for the target abnormal IO request group and the abnormal IO requests under different abnormal types are carried in the alarm information.
It should be understood that other implementation manners may be used in the present embodiment to construct the alert information, and the content fields contained in the alert information are not limited to the foregoing exemplary content fields. Based on the alarm information construction scheme provided by the embodiment, which clusters and/or which nodes generate which IO exception can be prompted through the alarm information. It can be known that, in the embodiment, the alarm information prompts the abnormality in the unit of the IO request, but prompts the abnormality from the dimensions of the abnormality type, the cluster, the node and the like, which is more convenient for performing abnormality positioning in the subsequent alarm processing process, so that the alarm processing efficiency can be further improved.
Fig. 5 is a flowchart of another alarm method according to an exemplary embodiment of the present application. Referring to fig. 5, the method may include:
Step 500, under the condition of an alarm triggering event, determining request paths corresponding to the abnormal IO requests respectively, wherein a single request path characterizes a communication relationship between physical nodes through which the corresponding abnormal IO requests pass;
step 501, clustering the abnormal IO requests based on the request paths to generate at least one abnormal IO request group, wherein the abnormal IO requests corresponding to the request paths passing through the same physical node are located in the same abnormal IO request group;
Step 502, taking the abnormal IO request group as a unit, and outputting alarm information.
Step 503, responding to the alarm processing instruction, and analyzing a communication structure formed between the computing node and the storage node under the target abnormal IO request group corresponding to the target alarm information;
Step 504, based on the pointing relationship between the communication structure and the abnormal node, the abnormal node causing the abnormality is presumed under the target abnormal IO request group.
The descriptions of steps 500-502 in the foregoing embodiments are referred to, and the detailed description is omitted here. In this embodiment, an alternative implementation after the alert information output is provided based on steps 503 and 504. This alternative implementation can be combined with the implementation provided for the other steps in the above or in the embodiments described below to create new solutions.
Referring to fig. 5, after outputting the alarm information, alarm processing logic for each piece of alarm information output may be started in response to the alarm processing instruction. Because the alarm processing logic implemented for different alarm information is consistent, for convenience of description, in this embodiment, the alarm processing logic is described by taking the target alarm information as an example. It should be understood that the target alarm information may be any piece of alarm information output in step 502 in this embodiment.
Referring to fig. 5, in the present embodiment, considering that the nature of the IO request can be understood as a data read-write request, there is at least a physical node for data storage, and the node for data storage is described as a storage node in the present embodiment; the root cause of data reading and writing is generally used for computing, so in this embodiment, a physical node that initiates an IO request is described as a computing node. Thus, in this embodiment, the physical nodes in the request path may include at least a compute node and a storage node. As mentioned before, the request path may also include an intermediate node for intermediate forwarding of the IO request, etc. In a split-memory architecture, the compute nodes may be located in a computing system, while the storage nodes may be located in a storage system. Of course, in other scenarios, the deployment locations of the computing nodes and the storage nodes may not be limited thereto, and the deployment locations of the computing nodes and the storage nodes are not limited herein.
Based on this, it is proposed in step 503 that the communication structure formed between the computing node and the storage node may be analyzed under the target abnormal IO request group corresponding to the target alarm information. The connection structure in this embodiment may be understood as a structure formed based on each request path connected by one physical node, where the physical node may be a computing node, a storage node, or an intermediate node for forwarding an IO request. The inventors found during the course of the study that under a single abnormal IO request group, multiple connected structures may be analyzed.
Fig. 6 is a schematic diagram of several exemplary communication structures provided by an exemplary embodiment of the present application. Referring to fig. 6, in this embodiment, there may be at least three types of communication structures:
The first type of communication structure is used for connecting one computing node with a plurality of storage nodes;
the second type of communication structure is that one storage node is communicated with a plurality of computing nodes;
the third type of communication structure is that a plurality of computing nodes and a plurality of storage nodes are communicated through an intermediate node. Wherein the intermediate node may be a network device or the like for network transit.
To this end, it is proposed in step 504 that the pointing relationship between the connectivity structure and the abnormal node may be preconfigured. The directional relationship may be used to guide the location of the abnormal node under different connectivity structures. Thus, in step 504, the exception node that caused the exception may be presumed under the target exception IO request group based on the directed relationship. If a plurality of communication structures are analyzed under the target abnormal IO request group, abnormal nodes can be respectively presumed according to the pointing relation under a plurality of communication structures.
In one exemplary speculation scheme:
If a first type of communication structure exists under the target abnormal IO request group, the computing nodes in the first type of communication structure are presumed to be abnormal nodes; or alternatively
If the second type of communication structure exists under the target abnormal IO request group, the storage nodes in the second type of communication structure are presumed to be abnormal nodes; or alternatively
If a third type of communication structure exists under the target abnormal IO request group, the intermediate node in the third type of communication structure is presumed to be an abnormal node.
Referring to fig. 6, for the first type of communication structure, a plurality of storage nodes are connected to the same computing node, so that an abnormality cause can be initially located on the computing node, and the abnormality usually occurs in the computing node, so that the IO requests corresponding to the plurality of storage nodes connected to the computing node are all abnormal. Similarly, for a second type of connectivity structure, the cause of the anomaly may be initially located at a storage node in such connectivity structure. For the third type of communication structure, since the IO requests between the plurality of computing nodes and the plurality of storage nodes are abnormal, the reasons of the abnormality can be primarily located on the intermediate nodes for communicating the plurality of computing nodes and the plurality of storage nodes, and the abnormality is usually caused by the abnormality of the intermediate nodes, so that the abnormality is caused to the plurality of IO requests passing through the intermediate nodes.
It should be noted that, in this embodiment, the provided alarm processing logic is only exemplary, and based on this exemplary alarm processing logic, the abnormal node can be estimated, and the estimated abnormal node can be used as an operation and maintenance reference basis, in practical application, more estimation dimensions can be introduced to further correct the estimation result provided in this embodiment, and of course, a processing link such as manual analysis can be added to ensure that the cause of the abnormality is accurately located, which is not limited to other estimation dimensions and manual analysis logic.
In summary, in this embodiment, after the alarm information is output, an alarm processing logic is further provided, where the alarm processing logic can make full use of the data such as the request path and the abnormal IO request group generated in the alarm process in this embodiment, as an analysis basis in the alarm processing logic. Based on the method, the communication structure can be analyzed under each alarm information, so that the abnormal node can be estimated based on the communication structure, and a reference basis is provided for operation and maintenance, so that alarm processing can be completed faster and more accurately, and the alarm processing efficiency can be further improved.
It should be noted that, in some of the above embodiments and the flows described in the drawings, a plurality of operations appearing in a specific order are included, but it should be clearly understood that the operations may be performed out of the order in which they appear herein or performed in parallel, the sequence numbers of the operations, such as 101, 102, etc., are merely used to distinguish between the various operations, and the sequence numbers themselves do not represent any execution order. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different communication structures, and do not represent a sequence, and are not limited to the "first" and "second" being different types.
Fig. 7 is a schematic structural diagram of an electronic device according to another exemplary embodiment of the present application. As shown in fig. 7, the electronic device may include: a memory 70 and a processor 71.
A processor 71 coupled to the memory 70 for executing a computer program in the memory 70 for:
Under the condition of an alarm triggering event, determining request paths corresponding to the abnormal IO requests respectively, wherein a single request path characterizes a communication relationship between physical nodes through which the corresponding abnormal IO requests pass;
clustering the abnormal IO requests based on the request paths to generate at least one abnormal IO request group, wherein the abnormal IO requests corresponding to the request paths passing through the same physical node are located in the same abnormal IO request group;
And outputting alarm information by taking the abnormal IO request group as a unit.
In an alternative embodiment, when determining the request paths corresponding to the abnormal IO requests, the processor 71 may be specifically configured to:
Acquiring the abnormality monitoring information corresponding to each abnormal IO request;
And analyzing the identification information and the passing sequence of the physical nodes passed by the corresponding abnormal IO requests from the abnormal monitoring information to determine the request paths corresponding to the abnormal IO requests.
In an alternative embodiment, when obtaining the anomaly monitoring information corresponding to each of the anomaly IO requests, the processor 71 may be specifically configured to:
An abnormal monitoring information acquisition request is sent to an abnormal monitoring system for carrying out IO abnormal monitoring, wherein the acquisition request carries a target abnormal type, and the abnormal monitoring system has diagnosed the abnormal type corresponding to each abnormal IO request;
and receiving the abnormality monitoring information corresponding to the abnormality IO request which is returned by the abnormality monitoring system and is diagnosed as the target abnormality type.
In an optional embodiment, the anomaly monitoring information is trace information, the trace information corresponds to the IO request one by one, the trace information includes a plurality of span items with sequence relations, the span items correspond to physical nodes through which the IO request passes one by one, the span items include identification information of the corresponding physical nodes, and the sequence relations among the span items included in the trace information are used for representing the passing sequence among the physical nodes through which the IO request passes.
In an alternative embodiment, when clustering the abnormal IO requests based on the request paths to generate at least one abnormal IO request group, the processor 71 may be specifically configured to:
clustering abnormal IO requests corresponding to each request path capable of being communicated through physical nodes into an abnormal IO request group, or
Inquiring a physical node capable of communicating a plurality of request paths as a clustering node; and clustering the abnormal IO requests corresponding to each request path which can be communicated by the single clustering node into an abnormal IO request group.
In an alternative embodiment, when clustering the abnormal IO requests corresponding to each request path that can be communicated through the physical node into the abnormal IO request group, the processor 71 may be specifically configured to:
Taking physical nodes on each request path as vertexes, and taking a communication relation among the physical nodes on each request path as edges to construct an undirected graph;
Searching for connected components from the undirected graph;
and clustering the abnormal IO requests corresponding to the request paths contained in the single connected component into an abnormal IO request group.
In an alternative embodiment, the processor 71, when searching for connected components from the undirected graph, may be specifically configured to:
searching a connected component of a target vertex when traversing to the target vertex in the undirected graph;
deleting a connected component of the target vertex from the undirected graph;
continuing to determine the next target vertex from the residual vertices in the undirected graph, and searching and deleting the corresponding connected components until the residual vertices do not exist in the undirected graph;
outputting the searched connected component.
In an alternative embodiment, the processor 71 may be specifically configured to, when outputting the alarm information in units of abnormal IO request groups:
After the physical nodes passed by each abnormal IO request in the target abnormal IO request group are de-duplicated, the rest physical nodes are determined as target nodes;
based on the identification information of the target node, the identification information of the cluster to which the target node belongs and/or the abnormality type related to the target abnormal IO request group, outputting alarm information for the target abnormal IO request group;
wherein the target abnormal IO request group is any abnormal IO request group.
In an alternative embodiment, when the processor 71 determines the remaining physical nodes as the target nodes after de-duplicating the physical nodes traversed by each abnormal IO request in the target abnormal IO request group, the method may specifically be used to:
And if an undirected graph is constructed based on each request path and connected components are searched from the undirected graph to cluster out the target abnormal IO request group, taking physical nodes represented by all vertexes contained in the connected components corresponding to the target abnormal IO request group as target nodes.
In an alternative embodiment, the physical nodes include at least a computing node and a storage node, and after outputting the alarm information, the processor 71 is further configured to:
responding to an alarm processing instruction, and analyzing a communication structure formed between a computing node and a storage node under a target abnormal IO request group corresponding to target alarm information;
Based on the pointing relation between the communication structure and the abnormal nodes, the abnormal nodes causing the abnormality are speculated under the target abnormal IO request group.
In an alternative embodiment, when the processor 71 speculates that the abnormal node is causing the abnormality under the target abnormal IO request group based on the pointing relationship between the communication structure and the abnormal node, the method may specifically be used to:
If a first type of communication structure exists under the target abnormal IO request group, computing nodes in the first type of communication structure are presumed to be abnormal nodes, and the first type of communication structure is that one computing node is communicated with a plurality of storage nodes; or alternatively
If a second type of communication structure exists under the target abnormal IO request group, presuming the storage nodes in the second type of communication structure as abnormal nodes, wherein the second type of communication structure is a storage node communicated with a plurality of computing nodes; or alternatively
If a third type of communication structure exists under the target abnormal IO request group, the intermediate nodes in the third type of communication structure are presumed to be abnormal nodes, and the third type of communication structure is formed by communicating a plurality of computing nodes and a plurality of storage nodes through the intermediate nodes.
In an alternative embodiment, the target exception type includes an IO unavailable class or an IO damaged class, the physical node includes a computing node in a computing system, a storage node in a storage system, and/or an intermediate node for network connection, and the abnormal IO request is an IO request that the computing node in the computing system initiates to the storage node in the storage system and has an abnormality.
Further, as shown in fig. 7, the electronic device further includes: communication component 72, and power component 73. Only some of the components are schematically shown in fig. 7, which does not mean that the electronic device only comprises the components shown in fig. 7.
It should be noted that, for the technical details of the embodiments of the electronic device, reference may be made to the related descriptions of the embodiments of the method described above, which are not repeated herein for the sake of brevity, but should not cause a loss of protection scope of the present application.
Accordingly, the present application also provides a computer readable storage medium storing a computer program, where the computer program is executed to implement the steps executed in the above method embodiments.
Accordingly, the present application also provides a computer program product, which can implement the steps performed in the above-mentioned method embodiments when the computer program in the computer program product is executed.
The memory of FIG. 7 described above is used to store a computer program and may be configured to store various other data to support operations on a computing platform. Examples of such data include instructions for any application or method operating on a computing platform, contact data, phonebook data, messages, pictures, videos, and the like. The memory may be implemented by any type of volatile or nonvolatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The communication assembly of fig. 7 is configured to facilitate wired or wireless communication between the device in which the communication assembly is located and other devices. The device where the communication component is located can access a wireless network based on a communication standard, such as a mobile communication network of WiFi,2G, 3G, 4G/LTE, 5G, etc., or a combination thereof. In one exemplary embodiment, the communication component receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component further comprises a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
The power supply assembly shown in fig. 7 provides power to various components of the device in which the power supply assembly is located. The power components may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the devices in which the power components are located.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.
The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (13)

1. An alert method, comprising:
Under the condition of an alarm triggering event, acquiring the abnormal monitoring information corresponding to each abnormal IO request; analyzing the identification information and the passing sequence of physical nodes passed by the corresponding abnormal IO requests from the abnormal monitoring information to determine the request paths corresponding to the abnormal IO requests, wherein a single request path characterizes the communication relationship between the physical nodes passed by the corresponding abnormal IO requests;
clustering the abnormal IO requests based on the request paths to generate at least one abnormal IO request group, wherein the abnormal IO requests corresponding to the request paths passing through the same physical node are located in the same abnormal IO request group;
outputting alarm information by taking the abnormal IO request group as a unit;
The anomaly monitoring information adopts tracking information, and sequence relations among span items contained in the tracking information are used for representing passing sequences among physical nodes through which the IO request passes.
2. The method of claim 1, wherein obtaining anomaly monitoring information corresponding to each of the anomaly IO requests comprises:
An abnormal monitoring information acquisition request is sent to an abnormal monitoring system for carrying out IO abnormal monitoring, wherein the acquisition request carries a target abnormal type, and the abnormal monitoring system has diagnosed the abnormal type corresponding to each abnormal IO request;
and receiving the abnormality monitoring information corresponding to the abnormality IO request which is returned by the abnormality monitoring system and is diagnosed as the target abnormality type.
3. The method of claim 1, wherein clustering the abnormal IO requests based on the request paths to generate at least one abnormal IO request group comprises:
clustering abnormal IO requests corresponding to each request path capable of being communicated through physical nodes into an abnormal IO request group, or
Inquiring a physical node capable of communicating a plurality of request paths as a clustering node; and clustering the abnormal IO requests corresponding to each request path which can be communicated by the single clustering node into an abnormal IO request group.
4. A method according to claim 3, wherein clustering the abnormal IO requests corresponding to each request path that can be communicated by a physical node into an abnormal IO request group comprises:
Taking physical nodes on each request path as vertexes, and taking a communication relation among the physical nodes on each request path as edges to construct an undirected graph;
Searching for connected components from the undirected graph;
and clustering the abnormal IO requests corresponding to the request paths contained in the single connected component into an abnormal IO request group.
5. The method of claim 4, wherein searching for connected components from the undirected graph comprises:
searching a connected component of a target vertex when traversing to the target vertex in the undirected graph;
deleting a connected component of the target vertex from the undirected graph;
continuing to determine the next target vertex from the residual vertices in the undirected graph, and searching and deleting the corresponding connected components until the residual vertices do not exist in the undirected graph;
outputting the searched connected component.
6. The method of claim 1, wherein outputting the alert information in units of the abnormal IO request group comprises:
After the physical nodes passed by each abnormal IO request in the target abnormal IO request group are de-duplicated, the rest physical nodes are determined as target nodes;
based on the identification information of the target node, the identification information of the cluster to which the target node belongs and/or the abnormality type related to the target abnormal IO request group, outputting alarm information for the target abnormal IO request group;
wherein the target abnormal IO request group is any abnormal IO request group.
7. The method of claim 6, wherein after de-duplicating the physical nodes traversed by each abnormal IO request in the target abnormal IO request group, determining the remaining physical nodes as target nodes comprises:
And if an undirected graph is constructed based on each request path and connected components are searched from the undirected graph to cluster out the target abnormal IO request group, taking physical nodes represented by all vertexes contained in the connected components corresponding to the target abnormal IO request group as target nodes.
8. The method of claim 1, wherein the physical nodes comprise at least a computing node and a storage node, and wherein after outputting the alert information, the method further comprises:
responding to an alarm processing instruction, and analyzing a communication structure formed between a computing node and a storage node under a target abnormal IO request group corresponding to target alarm information;
Based on the pointing relation between the communication structure and the abnormal nodes, the abnormal nodes causing the abnormality are speculated under the target abnormal IO request group.
9. The method of claim 8, wherein speculating an exception node that caused an exception under the target exception IO request group based on a directed relationship between a communication structure and the exception node comprises:
If a first type of communication structure exists under the target abnormal IO request group, computing nodes in the first type of communication structure are presumed to be abnormal nodes, and the first type of communication structure is that one computing node is communicated with a plurality of storage nodes; or alternatively
If a second type of communication structure exists under the target abnormal IO request group, presuming the storage nodes in the second type of communication structure as abnormal nodes, wherein the second type of communication structure is a storage node communicated with a plurality of computing nodes; or alternatively
If a third type of communication structure exists under the target abnormal IO request group, the intermediate nodes in the third type of communication structure are presumed to be abnormal nodes, and the third type of communication structure is formed by communicating a plurality of computing nodes and a plurality of storage nodes through the intermediate nodes.
10. The method according to claim 2, wherein the target exception type comprises an IO unavailable class and/or an IO damaged class, the physical node comprises a computing node in a computing system, a storage node in a storage system, and/or an intermediate node for network connection, and the abnormal IO request is an IO request that the computing node in the computing system initiates to the storage node in the storage system and has an abnormality occurred.
11. An electronic device comprising a memory and a processor;
the memory is used for storing one or more computer instructions;
the processor is coupled to the memory for executing the one or more computer instructions for performing the alerting method of any one of claims 1-10.
12. A computer-readable storage medium storing computer instructions that, when executed by one or more processors, cause the one or more processors to perform the alerting method of any one of claims 1-10.
13. A computer program product comprising computer programs/instructions which, when executed by a processor, cause the processor to implement the alerting method of any one of claims 1-10.
CN202410112376.5A 2024-01-25 2024-01-25 Alarm method, equipment and storage medium Active CN117632666B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410112376.5A CN117632666B (en) 2024-01-25 2024-01-25 Alarm method, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410112376.5A CN117632666B (en) 2024-01-25 2024-01-25 Alarm method, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN117632666A CN117632666A (en) 2024-03-01
CN117632666B true CN117632666B (en) 2024-05-07

Family

ID=90036059

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410112376.5A Active CN117632666B (en) 2024-01-25 2024-01-25 Alarm method, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117632666B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112015618A (en) * 2020-08-17 2020-12-01 杭州指令集智能科技有限公司 Abnormity warning method and device
CN115065539A (en) * 2022-06-17 2022-09-16 国家电网有限公司信息通信分公司 Data security monitoring method, device, equipment and storage medium
CN115629945A (en) * 2021-07-13 2023-01-20 阿里巴巴新加坡控股有限公司 Alarm processing method and device and electronic equipment
CN115632928A (en) * 2022-10-20 2023-01-20 中国农业银行股份有限公司 Alarm method and device based on system level, electronic equipment and storage medium
WO2023050705A1 (en) * 2021-09-30 2023-04-06 苏州浪潮智能科技有限公司 Monitoring data management method and apparatus, electronic device and storage medium
WO2023071761A1 (en) * 2021-10-29 2023-05-04 深圳前海微众银行股份有限公司 Anomaly positioning method and device
CN116668264A (en) * 2023-06-07 2023-08-29 国家计算机网络与信息安全管理中心 Root cause analysis method, device, equipment and storage medium for alarm clustering
CN117134967A (en) * 2023-08-28 2023-11-28 杭州安恒信息技术股份有限公司 Method, device, equipment and storage medium for detecting abnormal network service of host
CN117156012A (en) * 2023-10-26 2023-12-01 北京国电通网络技术有限公司 Exception request data processing method, device, equipment and computer readable medium
CN117221078A (en) * 2023-09-06 2023-12-12 中国联合网络通信集团有限公司 Association rule determining method, device and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112015618A (en) * 2020-08-17 2020-12-01 杭州指令集智能科技有限公司 Abnormity warning method and device
CN115629945A (en) * 2021-07-13 2023-01-20 阿里巴巴新加坡控股有限公司 Alarm processing method and device and electronic equipment
WO2023050705A1 (en) * 2021-09-30 2023-04-06 苏州浪潮智能科技有限公司 Monitoring data management method and apparatus, electronic device and storage medium
WO2023071761A1 (en) * 2021-10-29 2023-05-04 深圳前海微众银行股份有限公司 Anomaly positioning method and device
CN115065539A (en) * 2022-06-17 2022-09-16 国家电网有限公司信息通信分公司 Data security monitoring method, device, equipment and storage medium
CN115632928A (en) * 2022-10-20 2023-01-20 中国农业银行股份有限公司 Alarm method and device based on system level, electronic equipment and storage medium
CN116668264A (en) * 2023-06-07 2023-08-29 国家计算机网络与信息安全管理中心 Root cause analysis method, device, equipment and storage medium for alarm clustering
CN117134967A (en) * 2023-08-28 2023-11-28 杭州安恒信息技术股份有限公司 Method, device, equipment and storage medium for detecting abnormal network service of host
CN117221078A (en) * 2023-09-06 2023-12-12 中国联合网络通信集团有限公司 Association rule determining method, device and storage medium
CN117156012A (en) * 2023-10-26 2023-12-01 北京国电通网络技术有限公司 Exception request data processing method, device, equipment and computer readable medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
一种基于时序性告警的新型聚类算法;邓甜甜;熊荫乔;何贤浩;;计算机科学;20200615(第S1期);全文 *
邓甜甜 ; 熊荫乔 ; 何贤浩 ; .一种基于时序性告警的新型聚类算法.计算机科学.2020,(第S1期),全文. *

Also Published As

Publication number Publication date
CN117632666A (en) 2024-03-01

Similar Documents

Publication Publication Date Title
US20170134240A1 (en) Network Topology Estimation Based on Event Correlation
CN110287688B (en) Associated account analysis method and device and computer-readable storage medium
US8874963B2 (en) Operations management apparatus, operations management method and program thereof
CN110460460B (en) Service link fault positioning method, device and equipment
US20220156247A1 (en) Event records in a log file
CN111092752B (en) Fault positioning method and device spanning multiple network slices
CN113297044A (en) Operation and maintenance risk early warning method and device
CN112860412B (en) Service data processing method and device, electronic equipment and storage medium
CN117632666B (en) Alarm method, equipment and storage medium
CN112087320B (en) Abnormality positioning method and device, electronic equipment and readable storage medium
CN112255548B (en) Battery management system test method and system thereof
CN107544894B (en) Log processing method and device and server
CN115454781B (en) Data visualization display method and system based on enterprise architecture system
CN110309948A (en) Complete vehicle logistics order forecast method and device, logistics system and computer-readable medium
CN114070737B (en) Method and device for checking configuration data of equipment, storage medium and electronic equipment
CN111427736A (en) Log monitoring method, device, equipment and computer readable storage medium
CN115529219A (en) Alarm analysis method and device, computer readable storage medium and electronic equipment
CN115801557A (en) Fault root cause positioning method and device and readable storage medium
CN110365520B (en) Method, device and equipment for polling nodes in distributed system
CN114139853A (en) Steel structure product list processing method and device based on big data
CN108132951B (en) Data access method and data access device
CN109101187B (en) Method and device for reconstructing data
CN117439871B (en) Meter reading fault positioning method and device, storage medium and electronic equipment
CN112214409B (en) Operation and maintenance method and device used in test environment
CN110516439B (en) Detection method, device, server and computer readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant