CN112540832B

CN112540832B - Cloud native system fault analysis method based on knowledge graph

Info

Publication number: CN112540832B
Application number: CN202011554734.6A
Authority: CN
Inventors: 陈鹏飞; 陈彩琳; 郑子彬
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2022-01-28
Anticipated expiration: 2040-12-24
Also published as: CN112540832A

Abstract

The application discloses a cloud primary system fault analysis method based on a knowledge graph, which comprises the following steps: acquiring original data in a cloud native system, and constructing a knowledge graph based on the original data to obtain graph data; carrying out anomaly detection on the graph data through an anomaly detection model to obtain an anomaly node; and calculating the similarity of the abnormal node and the replica node corresponding to the abnormal node, and positioning the fault root cause based on the similarity, wherein the replica node corresponding to the abnormal node is the same type of node as the abnormal node. The method and the device solve the technical problems that in the prior art, the interaction relation between the entities is ignored, only the entity with the fault can be located, and the fault root cause of the cloud native system is difficult to be located quickly and accurately.

Description

Cloud native system fault analysis method based on knowledge graph

Technical Field

The application relates to the technical field of fault analysis, in particular to a cloud native system fault analysis method based on a knowledge graph.

Background

With the development of technologies such as containerization and virtualization, more and more software systems are migrated into a cloud environment, and a cloud native system has become a mainstream solution for application development and deployment. The cloud native system is oriented to the micro-services, and application programs are decoupled into a plurality of services in the modes of container packaging and micro-service deployment. However, in the architecture of such microservices, one application may generate hundreds or even thousands of microservices, and these microservices often have intricate and complex interaction relationships. The huge micro-service architecture and the massive alarm and index data of the cloud native system bring great challenge and pressure to operation and maintenance work. Once a problem occurs, a significant business impact is brought to the enterprise, and a huge business loss is caused.

The cloud native system comprises massive entities such as micro-services, containers, processes and the like, an intricate and complex interaction relationship exists among the entities, and when the cloud native system fails, the failure can be transmitted along an interaction network among the entities, so that the failure positioning difficulty of the cloud native system is higher. However, the existing fault location method ignores the interaction relationship between entities, can only locate the entity with a fault, and is difficult to quickly and accurately locate the fault root cause of the cloud native system.

Disclosure of Invention

The application provides a method for analyzing a cloud primary system fault based on a knowledge graph, which is used for solving the technical problems that in the prior art, the interaction relation between entities is neglected, only the entity with the fault can be positioned, and the fault root cause of the cloud primary system is difficult to be quickly and accurately positioned.

In view of the above, a first aspect of the present application provides a method for analyzing a failure of a cloud native system based on a knowledge graph, including:

acquiring original data in a cloud native system, and constructing a knowledge graph based on the original data to obtain graph data;

carrying out anomaly detection on the graph data through an anomaly detection model to obtain an abnormal node;

and calculating the similarity of the abnormal node and a replica node corresponding to the abnormal node, and performing fault root cause positioning based on the similarity, wherein the replica node corresponding to the abnormal node is the same type of node as the abnormal node.

Optionally, the original data includes entity information and network connection data;

the acquiring of the raw data in the cloud native system includes:

acquiring the entity information in a cloud native system, wherein the entity at least comprises a container;

entering a name space of the container through an nsenter tool, and mounting a directory where a netstat file of the host computer is located on a file system of the container;

and executing the nsenter command in the container to acquire the network connection data.

Optionally, the constructing a knowledge graph based on the original data to obtain graph data includes:

performing entity extraction, entity relationship extraction and entity attribute extraction on the original data in sequence, wherein the entity attributes comprise static attributes and dynamic attributes;

and constructing a knowledge graph based on the extracted entities, the entity relations and the entity attributes to obtain graph data.

Optionally, the calculating the similarity between the abnormal node and the replica node corresponding to the abnormal node, and performing fault root cause positioning based on the similarity includes:

calculating the dynamic attribute similarity and the static attribute similarity of the abnormal node and the replica node corresponding to the abnormal node to respectively obtain a dynamic attribute similarity score and a static attribute similarity score;

determining an abnormal attribute based on the dynamic attribute similarity score and the static attribute similarity score.

Optionally, calculating the dynamic attribute similarity between the abnormal node and the replica node corresponding to the abnormal node to obtain a dynamic attribute similarity score, including:

acquiring the dynamic attributes of the abnormal node and the copy node corresponding to the abnormal node in a preset time period, and generating an abnormal node dynamic attribute vector and a copy node dynamic attribute vector;

and calculating cosine similarity of the abnormal node dynamic attribute vector and the duplicate node dynamic attribute vector to obtain a dynamic attribute similarity score.

Optionally, calculating the static attribute similarity of the abnormal node and the replica node corresponding to the abnormal node to obtain a static attribute similarity score, including:

performing word segmentation, stop word filtering and oneHot coding on the static attribute of the abnormal node and the static attribute of the replica node corresponding to the abnormal node in sequence to obtain an abnormal node text vector and a replica node text vector respectively;

and calculating cosine similarity of the abnormal node text vector and the duplicate node text vector to obtain a static attribute similarity score.

Optionally, the performing anomaly detection on the graph data through an anomaly detection model to obtain an abnormal node includes:

extracting the directed topology relation of each node in the graph data to obtain an adjacency matrix, and extracting the attribute of each node in the graph data to obtain an attribute matrix;

and carrying out anomaly detection on the adjacent matrix and the attribute matrix through an anomaly detection model to obtain an abnormal node.

Optionally, the anomaly detection model comprises an encoder and a decoder;

the performing anomaly detection on the adjacency matrix and the attribute matrix through an anomaly detection model to obtain an abnormal node includes:

coding the adjacent matrix and the attribute matrix through the coder in the anomaly detection model to obtain a coding feature vector, and inputting the coding feature vector and the adjacent matrix into the decoder;

decoding processing is carried out through the decoder in the anomaly detection model by combining the adjacent matrix and the coding characteristic vector to obtain a reconstructed adjacent matrix and a reconstructed attribute matrix;

and calculating a reconstruction error based on the adjacency matrix, the attribute matrix, the reconstructed adjacency matrix and the reconstructed attribute matrix through the anomaly detection model, and determining an abnormal node based on the reconstruction error.

Optionally, the encoder includes a graph convolution network and a recurrent neural network connected in sequence;

the encoding the adjacent matrix and the attribute matrix by the encoder in the anomaly detection model to obtain an encoded feature vector includes:

mapping the adjacency matrix and the attribute matrix to a low-dimensional space through the graph convolution network to obtain a low-dimensional feature vector;

and performing feature extraction on the low-dimensional feature vector through the recurrent neural network to obtain a coding feature vector.

Optionally, the decoder includes a topology structure reconstruction module and an attribute information reconstruction module, where the attribute information reconstruction module includes a recurrent neural network and a graph convolution network that are connected in sequence;

the decoding processing is performed by the decoder in the anomaly detection model in combination with the adjacent matrix and the coding feature vector to obtain a reconstructed adjacent matrix and a reconstructed attribute matrix, and the method comprises the following steps:

decoding the coding characteristic vector through the topological structure reconstruction module to obtain a reconstructed adjacent matrix;

decoding the coding feature vector through the recurrent neural network in the attribute information reconstruction module to obtain an output result;

and decoding the output result based on the adjacency matrix through the graph convolution network in the attribute information reconstruction module to obtain a reconstructed attribute matrix.

According to the technical scheme, the method has the following advantages:

the application provides a cloud native system fault analysis method based on a knowledge graph, which comprises the following steps: acquiring original data in a cloud native system, and constructing a knowledge graph based on the original data to obtain graph data; carrying out anomaly detection on the graph data through an anomaly detection model to obtain an anomaly node; and calculating the similarity of the abnormal node and the replica node corresponding to the abnormal node, and positioning the fault root cause based on the similarity, wherein the replica node corresponding to the abnormal node is the same type of node as the abnormal node.

According to the method, the original data in the cloud original system are obtained, the knowledge graph is constructed according to the attributes and the dependency relationships of all the entities in the original data to obtain graph data, then the graph data with the attribute information and the dependency relationships are subjected to abnormal detection through the abnormal detection model to obtain abnormal nodes, the mutual relationships among all the entities in the cloud original system are considered, and faults can be located quickly and accurately; in addition, the method and the device further perform fault root cause positioning by calculating the similarity of the abnormal nodes and the nodes of the same type, can position the fault root cause with finer granularity and more accuracy, and solve the technical problems that in the prior art, the interaction relation between entities is ignored, only the entity with the fault can be positioned, and the fault root cause of the cloud primary system is difficult to be positioned quickly and accurately.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic flowchart of a method for analyzing a failure of a cloud native system based on a knowledge graph according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an entity dependency model according to an embodiment of the present application;

FIG. 3 is a schematic diagram of graph data provided by an embodiment of the present application;

FIG. 4 is a diagram illustrating a graph data construction process according to an embodiment of the present disclosure;

fig. 5 is a schematic overall architecture diagram of a cloud native system fault analysis method according to an embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

For convenience of understanding, please refer to fig. 1, a method for analyzing a failure of a cloud native system based on a knowledge graph according to an embodiment of the present application includes:

step 101, obtaining original data in a cloud native system, and constructing a knowledge graph based on the original data to obtain graph data.

Cloud native applications are managed using a resilient infrastructure. At present, the mainstream elastic infrastructure is mainly kubernets, which becomes a virtual cloud native system at present with its complete container arrangement capability and cluster management capability. The knowledge graph of the Kubernetes cloud native system is constructed by acquiring the entities in the cloud native system and the dependency relationship among the entities, and screening out main attributes from a plurality of attributes of each entity.

Unstructured data in the cloud native system, such as configuration files of specific resource objects (entities), TCP connection data between micro services, and the like, are acquired to obtain original data.

Further, when data information of the cloud native system is collected in the prior art, corresponding agents need to be installed or code instrumentation needs to be performed on software to obtain corresponding data information, and particularly, the dependency relationship and the call relationship of the cloud native system are obtained. The method has strong invasion to the cloud native system, low safety and reliability, and even has a certain influence on the cloud native system. In order to solve the problem, in the embodiment of the present application, a non-invasive data obtaining method is adopted, and the process needs to obtain information of each resource entity and network connection data between micro services or pods.

Specifically, the original data includes entity information and network connection data, and the specific process of acquiring the original data in the cloud native system includes two parts: entity information acquisition and network connection data acquisition.

(1) Entity information acquisition

In the embodiment of the application, in the stage of acquiring the entity information, a tool for capturing Kubernets entity information is written through Python language. In the process of capturing data, three third-party tool libraries, namely Kubernets client-Python, Docker-py and Paramiko, are mainly utilized. The acquisition modes of various entities (resource objects) can be simply divided into three types, and the specific acquisition mode of the entities can refer to table 1.

Table 1 data acquisition mode of each entity

The resource object above the Container level mainly acquires relevant data by means of a Kubernets client-Python module. The APIServer accessing the Kubernetes cluster by using the module needs to create a serviceAccount with Admin authority on a MasterNode of the cluster, and acquires the address and Token of the APIServer as a license certificate for remotely accessing the APIServer. The kubernets client-Python module provides a very convenient API and Model, for example, all Pod entity information of a specified namespace (namespace) can be obtained by using the following script:

kubernetes.client.CoreV1Api().list_namespaced_Pod(namespace)

the acquisition result of the script is a set of all Pod entities under the specified namespace in the system, each Pod entity can be sequentially taken out by using an iterator, and the required information of each important field can be acquired according to the Model design rule of the module. Therefore, the data acquisition tool written by the script is suitable for clusters of different sizes. Other resource objects are obtained in a similar manner, and are not described herein again.

The minimum scheduling unit of the Kubernetes cluster is Pod, so the detailed information of Container and Image cannot be obtained by using Kubernetes api. Since most of the containers currently use containers of the Docker type, the embodiment of the present application uses the dockermemote api to obtain the Container and the Image related information. The DockerremoteAPI is an interface that obtains data from Dockerd, and the Docker-py module is an encapsulation of the DockerremoteAPI. Docker-py also provides a convenient API and Model that can obtain all the containers or images on a given host at one time, and can also obtain the Container or Image of a given ID.

The information of the Process and the File entity needs to be acquired by utilizing a Linux kernel command. The Paramiko module follows the SSH2 protocol and supports connecting to a remote server in an encrypted manner. The command may be executed on the remote server by means of the paramiko module and the result of executing the command is returned to the local client. For example, all files opened by a process that specifies pid are fetched using the following script:

command＝'lsof-p'+str(con['pid'])

stdin,stdout,stderr＝ssh_client.exec_command(command)

(2) network connection data acquisition

Considering that the existing method needs to invade network connection data between Pod entities into a Container and mount a file directory of a host before the Container is started, because the Container in the cloud native system is dynamically operated when the network connection data is captured, and the Container does not allow the file directory to be dynamically mounted due to security, the traditional mounting method cannot mount the file directory of the host in the operating Container.

In order to solve the above problem, the embodiment of the present application dynamically mounts the host file directory in the running Container by means of the nsenter (namespace enter) tool. The nsenter is a small tool which can enter a container name space and is integrated into Linux. And entering a namespace (comprising a process space, a file space and a network space) of the container through an nsenter tool, mounting a directory where a netstat file of the host is located on a file system of the container, and executing an nsenter command in the container to acquire network connection data. In the embodiment of the application, only the nsenter tool is used for entering the name space of the container, the information of the related file space is read for executing operations such as file mounting and the like, the source code of the container cannot be directly modified, and the influence on the container can be almost ignored, so that the netstat tool is not invasive and safe.

Specifically, the mounting the directory where the netstat file of the host is located in the Container and acquiring the information may include the following steps:

s1, entering a name space of the container through a nsenter tool, and mounting a directory where a netstat file of the host computer is located on a file system of the container;

s2, executing the nsenter command in the Container to obtain TCP connection data;

s3, processing TCP connection data, and removing repeated connection, finished connection or failed connection;

s4, screening out local address ip (local ip address) and foreign address ip (external ip address) of each connection;

and S5, finding the Pod corresponding to each ip address, and establishing TCP connection between the pods.

For example, mount the bin file of the host to the/mnt file directory of the container with id 60c2e42f1e80, first obtain the PID of the corresponding process by using the container id, for example, PID is 5882, the nsenter tool finds the corresponding process according to PID 5882, and enters the name space of the process. And then mounting the/bin directory of the host machine into the/mnt file directory of the container, entering the container, and executing a netstat instruction under the/mnt directory to acquire internal network connection data. The obtained TCP connection form may refer to table 2. Processing the TCP connection data, extracting an ip address of each connection, finding out two Pod entities corresponding to the ip address, and adding a connection relation of the two Pod entities in an entity dependency graph.

TABLE 2 TCP CONNECTION DATA

After the original data of the cloud native system are obtained, a knowledge graph is constructed based on the original data, and graph data are obtained. In the process of constructing the knowledge graph, firstly, data extraction is performed on the original data, and specifically, entity extraction, entity relationship extraction and entity attribute extraction are sequentially performed on the original data.

1. Entity extraction

The entity extraction stage is mainly used for extracting the application components of the deployment and the following, including node, namespace, deployment, repliicaset, Pod, endpoint, service, container, image, process and file.

2. Entity relationship extraction

The entity relationship extraction mainly extracts the relationship among all entities and the description mode of all the relationships according to experience knowledge. The embodiment of the application extracts entity relationships for extracted entities, specifically:

1) association between Process and File: in the host, one process can open multiple files, and the same file can also be opened by multiple processes, and the multiple-to-multiple relationship exists between the multiple files. Thus, the relationship of the two can be described as: file-exposed → process.

2) Connection between Container and Image: the Container in Kubernetes is created using Image. The relationship between Image and Container is one-to-many: the same Image can be dynamically run as multiple different containers, but each Container can only correspond to one Image. The relationship between the two can be described as: image-spawn → container.

3) Connection between Container and Process: each dynamically running Container is a Process on the host, and the Container and the Process can be associated by looking at the host Process Pid corresponding to each Container. These two are in a one-to-one relationship, which can be described as: process-mapping → container.

4) Association between Pod and Container: the smallest choreography scheduling unit in a Kubernetes cluster is Pod. A one-to-many relationship exists between Pod and Container; one to many containers can be run inside each Pod, and these containers share the resources of the same Pod. It can therefore be described as: container-running → pod.

5) The relation between the replicase and the Pod: the replicase is a management tool of the Pod, and essentially defines a desired scene (the number of copies of the Pod required, the label selector for screening the target Pod, the creation template of the Pod, etc.). Pod can be deployed from repliaset and DaemonSet. Each Pod will record information about the Pod's own in its configuration file, which can be used to associate the Pod with the repliaset. The Pod and the repliaset are many-to-one, and the two relations can be described as follows: pod-replica → replicaSet.

6) Association between Deployment and repliaset: the Deployment enables better orchestration of the Pod by creating a repliaset. Information related to owner is also recorded in the definition file of the repliaset, and the repliaset and the Deployment can be linked by using the information. The replicase and the Deployment are in a many-to-one relationship, and the relationship can be described as follows: repliicaset-complex → deployment.

7) Association between Deployment and Namespace: the Namespace mainly realizes the resource isolation in the cluster through a certain technology, and a user can divide resource objects in the cluster in different Namespaces, thereby realizing the resource isolation among multiple tenants, different groups and multiple projects. Each Deployment element can only belong to one of the namespaces in the cluster, and all object entities such as Pod deployed by the Deployment element belong to the Namespace. The configuration file of the Deployment records the affiliated Namespace information. The Delpoyment and Namespace are in a many-to-one relationship, and the relationship between the Delpoyment and Namespace can be described as follows: deployment-deployed → namespace.

8) The connection between Node and Namespace: a Node generally corresponds to a physical machine or a virtual machine, and a Node is a host for running all entities in a cluster. The Namespace does not logically exist with the Node, but the two can be associated through Pod. The Pod in each Namespace is arranged on different nodes, so each Namespace has multiple nodes associated with it, each Node can also be shared by multiple namespaces, both are in many-to-many relationship, and both relationship can be described as: namespace-host → node.

9) Association between Pod and endpoint, service: in the kubernets cluster, Service is an abstract concept and corresponds to real application in the cluster. Each Service is supported by a series of Pod. The Service establishes association by using the label selector and the Pod. Kubernets will create corresponding endpoint (mainly including Pod IP and Pod information) resource object for Service configured with label selector, and store in Etcd, and by using endpoint, Service can be associated with all corresponding points. The relationship between the three can be described as: service-binding → endpoint-register → pod.

10) Association between Pod and Pod: the Pod in the cluster can communicate through the network, whether between the same hosts or between different hosts. In the embodiment of the application, TCP connections in a cluster are acquired by entering a Container, main data of each connection includes three fields of a local address IP, a forward address IP, and a status, the local address IP and the forward address IP are extracted here, the two IPs are IP addresses allocated by the cluster to two Pod entities, and the connection has no direction, so that the association between the pods can be described as: pod ← tcp → pod.

In summary, the dependencies between these entities on the kubernets cluster can be combated into an entity dependency model as shown in fig. 2.

3. Entity attribute extraction

The entity attributes extracted in the embodiment of the application mainly comprise static attributes and dynamic attributes. The static attributes mainly include information uniquely identifying a component and some configuration information, such as identification information of name, id, and the like, and configuration information of environment variable, ip, port, and the like. The information does not change along with the operation of the component, is generally stable, and can cause some faults once being changed, so that the information can be used as a key index for fault detection.

The dynamic attribute refers to a characteristic expressed by the IT component in the operation process, and is mainly a key index of the operation condition of the IT component or the application program, such as service delay, throughput and system resources (such as CPU, memory and network utilization rate).

After the data is extracted, entities, entity relationships and entity attributes can be obtained, an entity dependency relationship model can be constructed according to the obtained entities, entity relationships and entity attributes, in the embodiment of the application, the entity dependency relationship model is constructed in a top-down mode, the entity dependency relationship model can be constructed layer by layer from top to bottom, from a node to a namespace node, from a namespace node to a deployment node and the like, the entity dependency relationship model can be represented by graph data according to the constructed entity dependency relationship model, please refer to fig. 3, each entity corresponds to one node in the graph data, the node has a plurality of attributes, the relationship between the entities is mapped to edges in the graph data, and each edge can have a plurality of attribute values; the attributes and attribute values are stored in a key-values schema. The modeled graph data is stored by using a Neo4j graph database or other databases, and efficient query work can be completed by means of built-in Cypher query sentences. The process of constructing the graph data may refer to fig. 4.

And 102, carrying out anomaly detection on the graph data through an anomaly detection model to obtain an abnormal node.

The graph data of the cloud native system can be represented by G (V, E, X), wherein V is a set of nodes, E is a set of all connecting edges, and X is an attribute matrix of all nodes, X_iIs the attribute vector of the ith node.

In the embodiment of the present application, the input data of the anomaly detection model are an adjacency matrix and an attribute matrix. Extracting the directed topological relation of each node in the graph data G (V, E, X) to obtain an adjacency matrix A, A_ij1 indicates that there is a connecting edge e between the entity i and the entity j (i, j ∈ V)_ij∈E，A_ij0 means that there is no connecting edge between entity i and entity j (i, j ∈ V); the attributes of each node in the graph data are extracted to obtain an attribute matrix X, the attribute matrix X is formed by splicing the attributes of each entity in the matrix, and the selected attributes are dynamic attributes with time sequence characteristics, such as the utilization rate of a CPU (Central processing Unit) in a period of time, the transmission and receiving bandwidth of a network, time delay and the like, so as to diagnose abnormal nodes. And carrying out anomaly detection on the adjacent matrix and the attribute matrix through an anomaly detection model to obtain an abnormal node.

Further, the anomaly detection model in the embodiment of the application includes an encoder and a decoder, the encoder encodes and maps input data into a low-dimensional feature vector, the decoder reconstructs a knowledge graph by using the feature vector output by the encoder, and the anomaly detection model measures the anomaly degree of a node through the reconstruction error. Specifically, the performing anomaly detection on the adjacency matrix and the attribute matrix through an anomaly detection model to obtain an abnormal node includes:

coding the adjacent matrix and the attribute matrix through a coder in the anomaly detection model to obtain a coding feature vector, and inputting the coding feature vector and the adjacent matrix into a decoder; decoding processing is carried out by combining the adjacent matrix and the coding characteristic vector through a decoder in the anomaly detection model to obtain a reconstructed adjacent matrix and a reconstructed attribute matrix; and calculating a reconstruction error based on the adjacency matrix, the attribute matrix, the reconstructed adjacency matrix and the reconstructed attribute matrix through an anomaly detection model, and determining an abnormal node based on the reconstruction error.

(1) Encoder for encoding a video signal

The encoder in the embodiment of the application comprises a Graph Convolutional Network (GCN) and a Recurrent Neural Network (RNN) which are sequentially connected, wherein the GCN can capture topological dependency relationships among entities in a knowledge Graph, and the attributes of nodes are subjected to dimension reduction by using a Laplace matrix and are mapped into a low-dimensional space. The RNN brings the output result of the previous time to the hidden layer of the next time for training each time, and can well establish the relation of index data at adjacent time. Therefore, by adopting a mode of combining GCN and RNN, the information of the knowledge graph of the cloud native system can be effectively mined from two aspects of time and space, and abnormal nodes can be rapidly searched. It is understood that a LSTM (Long Short-Term Memory) network may be used instead of the RNN.

And after the adjacent matrix A and the attribute matrix X are input into the anomaly detection model, the adjacent matrix and the attribute matrix are coded by a coder in the anomaly detection model to obtain a coded characteristic vector. Specifically, the adjacency matrix A and the attribute matrix X are mapped to a low-dimensional space through a graph convolution network to obtain a low-dimensional feature vector; and performing feature extraction on the low-dimensional feature vector through a recurrent neural network to obtain a coding feature vector Z.

The first part of the encoder is a graph convolution network, the graph convolution network carries out feature extraction through a formula (1), and an adjacent matrix A and an attribute matrix X are mapped to a low-dimensional space:

wherein, X^(l+1)For the low-dimensional feature vector output by layer l +1,

is that

The degree matrix of (c) is,

W^(l)for the weight matrix of the l-th layer, σ (·) is an activation function, and specifically, an activation function such as ReLU or sigmoid may be selected, and in the embodiment of the present application, a ReLU activation function is preferably used, and a calculation formula is as follows:

the graph convolution network in the embodiment of the application is preferably 2 layers, and for the input adjacent matrix A and the attribute matrix X, the graph convolution network maps the input data to the low-dimensional feature vector X⁽²⁾The calculation process is as follows:

X⁽⁰⁾＝X (3)

GCN Low-dimensional feature vector X to be output⁽²⁾Inputting the data into RNN for feature extraction to obtain input data X of coding feature vector Z and RNN⁽²⁾In the method, the attribute sequence corresponding to each node is an attribute vector with time sequence characteristics, and the specific calculation process of the RNN model is as follows:

x(t)＝w(t)+s(t-1) (6)

s_j(t)＝f(∑_ix_i(t)u_ji) (7)

y_k(t)＝g(∑_js_j(t)v_kj) (8)

wherein x (t) is input at time t, s (t) is hidden layer state at time t, s_j(t) is the state of the jth neuron of the hidden layer, y (t) is the output result at the time t, y_k(t) is the output result of the kth neuron of the output layer, u is the weight matrix between the input layer and the hidden layer, v is the weight matrix between the hidden layer and the output layer, f (-) and g (-) are activation functions, namely a sigmoid activation function and a softmax activation function respectively, and the calculation formula is as follows:

(2) decoder

The decoder decodes according to the coding feature vector Z, reconstructs the input adjacent matrix A and the attribute matrix X to obtain a reconstructed adjacent matrix

And reconstructing the attribute matrix

The decoding process includes two parts: decoding with a part being a adjacency matrixI.e. reconstruction of the topology; the other part is the decoding of the attribute matrix, i.e. the reconstruction of the attribute information. Therefore, the decoder comprises a topological structure reconstruction module and an attribute information reconstruction module, wherein the attribute information reconstruction module comprises a recurrent neural network and a graph convolution network which are connected in sequence.

The specific decoding process of the decoder is as follows:

s1, decoding the encoding characteristic vector Z through the topological structure reconstruction module to obtain a reconstructed adjacent matrix

The specific calculation formula is as follows:

where σ (-) is an activation function, preferably a sigmoid activation function.

S2, decoding the coding feature vector through a recurrent neural network in the attribute information reconstruction module to obtain an output result; and decoding the output result based on the adjacent matrix through a graph convolution network in the attribute information reconstruction module to obtain a reconstructed attribute matrix.

The decoding framework adopted by the attribute information reconstruction module is an RNN plus GCN structure similar to an encoder. The input of the attribute information reconstruction module comprises two parts, one part is a coded feature vector Z, and the other part is an adjacent matrix A, wherein the reconstruction of the attribute matrix needs to be assisted by the adjacent matrix input by the coder. It is understood that LSTM may also be used in place of RNN.

First, the coding feature vector Z is input into the RNN, which is calculated in a similar manner to the coding process described above. The output result of RNN is an attribute matrix X', which is used as the input of GCN, together with the initial adjacency matrix A, and the output result of GCN is the reconstructed attribute matrix

Reconstructed attribute matrix, reconstructed adjacency matrix andthe error of the raw input data is the key to the subsequent detection of the abnormal node.

The training process of the anomaly detection model is to minimize the reconstruction error between the input data and the reconstruction result, and the calculation formula of the reconstruction error is as follows:

the abnormal detection model measures the abnormal degree of the node through the size of the reconstruction error, the reconstruction error of the node i is measured through the Euclidean norm, and the calculation formula is as follows:

wherein, a_iFor the adjacency vector corresponding to the ith node,

for reconstructed neighbor vectors, x, corresponding to the ith node_iFor the attribute vector corresponding to the ith node,

for the reconstructed attribute vector corresponding to the ith node, L_iThe reconstruction error corresponding to the ith node.

The node with the larger reconstruction error is more likely to become an abnormal node, and the node with the reconstruction error exceeding a preset threshold value theta is determined as the abnormal node, namely:

Anomaly(v₁,v₂,…,v_n)＝{v_i|L_i≥θ,θ＜Max(L₁,L₂,…,L_n),i∈[1,n]} (14)

and 103, calculating the similarity of the abnormal node and the replica node corresponding to the abnormal node, and performing fault root cause positioning based on the similarity, wherein the replica node corresponding to the abnormal node is the same type of node as the abnormal node.

The specific node with the abnormality can be detected through the abnormality detection model, and the root cause of the abnormality is further analyzed in the embodiment of the application. Generally, the copies of the entity pod providing the same service have the same configuration and performance, so that the failure root can be found through comparison of the similarity between the copies. The embodiment of the application adopts a local comparison method, compares the abnormal node with the corresponding replica node, calculates the dynamic attribute similarity and the static attribute similarity of the abnormal node and the replica node corresponding to the abnormal node, and respectively obtains a dynamic attribute similarity score and a static attribute similarity score; an anomalous attribute is determined based on the dynamic attribute similarity score and the static attribute similarity score. The copy node corresponding to the abnormal node is a node of the same type as the abnormal node, that is, the abnormal node and the copy node corresponding to the abnormal node provide the same service to the outside.

The specific process of calculating the dynamic attribute similarity of the abnormal node and the replica node corresponding to the abnormal node to obtain the dynamic attribute similarity score may be as follows: acquiring dynamic attributes of abnormal nodes and copy nodes corresponding to the abnormal nodes in a preset time period, and generating abnormal node dynamic attribute vectors and copy node dynamic attribute vectors; and calculating cosine similarity of the abnormal node dynamic attribute vector and the replica node dynamic attribute vector to obtain a dynamic attribute similarity score. Specifically, the similarity of the dynamic attributes such as the usage rate of the CPU of the abnormal node and the replica node corresponding to the abnormal node, the transmission and reception bandwidth of the network, and the memory usage rate may be calculated.

Exceptions in the cloud-native system may also be caused by changes in static properties, such as missing configuration files, tampering with environment variables, and the like. There is generally the same configuration between copies of instances pod that offer the same kind of service externally, so the root cause can be discovered by similarity comparison between copies. For static attributes, these attributes are generally text type, and text similarity can be used for comparison, and other similarity calculation methods can also be used, which are not listed here. The embodiment of the application preferably adopts cosine similarity for comparison, and the specific process is as follows:

performing word segmentation, stop word filtering and oneHot coding on the static attribute of the abnormal node and the static attribute of the replica node corresponding to the abnormal node in sequence to obtain an abnormal node text vector and a replica node text vector respectively; and calculating cosine similarity of the text vector of the abnormal node and the text vector of the copy node to obtain a static attribute similarity score.

And sorting based on the dynamic attribute similarity score and the static attribute similarity score, wherein the lower the similarity score is, the higher the possibility of the fault root cause is, so that the specific fault root cause can be positioned. The overall architecture of the anomaly detection model and root cause localization can be referred to fig. 5.

In the embodiment of the application, the original data in the cloud original system is obtained, the knowledge graph is constructed according to the attributes and the dependency relationships of the entities in the original data to obtain graph data, and then the abnormal detection model is used for carrying out abnormal detection on the graph data with the attribute information and the dependency relationships to obtain abnormal nodes, so that the mutual relationships among the entities in the cloud original system are considered, and the fault can be quickly and accurately positioned; in addition, the method and the device further perform fault root cause positioning by calculating the similarity of the abnormal nodes and the nodes of the same type, can position the fault root cause with finer granularity and more accuracy, and solve the technical problems that in the prior art, the interaction relation between entities is ignored, only the entity with the fault can be positioned, and the fault root cause of the cloud primary system is difficult to be positioned quickly and accurately.

Furthermore, the original data in the cloud native system is obtained in a non-invasive and lightweight mode, code instrumentation or agent installation is not needed, the influence on the cloud native system can be almost ignored, and the method is safer and more reliable; according to the method and the device, index data and the topological relation of the components in the cloud native system are considered at the same time, and the fault root can be more accurately positioned; and the root cause of the dynamic attribute and the static attribute of each abnormal component is checked from two dimensions of time and space, so that the root cause of the fault can be positioned in a finer granularity manner.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for executing all or part of the steps of the method described in the embodiments of the present application through a computer device (which may be a personal computer, a server, or a network device). And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A cloud native system fault analysis method based on a knowledge graph is characterized by comprising the following steps:

the method comprises the steps of obtaining original data in a cloud native system, constructing a knowledge graph based on the original data, and obtaining graph data, wherein the original data comprises entity information and network connection data, and the obtaining of the original data of the cloud native system comprises the following steps:

acquiring the entity information in the cloud native system, wherein the entity comprises a container, a Process and a File; entering a name space of the container through an nsenter tool, and mounting a directory where a netstat file of the host computer is located on a file system of the container; executing the nsenter command in the container to acquire the network connection data;

the constructing of the knowledge graph based on the original data to obtain graph data comprises:

performing entity extraction, entity relationship extraction and entity attribute extraction on the original data in sequence, wherein the entity attributes comprise static attributes and dynamic attributes; constructing a knowledge graph based on the extracted entities, the entity relations and the entity attributes to obtain graph data;

2. The method for analyzing the failure of the cloud native system based on the knowledge-graph according to claim 1, wherein the calculating the similarity between the abnormal node and the replica node corresponding to the abnormal node and performing failure root location based on the similarity comprises:

3. The method for analyzing the failure of the cloud native system based on the knowledge-graph according to claim 2, wherein calculating the similarity of the dynamic attributes of the abnormal node and the replica node corresponding to the abnormal node to obtain a similarity score of the dynamic attributes comprises:

4. The method for analyzing the failure of the cloud native system based on the knowledge-graph according to claim 2, wherein the step of calculating the static attribute similarity of the abnormal node and the replica node corresponding to the abnormal node to obtain a static attribute similarity score comprises the steps of:

5. The method for analyzing the failure of the cloud native system based on the knowledge-graph according to claim 1, wherein the performing anomaly detection on the graph data through an anomaly detection model to obtain an anomalous node comprises:

6. The method of knowledgegraph-based cloud-native system fault analysis of claim 5, wherein the anomaly detection model comprises an encoder and a decoder;

7. The method of cloud-native system fault analysis based on knowledge-graph according to claim 6, wherein the encoder comprises a graph convolution network and a recurrent neural network connected in sequence;

8. The method for analyzing the cloud-native system fault based on the knowledge-graph according to claim 6, wherein the decoder comprises a topology reconstruction module and an attribute information reconstruction module, wherein the attribute information reconstruction module comprises a recurrent neural network and a graph convolution network which are connected in sequence;