CN114265758A

CN114265758A - Full link monitoring method and device based on software and hardware integrated architecture

Info

Publication number: CN114265758A
Application number: CN202111481950.7A
Authority: CN
Inventors: 来海龙; 吴腾雷
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2021-12-06
Filing date: 2021-12-06
Publication date: 2022-04-01

Abstract

The embodiment of the specification provides a full link monitoring method and a full link monitoring device based on a software and hardware integrated architecture, and the method comprises the following steps: acquiring a plurality of performance index data output in the processing process of a target service request, wherein the performance index data are acquired by respectively acquiring indexes of a plurality of server applications, a plurality of containers and a plurality of physical machines related to the target service request in a software and hardware integrated framework; for a first index data in the plurality of performance index data, determining a problem category to which the first index data belongs under the service dimension; and responding to the first index data meeting one of a plurality of preset alarm conditions, generating problem description information of the first index data, and classifying the problem description information into a target knowledge base, wherein the problem description information comprises the first index data and a problem category.

Description

Full link monitoring method and device based on software and hardware integrated architecture

Technical Field

The embodiment of the specification relates to the field of application performance management, in particular to a full link monitoring method and device based on a software and hardware integrated architecture.

Background

Although the existing APM (Application Performance Management) technology has monitoring functions, such as active monitoring, passive monitoring, and full link monitoring, it mainly faces to products such as traditional APP (Application), PC (Personal Computer), and is not suitable for software and hardware integration schemes in scenarios such as AI (Artificial Intelligence) and edge.

Therefore, a reasonable and reliable scheme is urgently needed, and full link monitoring can be performed by a software and hardware-oriented integrated scheme.

Disclosure of Invention

The embodiment of the specification provides a full link monitoring method and device based on a software and hardware integrated architecture, and full link monitoring can be performed facing to a software and hardware integrated scheme.

In a first aspect, an embodiment of the present specification provides a full link monitoring method based on a software and hardware integrated architecture, including: acquiring a plurality of performance index data output in a target service request processing process, wherein the performance index data are acquired by respectively acquiring indexes of a plurality of server applications, a plurality of containers and a plurality of physical machines related to the target service request in a software and hardware integrated framework; for a first index data of the plurality of performance index data, determining a problem category to which the first index data belongs in a service dimension; responding to the first index data meeting one of a plurality of preset alarm conditions, generating problem description information of the first index data, and classifying the problem description information into a target knowledge base, wherein the problem description information comprises the first index data and the problem category.

In some embodiments, when the first indicator data satisfies one of the plurality of alarm conditions, further comprising: sending an alert notification message related to the first indicator data to an alert system.

In some embodiments, before the generating the problem description information of the first index data, further comprising: determining a technology class to which the first index data belongs in a technology dimension; the issue description information also includes the technology category.

In some embodiments, the technology class is an application, a container, or a physical machine.

In some embodiments, after the obtaining the plurality of performance indicator data produced during the processing of the target service request, further comprises: acquiring a tracking identifier corresponding to the target service request; the problem description information further includes the tracking identification.

In some embodiments, after the attributing the problem description information to the target knowledge base, further comprises: receiving a data viewing request, wherein the tracking identification is included; acquiring problem description information comprising the tracking identification from the target knowledge base; and returning the acquired problem description information.

In some embodiments, the plurality of server applications includes a first server application for processing the target service request; the plurality of containers comprise a first container in which the first server application is located; the physical machines comprise a first server cluster where the first container is located and a network switch corresponding to the first server cluster.

In some embodiments, the first server cluster is an AI server cluster or an edge server cluster.

In some embodiments, the number of server applications further includes a first middleware and/or a first database through which the first server application passes in processing the target service request; the plurality of containers further comprise a second container where the first middleware is located and/or a third container where the first database is located; the plurality of physical machines further comprise a second server cluster where the second container is located, and/or a third server cluster where the third container is located.

In some embodiments, the software and hardware integrated architecture is divided into an application service layer, a container layer and a hardware layer, the server applications are located in the application service layer, the containers are located in the container layer, and the physical machines are located in the hardware layer.

In some embodiments, the application service layer further includes a second database and a third database, the second database and the third database are respectively deployed in the containers of the container layer, the second database stores a first set of performance index data respectively collected from the plurality of server applications during the target service request processing, and a second set of performance index data respectively collected from the plurality of containers, the third database stores a third set of performance index data respectively collected from the plurality of physical machines during the target service request processing; and the obtaining of the plurality of performance indicator data produced in the target service request processing process includes: obtaining the first set of performance indicator data and the second set of performance indicator data from the second database, and obtaining the third set of performance indicator data from the third database.

In some embodiments, the server applications, the containers, and the physical machines are respectively deployed with agents, and the first set of performance indicator data, the second set of performance indicator data, and the third set of performance indicator data are uploaded to the corresponding databases by the corresponding agents.

In a second aspect, an embodiment of the present specification provides a full link monitoring device based on a software and hardware integrated architecture, including: the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is configured to acquire a plurality of performance index data produced in a target service request processing process, and the performance index data are obtained by respectively performing index acquisition on a plurality of server applications, a plurality of containers and a plurality of physical machines related to the target service request in a software and hardware integrated architecture; a classification unit configured to determine, for a first index data of the plurality of performance index data, a problem category to which the first index data belongs under a service dimension; the processing unit is configured to respond to the first index data meeting one of a plurality of preset alarm conditions, generate problem description information of the first index data, and place the problem description information into a target knowledge base, wherein the problem description information comprises the first index data and the problem category.

In a third aspect, the present specification provides a computer-readable storage medium, on which a computer program is stored, wherein when the computer program is executed in a computer, the computer is caused to execute the method described in any implementation manner in the first aspect.

In a fourth aspect, the present specification provides a computing device, including a memory and a processor, where the memory stores executable code, and the processor executes the executable code to implement the method described in any implementation manner of the first aspect.

In a fifth aspect, the present specification provides a computer program, wherein when the computer program is executed in a computer, the computer is caused to execute the method described in any implementation manner of the first aspect.

The above embodiments of the present specification provide a full link monitoring method and apparatus based on a software and hardware integrated architecture. For a target service request under the software and hardware integrated architecture, for example, an AI calculation request or a data viewing request, multiple performance index data generated in a target service request processing process may be acquired, and the multiple performance index data may be acquired by respectively performing index acquisition on a plurality of server applications, a plurality of containers, and a plurality of physical machines related to the target service request in the software and hardware integrated architecture. Then, for a first index data of the plurality of performance index data, a problem category to which the first index data belongs in the service dimension may be determined. Then, in response to the first index data meeting one of a plurality of preset alarm conditions, generating problem description information of the first index data, and classifying the problem description information into a target knowledge base, wherein the problem description information comprises the first index data and the problem category. Therefore, the full link monitoring facing to the software and hardware integration scheme can be realized.

It should be noted that the problem category in the problem description information may be regarded as a problem or a fault, and the first index data in the problem description information may be regarded as a cause of the problem or the fault. By classifying the problem description information into the target knowledge base, the user can conveniently and quickly know the problems of the application (for example, the first service end application for processing the target service request), the generation reasons of the problems, the performance bottlenecks and the like by checking the problem description information in the target knowledge base.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments disclosed in the present specification, the drawings needed to be used in the description of the embodiments will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments disclosed in the present specification, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a schematic diagram of a software and hardware integrated architecture;

FIG. 2 is an exemplary system architecture diagram to which some embodiments of the present description may be applied;

FIG. 3 is a diagram of one embodiment of a full link monitoring method based on a software and hardware integrated architecture;

fig. 4 is a schematic structural diagram of a full-link monitoring device based on a software and hardware integrated architecture.

Detailed Description

The present specification will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. The described embodiments are only a subset of the embodiments described herein and not all embodiments described herein. All other embodiments obtained by a person skilled in the art based on the embodiments in the present specification without any inventive step are within the scope of the present application.

It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings. The embodiments and features of the embodiments in the present description may be combined with each other without conflict. In addition, the terms "first", "second", "third", and the like in the embodiments of the present description are used only for information distinction and do not play any limiting role.

As described above, although the existing APM technology has a monitoring function, it is mainly oriented to products such as traditional APP and PC, and is not suitable for software and hardware integration schemes in scenarios such as AI and edge.

Based on this, some embodiments of the present specification provide a full link monitoring method based on a software and hardware integration architecture, and through the method, full link monitoring can be performed in a software and hardware integration-oriented scheme.

Specifically, fig. 1 shows a schematic diagram of a software and hardware integrated architecture to which the full link monitoring method provided in the embodiment of the present specification is oriented.

In practice, the software and hardware integrated architecture may include a plurality of physical machines, at least some of which may be built with a container, and the container may be deployed with a server application. Any server application may be developed based on Java, Go, or node. In other words, any server application may be a Java application, a Go application, a node.

Further, as shown in fig. 1, the software and hardware integrated architecture may be divided into multiple layers, such as a hardware layer, a container layer, and an application service layer. The plurality of physical machines may be located in a hardware layer, the container built on at least part of the physical machines may be located in a container layer, and the server application deployed in the container may be located in an application service layer.

The plurality of physical machines may comprise, for example, several clusters of servers. The several server clusters may include, for example, a server cluster in an AI scene (may be simply referred to as an AI server cluster), a server cluster in an edge scene (may be simply referred to as an edge server cluster), and other server clusters other than the AI server cluster and the edge server cluster. The other server cluster may be referred to as a normal server cluster, for example.

In general, an edge server cluster, for example, may be made up of a plurality of edge devices. The plurality of edge devices may be, for example, a plurality of network-enabled cameras, a plurality of edge boxes, or the like, and are not particularly limited herein.

The at least part of the physical machines may be, for example, the plurality of server clusters. Containers, such as those based on K8s (Kubernets), may be built on the several server clusters. In practice, K8s is an open source container orchestration engine that can support automated deployment, large-scale scalable, application containerization management.

As an example, in one or more containers built on the AI server cluster, a server application for providing an AI service in an AI scene may be deployed, and the server application may be referred to as an AI server application. In one or more containers built on the edge server cluster, a server application for providing an edge service in an edge scene may be deployed, and the server application may be referred to as an edge server application. In one or more containers built on the common server cluster, a server application serving as a middleware and/or a server application serving as a database may be deployed. It should be noted that the AI server application and the edge server application may be collectively referred to as a service server application.

Middleware generally has functions of data transmission, data search, data visualization, and/or data analysis. Taking data transmission as an example, the middleware may receive data sent by one party (e.g., a certain service server application or other middleware, etc.) and send the data to another party (e.g., a database, etc.); alternatively, the middleware may obtain data from one party (e.g., a database, etc.) based on the data obtaining request and return the data to the sender of the data obtaining request.

In practice, there may be various middleware within the application service layer, such as Kafka message middleware, Logstash middleware, APM Server middleware, Grafana middleware, Kibana middleware, and so on. Kafka is generally a high throughput distributed publish-subscribe messaging system. Logstash is typically an open source data collection engine with real-time pipeline functionality that dynamically unifies data from different data sources and normalizes the data to a specified location. The APM Server can transmit the received data to a specified location. Grafana is generally an open-source and function-rich data visualization platform which can be used for visualization of time series data. Kibana can be used for performing various operations such as efficient searching, visualization and analysis on the log.

In addition, there may be various databases in the application service layer, such as a relational database (e.g., MySQL database), a non-relational database (e.g., ES database), and a time-series database (e.g., promemeus-based database, which may be referred to as promemeus database). In practice, es (elastic search) is generally a distributed, highly scalable, highly real-time search and data analysis engine. Prometheus is a set of open source monitoring & alarm & time series database combinations.

Since Kafka, logstack, APM Server, Grafana, Kibana, MySQL, ES, and promemeus are known technologies in the art, they will not be described in detail herein.

Optionally, the plurality of physical machines may further include network switches respectively corresponding to the plurality of server clusters, for example, RDMA (Remote Direct Memory Access) network switches.

An exemplary system architecture diagram suitable for use with embodiments of the present description is further described below in conjunction with fig. 2. As shown in fig. 2, the system architecture may include a software and hardware integrated architecture as previously described, a tracking system (also referred to as a tracking center or tracking center), and a target knowledge base.

It should be noted that, for the sake of clarity in describing the monitoring process, only a simplified version of the software and hardware integration architecture is shown in fig. 2. For a detailed description of the architecture, reference is made to the relevant description hereinbefore.

In practice, the probe burying can be performed in a software and hardware integrated framework in advance, for example, the probe burying is performed in a server application, a container and a physical machine in the software and hardware integrated framework. It is understood that, when the software and hardware integrated architecture is divided into a hardware layer, a container layer and an application service layer, probe burying points may be performed in multiple layers of the software and hardware integrated architecture in advance, for example, probe burying points are performed in the hardware layer, the container layer and the application service layer.

Specifically, when the probe is embedded in the hardware layer, the probe may be embedded in part or all of the physical machines in the hardware layer to perform index collection on the physical machines. For example, by performing probe burial of the server cluster, performance index data of several performance indexes of the server cluster may be collected, where the several performance indexes may include, for example, a Central Processing Unit (CPU) utilization rate of the server cluster, and/or a Graphics Processing Unit (GPU) utilization rate of the server cluster, and so on. By probe-pinning the RDMA network switch, performance index data for several performance indices of the network switch may be collected, which may include, for example, RDMA network throughput, etc.

When probe burial is performed in the container layer, probe burial may be performed in some or all of the containers in the container layer to perform index collection on the containers. For example, by probe-dotting the container, performance index data for a number of performance indicators of the container may be collected, which may include, for example, container utilization, memory utilization of the container, CPU utilization of the container, and/or GPU utilization of the container, among others.

When the probe is embedded in the application service layer, the probe can be embedded in part or all of the server applications in the application service layer so as to acquire indexes of the server applications. For example, by performing probe dotting on the server application, performance index data of several performance indexes of the server application may be collected, where the several performance indexes may include, for example, interface time consumption.

It should be understood that the various performance indicators listed above are merely exemplary indicators. In practical application, the performance index to be acquired can be set according to actual requirements, and is not specifically limited herein.

In addition, corresponding agents can be deployed in the server application, the container and the physical machine for probe embedding. It can be understood that when the software and hardware integrated architecture is divided into a hardware layer, a container layer and an application service layer, corresponding agents can be deployed in each layer for probe embedding. The agent may be used to collect performance indicator data and upload the performance indicator data to a corresponding database.

Specifically, in the hardware layer, an agent such as OneAgent, RDMAExporter, or nodeeexporter may be deployed in a physical machine that performs probe burying, for example, a network switch and a server cluster (e.g., an AI server cluster shown in fig. 2) that perform probe burying. In the container layer, a proxy program such as filebed may be disposed in the container where the probe is buried. In the application service layer, an Agent program such as an APM Agent may be deployed in a service end application for probe embedding, for example, a service end application, middleware, and a database for probe embedding. It should be noted that the various agents listed here are all currently known and will not be described in detail here.

Note that fig. 2 shows only three agents, that is, OneAgent, Filebeat, and APM Agent. It should be understood that the three kinds of agents may be replaced by other agents according to actual requirements, and are not limited in particular.

In practice, a service server application in the software and hardware integrated architecture, for example, the AI server application shown in fig. 2, may receive a target service request sent by a user, for example, an AI calculation request, or a data viewing request. The processing procedure of the target service request generally involves a plurality of server applications, a plurality of containers, and a plurality of physical machines in the software and hardware integrated architecture. Specifically, when the software and hardware integrated architecture is divided into a hardware layer, a container layer and an application service layer, the processing procedure of the target service request generally involves a number of server applications in the application service layer, a number of containers in the container layer, and a number of physical machines in the hardware layer. The number of server applications may include, but are not limited to, the business server application, the number of containers may include, but is not limited to, a container (hereinafter referred to as a first container) in which the business server application is located, and the number of physical machines may include, but is not limited to, a server cluster (hereinafter referred to as a first server cluster) in which the first container is located.

Under the condition that the business server application, the first container and the first server cluster are all subjected to probe point burying, index collection can be performed on the business server application, the first container and the first server cluster in the target service request processing process.

Taking the service server application as an AI server application and the first server cluster as an AI server cluster as an example, as shown in fig. 2, an Agent APM Agent in the AI server application may collect performance index data collected for the AI server application in the target service request processing process, and upload the performance index data to a corresponding database. Specifically, the APM Agent may upload the performance index data to the corresponding database via several pieces of middleware. Wherein the number of middleware may comprise, for example, APM Server middleware. The database may be a non-relational database, such as an ES database.

The agent filebeacon in the container where the AI server application is located can collect performance index data collected for the container in the target service request processing process and upload the performance index data to the corresponding database. Specifically, Filebeat may upload performance index data to a corresponding database via several middleware. The several pieces of middleware may include, for example, Kafka message middleware, Logstash middleware. The database may be a non-relational database, such as the ES database described above.

The agent OneAgents in the AI server cluster can collect performance index data collected by the AI server cluster in the process of processing the target service request and upload the performance index data to a corresponding database. The database may be a time series database, such as a Prometheus database.

Subsequently, the tracking system may obtain a plurality of performance indicator data produced during the processing of the target service request. Specifically, the performance index data of the plurality of performance index data may be acquired from the ES database and the promemeus database.

Next, the tracking system may classify the performance index data, for example, determine a problem category to which the performance index data belongs. In practice, a plurality of problem categories may be preset in the tracking system, and the plurality of problem categories may include, for example, normal, long interface time, abnormal interface call, cyclic function method reference, and the like. The tracking system may determine a problem category to which the performance index data belongs among the preset plurality of problem categories.

Then, for at least part of the performance index data with abnormal conditions, such as at least part of the performance index data with abnormal problem types, the tracking system may generate respective problem description information of the at least part of the performance index data, and classify the problem description information into the target knowledge base. The problem description information may include performance index data corresponding to the problem description information, and a problem category to which the performance index data belongs. The target repository may be, for example, a repository corresponding to a first server application for processing the target service request.

In some embodiments, the system architecture may further include an alarm system (also referred to as an alarm center), and the tracking system may send an alarm notification message to the alarm system, the alarm notification message being related to at least part of the performance indicator data respectively, so that the alarm system is aware of problems occurring in the user application.

Through the monitoring process described above, the tracking system can perform full link monitoring for a software and hardware integrated architecture, for example, the server application, the container and the physical machine can all be monitored, specifically, the monitoring can be performed from an upper application service layer, to a middle container layer, and then to a bottom hardware layer. Therefore, data wanted by a user can be collected through the view angle of the full link, and finally the user can be rapidly helped to locate and analyze problems.

The following describes specific implementation steps of the above method with reference to specific examples.

Referring to fig. 3, a schematic diagram of an embodiment of a full link monitoring method based on a software and hardware integrated architecture is shown. The method comprises the following steps:

step 302, a tracking system acquires a plurality of performance index data generated in a target service request processing process, wherein the performance index data are acquired by respectively acquiring indexes of a plurality of server applications, a plurality of containers and a plurality of physical machines related to a target service request in a software and hardware integrated framework;

step 304, the tracking system determines a problem category to which a first index data belongs under a service dimension for the first index data in the multiple performance index data;

step 306, the tracking system determines whether the first index data meets one of a plurality of preset alarm conditions;

step 310, the tracking system responds to the first index data meeting one of a plurality of preset alarm conditions, and generates problem description information of the first index data, wherein the problem description information comprises the first index data and a problem category to which the first index data belongs;

at step 312, the tracking system includes the problem description information in a target knowledge base.

In this embodiment, the software and hardware integrated architecture may include a plurality of physical machines, at least some of the physical machines may have a container built thereon, and the container may have a server application deployed therein. In particular, the software and hardware integrated architecture may be divided into multiple layers, such as a hardware layer, a container layer, and an application services layer. The plurality of physical machines may be located at a hardware layer, the container built on at least part of the physical machines may be located at a container layer, and the server application deployed in the container may be located at an application service layer.

The plurality of physical machines may comprise, for example, several clusters of servers. The plurality of server clusters may include, for example, an AI server cluster, an edge server cluster, and a normal server cluster. Further, the plurality of physical machines may further include network switches respectively corresponding to the plurality of server clusters, such as RDMA network switches.

The at least part of the physical machines may be the plurality of server clusters. Containers, such as those based on K8s (Kubernets), may be built on the several server clusters. As an example, an AI server application may be deployed in one or more containers built on the AI server cluster. The edge server application can be deployed in one or more containers built on the edge server cluster. In one or more containers built on the common server cluster, a server application serving as a middleware and/or a server application serving as a database may be deployed. It should be noted that the AI server application and the edge server application may be collectively referred to as a service server application.

In addition, the software and hardware integrated framework can be used for probe embedding. In particular, probe burial can be performed in a server application, a container and a physical machine in the software and hardware integrated architecture. It is understood that when the software and hardware integrated architecture is divided into a hardware layer, a container layer and an application service layer, the probe burying points can be performed in multiple layers of the software and hardware integrated architecture, such as the hardware layer, the container layer and the application service layer.

It should be understood that the purpose of probe embedding is to collect performance index data of certain performance indexes of a plurality of server applications, a plurality of containers and a plurality of physical machines involved in the software and hardware integrated architecture of the target service request in the target service request processing process. Furthermore, when the software and hardware integrated architecture is divided into a hardware layer, a container layer and an application service layer, the purpose of probe embedding is to collect performance index data of certain performance indexes of a plurality of server applications located in the application service layer, a plurality of containers located in the container layer and a plurality of physical machines located in the hardware layer, which are related to the target service request, in the target service request processing process.

For a more detailed explanation of the software and hardware integration structure, reference may be made to the related description in the foregoing, and details are not repeated here.

The above steps are further explained below.

In step 302, a tracking system (e.g., the tracking system shown in fig. 1) may obtain a plurality of performance index data generated in a process of processing a target service request, where the performance index data may be obtained by respectively performing index collection on a plurality of server applications, a plurality of containers, and a plurality of physical machines related to the target service request in a software and hardware integrated architecture.

It should be noted that, when the software and hardware integrated architecture is divided into a hardware layer, a container layer, and an application service layer, the plurality of server applications may be located in the application service layer, the plurality of containers may be located in the container layer, and the plurality of physical machines may be located in the hardware layer. Based on this, the performance index data can be obtained by respectively performing index collection on the server applications located in the application service layer, the containers located in the container layer and the physical machines located in the hardware layer, which are related to the target service request.

The target service request may be, for example, an AI calculation request or a data viewing request, and is not limited in this respect. In addition, in the software and hardware integrated architecture, for example, in an application service layer of the software and hardware integrated architecture, there is a service server application corresponding to the service to which the target service request belongs. It will be appreciated that the business server application is a server application for processing a target service request. For convenience of description, the service end application is hereinafter referred to as a first service end application.

The plurality of server applications may include, for example, a first server application, the plurality of containers may include, for example, a first container in which the first server application is located, and the plurality of physical machines may include, for example, a first server cluster in which the first container is located. The first server cluster may be, for example, an AI server cluster or an edge server cluster.

The performance index data may include performance index data of a plurality of individual performance indexes acquired from the first service application in the target service request processing process, where the individual performance indexes may include interface time consumption, for example; collecting performance index data of a plurality of individual performance indexes from the first container, wherein the plurality of individual performance indexes can comprise, for example, container utilization rate, container memory utilization rate, container CPU utilization rate, and/or container GPU utilization rate, and the like; performance indicator data for a number of performance indicators collected from the first server cluster may include, for example, CPU utilization of the server cluster, GPU utilization of the server cluster, and/or the like.

Further, the physical machines may further include a network switch corresponding to the first server cluster, for example, an RDMA network switch. Based on this, the performance index data may further include performance index data of several performance indexes collected from the network switch in the target service request processing process, where the several performance indexes may include, for example, network throughput and the like.

In some embodiments, if the target service request processing procedure involves a call to middleware and/or a database, the plurality of server applications may further include the middleware and/or the database. For convenience of description, hereinafter, the middleware will be referred to as a first middleware, and the database will be referred to as a first database. In other words, the plurality of server applications may further include a first middleware and/or a first database through which the first server application passes in the process of processing the target service request. The first middleware may be middleware for data transmission, data search, and/or data analysis, among others. The first database may be, for example, a MySQL database or an ES database, etc.

Optionally, the plurality of containers may further include a second container in which the first middleware is located, and/or a third container in which the first database is located. Optionally, the plurality of physical machines may further include a second server cluster where the second container is located, and/or a third server cluster where the third container is located.

Typically, the tracking system may obtain the plurality of performance indicator data from a specified database. By way of example, the specified database may be contained in a software and hardware integrated architecture. Further, the designated database may include a second database and a third database in a software and hardware integrated architecture. Further, when the software and hardware integrated architecture is divided into a hardware layer, a container layer and an application service layer, the specified databases may include, for example, a second database (e.g., the ES database shown in fig. 2) and a third database (e.g., the Prometheus database shown in fig. 2) located at the application service layer, which may be respectively deployed in a container at the container layer.

The second database may store a plurality of performance indicator data (hereinafter, referred to as a first set of performance indicator data) respectively collected from the plurality of server applications during the processing of the target service request, and a plurality of performance indicator data (hereinafter, referred to as a second set of performance indicator data) respectively collected from the plurality of containers. The third database may store a plurality of performance indicator data (hereinafter referred to as a third set of performance indicator data) respectively collected from the plurality of physical machines during the processing of the target service request.

Based on this, the tracking system may obtain a first set of performance indicator data and a second set of performance indicator data from a second database, and a third set of performance indicator data from a third database.

In some embodiments, the server applications, the containers, and the physical machines may be respectively deployed with agents, and the first set of performance indicator data, the second set of performance indicator data, and the third set of performance indicator data may be uploaded to the corresponding databases by the corresponding agents.

As an example, the several server applications may be respectively deployed with an Agent program such as an APM Agent, and the several server applications may upload their respective performance index data to the second database by their respective Agent programs. The containers may be respectively deployed with agents such as Filebeat, and the containers may upload respective performance index data to the second database by the respective agents. The plurality of physical machines can be respectively deployed with agent programs such as OneAgent, RDMAExporter or nodeeexporter, and the plurality of physical machines can upload respective performance index data to a third database through respective agent programs.

In some embodiments, the application services layer may be subdivided into an application layer, as well as a middleware and database layer. The middleware and the database layer may be referred to as a middleware layer for short. The service server application in the software and hardware integrated architecture can be located in an application layer, and the middleware and the database in the software and hardware integrated architecture can be located in a middleware layer.

As an example, the application layer may introduce an APM Agent, that is, the service server application in the application layer may be deployed with the APM Agent, and the APM Agent may upload the collected performance index data to the second database. A telgraff agent can be introduced into the middleware layer, that is, the middleware and/or the database in the middleware layer can be deployed with the telgraff agent, and the telgraff agent can upload the collected performance index data to a third database. In practice, Telegraf is a plug-in driven server agent that collects and reports metrics.

It should be noted that the agent may upload the performance indicator data to the corresponding database directly or via several middleware. For example, as shown in FIG. 2, the APM Agent may upload performance index data to the ES database via the APM Server middleware. The Filebeat agent program can upload the performance index data to the ES database through Kafka message middleware and Logstash middleware in sequence. The OneAgent agent program can directly upload the performance index data to the promemeus database.

Alternatively, if the Telegraf agent is deployed in the middleware and/or database, the Telegraf agent may, for example, upload the performance indicator data directly to the Prometheus database.

For convenience of description, the performance index data among the above-described plurality of performance index data may be referred to as first index data. Next, in step 304, the tracking system may determine a problem category to which the first metric data belongs in a service dimension (also referred to as a business dimension).

Specifically, a plurality of problem categories may be preset in the service dimension, and the problem categories may include, for example, normal, long interface time, abnormal interface call, cyclic function method reference, and the like. In addition, the plurality of question categories may correspond to classification rules, respectively. For a problem category of the plurality of problem categories, the tracking system may determine whether the first index data belongs to the problem category according to a classification rule corresponding to the problem category.

It is noted that the first metric data may be attributed to one or more problem categories. Generally, for an abnormal problem category, which may be considered a problem or fault, the performance indicator data attributed to the problem category may be considered the cause of the problem or fault. Taking a fault as an example, when the first index data belongs to a plurality of problem categories, the plurality of problem categories may show a direct fault and an indirect fault due to the reason shown by the first index data. The indirect fault may be a fault caused by the direct fault.

In some embodiments, to facilitate a technician locating a problem, in step 304, a technology category to which the first metric data belongs in the technology dimension may also be determined.

In one example, the technology classes may be divided according to a plurality of layers as previously described. Based on this, a single technology class may be an application service layer, a container layer, or a hardware layer. For example, when the first index data originates from an application service layer, the technology class to which the first index data belongs may be the application service layer. When the first index data is derived from the container layer, the technology class to which the first index data belongs may be the container layer. When the first index data originates from a hardware layer, the technology class to which the first index data belongs may be the hardware layer. Further, when the application service layer is subdivided into an application layer and a middleware layer, a single technology class may be the application layer, the middleware layer, the container layer, or the hardware layer.

In another example, technology classes may be divided according to specific sources of performance indicator data (e.g., server applications, containers, or physical machines). Based on this, a single technology class may be, for example, an application, a container, or a physical machine. Alternatively, a single technology class may be, for example, a business server application, middleware, a database, a container, or a physical machine.

It should be understood that the problem category and the technical category can be set according to actual requirements, and are not specifically limited herein.

Next, to facilitate recording some abnormal performance indicator data for real-time viewing, the tracking system may execute step 306 to determine whether the first indicator data satisfies one of a plurality of predetermined alarm conditions. When the first indicator data satisfies one of the plurality of alarm conditions, the tracking system may perform step 310 to generate problem description information for the first indicator data, and then perform step 312 to categorize the problem description information into a target knowledge base. The target repository may be, for example, a repository corresponding to a first server application for processing the target service request.

In practice, when the performance index data meets a certain alarm condition, it may indicate that the performance index data is abnormal and needs to be paid special attention, and the performance index data needs to be recorded.

The plurality of alarm conditions may be set based on the performance index as described above. For example, the plurality of alarm conditions may include CPU utilization being greater than a first threshold, GPU utilization being greater than a second threshold, interface elapsed time being greater than an elapsed time threshold, network throughput being below a throughput threshold, and so on. It should be understood that the specific contents of the above alarm conditions may be set according to actual requirements, and are not specifically limited herein.

The problem description information may include first index data corresponding to the problem description information, and a problem category to which the first index data belongs. Further, if the technology category to which the first index data belongs is also determined in step 304, the problem description information may further include the technology category to which the first index data belongs.

Taking the first index data as the AI server cluster, where the CPU utilization rate is 85%, the plurality of alarm conditions include a CPU utilization rate greater than 80%, and the problem category to which the first index data belongs includes a CPU overload, as an example, it is obvious that the first index data satisfies the alarm condition that the CPU utilization rate is greater than 80%, and the problem description information "the CPU overload is caused because the CPU utilization rate of the AI server cluster is 85% of the problem description information" of the first index data can be generated.

In some embodiments, in order to facilitate the user to quickly know the problem existing in the whole link, after step 302, and before step 310, the tracking system may also obtain a tracking identifier (which may be called a traceid) corresponding to the target service request. Based on this, the issue description information may also include the tracking identification.

The trace identification may be generated from an identification (which may be referred to as a span) of each invocation object (e.g., the first service-side application, the first container, the first server cluster, etc., as previously described) in the invocation chain of the target service request, and the trace identification is globally unique. In particular, the tracking identifier may be generated by the first service-side application, and the tracking system may obtain the tracking identifier from the first service-side application, for example.

Subsequently, the tracking system may, for example, receive a data viewing request including the tracking identifier, and obtain the problem description information including the tracking identifier from the target knowledge base, and return the obtained problem description information. Therefore, the user sending the data viewing request can view the problem description information under the whole link, and quickly know the problems of the whole link based on the viewed problem description information. In addition, the cause of the problem can be quickly known, and the performance bottleneck of the application can be quickly known.

In some embodiments, to facilitate the alert system being immediately aware of a problem with the user application, after step 306 is performed, or before or while step 310 is performed, the tracking system may perform step 308 to send an alert notification message associated with the first indicator data to the alert system in response to the first indicator data satisfying one of the plurality of alert conditions.

As an example, the alert notification message may be generated according to a problem category to which the first metric data belongs. Continuing to use the first index data as the AI server cluster with a CPU utilization rate of 85%, where the plurality of alarm conditions include a CPU utilization rate greater than 80%, and the problem category to which the first index data belongs includes a CPU overload, it is obvious that the first index data satisfies the alarm condition that the CPU utilization rate is greater than 80%, and an alarm notification message "the AI server cluster with a CPU overload" related to the first index data may be sent to the alarm system.

It should be understood that both the problem description information and the alarm notification message may be generated according to a preset generation rule, and the generation rule may be set according to an actual requirement, and is not specifically limited herein.

In the full-link monitoring method based on the software and hardware integrated architecture provided in the embodiment corresponding to fig. 3, by performing probe point burying on the software and hardware integrated architecture in advance, index collection can be performed on a plurality of server applications, a plurality of containers, and a plurality of physical machines related to a target service request in the software and hardware integrated architecture respectively in a target service request processing process, for example, when the software and hardware integrated architecture is divided into a hardware layer, a container layer, and an application service layer, performance index collection is performed on a plurality of server applications located in the application service layer, a plurality of containers located in the container layer, and a plurality of physical machines located in the hardware layer related to the target service request respectively. Then, the collected performance index data can be classified respectively, and when the performance index data is abnormal, the performance index data and the problem description information formed by the category of the performance index data can be classified into the target knowledge base, so that a user can conveniently and quickly know the problems existing in application, the generation reasons of the problems, the performance bottlenecks and the like by checking the problem description information in the target knowledge base. Therefore, the full link monitoring facing to the software and hardware integration scheme can be realized.

In addition, the full-link monitoring scheme provided in the embodiments of the present specification is a bottom-layer dependent technology, and can play a good supporting role for various products on an upper layer, such as AI training and reasoning related products, a virtualization open computing platform, storage super-fusion related products, and the like, and can collect performance index data desired by a user through a view angle of a full link, and finally can help the user to locate and analyze problems quickly.

In addition, as described above, various performance indexes are indexes that are relatively concerned and emphasized by users facing to scenes such as AI and edge, and the full link monitoring scheme provided in the embodiment of the present specification realizes acquisition of the various performance indexes by combining an application service layer, a container layer and a hardware layer, and performs processing such as classification based on the acquired performance index data, so that deep customization of customer requirements in scenes such as AI and edge can be well realized.

With further reference to fig. 4, the present specification provides one embodiment of a full link monitoring apparatus based on a software and hardware integrated architecture, which may be applied to a tracking system (e.g., the tracking system shown in fig. 2).

As shown in fig. 4, the full-link monitoring apparatus 400 based on the software and hardware integrated architecture of the present embodiment includes: an acquisition unit 401, a classification unit 402 and a processing unit 403. The obtaining unit 401 is configured to obtain a plurality of performance index data generated in a target service request processing process, where the performance index data is obtained by performing index collection on a plurality of server applications, a plurality of containers, and a plurality of physical machines related to the target service request in a software and hardware integrated architecture, respectively; the classification unit 402 is configured to, for a first index data of the plurality of performance index data, determine a problem category to which the first index data belongs in the service dimension; the processing unit 403 is configured to generate problem description information of the first index data in response to the first index data satisfying one of a plurality of preset alarm conditions, and to classify the problem description information into a target knowledge base, the problem description information including the first index data and the problem category.

In some embodiments, the software and hardware integrated architecture may be divided into an application service layer, a container layer, and a hardware layer, the server applications may be located in the application service layer, the containers may be located in the container layer, and the physical machines may be located in the hardware layer.

In some embodiments, the processing unit 403 may also be configured to: and sending an alarm notification message related to the first index data to an alarm system when the first index data meets one of the plurality of alarm conditions.

In some embodiments, the classification unit 402 may also be configured to: determining a technology class to which the first index data belongs in a technology dimension; the above-mentioned problem description information may also include the technical category.

In some embodiments, the technology classes described above may be an application service layer, a container layer, or a hardware layer.

In some embodiments, the technology classes described above may be applications, containers, or physical machines. In some embodiments, the obtaining unit 401 may be further configured to: acquiring a tracking identifier corresponding to a target service request; the problem description information may also include the tracking identifier.

In some embodiments, the apparatus 400 may further include: a receiving unit (not shown in the figure) and a transmitting unit (not shown in the figure); the receiving unit may be configured to receive a data viewing request including the tracking identifier after the processing unit 403 puts the problem description information into the target knowledge base; the obtaining unit 401 may be further configured to obtain the problem description information including the tracking identifier from the target knowledge base; the sending unit may be configured to return the problem description information acquired by the acquisition unit 401.

In some embodiments, the plurality of server applications may include a first server application for processing the target service request; the plurality of containers may include a first container in which the first service application is located; the plurality of physical machines may include a first server cluster in which the first container is located. Further, the plurality of physical machines may further include a network switch corresponding to the first server cluster.

In some embodiments, the first server cluster may be an AI server cluster or an edge server cluster.

In some embodiments, the plurality of server applications may further include a first middleware and/or a first database through which the first server application passes in the process of processing the target service request; the plurality of containers can also comprise a second container where the first middleware is located and/or a third container where the first database is located; the plurality of physical machines may further include a second server cluster where the second container is located, and/or a third server cluster where the third container is located.

In some embodiments, the software and hardware integrated architecture may further include a second database and a third database. Further, when the software and hardware integrated architecture is divided into a hardware layer, a container layer and an application service layer, the application service layer may further include a second database and a third database, and the second database and the third database may be respectively deployed in a container of the container layer. The second database may store a first set of performance indicator data collected from the plurality of server applications, respectively, and a second set of performance indicator data collected from the plurality of containers, respectively, during processing of the target service request. The third database may store a third set of performance indicator data collected from the plurality of physical machines during processing of the target service request. The obtaining unit 401 may be further configured to: the first and second sets of performance indicator data are obtained from a second database, and a third set of performance indicator data is obtained from a third database.

In some embodiments, the server applications, the containers, and the physical machines are respectively deployed with agents, and the first set of performance indicator data, the second set of performance indicator data, and the third set of performance indicator data may be uploaded to corresponding databases by corresponding agents.

In the embodiment of the apparatus corresponding to fig. 4, the detailed processing of each unit and the technical effect thereof can refer to the related description in the method embodiment in the foregoing, and are not repeated herein.

The embodiments of the present specification further provide a computer-readable storage medium, on which a computer program is stored, where when the computer program is executed in a computer, the computer is caused to execute the full link monitoring method based on the software and hardware integration architecture, which is respectively described in the above method embodiments.

The embodiment of the present specification further provides a computing device, which includes a memory and a processor, where the memory stores executable codes, and when the processor executes the executable codes, the full link monitoring method based on a software and hardware integrated architecture, which is respectively described in the above method embodiments, is implemented.

Embodiments of the present disclosure also provide a computer program, where when the computer program is executed in a computer, the computer is caused to execute the full link monitoring method based on a software and hardware integrated architecture, which is respectively described in the above method embodiments.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in the embodiments disclosed herein may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The above-mentioned embodiments, objects, technical solutions and advantages of the embodiments disclosed in the present specification are further described in detail, it should be understood that the above-mentioned embodiments are only specific embodiments of the embodiments disclosed in the present specification, and are not intended to limit the scope of the embodiments disclosed in the present specification, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the embodiments disclosed in the present specification should be included in the scope of the embodiments disclosed in the present specification.

Claims

1. A full link monitoring method based on a software and hardware integrated architecture comprises the following steps:

acquiring a plurality of performance index data output in a target service request processing process, wherein the performance index data are acquired by respectively acquiring indexes of a plurality of server applications, a plurality of containers and a plurality of physical machines related to the target service request in a software and hardware integrated framework;

for a first index data of the plurality of performance index data, determining a problem category to which the first index data belongs in a service dimension;

responding to the first index data meeting one of a plurality of preset alarm conditions, generating problem description information of the first index data, and classifying the problem description information into a target knowledge base, wherein the problem description information comprises the first index data and the problem category.

2. The method of claim 1, wherein, when the first indicator data satisfies one of the plurality of alarm conditions, further comprising:

sending an alert notification message related to the first indicator data to an alert system.

3. The method of claim 1, wherein prior to said generating issue description information for said first metric data, further comprising:

determining a technology class to which the first index data belongs in a technology dimension;

the issue description information also includes the technology category.

4. The method of claim 3, wherein the technology category is an application, a container, or a physical machine.

5. The method of claim 1, wherein after said obtaining a plurality of performance metric data produced during the processing of the target service request, further comprising:

acquiring a tracking identifier corresponding to the target service request;

the problem description information further includes the tracking identification.

6. The method of claim 5, wherein after populating the problem description information into a target knowledge base, further comprising:

receiving a data viewing request, wherein the tracking identification is included;

acquiring problem description information comprising the tracking identification from the target knowledge base;

and returning the acquired problem description information.

7. The method of claim 1, wherein,

the plurality of server applications comprise a first server application used for processing the target service request;

the plurality of containers comprise a first container in which the first server application is located;

the physical machines comprise a first server cluster where the first container is located and a network switch corresponding to the first server cluster.

8. The method of claim 7, wherein the first server cluster is an AI server cluster or an edge server cluster.

9. The method of claim 7, wherein,

the plurality of server-side applications further comprise first middleware and/or a first database which are/is passed by the first server-side application in the process of processing the target service request;

the plurality of containers further comprise a second container where the first middleware is located and/or a third container where the first database is located;

the plurality of physical machines further comprise a second server cluster where the second container is located, and/or a third server cluster where the third container is located.

10. The method according to one of claims 1 to 9, wherein the software and hardware integrated architecture is divided into an application service layer, a container layer and a hardware layer, the plurality of server applications are located in the application service layer, the plurality of containers are located in the container layer, and the plurality of physical machines are located in the hardware layer.

11. The method of claim 10, wherein the application service layer further comprises a second database and a third database, the second database and the third database being respectively deployed in the containers of the container layer, the second database storing a first set of performance indicator data respectively collected from the plurality of server applications during the target service request processing and a second set of performance indicator data respectively collected from the plurality of containers, the third database storing a third set of performance indicator data respectively collected from the plurality of physical machines during the target service request processing; and

the obtaining of the plurality of performance indicator data produced in the target service request processing process includes:

obtaining the first set of performance indicator data and the second set of performance indicator data from the second database, and obtaining the third set of performance indicator data from the third database.

12. The method of claim 11, wherein the plurality of server applications, the plurality of containers, and the plurality of physical machines are respectively deployed with agents, and the first set of performance metric data, the second set of performance metric data, and the third set of performance metric data are uploaded to the respective databases by the respective agents.

13. A full link monitoring device based on software and hardware integrated architecture comprises:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is configured to acquire a plurality of performance index data produced in a target service request processing process, and the performance index data are obtained by respectively performing index acquisition on a plurality of server applications, a plurality of containers and a plurality of physical machines related to the target service request in a software and hardware integrated architecture;

a classification unit configured to determine, for a first index data of the plurality of performance index data, a problem category to which the first index data belongs under a service dimension;

the processing unit is configured to respond to the first index data meeting one of a plurality of preset alarm conditions, generate problem description information of the first index data, and place the problem description information into a target knowledge base, wherein the problem description information comprises the first index data and the problem category.

14. A computer-readable storage medium, on which a computer program is stored, wherein the computer program causes a computer to carry out the method of any one of claims 1-12 when the computer program is carried out in the computer.

15. A computing device comprising a memory and a processor, wherein the memory has stored therein executable code that when executed by the processor implements the method of any of claims 1-12.