CN116909831A

CN116909831A - Machine learning for rule recommendation

Info

Publication number: CN116909831A
Application number: CN202211705865.9A
Authority: CN
Inventors: 拉贾·科穆拉; 加内什·比亚戈蒂·马塔德·桑卡达; 蒂姆纳万·斯里达; 劳伦斯·克罗登·洛博; 拉杰·亚瓦特卡尔
Original assignee: Juniper Networks Inc
Current assignee: Juniper Networks Inc
Priority date: 2022-04-16
Filing date: 2022-12-29
Publication date: 2023-10-20

Abstract

The present disclosure relates to machine learning for rule recommendation. A performance monitoring system includes a metric collector configured to receive telemetry data including metrics related to a network of computing devices via a metric deriver. The metrics timing database stores relevant metrics. The alert rule evaluator service is configured to evaluate rules using the stored metrics. The performance monitoring system may include a machine learning module and be configured to automatically determine recommended alert rules.

Description

Machine learning for rule recommendation

RELATED APPLICATIONS

The present application claims priority from U.S. application Ser. No. 17/810,185, filed on 6 months and 30 days 2022, and from Indian provisional patent application Ser. No. 202241022566, filed on 4 months and 16 days 2022, which are incorporated herein by reference in their entireties.

Technical Field

The present disclosure relates to computer networks, and more particularly to improving the acquisition and evaluation of telemetry data in computer networks.

Background

In a typical cloud data center environment, there is a large collection of interconnected servers that provide computing and/or storage capacity to run various applications. For example, a data center includes facilities that host applications and servers for subscribers (i.e., clients of the data center). The data center may, for example, host all infrastructure equipment, such as networking and storage systems, redundant power supplies, and environmental controls. In a typical data center, clusters of storage servers and application servers (computer nodes) are interconnected via a high-speed switching fabric provided by one or more layers of physical network switches and routers. More complex data centers provide an infrastructure that is widespread throughout the world, such that subscriber support equipment is located in various physical hosting facilities.

Connectivity between the server and the switching fabric occurs at hardware modules, known as Network Interface Cards (NICs). Conventional NICs include Application Specific Integrated Circuits (ASICs) that include some base layer 2/layer 3 (L2/L3) functionality to perform packet forwarding. In a conventional NIC, packet processing, supervision, and other advanced functions (known as "data paths") are performed by a host CPU, i.e., the CPU of a server that includes the NIC. As a result, the CPU resources in the server are shared by the application running on this server, as well as by the data path processing. For example, in a 4-core x86 server, one of these cores may be reserved for the data path, leaving 3 cores (or 75% of the CPU) for the application and host operating system.

The performance monitoring system is capable of monitoring data center performance. Telemetry data includes a variety of metrics about network elements/nodes that can be transmitted to a metrics collector of a typical centralized remote performance monitoring system for evaluation according to various rules. This enables a user (such as a network administrator) to measure and evaluate many different performance measurements about the network, such as CPU usage, memory usage, total network device and application, link and node utilization, network congestion, and the like.

Some NIC vendors have begun to incorporate additional processing units into the NIC itself to offload some of the data path processing from the host CPU to the NIC. The processing units in the NIC may be, for example, multi-core ARM processors, where some hardware acceleration is provided by a Data Processing Unit (DPU), a Field Programmable Gate Array (FPGA), and/or an ASIC. NICs that include such enhanced data path processing capabilities are often referred to as smart NICs and are capable of providing additional processing capacity capable of facilitating telemetry data transmissions.

Disclosure of Invention

In general, techniques are described for computing infrastructure performance monitoring systems using machine learning to provide improved metric acquisition sampling intervals, improved rule evaluation intervals, and/or ongoing rule recommendations in order to conserve network resources and provide more important data to provide a better understanding of the network.

The performance monitoring system comprises an acquisition unit for acquiring telemetry data (metrics) relating to computing nodes in the network; and an alarm rules evaluator service for analyzing the telemetry data according to an alarm rule that determines whether an alarm should be generated based on the telemetry data. Telemetry data is collected at predetermined acquisition sampling intervals and rules are evaluated at predetermined rule evaluation intervals.

The performance monitoring system described in this disclosure may reduce the storage space required to store metrics and avoid unnecessary acquisition of metric data that is not relevant to a given use case and/or context, rather than acquiring telemetry data at a fixed static rate (where the acquisition uses static sampling intervals to acquire metrics). That is, collectors that use static sampling intervals to collect metrics may have disadvantages related to the amount of storage space required to store metrics, especially in large data centers that include many compute nodes and many alarm rules. Additional disadvantages may include the metrics that may be collected that are not relevant to the user, may require significant computing power to search for metrics that may not be relevant to the user, and may be redundant when the metric values do not change much over time. Thus, when more relevant metrics are acquired, the metric acquirer may use the same sampling interval to stop acquiring many less useful metrics. The above-listed problems may become more apparent and problematic as metrics are collected in a scaled network environment.

In an example, a performance monitoring system implementing aspects of the techniques described in this disclosure may utilize machine learning to determine one or more metric attribute correlations that represent the usefulness of a metric to a user to predict a metric weight and an optimal acquisition sampling rate for this metric. In this aspect, the performance monitoring system of the present disclosure can efficiently identify and optimize the acquisition sampling rate of metrics, rather than using a fixed static time interval to acquire metrics.

Further, the periodic rule evaluation process involves computationally intensive tasks such as querying massive telemetry data, aggregating telemetry data, and comparing the aggregated data to multiple thresholds. When a large number of rules are configured in a computing resource constrained environment, a rule evaluation process using static evaluation intervals may have difficulty in properly evaluating the rules. Furthermore, computing resources may be wasted when too many rules are handled in an overloaded system. To circumvent these scaling issues, administrators typically limit the number of rules, they configure or increase the rule evaluation interval (i.e., decrease the rule evaluation rate).

By employing optimized rule evaluation intervals (i.e., using different rule evaluation intervals) during which rules may be evaluated at different frequencies, such as based on their past evaluation success or failure (hit or miss) rates, the performance monitoring system of the present disclosure may avoid the above-described problems of static evaluation intervals. When rules miss for a long time, a solution based on a fixed evaluation rate will waste resources because the probability of recent evaluation of hits is low.

In contrast, the performance monitoring system provides many advantages, it enables a machine learning based intelligent process for rule evaluation, wherein alarm rules are periodically evaluated at optimized rule evaluation intervals; these alarm rules may change over time as network conditions change. The evaluation interval of the rule may be assigned based on the determined weight of the rule. The determined weight of a rule may indicate the priority of the rule and may be inversely proportional to the desired rule evaluation interval. In other words, when the rule weight is higher, the corresponding evaluation interval is smaller; and when the rule weight is low, the corresponding evaluation interval is large. The machine learning model and past rule evaluation data may be used to predict rule weights.

Optimizing the acquisition sampling interval of the metrics and optimizing the performance monitoring system for evaluating the regular evaluation intervals of the rules enables the computing nodes of the network being monitored and the performance monitoring system itself to better operate in terms of reduced computing resource consumption (such as processing cycles, memory and memory bus bandwidth, etc.) and reduced associated power demand consumption.

Furthermore, it is advantageous to automatically create alert rules. For example, if a network system is problematic, such as the system CPU usage is high, an administrator will typically find the application or module in the system that consumes the most CPU resources or performs the most intensive operations of the CUP. After such analysis is performed, the administrator typically creates one or more alert rules with associated metrics to capture the high CUP problem before it reoccurs and possibly take action to prevent the system CUP from becoming too high. Such manual creation of alert rules may be time consuming and may require an administrator to analyze the metric data and attempt to identify suspicious metrics that may be related to the problem that the administrator is attempting to diagnose. This may become more difficult when the amount of telemetry data is large. The process of manually creating a set of appropriate alert rules to diagnose a problem may be time consuming, inefficient, and in some cases unsuccessful due to the time delay of implementing the alert rules that the user manually creates. For example, when an administrator starts a survey, or when a new rule is added, there may be no longer a fault/problem.

According to the performance monitoring system of the disclosed technology using the intelligent alert rule creation method based on machine learning, it is possible to automatically find relevant metrics related to existing rule metrics and recommend additional alert rules for future analysis of problems. The recommended alert rules may be implemented automatically or may require user approval and provide a way to ease the burden of manual rule creation while conserving network resources by providing alert rules that are relevant and provide meaningful information about the network. In this way, the efficiency and operation of the performance monitoring system improves in that problems can be discovered and resolved more quickly and computing resources are saved for determining network problems. This may be the case for both the performance monitoring system itself and the network being monitored, since the correlation rules with associated correlation metrics mean that uncorrelated metrics are neither derived by the network nor collected by the performance monitoring system.

In one example, the present disclosure describes a method for recommending alert rules for a performance monitoring system, the method comprising: receiving, by the performance monitoring system, a user-created alert rule; collecting telemetry data by a performance monitoring system, the telemetry data including a plurality of metrics related to a user-created rule; reading, by the performance monitoring system, a first alert rule of the user-created alert rules, wherein the first alert rule is associated with a first metric of the plurality of metrics; determining, by the performance monitoring system, at least one relevant metric related to the first metric; creating, by the performance monitoring system, a set of temporal correlation rules using at least one relevant metric; evaluating, by the performance monitoring system, each provisional correlation rule of the set of provisional correlation rules using telemetry data to determine a corresponding provisional rule evaluation attribute; determining, by the performance monitoring system, for each temporal correlation rule in the set of temporal correlation rules, a corresponding correlation rule weight based on the corresponding temporal rule evaluation attribute; and determining, by the performance monitoring system, whether to recommend each temporal correlation rule based on the corresponding correlation rule weight.

In another example, the present disclosure describes a performance monitoring system including processing circuitry coupled to a storage device, the storage and processing circuitry configured to: receiving an alarm rule created by a user; collecting telemetry data comprising a plurality of metrics relating to a user-created rule; reading a first alert rule of the alert rules created by the user, wherein the first alert rule is associated with a first metric of the plurality of metrics; determining at least one relevant metric related to the first metric; creating a set of temporal correlation rules using at least one of the relevant metrics; evaluating each temporal correlation rule of the set of temporal correlation rules using telemetry data to determine a corresponding temporal rule evaluation attribute; for each temporal correlation rule in the set of temporal correlation rules, determining a corresponding correlation rule weight based on the corresponding temporal rule evaluation attribute; and determining whether to recommend each temporal correlation rule based on the corresponding correlation rule weight.

In another example, the present disclosure describes a non-transitory computer-readable storage medium having instructions stored thereon; when executed, the instructions cause the one or more processors to, via the execution performance monitoring system: receiving an alarm rule created by a user; collecting telemetry data comprising a plurality of metrics relating to a user-created rule; reading a first alert rule of the alert rules created by the user, wherein the first alert rule is associated with a first metric of the plurality of metrics; determining at least one relevant metric related to the first metric; creating a set of temporal correlation rules using at least one of the relevant metrics; evaluating each temporal correlation rule of the set of temporal correlation rules using telemetry data to determine a corresponding temporal rule evaluation attribute; for each temporal correlation rule in the set of temporal correlation rules, determining a corresponding correlation rule weight based on the corresponding temporal rule evaluation attribute; and determining whether to recommend each temporal correlation rule based on the corresponding correlation rule weight.

The details of one or more embodiments of the disclosure are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the invention will be apparent from the description and drawings, and from the claims.

Drawings

FIG. 1 is a block diagram illustrating an exemplary network system having a data center in which examples of the techniques described herein may be implemented.

FIG. 2 is a block diagram illustrating an exemplary computing device that uses a network interface card with a separate processing unit to perform edge service controller managed services in accordance with the techniques described herein.

Fig. 3 is a conceptual diagram illustrating a data center having servers, each including a network interface card having a separate processing unit controlled by an edge service controller, according to the techniques of this disclosure.

Fig. 4 is a block diagram illustrating an exemplary performance monitoring service with telemetry services including telemetry acquisition services in a network and/or within a data center in accordance with the techniques described in this disclosure.

FIG. 5 illustrates a performance monitoring system in communication with a metric deriver for collecting telemetry data and including an alert rule evaluator service for evaluating rules using telemetry data, in accordance with the techniques of the present disclosure.

FIG. 6 illustrates an example of a performance monitoring system with intelligent collectors in accordance with the techniques described in this disclosure.

FIG. 7 is an exemplary flow chart for determining metric weights and corresponding new sampling intervals for metric acquisition in accordance with the techniques described in this disclosure.

Fig. 8 is an exemplary sequence diagram for determining new sampling intervals for metric acquisition in accordance with the techniques described in this disclosure.

FIG. 9 is an example of a performance monitoring system that adjusts a rule evaluation interval in accordance with the techniques described in this disclosure.

FIG. 10 is an exemplary sequence diagram for the performance monitoring system of FIG. 9, which provides additional details regarding interactions between various components, in accordance with the techniques described in this disclosure.

Fig. 11 is an exemplary flowchart illustrating actions of the alert rule evaluator service of fig. 9 in accordance with the techniques described in this disclosure.

FIG. 12 is an example of a performance monitoring system for recommending rules according to the techniques described in this disclosure.

FIG. 13 is an exemplary sequence diagram for the performance monitoring system of FIG. 12, providing additional details regarding interactions between various components, in accordance with the techniques described in this disclosure.

FIG. 14 is an exemplary flowchart illustrating the actions of the performance monitoring system of FIG. 12 in accordance with the techniques described in this disclosure.

Like reference numerals refer to like elements throughout the specification and drawings.

Detailed Description

FIG. 1 is a block diagram illustrating an exemplary network system 8 with computing infrastructure in which examples of the techniques described herein may be implemented. In general, the data center 10 provides an operating environment for applications and services for one or more customer sites 11 (shown as "customers 11"), the one or more customer sites 11 having one or more customer networks coupled to the data center through a service provider network 7.

The data center 10 may, for example, host all infrastructure equipment such as networking and storage systems, redundant power supplies, and environmental controls. The service provider network 7 is coupled to a public network 4, which may represent one or more networks managed by other providers, and thus may form part of a large public network infrastructure (e.g., the internet). Public network 4 may represent, for example, a Local Area Network (LAN), wide Area Network (WAN), the internet, a Virtual LAN (VLAN), an enterprise LAN, a layer 3 Virtual Private Network (VPN), an Internet Protocol (IP) intranet operated by a service provider operating mobile service provider network 7, an enterprise IP network, or some combination thereof.

Although the customer sites 11 and public network 4 are primarily shown and described as edge networks of the service provider network 7, in some examples, one or more of the customer sites 11 and public network 4 may be tenant networks within the data center 10 or another data center. For example, the data center 10 may host a plurality of tenants (customers) each associated with one or more Virtual Private Networks (VPNs), each of which may implement one of the customer sites 11.

The service provider network 7 provides packet-based connectivity to attached customer sites 11, data centers 10, and public networks 4. The service provider network 7 may represent a network that a service provider owns and operates to interconnect multiple networks. The service provider network 7 may implement multiprotocol label switching (MPLS) forwarding and, in such instances, may be referred to as an MPLS network or an MPLS backbone network. In some examples, service provider network 7 represents a plurality of interconnected autonomous systems, such as the Internet, serviced by one or more service providers.

In some examples, data center 10 may represent one of many geographically distributed network data centers. As shown in the example of fig. 1, the data center 10 may be a facility that provides network services to customers. The clients of the service provider may be syndicated entities, such as businesses and governments, or individuals. For example, a network data center may host network services for several enterprises and end users. Other exemplary services may include data storage, virtual private networks, traffic engineering, file services, data mining, scientific or super computing, and the like. Although shown as a separate edge network of the service provider network 7, elements of the data center 10, such as one or more Physical Network Functions (PNFs) or Virtualized Network Functions (VNFs), may be included within the service provider network 7 core.

In this example, the data center 10 includes storage and/or computing servers interconnected via a switching fabric 14 provided by one or more layers of physical network switches and routers, wherein the servers 12A-12X (herein "server 12") are shown coupled to roof-top switches 16A-16N (herein "TOR switches 16"). The server 12 may also be referred to herein as a "host" or "host device. The data center 10 may include many additional servers coupled to other TOR switches 16 of the data center 10. Each host device in such a data center may execute one or more virtual machines, PODs, or other extensible virtual execution elements, which may be referred to as workloads. Clients of a data center typically have access to these workloads and can install applications and perform other operations using such workloads. Workloads running on different host devices, but accessible by one particular client, are organized into a virtual network. Each client typically has at least one virtual network. Those virtual networks are also referred to as overlay networks.

In some cases, clients of a data center may experience network problems such as increased latency, packet loss, low network traffic, or slow workload processing. The solution of this problem may be complicated by the deployment of workloads in large multi-tenant data centers. Telemetry data, such as those provided by telemetry services and analyzed by performance monitoring systems, may be used to help solve problems in data centers.

Edge service controller 28 may include a performance monitoring system (shown in more detail in fig. 5, 6, 9, and 12) having a collector for collecting telemetry data and an alarm rule evaluator for analyzing the telemetry data according to alarm rules based on whether the telemetry data does or does not generate an alarm, as will be explained further below. The performance monitoring system may also include a telemetry service, such as shown in fig. 4, which may include a metrics collector and allow a user to create alarm rules for network monitoring. The performance monitoring system may include one or more machine learning components and may be configured to provide adaptive sampling intervals for collecting telemetry data, to provide adaptive rule evaluation intervals, and/or to provide recommended alert rules that provide a better understanding of the network.

In the example shown, servers 12A and 12X are directly coupled to TOR switch 16, and servers 12B, 12D, and 12C are not directly coupled to the TOR switch. Servers 12B, 12D, and 12C may reach TOR switch 16 and IP fabric 20 via servers 12A or 12X. The switch fabric 14 in the illustrated example includes interconnected top of rack (TOR) (or other "leaf") switches 16A-16N that are coupled to a distribution layer (herein "rack-mounted switch 18") of rack-mounted (or "spine" or "core") switches 18A-18M. Although not shown, the data center 10 may also include, for example, one or more non-edge switches, routers, hubs, gateways, security devices (such as firewalls, intrusion detection and/or prevention devices), servers, computer terminals, laptops, printers, databases, wireless mobile devices (such as cellular telephones or personal digital assistants), wireless access points, bridges, cable modems, application accelerators, or other network devices.

In this example, TOR switch 16 and rack switch 18 may in some cases provide server 12 with redundant (multi-homing) connections to IP fabric 20 and service provider network 7. The rack-mounted switches 18 aggregate traffic flows and provide connections between TOR switches 16. TOR switches 16 may be network devices that provide layer 2 (MAC) and/or layer 3 (e.g., IP) routing and/or switching functions. The TOR switch 16 and the rack-mounted switch 18 may each include one or more processors and memory and be capable of executing one or more software processes. The rack-mounted switch 18 is coupled to an IP fabric 20, and the IP fabric 20 can perform layer 3 routing to route network traffic between the data center 10 and the customer sites 11 by the service provider network 7. The switching architecture of the data center 10 is merely one example. For example, other switching fabrics may have more or fewer switching layers.

The term "packet flow", "traffic flow" or simply "flow" refers to a group of packets originating from a particular source device or endpoint and sent to a particular destination device or endpoint. A single flow of packets may be identified by a 5-tuple: for example, < source network address, destination network address, source port, destination port, protocol >. This 5-tuple typically identifies the packet stream to which the received packet corresponds. An n-tuple refers to any n entries extracted from a 5-tuple. For example, a 2-tuple for a packet refers to < source network address, destination network address > or a combination of < source network address, source port > for that packet. The source port refers to a transport layer (e.g., TCP/UDP) port. "port" may refer to the physical network interface of the NIC.

Each of the servers 12 may be a computer node, an application server, a storage server, or other type of server. For example, each of the servers 12 may represent a computing device, such as an X86 processor-based server, configured to operate in accordance with the techniques described herein. The server 12 may provide Network Function Virtualization Infrastructure (NFVI) for the NFV architecture.

Server 12 may host endpoints for one or more virtual networks operating on the physical networks represented herein by IP fabric 20 and switch fabric 14. Although a data center based switching network is primarily described, other physical networks, such as the service provider network 7, may be under one or more virtual networks. Endpoints may include, for example, virtual machines, containerized applications, or applications executing locally on an operating system or bare metal.

The servers 12 each include at least one Network Interface Card (NIC) of NICs 13A-13X (collectively, "NIC 13"), each of the NICs 13A-13X including at least one port that exchanges packets over one or more communication links coupled to the NIC ports. For example, the server 12A includes the NIC13A.

In some examples, each of the NICs 13 provides one or more virtual hardware components for virtualized input/output (I/O). The virtual hardware component for I/O may be a virtualization of the physical NIC13 ("physical function"). For example, in single root I/O virtualization (SR-IOV) described in the SR-IOV specification of the Peripheral Component Interface (PCI) Special Interest Group (SIG), PCIe physical functions of a network interface card (or "network adapter") are virtualized to expose one or more virtual network interface cards as "virtual functions" for use by respective endpoints executing on server 12. In this way, virtual network endpoints may share the same PCIe physical hardware resources, and virtual functions are examples of virtual hardware components. For another example, one or more servers 12 may implement, for example, the available para-virtualization framework Virtio for the Linux operating system, which provides emulated NIC functionality as a virtual hardware component. For another example, one or more servers 12 may implement Open vSwitch to perform distributed virtual multi-layer switching between one or more virtual NICs (vnics) for hosted virtual machines, where such vnics may also represent a virtual hardware component. In some examples, the virtual hardware component is a virtual I/O (e.g., NIC) component. In some instances, the virtual hardware component is an SR-IOV virtual function and may provide SR-IOV with direct process user space access based on a Data Plane Development Kit (DPDK).

In some examples, including the example shown in fig. 1, one or more of the NICs 13 may include multiple ports. The NIC13 may be connected to each other via ports and communication links of the NIC13 to form a NIC structure 23 having a NIC structure topology. NIC structure 23 is a collection of NICs 13 connected to at least another one of NICs 13 and a communication link coupling NICs 13 to each other.

The NICs 13 each include a processing unit 25 to offload aspects of the data path. The processing units in the NIC may be, for example, multi-core ARM processors with hardware acceleration provided by a Data Processing Unit (DPU), a Field Programmable Gate Array (FPGA), and/or an ASIC. NIC13 is alternatively referred to as a smart NIC or Genius NIC.

According to various aspects of the disclosed technology, the edge services platform utilizes the processing unit 25 of the NIC13 to enhance the processing and networking functions of the switching fabric 14 and/or the server 12 including the NIC 13.

In addition, edge service controller 28 may manage API-driven deployment of services 233 on NIC 13; addition, deletion, and replacement of NIC13 within the edge services platform; monitoring of services 233 and other resources on NIC 13; and manages connectivity between the various services 233 running on the NIC 13. In addition, edge service controller 28 may include a performance monitoring system 500 and telemetry service 440 (shown in fig. 3) that may be used to collect metrics from DPU 25 using a drag or push query. Performance monitoring system 500 may take various forms, such as performance monitoring systems 600, 900, and 1200 shown in FIGS. 6, 9, and 12.

Edge service controller 28 may transmit information describing the services available on NIC 13, the topology of NIC structure 23, or other information about the edge service platform to a orchestration system (orchestration system) of network controller 24 (not shown). Exemplary programming systems include the vCenter of OpenStack, VMWARE or the Microsoft corporation's System Center (System Center). Exemplary network controllers 24 include controllers of a control system of a Josep network (JUNIPER NETWORKS) or a Tungsten Fabric. The network controller 24 may be a network fabric manager. Additional information regarding the controller 24 operating in conjunction with the data center 10 or other software defined network is found in International application No. PCT/US2013/044378 entitled "determining physical paths for virtual network packet flows" filed on day 6 of 2013 and U.S. patent application No. 14/226,509 entitled "tunneled packets for virtual networks" filed on day 26 of 2014, each of which is incorporated herein by reference as if fully set forth herein.

In some examples, edge service controller 28 may program processing unit 25 to provide telemetry data when requested. Edge service controller 28 also performs performance monitoring functions including evaluating metrics, evaluating desired telemetry data according to alert rules, and using machine learning to determine optimized telemetry data acquisition rates, rule evaluation rates, and providing alert rule recommendations.

Fig. 2 is a block diagram illustrating an exemplary server 12, the exemplary server 12 using a network interface card with a separate processing unit to perform services managed by an edge services platform, in accordance with the techniques described herein. The server 12 of FIG. 2 may represent a real or virtual server and may represent an illustrative instance of any of the servers 12A-12X of FIG. 1. In this example, server 12 includes a bus 242 that couples the hardware components of server 12, such as a SR-IOV enabled Network Interface Card (NIC) 13, storage disk 246, and microprocessor 210. In some cases, a front side bus may couple microprocessor 210 and storage 244. In some examples, bus 242 may couple storage 244, microprocessor 210, and NIC 13. Bus 242 may represent a Peripheral Component Interface (PCI) express (PCIe) bus. In some examples, a Direct Memory Access (DMA) controller may control DMA transfers between components coupled to bus 242. In some examples, components coupled to bus 242 control DMA transfers between components coupled to bus 242.

Microprocessor 210 may include one or more processors, each of which includes a separate execution unit ("processing core") to execute instructions conforming to an instruction set architecture. The execution units may be implemented as separate Integrated Circuits (ICs), or may be incorporated within one or more multi-core processors (or "many-core" processors), each of which is implemented using ICs (i.e., chip multiprocessors).

Disk 246 represents computer-readable storage media including volatile and/or nonvolatile media, removable and/or non-removable media, and communication media implemented in methods and techniques for storing information such as processor-readable instructions, data structures, program modules, or other data. Computer-readable storage media includes, but is not limited to, random Access Memory (RAM), read Only Memory (ROM), EEPROM, flash memory, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage devices, or any other medium which can be used to store the desired information and which can be accessed by microprocessor 210.

Main memory 244 includes one or more computer-readable storage media that may include Random Access Memory (RAM), such as various forms of Dynamic RAM (DRAM) (e.g., DDR2/DDR3 SDRAM) or Static RAM (SRAM), flash memory, or any other form of removable storage medium that can be used to carry or store desired program code and program data in the form of instructions or data structures and that can be accessed by a computer. Main memory 144 provides a physical address space comprised of addressable memory locations.

The Network Interface Card (NIC) 13 includes one or more interfaces 232 configured to exchange packets using links of an underlying physical network. Interface 232 may include a port interface card having one or more network ports. NIC13 also includes on-card memory 227, for example, to store packet data. Direct memory access transmissions between NIC13 and other devices coupled to bus 242 may be read from and/or written to memory 227.

Memory 244, NIC13, storage disk 246, and microprocessor 210 provide an operating environment for a software stack that can execute hypervisor 214 and one or more virtual machines 228 managed by hypervisor 214.

Generally, virtual machines provide a virtualized/guest operating system for executing applications in a quarantined virtual environment. Because the virtual machine is virtualized from the physical hardware of the host server, the applications that execute are isolated from the hardware of both the host and other virtual machines.

An alternative to virtual machines is to virtualize containers, such as those provided by open source DOCKER container applications. Similar to virtual machines, each container is virtualized and can still be isolated from hosts and other containers. However, unlike virtual machines, each container may omit a single operating system and only provide application suites and proprietary application libraries. The containers are executed by the host as isolated user space examples and may share an operating system and a common library with other containers executing on the host. Thus, the container may require less processing power, storage, and network resources than the virtual machine. As used herein, a container may also be referred to as a virtualization engine, virtual private server silo (silos), or prison (jail). In some examples, the techniques described herein relate to containers and virtual machines or other virtualized components.

Although the virtual network endpoint in fig. 2 is shown and described with respect to a virtual machine, other operating environments, such as a container (e.g., a container of a DOCKER) may execute the virtual network endpoint. An operating system kernel (not shown in fig. 2) may execute in kernel space 243 and may include, for example, linux, berkeley Software Distribution (BSD), another Unix variant kernel, or a Windows server operating system kernel available from MICROSOFT corporation.

The server 12 executes the hypervisor 214 to manage the virtual machines 228. Exemplary hypervisors include kernel-based virtual machines (KVM) for Linux kernels, xen, ESXi available from VMWARE, windows Hyper-V available from MICROSOFT corporation, and other open source proprietary hypervisors. The hypervisor 214 may represent a Virtual Machine Manager (VMM).

The virtual machine 228 may host one or more application programs, such as virtual network function instances. In some examples, virtual machine 228 may host one or more VNF instances, wherein each of the VNF instances is configured to apply network functions to the packets.

The hypervisor 214 includes a physical driver 225 to use the physical functions provided by the network interface card 13. In some cases, the network interface card 13 may also implement SR-IOV, thereby enabling sharing of physical network functions (I/O) among the virtual machines 228. Each port of NIC 13 may be associated with a different physical function. The shared virtual devices (also known as virtual functions) provide dedicated resources such that each of the virtual machines 228 (and corresponding guest operating systems) can access the dedicated resources of the NIC 13, which dedicated resources of the NIC 13 thus appear as dedicated NICs to the virtual machines 228. The virtual function 217 may be a lightweight PCIe function that shares physical resources with physical functions and other virtual functions. According to the SR-IOV standard, NIC 13 may have thousands of virtual functions available, but the number of virtual functions configured is typically much smaller for I/O intensive applications.

The virtual machine 228 includes a corresponding virtual NIC 229 that is directly presented in the guest operating system of the virtual machine 228 for direct transfer between the NIC 13 and the virtual machine 228 via bus 242 using virtual functions assigned to the virtual machine. This may reduce hypervisor 214 overhead related to software-based, virtoio and/or vSwitch implementations, where hypervisor 214 of memory 244 stores packet data and packet data copied from NIC 13 to hypervisor 214 stores address space and from hypervisor 214 stores address space to virtual machine 228 stores address space consumes microprocessor 210 cycles.

NIC 13 may also include a hardware-based ethernet bridge or embedded switch 234. Ethernet bridge 234 may perform layer 2 forwarding between virtual and physical functions of NIC 13. Thus, in some cases, bridge 234 provides hardware acceleration for packet forwarding between virtual machines 228 via bus 242 and between hypervisor 214 and any of virtual machines 228 that access physical functions via physical driver 225. The embedded switch 234 may be physically separate from the processing unit 25.

The server 12 may be coupled to a physical network switch fabric that includes an overlay network that extends the network fabric from the physical switch to software or "virtual" routers, including virtual router 220, coupled to the physical servers of the switch fabric. The virtual router may be a process, or thread, or component thereof, executed by a physical server (e.g., server 12 of fig. 1) that dynamically creates and manages one or more virtual networks that are available for transmission between virtual network endpoints. In one example, the virtual router implements each virtual network using an overlay network, which provides the ability to decouple the virtual address of an endpoint from the physical address (e.g., IP address) of the server on which the endpoint is executing. Each virtual network may use its own addressing and security scheme and may be considered orthogonal to the physical network and its addressing scheme. Various techniques may be used to transport packets within and across virtual networks through physical networks. At least some of the functions of the virtual router may be performed as one of the services 233.

In the exemplary computing device/server 12 of fig. 2, the virtual router 220 executes within the hypervisor 214 using physical functions for I/O, but the virtual router 220 may execute within one of the hypervisor, host operating system, host application, virtual machine 228, and/or NIC 13 processing unit 25.

In general, each virtual machine 228 may assign a virtual address for use in a corresponding virtual network, where each of the virtual networks may be associated with a different virtual subnet provided by virtual router 220. The virtual machine 228 may assign its own virtual layer 3 (L3) IP address, e.g., for sending and receiving communications, but may not be aware of the IP address of the server 12 on which the virtual machine is executing. Thus, a "virtual address" is an address for an application that is different from a logical address for an underlying physical computer system, e.g., server 12.

In one embodiment, server 12 includes a Virtual Network (VN) agent (not shown) that controls virtual network overlap for server 12 and adjusts routing of data packets within server 12. Generally, a VN agent communicates with a virtual network controller for a plurality of virtual networks, which generates commands to control the routing of packets. VN agents may operate as agents for control plane messages between virtual machine 228 and virtual network controllers (such as controllers 24 or 28). For example, the virtual machine may request that a message be sent via the VN agent using its virtual address, and the VN agent may in turn send a message and request initiation of the first message in response to the received message for the virtual address of the virtual machine. In some cases, virtual machine 228 may cause a program or function call presented by the application programming interface of the VN agent, and the VN agent may also process message encapsulation, including addressing.

In one example, a network packet, e.g., a layer 3 (L3) IP packet or a layer 2 (L2) ethernet packet generated or consumed by an instance of an application executed by virtual machine 228 within a virtual network domain, may be encapsulated in another packet (e.g., another IP or ethernet packet) transmitted by a physical network. Packets transmitted within a virtual network may be referred to herein as "inner packets" while physical network packets may be referred to herein as "outer packets" or "tunnel packets. Encapsulation and/or decapsulation of virtual network packets within physical network packets may be performed by virtual router 220. This function is referred to herein as tunneling (tunneling) and may be used to create one or more overlay networks. In addition to ipineip, other exemplary tunneling protocols that may be used include multiprotocol label switching (MPLS) over IP, vxLAN, GRE of Generic Routing Encapsulation (GRE) (MPLSoGRE) or MPLS (MPLSoUDP) over User Datagram Protocol (UDP), and the like.

As described above, the virtual network controller may provide a logical centralized controller for facilitating one or more virtual network operations. The virtual network controller may, for example, maintain a routing information base, e.g., one or more routing tables, that stores routing information for the physical network and one or more overlay networks. Virtual router 220 of hypervisor 214 implements Network Forwarding Tables (NFTs) 222A-222N of N virtual networks; for Network Forwarding Tables (NFTs) 222A-222N, virtual router 220 operates as a tunnel endpoint. In general, each NFT 222 stores forwarding information for the corresponding virtual network and identifies where the data packet is to be forwarded and whether the packet is to be encapsulated in a tunneling protocol, such as with a tunneling header that may include one or more headers for different layers of the virtual network protocol stack. Each of NFTs 222 may be NFTs for different routing instances (not shown) implemented by virtual router 220.

According to the techniques described in this disclosure, the edge service platform includes, for example, an edge service controller 28 that enhances the processing and networking functions of the server 12 using the processing unit 25 of the NIC 13. The processing unit 25 includes a processing circuit 231 to execute the services formulated by the edge service controller 28. The processing circuitry 231 may represent any combination of processing cores, ASICs, FPGAs, or other integrated circuits and programmable hardware. In an example, the processing circuitry may include a system on a chip (SoC) having, for example, one or more cores, a network interface for high-speed packet processing, one or more acceleration engines for specialized functions (e.g., security/cryptographic techniques, machine learning, storage), programming logic, integrated circuits, and the like. Such a SoC may be referred to as a Data Processing Unit (DPU). The DPU may be an example of the processing unit 25.

In the exemplary NIC 13, processing unit 25 executes an operating system kernel 237 and user space 241 for servicing. The kernel 237 may be a Linux kernel, a Unix or BSD kernel, a real-time OS kernel, or other kernel for managing hardware resources of the processing unit 25 and managing the user space 241.

Services 233 may include networking, security, storage, data processing, co-processing, machine learning, telemetry (such as telemetry service 233 of fig. 3), and/or other services. Service 233 and ESP agent 236 may include executable instructions. Processing unit 25 may execute services 233 and Edge Service Platform (ESP) agents 236 as processes and/or execute services 233 and Edge Service Platform (ESP) agents 236 within virtual execution elements, such as containers or virtual machines. As described elsewhere herein, the service 233 may enhance the processing power of a host processor (e.g., the microprocessor 210) by, for example, having the server 12 offload packet processing, security, or other operations that would also be performed by the host processor.

The processing unit 25 executes an Edge Service Platform (ESP) agent 236 to exchange data and control the data using an edge service controller 28 for the edge service platform. Although shown in user space 241, in some examples ESP agent 236 may be a kernel module of kernel 237.

For example, ESP agent 236 may collect telemetry data generated by service 233 and send it to an ESP controller (which is another way of edge service controller 28 shown in the example of fig. 1) that describes traffic and/or resource availability in network, server 12, and/or processing unit 25 (such as memory or processor and/or core utilization). For another example, ESP agent 236 may receive a service code from an ESP controller that executes any of services 233, configure a service configuration of any of services 233, packets injected into the network, or other data.

The edge service controller 28 manages the operation of the processing unit 25 by: such as compiling and configuring services 233 executed by processing unit 25; deploying a service 233; addition, deletion, and replacement of NIC 13 within the edge services platform; monitoring services 233 and other resources on NIC 13; and manages connectivity between the various services 233 running on the NIC 13. Exemplary resources on NIC 13 include memory 227 and processing circuitry 231.

Fig. 3 is a conceptual diagram illustrating a data center with compute nodes according to the techniques of this disclosure, with each of the servers including a network interface card with separate processing units controlled by an edge services platform 300. The edge services platform 300 may include a network automation platform 306 and an orchestrator 304. The racks of computing nodes may correspond to servers 12 of fig. 1, and switches 16A/18A and 16B/18B may correspond to switches 16 and 18 of fabric 14 of fig. 1. The processing unit 25, shown as a Data Processing Unit (DPU), may include an agent 236 and a service (such as service 233 of fig. 2), the agent 236 and the service may represent software. Services 233 executed by processing unit 25 may include web service 233A, L4-L7 service 233B, telemetry service 233C, and Linux+SDK (software development kit) service 233D.

As described more fully herein, processing unit 25 may send telemetry data (shown as telemetry data 312) and other information for the NIC that includes this processing unit to composer 304 of edge services platform 300 via agent 236 and telemetry service 233C. The composer 304 may represent an example of the edge service controller 28 of fig. 1 and may include a performance monitoring system 500 (shown in more detail in fig. 4), the performance monitoring system 500 including a telemetry service 440 (shown in more detail in fig. 4). Performance monitoring system 500 may receive telemetry data, including metrics, from a number of agents 236 associated with a number of hosts (another way about server 12) via telemetry service 440.

The network automation platform 306, which may represent an example of the controller 24 of fig. 1, is connected to and manages network devices (e.g., servers 12 and/or switches 16/18) and the orchestrator 304. The network automation platform 306 may, for example, deploy network devices to configure and manage the network. Performance monitoring system 500 may extract telemetry, analyze and provide an indication of network status. Various APIs may provide a network automation platform and/or performance monitoring system for a user interface to enable, for example, entry and automatic configuration of intent-based policies regarding network operation and performance.

Fig. 4 illustrates a scalable microservice-based telemetry service 440 that enables acquisition of time-series telemetry data from a computing device, such as via the agent 236 of fig. 3, and enables different consumers to obtain telemetry data through a subscription service. Telemetry service 440 may be part of performance monitoring system 500, or part of controller 28 or controller 24. The consumer of telemetry data may be other shared services included in performance monitoring system 500, see FIG. 5 for details.

An administrator or application can express telemetry acquisition requirements as "intent" defining how telemetry data will be acquired in high-level "natural language". The telemetry intent compiler is capable of receiving telemetry intent and translating high-level intent into abstract telemetry configuration parameters that provide generic descriptions of desired telemetry data, also referred to as metrics or performance measurements. Telemetry service 440 is able to determine a set of devices from which telemetry data is collected by telemetry intent. For each device, the telemetry service can determine the device's capabilities with respect to telemetry data acquisition. This capability may include telemetry protocols supported by the device. The telemetry service is capable of creating protocol specific devices based on abstract telemetry configuration parameters and telemetry protocols supported by the devices. Devices in a network system supporting a particular telemetry protocol can be assigned to telemetry collectors (metric collectors) supporting a distributed telemetry protocol.

Telemetry service 440 can be implemented as a collection of micro-services that can be fault tolerant and scaled. In response to the increasing demand for telemetry acquisition services, new instances of micro services may be created.

In particular, the exemplary data center 400 may include telemetry services 440 in the network 405 and/or within one or more data centers. The data center 400 of fig. 4 may be described as an example or alternative embodiment of the data center 10 of fig. 1. One or more aspects of fig. 4 may be described herein in the context of fig. 1.

Although a data center, such as the data centers shown in fig. 1 and 4, may be operated by any entity, service providers operate some data centers, wherein a business model of such service providers may involve providing computing power to customers or clients, typically together providing co-leasing. For this reason, data centers typically contain a large number of computing nodes or host devices. For efficient operation, these hosts must be connected to each other and to the outside world, and this capability is provided via physical devices that can be interconnected in a leaf-ridge topology. The collection of these physical devices (such as network devices and host devices) form the underlying network.

In some examples, data center 10 may represent one of many geographically distributed network data centers. In the example of FIG. 4, data center 400 includes a set of storage systems, application servers, computing nodes, or other devices, including devices 410A-410N (collectively, "devices 410," representing any number of devices). The apparatus 410 may be interconnected via one or more layers of the physical network switches and routers providing the high-speed switching fabric 14 of fig. 1.

The device 410 may represent any of a number of different types of devices (core switches, spine network devices, leaf network devices, edge network devices, or other network devices), but in some examples, one or more of the devices 410 may represent a physical compute node and/or a storage node of a data center. For example, one or more of the devices 410 may provide an operating environment for executing one or more client-specific applications or services. Alternatively, or in addition, one or more of the devices 410 may provide an operating environment for one or more virtual machines or other virtualized instances (such as containers). In some examples, one or more of the devices 410 are alternatively referred to as host computing devices, hosts, or servers. Accordingly, the apparatus 410 may execute one or more virtualized instances, such as virtual machines, containers, or other virtual execution environments for running one or more applications or services, such as Virtualized Network Functions (VNFs).

In general, each of the devices 410 may be any type of device operable on a network and capable of generating data (e.g., connectivity data, flow data, sFlow data, resource utilization data) that is accessed via telemetry or other means, including any type of computing device, sensor, camera, node, monitoring device, or other device. Further, some and all of the devices 410 may represent components of another device, where such components may generate data collected via telemetry or other means. For example, some or all of the devices 410 may represent physical or virtual devices, such as switches, routers, hubs, gateways, security devices (such as firewalls, intrusion detection and/or intrusion prevention devices).

Telemetry service 440 can configure devices 410 (and/or other devices) to generate and provide telemetry data related to the operation of these devices. Such data can include process usage data, memory usage data, network usage data, error counts, and the like. Telemetry service 440 may be configured to collect telemetry data from device 410 using protocols supported by device 410. An application, process, thread, etc. can subscribe to the collected telemetry data for notification when telemetry data is available to one or more devices supporting the network.

The user interface device 429 may be implemented as any suitable device for presenting output and/or accepting user input. For example, the user interface device 429 may include a display. The user interface device 429 may be a computing system, such as a mobile or non-mobile computing device operated by a user and/or administrator 428. In some examples, the user interface device 429 may be physically separate from the controller 24 and/or located in a different location than the controller 24. In such examples, the user interface device 429 may communicate with the controller 24 via a network or other communication means. In other examples, the user interface device 149 may be a local peripheral to the controller 24 or 28, or may be integrated into the controller 24 or 28.

In some aspects, user interface device 429 may communicate with telemetry service 440 or components thereof to configure telemetry service 440 such that the configuring device uses the high-level declaration of intent to provide telemetry data and receive telemetry data from other components of device and data center 10 via telemetry service 440. In some aspects, telemetry service 440 may be configured by an application or service that uses telemetry data obtained via telemetry service 440. For example, performance monitoring system 500 of fig. 5 or components thereof may configure telemetry service 440 to collect and provide telemetry data from device 410, such as at a desired collection rate. In some cases, the telemetry data includes metrics (performance measurements) for different aspects of the host device, the metrics (performance measurements) may be collected for each metric as a series of metric values, the metric values obtained at each of a plurality of specific times according to a specific sampling rate, and the metric values associated with the metric names of the corresponding metrics.

Telemetry service 440 provides a sharable telemetry data acquisition service to acquire telemetry data from a plurality of devices in a network system according to a protocol supported by the devices. The collected telemetry data can be used to perform anomaly detection and generate alerts to monitor, at cloud scale, a cloud computing infrastructure that can be used by multiple applications and tenants.

The administrator 128 can utilize the UI device 129 to input data that represents telemetry acquisition requirements as "intent" defined in high-level "natural language". Telemetry service 440 is capable of receiving data representing an intent and translating the high-level intent into abstract telemetry configuration parameters that can be programmatically processed by a telemetry controller of telemetry service 440. The telemetry controller is capable of creating a protocol specific telemetry configuration for the device based on the abstract telemetry configuration parameters and a telemetry protocol supported by the device.

As described above, in some cases, clients of a data center may experience network problems, such as increased latency, packet loss, low network traffic, or slow workload processing. The solution of this problem may be complicated by deploying workloads in a large multi-tenant data center. Telemetry data, such as those provided by telemetry service 440, may be used to help solve problems in a data center.

In the example of fig. 4, network 405 connects telemetry service 440, host device 410A, and host devices 410B-410N. The host devices 410A, 410B-410N may be collectively referred to as "host devices 410" representing any number of host devices 410.

Each of the host devices 410 may be an example in the device 12N of fig. 1, but in the example of fig. 4, each of the host devices 410 is implemented as a server or host device operating as a storage node of a physical or virtualized computing node or virtualized data center, as opposed to a network device. As further described herein, one or more of the host devices 410 (e.g., host device 410A of fig. 4) may execute multiple virtual compute instances, such as virtual machine 428, and furthermore, one or more of the host devices 410 (e.g., one or more of host devices 410B-410N of fig. 4) may execute applications or services on non-virtualized, single tenant, and/or bare metal servers. Thus, the example of fig. 4 shows a network system that may include a combination of virtualized server devices and bare metal server devices.

A user interface device 129 operable by the administrator 128 is also connected. In some examples, user interface device 129 may present one or more user interfaces on a display device associated with user interface device 129.

Network 405 may correspond to any of switching fabric 14 and/or service provider network 7 of fig. 1, or alternatively may correspond to a combination of switching fabric 14, service provider network 7, and/or another network. Although not shown in fig. 4, network 405 may also include some of the components of fig. 1, SDN controller 24, and edge service controller 28.

Shown in network 405 are spinal devices 402A and 402B (collectively, "spinal device 402" and representing any number of spinal devices 402), and leaf devices 403A, 403B and 403C (collectively, "leaf device 403" and representing any number of leaf devices 403). Although the network 405 is shown with spine devices 402 and leaf devices 403, other types of devices may be included in the network 405, including core switches, edge devices, roof-top switches, and other devices (such as those shown in fig. 1).

In general, the network 405 may be the Internet, or may include or represent any public or private communication network or other network. For example, the network 405 may be a cellular network,ZigBee, bluetooth, near Field Communication (NFC), satellite, enterprise, service provider, and/or other types of networks capable of transmitting transmitted data between computing systems, servers, and computing devices. One or more of the client device, server device, and other devices may send and receive data, commands, controls over the network 405 using any suitable communication technology Signals, and/or other information. Network 405 may include one or more hubs, network switches, network routers, satellite dishes, or any other network device. Such devices or components may be operably coupled internally to enable information exchange between computers, devices, or other components (e.g., between one or more client devices or systems and one or more server devices or systems). Each of the devices or systems shown in fig. 4 is operatively coupled to the network 405 using one or more network links. The link coupling such a device or system to network 405 may be an ethernet, asynchronous Transfer Mode (ATM), or other type of network connection, and such connection may be a wireless and/or wired connection. One or more of the devices or systems shown in fig. 4 or on the network 405 may be located at a remote location relative to one or more other shown devices or systems.

Each of the host devices 410 represents a physical computing device or computing node or storage node that provides an execution environment for virtual hosts, virtual machines, containers, and/or other real or virtualized computing resources. In some examples, each of host devices 410 may be a component of a cloud computing system, a server farm (server farm), and/or a server cluster (or portion thereof) that provides services for client devices and other devices or systems.

Specific aspects of host device 410 are described herein with respect to host device 410A. Other host devices 410 (e.g., host devices 410B-410N) may be described in the same manner and may also include like-numbered components, with like-numbered components representing like, similar, or corresponding components, devices, modules, functions, and/or features. Accordingly, the description herein with respect to host device 410A is correspondingly applicable to one or more other host devices 410 (e.g., host devices 410B-410N).

In the example of fig. 4, host device 410A includes underlying physical computing hardware that includes one or more processors 413, one or more communication units 415, one or more input devices 416, one or more output devices 417, and one or more storage devices 420. In the illustrated example, the storage 420 may include a kernel module 422 and a virtual router module 424. Storage 420 may also include virtual machines 428A-428N (collectively, "virtual machines 428" and representing any number of virtual machines 428), when present, may execute on top of or may be controlled by a hypervisor (not shown). One or more of the devices, modules, storage areas, and other components of the host device 410A may be interconnected to enable inter-component communication (physically, communicatively, and/or operatively). In some examples, such connectivity may be provided by a communication channel (e.g., communication channel 412), a system bus, a network connection, an inter-process communication data structure, or any other method for transmitting data.

Processor 413 can perform functions and/or execute instructions associated with host device 410A. The communication unit 415 may communicate with other devices or systems on behalf of the host device 410A. The one or more input devices 416 and output devices 417 may represent any other input and/or output devices associated with the host device 410A. The storage device 420 may store information for processing during operation of the host device 410A.

Virtual router module 424 may execute multiple routing instances for corresponding virtual networks within data center 10 (fig. 1) and may route packets to the appropriate virtual machines executing within the operating environment provided by apparatus 410. The virtual router module 424 may also be responsible for collecting overlay flow data, such as Contrail flow data, when used with an infrastructure employing Contrail SDN.

Virtual machines 428A-428N (collectively, "virtual machines 428," representing any number of virtual machines 428) may represent illustrative instances of virtual machines 428. Host device 410A may partition the virtual and/or physical address space provided by storage device 420 into user space for running user processes. Host device 410A may also partition the virtual and/or physical address space provided by storage device 420 into kernel space that is protected and not accessible by user processes.

Each of virtual machines 428 may represent tenant virtual machines running client applications, such as web servers, database servers, enterprise applications, or hosted virtualization services for creating a service chain. In some cases, any one or more of host devices 410 or other computing devices directly host the client application, i.e., not act as virtual machines (e.g., one or more of host devices 410B-410N, such as host device 410B and host device 410N). Although one or more aspects of the present disclosure are described in terms of a virtual machine or virtual host, techniques described herein in terms of one or more aspects of the present disclosure with respect to such a virtual machine or virtual host may also be applied to a container, application, process, or other execution unit (virtualized or non-virtualized) executing on host device 410.

In the example of fig. 4, one or more processors 443 may execute telemetry service 440 to perform operations attributed herein to telemetry service 440, which telemetry service 440 may be stored in a memory (such as storage 450). Telemetry service 440 may include one or more communication units 445, one or more input devices 446, and one or more output devices 447. Storage 450 may include intent service 418, telemetry controller 421, telemetry subscription service 408, and telemetry controller 510.

One or more of the devices, modules, storage areas, and other components of telemetry service 440 may be interconnected to enable inter-component communication (physically, communicatively, and/or operatively). In some examples, such connectivity may be provided by one or more of a communication channel (e.g., communication channel 442), a system bus, a network connection, an inter-process communication data structure, and any other method for transmitting data.

The one or more processors 443 may be part of the NIC of fig. 1 and/or may include processing circuitry to perform operations in accordance with one or more aspects of the present disclosure. Examples of the processor 443 include a microprocessor, an application processor, a display controller, an auxiliary processor, one or more sensor hubs, and any other hardware configured to act as a processor, processing unit, or processing device.

One or more communication units 445 of telemetry service 440 may communicate with devices external to telemetry service 440 by sending and/or receiving data and, in some aspects, may operate as an input device and an output device. In some examples, the communication unit 445 may communicate with other devices, such as the orchestrator 304 and the agent 302 shown in the example of fig. 3, through a network.

One or more storage devices 450 within the service 440 may store information for processing during operation of the service 440. Storage 450 may store program instructions and/or data associated with one or more of the modules according to one or more aspects of the present disclosure. The one or more processors 443 and the one or more storage devices 450 may provide an operating environment or platform for such modules, which may be implemented as software, but in some examples may include any combination of hardware, firmware, and software. The one or more processors 443 may execute instructions and the one or more storage devices 450 may store data for the instructions and/or one or more modules. The combination of processor 443 and storage 450 may retrieve, store and/or execute instructions and/or data for one or more applications, modules or software. The processor 443 and/or the storage 450 may also be operatively coupled to one or more other software and/or hardware components, including, but not limited to, one or more of the components of the alert service 440 and/or one or more devices or systems shown connected to the telemetry service 440.

In some examples, the one or more storage devices 450 are implemented by telemetry memory, which may mean that the primary purpose of the one or more storage devices is not long-term storage. The storage 450 of the telemetry service 440 may be configured as volatile memory for short term storage of information and therefore does not retain stored content if deactivated. Examples of volatile memory include Random Access Memory (RAM), dynamic Random Access Memory (DRAM), static Random Access Memory (SRAM), and other forms of volatile memory known in the art. In some examples, storage 450 also includes one or more computer-readable storage media. Storage 450 may be configured to store larger amounts of information than volatile memory. The storage 450 may also be configured as a non-volatile storage space for long-term storage of information and to retain information after an activation/deactivation cycle. Examples of non-volatile storage elements include magnetic hard disks, optical disks, flash memory, or forms including electrically programmable memory (EPROM) or Electrically Erasable and Programmable (EEPROM) memory.

Intent service 418 receives telemetry intent 430 that high-level expresses telemetry requirements for generating and collecting telemetry data. Telemetry intent 430 may be represented in natural language. For example, telemetry intent 430 may be "collect CPU resource usage metrics from all devices at 1 minute intervals". As another example, telemetry intent 430 may be "collect memory resource usage from device router A, router B, and router C". Intent service 418 may translate telemetry intent 430 into one or more low-level telemetry commands and protocols that may implement telemetry intent 430. In some cases, a device may support more than one telemetry protocol. In this case, the intent service may translate telemetry intent 430 using the protocol, which may be selected according to criteria such as priority assigned to the protocol, capabilities of the device with respect to the protocol, and overhead associated with the protocol. Further, in some aspects, intent service 418 may mediate intent for multiple applications requesting telemetry data from the same device. Intent services 418 can send low-level telemetry commands (conforming to a selected protocol) and indications of the selected protocol to telemetry controller 441 to upgrade telemetry acquisition for the affected device.

Telemetry controller 441 is capable of receiving low-level telemetry commands and indications of selected protocols. In some aspects, telemetry controller 441 maintains up-to-date telemetry requirements for each device. The telemetry controller 441 can provide the telemetry collector 214 to devices such as the leaf device 203 and the spinal device 202 specified by telemetry commands and protocols translated from the telemetry intent 430.

Telemetry subscription service 408 receives a request to subscribe to telemetry data generated by a device. In some aspects, in response to receiving the subscription, telemetry controller 441 may provide telemetry collector 510 if the telemetry collector has not been provided to the device.

Telemetry collector 510 collects telemetry data from the device. Telemetry collector 510 is capable of storing the collected data in a cache or database (not shown in fig. 4 for ease of illustration). Telemetry service 440 is capable of providing collected data for applications or services that have subscribed to the data.

FIG. 5 illustrates a performance monitoring system including a collector for collecting telemetry data and an alarm rule evaluator service for evaluating rules using the telemetry data, in accordance with the techniques of the present disclosure. As shown in fig. 5, a metric collector of performance monitoring system 500 may collect telemetry data via a metric deriver 504. In an example, system 500 may be a consumer of telemetry data collected by telemetry service 440 and may implement services and rules that can be used to obtain and/or subscribe to telemetry data. Performance monitoring system 500 may analyze the telemetry data according to alarm rules that determine whether an alarm should be generated based on the telemetry data, as will be further described below. Further, performance monitoring system 500 may include one or more machine learning components (such as machine learning component 521) and may be configured to provide adaptive sampling intervals for collecting telemetry data, to provide adaptive rule evaluation intervals, and/or to be able to recommend alert rules that provide a better understanding of the network.

Referring to FIG. 5, applications and services running within a workload cluster (such as cluster 502) are configured to derive various metrics for a network to a performance monitoring system 500 via one or more metric derivers 504A-504C. Performance monitoring system 500 may be an example of controller 24 and/or edge service controller 28, which may be configured to control a cluster 502 of virtual machines and communicate with a plurality of metric exporters 504. The system 500 may include a metric collector 510 for collecting telemetry data, a metric timing database (TSDB) 508 for storing telemetry data, a metric querier 512 for receiving queries from users about metrics, an alert rule evaluator service 514 for evaluating alert rules, such as user-created rules from users, such as a network administrator. A metric collector 510, such as shown in fig. 5, periodically discovers the metric deriver 504 and collects metrics, for example, by using a drag-based method, wherein the performance monitoring system 500 determines the collection time. Applications and services of the network expose their internal metrics through a metrics exporter 504, which metrics exporter 504 may be an agent 236 that performs the export function. The metrics exporter 504 may be embedded in or run alongside an application/service and expose metrics using HTTP endpoints.

More specifically, metric collector 510 may automatically discover metric exporters 504 in a network (such as data center 10) and collect all metrics exhibited by those exporters. Metric collector 510 periodically collects metrics at configured time intervals that define corresponding sample rates. Previous systems typically use fixed time intervals to collect metrics. The collected metrics include metric values associated with the metric names and may reside in a metric timing database 508, where the metrics are time-stamped. The time series data is typically stored as dense high precision data points, which can then be downsampled and eliminated. Furthermore, the TSDB may provide features for generating an aggregated data sequence over time. When used to store time-ordered data, some examples of time-ordered databases have the advantage that the time-ordered nature of the data compresses the data and reduces the storage space (e.g., disk or solid state drive) footprint. The timing database may be SQL (relational) or NoSQL (non-relational) in the architecture. The NoSQL database can operate better on a scale in a cluster.

The metrics querier 512 is configured to interact with the timing database 508 to access the collected metrics. The metric data is accessed using a query language provided by the metric querier 512, the metric querier 512 may provide an HTTP-based interface for the user. When a problem occurs, a user may manually query the collected metrics through a metrics querier interface that supports query language, thereby enabling the user to construct complex queries and access metrics data.

When a user wants to monitor a metric or a set of metrics offline or in the background, the user can automate metric monitoring by creating specific metric evaluation rules, known as alert rules (also referred to as event rules). These alert rules contain various conditions relating to the metric to be evaluated relative to a set of thresholds. Each alert rule may contain the name of the corresponding metric, a threshold value, and a comparison condition. For example, the user may configure alert rules to alert the user when the CPU usage of the system exceeds 80%. The rules may be evaluated on a periodic basis using the collected metric data at predetermined time intervals with alert rule evaluator service 514; and if the comparison condition is met, referred to herein as a hit, an alert is generated for the user. Alert rule evaluator service 514 includes event reporter 516, alert rule evaluator 518, and alert rule database 520. More specifically, alert rule evaluator 518 periodically reads user-created rules from alert rule database 520 and evaluates rule expressions relative to the metrology data accessed by metrology querier 512. Evaluating substantially determining whether the metric satisfies a condition specified by the user as a rule; if so, an alert is generated by alert rule evaluator 518 and transmitted to the user and/or stored by event reporter 516.

When the alert rule evaluator 518 evaluates the rule and determines that the comparison condition is satisfied or true, an alert is generated and the rule evaluation is considered a rule hit; and if no alert is generated, the rule evaluation is considered a rule miss.

Self-learning measurement collector

Metric collectors that use static predetermined time intervals rather than dynamic time intervals (variable sampling rates) to obtain metrics generally work well, but suffer from the following drawbacks:

a. more storage space is required because a data center may include thousands of exporters and thousands of rules.

b. Metrics that are rarely accessed by users are collected frequently.

c. Analyzing metrics requires a large amount of computational power because finding relevant metrics requires overlaying a large amount of metric data.

d. Redundant acquisition of metrics occurs when the metric values do not change much over time.

Thus, when more metrics are acquired, the metric acquirer may use the same time interval to stop acquiring many less useful metrics. The above-listed problems become more apparent and problematic as metrics are collected in a scaled network environment.

In an example, a machine learning based intelligent approach is employed to use various analyses (using telemetry data to obtain various analyses) to train a machine learning model, where the machine learning model is employed to make predictions using additional telemetry data, such as predicted metric weights or predicted weights for rules, etc. Using machine learning, a metric collector (such as metric collector 510) may learn how to identify the usefulness of a metric based on the relevance of this metric to a user, or an alert rule evaluator service (such as alert rule evaluator service 514) may learn how to identify the relevance or weight of an alert rule.

For example, the relevance of a metric to a user may be measured using various metric attributes (which may be aggregated in some manner) to determine a metric relevance value, also referred to herein as a metric weight. In an example, the higher the metric weight, the greater the likelihood that the user is interested in this metric; while the lower the metric weight, the less likely the user is interested in this metric. Higher metric weights may then be used to calculate updated sampling intervals, which may cause the collector to sample this metric more frequently (i.e., the higher the metric weights, the higher the sampling frequency and the sampling interval decreases).

FIG. 6 illustrates an example of a performance monitoring system 600 including an intelligent collector 610 in accordance with the techniques described herein, the intelligent collector 610 utilizing machine learning via a machine learning module 621 to determine an improved metric acquisition rate. The intelligent collector 610 receives metrics from the metric derivers 604A-604C of the cluster 602. The intelligent collector 610 includes a metric sampler 630 for accessing metrics stored in the database 608; a metric metadata synchronizer 632; and a metric metadata inventory database 638; wherein the metric metadata may include metric correlation data and historical correlation data for each evaluated metric.

The intelligent collector 610 also includes a metric group discovery service 636 for discovering one or more corresponding metrics (or rules) related to the metrics (or rules); a metric weight predictor 640 for predicting metric weights based on the machine learning model; a metric variance detector 634 for determining how the metric changes over a period of time; a key metric discovery service 642 for determining key metrics (or rules); and a metric access rate calculator 644 for determining the metrics and the access rates of the relevant metrics. First, the intelligent collector 610 may sample all derived metrics within the network at a predetermined default sampling interval. After each sampling iteration is completed, a learning process may be triggered to learn more suitable sampling intervals for the derived metrics using the model of the machine learning module 621.

More specifically, performance monitoring system 600 may include a metric timing database 608, a metric querier 612, and a query history database 618. The user may access the metrics querier 612 to query metrics stored in the metrics timing database 608. The query history of the metric querier may be stored in a query history database 618. Performance monitoring system 600 also includes an alarm rules evaluator service 614 (operating in a similar manner to alarm rules evaluator service 514 of FIG. 5) and includes an event reporter 616 for reporting alarms/events based on the results of the rule evaluation; an alert rule evaluator 624 for evaluating rules to generate rule evaluation results; an alert rule database 620 for storing alert rules; and an alarm rule history database 622 for storing alarm rule evaluation results including alarms/events.

Fig. 8 is an exemplary sequence diagram for determining new sampling intervals for metric acquisition in accordance with the techniques described in this disclosure. As shown in fig. 8, the intelligent collector 610 may initially collect all metrics from the metric deriver 604 at a default interval (e.g., a predetermined default sampling rate). For each metric, the corresponding metric value with their associated metric name is stored in the metric TSDB 608. Alert rule evaluator service 614 evaluates alert rules using the storage metrics and stores events/alerts in alert history database 622 as query histories are stored in database 618.

Enabling the metric intelligent collector 610 to determine various metric attributes. For example, metric intelligent collector 610 may read its query history for a given metric and associated alert rule, determine a corresponding metric access rate using calculator 644, and store the access rate in metric metadata database 638. The intelligent collector 610 may read the metric event history, determine a metric threshold hit rate, and store the threshold hit rate. The metric group discovery service 636 may discover one or more related metrics related to a given metric, wherein the related metrics define a group, evaluate the related metrics and associated rules in the group, and determine a group access rate based on the evaluation of the rules associated with the related metrics in the group. Other different metric attributes may also be determined as being related to the desired sampling rate for a given metric or group of metrics. Using the determined metric attributes, the intelligent collector is caused to determine predicted metric weights and associated alert rules for the given metric using the metric weight predictor 640. Using the predicted metric weights, updated acquisition sampling rates may be determined and subsequently used by the metric sampler 630 to acquire additional metrics. In this way, a given metric may be acquired using a conventional sampling rate for a specified metric or set of related metrics.

In an example, metric weights may be determined based on various consideration/metric attributes, such as the consideration/metric attributes expressed in the following guidance methods:

a) If the user has accessed the metrics (at a certain frequency) by a query, the user may access the same metrics again in the near future.

b) The user may also be interested in the relevant metrics if the user has accessed the metrics by the query.

The relevant metrics may be identified based on various factors, such as:

1. if the user accesses two metrics frequently or simultaneously during a certain time window, it can be considered that these metrics are relevant.

2. If a user accesses two metrics when a certain system event occurs, it can be considered that these metrics are relevant.

3. If two metrics share a common metric tag (e.g., a CPU-related metric), then it may be considered that these metrics are related.

4. If two metrics are derived with the same or independent software components or modules, it may be considered that these metrics are relevant.

Based on the above considerations, a set of metric-related attributes, such as metric access rate, metric threshold hit rate, related metric set access rate, and metric variance, may be calculated for each metric and used to calculate the weight of the metric, as described below.

a) Measuring access rate: may be defined as the ratio between the access rate and the sampling rate and indicates how many times the metric has been accessed to evaluate the alert rule when compared to its number of samples in a given period of time. A higher value indicates that the user is more interested in the metric, while a lower value indicates that the user is less interested in the metric. In other words, the access rate of the metric is the number of times the metric is accessed within a certain fixed duration. The sampling rate of the metric is the number of samples of the metric over a fixed duration. The metric access rate may be determined by a metric access rate calculator 644.

Metric access rate = access rate/sampling rate

b) The metric threshold hit rate may be defined as the number of metric values that cross a threshold set by the user in the corresponding alert rule divided by the number of samples. In general, a user sets thresholds in alert rules for useful metrics to monitor system behavior. The results of the alert rule evaluator service 614 may be used to determine a metric threshold hit rate with the intelligent collector 610.

Metric threshold hit rate = number of metrics/number of samples of threshold crossed

c) Regarding metric group access rate: may be defined as the average access rate of each metric in the set of metrics related to the metric. The relevant metrics may be identified using the metrics tag, origin, and query history. Two metrics may be considered related metrics when they contain a common tag or originate from the same exporter/software component or query together. The group access rate may be calculated using an average of the access rates of the individual metrics in the group. This metric property may be determined using a metric group discovery service 636 and a metric access rate calculator 644.

Average number of related metrics group access rate= (access rate of all metrics in group).

If there are more relevant groups, then the relevant metric group access rate can be determined, where the relevant metric group access rate= (access rate of all groups) is averaged.

d) Key metric label: when the metrics are part of a critical event, the metrics may be automatically marked as critical. Key metric tag labels may be attached to these types of metrics and when so attached, the greatest weight may be assigned to the metrics, which results in more frequent sampling of the metrics. Some examples of the above that may be applied to metrics may include: packet drop, CRC error count, etc. In some cases where metrics are manually labeled as critical metrics or system metrics, a minimum threshold interval value (i.e., a predetermined high sampling rate) is automatically considered. This may be determined by the key metric discovery service 642.

e) Measurement variance: may be defined as the difference between two metric values sampled within a particular time window. The metric variance indicates how the metric value changes over a period of time. Metrics with lower variances may be considered lower weights and will be sampled at higher intervals. The metric variance may be determined by a metric variance detector 634.

By using the correlation properties of the metric, its weight can be predicted using the machine learning model of the metric weight predictor 640 and the machine learning module 621. The machine learning module 621 can be used to read historical data related to metrics, such as historical related attributes of metrics, and train a metric weight machine learning model. The intelligent collector training process comprises determining relevant attributes of the metrics; sending the relevant attributes into a machine learning model; predicting a metric weight; and determining an updated sampling interval (corresponding to the acquisition frequency).

Specifically, the metric weights can be calculated using linear regression using the metric attributes R1, R2, R3, R4, the metric attributes R1, R2, R3, R4 being as follows:

metric access Rate is R1

Metric threshold hit Rate of R2

The related measurement set measures the access rate to be R3

The measurement variance is R4

W1＝a+b(R1)

W2＝a+b(R2)

W3＝a+b(R3)

W4＝a+b(R4)

Where a= (Σwi) (Σri x 2) - (Σri) (Σwi x 2) - (Σri) 2b=n (Σwi) - (Σri) (Σwi) (Σri x 2) - (Σri) 2) the measured weight may be calculated by taking the average of the predicted weights.

Where n is the number of samples. Using the calculated metric weights (or determining the metric as critical), an updated sampling interval/sampling rate can be determined. The intelligent collector 610 may then use the new sampling interval to obtain a metric value for this metric.

FIG. 7 is an exemplary flow chart of an intelligent collector 610 of a performance monitoring system 600 for determining metric weights and corresponding updated sampling intervals for metric acquisition in accordance with the techniques described in this disclosure. At 702, intelligent collector 610 reads metrics from metrics database 708. At 704, the intelligent collector may determine one or more of: metric access rate, metric variance, whether the metric is a key metric, and metric threshold hit rate. At 708, a determination is made as to whether a query history for other metrics is available. If so, the process proceeds to 712. If not, the process proceeds to 710.

At 712, the query history is fed into a machine learning model, where one or more relevant metric sets may be determined at 714 using the metric set discovery service 636. At 716, the intelligent collector may use the metric access rate receiver 644 to calculate a metric access rate for the metrics in the one or more groups. At 710, the correlation attribute as calculated at 704 or 716 is persisted; and at 720, this metric correlation data may be stored, such as in database 638, and the process proceeds to 718. At 718, a determination is made as to whether the next metric is to be evaluated. If not, the process ends. If another metric is to be evaluated, the process proceeds to 702 to repeat steps 702-718 for the next metric.

To predict metric weights and calculate updated sampling intervals, at 730, the machine learning module may read historical correlation data for the metrics. At 732, a machine learning model for predicting metric weights and acquisition sampling intervals is trained using the historical correlation properties. At 734, the correlation properties of the particular metric are read; and at 736, these correlation properties are passed to the machine learning model. At 738, the metric weights are predicted; and at 740, if the metric is a key metric, a sampling interval is calculated, such as by dividing the default sampling interval by the predicted metric weight, or collecting using a predetermined minimum collection sampling interval (maximum frequency). At 742, the additional metrics are evaluated using steps 734-740, and after all metrics are evaluated, the process ends.

The following includes exemplary pseudocode implementing the techniques described above:

/>

machine learning for telemetry rule evaluation

As previously described, the background service (referred to as an alert rule evaluator service) periodically evaluates alert rules. Some performance monitoring systems use static time intervals to evaluate alarm rules. When the alert rule evaluator service generates an alert for the user due to the contrast condition being true, the rule evaluation is considered a rule hit, otherwise if an alert is not generated, it is considered a rule miss. Periodic rule evaluation processes involve computationally intensive tasks such as querying massive telemetry data, aggregating telemetry data, and comparing the aggregated data to multiple thresholds.

When a large number of rules are deployed in a computing resource constrained environment, rule evaluation processes using static time intervals cannot properly evaluate the rules. Furthermore, computing resources may be wasted when processing rules in an overloaded system. To circumvent these scaling problems, administrators typically limit the number of rules they configure, or increase the rule evaluation interval.

An optimized rule evaluation period means that rules can be evaluated at different frequencies based on the rule past evaluation success or failure (hit or miss) rate. This means that when rules are missed for a long time, a solution based on a fixed evaluation rate will waste resources because the likelihood of a recent successful evaluation is low.

In an example, intelligent methods based on performance monitoring system machine learning are used for rule evaluation. Using this method, rules are periodically evaluated at optimized regular evaluation intervals, and these alert rules may change over time as network conditions change. The evaluation time interval of the rule may be assigned based on the determined weight of the rule. The determined weight of a rule may indicate the priority of the rule and may be inversely proportional to the desired rule evaluation interval. In other words, when the weight of a rule is higher, the corresponding evaluation interval is lower, and vice versa. The machine learning model and past rule evaluation data may be used to predict rule weights.

FIG. 9 illustrates an example of a performance monitoring system 900 according to the techniques described herein, the performance monitoring system 900 utilizing machine learning via a machine learning module 921 to determine improved evaluation intervals for evaluating alarm rules. The performance monitoring system includes a metrics collector 910, a metrics TSDB 908, a metrics querier 912, and an alert rule evaluator service 914. As shown in FIG. 9, the alert rule evaluator service includes an alert rule database 920, an alert rule evaluator 924, an event reporter 916, a rule evaluation history database 928, an alert rule history analyzer 926, and a rule weight predictor 922. Alert rule evaluator service 914 may operate in a similar manner as alert rule evaluator services 514, 614 as described herein and may include additional capabilities using machine learning. The service 914 may store the rule evaluation results in a rule evaluation history DB 928, and the rule evaluation history DB 928 may be a persistent database; and the machine learning module 921 can use these results to derive analysis related to alert rule evaluation, such as by predicting rule weights. As previously described, when an alert rule evaluation generates an alert (by comparing a metric value to a rule threshold and determining that the rule comparison condition is true), the evaluation rule is considered a rule hit. When no alert is generated, this means that the rule alignment condition is not true, and is considered a rule miss. By analyzing the rules and their corresponding metrics over time, such as by determining hits and misses for a series of collected metric values at a first rule evaluation interval, and predicting rule weights, an updated evaluation interval may be determined based on the predicted rule weights.

In particular, the machine learning module 921 can interact with the alert rule evaluator service 914 or be incorporated into the alert rule evaluator service 914. Alert rule history analyzer 926 may analyze rule evaluation histories and derive data rule analysis, as described below. Alert rule history analyzer 926 may provide (or, in other words, communicate, possibly as a reference to a memory location (e.g., an indicator) storing such analysis) an analysis for rule weight predictor 922, rule weight predictor 922 may process the analysis to determine rule weights and, thus, corresponding post-update evaluation intervals for rule evaluation. For example, the following rule analysis may be used:

a) The Rule Hit Rate (RHR) may be defined as the number of successful evaluations (alarms/hits) in the rule total evaluation.

RHR = total number of evaluations of rule/alert generated by rule

b) The Relevant Rule Hit Rate (RRHR) may be defined as the estimated success rate of other alert rules related to a given alert rule. For example, two rules may be considered related rules if they contain metrics from the same source in the network. This metric provides the overall health of the system. A higher RRHR value indicates poor system health and a high likelihood of the rule generating an alert. The relevant rule hit rate can be calculated as follows:

Rrhr=total hit rate of all relevant rules/total number of evaluations of relevant rules

c) Rule proximity miss rate (RCMR) may be defined as the number of evaluations that generate an alarm failure due to the evaluated metric value being less than but a small margin approaching the rule threshold. In other words, the rule is evaluated as a value that is only slightly less than the rule threshold. The miss margin may be calculated as a percentage relative to a rule threshold.

Miss margin = 100- [ (evaluation value x 100)/rule threshold ]

To mark the evaluation as a near miss, the miss margin value may be compared to an acceptable margin limit, and when the margin falls within the acceptable margin limit, the evaluation is considered a near miss evaluation. For example, if the rule threshold is 20 and the evaluation value of the rule is 18, the miss margin=100- [ (18×100)/20 ], that is, 10%. This means that, due to this 10% amount, rule evaluation fails to generate an alarm. To consider evaluating as near miss, it is assumed that 20% is used as the miss margin threshold. Therefore, a miss margin of 10% makes the evaluation considered as a near miss evaluation. The rule approach miss rate may be calculated as follows:

RCMR = number of proximity miss evaluations/total number of evaluations

d) Rule metric criticality: rule metrics may be considered critical, such as when a tag is attached to a rule, when any of the metrics included within a rule are part of a critical event and are tagged by a user, or when the metric metadata at runtime is tagged as critical. For example, when a user observes a log event for packet loss, this event may be labeled as critical, and all network-related metrics may also be considered critical metrics. When a metric is labeled as critical, the rule weight related to this metric may be set to a maximum value, which may enable the rule evaluator to evaluate these rules at a higher frequency (i.e., using smaller evaluation intervals).

Using the calculated rule analysis described above, the rule weight predictor 922 may predict weights for rules. Depending on the rule weight, a new evaluation interval may be derived for the rule using an inverse relationship.

The rule analysis may be periodically computed over a predetermined period of time to produce a rule weight prediction, such as using the following:

the regular hit rate is R1

R1= (alarm count/evaluation count)

The relevant rule hit rate is R2

R2= (relevant rule hit rate/relevant rule evaluation count)

Regular approximate miss ratio of R3

R3= (approach miss evaluation count/evaluation count)

The rule weights can be predicted using a linear regression formula and the single ratios R1, R2, R3 calculated above.

W1＝a+b(R1)

W2＝a+b(R2)

W3＝a+b(R3)

Wherein a= (Σwi) (Σri x 2) - (Σri) (Σwi x 2) - (Σri) 2 b=n (Σwi) - (Σri) (Σwi) (Σri x 2) - (Σri) 2)

The average of the predicted weights may be considered a regular weight.

The alert rule evaluator service 914 may then calculate a new evaluation interval for the rule based on the rule weights:

post-update evaluation interval= (default or first rule evaluation interval/rule weight).

Alert rule evaluator service 914 can use the updated evaluation interval to evaluate the rule subsequently using the newly acquired metrics. In an example, performance monitoring system 900 may also coordinate the determined new regular evaluation interval with the metric acquisition sampling rate. For example, if the rule evaluates less frequently than before, it may be desirable to slow down the acquisition of the relevant metric at the same time; if the rule evaluates more frequently than before, it may be desirable to increase the acquisition rate (decrease the acquisition sampling interval) of the relevant metric.

FIG. 10 illustrates an exemplary sequence diagram of the performance monitoring system 900 of FIG. 9. Alert rules, such as user-created alert rules, may be created and stored in alert rule database 920. Alert rule evaluator service 914 may read rules from alert rule database 920 and evaluate rules by accessing metric querier 912 to receive corresponding metric values for metrics and determining hit counts and miss counts for each rule using a first evaluation interval by comparing the metric values to corresponding thresholds for the rules. The evaluation results may be stored in a rule evaluation database 928. Alert rule evaluator service 914 can discover relevant metrics and relevant rules for a prescribed rule and calculate relevant rule hit counts and miss counts. Alert rule service may also calculate rule hit rates, related rule hit rates, and proximity hit rates. These rule attributes and rule histories may be used as training data for rule weight predictors for machine learning models. The machine learning model may use rule attributes corresponding to the first evaluation interval to predict rule weights and determine updated rule evaluation intervals based on the predicted rule weights and then evaluate rules using the updated rule evaluation intervals.

Fig. 11 is a flow chart illustrating actions of the alert rule evaluator service 1114 in accordance with the disclosed technology. First, at 1100, the alert rule evaluator service 1114 can determine whether a rule history exists. If not, at 1108, rules may be evaluated at a default first evaluation interval, each rule evaluated multiple times, with the evaluation result recorded in the historical database. If a rule history exists, at 1102, a rule history corresponding to the rule may be checked; and at 1104, a rule evaluation analysis (rule attribute) may be determined and used as training data to update the rule weight predictor of the machine learning module 921. At 1106, a determination may be made as to whether the rule includes a key metric. If it is determined that the rule has a key metric, then at 1108, a maximum weight may be assigned to the rule; and at 1116, the post-update evaluation interval may be determined as a function of the predicted weight. In an example, a predetermined minimum evaluation interval may be used for the key metrics. If it is determined that the rule does not have a key metric, at 1112, a rule weight may be determined for the rule using a rule evaluation analysis/rule attribute; at 1114, the predicted weights may be assigned to the rules; and at 1116, the post-update evaluation interval may be determined as a function of the predicted weight. In an example, the post-update evaluation interval is inversely proportional to the weight of the rule prediction. At 1118, the rules may then be evaluated using the updated evaluation interval; and at 1120, these new evaluation results may be stored in rule evaluation history database 928. The actions described in fig. 11 may continue to optimize the regular evaluation interval on a forward basis as a condition for the monitored network change.

pseudo code

/>

Self-learning telemetry alarm rule recommender

FIG. 12 illustrates an example of a performance monitoring system 1200, the performance monitoring system 1200 providing recommended alert rules for a network of performance monitoring computing devices, in accordance with the techniques described in this disclosure. The performance monitoring system 1200 may be similar in many respects to the monitoring systems 500, 600, 900 shown herein, and may also include various other components of these systems. As shown in fig. 12, the system 1200 may include a metric collector 1210, a metric TSDB 1208, a metric querier 1212, an alert rule evaluator service 1214, and a machine learning module 1222. Alert rule evaluator service 1214 may include an alert rule database 1220, an alert rule evaluator 1218, and an event reporter 1216. The alert rule evaluator service 1214 may store alert rule evaluation results in a persistent database, and the machine learning module 1222 may use this data to derive analysis regarding user-created alert rules and temporary related rules to automatically generate recommended alert rules to fine tune alert generated information to be more relevant to the user.

For example, if the network system CPU usage is high, an administrator will typically find the application or module in the system that consumes the most CPU resources or performs the most intensive operations of the CUP. After such analysis is performed, the administrator typically creates one or more alert rules using the correlation metrics to capture the high CUP problem before it reoccurs and possibly take action to prevent the system CUP from becoming too high.

Such manual creation of alert rules may be time consuming and may require an administrator to analyze the metric data and attempt to identify suspicious metrics that may be related to the problem that the administrator is attempting to diagnose. This may become more difficult when the amount of telemetry data is large. The process of manually creating a set of appropriate alert rules to diagnose a problem may be time consuming, inefficient, and in some cases unsuccessful due to the time delay of implementing the alert rules that the user manually creates. For example, when an administrator starts a survey, or when a new rule is added, there may be no longer a defect/problem.

According to the performance monitoring system of the disclosed technology using the intelligent alert rule creation method based on machine learning, it is possible to automatically find relevant metrics related to existing rule metrics and recommend additional alert rules for future analysis of problems. The recommended alert rules may be implemented automatically or may require user approval and provide a way to ease the burden of manual rule creation while conserving network resources by providing alert rules that are related and provide meaningful information about the network. Thus, the efficiency and operation of the performance monitoring system is improved as follows: the problem can be discovered and solved faster and the computational resources for determining the network problem can be saved. This may be the case for both the performance monitoring system itself and the network being monitored because the correlation rules that utilize the associated correlation metrics mean that the uncorrelated metrics are not derived by the network nor collected by the performance monitoring system 1200.

Alert rules, such as user-created alert rules, may be stored in alert rule database 1220. To evaluate an alert rule, alert rule evaluator service 1214 reads the alert rule and its associated metric name, accesses metric querier 1212 to receive a corresponding metric value from metric timing database 1208 for the metric name in the rule, compares the metric value to a corresponding rule threshold, and may provide an alert via event reporter 1216 when a rule hit occurs, and may record evaluation results, including hits and misses, in a rule evaluation history database (not specifically shown in fig. 12).

FIG. 13 illustrates an exemplary sequence diagram of the performance monitoring system 1200 of FIG. 12. The alert rule evaluator service 1214 may read the user-created alert rule and its associated metric name. The alert rule evaluator service discovers a set of relevant metrics related to the metrics of the user-created alert rule and uses the discovered relevant metrics to create a temporary correlation rule, as described in more detail below, and automatically generates the temporary rule based on the analysis. The temporal correlation rules may be stored, for example, in alarm rules database 1220. The user-created alert rules are evaluated using the corresponding metrics from the metrics timing database 1208. If a rule miss occurs, increasing the miss count; and if a rule hit occurs, the rule hit count is incremented and an alert may be generated. The evaluation count (total number of evaluations) of the alert rules created by the user may also be tracked. Corresponding metrics from the metrics timing database 1208 may also be used to evaluate automatically generated temporal correlation rules. For each temporal correlation rule, a corresponding rule attribute, such as one or both of a temporal hit rate and a temporal miss rate, and other possible rule correlation attributes, such as correlation, may be determined, as described below. The machine learning module 1222 may predict the weight of each automatically generated temporary rule. When the predicted weight of the temporary rule is greater than (or equal to) a predetermined acceptable value, the temporary rule may be recommended to the user; and when the predictive weight is less than (or equal to) the acceptable value, the temporary rule may be discarded and not provided to the user as a recommendation rule.

FIG. 14 is a flowchart illustrating exemplary operations performed by the alert rule evaluator service and machine learning module of the performance monitoring system of FIG. 12 to create and evaluate temporary rules in accordance with the techniques of the present disclosure. At 1402, the alert rule evaluator service 1214 may read the user-created alert rule from a user rule table (which may be the alert rule database 1220). At 1404, the alert rule evaluator service 1214 finds relevant metrics at 1404 and creates one or more temporary correlation rules at 1406. At 1408, temporal correlation rules may reside, such as in a temporal automatic rules table (which may be an alert rules database 1220). At 1410, for each created temporary rule, the temporary rule may be read and evaluated to determine a hit count, a miss count, and an evaluation count; also, at 1412, hit and miss rates for the temporary rule may be calculated. At 1414, a relative hit rate and a relative miss rate for the temporal rules may be calculated. At 1416, this information is maintained, such as in a temporary rule evaluation results table. Each of the temporary rules associated with the user-created rule may be evaluated according to steps 1410-1416. At 1418, if additional user-created rules exist, each of these rules may have one or more temporal correlation rules, which are then determined to evaluate in the same manner as described above.

The machine learning module 1212 may be trained to predict alert rule weights for the temporary rules to determine the most relevant temporary rules. For example, as shown in fig. 14, at 1420, the machine learning module may read historical evaluation data for each temporary rule; and at 1422, a rule recommender machine learning model may be trained. At 1424, the relative hit rate and relative miss rate of the temporary rule may be read from memory; and at 1426, the relative hit rates and relative miss rates of the temporal rules may be fed into a machine learning module that analyzes the data using one or more of the rule correlation attributes described below to determine a predicted weight for each corresponding temporal alert rule. At 1428, a prediction may be made as to whether the rule is the most relevant. If so, the rule is added to the list of recommended rules. Predictions as to whether the rule is least relevant may also be made. If so, the rule will be discarded and not recommended. Several sets of temporary rules associated with alert rules created by other users may be evaluated in the same manner.

When a user creates an alert rule, a set of related temporary alert rules may be automatically created; and may evaluate the associated user-created rules for several time intervals. For example, when the user created regular evaluation interval is 30 seconds, the relevant temporary regular evaluation interval may be 30 x T seconds, where T may be a predetermined value, a random variable, an exponential variable, or the like.

When a user-created alert rule is evaluated and a hit occurs, the machine learning model identifies the most relevant rule between the temporary rules and converts this temporary rule into a conventional rule for future failure analysis.

A set of metrics may be identified as related metrics for each metric in the user-created rule, such as when Guan Duliang originates from the same service or component or module as the metrics in the user-created rule, and/or share a common metric tag. The tags may be used as keywords and may serve as indicators of different types of metrics. The determined relevant metrics may be translated into a temporary set of rules using a set of metric aggregation and comparison operators.

For example, if a user has created an alert rule to monitor a point in time when the average aggregate value of the metric "system_cpu_usage" is greater than 80%, the metric of the alert rule may originate from a system resource monitoring agent running on the machine and may be labeled "cpu". There may be several other processes running on the same machine and these processes derive their metrics for CPU usage and CPU intensive operations (e.g., encryption/decryption counts, etc.). These other metrics are typically labeled "cpu", "cpu_intense_op", etc.

Some of these process derived metrics are assumed to be as follows.

1. Metric #1: metric name: app_x_cpu_use metric tag: label1=cpu

2. Metric #2: metric name: app_x_encrypt_op_count, metric tag:

label2＝cpu_intense_op

3. metric #3: metric name: app_y_net_if_down_count, metric tag: label1=net_err

4. Metric #4: metric name: app_y_cpu_use, metric tag: label1=cpu

The relevant metric identification process identifies metrics 1, 2, and 4 as relevant metrics because these metrics either originate from the same machine or they have a common label. After this, a set of temporary rules may be created for the user tag, performing different combinations of aggregation and comparison functions. The threshold for the temporary alert rule may be calculated based on instrument metadata about the metrics. For example, the instrument metadata for the metric "app_x_encrypt_op_count" would be how much CUP approximation percentage would be consumed per operation.

In the above example, temporary rules may be created as follows.

Temporary rule-1: the average value of "app_x_cpu_use" is greater than 80%

Temporary rule-2: the average value of "app_x_cpu_use" is less than 40%

Temporary rule-3: the value of "app_x_encrypt_op_count" is greater than 500

Temporary rule-4: the value of "app_x_encrypt_op_count" is less than 100

Temporary rule-5: the average value of "app_y_cpu_use" is greater than 80%

Temporary rule-6: the average value of "app_y_cpu_use" is less than 40%

When an associated user-created rule hits, the newly created temporary rule may be evaluated. After each evaluation of the temporary rules, a set of evaluation attributes may be calculated and assigned to each temporary rule. These evaluation attributes may indicate the validity of the rule in generating an alert. The evaluation attribute is calculated as follows.

Temporary rule hit Rate (PRHR): it indicates how often the temporary rule satisfies the rule condition (rule hit). The temporary rule hit rate may be calculated as follows:

PRHR = hit count/number of temporary rule evaluations

Temporary rule miss rate (PRMR): it indicates how often the temporary rule does not meet the rule condition (rule miss). The temporary rule miss rate may be calculated as follows:

PRMR = miss count/number of temporary rule evaluations

This attribute may play a key role in learning which temporal rules are least relevant and discarding this non-relevant rule in future evaluations.

Relative temporary rule hit Rate (RPRHR): it indicates how often the temporary rule satisfies its rule condition relative to the associated user-created rule. The relative temporal rule hit rate is calculated as follows:

Rprhr=temporary rule hit rate/user created rule hit rate

Relative temporary rule miss rate (RPRMR): it indicates how often the temporary rule does not satisfy its rule condition relative to the associated user-created rule. The relative temporal rule miss rate is calculated as follows:

RPRMR = temporary rule miss rate/user rule miss rate

This attribute may play a key role in learning the least relevant temporary rules and discarding those rules in future evaluations.

The evaluation attributes described above may be used to predict the weight of each temporary rule. The predicted weights indicate the relatedness of the temporary rules to the user-created rules. Higher weights indicate that the rule is most relevant and vice versa.

Simple linear regression machine learning models may be used to predict rule weights. For example, the weights are predicted with respect to each correlation attribute, and the average value of the weights is regarded as a rule weight.

Assume that:

the temporary rule hit rate is R1

R1= (hit amount/rule evaluation number)

The temporary rule miss rate is R2

R2= (miss amount/rule evaluation times)

Relative temporary rule hit Rate of R3

R3= (temporary rule hit rate/user rule hit rate)

Relative temporary rule miss Rate of R4

R4= (temporary rule miss rate/user rule hit rate)

The rule weights are predicted using a linear regression formula and the single ratios calculated above.

W1＝a+b(R1)

W2＝a+b(1/R2)

W3＝a+b(R3)

W4＝a+b(1/R4)

The average of the predicted weights can be used as a rule weight and compared to a predetermined threshold to determine whether the proposed rule is relevant or irrelevant.

The following includes exemplary pseudocode implementing the techniques described above: pseudo code

/>

The techniques described herein may be implemented in hardware, software, firmware, or any combination thereof. The various features of the different examples 500, 600, 900, and 1200 of the system may be combined in a single performance monitoring system. The various features described as modules, units, or components may be implemented together in integrated logic devices or may be implemented separately as discrete but interoperable logic devices or other hardware devices. In some cases, various features of the electronic circuit may be implemented as one or more integrated circuit devices, such as an integrated circuit chip or chipset.

If implemented in hardware, the present disclosure may relate to an apparatus, such as a processor or an integrated circuit apparatus (such as an integrated circuit chip or chipset). Alternatively, or in addition, if implemented in software or firmware, the techniques may be implemented, at least in part, by a computer-readable storage medium including instructions; the instructions, when executed, cause the processor to perform one or more of the methods described above. For example, a computer-readable storage medium may store such instructions for execution by a processor.

The computer readable medium may form part of a computer program product, which may include packaging material. The computer-readable medium may include computer data storage media such as Random Access Memory (RAM), read Only Memory (ROM), non-volatile random access memory (NVRAM), electrically Erasable and Programmable Read Only Memory (EEPROM), flash memory, magnetic data storage media, or optical data storage media, among others. In some examples, an article of manufacture may comprise one or more computer-readable storage media.

In some examples, the computer-readable storage medium may include a non-transitory medium. The term "non-transitory" may mean that the storage medium is not embodied as a carrier wave or propagated signal. In some examples, a non-transitory storage medium may store data over time (e.g., in RAM or cache).

The code or instructions may be software and/or firmware executed by a processing circuit that includes one or more processors, such as one or more Digital Signal Processors (DSPs), general purpose microprocessors, application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs), or other equivalent discrete or integrated logic circuitry. Thus, the term "processor," as used herein, may refer to any one of the foregoing structures, or any other structure suitable for implementation of the techniques described herein. Furthermore, in some aspects, the functionality described in the present disclosure may be provided in software modules or hardware modules.

Claims

1. A performance monitoring method for recommending alert rules for a performance monitoring system, the method comprising:

receiving, by the performance monitoring system, a user-created alert rule;

collecting, by the performance monitoring system, telemetry data comprising a plurality of metrics related to a user-created alert rule;

reading, by the performance monitoring system, a first alert rule of the user-created alert rules, wherein the first alert rule is associated with a first metric of the plurality of metrics;

determining, by the performance monitoring system, at least one relevant metric related to the first metric;

creating, by the performance monitoring system, a set of temporal correlation rules using the at least one relevant metric;

evaluating, by the performance monitoring system, each provisional correlation rule of the set of provisional correlation rules using the telemetry data to determine a corresponding provisional rule evaluation attribute;

determining, by the performance monitoring system, for each temporal correlation rule in the set of temporal correlation rules, a corresponding correlation rule weight based on the corresponding temporal rule evaluation attribute; and

determining, by the performance monitoring system, whether each temporal correlation rule is a recommended correlation rule based on the corresponding correlation rule weight.

2. The method of claim 1, wherein determining whether each temporal correlation rule is a recommended correlation rule comprises: it is determined whether the corresponding correlation rule weight is greater than a predetermined threshold weight.

3. The method of claim 1, wherein determining whether each temporal correlation rule is a recommended correlation rule comprises: determining whether the corresponding correlation rule weight is less than a predetermined acceptable threshold weight to determine that the provisional correlation rule is uncorrelated.

4. A method according to any one of claims 1 to 3, wherein the corresponding temporal rule evaluation attribute for the temporal correlation rule is at least one of a corresponding temporal rule hit rate and a corresponding temporal rule miss rate.

5. The method of claim 4, further comprising: evaluating the first alert rule using the telemetry data to determine at least one of a first alert rule hit rate and a first alert rule miss rate; and determining a corresponding temporal rule evaluation attribute using at least one of the first alert rule hit rate and the first alert rule miss rate.

6. The method of any of claims 1-3, wherein the corresponding temporal rule evaluation attributes for the temporal correlation rule include a temporal rule hit rate and a temporal rule miss rate; the method further comprises the steps of: evaluating the first alert rule using the telemetry data to determine a first alert rule hit rate and a first alert rule miss rate; and determining additional corresponding temporary rule evaluation attributes for determining corresponding temporary rule weights using the first alert rule hit rate and the first alert rule miss rate.

7. The method of claim 6, wherein the additional corresponding temporary rule evaluation attributes include a relative temporary rule hit rate and a relative temporary rule miss rate.

8. A method according to any one of claims 1 to 3, wherein creating a set of temporal correlation rules using the at least one relevant metric comprises: at least one of a metric aggregation and a set of comparison operators is used.

9. A method according to any one of claims 1 to 3, further comprising automatically implementing recommended correlation rules.

10. A method according to any one of claims 1 to 3, further comprising providing the recommended correlation rules to a user for approval prior to implementing the recommended correlation rules.

11. A method according to any one of claims 1 to 3, further comprising determining and evaluating a set of corresponding temporary rules for each of the user-created alert rules.

12. A method according to any one of claims 1 to 3, wherein determining a corresponding relevant rule weight based on the corresponding temporary rule evaluation attribute comprises: a machine learning model is used.

13. A method according to any one of claims 1 to 3, wherein determining from the at least one relevant metric related to the first metric comprises: a source of the first metric is matched, or a tag of the first metric is matched.

14. A computer readable storage medium encoded with instructions that cause one or more programmable processors to perform the method recited in any of claims 1-13.

15. A performance monitoring system includes processing circuitry coupled to a memory device, the memory and processing circuitry configured to:

receiving an alarm rule created by a user;

collecting telemetry data comprising a plurality of metrics related to the user-created alert rule;

reading a first alert rule of the user-created alert rules, wherein the first alert rule is associated with a first metric of the plurality of metrics;

determining at least one relevant metric related to the first metric;

creating a set of temporal correlation rules using the at least one related metric;

evaluating each temporal correlation rule of the set of temporal correlation rules using the telemetry data to determine a corresponding temporal rule evaluation attribute;

determining, for each temporal correlation rule in the set of temporal correlation rules, a corresponding correlation rule weight based on the corresponding temporal rule evaluation attribute; and

based on the corresponding correlation rule weights, it is determined whether each temporal correlation rule is a recommended correlation rule or an uncorrelated rule.

16. The performance monitoring system of claim 15, wherein the memory and the processing circuit are configured to: it is determined whether the corresponding correlation rule weight is greater than a predetermined threshold weight.

17. The performance monitoring system of claim 15, wherein the corresponding temporal rule evaluation attribute for the temporal correlation rule is at least one of a corresponding temporal rule hit rate and a corresponding temporal rule miss rate.

18. The performance monitoring system of any one of claims 15-17, wherein the memory and the processing circuitry are configured to: evaluating the first alert rule using the telemetry data to determine at least one of a first alert rule hit rate and a first alert rule miss rate; and determining a corresponding temporal rule evaluation attribute using at least one of the first alert rule hit rate and the first alert rule miss rate.

19. The performance monitoring system of any of claims 15-17, wherein the corresponding temporal rule evaluation attributes for temporal correlation rules include a temporal rule hit rate and a temporal rule miss rate, wherein the memory and the processing circuitry are configured to: evaluating the first alert rule using the telemetry data to determine the first alert rule hit rate and the first alert rule miss rate; and determining additional corresponding temporal rule evaluation attributes for determining corresponding temporal rule weights using the first alert rule hit rate and the first alert rule miss rate.

20. The performance monitoring system of any one of claims 15-17, wherein the memory and the processing circuitry are configured to: for each of the user-created alert rules, a corresponding set of temporary rules is determined and evaluated.