EP4303730A1 - Computer-implemented method for automatically detecting anomalies in a cloud infrastructure comprising microservices - Google Patents

Computer-implemented method for automatically detecting anomalies in a cloud infrastructure comprising microservices Download PDF

Info

Publication number
EP4303730A1
EP4303730A1 EP22305998.1A EP22305998A EP4303730A1 EP 4303730 A1 EP4303730 A1 EP 4303730A1 EP 22305998 A EP22305998 A EP 22305998A EP 4303730 A1 EP4303730 A1 EP 4303730A1
Authority
EP
European Patent Office
Prior art keywords
containers
better
computer
microservice
microservices
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP22305998.1A
Other languages
German (de)
French (fr)
Inventor
Abdelhadi AZZOUNI
Valentin LAPPAROV
Rama Rao GANJI
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Packetai
Original Assignee
Packetai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Packetai filed Critical Packetai
Priority to EP22305998.1A priority Critical patent/EP4303730A1/en
Publication of EP4303730A1 publication Critical patent/EP4303730A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/302Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a software system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3452Performance evaluation by statistical analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • G06F11/0754Error or fault detection not based on redundancy by exceeding limits
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3447Performance evaluation by modeling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/81Threshold
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/815Virtual
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/83Indexing scheme relating to error detection, to error correction, and to monitoring the solution involving signatures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/865Monitoring of software
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Definitions

  • a cloud infrastructure is a complex and dynamic infrastructure that comprises a high number of components and generates a large amount of data.
  • a cloud infrastructure may comprise microservices.
  • each microservice comprises one or more container(s).
  • a container is an atomic compute component that is meant to perform atomic tasks.
  • the deployment of such cloud infrastructure comprising containers is managed by a container orchestrator which is a software that enables the management of the lifecycle of the container (creation, resizing and/or destruction of containers) and the management of the container's allocation which is the attribution to a specific microservice.
  • a container orchestrator which is a software that enables the management of the lifecycle of the container (creation, resizing and/or destruction of containers) and the management of the container's allocation which is the attribution to a specific microservice.
  • An example of such an orchestrator is KUBERNES ® .
  • the container orchestrator also allows the monitoring of the infrastructure by collecting container's data and exposing it via an application programming interface (API).
  • API application programming interface
  • Cloud infrastructures comprising containers are characterized by their dynamism and the ephemerality of containers.
  • Containers are created and destroyed dynamically based on workload and usage. For example, during peak traffic hours, a cloud infrastructure expands by creating more containers to handle more traffic. Similarly, during low traffic hours, for example during the night, the number of containers shrinks in order to save compute power. This automatic creation and destruction of containers is called auto-scaling, which enables more granular management of compute capacity. It also saves costs that would incur from overprovisioned idle resources.
  • a way of monitoring a cloud infrastructure, and in particular to detect anomalies, is to monitor metrics.
  • Anomaly detection methods based on metrics monitoring are known for IT infrastructures that may be applied to cloud infrastructures comprising microservices.
  • these systems are conceived to work on hand-picked individual metrics. The user needs to define what metrics are interesting to monitor then individually apply some models to them in order to detect anomalies.
  • an anomaly detection model For example, it is possible for monitoring a cloud infrastructure to set up an anomaly detection model to detect anomalies on a specific container's central processing unit (CPU) usage.
  • the model would learn the normal behaviour of the container's CPU and send alerts whenever there is a deviation from this normal behaviour.
  • This method of manually setting up detection model for each metrics needed to be monitored is not compatible with a cloud infrastructure comprising microservices.
  • the present invention notably seeks to meet this need.
  • One subject of the invention is a computer-implemented method for automatically detecting anomalies in a cloud infrastructure comprising one or more microservices, each microservice comprising one or more containers, each container having a plurality of metrics of different types and relative to the state of said container, the method comprising the following steps:
  • the computer implemented method according to the invention allows detecting anomalies in a microservice deployment and not in each individual container of said microservice.
  • the method is thus more scalable than the known methods of the prior art.
  • the method according to the invention is agnostic to the underlying containers of each microservice and to their life cycles.
  • the evolution of an aggregated metric is monitored instead of monitoring individually metrics of each container.
  • the method is not disturbed by automatic termination or creation of containers that may trigger error in the anomaly detection method.
  • the method according to the invention has a good elasticity and may be automatically adapted to all size of cloud infrastructure.
  • the values of the predefined range of values may be expected values computed beforehand.
  • a computer in the broadest sense, i.e. by any electronic system, which may include a set of several machines and have computer processing capabilities.
  • a computer comprises a processor, a memory, a humanmachine interface, for example a keyboard, a mouse or a touch screen, a screen and a communication module, in particular Ethernet, WIFI, Bluetooth ® or using a mobile telephone technology, for example a technology operating with the GSM, GPRS, EDGE, LTE or UMTS protocols.
  • cloud infrastructure we designate a set of components that communicates via a network which can be the internet or can be a private intranet.
  • the method according to the invention is mostly dedicated to cloud infrastructure comprising microservices.
  • microservice we designate a component of the cloud architecture that is configured to execute one function.
  • a microservice is usually written in a single programming language and using a single technology.
  • a microservice may also correspond to business services built as microservices.
  • a microservice is developed using one or more container(s).
  • concise we designate a component of the cloud architecture that is configured to perform a task that participates to the realization of the microservice function.
  • the step of grouping containers associated to said first microservice may comprise the following steps:
  • Containers implementing a microservice share the same underlying programming language and/or technology. Grouping containers by programming language and/or by technology allows to simply realize a first sort of containers according to the microservice they are associated to. There is no need to access external meta-configuration files of the cloud infrastructure. The method according to the invention is therefore easily compatible with any cloud infrastructure.
  • the programming language and/or the technology may be determined by crafting special queries and sending them to open ports of the target process. This technique is based on constructing a database of technology fingerprints and using them for detection. For example, it is possible to perform a banner grab which is performed by crafting and sending an HTTP request to the web server and examining its response header. This can be accomplished using a variety of tools, including telnet for HTTP requests, or openssl for requests over SSL.
  • the programming language and/or the technology may be identified by examining the error response and/or the default error pages of the web servers. One way to compel a server to present an error response and/or a default error page is by crafting and sending intentionally incorrect or malformed requests.
  • the programming language and/or the technology may be determined by parsing configuration files.
  • This method includes first looking for configuration files in default repositories that may match a process name, then parsing the configuration files to confirm the technology or programming language. For example, if the name of the process is mongo- ⁇ , one way to confirm that it is a mongodb database process is to first search for /etc/mongod.conf. If the file exists, then it is parsed to see if it conforms to a default configuration file of a mongodb database. Further information about the exact mongodb version could be extracted from the config file.
  • the step of grouping containers associated to said first microservice may comprise the following steps:
  • a tag identifying which microservice a container is associated to may be a process name and/or one or more container's meta-data.
  • Process name may be obtained by using an OS kernel command or by running one or more fingerprinting tests.
  • Container's meta-data may include one or more of the following: image name, image version, repository name.
  • Determination by using process name and by using meta-data may be used to increase the accuracy of the result.
  • the tag identifying which microservice a container is associated to may not be available or those tags may be attributed by engineers which is source of potential mistakes. Thus, because this information is not available, the sorting of all the containers is not possible by applying this method.
  • microservices may also be possible that two microservices share the same combination of programming language and technology. In this case, grouping all containers associated to one microservice for each microservice is not possible.
  • the step of grouping containers associated to said first microservice comprises the following steps:
  • the method may include the step of inviting the user to confirm the sorting of the containers.
  • This invitation may be a message printed on the user interface.
  • the step of monitoring the evolution of said at least one aggregated metric may be performed by comparing the value of said at least one aggregated metric to a predetermined threshold, an alert being emitted if said value is greater or smaller than said predetermined threshold.
  • This monitoring method is easy to perform because it only needs to predetermine a threshold and to monitor the value on the aggregated value relatively to it. This method is economic and easy to implement.
  • the step of monitoring the evolution of said at least one aggregated metric may be performed by applying a statistical algorithm to the values of said at least one aggregated metric.
  • the statistical algorithm may evaluate the similarity of the value of the aggregated metrics to the other values and if a value is significantly different then an aggregated metric is declared as an anomaly and an alert is emitted.
  • the step of monitoring the evolution of said at least one aggregated metric may be performed by transmitting values of said at least one aggregated metric to a regression model or a model trained beforehand on previously acquired aggregated metrics, an alert being emitted if said values deviate from the values of said previously acquired aggregated metrics.
  • the model used may be a moving average model over a predetermined number of past days.
  • This model may use a sliding window and calculate the average value of all values of the aggregated metric over the past days, for example over the past seven days.
  • the calculated average value is compared with the next value of the aggregated metric. If the value of the next aggregated metric is significantly far from the calculated average value, then the algorithm generates an anomaly alert.
  • the model trained beforehand may be a neural network.
  • a neural network may be trained on previously acquired normal values of the said aggregated metric.
  • the neural network is tasked with predicting the next value of the said aggregated metric based on the learned characteristics of its normal values. If the actual measured aggregated metric value is significantly different from the predicted value, then the model fires an anomaly alert.
  • the model trained beforehand may be configured to emit an alert if an anomaly is detected in the evolution of the value of the aggregated metrics.
  • Using a model trained beforehand allows detecting a deviation of the evolution of the value of an aggregated metric without necessitating a predetermined threshold.
  • the metrics may be incremental metrics or punctual metrics.
  • metric(t_x+1) metric(t_x) + ⁇ , where ⁇ is the incremental value that metric gained.
  • a punctual metric is a metric in which there is no linear relationship between consecutive values.
  • the metrics may be chosen from the following: total CPU usage, CPU usage in user space, CPU usage in kernel space, memory usage, ratio of big memory pages, disk usage, number of write and read operations, volume of traffic entering via network, volume of traffic exiting via network, usage of cache memory, number of tables of a database, waiting time, latency, number of operations per unit of time, number of errors, queue size of a messaging broker, connection number to an API or database, large query number for a database, number of missed cache memory hits.
  • the computer-implemented method allows monitoring a lot of different metrics. It also allows monitoring all chosen metrics simultaneously.
  • the cloud infrastructure may perform at least 5 microservices, better at least 10 microservices, better at least 100 microservices, better at least 1 000 microservices, better at least 10 00 microservices, better at least 100 000 microservices, better at least 1.10 6 microservices, better at least 1.10 9 microservices, better at least 1.10 12 microservices.
  • Each microservice may be executed by one container, better by at least 5 containers, better by at least 10 containers, better by at least 100 containers, better by at least 1 000 containers, better by at least 10 00 containers, better by at least 100 000 containers, better by at least 1.10 6 containers, better by at least 1.10 9 containers.
  • the method according to the invention is adapted to cloud infrastructures comprising a high number of microservices and containers.
  • the method may be performed without accessing to external meta-configuration file(s) of the cloud infrastructure.
  • the invention also relates to a system for automatically detecting anomalies in a cloud infrastructure performing one or more microservices, the system being adapted to perform the method according to the invention.
  • the invention also relates to a computer program comprising instructions which, when the program is executed on a computer, cause the computer to carry out the steps of the method according to the invention.
  • FIG. 1 illustrates an example of a cloud infrastructure comprising several microservices 10.
  • Each microservice comprises one or more containers that are not represented.
  • each container comprises a plurality of metrics of different types and relative to its state.
  • the number of containers associated to one microservice may vary over time and therefore the number of metrics that can be evaluated varies also over time.
  • each container comprises 8 metrics.
  • the value of the metrics corresponding to containers that do not exist anymore is set to 0.
  • the method according to the invention allows avoiding this waste.
  • the first step is to identify the programming language and the technology of each container. When it is available, the tag identifying which microservice this container is associated to is also read.
  • containers sharing the same technology, the same programming language and the same tag are grouped. By doing this, the method is able to identify all containers that each microservice comprises.
  • the obtained value is transmitted to a model trained beforehand.
  • each container has only one CPU usage metric.
  • the aggregated CPU usage value of a given microservice is the mean of 20 CPU usage values.
  • the aggregated CPU usage value is the mean of 5 CPU usage values. In either case, there is only one value to monitor.
  • the invention is not limited to the examples that have just been described.
  • different models trained beforehand may be applied to the aggregated metrics according to the invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Debugging And Monitoring (AREA)

Abstract

Computer-implemented method for automatically detecting anomalies in a cloud infrastructure comprising one or more microservices, each microservice comprising one or more containers, each container having a plurality of metrics of different types and relative to the state of said container, the method comprising the following steps:a. for a first given microservice, grouping all alive containers associated to said first microservice,b. choose a type of metric and aggregate at least one metric relative to each previously grouped container to get a unique aggregated metric of this type of metric,c. monitoring the evolution of said at least one aggregated metric, andd. emit an alert at least if, while monitoring the evolution of said at least one aggregated metric, a value thereof is determined to be outside a predefined range of values.

Description

    Technical field
  • A cloud infrastructure is a complex and dynamic infrastructure that comprises a high number of components and generates a large amount of data.
  • A cloud infrastructure may comprise microservices. In this kind of infrastructure, each microservice comprises one or more container(s). A container is an atomic compute component that is meant to perform atomic tasks.
  • The deployment of such cloud infrastructure comprising containers is managed by a container orchestrator which is a software that enables the management of the lifecycle of the container (creation, resizing and/or destruction of containers) and the management of the container's allocation which is the attribution to a specific microservice. An example of such an orchestrator is KUBERNES®. The container orchestrator also allows the monitoring of the infrastructure by collecting container's data and exposing it via an application programming interface (API).
  • Cloud infrastructures comprising containers are characterized by their dynamism and the ephemerality of containers. Containers are created and destroyed dynamically based on workload and usage. For example, during peak traffic hours, a cloud infrastructure expands by creating more containers to handle more traffic. Similarly, during low traffic hours, for example during the night, the number of containers shrinks in order to save compute power. This automatic creation and destruction of containers is called auto-scaling, which enables more granular management of compute capacity. It also saves costs that would incur from overprovisioned idle resources.
  • A way of monitoring a cloud infrastructure, and in particular to detect anomalies, is to monitor metrics.
  • Yet, monitoring auto-scaling cloud infrastructures is complex due to their dynamic nature. Indeed, since a container is ephemeral, once a container is destroyed, its data disappears, and it is no longer possible to monitor metrics associated with this container. However, this does not mean that the associated task is done, it only means that the particular container is not needed anymore, and the same related task may well be continued by other containers.
  • Prior art
  • Anomaly detection methods based on metrics monitoring are known for IT infrastructures that may be applied to cloud infrastructures comprising microservices. However, these systems are conceived to work on hand-picked individual metrics. The user needs to define what metrics are interesting to monitor then individually apply some models to them in order to detect anomalies.
  • For example, it is possible for monitoring a cloud infrastructure to set up an anomaly detection model to detect anomalies on a specific container's central processing unit (CPU) usage. The model would learn the normal behaviour of the container's CPU and send alerts whenever there is a deviation from this normal behaviour.
  • The drawback of this approach is that it is not scalable. Indeed, when a container is destroyed, the container's metrics disappear, rendering the monitor useless, and leads to resource waste as well as potential false alarms. For example, the value of the container's CPU would just disappear after container's destruction, either generating an error or leading to the monitor firing false alarms. Even if the monitor eventually is adapted to consider the 0 value as normal, it would still waste compute resources.
  • Another major drawback of this approach is that the number of containers is usually very high, leading to impractical costs for the anomaly detection method.
  • This method of manually setting up detection model for each metrics needed to be monitored is not compatible with a cloud infrastructure comprising microservices.
  • There is a need to further improve automatic detection of anomalies in a cloud infrastructure comprising microservices. In particular, there is a need for obtaining a more scalable model.
  • The present invention notably seeks to meet this need.
  • Disclosure of the invention
  • One subject of the invention is a computer-implemented method for automatically detecting anomalies in a cloud infrastructure comprising one or more microservices, each microservice comprising one or more containers, each container having a plurality of metrics of different types and relative to the state of said container, the method comprising the following steps:
    1. a. for a first given microservice, grouping all alive containers associated to said first micro service,
    2. b. choose a type of metric and aggregate at least one metric relative to each previously grouped container to get a unique aggregated metric of this type of metric,
    3. c. monitoring the evolution of said at least one aggregated metric, and
    4. d. emit an alert at least if, while monitoring the evolution of said at least one aggregated metric, a value thereof is determined to be outside a predefined range of values.
  • The computer implemented method according to the invention allows detecting anomalies in a microservice deployment and not in each individual container of said microservice. The method is thus more scalable than the known methods of the prior art. Moreover, the method according to the invention is agnostic to the underlying containers of each microservice and to their life cycles.
  • In the method according to the invention, the evolution of an aggregated metric is monitored instead of monitoring individually metrics of each container. By monitoring aggregated metrics, the method is not disturbed by automatic termination or creation of containers that may trigger error in the anomaly detection method. The method according to the invention has a good elasticity and may be automatically adapted to all size of cloud infrastructure.
  • The values of the predefined range of values may be expected values computed beforehand.
  • The method according to the invention can be implemented by a computer in the broadest sense, i.e. by any electronic system, which may include a set of several machines and have computer processing capabilities. Classically, a computer comprises a processor, a memory, a humanmachine interface, for example a keyboard, a mouse or a touch screen, a screen and a communication module, in particular Ethernet, WIFI, Bluetooth® or using a mobile telephone technology, for example a technology operating with the GSM, GPRS, EDGE, LTE or UMTS protocols.
  • By "cloud infrastructure", we designate a set of components that communicates via a network which can be the internet or can be a private intranet. The method according to the invention is mostly dedicated to cloud infrastructure comprising microservices.
  • By "microservice", we designate a component of the cloud architecture that is configured to execute one function. A microservice is usually written in a single programming language and using a single technology. A microservice may also correspond to business services built as microservices. A microservice is developed using one or more container(s).
  • By "container", we designate a component of the cloud architecture that is configured to perform a task that participates to the realization of the microservice function.
  • By "auto-scaling", we designate the capability of a cloud infrastructure to automatically destroy and/or create containers to adapt the compute power to the need at a given moment.
  • The step of grouping containers associated to said first microservice may comprise the following steps:
    1. i. for at least a part of the containers, better for all the containers, determine the programming language and/or the technology of each container, and
    2. ii. group together containers having the same programming language and/or the same technology.
    Preferably, the programming language and/or the technology of all the containers is determined.
  • Containers implementing a microservice share the same underlying programming language and/or technology. Grouping containers by programming language and/or by technology allows to simply realize a first sort of containers according to the microservice they are associated to. There is no need to access external meta-configuration files of the cloud infrastructure. The method according to the invention is therefore easily compatible with any cloud infrastructure.
  • In one embodiment, the programming language and/or the technology may be determined by crafting special queries and sending them to open ports of the target process. This technique is based on constructing a database of technology fingerprints and using them for detection. For example, it is possible to perform a banner grab which is performed by crafting and sending an HTTP request to the web server and examining its response header. This can be accomplished using a variety of tools, including telnet for HTTP requests, or openssl for requests over SSL. In another example, the programming language and/or the technology may be identified by examining the error response and/or the default error pages of the web servers. One way to compel a server to present an error response and/or a default error page is by crafting and sending intentionally incorrect or malformed requests.
  • In another embodiment, the programming language and/or the technology may be determined by parsing configuration files. This method includes first looking for configuration files in default repositories that may match a process name, then parsing the configuration files to confirm the technology or programming language. For example, if the name of the process is mongo-, one way to confirm that it is a mongodb database process is to first search for /etc/mongod.conf. If the file exists, then it is parsed to see if it conforms to a default configuration file of a mongodb database. Further information about the exact mongodb version could be extracted from the config file.
  • The step of grouping containers associated to said first microservice may comprise the following steps:
    1. i. for at least a part of the containers, better for all the containers, detect at least one tag identifying which microservice this container is associated to, and
    2. ii. group together containers having the same tag.
  • By detecting tags identifying which microservice a container is associated to, it is possible to sort containers and to group those who are associated to a same microservice.
  • A tag identifying which microservice a container is associated to may be a process name and/or one or more container's meta-data.
  • Process name may be obtained by using an OS kernel command or by running one or more fingerprinting tests.
  • Container's meta-data may include one or more of the following: image name, image version, repository name.
  • Determination by using process name and by using meta-data may be used to increase the accuracy of the result.
  • Under certain circumstances, the tag identifying which microservice a container is associated to may not be available or those tags may be attributed by engineers which is source of potential mistakes. Thus, because this information is not available, the sorting of all the containers is not possible by applying this method.
  • It may also be possible that two microservices share the same combination of programming language and technology. In this case, grouping all containers associated to one microservice for each microservice is not possible.
  • That is why, in a preferred embodiment, the two grouping methods described above are consecutively performed.
  • Preferably, the step of grouping containers associated to said first microservice comprises the following steps:
    1. i. for at least a part of the containers, better for all the containers, determine the programming language and the technology of each container,
    2. ii. for said at least a part of the containers, better for, detect at least one tag identifying which microservice this container is associated to, and
    3. iii. group together containers having the same programming language, the same technology and the same tag.
  • By using consecutively two different methods for grouping containers, the risk of not being able to sort all containers and to group them by microservice is reduced.
  • In addition, the method may include the step of inviting the user to confirm the sorting of the containers. This invitation may be a message printed on the user interface.
  • The step of monitoring the evolution of said at least one aggregated metric may be performed by comparing the value of said at least one aggregated metric to a predetermined threshold, an alert being emitted if said value is greater or smaller than said predetermined threshold.
  • This monitoring method is easy to perform because it only needs to predetermine a threshold and to monitor the value on the aggregated value relatively to it. This method is economic and easy to implement.
  • The step of monitoring the evolution of said at least one aggregated metric may be performed by applying a statistical algorithm to the values of said at least one aggregated metric.
  • For example, the statistical algorithm may evaluate the similarity of the value of the aggregated metrics to the other values and if a value is significantly different then an aggregated metric is declared as an anomaly and an alert is emitted.
  • The step of monitoring the evolution of said at least one aggregated metric may be performed by transmitting values of said at least one aggregated metric to a regression model or a model trained beforehand on previously acquired aggregated metrics, an alert being emitted if said values deviate from the values of said previously acquired aggregated metrics.
  • For example, the model used may be a moving average model over a predetermined number of past days. This model may use a sliding window and calculate the average value of all values of the aggregated metric over the past days, for example over the past seven days. The calculated average value is compared with the next value of the aggregated metric. If the value of the next aggregated metric is significantly far from the calculated average value, then the algorithm generates an anomaly alert.
  • The model trained beforehand may be a neural network.
  • For example, a neural network may be trained on previously acquired normal values of the said aggregated metric. The neural network is tasked with predicting the next value of the said aggregated metric based on the learned characteristics of its normal values. If the actual measured aggregated metric value is significantly different from the predicted value, then the model fires an anomaly alert.
  • The model trained beforehand may be configured to emit an alert if an anomaly is detected in the evolution of the value of the aggregated metrics.
  • Using a model trained beforehand allows detecting a deviation of the evolution of the value of an aggregated metric without necessitating a predetermined threshold.
  • The metrics may be incremental metrics or punctual metrics.
  • An incremental metric is defined as follows:
    metric(t_x+1) = metric(t_x) + Δ, where Δ is the incremental value that metric gained.
  • A punctual metric is a metric in which there is no linear relationship between consecutive values.
  • The metrics may be chosen from the following: total CPU usage, CPU usage in user space, CPU usage in kernel space, memory usage, ratio of big memory pages, disk usage, number of write and read operations, volume of traffic entering via network, volume of traffic exiting via network, usage of cache memory, number of tables of a database, waiting time, latency, number of operations per unit of time, number of errors, queue size of a messaging broker, connection number to an API or database, large query number for a database, number of missed cache memory hits.
  • The list of metrics above is not exhaustive. Many metrics may be derived from the ones listed above may be chosen.
  • The computer-implemented method allows monitoring a lot of different metrics. It also allows monitoring all chosen metrics simultaneously.
  • The cloud infrastructure may perform at least 5 microservices, better at least 10 microservices, better at least 100 microservices, better at least 1 000 microservices, better at least 10 00 microservices, better at least 100 000 microservices, better at least 1.106 microservices, better at least 1.109 microservices, better at least 1.1012 microservices.
  • Each microservice may be executed by one container, better by at least 5 containers, better by at least 10 containers, better by at least 100 containers, better by at least 1 000 containers, better by at least 10 00 containers, better by at least 100 000 containers, better by at least 1.106 containers, better by at least 1.109 containers.
  • The method according to the invention is adapted to cloud infrastructures comprising a high number of microservices and containers.
  • The method may be performed without accessing to external meta-configuration file(s) of the cloud infrastructure.
  • Therefore, the set-up of the system used to perform the method is simplified.
  • According to another one of its aspects, the invention also relates to a system for automatically detecting anomalies in a cloud infrastructure performing one or more microservices, the system being adapted to perform the method according to the invention.
  • According to another one of its aspects, the invention also relates to a computer program comprising instructions which, when the program is executed on a computer, cause the computer to carry out the steps of the method according to the invention.
  • Brief description of the drawings
  • The invention may be better understood upon reading the following detailed description of nonlimiting implementation examples thereof and on studying the appended drawing, in which:
    • figure 1 is a diagram illustrating the architecture of an example of a cloud infrastructure, and
    • figure 2 is a block diagram illustrating an example of different steps of the method according to the invention.
    Detailed description
  • Figure 1 illustrates an example of a cloud infrastructure comprising several microservices 10. Each microservice comprises one or more containers that are not represented. Moreover, each container comprises a plurality of metrics of different types and relative to its state.
  • The number of containers associated to one microservice may vary over time and therefore the number of metrics that can be evaluated varies also over time.
  • For each container we can store the values of the metrics in a vector. This means that the size of the vector will not be constant over time.
  • For example, we consider a microservice comprising 10 containers at an instant To, then the number of containers increases to 20 containers at a further instant T1 and then it decreases to 5 containers at a further instant T2. We also consider that each container comprises 8 metrics.
  • Therefore, the size of the vector used as input to an anomaly detection model varies as follow:
    • at T0 the input vector's size has to be at least 108=80,
    • at T1 the input vector's size has to be at least 208=160,
    • at T2 the input vector's size has to be at least 58=40.
  • In the prior art method, the value of the metrics corresponding to containers that do not exist anymore is set to 0. Thus, the anomaly detection models would perform inference on 160-80=80 times series with constant value of 0 which is a waste of calculation resources.
  • The method according to the invention allows avoiding this waste.
  • The main steps of the method according to the invention are illustrated in the block diagram of figure 2.
  • The first step is to identify the programming language and the technology of each container. When it is available, the tag identifying which microservice this container is associated to is also read.
  • Based on that information, containers sharing the same technology, the same programming language and the same tag are grouped. By doing this, the method is able to identify all containers that each microservice comprises.
  • Then, for each microservice to be monitored, the metrics of all the containers associated to that microservice are aggregated to get a unique value. Then, there is only one value for which we need to monitor the evolution.
  • After the aggregation of the metrics, the obtained value is transmitted to a model trained beforehand.
  • Those steps are repeated over time, and the successive values of the aggregated metric are transmitted to the model. If the evolution is abnormal compared to the scheme of evolution the model has previously learnt, an alert is emitted.
  • Those steps can be applied to all metrics we want to monitor. Each metric can be transmitted to a specific model trained beforehand.
  • As an example, we want to monitor the CPU usage metric. We consider that each container has only one CPU usage metric. During daytime, the aggregated CPU usage value of a given microservice is the mean of 20 CPU usage values. During night-time, the aggregated CPU usage value is the mean of 5 CPU usage values. In either case, there is only one value to monitor.
  • The invention is not limited to the examples that have just been described. For example, different models trained beforehand may be applied to the aggregated metrics according to the invention.

Claims (14)

  1. Computer-implemented method for automatically detecting anomalies in a cloud infrastructure comprising one or more microservices, each microservice comprising one or more containers, each container having a plurality of metrics of different types and relative to the state of said container, the method comprising the following steps:
    a. for a first given microservice, grouping all alive containers associated to said first microservice,
    b. choose a type of metric and aggregate at least one metric relative to each previously grouped container to get a unique aggregated metric of this type of metric,
    c. monitoring the evolution of said at least one aggregated metric, and
    d. emit an alert at least if, while monitoring the evolution of said at least one aggregated metric, a value thereof is determined to be outside a predefined range of values.
  2. Computer-implemented method as claimed in claim 1, the step of grouping containers associated to said first microservice comprising the following steps:
    i. for at least a part of the containers, better for all the containers, determine the programming language and/or the technology of each container, and
    ii. group together containers having the same programming language and/or the same technology.
  3. Computer-implemented method as claimed in any one of the preceding claims, the step of grouping containers associated to said first microservice comprising the following steps:
    i. for at least a part of the containers, better for all the containers associated to said first microservice, detect at least one tag identifying which microservice this container is associated to, and
    ii. group together containers having the same tag.
  4. Computer-implemented method as claimed in claim 1, the step of grouping containers associated to said first microservice comprising the following steps:
    i. for at least a part of the containers, better for all the containers, determine the programming language and the technology of each container,
    ii. for said at least a part of the containers, better for all the containers associated, detect at least one tag identifying which microservice this container is associated to, and
    iii. group together containers having the same programming language, the same technology and the same tag.
  5. Computer-implemented method as claimed in any one of the preceding claims, the step of monitoring the evolution of said at least one aggregated metric being performed by comparing the value of said at least one aggregated metric to a predetermined threshold, an alert being emitted if said value is greater or smaller than said predetermined threshold.
  6. Computer-implemented method as claimed in any one of the preceding claims, the step of monitoring the evolution of said at least one aggregated metric being performed by applying a statistical algorithm to the values of said at least one aggregated metric.
  7. Computer-implemented method as claimed in any one of the preceding claims, the step of monitoring the evolution of said at least one aggregated metric being performed by transmitting values of said at least one aggregated metric to a model trained beforehand on previously acquired aggregated metrics, an alert being emitted if said values deviate from the values of said previously acquired aggregated metrics.
  8. Computer-implemented method as claimed in any one of the preceding claims, the metrics being incremental metrics or punctual metrics.
  9. Computer-implemented method as claimed in any one of the preceding claims, the metrics being chosen from the following: total CPU usage, CPU usage in user space, CPU usage in kernel space, Memory usage, ratio of big memory pages, disk usage, number of write and read operations, volume of traffic entering via network, volume of traffic exiting via network, usage of cache memory, number of tables of a database, waiting time, latency, number of operations per unit of time, number of errors, queue size of a messaging broker, connection number to an API or database, large query number for a database, number of missed cache memory hits.
  10. Computer-implemented method as claimed in any one of the preceding claims, the cloud infrastructure performing at least 5 microservices, better at least 10 microservices, better at least 100 microservices, better at least 1 000 microservices, better at least 10 00 microservices, better at least 100 000 microservices, better at least 1.106 microservices, better at least 1.109 microservices, better at least 1.1012 microservices.
  11. Computer-implemented method as claimed in any one of the preceding claims, each microservice being executed by one container, better by at least 5 containers, better by at least 10 containers, better by at least 100 containers, better by at least 1 000 containers, better by at least 10 00 containers, better by at least 100 000 containers, better by at least 1.106 containers, better by at least 1.109 containers.
  12. Computer-implemented method as claimed in any one of the preceding claims, the method being perform without accessing to configuration file(s) of the cloud infrastructure.
  13. System for automatically detecting anomalies in a cloud infrastructure performing one or more microservices, the system being adapted to perform the method according to any one of the preceding claims.
  14. A computer program comprising instructions which, when the program is executed on a computer, cause the computer to carry out the steps of the method according to any one of claims 1 to 12.
EP22305998.1A 2022-07-04 2022-07-04 Computer-implemented method for automatically detecting anomalies in a cloud infrastructure comprising microservices Pending EP4303730A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP22305998.1A EP4303730A1 (en) 2022-07-04 2022-07-04 Computer-implemented method for automatically detecting anomalies in a cloud infrastructure comprising microservices

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
EP22305998.1A EP4303730A1 (en) 2022-07-04 2022-07-04 Computer-implemented method for automatically detecting anomalies in a cloud infrastructure comprising microservices

Publications (1)

Publication Number Publication Date
EP4303730A1 true EP4303730A1 (en) 2024-01-10

Family

ID=83444875

Family Applications (1)

Application Number Title Priority Date Filing Date
EP22305998.1A Pending EP4303730A1 (en) 2022-07-04 2022-07-04 Computer-implemented method for automatically detecting anomalies in a cloud infrastructure comprising microservices

Country Status (1)

Country Link
EP (1) EP4303730A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11347622B1 (en) * 2020-10-06 2022-05-31 Splunk Inc. Generating metrics values for teams of microservices of a microservices-based architecture

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11347622B1 (en) * 2020-10-06 2022-05-31 Splunk Inc. Generating metrics values for teams of microservices of a microservices-based architecture

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SAMIR AREEG ET AL: "DLA: Detecting and Localizing Anomalies in Containerized Microservice Architectures Using Markov Models", 2019 7TH INTERNATIONAL CONFERENCE ON FUTURE INTERNET OF THINGS AND CLOUD (FICLOUD), IEEE, 26 August 2019 (2019-08-26), pages 205 - 213, XP033698064, DOI: 10.1109/FICLOUD.2019.00036 *

Similar Documents

Publication Publication Date Title
US11586972B2 (en) Tool-specific alerting rules based on abnormal and normal patterns obtained from history logs
US11657309B2 (en) Behavior analysis and visualization for a computer infrastructure
US11138058B2 (en) Hierarchical fault determination in an application performance management system
US9921937B2 (en) Behavior clustering analysis and alerting system for computer applications
US20150205691A1 (en) Event prediction using historical time series observations of a computer application
Li et al. FLAP: An end-to-end event log analysis platform for system management
US9870294B2 (en) Visualization of behavior clustering of computer applications
US11533217B2 (en) Systems and methods for predictive assurance
US10938847B2 (en) Automated determination of relative asset importance in an enterprise system
US11900248B2 (en) Correlating data center resources in a multi-tenant execution environment using machine learning techniques
US10942801B2 (en) Application performance management system with collective learning
CN110471945B (en) Active data processing method, system, computer equipment and storage medium
US11314609B2 (en) Diagnosing and remediating errors using visual error signatures
CN112130996A (en) Data monitoring control system, method and device, electronic equipment and storage medium
CN113965389B (en) Network security management method, device and medium based on firewall log
US11055631B2 (en) Automated meta parameter search for invariant based anomaly detectors in log analytics
US11392821B2 (en) Detecting behavior patterns utilizing machine learning model trained with multi-modal time series analysis of diagnostic data
US11153183B2 (en) Compacted messaging for application performance management system
US10848371B2 (en) User interface for an application performance management system
US11809271B1 (en) System and method for identifying anomalies in data logs using context-based analysis
EP4303730A1 (en) Computer-implemented method for automatically detecting anomalies in a cloud infrastructure comprising microservices
CN115349129A (en) Generating performance predictions with uncertainty intervals
CN116401138B (en) Operating system running state detection method and device, electronic equipment and medium
US20160364314A1 (en) Recognition of operational elements by fingerprint in an application performance management system
Vafaie et al. A New Statistical Method for Anomaly Detection in Distributed Systems

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN PUBLISHED

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR