WO2023185090A1 - 基于微服务链路分析和强化学习的调度方法及装置 - Google Patents

基于微服务链路分析和强化学习的调度方法及装置 Download PDF

Info

Publication number
WO2023185090A1
WO2023185090A1 PCT/CN2022/138189 CN2022138189W WO2023185090A1 WO 2023185090 A1 WO2023185090 A1 WO 2023185090A1 CN 2022138189 W CN2022138189 W CN 2022138189W WO 2023185090 A1 WO2023185090 A1 WO 2023185090A1
Authority
WO
WIPO (PCT)
Prior art keywords
cloud server
microservice
link
data
reinforcement learning
Prior art date
Application number
PCT/CN2022/138189
Other languages
English (en)
French (fr)
Inventor
徐敏贤
宋承浩
叶可江
须成忠
Original Assignee
深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳先进技术研究院 filed Critical 深圳先进技术研究院
Publication of WO2023185090A1 publication Critical patent/WO2023185090A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the invention relates to a scheduling method and device based on microservice link analysis and reinforcement learning.
  • microservice scheduling algorithm of cloud servers has become one of the important indicators to test the load capacity of cloud servers under highly variable loads.
  • An excellent microservice link scheduling algorithm can reduce the waste of server resources for cloud server providers. It can also greatly improve the user experience of cloud server users, improve the stability of cloud servers, and make the resource allocation of microservices in cloud servers better. It is more reasonable and improves the robustness of microservices in the entire cloud server cluster.
  • the mainstream cloud server microservice link delay scheduling algorithms include resource allocation scheduling algorithms and microservice scheduling algorithms based on the longest link. in:
  • Resource allocation and scheduling algorithm uses the preset resources of the cloud server microservice itself for allocation. Better results can be achieved when resources are sufficient. In this resource allocation method, resource competition between microservices can be reduced as much as possible, which improves the stability of the cloud server to a certain extent.
  • Microservice scheduling algorithm based on the longest link uses the previous link relationship of the microservice to allocate resources. For microservices in the cluster, the longest link or links can be obtained by analyzing their link relationships, and resources are allocated based on this standard. All microservices on the microservice link with the longest link will be allocated resources first, and resources will be allocated from the longest link to the shortest link. This resource allocation method can alleviate resource competition between microservices to a certain extent, give longer links priority to provide more resources, and also reduce the overall time delay of microservices.
  • Embodiments of the present invention provide a scheduling method and device based on microservice link analysis and reinforcement learning to at least solve the technical problem of microservice delay in existing cloud servers.
  • a scheduling method based on microservice link analysis and reinforcement learning including the following steps:
  • the workload and link analyzer based on the deep learning model analyzes and makes decisions on the links of cloud server microservices, selects the critical paths and key nodes with the longest delay, and obtains cloud server load data;
  • the reinforcement learning algorithm based on Deep Q-Learning is used to train the cloud server load data, and the deep learning model adapted to different load conditions is trained using the reinforcement learning algorithm based on Deep Q-Learning;
  • Scheduling methods include: horizontal expansion, vertical expansion and control.
  • the workload and link analyzer based on the deep learning model analyzes and makes decisions on the links of cloud server microservices, selects the critical paths and key nodes with the longest delay, and obtains cloud server load data including:
  • the deep learning algorithm After collecting the relationship between the number of requests and the workload of the cloud server, the deep learning algorithm is used to obtain the data of the entire number of requests required for the experiment;
  • the request is generated and sent to the request processor, on the one hand, the request is sent to the target cluster, and on the other hand, the load situation from the cloud server cluster is collected;
  • the key nodes are judged and the results are obtained through the classification method based on the decision tree.
  • the number of acquisition requests includes:
  • MinMaxScaler to normalize the data set; the operation of MinMaxScaler is based on the min-max scaling method.
  • the specific formulas are as follows: Formula 1 and Formula 2;
  • X represents the set of data to be processed.
  • X min and X max are the minimum and maximum data in the set.
  • the final processed data Represented by X scaled .
  • the method after collecting the relationship between the number of requests and the workload of the cloud server, and using a deep learning algorithm to obtain data on the entire number of requests required for the experiment, the method also includes:
  • the load situation includes: the delay of the microservice itself, the resource consumption of the cloud server, the request delay of the microservice, and the request success rate information.
  • the key nodes are judged through the classification method based on the decision tree and the results include:
  • a decision tree model is obtained for whether it is a key node; the decision tree has three decision results: a key node, a non-key node and a potential key node; among the three results, each A critical path includes at least one critical node. Non-critical nodes and potential critical nodes do not exist in the critical path.
  • the reinforcement learning algorithm based on Deep Q-Learning is used to train the cloud server load data, and the deep learning model adapted to different load conditions is trained using the reinforcement learning algorithm based on Deep Q-Learning, including:
  • the trained deep learning model After determining the location of key nodes, the trained deep learning model will be used to make scheduling decisions on the cloud server cluster; reinforcement learning uses Deep Q-Learning based on deep learning neural networks.
  • Equation 3 The calculation equation is as shown in Equation 3:
  • the Q-value generated by being in state s and executing action a is the reward r(s,a) plus the highest Q-value generated by the next state s'; ⁇ is discount, which controls the future reward versus the current state The influence of Q-value below;
  • Equation 5 The part that stores the past Q-value in Equation 4 is obtained, which is Equation 5:
  • horizontal expansion adjusts resources by adding or deleting copies of microservices on cloud servers in the cluster;
  • cloud server clusters reduce the workload of cloud server clusters by adding copies of microservices;
  • Cloud server clusters increase single or multiple cloud servers within the cluster regarding single or multiple microservices. resources to reduce the workload of the cloud server cluster;
  • Control reduces microservice link latency by dynamically turning on or off optional components.
  • the turned off optional components will be turned back on under control.
  • a scheduling device based on microservice link analysis and reinforcement learning including:
  • the load generator is used to analyze and make decisions on the links of cloud server microservices based on the workload and link analyzer of the deep learning model, select the critical path and key nodes with the longest delay, and obtain the cloud server load data;
  • Workload and link analyzer is used to train the cloud server load data based on the reinforcement learning algorithm based on Deep Q-Learning, and use the reinforcement learning algorithm based on Deep Q-Learning to train deep learning models adapted to different load conditions. ;
  • the cluster scheduler is used to use deep learning models to perform cluster scheduling on cloud server clusters.
  • the scheduling methods include: horizontal expansion, vertical expansion and control.
  • a storage medium that stores program files that can implement any of the above scheduling methods based on microservice link analysis and reinforcement learning.
  • a processor is used to run a program, wherein when the program is running, it executes any one of the above scheduling methods based on microservice link analysis and reinforcement learning.
  • the scheduling method and device based on microservice link analysis and reinforcement learning in the embodiment of the present invention schedules the cloud server cluster based on the deep learning model.
  • Three different scheduling methods can be selected for reinforcement learning: horizontal expansion, vertical expansion and control.
  • the present invention uses the workload and link analyzer based on the deep learning model to analyze and make decisions on the links of microservices, and select the critical path and key node with the longest delay.
  • This invention is based on the reinforcement learning algorithm of Deep Q-Learning, trains the cloud server load data, and uses the algorithm to train a deep learning model adapted to different load conditions.
  • the present invention uses the maximum delay link in the microservice link to determine the location of the key node in the link.
  • this invention solves the problem of possible mismatch between the link length and the microservice delay, and can obtain an optimization target for the delay itself, based on which the cloud server is optimized Resource scheduling can effectively alleviate the existing microservice link delay problem.
  • Figure 1 is a system model diagram of the scheduling method based on microservice link analysis and reinforcement learning according to the present invention
  • Figure 2 is a diagram of three scheduling strategies based on microservice link analysis and reinforcement learning in this invention.
  • a scheduling method based on microservice link analysis and reinforcement learning including the following steps:
  • the workload and link analyzer based on the deep learning model analyzes and makes decisions on the links of cloud server microservices, selects the critical paths and key nodes with the longest delay, and obtains cloud server load data;
  • the reinforcement learning algorithm based on Deep Q-Learning is used to train the cloud server load data, and the deep learning model adapted to different load conditions is trained using the reinforcement learning algorithm based on Deep Q-Learning;
  • Scheduling methods include: horizontal expansion, vertical expansion and control.
  • the scheduling method based on microservice link analysis and reinforcement learning in the embodiment of the present invention schedules the cloud server cluster based on the deep learning model.
  • Three different scheduling methods can be selected for reinforcement learning: horizontal expansion, vertical expansion and control.
  • the present invention uses the workload and link analyzer based on the deep learning model to analyze and make decisions on the links of microservices, and select the critical path and key node with the longest delay.
  • This invention is based on the reinforcement learning algorithm of Deep Q-Learning, trains the cloud server load data, and uses the algorithm to train a deep learning model adapted to different load conditions.
  • the present invention uses the maximum delay link in the microservice link to determine the location of the key node in the link.
  • this invention solves the problem of possible mismatch between the link length and the microservice delay, and can obtain an optimization target for the delay itself, based on which the cloud server is optimized Resource scheduling can effectively alleviate the existing microservice link delay problem.
  • the workload and link analyzer based on the deep learning model analyzes and makes decisions on the links of cloud server microservices, selects the critical paths and key nodes with the longest delay, and obtains cloud server load data including:
  • the deep learning algorithm After collecting the relationship between the number of requests and the workload of the cloud server, the deep learning algorithm is used to obtain the data of the entire number of requests required for the experiment;
  • the request is generated and sent to the request processor, on the one hand, the request is sent to the target cluster, and on the other hand, the load situation from the cloud server cluster is collected;
  • the key nodes are judged and the results are obtained through the classification method based on the decision tree.
  • the number of acquisition requests includes:
  • MinMaxScaler to normalize the data set; the operation of MinMaxScaler is based on the min-max scaling method.
  • the specific formulas are as follows: Formula 1 and Formula 2;
  • X represents the set of data to be processed.
  • X min and X max are the minimum and maximum data in the set.
  • the final processed data Represented by X scaled .
  • the method after collecting the relationship between the number of requests and the workload of the cloud server, and using the deep learning algorithm to obtain the data of the entire number of requests required for the experiment, the method also includes:
  • the load situation includes: the delay of the microservice itself, the resource consumption of the cloud server, the request delay of the microservice, and the request success rate information.
  • the key nodes are judged through the classification method based on the decision tree and the results include:
  • a decision tree model is obtained for whether it is a key node; the decision tree has three decision results: a key node, a non-key node and a potential key node; among the three results, each A critical path includes at least one critical node. Non-critical nodes and potential critical nodes do not exist in the critical path.
  • the reinforcement learning algorithm based on Deep Q-Learning is used to train the cloud server load data, and the deep learning model adapted to different load conditions is trained using the reinforcement learning algorithm based on Deep Q-Learning, including:
  • the trained deep learning model After determining the location of key nodes, the trained deep learning model will be used to make scheduling decisions on the cloud server cluster; reinforcement learning uses Deep Q-Learning based on deep learning neural networks.
  • the Q-value generated by being in state s and executing action a is the reward r(s,a) plus the highest Q-value generated by the next state s'; ⁇ is discount, which controls the future reward versus the current state The influence of Q-value below;
  • Equation 5 The part that stores the past Q-value in Equation 4 is obtained, which is Equation 5:
  • horizontal expansion adjusts resources by adding or deleting copies of microservices on cloud servers in the cluster;
  • cloud server clusters reduce the workload of cloud server clusters by adding copies of microservices;
  • Cloud server clusters increase single or multiple cloud servers within the cluster regarding single or multiple microservices. resources to reduce the workload of the cloud server cluster;
  • Control reduces microservice link latency by dynamically turning on or off optional components.
  • the turned off optional components will be turned back on under control.
  • user-oriented cloud computing-related services provide many conveniences to users.
  • User-facing microservices are usually supported by distributed cloud server clusters. Users are usually highly sensitive to the link delay of the server when using these services, because the server link delay is directly related to the service quality of the server, thus affecting the user experience.
  • Link latency within a microservices cluster is caused by microservices within a server cluster previously competing for resources with each other.
  • the present invention is based on reinforcement learning and proposes an efficient intelligent scheduling method for online adjustment based on different cloud server load conditions, using methods including horizontal expansion and vertical expansion. and other methods to reallocate resources within the machine.
  • the purpose of the present invention is to use a server load scheduling algorithm to solve the impact of link delays of microservices existing in cloud servers.
  • the present invention uses the maximum delay link in the microservice link to determine the location of the key node in the link. Compared with the microservice scheduling algorithm based on the longest link, this invention solves the problem of possible mismatch between the link length and the microservice delay, and can obtain an optimization target for the delay itself, based on which the cloud server is optimized Resource scheduling can effectively alleviate the existing microservice link delay problem.
  • the present invention schedules cloud server clusters based on a deep learning model.
  • Reinforcement learning can use three different scheduling methods: horizontal expansion, vertical expansion and control.
  • the present invention uses the workload and link analyzer based on the deep learning model to analyze and make decisions on the links of microservices, and selects the longest delayed link (critical path) and node (key node).
  • This invention is based on the reinforcement learning algorithm of Deep Q-Learning, trains the cloud server load data, and uses the algorithm to train a deep learning model adapted to different load conditions. Specifically:
  • the first set of data used in this invention is the workload data set cluster-trace-v2018 of the cloud server cluster from Facebook in 2018.
  • This data set contains workload data spanning 8 days, which was generated by 4034 homogeneous servers. Each of these servers has 96 CPU cores and the same amount of memory.
  • the present invention selects machine_usage.csv from the data set for analysis.
  • the machine_usage.csv data set includes machine number machine_id, timestamp time_stamp, CPU utilization of the current machine cpu_util_percent, memory occupation of the machine at the current moment mem_util_percent and other parameters, and its records
  • the time interval of the data is 10 seconds.
  • the present invention removes some reference variables that often have null values and assigns 0 to the null values of reference variables that occasionally have null values.
  • SockShop simulates the user-facing part of an e-commerce website that sells socks. The purpose is to help users demonstrate and test microservices and cloud native technologies. SockShop's microservices are designed to have the smallest expected allocation of resources, that is, the quota of each microservice is as small as possible during initialization. Microservices use DNS to find other related microservices. When performing scheduling testing on the cloud server, a load balancer or service router can be inserted into the entire microservice framework according to the needs of the present invention.
  • the SockShop version used in this invention is the official version based on Kubernetes, which in addition to the basic microservice components also includes a Jaeger-based microservice monitoring component.
  • train-ticket Another experimental platform used by this invention is train-ticket.
  • This project is a train ticket booking system based on microservice architecture and contains 41 microservices.
  • the programming languages and frameworks used include Java, Python, Go, etc.
  • train-ticket simulates the front-end and back-end of an online ticketing website, including a series of functions such as ticket purchase, ticket sales, ticket query, refund, login, etc.
  • Jaeger microservice detection component and control microservice for monitoring system status. Workload balancer for overall stability of the service.
  • the system model of the present invention includes the following components: load generator, workload and link analyzer, and cluster scheduler.
  • Historical workload refers to the workload from a running cloud server cluster that is regularly collected by a specific server or a built-in program. It mainly includes timestamp, machine number, CPU utilization, and internal accuracy. Occupied size and other information. In this invention, the historical workload comes from Facebook's data set.
  • the load generator is composed of a series of components, including workloads and processors, Locust-based request generators, and databases that store historical workloads. These components coordinate with each other to extract the required workload data and characteristics from the historical database and process them, then simulate real user usage through the load generator, and convert them into requests to the server cluster and send them out.
  • Workload and link analyzer The main task of the workload and link analyzer is to analyze the requests sent from the cloud server, and perform link analysis on them based on the link relationship of the microservices to find out the key nodes and critical paths. , and then send scheduling requests to the microservices of the cloud server group by deploying the cluster scheduler of the deep learning/deep learning model.
  • Cluster scheduler The cluster scheduler mainly acts on the cloud server cluster and completes the scheduling of microservices in the corresponding cloud server by accepting scheduling requests from the workload and link analyzer. This scheduling method is mainly divided into three types: horizontal expansion Vertical, vertical expansion Horizontal and control Brownout.
  • Critical node and critical path refers to the path in a microservice link that has the largest sum of microservice delays on the link among all the longest non-repeating links in the entire system. In the same microservice cluster, there can be one or more critical paths. Key nodes refer to testing the resource utilization performance of different microservices after finding the critical path, that is, by limiting their resource consumption under a certain number of requests to observe the stability and latency of the entire system. To distinguish between critical nodes and non-critical nodes on the path.
  • This step includes Step 1 in Figure 1: the preprocessing part and data cleaning part of the workload, and processes the original data obtained from the actual data set. Whether it is raw data from Facebook or Google's cloud workload, this invention first removes columns containing empty data. Because whether the zero-filling scheme is used or the data is ignored directly, these redundant items will have a negative impact on the predicted data of the present invention. After this, the present invention classifies the data set according to time series and then calculates the average value of each parameter with the same timestamp. This can be done using the grouping function (Python Groupby function). Next, the present invention normalizes the Facebook data set. Normalization is a data processing method that reduces dimensionality.
  • normalization can not only improve the convergence speed of the model, but also improve the accuracy of prediction.
  • the present invention chooses the first normalization method, and uses MinMaxScaler to realize this function. The operation of MinMaxScaler is based on the min-max scaling method. The specific formulas are as shown in Formula 1 and Formula 2 below.
  • This invention uses MinMaxScaler to transform each feature and scale each feature to a value between 0 and 1.
  • X represents the set of data to be processed.
  • X min and X max are the minimum and maximum data in the set. After final processing The data are represented by X scaled .
  • Step 2 Convert into the number of requests based on user behavior
  • the present invention After obtaining the processed Facebook workload, the present invention achieves the same effect as the original data set by simulating request generation in the cluster. After collecting the relationship between the number of requests and the workload of the cloud server, the present invention uses a deep learning algorithm to obtain data on the entire number of requests required for the experiment.
  • the present invention analyzes the user behavior, and records the different locations of the website that the user visits based on the user's behavior.
  • the different locations of the website that the user visits here refer to the website based on the microservice homepage and its sub-websites. , such as website homepage, website index page, website category page, website login page, etc.
  • the present invention sends the request to the target cluster on the one hand, and collects the load status from the cloud server cluster on the other hand.
  • These load conditions mainly include information such as the delay of the microservice itself, the resource consumption of the cloud server, the request delay of the microservice, and the request success rate. Pass this information to the workload and link analyzer. Through the critical path analysis of the microservice, you can get the link relationship of the microservice in the current state. Combined with the obtained link delay data, you can get the critical path of the microservice. .
  • the present invention makes judgments and obtains results through a classification method based on decision trees.
  • a decision tree model that accurately targets whether it is a key node is obtained.
  • the decision tree has three decision results: critical nodes, not critical nodes and potential critical nodes.
  • each critical path includes at least one critical node, and non-critical nodes and potential critical nodes may not exist in the critical path.
  • the present invention After determining the location of the key nodes, the present invention will use the trained deep learning model to make scheduling decisions for the cloud server cluster.
  • reinforcement learning uses deep reinforcement learning Deep Q-Learning based on deep learning neural network.
  • Deep Q-Learning is a commonly used method of reinforcement learning. Based on Q-Learning, the Q-Table generated by multiple iterations is converted into a neural network with all corresponding parameters and delivered to Deep Q-Learning. Learn to train. For Q-Learning, the present invention needs to train the model by calculating the Q-value after a series of actions. The calculation equation is as shown in Formula 3:
  • the Q-value generated by being in state s and performing action a is the reward r(s,a) plus the highest Q-value that may be generated by the next state s'.
  • is discount, which controls the impact of future rewards on the Q-value in the current state.
  • the present invention iterates the equation in the above formula 3 to obtain formula 4, which is the equation in the final convergence condition.
  • the training target of the present invention is the part of the above equation 4 that stores the past Q-value, that is, equation 5.
  • the present invention uses three algorithms to schedule microservices in the cloud server cluster.
  • the three scheduling strategies are shown in Figure 2. These three scheduling strategies are respectively Horizontally expand Vertical, vertically expand Horizontal, and control Brownout.
  • Horizontal expansion is also called horizontal expansion. Horizontal scaling adjusts resources by adding or removing replicas of microservices on cloud servers within the cluster to improve resource usage and system availability. Cloud server clusters can reduce the workload of cloud server clusters by adding copies of microservices.
  • Vertical expansion is also called horizontal expansion. Vertical scaling adjusts the processing service capabilities of the current cloud server by adjusting the number of CPU, memory, or network resources allocated to microservice instances. Cloud server clusters can reduce the workload of cloud server clusters by increasing the resources of a single cloud server or multiple cloud servers in the cluster for single or multiple microservices.
  • Control is to achieve the purpose of reducing microservice link delay by dynamically turning on or off optional components. Since control will affect the integrity of the overall microservice to a certain extent, it is generally only used in extreme situations, such as cloud service clusters that persist. When the cloud service and cluster transition from extreme conditions to normal working status, the closed optional components will be controlled to be reopened.
  • a scheduling device based on microservice link analysis and reinforcement learning including:
  • the load generator is used to analyze and make decisions on the links of cloud server microservices based on the workload and link analyzer of the deep learning model, select the critical path and key nodes with the longest delay, and obtain the cloud server load data;
  • Workload and link analyzer is used to train the cloud server load data based on the reinforcement learning algorithm based on Deep Q-Learning, and use the reinforcement learning algorithm based on Deep Q-Learning to train deep learning models adapted to different load conditions. ;
  • the cluster scheduler is used to use deep learning models to perform cluster scheduling on cloud server clusters.
  • the scheduling methods include: horizontal expansion, vertical expansion and control.
  • the scheduling device based on microservice link analysis and reinforcement learning in the embodiment of the present invention schedules the cloud server cluster based on the deep learning model.
  • Three different scheduling methods can be selected for reinforcement learning: horizontal expansion, vertical expansion and control.
  • the present invention uses the workload and link analyzer based on the deep learning model to analyze and make decisions on the links of microservices, and select the critical path and key node with the longest delay.
  • This invention is based on the reinforcement learning algorithm of Deep Q-Learning, trains the cloud server load data, and uses the algorithm to train a deep learning model adapted to different load conditions.
  • the present invention uses the maximum delay link in the microservice link to determine the location of the key node in the link.
  • this invention solves the problem of possible mismatch between the link length and the microservice delay, and can obtain an optimization target for the delay itself, based on which the cloud server is optimized Resource scheduling can effectively alleviate the existing microservice link delay problem.
  • user-oriented cloud computing-related services provide many conveniences to users.
  • User-facing microservices are often supported by distributed clusters of cloud servers. Users are usually highly sensitive to the link delay of the server when using these services, because the server link delay is directly related to the service quality of the server, thus affecting the user experience.
  • Link latency within a microservices cluster is caused by microservices within a server cluster previously competing for resources with each other.
  • the present invention is based on reinforcement learning and proposes an efficient intelligent scheduling device for online adjustment based on different cloud server load conditions, using methods including horizontal expansion and vertical expansion. and other methods to reallocate resources within the machine.
  • the purpose of the present invention is to use a server load scheduling algorithm to solve the impact of link delays of microservices existing in cloud servers.
  • the present invention uses the maximum delay link in the microservice link to determine the location of the key node in the link. Compared with the microservice scheduling algorithm based on the longest link, this invention solves the problem of possible mismatch between the link length and the microservice delay, and can obtain an optimization target for the delay itself, based on which the cloud server is optimized Resource scheduling can effectively alleviate the existing microservice link delay problem.
  • the present invention schedules cloud server clusters based on a deep learning model.
  • Reinforcement learning can use three different scheduling methods: horizontal expansion, vertical expansion and control.
  • the present invention uses the workload and link analyzer based on the deep learning model to analyze and make decisions on the links of microservices, and selects the longest delayed link (critical path) and node (key node).
  • This invention is based on the reinforcement learning algorithm of Deep Q-Learning, trains the cloud server load data, and uses the algorithm to train a deep learning model adapted to different load conditions. Specifically:
  • the first set of data used in this invention is the workload data set cluster-trace-v2018 of the cloud server cluster from Facebook in 2018.
  • This data set contains workload data spanning 8 days, which was generated by 4034 homogeneous servers. Each of these servers has 96 CPU cores and the same amount of memory.
  • the present invention selects machine_usage.csv from the data set for analysis.
  • the machine_usage.csv data set includes machine number machine_id, timestamp time_stamp, CPU utilization of the current machine cpu_util_percent, memory occupation of the machine at the current moment mem_util_percent and other parameters, and its records
  • the time interval of the data is 10 seconds.
  • the present invention removes some reference variables that often have null values and assigns 0 to the null values of reference variables that occasionally have null values.
  • SockShop simulates the user-facing part of an e-commerce website that sells socks. The purpose is to help users demonstrate and test microservices and cloud native technologies. SockShop's microservices are designed to have the smallest expected allocation of resources, that is, the quota of each microservice is as small as possible during initialization. Microservices use DNS to find other related microservices. When performing scheduling testing on the cloud server, a load balancer or service router can be inserted into the entire microservice framework according to the needs of the present invention.
  • the SockShop version used in this invention is the official version based on Kubernetes, which in addition to the basic microservice components also includes a Jaeger-based microservice monitoring component.
  • train-ticket Another experimental platform used by this invention is train-ticket.
  • This project is a train ticket booking system based on microservice architecture and contains 41 microservices.
  • the programming languages and frameworks used include Java, Python, Go, etc.
  • train-ticket simulates the front-end and back-end of an online ticketing website, including a series of functions such as ticket purchase, ticket sales, ticket query, refund, login, etc.
  • Jaeger microservice detection component and control microservice for monitoring system status. Workload balancer for overall stability of the service.
  • the system model of the present invention includes the following components: load generator, workload and link analyzer, and cluster scheduler.
  • Historical workload refers to the workload from a running cloud server cluster that is regularly collected by a specific server or a built-in program. It mainly includes timestamp, machine number, CPU utilization, and internal accuracy. Occupied size and other information. In this invention, the historical workload comes from Facebook's data set.
  • the load generator is composed of a series of components, including workloads and processors, Locust-based request generators, and databases that store historical workloads. These components coordinate with each other to extract the required workload data and characteristics from the historical database and process them, then simulate real user usage through the load generator, and convert them into requests to the server cluster and send them out.
  • Workload and link analyzer The main task of the workload and link analyzer is to analyze the requests sent from the cloud server, and perform link analysis on them based on the link relationship of the microservices to find out the key nodes and critical paths. , and then send scheduling requests to the microservices of the cloud server group by deploying the cluster scheduler of the deep learning/deep learning model.
  • Cluster scheduler The cluster scheduler mainly acts on the cloud server cluster and completes the scheduling of microservices in the corresponding cloud server by accepting scheduling requests from the workload and link analyzer. This scheduling method is mainly divided into three types: horizontal expansion Vertical, vertical expansion Horizontal and control Brownout.
  • Critical node and critical path refers to the path in a microservice link that has the largest sum of microservice delays on the link among all the longest non-repeating links in the entire system. In the same microservice cluster, there can be one or more critical paths. Key nodes refer to testing the resource utilization performance of different microservices after finding the critical path, that is, by limiting their resource consumption under a certain number of requests to observe the stability and latency of the entire system. To distinguish between critical nodes and non-critical nodes on the path.
  • This step includes Step 1 in Figure 1: the preprocessing part and data cleaning part of the workload, and processes the original data obtained from the actual data set. Whether it is raw data from Facebook or Google's cloud workload, this invention first removes columns containing empty data. Because whether the zero-filling scheme is used or the data is ignored directly, these redundant items will have a negative impact on the predicted data of the present invention. After this, the present invention classifies the data set according to time series and then calculates the average value of each parameter with the same timestamp. This can be done using the grouping function (Python Groupby function). Next, the present invention normalizes the Facebook data set. Normalization is a data processing method that reduces dimensionality.
  • normalization can not only improve the convergence speed of the model, but also improve the accuracy of prediction.
  • the present invention chooses the first normalization method, and uses MinMaxScaler to realize this function. The operation of MinMaxScaler is based on the min-max scaling method. The specific formulas are as shown in Formula 1 and Formula 2 below.
  • This invention uses MinMaxScaler to transform each feature and scale each feature to a value between 0 and 1.
  • X represents the set of data to be processed.
  • X min and X max are the minimum and maximum data in the set. After final processing The data are represented by X scaled .
  • Step 2 Convert into the number of requests based on user behavior
  • the present invention After obtaining the processed Facebook workload, the present invention achieves the same effect as the original data set by simulating request generation in the cluster. After collecting the relationship between the number of requests and the workload of the cloud server, the present invention uses a deep learning algorithm to obtain data on the entire number of requests required for the experiment.
  • the present invention analyzes the user behavior, and records the different locations of the website that the user visits based on the user's behavior.
  • the different locations of the website that the user visits here refer to the website based on the microservice homepage and its sub-websites. , such as website homepage, website index page, website category page, website login page, etc.
  • the present invention sends the request to the target cluster on the one hand, and collects the load status from the cloud server cluster on the other hand.
  • These load conditions mainly include information such as the delay of the microservice itself, the resource consumption of the cloud server, the request delay of the microservice, and the request success rate. Pass this information to the workload and link analyzer. Through the critical path analysis of the microservice, you can get the link relationship of the microservice in the current state. Combined with the obtained link delay data, you can get the critical path of the microservice. .
  • the present invention makes judgments and obtains results through a classification method based on decision trees.
  • a decision tree model that accurately targets whether it is a key node is obtained.
  • the decision tree has three decision results: critical nodes, not critical nodes and potential critical nodes.
  • each critical path includes at least one critical node, and non-critical nodes and potential critical nodes may not exist in the critical path.
  • the present invention After determining the location of the key nodes, the present invention will use the trained deep learning model to make scheduling decisions for the cloud server cluster.
  • reinforcement learning uses deep reinforcement learning Deep Q-Learning based on deep learning neural network.
  • Deep Q-Learning is a commonly used method of reinforcement learning. Based on Q-Learning, the Q-Table generated by multiple iterations is converted into a neural network with all corresponding parameters and delivered to Deep learning for training. For Q-Learning, the present invention needs to train the model by calculating the Q-value after a series of actions. The calculation equation is as shown in Formula 3:
  • the Q-value generated by being in state s and performing action a is the reward r(s,a) plus the highest Q-value that may be generated by the next state s'.
  • is discount, which controls the impact of future rewards on the Q-value in the current state.
  • the present invention iterates the equation in the above formula 3 to obtain formula 4, which is the equation in the final convergence condition.
  • the training target of the present invention is the part of the above equation 4 that stores the past Q-value, that is, equation 5.
  • the present invention uses three algorithms to schedule microservices in the cloud server cluster.
  • the three scheduling strategies are shown in Figure 2. These three scheduling strategies are respectively Horizontally expand Vertical, vertically expand Horizontal, and control Brownout.
  • Horizontal expansion is also called horizontal expansion. Horizontal scaling adjusts resources by adding or removing replicas of microservices on cloud servers within the cluster to improve resource usage and system availability. Cloud server clusters can reduce the workload of cloud server clusters by adding copies of microservices.
  • Vertical expansion is also called horizontal expansion. Vertical scaling adjusts the processing service capabilities of the current cloud server by adjusting the number of CPU, memory, or network resources allocated to microservice instances. Cloud server clusters can reduce the workload of cloud server clusters by increasing the resources of a single cloud server or multiple cloud servers in the cluster for single or multiple microservices.
  • Control is to achieve the purpose of reducing microservice link delay by dynamically turning on or off optional components. Since control will affect the integrity of the overall microservice to a certain extent, it is generally only used in extreme situations, such as cloud service clusters that persist. When the cloud service and cluster transition from extreme conditions to normal working status, the closed optional components will be controlled to be reopened.
  • a storage medium that stores program files capable of implementing any of the above scheduling methods based on microservice link analysis and reinforcement learning.
  • a processor is used to run a program, wherein when the program is running, it executes any one of the above scheduling methods based on microservice link analysis and reinforcement learning.
  • the present invention adopts a microservice link scheduling model based on reinforcement learning to allocate resources for the longest delay link of the cloud server. Compared with the method of directly selecting the longest link without analysis, Specifically, the present invention can solve problems in the field of cloud service scheduling where variability is greater, microservice links are more complex, and link delays are more sensitive.
  • the experiment of this invention uses cloud data center workload data sets from Facebook and Google.
  • the disclosed technical content can be implemented in other ways.
  • the system embodiments described above are only illustrative.
  • the division of units can be a logical functional division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or integrated into Another system, or some features can be ignored, or not implemented.
  • the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the units or modules may be in electrical or other forms.
  • Units described as separate components may or may not be physically separate, and components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed over multiple units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically alone, or two or more units can be integrated into one unit.
  • the above integrated units can be implemented in the form of hardware or software functional units.
  • Integrated units may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as independent products.
  • the technical solution of the present invention is essentially or contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to cause a computer device (which can be a personal computer, a server or a network device, etc.) to execute all or part of the steps of the methods of various embodiments of the present invention.
  • the aforementioned storage media include: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program code. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明涉及一种基于微服务链路分析和强化学习的调度方法及装置。该方法及装置基于深度学习模型来对云服务器集群进行调度,强化学习可以选用三种不同的调度方式:横向扩展、纵向扩展与管制。本发明利用基于深度学习模型的工作负载与链路分析器对微服务的链路进行分析与决策,选取最长延迟的关键路径与关键节点。本发明基于Deep Q-Learning的强化学习算法,对云服务器负载数据进行训练,利用该算法训练出适应于不同负载状态下的深度学习模型。本发明解决了可能存在的链路长度与微服务延迟不匹配的问题,能得到一个针对延迟本身的优化目标,依据于此对云服务器进行资源调度能有效的缓解存在的微服务链路延迟问题。

Description

基于微服务链路分析和强化学习的调度方法及装置 技术领域
本发明涉及一种基于微服务链路分析和强化学习的调度方法及装置。
背景技术
云服务器的微服务调度算法已经成为了考验高变动负载下云服务器负载能力的重要指标之一。一个优秀的微服务链路调度算法能够为云服务器提供商减少服务器资源的浪费,也能大大提高云服务器使用者的用户体验,提高云服务器的稳定性,使得云服务器中的微服务的资源分配更加合理,提高整个云服务器集群中微服务的鲁棒性。目前而言,主流的云服务器微服务链路延迟调度算法包括有资源分配调度算法、基于最长链路的微服务调度算法。其中:
(1)资源分配调度算法:资源分配调度算法利用了云服务器微服务本身的预设资源来进行分配。在资源充足的情况下能取得较好的效果。在这种资源分配方式下可以尽可能的减少微服务之间资源的竞争,一定程度上提高了云服务器的稳定性。
(2)基于最长链路的微服务调度算法:基于最长链路的微服务调度算法利用了微服务之前的链路关系来进行资源的分配。对于集群中的微服务而言,通过分析其链路关系可以得到链路中最长的一条或多条链路,依据此标准来对资源进行分配。具有最长链路的微服务链路上的所有微服务将被优先分配资源,依次从链路最长到链路最短来进行资源的分配。在这种资源分配方式下能一定程度上缓解微服务之间的资源竞争,优先给较长链路提供更多的资源,也能降低微服务整体的时间延迟。
但现有的许多预测方法诸如回归预测方法虽然能够提供对于简单的时间序列预测的结果,但是难以应对目前高维度的云服务器工作负载预测 问题。虽然循环神经网络与长短期记忆能相对准确的得到短时间内的服务器负载预测值,但是由于其结构与问题,对一段时间内的工作负载预测不够准确,长短期记忆的预测时间也较长,时效性不佳。
同时现有的许多主流的微服务资源调度方法虽然能够对资源充足情况下的微服务进行简单的调度,但是难以应付对于延迟高度敏感的云服务器集群中的资源竞争。按照资源分配调度算法,在云服务器集群中,经常有数十个容器同时运行在一个节点上,导致了微服务的资源并不能按照预设的资源来进行分配,会出现资源竞争。这种调度算法并不适用于目前的云服务器集群架构中。按照基于最长链路的微服务调度算法,若遇到微服务的关键节点不在最长链路上的情况下,不能正确的分配资源给关键节点,这导致了云服务器中的微服务延迟问题不能得到及时解决。
发明内容
本发明实施例提供了一种基于微服务链路分析和强化学习的调度方法及装置,以至少解决现有云服务器中的微服务延迟的技术问题。
根据本发明的一实施例,提供了一种基于微服务链路分析和强化学习的调度方法,包括以下步骤:
基于深度学习模型的工作负载与链路分析器对云服务器微服务的链路进行分析与决策,选取最长延迟的关键路径与关键节点,获取云服务器负载数据;
基于Deep Q-Learning的强化学习算法,对云服务器负载数据进行训练,利用基于Deep Q-Learning的强化学习算法训练出适应于不同负载状态下的深度学习模型;
使用深度学习模型对云服务器集群进行集群调度,其中调度方式包括:横向扩展、纵向扩展与管制。
进一步地,基于深度学习模型的工作负载与链路分析器对云服务器微服务的链路进行分析与决策,选取最长延迟的关键路径与关键节点,获取 云服务器负载数据包括:
对工作负载进行预处理和数据清洗,并对实际的数据集中得到的原始数据进行处理获取请求数量;
在收集到请求数量与云服务器的工作负载之间的关系之后,通过深度学习算法来得到整个所需要进行实验的请求数量的数据;
在生成请求并发送至请求处理器之后,一方面将请求发送给目标集群,另一方面收集来自云服务器集群的负载情况;
将负载情况信息传递给工作负载与链路分析器,通过微服务的关键路径分析,得到当前状态下微服务的链路关系,结合获得的链路延迟数据,得到微服务的关键路径;
通过基于决策树的分类方式对关键节点进行判断并得到结果。
进一步地,对工作负载进行预处理和数据清洗,并对实际的数据集中得到的原始数据进行处理获取请求数量包括:
首先删除包含空数据的列,按照时间序列对数据集进行分类,然后利用分组函数计算出具有相同时间戳的每个参数的平均值;
使用MinMaxScaler来对数据集进行归一化处理;MinMaxScaler的操作是基于min-max缩放法,具体的公式如下公式1与公式2所示;
Figure PCTCN2022138189-appb-000001
X scaled=X std*(X max-X min)+X min     (2)
使用MinMaxScaler对每个特征进行变换,将每个特征缩放为0和1之间的值,X代表待处理数据的集合,X min和X max是集合中的最小和最大数据,最终处理后的数据用X scaled表示。
进一步地,在在收集到请求数量与云服务器的工作负载之间的关系之后,通过深度学习算法来得到整个所需要进行实验的请求数量的数据之后, 方法还包括:
对用户行为进行分析,并且基于用户的行为将用户访问网站的不同位置进行了记录,通过对访问网站的不同请求进行模拟,模拟出现实生活中用户的行为。
进一步地,负载情况包括:微服务自身的延迟、云服务器的资源消耗、微服务的请求延迟、请求成功率信息。
进一步地,通过基于决策树的分类方式对关键节点进行判断并得到结果包括:
通过训练针对于动态变化的链路关系得到一个针对是否为关键节点的决策树模型;该决策树有三种决策结果:属于关键节点、不属于关键节点与潜在的关键节点;三种结果中,每一条关键路径中至少包括一个关键节点,非关键节点和潜在关键节点不存在于关键路径中。
进一步地,基于Deep Q-Learning的强化学习算法,对云服务器负载数据进行训练,利用基于Deep Q-Learning的强化学习算法训练出适应于不同负载状态下的深度学习模型包括:
在确定关键节点的位置之后,将通过训练好的深度学习模型来对云服务器集群来进行调度决策;强化学习使用的是基于深度学习神经网络的深度强化学习Deep Q-Learning。
进一步地,通过计算一系列action之后的Q-value来训练模型,计算方程如公式3所示:
Figure PCTCN2022138189-appb-000002
其中处于状态s并执行行动a所产生的Q-value是奖励r(s,a)加上下一个状态s'产生的最高Q-value;γ是discount,其控制着未来的奖励的对现在的状态下的Q-value的影响;
将上述公式3中的方程进行迭代,得到公式4,为最终收敛情况下的 方程:
Q*(s,a)=∑ s′P(s′∣s,a)(R(s,a,s′)+γmax a′Q*(s′,a′))     (4)
得到方程4中的存储过去Q-value的部分,为公式5:
γmax a′Q*(s′,a′)  (5)。
进一步地,横向扩展通过在集群内的云服务器上增加或删除微服务的副本来调整资源;云服务器集群通过增加微服务的副本降低云服务器集群的工作负载;
纵向扩展通过调节分配给微服务的CPU、内存或网络资源的数量来调整当前云服务器的处理服务能力;云服务器集群通过增加集群内的单个云服务器或多个云服务器关于单个或多个微服务的资源降低云服务器集群的工作负载;
管制通过动态的开启或者关闭可选组件来降低微服务链路延迟,当云服务及集群从极端情况过渡到正常工作状态之后,被关闭的可选组件将被管制重新开启。
根据本发明的另一实施例,提供了一种基于微服务链路分析和强化学习的调度装置,包括:
负载生成器,用于基于深度学习模型的工作负载与链路分析器对云服务器微服务的链路进行分析与决策,选取最长延迟的关键路径与关键节点,获取云服务器负载数据;
工作负载与链路分析器,用于基于Deep Q-Learning的强化学习算法,对云服务器负载数据进行训练,利用基于Deep Q-Learning的强化学习算法训练出适应于不同负载状态下的深度学习模型;
集群调度器,用于使用深度学习模型对云服务器集群进行集群调度,其中调度方式包括:横向扩展、纵向扩展与管制。
一种存储介质,存储介质存储有能够实现上述任意一项基于微服务链 路分析和强化学习的调度方法的程序文件。
一种处理器,处理器用于运行程序,其中,程序运行时执行上述任意一项的基于微服务链路分析和强化学习的调度方法。
本发明实施例中的基于微服务链路分析和强化学习的调度方法及装置,基于深度学习模型来对云服务器集群进行调度,强化学习可以选用三种不同的调度方式:横向扩展、纵向扩展与管制。本发明利用基于深度学习模型的工作负载与链路分析器对微服务的链路进行分析与决策,选取最长延迟的关键路径与关键节点。本发明基于Deep Q-Learning的强化学习算法,对云服务器负载数据进行训练,利用该算法训练出适应于不同负载状态下的深度学习模型。本发明利用微服务链路中的最大延迟链路来确定链路中的关键节点的位置。相比于基于最长链路的微服务调度算法,本发明解决了可能存在的链路长度与微服务延迟不匹配的问题,能得到一个针对延迟本身的优化目标,依据于此对云服务器进行资源调度能有效的缓解存在的微服务链路延迟问题。
附图说明
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:
图1为本发明基于微服务链路分析和强化学习的调度方法的系统模型图;
图2为本发明基于微服务链路分析和强化学习中三种调度策略图。
具体实施方式
为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分的实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动 前提下所获得的所有其他实施例,都应当属于本发明保护的范围。
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
实施例1
根据本发明一实施例,提供了一种基于微服务链路分析和强化学习的调度方法,包括以下步骤:
基于深度学习模型的工作负载与链路分析器对云服务器微服务的链路进行分析与决策,选取最长延迟的关键路径与关键节点,获取云服务器负载数据;
基于Deep Q-Learning的强化学习算法,对云服务器负载数据进行训练,利用基于Deep Q-Learning的强化学习算法训练出适应于不同负载状态下的深度学习模型;
使用深度学习模型对云服务器集群进行集群调度,其中调度方式包括:横向扩展、纵向扩展与管制。
本发明实施例中的基于微服务链路分析和强化学习的调度方法,基于深度学习模型来对云服务器集群进行调度,强化学习可以选用三种不同的调度方式:横向扩展、纵向扩展与管制。本发明利用基于深度学习模型的工作负载与链路分析器对微服务的链路进行分析与决策,选取最长延迟的关键路径与关键节点。本发明基于Deep Q-Learning的强化学习算法,对云服务器负载数据进行训练,利用该算法训练出适应于不同负载状态下的 深度学习模型。本发明利用微服务链路中的最大延迟链路来确定链路中的关键节点的位置。相比于基于最长链路的微服务调度算法,本发明解决了可能存在的链路长度与微服务延迟不匹配的问题,能得到一个针对延迟本身的优化目标,依据于此对云服务器进行资源调度能有效的缓解存在的微服务链路延迟问题。
其中,基于深度学习模型的工作负载与链路分析器对云服务器微服务的链路进行分析与决策,选取最长延迟的关键路径与关键节点,获取云服务器负载数据包括:
对工作负载进行预处理和数据清洗,并对实际的数据集中得到的原始数据进行处理获取请求数量;
在收集到请求数量与云服务器的工作负载之间的关系之后,通过深度学习算法来得到整个所需要进行实验的请求数量的数据;
在生成请求并发送至请求处理器之后,一方面将请求发送给目标集群,另一方面收集来自云服务器集群的负载情况;
将负载情况信息传递给工作负载与链路分析器,通过微服务的关键路径分析,得到当前状态下微服务的链路关系,结合获得的链路延迟数据,得到微服务的关键路径;
通过基于决策树的分类方式对关键节点进行判断并得到结果。
其中,对工作负载进行预处理和数据清洗,并对实际的数据集中得到的原始数据进行处理获取请求数量包括:
首先删除包含空数据的列,按照时间序列对数据集进行分类,然后利用分组函数计算出具有相同时间戳的每个参数的平均值;
使用MinMaxScaler来对数据集进行归一化处理;MinMaxScaler的操作是基于min-max缩放法,具体的公式如下公式1与公式2所示;
Figure PCTCN2022138189-appb-000003
X scaled=X std*(X max-X min)+X min     (2)
使用MinMaxScaler对每个特征进行变换,将每个特征缩放为0和1之间的值,X代表待处理数据的集合,X min和X max是集合中的最小和最大数据,最终处理后的数据用X scaled表示。
其中,在在收集到请求数量与云服务器的工作负载之间的关系之后,通过深度学习算法来得到整个所需要进行实验的请求数量的数据之后,方法还包括:
对用户行为进行分析,并且基于用户的行为将用户访问网站的不同位置进行了记录,通过对访问网站的不同请求进行模拟,模拟出现实生活中用户的行为。
其中,负载情况包括:微服务自身的延迟、云服务器的资源消耗、微服务的请求延迟、请求成功率信息。
其中,通过基于决策树的分类方式对关键节点进行判断并得到结果包括:
通过训练针对于动态变化的链路关系得到一个针对是否为关键节点的决策树模型;该决策树有三种决策结果:属于关键节点、不属于关键节点与潜在的关键节点;三种结果中,每一条关键路径中至少包括一个关键节点,非关键节点和潜在关键节点不存在于关键路径中。
其中,基于Deep Q-Learning的强化学习算法,对云服务器负载数据进行训练,利用基于Deep Q-Learning的强化学习算法训练出适应于不同负载状态下的深度学习模型包括:
在确定关键节点的位置之后,将通过训练好的深度学习模型来对云服务器集群来进行调度决策;强化学习使用的是基于深度学习神经网络的深度强化学习Deep Q-Learning。
其中,通过计算一系列action之后的Q-value来训练模型,计算方 程如公式3所示:
Figure PCTCN2022138189-appb-000004
其中处于状态s并执行行动a所产生的Q-value是奖励r(s,a)加上下一个状态s'产生的最高Q-value;γ是discount,其控制着未来的奖励的对现在的状态下的Q-value的影响;
将上述公式3中的方程进行迭代,得到公式4,为最终收敛情况下的方程:
Q*(s,a)=∑ s′P(s′∣s,a)(R(s,a,s′)+γmax a′Q*(s′,a′))   (4)
得到方程4中的存储过去Q-value的部分,为公式5:
γmax a′Q*(s′,a′)    (5)。
其中,横向扩展通过在集群内的云服务器上增加或删除微服务的副本来调整资源;云服务器集群通过增加微服务的副本降低云服务器集群的工作负载;
纵向扩展通过调节分配给微服务的CPU、内存或网络资源的数量来调整当前云服务器的处理服务能力;云服务器集群通过增加集群内的单个云服务器或多个云服务器关于单个或多个微服务的资源降低云服务器集群的工作负载;
管制通过动态的开启或者关闭可选组件来降低微服务链路延迟,当云服务及集群从极端情况过渡到正常工作状态之后,被关闭的可选组件将被管制重新开启。
下面以具体实施例,对本发明的基于微服务链路分析和强化学习的调度方法进行详细说明:
作为新时代IT技术的重要组成部分,面向使用者的云计算相关业务为广大用户提供了诸多便利。面向用户的微服务通常由分布式的云服务器 集群提供支持。用户在使用这些服务时通常会对服务器的链路延迟有着高度敏感,因为服务器链路延迟直接关系到服务器的服务质量,从而影响到用户的体验。微服务集群内的链路延迟是由服务器集群内的微服务之前互相资源竞争造成的。为了解决云服务器中微服务的链路延迟高这一问题,本发明基于强化学习,提出了一种基于不同的云服务器负载情况进行在线调节的高效智能调度方法,使用了包括横向扩展、纵向扩展等方法来重新分配机器内的资源。本发明的目的是利用服务器负载调度算法,解决云服务器中存在的微服务的链路延迟带来的影响。本发明利用微服务链路中的最大延迟链路来确定链路中的关键节点的位置。相比于基于最长链路的微服务调度算法,本发明解决了可能存在的链路长度与微服务延迟不匹配的问题,能得到一个针对延迟本身的优化目标,依据于此对云服务器进行资源调度能有效的缓解存在的微服务链路延迟问题。
为达到上述目的,本发明基于深度学习模型来对云服务器集群进行调度,强化学习可以选用三种不同的调度方式:横向扩展、纵向扩展与管制。本发明利用基于深度学习模型的工作负载与链路分析器对微服务的链路进行分析与决策,选取最长延迟的链路(关键路径)与节点(关键节点)。本发明基于Deep Q-Learning的强化学习算法,对云服务器负载数据进行训练,利用该算法训练出适应于不同负载状态下的深度学习模型。具体为:
(1)数据源介绍
本发明采用的第一组数据集是来自Alibaba,2018年的云服务器集群的工作负载数据集cluster-trace-v2018。这个数据集内包含着跨度为8天的工作负载数据,该项数据是由4034台同构的服务器生成的。其中每一台服务器均具有96个CPU核心,它们的内存大小也是相同的。本发明选取了该数据集中的machine_usage.csv进行分析,该machine_usage.csv数据集包括了机器号machine_id、时间戳time_stamp、当前机器的CPU利用率cpu_util_percent、当前时刻机器的内存占用mem_util_percent等参数,其记录数据的时间间隔为10秒。本发明去掉了一些常出现空值的参考变量,将偶尔出现空值的参考变量的空值赋0。
本发明使用的实验平台有两个,其中一个是基于SockShop的微服务平台。SockShop模拟了一个销售袜子的电子商务网站中面向用户的部分。目的是帮助用户用于演示和测试微服务和云原生技术。SockShop的微服务被设计为具有最小的资源预计分配量,即每一个微服务的配额在进行初始化的时候都是尽可能小的。微服务之间使用DNS来寻找其他相关联的微服务。在对云服务器进行调度测试时可以根据本发明的需要,在整个微服务框架中插入负载均衡器或服务路由器。本发明使用的SockShop版本是基于Kubernetes的正式版本,里面除了基本微服务组件之外还包括了基于Jaeger的微服务监测组件。
本发明使用的另一个实验平台是train-ticket,该项目是一个基于微服务架构的火车票预订系统,包含41个微服务。所使用的编程语言和框架包括了Java、Python、Go等。train-ticket模拟了在线售票网站的前端与后端,包括了购票、售票、查询票、退票、登陆等一系列功能,此外还包括了用于监控系统状态的Jaeger微服务检测组件以及控制微服务整体稳定性的工作负载均衡器。
(2)概念定义
本发明的系统模型(如图1所示)包括了以下组成部分:负载生成器、工作负载与链路分析器、集群调度器。
历史工作负载:历史工作负载指的是由特定的服务器或内建的程序定时收集到的来自于运行中的云服务器集群的工作负载,主要包括了时间戳、机器编号、CPU利用率、内准占用大小等信息。在本发明中,历史工作负载来自于Alibaba的数据集。
负载生成器:在本发明的模型中,负载生成器是由一系列组件共同构成的,这些组件包括了工作负载与处理器、基于Locust的请求生成器以及存储着历史工作负载的数据库。这些组件相互协调的完成了从历史数据库中提取出需要的工作负载数据与特征并加以处理,再通过负载生成器进行模拟真实用户使用情况,并转化成对服务器集群的请求发送出去。
工作负载与链路分析器:工作负载与链路分析器主要的任务是分析从云服务器发送来的请求,并结合微服务的链路关系对其进行链路分析,找出关键节点与关键路径,在这之后通过部署好深度学习/深度学习模型的集群调度器来对云服务器群的微服务发送调度请求。
集群调度器:集群调度器主要作用于云服务器集群,通过接受来自工作负载与链路分析器的调度请求来完成对相应的云服务器中的微服务进行调度。这种调度的方式主要分为三种:横向扩展Vertical、纵向扩展Horizontal与管制Brownout。
关键节点与关键路径:关键路径指的是在一个微服务链路中,对于整个系统中的所有最长不重复链路中,链路上的微服务延迟之和最大的那一条路径。在同一个微服务集群中,可以存在一个或多个关键路径。关键节点指的是在寻找到关键路径之后,通过测试不同的微服务在资源利用率上的表现,即在一定的请求数量下通过限制其资源消耗,观察整个系统的稳定性与延迟的高低,来区分路径上的关键节点与非关键节点。
(3)解决方法
3.1步骤一:源数据的预处理
这一步中包含了图1中的Step1:工作负载的预处理部分和数据清洗部分,并对实际的数据集中得到的原始数据进行处理。无论是对于来自Alibaba还是来自Google的云工作负载的原始数据,本发明首先要删除包含空数据的列。因为不管是使用零填充方案还是直接忽略这些数据,这些冗余项都会对本发明的预测数据产生负面影响。在这之后,本发明按照时间序列对数据集进行分类,然后计算出具有相同时间戳的每个参数的平均值,利用分组函数(Python Groupby函数)可以做到这一点。接下来,本发明对Alibaba数据集进行归一化处理。归一化是一种降低维度的数据处理方法。从模型优化的角度来看,归一化不仅可以提高模型的收敛速度,还可以提高预测的准确性。归一化方法有两种形式,一种是将数字改为小数,即介于0与1之间。另一种是将维度表达式改为非维度表达式,即转 化为标量。本发明选择了第一种归一化的方法,本发明使用MinMaxScaler来实现这一功能。MinMaxScaler的操作是基于min-max缩放法,具体的公式如下公式1与公式2所示。
Figure PCTCN2022138189-appb-000005
X scaled=X std*(X max-X min)+X min       (2)
本发明使用MinMaxScaler对每个特征进行变换,将每个特征缩放为0和1之间的值,X代表待处理数据的集合,X min和X max是集合中的最小和最大数据,最终处理后的数据用X scaled表示。
3.2步骤二:转化为基于用户行为的请求数量
在获得处理之后的Alibaba的工作负载之后,本发明通过在集群中模拟生成请求来达到与原数据集相同的效果。在收集到请求数量与云服务器的工作负载之间的关系之后,本发明通过深度学习算法来得到整个所需要进行实验的请求数量的数据。
在此之后本发明对用户行为进行分析,并且基于用户的行为将用户访问网站的不同位置进行了记录,在这里用户访问的网站的不同位置指的是访问基于微服务主页的网站及其子网站,例如网站主页、网站索引页、网站分类页、网站登录页等。通过对这些访问网站的不同请求进行模拟,本发明能更加贴近实际的模拟出现实生活中用户的行为。
3.3步骤三:工作负载与微服务链路分析
在生成请求并发送至请求处理器之后,本发明一方面将请求发送给目标集群,一方面收集来自云服务器集群的负载情况。这些负载情况主要包括了微服务自身的延迟、云服务器的资源消耗、微服务的请求延迟、请求成功率等信息。将这些信息传递给工作负载与链路分析器,通过微服务的关键路径分析,就可以得到当前状态下微服务的链路关系,结合获得的链路延迟数据,即可得到微服务的关键路径。
对于关键节点,本发明是通过基于决策树的分类方式进行判断并得到结果。通过训练针对于动态变化的链路关系得到一个精准针对是否为关键节点的决策树模型。该决策树有三种决策结果:属于关键节点、不属于关键节点与潜在的关键节点。三种结果中,每一条关键路径中至少包括一个关键节点,非关键节点和潜在关键节点可以不存在于关键路径中。
在确定关键节点的位置之后,本发明将通过训练好的深度学习模型来对云服务器集群来进行调度决策。在本发明中,强化学习使用的是基于深度学习神经网络的深度强化学习Deep Q-Learning。
Deep Q-Learning是强化学习的一种常用的方法,基于Q-Learing的基础上,将需要多次迭代生成的Q-Table转化成了具有相对应的全部参数的神经网络,将其交付给深度学习进行训练。对于Q-Learing而言,本发明需要通过计算一系列action之后的Q-value来训练模型,计算方程如公式3所示:
Figure PCTCN2022138189-appb-000006
上面的等式可以看出,处于状态s并执行行动a所产生的Q-value是奖励r(s,a)加上下一个状态s'可能产生的最高Q-value。这里的γ是discount,其控制着未来的奖励的对现在的状态下的Q-value的影响。
本发明将上述公式3中的方程进行迭代,可以得到公式4,即最终收敛情况下的方程。
Q*(s,a)=∑ s′P(s′∣s,a)(R(s,a,s′)+γmax a′Q*(s′,a′))    (4)
对于神经网络而言,本发明训练的目标就是上述方程4中的存储过去Q-value的部分,即公式5。
γmax a′Q*(s′,a′)    (5)
3.4步骤四:集群调度策略
在经过关键路径与节点判断以及基于模型的调度决策之后,本发明使 用三种算法来对云服务器集群中的微服务进行调度,三种调度策略如图2所示,这三种调度策略分别为横向扩展Vertical、纵向扩展Horizontal、管制Brownout。
横向扩展又称水平扩展。横向扩展通过在集群内的云服务器上增加或删除微服务的副本(replicates)来调整资源,以改善资源的使用和系统的可用性。云服务器集群能通过增加微服务的副本来达到降低云服务器集群的工作负载的目的。
纵向扩展:纵向扩展又称水平扩展。纵向扩展通过调节分配给微服务实例的CPU、内存或网络资源的数量来调整当前云服务器的处理服务能力。云服务器集群能通过增加集群内的单个云服务器或多个云服务器关于单个或多个微服务的资源来达到降低云服务器集群的工作负载的目的。
管制则是通过动态的开启或者关闭可选组件来达到降低微服务链路延迟的目的,管制由于会在一定程度上影响到微服务整体的完整性,一般只有在极端情况,例如云服务集群持续处于过在状态,管制机制才会触发并关闭可选组件,当云服务及集群从极端情况过渡到正常工作状态之后,被关闭的可选组件将被管制重新开启。
实施例2
根据本发明的另一实施例,提供了一种基于微服务链路分析和强化学习的调度装置,包括:
负载生成器,用于基于深度学习模型的工作负载与链路分析器对云服务器微服务的链路进行分析与决策,选取最长延迟的关键路径与关键节点,获取云服务器负载数据;
工作负载与链路分析器,用于基于Deep Q-Learning的强化学习算法,对云服务器负载数据进行训练,利用基于Deep Q-Learning的强化学习算法训练出适应于不同负载状态下的深度学习模型;
集群调度器,用于使用深度学习模型对云服务器集群进行集群调度, 其中调度方式包括:横向扩展、纵向扩展与管制。
本发明实施例中的基于微服务链路分析和强化学习的调度装置,基于深度学习模型来对云服务器集群进行调度,强化学习可以选用三种不同的调度方式:横向扩展、纵向扩展与管制。本发明利用基于深度学习模型的工作负载与链路分析器对微服务的链路进行分析与决策,选取最长延迟的关键路径与关键节点。本发明基于Deep Q-Learning的强化学习算法,对云服务器负载数据进行训练,利用该算法训练出适应于不同负载状态下的深度学习模型。本发明利用微服务链路中的最大延迟链路来确定链路中的关键节点的位置。相比于基于最长链路的微服务调度算法,本发明解决了可能存在的链路长度与微服务延迟不匹配的问题,能得到一个针对延迟本身的优化目标,依据于此对云服务器进行资源调度能有效的缓解存在的微服务链路延迟问题。
下面以具体实施例,对本发明的基于微服务链路分析和强化学习的调度装置进行详细说明:
作为新时代IT技术的重要组成部分,面向使用者的云计算相关业务为广大用户提供了诸多便利。面向用户的微服务通常由分布式的云服务器集群提供支持。用户在使用这些服务时通常会对服务器的链路延迟有着高度敏感,因为服务器链路延迟直接关系到服务器的服务质量,从而影响到用户的体验。微服务集群内的链路延迟是由服务器集群内的微服务之前互相资源竞争造成的。为了解决云服务器中微服务的链路延迟高这一问题,本发明基于强化学习,提出了一种基于不同的云服务器负载情况进行在线调节的高效智能调度装置,使用了包括横向扩展、纵向扩展等方法来重新分配机器内的资源。本发明的目的是利用服务器负载调度算法,解决云服务器中存在的微服务的链路延迟带来的影响。本发明利用微服务链路中的最大延迟链路来确定链路中的关键节点的位置。相比于基于最长链路的微服务调度算法,本发明解决了可能存在的链路长度与微服务延迟不匹配的问题,能得到一个针对延迟本身的优化目标,依据于此对云服务器进行资源调度能有效的缓解存在的微服务链路延迟问题。
为达到上述目的,本发明基于深度学习模型来对云服务器集群进行调度,强化学习可以选用三种不同的调度方式:横向扩展、纵向扩展与管制。本发明利用基于深度学习模型的工作负载与链路分析器对微服务的链路进行分析与决策,选取最长延迟的链路(关键路径)与节点(关键节点)。本发明基于Deep Q-Learning的强化学习算法,对云服务器负载数据进行训练,利用该算法训练出适应于不同负载状态下的深度学习模型。具体为:
(1)数据源介绍
本发明采用的第一组数据集是来自Alibaba,2018年的云服务器集群的工作负载数据集cluster-trace-v2018。这个数据集内包含着跨度为8天的工作负载数据,该项数据是由4034台同构的服务器生成的。其中每一台服务器均具有96个CPU核心,它们的内存大小也是相同的。本发明选取了该数据集中的machine_usage.csv进行分析,该machine_usage.csv数据集包括了机器号machine_id、时间戳time_stamp、当前机器的CPU利用率cpu_util_percent、当前时刻机器的内存占用mem_util_percent等参数,其记录数据的时间间隔为10秒。本发明去掉了一些常出现空值的参考变量,将偶尔出现空值的参考变量的空值赋0。
本发明使用的实验平台有两个,其中一个是基于SockShop的微服务平台。SockShop模拟了一个销售袜子的电子商务网站中面向用户的部分。目的是帮助用户用于演示和测试微服务和云原生技术。SockShop的微服务被设计为具有最小的资源预计分配量,即每一个微服务的配额在进行初始化的时候都是尽可能小的。微服务之间使用DNS来寻找其他相关联的微服务。在对云服务器进行调度测试时可以根据本发明的需要,在整个微服务框架中插入负载均衡器或服务路由器。本发明使用的SockShop版本是基于Kubernetes的正式版本,里面除了基本微服务组件之外还包括了基于Jaeger的微服务监测组件。
本发明使用的另一个实验平台是train-ticket,该项目是一个基于微服务架构的火车票预订系统,包含41个微服务。所使用的编程语言和框 架包括了Java、Python、Go等。train-ticket模拟了在线售票网站的前端与后端,包括了购票、售票、查询票、退票、登陆等一系列功能,此外还包括了用于监控系统状态的Jaeger微服务检测组件以及控制微服务整体稳定性的工作负载均衡器。
(2)概念定义
本发明的系统模型(如图1所示)包括了以下组成部分:负载生成器、工作负载与链路分析器、集群调度器。
历史工作负载:历史工作负载指的是由特定的服务器或内建的程序定时收集到的来自于运行中的云服务器集群的工作负载,主要包括了时间戳、机器编号、CPU利用率、内准占用大小等信息。在本发明中,历史工作负载来自于Alibaba的数据集。
负载生成器:在本发明的模型中,负载生成器是由一系列组件共同构成的,这些组件包括了工作负载与处理器、基于Locust的请求生成器以及存储着历史工作负载的数据库。这些组件相互协调的完成了从历史数据库中提取出需要的工作负载数据与特征并加以处理,再通过负载生成器进行模拟真实用户使用情况,并转化成对服务器集群的请求发送出去。
工作负载与链路分析器:工作负载与链路分析器主要的任务是分析从云服务器发送来的请求,并结合微服务的链路关系对其进行链路分析,找出关键节点与关键路径,在这之后通过部署好深度学习/深度学习模型的集群调度器来对云服务器群的微服务发送调度请求。
集群调度器:集群调度器主要作用于云服务器集群,通过接受来自工作负载与链路分析器的调度请求来完成对相应的云服务器中的微服务进行调度。这种调度的方式主要分为三种:横向扩展Vertical、纵向扩展Horizontal与管制Brownout。
关键节点与关键路径:关键路径指的是在一个微服务链路中,对于整个系统中的所有最长不重复链路中,链路上的微服务延迟之和最大的那一 条路径。在同一个微服务集群中,可以存在一个或多个关键路径。关键节点指的是在寻找到关键路径之后,通过测试不同的微服务在资源利用率上的表现,即在一定的请求数量下通过限制其资源消耗,观察整个系统的稳定性与延迟的高低,来区分路径上的关键节点与非关键节点。
(3)解决方法
3.1步骤一:源数据的预处理
这一步中包含了图1中的Step1:工作负载的预处理部分和数据清洗部分,并对实际的数据集中得到的原始数据进行处理。无论是对于来自Alibaba还是来自Google的云工作负载的原始数据,本发明首先要删除包含空数据的列。因为不管是使用零填充方案还是直接忽略这些数据,这些冗余项都会对本发明的预测数据产生负面影响。在这之后,本发明按照时间序列对数据集进行分类,然后计算出具有相同时间戳的每个参数的平均值,利用分组函数(Python Groupby函数)可以做到这一点。接下来,本发明对Alibaba数据集进行归一化处理。归一化是一种降低维度的数据处理方法。从模型优化的角度来看,归一化不仅可以提高模型的收敛速度,还可以提高预测的准确性。归一化方法有两种形式,一种是将数字改为小数,即介于0与1之间。另一种是将维度表达式改为非维度表达式,即转化为标量。本发明选择了第一种归一化的方法,本发明使用MinMaxScaler来实现这一功能。MinMaxScaler的操作是基于min-max缩放法,具体的公式如下公式1与公式2所示。
Figure PCTCN2022138189-appb-000007
X scaled=X std*(X max-X min)+X min       (2)
本发明使用MinMaxScaler对每个特征进行变换,将每个特征缩放为0和1之间的值,X代表待处理数据的集合,X min和X max是集合中的最小和最大数据,最终处理后的数据用X scaled表示。
3.2步骤二:转化为基于用户行为的请求数量
在获得处理之后的Alibaba的工作负载之后,本发明通过在集群中模拟生成请求来达到与原数据集相同的效果。在收集到请求数量与云服务器的工作负载之间的关系之后,本发明通过深度学习算法来得到整个所需要进行实验的请求数量的数据。
在此之后本发明对用户行为进行分析,并且基于用户的行为将用户访问网站的不同位置进行了记录,在这里用户访问的网站的不同位置指的是访问基于微服务主页的网站及其子网站,例如网站主页、网站索引页、网站分类页、网站登录页等。通过对这些访问网站的不同请求进行模拟,本发明能更加贴近实际的模拟出现实生活中用户的行为。
3.3步骤三:工作负载与微服务链路分析
在生成请求并发送至请求处理器之后,本发明一方面将请求发送给目标集群,一方面收集来自云服务器集群的负载情况。这些负载情况主要包括了微服务自身的延迟、云服务器的资源消耗、微服务的请求延迟、请求成功率等信息。将这些信息传递给工作负载与链路分析器,通过微服务的关键路径分析,就可以得到当前状态下微服务的链路关系,结合获得的链路延迟数据,即可得到微服务的关键路径。
对于关键节点,本发明是通过基于决策树的分类方式进行判断并得到结果。通过训练针对于动态变化的链路关系得到一个精准针对是否为关键节点的决策树模型。该决策树有三种决策结果:属于关键节点、不属于关键节点与潜在的关键节点。三种结果中,每一条关键路径中至少包括一个关键节点,非关键节点和潜在关键节点可以不存在于关键路径中。
在确定关键节点的位置之后,本发明将通过训练好的深度学习模型来对云服务器集群来进行调度决策。在本发明中,强化学习使用的是基于深度学习神经网络的深度强化学习Deep Q-Learning。
Deep Q-Learning是强化学习的一种常用的方法,是基于Q-Learing的基础上,将需要多次迭代生成的Q-Table转化成了具有相对应的全部参数的神经网络,将其交付给深度学习进行训练。对于Q-Learing而言,本 发明需要通过计算一系列action之后的Q-value来训练模型,计算方程如公式3所示:
Figure PCTCN2022138189-appb-000008
上面的等式可以看出,处于状态s并执行行动a所产生的Q-value是奖励r(s,a)加上下一个状态s'可能产生的最高Q-value。这里的γ是discount,控制着未来的奖励的对现在的状态下的Q-value的影响。
本发明将上述公式3中的方程进行迭代,可以得到公式4,即最终收敛情况下的方程。
Q*(s,a)=∑ s′P(s′∣s,a)(R(s,a,s′)+γmax a′Q*(s′,a′))    (4)
对于神经网络而言,本发明训练的目标就是上述方程4中的存储过去Q-value的部分,即公式5。
γmax a′Q*(s′,a′)   (5)
3.4步骤四:集群调度策略
在经过关键路径与节点判断以及基于模型的调度决策之后,本发明使用三种算法来对云服务器集群中的微服务进行调度,三种调度策略如图2所示,这三种调度策略分别为横向扩展Vertical、纵向扩展Horizontal、管制Brownout。
横向扩展又称水平扩展。横向扩展通过在集群内的云服务器上增加或删除微服务的副本(replicates)来调整资源,以改善资源的使用和系统的可用性。云服务器集群能通过增加微服务的副本来达到降低云服务器集群的工作负载的目的。
纵向扩展:纵向扩展又称水平扩展。纵向扩展通过调节分配给微服务实例的CPU、内存或网络资源的数量来调整当前云服务器的处理服务能力。云服务器集群能通过增加集群内的单个云服务器或多个云服务器关于单个或多个微服务的资源来达到降低云服务器集群的工作负载的目的。
管制则是通过动态的开启或者关闭可选组件来达到降低微服务链路延迟的目的,管制由于会在一定程度上影响到微服务整体的完整性,一般只有在极端情况,例如云服务集群持续处于过在状态,管制机制才会触发并关闭可选组件,当云服务及集群从极端情况过渡到正常工作状态之后,被关闭的可选组件将被管制重新开启。
实施例3
一种存储介质,存储介质存储有能够实现上述任意一项基于微服务链路分析和强化学习的调度方法的程序文件。
实施例4
一种处理器,处理器用于运行程序,其中,程序运行时执行上述任意一项的基于微服务链路分析和强化学习的调度方法。
相比于现有技术,本发明采用了基于强化学习的微服务链路调度模型来针对于云服务器的最长延迟链路进行资源分配,相较于不经分析直接选取最长链路的方法而言,本发明能够解决可变性更大、微服务链路更复杂、链路延迟更敏感的云服务调度领域的问题。
本发明实验使用来自Alibaba与Google的云数据中心工作负载数据集。使用基于Locust的负载生成器与基于深度学习模型的云服务器集群调度器,并对比了目前调度领域常用的几种不同的调度算法,结果证明本发明在云服务器链路延迟分析与负载调度领域优于现有方法。
上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。
在本发明的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本申请所提供的几个实施例中,应该理解到,所揭露的技术内容,可通过其它的方式实现。其中,以上所描述的系统实施例仅仅是示意性的, 例如单元的划分,可以为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元或模块的间接耦合或通信连接,可以是电性或其它的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本发明各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。

Claims (10)

  1. 一种基于微服务链路分析和强化学习的调度方法,其特征在于,包括以下步骤:
    基于深度学习模型的工作负载与链路分析器对云服务器微服务的链路进行分析与决策,选取最长延迟的关键路径与关键节点,获取云服务器负载数据;
    基于Deep Q-Learning的强化学习算法,对云服务器负载数据进行训练,利用基于Deep Q-Learning的强化学习算法训练出适应于不同负载状态下的深度学习模型;
    使用深度学习模型对云服务器集群进行集群调度,其中调度方式包括:横向扩展、纵向扩展与管制。
  2. 根据权利要求1所述的基于微服务链路分析和强化学习的调度方法,其特征在于,所述基于深度学习模型的工作负载与链路分析器对云服务器微服务的链路进行分析与决策,选取最长延迟的关键路径与关键节点,获取云服务器负载数据包括:
    对工作负载进行预处理和数据清洗,并对实际的数据集中得到的原始数据进行处理获取请求数量;
    在收集到请求数量与云服务器的工作负载之间的关系之后,通过深度学习算法来得到整个所需要进行实验的请求数量的数据;
    在生成请求并发送至请求处理器之后,一方面将请求发送给目标集群,另一方面收集来自云服务器集群的负载情况;
    将负载情况信息传递给工作负载与链路分析器,通过微服务的关键路径分析,得到当前状态下微服务的链路关系,结合获得的链路延迟数据,得到微服务的关键路径;
    通过基于决策树的分类方式对关键节点进行判断并得到结果。
  3. 根据权利要求2所述的基于微服务链路分析和强化学习的调度方法,其特征在于,所述对工作负载进行预处理和数据清洗,并对实际的数据集中得到的原始数据进行处理获取请求数量包括:
    首先删除包含空数据的列,按照时间序列对数据集进行分类,然后利用分组函数计算出具有相同时间戳的每个参数的平均值;
    使用MinMaxScaler来对数据集进行归一化处理;MinMaxScaler的操作是基于min-max缩放法,具体的公式如下公式1与公式2所示;
    Figure PCTCN2022138189-appb-100001
    X scaled=X std*(X max-X min)+X min  (2)
    使用MinMaxScaler对每个特征进行变换,将每个特征缩放为0和1之间的值,X代表待处理数据的集合,X min和X max是集合中的最小和最大数据,最终处理后的数据用X scaled表示。
  4. 根据权利要求2所述的基于微服务链路分析和强化学习的调度方法,其特征在于,在所述在收集到请求数量与云服务器的工作负载之间的关系之后,通过深度学习算法来得到整个所需要进行实验的请求数量的数据之后,所述方法还包括:
    对用户行为进行分析,并且基于用户的行为将用户访问网站的不同位置进行了记录,通过对访问网站的不同请求进行模拟,模拟出现实生活中用户的行为。
  5. 根据权利要求2所述的基于微服务链路分析和强化学习的调度方法,其特征在于,所述系统情况包括:微服务自身的延迟、云服务器的资源消耗、微服务的请求延迟、请求成功率信息。
  6. 根据权利要求2所述的基于微服务链路分析和强化学习的调度方法,其特征在于,所述通过基于决策树的分类方式对关键节点进行判断并得到结果包括:
    通过训练针对于动态变化的链路关系得到一个针对是否为关键节点的决策树模型;该决策树有三种决策结果:属于关键节点、不属于关键节点与潜在的关键节点;三种结果中,每一条关键路径中至少包括一个关键节点,非关键节点和潜在关键节点不存在于关键路径中。
  7. 根据权利要求1所述的基于微服务链路分析和强化学习的调度方法,其特征在于,所述基于Deep Q-Learning的强化学习算法,对云服务器负载数据进行训练,利用基于Deep Q-Learning的强化学习算法训练出适应于不同负载状态下的深度学习模型包括:
    在确定关键节点的位置之后,将通过训练好的深度学习模型来对云服务器集群来进行调度决策;强化学习使用的是基于深度学习神经网络的深度强化学习Deep Q-Learning。
  8. 根据权利要求7所述的基于微服务链路分析和强化学习的调度方法,其特征在于,通过计算一系列action之后的Q-value来训练模型,计算方程如公式3所示:
    Figure PCTCN2022138189-appb-100002
    其中处于状态s并执行行动a所产生的Q-value是奖励r(s,a)加上下一个状态s'产生的最高Q-value;γ是折扣率,其控制着未来的奖励的对现在的状态下的Q-value的影响;
    将上述公式3中的方程进行迭代,得到公式4,为最终收敛情况下的方程:
    Q*(s,a)=∑ s′P(s′∣s,a)(R(s,a,s′)+γmax a′Q*(s′,a′))  (4)
    得到方程4中的存储过去Q-value的部分,为公式5:
    γmax a′Q*(s′,a′)  (5)。
  9. 根据权利要求1所述的基于微服务链路分析和强化学习的调度方法,其特征在于,所述横向扩展通过在集群内的云服务器上增加或删除微服务 的副本来调整资源;云服务器集群通过增加微服务的副本降低云服务器集群的工作负载;
    所述纵向扩展通过调节分配给微服务的CPU、内存或网络资源的数量来调整当前云服务器的处理服务能力;云服务器集群通过增加集群内的单个云服务器或多个云服务器关于单个或多个微服务的资源降低云服务器集群的工作负载;
    所述管制通过动态的开启或者关闭可选组件来降低微服务链路延迟,当云服务及集群从极端情况过渡到正常工作状态之后,被关闭的可选组件将被管制重新开启。
  10. 一种基于微服务链路分析和强化学习的调度装置,其特征在于,包括:
    负载生成器,用于基于深度学习模型的工作负载与链路分析器对云服务器微服务的链路进行分析与决策,选取最长延迟的关键路径与关键节点,获取云服务器负载数据;
    工作负载与链路分析器,用于基于Deep Q-Learning的强化学习算法,对云服务器负载数据进行训练,利用基于Deep Q-Learning的强化学习算法训练出适应于不同负载状态下的深度学习模型;
    集群调度器,用于使用深度学习模型对云服务器集群进行集群调度,其中调度方式包括:横向扩展、纵向扩展与管制。
PCT/CN2022/138189 2022-03-30 2022-12-09 基于微服务链路分析和强化学习的调度方法及装置 WO2023185090A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210323609.7 2022-03-30
CN202210323609.7A CN114780233A (zh) 2022-03-30 2022-03-30 基于微服务链路分析和强化学习的调度方法及装置

Publications (1)

Publication Number Publication Date
WO2023185090A1 true WO2023185090A1 (zh) 2023-10-05

Family

ID=82426820

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/138189 WO2023185090A1 (zh) 2022-03-30 2022-12-09 基于微服务链路分析和强化学习的调度方法及装置

Country Status (2)

Country Link
CN (1) CN114780233A (zh)
WO (1) WO2023185090A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117519948A (zh) * 2023-12-11 2024-02-06 广东筠诚建筑科技有限公司 基于云平台实现建筑施工下的计算资源调整方法及系统
CN117640413A (zh) * 2024-01-26 2024-03-01 国网湖北省电力有限公司信息通信公司 雾计算中基于强化学习的微服务和数据库联合部署方法
CN118337640A (zh) * 2024-06-12 2024-07-12 湖北省楚天云有限公司 一种云与多边缘网络节点协同的微服务部署方法
CN118354282A (zh) * 2024-04-15 2024-07-16 启东市恒安防爆通信设备有限公司 一种基于集群通话的防爆终端电话通信方法及系统
CN118433239A (zh) * 2024-07-04 2024-08-02 中国电子科技集团公司第五十四研究所 一种轻量化指挥控制网络关键节点智能识别服务构建方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114780233A (zh) * 2022-03-30 2022-07-22 深圳先进技术研究院 基于微服务链路分析和强化学习的调度方法及装置
CN116827685B (zh) * 2023-08-28 2023-11-14 成都乐超人科技有限公司 基于深度强化学习的微服务系统动态防御策略方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210304063A1 (en) * 2020-03-30 2021-09-30 International Business Machines Corporation Machine Learning Model For Micro-Service Compliance Requirements
CN113553149A (zh) * 2021-07-02 2021-10-26 深圳先进技术研究院 云服务器集群负载调度方法、系统、终端以及存储介质
CN113553150A (zh) * 2021-07-02 2021-10-26 深圳先进技术研究院 一种云服务器集群负载预测方法、系统、终端以及存储介质
CN114780233A (zh) * 2022-03-30 2022-07-22 深圳先进技术研究院 基于微服务链路分析和强化学习的调度方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210304063A1 (en) * 2020-03-30 2021-09-30 International Business Machines Corporation Machine Learning Model For Micro-Service Compliance Requirements
CN113553149A (zh) * 2021-07-02 2021-10-26 深圳先进技术研究院 云服务器集群负载调度方法、系统、终端以及存储介质
CN113553150A (zh) * 2021-07-02 2021-10-26 深圳先进技术研究院 一种云服务器集群负载预测方法、系统、终端以及存储介质
CN114780233A (zh) * 2022-03-30 2022-07-22 深圳先进技术研究院 基于微服务链路分析和强化学习的调度方法及装置

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117519948A (zh) * 2023-12-11 2024-02-06 广东筠诚建筑科技有限公司 基于云平台实现建筑施工下的计算资源调整方法及系统
CN117519948B (zh) * 2023-12-11 2024-04-26 广东筠诚建筑科技有限公司 基于云平台实现建筑施工下的计算资源调整方法及系统
CN117640413A (zh) * 2024-01-26 2024-03-01 国网湖北省电力有限公司信息通信公司 雾计算中基于强化学习的微服务和数据库联合部署方法
CN117640413B (zh) * 2024-01-26 2024-04-26 国网湖北省电力有限公司信息通信公司 雾计算中基于强化学习的微服务和数据库联合部署方法
CN118354282A (zh) * 2024-04-15 2024-07-16 启东市恒安防爆通信设备有限公司 一种基于集群通话的防爆终端电话通信方法及系统
CN118337640A (zh) * 2024-06-12 2024-07-12 湖北省楚天云有限公司 一种云与多边缘网络节点协同的微服务部署方法
CN118433239A (zh) * 2024-07-04 2024-08-02 中国电子科技集团公司第五十四研究所 一种轻量化指挥控制网络关键节点智能识别服务构建方法

Also Published As

Publication number Publication date
CN114780233A (zh) 2022-07-22

Similar Documents

Publication Publication Date Title
WO2023185090A1 (zh) 基于微服务链路分析和强化学习的调度方法及装置
US11868890B2 (en) Workflow optimization
CN113377540B (zh) 集群资源调度方法及装置、电子设备和存储介质
Yu et al. Faasrank: Learning to schedule functions in serverless platforms
US20060047794A1 (en) Application of genetic algorithms to computer system tuning
CN113037877B (zh) 云边端架构下时空数据及资源调度的优化方法
Derakhshan et al. Continuous Deployment of Machine Learning Pipelines.
CN110287010A (zh) 一种面向Spark时间窗口数据分析的缓存数据预取方法
Vu et al. Distributed adaptive model rules for mining big data streams
WO2022252694A1 (zh) 神经网络优化方法及其装置
CN111752706A (zh) 资源配置方法、装置及存储介质
CN116166443B (zh) 一种游戏任务系统的负载优化方法及系统
CN113835874A (zh) 深度学习业务调度方法、系统、终端及存储介质
Bao et al. Deep learning-based job placement in distributed machine learning clusters with heterogeneous workloads
Qiu et al. FLASH: Fast model adaptation in ML-centric cloud platforms
Feng et al. Heterogeneity-aware proactive elastic resource allocation for serverless applications
Sagaama et al. Automatic parameter tuning for big data pipelines with deep reinforcement learning
Marin et al. Hybrid capacity planning methodology for web search engines
He et al. An SLA-driven cache optimization approach for multi-tenant application on PaaS
CN112596901A (zh) 云平台自动化部署与运行方法、电子设备及存储介质
CN113762972A (zh) 数据存储控制方法及装置、电子设备、存储介质
Yu Faasrank: A reinforcement learning scheduler for serverless function-as-a-service platforms
Qin et al. PRESC^ 2 2: efficient self-reconfiguration of cache strategies for elastic caching platforms
Chouliaras et al. Towards constrained optimization of cloud applications: A hybrid approach
WO2024087844A1 (zh) 图神经网络的训练方法、训练系统及异常账号识别方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22934907

Country of ref document: EP

Kind code of ref document: A1