WO2023207663A1

WO2023207663A1 - Traffic scheduling method

Info

Publication number: WO2023207663A1
Application number: PCT/CN2023/088860
Authority: WO
Inventors: 彭小新; 王晓亮; 宋扬; 薛蹦蹦; 于兴兴; 李嘉; 徐婷婷; 程冬旭
Original assignee: 阿里云计算有限公司; 阿里巴巴（中国）有限公司
Priority date: 2022-04-29
Filing date: 2023-04-18
Publication date: 2023-11-02
Also published as: CN114745392A

Abstract

The present description provides a traffic scheduling method. The method comprises: determining environmental data, the environmental data comprising resource requirement information of tenant traffic to be scheduled and actual allocation of cloud host resources; inputting the environmental data into a pre-trained generated reinforcement learning model, and obtaining actions output by the reinforcement learning model and a score of each action, wherein each action corresponds to one or more cloud hosts; and determining an action combination having the highest score, and scheduling said tenant traffic to the cloud host corresponding to the action combination having the highest score, wherein each action combination corresponds to one or more actions.

Description

Traffic scheduling method

This application claims priority to the Chinese patent application filed with the China Patent Office on April 29, 2022, with the application number 202210476232.9 and the application name "Traffic Scheduling Method", the entire content of which is incorporated into this application by reference.

Technical field

This specification relates to the field of communication technology, and in particular, to a traffic scheduling method.

Background technique

With the continuous development of cloud technology, virtual network functions (Virtual Network Function, VNF) have begun to be widely used in modern cloud networks. With the rapid increase of enterprises and organizations using cloud services, they have stricter requirements for the scalability, deployment speed and performance of network functions. Server-based implementation of virtual network functions is difficult to meet these requirements. In order to deploy and deliver network functions in a more flexible manner, using cloud hosts (Elastic Compute Service, ECS) provided by cloud networks to deploy virtual network functions has become a suitable solution. This cloud-native design is easy to manage, and can supply resources on demand and flexibly allocate resources. It can easily deploy user network functions on cloud hosts and maintain high availability.

Contents of the invention

In view of this, this specification provides a traffic scheduling method to solve the deficiencies in related technologies.

Specifically, this specification is implemented through the following technical solutions:

According to a first aspect of the embodiments of this specification, a traffic scheduling method is provided, and the method includes:

Determine environmental data, which includes resource demand information of tenant traffic to be scheduled and actual allocation of cloud host resources;

Input the environmental data into the reinforcement learning model generated by pre-training, and obtain the actions output by the reinforcement learning model and the score of each action; wherein each action corresponds to one or more cloud hosts;

Determine the action combination with the highest score, and schedule the tenant traffic to be scheduled to the cloud host corresponding to the action combination with the highest score; wherein each action combination corresponds to one or more actions.

According to a second aspect of the embodiments of this specification, a computer-readable storage medium is provided, on which a computer program is stored, and when the program is executed by a processor, the steps of the method described in the first aspect are implemented.

According to the third aspect of the embodiment of this specification, a network function virtualization NFV platform controller is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor. The processor executes the The described procedures are steps for implementing the method described in the first aspect.

In the technical solution provided in this specification, by inputting the resource demand information of the tenant traffic to be scheduled and the actual allocation of cloud host resources as environmental data into the reinforcement learning model for processing, the dynamics of the tenant traffic to be scheduled can be achieved , Personalized cloud host resource allocation. Compared with the allocation of cloud host resources based on preset fixed weights in related technologies, the scheduling decisions of tenant traffic to be scheduled in this manual can be applied to different actual application scenarios in a timely manner. And ensure that tenant traffic is reasonably scheduled to the corresponding cloud host, thereby achieving reasonable allocation of cloud host resources.

By training the reinforcement learning model, the trained reinforcement learning model can meet the traffic scheduling purpose expected in this specification. For example, by preferentially scheduling the traffic of tenants currently to be allocated to cloud hosts that have been allocated the traffic of other tenants, the resource utilization of the corresponding cloud hosts can be improved within a safe range, and the total cost of cloud host resources can be reduced. Moreover, since the number of started cloud hosts can be reduced under the same conditions, it can at least save the basic resource consumption of system operation, heat dissipation and other aspects necessary for the operation of these cloud hosts, and reduce greenhouse gas emissions, which is conducive to early deployment. Achieve the goal of peak carbon neutrality.

It should be understood that the above general description and the following detailed description are only exemplary and explanatory, and do not limit this specification.

Description of the drawings

In order to more clearly explain the embodiments of this specification or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only These are some embodiments recorded in this specification. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings.

Figure 1 is a schematic architectural diagram of a traffic scheduling system according to an exemplary embodiment of this specification;

Figure 2 is a schematic flowchart of a traffic scheduling method according to an exemplary embodiment of this specification;

Figure 3 is a schematic diagram of cloud host allocation for tenant traffic according to an exemplary embodiment of this specification;

Figure 4 is a schematic structural diagram of a network function virtualization NFV platform controller according to an exemplary embodiment of this specification;

Figure 5 is a schematic structural diagram of a traffic scheduling device according to an exemplary embodiment of this specification.

Detailed ways

Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. When the following description refers to the drawings, the same numbers in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with this specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of this specification, as detailed in the appended claims.

The terminology used in this specification is for the purpose of describing particular embodiments only and is not intended to limit the specification. As used in this specification and the appended claims, the singular forms "a,""the" and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It should also be understood that the term "and/or" as used herein refers to both Contains any or all possible combinations of one or more associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in this specification to describe various information, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from each other. For example, without departing from the scope of this specification, the first information may also be called second information, and similarly, the second information may also be called first information. Depending on the context, the word "if" as used herein may be interpreted as "when" or "when" or "in response to determining."

The applicant found in its research on related technologies that service providers usually manage multiple different network functions from tenants (i.e., the above-mentioned enterprises and organizations using cloud services) based on the Network Functions Virtualization (NFV) management and control platform. Deployment and traffic scheduling. Among them, the elephant flow in tenant traffic is usually offloaded using a virtual processor distributed processing unit (VDPU) to reduce the waste of idle resources. For other tenant traffic, the corresponding scheduling plan is usually obtained based on weighted calculation. However, due to the uncertainty caused by dynamic changes in tenant traffic, values such as weight values in the above weighted calculation will not be applicable to different actual application scenarios in a timely manner. Therefore, this specification proposes the following technical solutions to solve the above problems.

Figure 1 is a schematic architectural diagram of a traffic scheduling system according to an exemplary embodiment of this specification. As shown in Figure 1, a reinforcement learning module 11 and an NFV platform controller 12 may be included.

The reinforcement learning (Reinforcement Learning, RL) module 11 is used to receive the environmental data of the corresponding NFV platform, and generate corresponding traffic scheduling decisions for the tenant traffic to be scheduled in the above-mentioned NFV platform, so that the NFV platform controller 12 makes the traffic scheduling decisions according to the traffic scheduling decisions. The unscheduled tenant traffic is scheduled so that the unscheduled tenant traffic is processed by the cloud host indicated by the decision. Among them, the reinforcement learning module 11 can use any type of reinforcement learning model to achieve the above functions; as an exemplary embodiment, the reinforcement learning model can be a reinforcement learning model based on a neural network algorithm, such as Deep Q-Learning (Deep Q-Learning). , DQN) model, etc.

The NFV platform controller 12 is used to schedule traffic for the corresponding NFV platform. After obtaining the environment data corresponding to the above-mentioned NFV platform, the NFV platform controller 12 can send the environment data to the reinforcement learning module 11, and according to the traffic scheduling policy returned by the reinforcement learning module 11, the tenant traffic to be scheduled in the above-mentioned NFV platform Schedule to the corresponding cloud host. Among them, the reinforcement learning module 11 and the NFV platform controller 12 can run on the same electronic device, or they can run on different electronic devices respectively, and this specification does not limit this. The above-mentioned electronic device may be, for example, a physical server including an independent host, or a virtual server hosted by a host cluster, which is not limited in this specification.

The technical solution of this specification will be described below with reference to the embodiment shown in FIG. 2 . Figure 2 is a schematic flowchart of a traffic scheduling method according to an exemplary embodiment of this specification. As shown in Figure 2, the method may include the following steps:

S201. Determine environmental data, which includes resource demand information of tenant traffic to be scheduled and actual allocation of cloud host resources.

In the technical solution of this specification, by determining the resource demand information of the tenant traffic to be scheduled and the actual allocation of cloud host resources, a reasonable match between the tenant traffic and the cloud host can be achieved, thereby achieving the required scheduling purpose. For example, to effectively control the operating costs of cloud hosts and stabilize traffic processing, tenant traffic to be scheduled can be Prioritize scheduling to already used cloud hosts, thereby reducing costs by improving the resource utilization of cloud hosts on the one hand, and allowing the remaining cloud hosts to respond to unpredictable traffic bursts, equipment failures and other emergencies at any time. For another example, in order to balance the service life of all cloud hosts, tenant traffic to be scheduled can be prioritized to cloud hosts with relatively low resource utilization, so that the resource utilization of each cloud host is as equal as possible to achieve the longest service life. balance. Similarly, other scheduling purposes can also be formulated based on actual needs, which will not be detailed here.

The resources provided by the cloud host can include various types, such as bandwidth resources, CPU resources, GPU resources, memory resources, etc., which are not limited in this manual. Tenant traffic to be scheduled needs to be processed by one or more of the above types of resources provided by the cloud host, depending on the traffic processing requirements. For example, when the tenant traffic to be scheduled is traffic carrying computing data, data calculation needs to be performed through the CPU resources, memory resources, etc. provided by the cloud host. At this time, the cloud host is equivalent to the computing node; for another example, when the tenant traffic to be scheduled is When the traffic carrying communication data needs to be filtered and forwarded through the bandwidth resources, CPU resources, etc. provided by the cloud host, the cloud host is equivalent to the network element.

Since the resource demand information of the tenant traffic to be scheduled and the actual allocation of the above-mentioned cloud host resources have strong real-time nature, this specification can be used for the tenant to be scheduled by inputting it as the above-mentioned environmental data into the reinforcement learning model for processing. The traffic realizes dynamic and personalized cloud host resource allocation. Compared with the allocation of cloud host resources based on preset fixed weights in related technologies, the scheduling decision of tenant traffic to be scheduled in this manual can be applied to different situations in a timely manner. Actual application scenarios and achieve the scheduling purpose as mentioned above.

The above tenant traffic to be scheduled can be received traffic from a new tenant, that is, the triggering condition of the traffic scheduling method in this specification can be: receiving traffic from a new tenant; in other words, when receiving traffic from a new tenant, The traffic of this new tenant can be regarded as the above-mentioned tenant traffic to be scheduled, and enter S201 and its subsequent steps to achieve reasonable scheduling of this traffic, so that the traffic can be processed through reasonable cloud host resources. Of course, the above triggering conditions may also include any of the following: the resource utilization rate of the preset number of cloud hosts reaches the preset expansion threshold, which triggers the expansion demand; the resource utilization rate of the preset number of cloud hosts is lower than the preset shrinkage threshold, which triggers the expansion requirement. Shrinkage requirements, etc. In the event of capacity expansion or reduction, at least part of the tenant traffic that has been originally scheduled can be used as the above-mentioned tenant traffic to be scheduled and re-scheduled through the technical solution in this specification.

S202: Input the environmental data into the reinforcement learning model generated by pre-training, and obtain the actions output by the reinforcement learning model and the score of each action; wherein each action corresponds to one or more cloud hosts.

After the above-mentioned environmental data is determined, it can be input into a reinforcement learning model generated by pre-training. The above-mentioned reinforcement learning model can determine the action corresponding to the environmental data and the score of each action, wherein each of the above-mentioned action representations can be assigned to a processing The to-be-scheduled tenant traffic is a set of cloud hosts, and the set of cloud hosts includes one or more cloud hosts, that is, each action can correspond to one or more different cloud hosts. When the number of alternative cloud hosts is greater, since there are relatively more combinations between these cloud hosts, the number of cloud host sets formed by these cloud hosts is usually relatively greater; vice versa, so the above action The number is positively related to the number of alternative cloud hosts.

As mentioned above, due to differences in scheduling targets, the scheduling results adopted for tenant traffic to be scheduled may also be different. For ease of understanding, the technical solution of this specification will be described below with cloud host cost minimization as an exemplary scheduling purpose.

As shown in Equation (1), it is used to represent the minimum value of the total cost of all cloud hosts that have been used for traffic processing, which corresponds to the scheduling purpose of minimizing the cost of the cloud hosts listed above. Among them, k _e represents the cost of using cloud host e, y _e is a binary variable indicating whether cloud host e is used, and E is the set of all available cloud hosts.

In a multi-tenant scenario, when scheduling the traffic of each tenant, it is easy to understand that after scheduling the traffic of any tenant to the corresponding cloud host for processing, the actual allocation of cloud host resources will change. Therefore, each tenant's traffic scheduling plan will be affected by the traffic scheduling plans of all previous tenants. It can be seen that the technical solution of this specification solves a sequential decision-making problem involving multi-stage processing, which can be described by a Markov Decision Process (Markov Decision Process, MDP). Therefore, this specification uses reinforcement learning based on the Markov Decision Process model to solve the above traffic scheduling problem.

The Markov decision process can be described as (S, A, P, R, γ), where S is the state space, A is the action space, P is the state transition function, R is the reward function, and γ is the value belonging to the interval [ 0, 1] discount coefficient. In this specification, the NFV platform controller can determine the above-mentioned environmental data as state information S _t , and input it to the above-mentioned reinforcement learning model, and then obtain the action A _t output by the above-mentioned reinforcement learning model, and the NFV platform controller can according to Action A _t performs the corresponding traffic scheduling operation. Further, the NFV platform controller can give the above-mentioned reinforcement learning model a corresponding reward R _t for updating the above-mentioned reinforcement learning model, thereby optimizing the model decision-making strategy. The goal is to find an optimal strategy to maximize the cumulative reward. It can be seen that through the reasonable formulation of the reward function in this specification, the reinforcement learning model can be gradually adjusted to achieve or approach the scheduling purpose as mentioned above after optimizing based on rewards.

The above reinforcement learning model can be constructed using different algorithms according to the actual needs of the NFV platform, and this description does not limit this. In an embodiment, the above-mentioned reinforcement learning model may include a DQN model. The DQN model uses a neural network structure to approximate the Q Value table Q(s,a) of the traditional Q Learning algorithm, making it Q(s,a,θ) represented by the neural network node weight parameter θ. Among them, the multi-layer perceptron (MLP) can be used in the neural network to construct the corresponding Q network. Correspondingly, when the DQN model is used as the reinforcement learning model in this specification, the above score for each action can be the Q value (Q Value) output by the Q network in the DQN model.

S203: Determine the action combination with the highest score, and schedule the tenant traffic to be scheduled to the cloud host corresponding to the action combination with the highest score; wherein each action combination corresponds to one or more actions.

When the above-mentioned reinforcement learning model outputs the actions corresponding to the above-mentioned environmental data and the score of each action, all actions can be arranged and combined and the action combination with the highest score can be determined based on the score. At the same time, the NFV platform controller can use the action combination with the highest score to determine the action combination. The tenant traffic to be scheduled is scheduled to the cloud host corresponding to the action combination with the highest score.

As mentioned above, reinforcement learning is particularly suitable for sequential decision-making problems in dynamic environments, such as the traffic scheduling problem (or cloud host allocation problem) in this specification. Each action output by the reinforcement learning model can represent a cloud host allocation plan selected for the tenant traffic to be allocated, so the action combination with the highest score in S203 is actually the action with the highest score. However, if the total number of cloud hosts is large, the number of actions formed will be very large, for example, it may exceed millions, which will bring great complexity and calculation amount. Therefore, each action can be simplified by The number of machines, and in S203, each action is combined to form an action combination. Each action combination contains multiple actions. At this time, each action combination can represent a cloud host allocation plan selected for the tenant traffic to be allocated, so that Making the total number of actions controllable helps to efficiently determine the highest-scoring combination of actions. For example, each action can be simplified to include only one cloud host, at which point the total number of actions is minimal and equal to the total number of cloud hosts.

Due to considerations such as cloud host operational stability and system reliability, in addition to selecting action combinations through scores, it is also necessary to ensure that the cloud hosts involved in the cloud host allocation plan meet predefined constraints. The constraints may include the first constraint for individually constraining a single cloud host, or the second constraint for overall constraints for all cloud hosts corresponding to the action combination, or both types of constraints may exist at the same time. .

The first constraint can be used to filter actions output by the reinforcement learning model, and at least one cloud host corresponding to the filtered action does not satisfy the first constraint. The first constraint may include: the resource proportion of the cloud host to which the tenant traffic is to be scheduled does not exceed a preset resource threshold. For example: There are two actions A and B and cloud hosts 1 and 2. The available bandwidths of the two cloud hosts are a and b respectively. Action A includes cloud host 1, action B includes cloud hosts 1 and 2, and the tenants to be scheduled are The traffic demand for bandwidth is X(a<X<a+b). Then, when the first constraint includes a constraint requiring that the bandwidth demand of the tenant traffic to be scheduled cannot exceed the sum of the maximum available bandwidth corresponding to the action, then action A will be smaller than the available bandwidth of the corresponding cloud host a because the bandwidth demand of the tenant to be scheduled cannot exceed the sum of the maximum available bandwidth corresponding to the action. The traffic is filtered based on bandwidth requirements, leaving only action B.

For the second constraint, it can be used to impose overall constraints on all cloud hosts corresponding to the action combination. For example, in the solution described above, the action combination with the highest score is determined from the alternative action combinations formed by the remaining actions. The action combination with the highest score is: all alternatives that satisfy the preset second constraint. The alternative action combination with the highest score in the action combination. The second constraint may include at least one of the following conditions: the number of cloud hosts to which the tenant traffic to be scheduled remains within a preset interval, and the cloud hosts corresponding to the corresponding action combinations are not the same as the cloud hosts corresponding to the scheduled tenant traffic. Total overlap. For example, as mentioned above, there are also actions C and D that are not filtered by the above-mentioned first constraint like action B, and action C includes cloud hosts 1, 2, and 3, and action D includes cloud hosts 2 and 3. Then, when the second constraint includes a constraint requiring that the number of cloud hosts to which the tenant traffic to be scheduled is not greater than 2, action C is filtered because the number of cloud hosts corresponding to the entire action C is greater than 2. At this time, only actions B and D remain. Based on the scores given by the reinforcement learning model for actions B and D, the action combination with the highest score is determined, and then the tenant traffic to be scheduled is scheduled to the cloud host corresponding to the action combination with the highest score. .

In one embodiment, the above remaining actions can be arranged and combined to generate the above alternative action combinations, and then all alternative action combinations are filtered according to the above second constraint, and the filtered alternative action combinations are sorted according to all contained The actions are sorted by their total score, and the alternative action combination ranked first is determined as the action combination with the highest score.

In another embodiment, the above-mentioned remaining actions are sorted according to the score of each action, and the remaining actions with relatively higher scores are prioritized to be arranged and combined to generate the above-mentioned alternative action combinations, until the generated alternative action combinations meet the above-mentioned requirements. second constraint, and determine the alternative action combination as the action combination with the highest score above. Those skilled in the art can understand that since the sorting efficiency of the same large-scale data is obviously higher than the efficiency of permutation and combination, and compared with the previous In this embodiment, the sorting operation precedes the permutation and combination operation. Therefore, in this embodiment, when determining the action combination with the highest score, it is not necessary to traverse the scores of all action combinations, but only the remaining ones with relatively higher scores. Actions are prioritized and combined, thereby increasing the chance of identifying the highest-scoring action combinations above in advance.

As mentioned above, the reinforcement learning model in this specification can be updated according to the rewards corresponding to the output actions, so as to continuously improve the processing effect of the reinforcement learning model and achieve or approximate the required scheduling purpose. In one embodiment, the NFV platform controller can combine the historical environment data, the historical action combination with the highest score corresponding to the historical environment data, the reward corresponding to the historical action combination, and the updated historical environment formed after the historical action combination is executed. The data is stored in a preset cache area, and one or more sets of historical data are randomly selected from the preset cache area regularly to update the above reinforcement learning model. Wherein, the reward corresponding to the above historical action combination can be calculated by a preset reward function on the historical environment data, the preset cloud host price model and the historical action combination; wherein, the reward size is the same as the cloud host corresponding to the historical action combination. The total cost is negatively correlated, which allows the above reinforcement learning model to give higher scores to the action combinations corresponding to cloud hosts with lower total costs.

It can be seen that the above-mentioned preset reward function can be used to guide the reinforcement learning algorithm to find the optimal strategy. Then, through the reasonable formulation of the preset reward function, the optimization direction of the reinforcement learning model can be controlled so that it can achieve or approach the required scheduling purpose after continuous updates. For example, when the scheduling purpose is to minimize the total cost of the cloud host, the above reward function can be designed as: 1. If an action includes scheduling tenant traffic to be scheduled to an unused cloud host e (this ECS (has not been assigned to-be-scheduled traffic before), then the reward for this action is -k _e ke; 2. If an action schedules the to-be-scheduled tenant traffic to a used cloud host e, the reward is 0. After the reward is calculated, the reward can be regularized so that it falls within the interval [-1,1] to facilitate subsequent update operations of the reinforcement learning model.

As can be seen from the above embodiments, this specification models the scheduling problem of tenant traffic to be scheduled as a multi-stage sequence decision-making problem suitable for reinforcement learning, and adopts a model corresponding to the reinforcement learning algorithm DQN and a filtering combination algorithm determined based on constraints. Combined, historical environment data is used to train reinforcement learning models, and tenant traffic is scheduled to the cloud host set with the highest score and that meets actual needs. Among them, the above-mentioned multi-stage sequence decision-making problem can be analyzed and calculated based on the dynamically changing characteristics of tenant traffic. Compared with the allocation of cloud host resources based on preset fixed weights in related technologies, the to-be-scheduled tasks in this specification can be The scheduling decisions of tenant traffic can be applied to different practical application scenarios in a timely manner, so that the NFV platform can achieve the required scheduling purposes.

The technical solution of this specification will be described below with reference to the embodiment shown in FIG. 3 . Figure 3 is a schematic diagram of cloud host allocation for tenant traffic according to an exemplary embodiment of this specification. As shown in Figure 3, the NFV platform controller cooperates with the reinforcement learning module to allocate cloud hosts based on the information provided by the reinforcement learning module. Policy, allocate corresponding cloud hosts to tenants, so that these cloud hosts process the tenant's traffic. in particular:

The NFV platform controller can obtain or maintain tenant traffic information, cloud host resource information, reliability policies, price models, scheduling decisions, etc. respectively. Among them, tenant traffic information and cloud host resource information belong to environmental information. The NFV platform controller can provide these environmental data as status information to the environmental assessment module in the reinforcement learning module. The environmental assessment module further inputs the above status information into reinforcement learning. Model.

It is assumed that the tenant traffic information and cloud host resource information involve the CPU resources, memory resources and bandwidth provided by the cloud host. resources. At this time, the above status information can be characterized as _Si =<B, C, M, bi ^, c i ^, m ⁱ >, where: assuming that the number of all available cloud hosts is N, then B, C, M They are all vectors with a length of N, respectively representing the remaining available bandwidth resources, CPU resources and memory resource capacity on N ECSs. b ⁱ , c ⁱ , m ⁱ respectively represent the bandwidth resources and CPU resources required for the traffic of tenant i. and memory resources. Therefore, the state vector _Si is used to indicate that the i-th state encapsulates various resource information on all available cloud hosts and resource information required by the i-th tenant, and its total length can be 3N+3. Of course, this specification does not require that the above-mentioned environmental data and the state information S_t must be consistent in format. For example, when the above-mentioned environmental data is in a string format, it can be converted into an equivalent vector format before inputting the above-mentioned reinforcement learning model.

Reinforcement learning models can be built based on the DQN algorithm. After the DQN model obtains the status information _Si from the environmental assessment module, it is used as the input of the DQN model. The neural network included in the DQN model calculates and outputs the Q value of each action, where each action corresponds to a cloud host. . And, the DQN model passes the calculated Q values of all actions to the filter combination algorithm. The filter combination algorithm can select the action combination adopted this time based on the Q value. The cloud host corresponding to the action combination will be assigned to the tenant. In the process of processing actions, the filtering combination algorithm can be implemented through the following formula defined in this manual:

Among them, D is the set of tenant traffic to be scheduled; U is the cloud host resource threshold; G ⁱ is the number of cloud hosts used by tenant i; are binary variables indicating whether tenant i's to-be-scheduled tenant traffic is allocated on cloud host e; b ⁱ , c ⁱ , m ⁱ are the bandwidth, CPU resources, and memory resources occupied by tenant i's to-be-scheduled tenant traffic; B _e , C _e and _Me are the bandwidth, cloud host, and memory resources of cloud host e.

The above equations (2), (3), and (4) are used to individually constrain a single cloud host. Specifically, equations (2), (3), and (4) are used to limit the resource capacity of cloud hosts: the resource percentage used by all to-be-scheduled tenant traffic allocated to each cloud host should not exceed a predetermined threshold U, To ensure that each cloud host has sufficient resources to cope with emergencies such as traffic surges.

The above equations (5) and (6) are used to constrain all cloud hosts corresponding to the action combination as a whole. Specifically, Equation (5) is used to constrain the allocated number of cloud hosts: G ⁱ indicates the number of cloud hosts used by tenant i. This value should not be less than a fixed minimum value G _min . By allocating the tenant traffic to be scheduled to multiple On a cloud host, high reliability can be achieved. On the other hand, the value should not exceed some fixed maximum value G _max . The values of G _min and G _max can be determined according to the actual needs of the NFV platform. Equation (6) is used to constrain the overlap of cloud hosts: the number of cloud hosts shared between two tenants should not exceed a certain parameter O. The actual value of O is determined by the NFV platform administrator. The purpose of incomplete overlap between different tenants is to prevent dangerous traffic of some tenants from affecting the traffic of other tenants, thereby improving system reliability.

The filter combination algorithm can first filter out actions that do not satisfy equations (2), (3), and (4), then sort them according to the Q values of all remaining actions, and then try various actions starting from the remaining actions with relatively higher Q values. During this trial process, alternative action combinations that satisfy the above equation (5) can be obtained in sequence, and each alternative action combination should be obtained in order from high to low in total Q value.

For example, assume that the remaining actions and their Q values are: action A1, Q value is 100, action A2, Q value is 98, action A3, Q value is 95, action A4, Q value is 90, action A5, Q value is 88... Then, assuming that the number of cloud hosts indicated by equation (5) is 3 to 4, we should first select the four actions A1-A4 with the highest Q values, and determine this action combination (A1, A2, A3, A4 ) satisfies equation (6): If equation (6) is met, the action combination (A1, A2, A3, A4) can be determined to be the action combination with the highest total Q value finally determined, and the cloud host can be generated accordingly Allocation plan; if equation (6) is not satisfied, reselect another 4 actions until equation (6) is satisfied.

It can be seen that by sequentially determining the alternative action combinations through the above method, compared with the method of first generating all alternative action combinations and then considering the constraints, the efficiency of generating the cloud host allocation plan can be greatly improved.

The above constraints, such as those shown in equations (2) to (6), can be defined in the reliability policy maintained by the NFV platform controller, and the relevant content of the reliability policy can be provided by the NFV platform controller to the filtering combination. Algorithms for generating scheduling decisions. For example, the NFV platform controller indirectly provides relevant content of the reliability strategy to the filtering combination algorithm through the environmental assessment module. The NFV platform controller can update the constraints in the reliability policy based on actual needs.

Furthermore, after determining the above-mentioned cloud host allocation plan, the filtering combination algorithm can transfer relevant information to the NFV platform controller, so that the NFV platform controller can form a scheduling strategy accordingly. For example, when the cloud host allocation plan is to select cloud hosts 1 to 4 corresponding to actions A1 to A4, the corresponding scheduling policy can be: dispatch tenant traffic to cloud hosts 1 to 4 for processing.

As an exemplary description, when the scheduling purpose is to ensure that the total cost of all cloud hosts is minimized, the price model maintained by the NFV platform controller may be a cloud host price model. The environmental assessment module can calculate the rewards generated by this cloud host allocation based on the cloud host price model, as well as the aforementioned status information, cloud host allocation plan, and preset reward function. As mentioned above, by setting the scheduling purpose to minimize the total cost of all cloud hosts, the corresponding scheduling scheme can prioritize the traffic of tenants currently to be allocated to cloud hosts that have been allocated the traffic of other tenants, in a secure manner. Improving the resource utilization of the corresponding cloud host within the scope can reduce the total cost of cloud host resources. Moreover, since the number of started cloud hosts can be reduced under the same conditions, it can at least save the basic resource consumption of system operation, heat dissipation and other aspects necessary for the operation of these cloud hosts, and reduce greenhouse gas emissions, which is conducive to early deployment. Achieve the goal of peak carbon neutrality.

Reinforcement learning models can maintain buffered datasets. The above-mentioned status information, cloud host allocation plan, rewards and the latter status information formed after executing the cloud host allocation plan can be recorded as a set of data to the above-mentioned buffer data set. In fact, the buffer data set is used to save each set of data formed by each allocation. In addition, the reinforcement learning model can periodically select one or more sets of data from the buffer data set for its own model update; the selection of data can be random selection.

Those skilled in the art can understand that:

First of all, the above embodiment describes the implementation process of reasoning through the reinforcement learning model that has completed training. This reasoning process can enable the reinforcement learning module to provide the NFV platform controller with a scheduling strategy that meets the scheduling purpose. Since the above scheduling strategy is essentially It is the result output by the above-mentioned reinforcement learning model after inputting the above-mentioned resource demand information of the tenant traffic to be scheduled and the actual allocation of the above-mentioned cloud host resources into the reinforcement learning model. Therefore, compared with the fixed preset-based method in related technologies, Weights are used to allocate cloud host resources. The scheduling strategies obtained in this manual can be applied to different practical application scenarios in a timely manner.

Secondly, before the above reasoning process, the reinforcement learning model needs to be trained multiple times and iteratively to ensure that the scheduling policy output by the reinforcement learning model can meet the scheduling purpose. The training process for the reinforcement learning model is similar to the above-mentioned reasoning process, including: inputting the environmental data in the training sample into the reinforcement learning model, obtaining the actions and scores output by the model, and determining the action combination with the highest score; and, based on the score The highest action combination can determine whether the preset scheduling purpose has been achieved. If it has not been achieved, iterative training of the reinforcement learning model needs to continue until the reinforcement learning model can meet the scheduling purpose. Of course, it can also be determined whether to continue iterative training of the reinforcement learning model based on other conditions such as whether the number of training times has reached the preset number.

In addition, as mentioned above, even for the reinforcement learning model that has been trained, during the actual inference process, each group of data cached in the above buffer data set can still be used, that is, by periodically randomly selecting a group or Multiple sets of data update the parameters of the deep neural network used in the reinforcement learning model, thereby helping to overcome the correlation and non-stationary distribution of empirical data.

Figure 4 is a schematic structural diagram of a network function virtualization NFV platform controller in an exemplary embodiment. Please refer to Figure 4. At the hardware level, the NFV platform controller includes a processor, internal bus, network interface, memory and non-volatile storage, and of course may include other required hardware. The processor reads the corresponding computer program from the non-volatile memory into the memory and then runs it, forming a traffic scheduling device at the logical level. Of course, in addition to software implementation, this specification does not exclude other implementation methods, such as logic devices or a combination of software and hardware, etc. That is to say, the execution subject of the following processing flow is not limited to each logical unit, and may also be hardware or logic device.

Corresponding to the foregoing embodiment of the traffic scheduling method, this specification also provides an embodiment of a traffic scheduling device.

Please refer to FIG. 5 , which is a schematic structural diagram of a traffic scheduling device according to an exemplary embodiment. As shown in Figure 5, in a software implementation, the device may include:

The environment data determination unit 501 is used to determine the environment data, which includes the resource demand information of the tenant traffic to be scheduled and the actual allocation of cloud host resources;

The reinforcement learning model processing unit 502 is used to input the environmental data into the reinforcement learning model generated by pre-training, and obtain the actions output by the reinforcement learning model and the score of each action; wherein each action corresponds to one or more cloud host;

The traffic scheduling unit 503 is configured to determine the action combination with the highest score, and schedule the tenant traffic to be scheduled to the cloud host corresponding to the action combination with the highest score; wherein each action combination corresponds to one or more actions.

Optionally, the reinforcement learning model includes a deep Q network model; wherein each of the reinforcement learning models output The score of an action is the Q-value of the corresponding action.

Optionally, the device also includes:

The action constraint unit 504 is used to filter the actions output by the reinforcement learning model according to the preset first constraint condition; wherein the first constraint condition is used to individually constrain a single cloud host, and the filtered actions correspond to At least one cloud host does not satisfy the first constraint;

The action combination with the highest score is determined from the alternative action combinations formed by the remaining actions. The action combination with the highest score is: the action combination with the highest score among all alternative action combinations that satisfy the preset second constraint condition. Select an action combination; among them, the second constraint condition is used to impose overall constraints on all cloud hosts corresponding to the action combination.

Optionally, the action constraint unit 504 is specifically used to:

The remaining actions are sorted according to the score of each action, and the remaining actions with relatively higher scores are prioritized to generate the alternative action combination until the generated alternative action combination meets the second constraint. , and determine the alternative action combination as the action combination with the highest score; or,

The remaining actions are arranged and combined to generate the alternative action combinations, all alternative action combinations are filtered according to the second constraint, and the filtered alternative action combinations are calculated according to the total score of all included actions. Sorting is performed, and the alternative action combination ranked first is determined as the action combination with the highest score.

Optionally, the first constraint includes: the resource proportion of the cloud host to which the tenant traffic to be scheduled does not exceed a preset resource threshold;

The second constraint includes at least one of the following: the number of cloud hosts to which the tenant traffic to be scheduled remains within a preset interval, the cloud host corresponding to the corresponding action combination and the cloud host corresponding to the scheduled tenant traffic are not the same. Total overlap.

Optionally, the device also includes:

The reinforcement learning model update unit 505 is used to randomly select one or more sets of historical data from the preset cache area to update the reinforcement learning model;

Each set of historical data includes: historical environment data, the historical action combination with the highest score corresponding to the historical environment data, the reward corresponding to the historical action combination, and the updated historical environment data formed after the execution of the historical action combination.

Optionally, the reward corresponding to the historical action combination is calculated by a preset reward function based on the historical environment data, the preset cloud host price model and the historical action combination; where the reward size is related to the historical action The total cost of the cloud hosts corresponding to the combination is negatively correlated.

Optionally, the device is executed when any of the following trigger conditions is met:

Receive traffic from new tenants;

The resource usage of the preset number of cloud hosts reaches the preset expansion threshold and triggers expansion requirements;

The resource utilization of the preset number of cloud hosts is lower than the preset shrinkage threshold, triggering the need for shrinkage.

For details on the implementation process of the functions and effects of each unit in the above device, please refer to the implementation process of the corresponding steps in the above method, and will not be described again here.

For the device embodiment, since it basically corresponds to the method embodiment, please refer to the method embodiment for relevant information. Partial description is enough. The device embodiments described above are only illustrative. The units described as separate components may or may not be physically separated. The components shown as units may or may not be physical units, that is, they may be located in One location, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution in this specification. Persons of ordinary skill in the art can understand and implement the method without any creative effort.

Embodiments of the subject matter and functional operations described in this specification may be implemented in digital electronic circuits, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and their structural equivalents, or in A combination of one or more. Embodiments of the subject matter described in this specification may be implemented as one or more computer programs, i.e., one or more of computer program instructions encoded on a tangible non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Multiple modules. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, such as a machine-generated electrical, optical, or electromagnetic signal, which signal is generated to encode and transmit the information to a suitable receiver device for transmission by the data The processing device executes. Computer storage media may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The processes and logic flows described in this specification may be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows may also be performed by dedicated logic circuits, such as FPGA (Field Programmable Gate Array) or ASIC (Application Specific Integrated Circuit), and the device may also be implemented as a dedicated logic circuit.

Computers suitable for executing the computer program include, for example, general and/or special purpose microprocessors, or any other type of central processing unit. Typically, the central processing unit will receive instructions and data from read-only memory and/or random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Typically, the computer will also include one or more mass storage devices for storing data, such as magnetic, magneto-optical or optical disks, or the like, or the computer will be operably coupled to such mass storage device to receive data therefrom or to It transmits data, or both. However, the computer is not required to have such a device. Additionally, the computer may be embedded in another device, such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a global positioning system (GPS) receiver, or a universal serial bus (USB) ) flash drives for portable storage devices, to name a few.

Computer-readable media suitable for storage of computer program instructions and data include all forms of non-volatile memory, media and memory devices, including, for example, semiconductor memory devices (such as EPROM, EEPROM, and flash memory devices), magnetic disks (such as internal hard disks or removable memory devices). removable disks), magneto-optical disks, and CD ROM and DVD-ROM disks. The processor and memory may be supplemented by or incorporated into special purpose logic circuitry.

Although this specification contains many specific implementation details, these should not be construed to limit the scope of any invention or what is claimed, but rather serve primarily to describe features of specific embodiments of particular inventions. Certain features described in this specification as multiple embodiments can also be combined in a single embodiment. On the other hand, various features that are described in a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. In addition, although While features may function in certain combinations as described above and are even initially claimed as such, one or more features from a claimed combination may in some cases be removed from that combination and what is claimed A composition can point to a subcomposition or a variant of a subcomposition.

Similarly, although operations are depicted in a specific order in the drawings, this should not be construed as requiring that the operations be performed in the specific order shown, or sequentially, or that all illustrated operations be performed to achieve desired results. result. In some cases, multitasking and parallel processing may be advantageous. Furthermore, the separation of various system modules and components in the above embodiments should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product , or packaged into multiple software products.

Thus, specific embodiments of the subject matter have been described. Other embodiments are within the scope of the appended claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Furthermore, the processes depicted in the figures do not necessarily require the specific order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The above are only preferred embodiments of this specification and are not intended to limit this specification. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of this specification shall be included in this specification. within the scope of protection.

Claims

A traffic scheduling method, characterized in that the method includes:

Determine environmental data, which includes resource demand information of tenant traffic to be scheduled and actual allocation of cloud host resources;

Input the environmental data into the reinforcement learning model generated by pre-training, and obtain the actions output by the reinforcement learning model and the score of each action; wherein each action corresponds to one or more cloud hosts;

Determine the action combination with the highest score, and schedule the tenant traffic to be scheduled to the cloud host corresponding to the action combination with the highest score; wherein each action combination corresponds to one or more actions.
The method of claim 1, wherein the reinforcement learning model includes a deep Q network model; wherein the score of each action output by the reinforcement learning model is the Q value of the corresponding action.
The method according to claim 1, wherein determining the action combination with the highest score includes:

Filter the actions output by the reinforcement learning model according to the preset first constraint condition; wherein the first constraint condition is used to individually constrain a single cloud host, and at least one cloud host corresponding to the filtered action does not satisfy The first constraint;

The action combination with the highest score is determined from the alternative action combinations formed by the remaining actions. The action combination with the highest score is: the action combination with the highest score among all alternative action combinations that satisfy the preset second constraint condition. Select an action combination; among them, the second constraint condition is used to impose overall constraints on all cloud hosts corresponding to the action combination.
The method according to claim 3, wherein determining the action combination with the highest score from the alternative action combinations formed by the remaining actions includes:

The remaining actions are sorted according to the score of each action, and the remaining actions with relatively higher scores are prioritized to generate the alternative action combination until the generated alternative action combination meets the second constraint. , and determine the alternative action combination as the action combination with the highest score; or,

The remaining actions are arranged and combined to generate the alternative action combinations, all alternative action combinations are filtered according to the second constraint, and the filtered alternative action combinations are calculated according to the total score of all included actions. Sorting is performed, and the alternative action combination ranked first is determined as the action combination with the highest score.
The method according to claim 3, characterized in that:

The first constraint includes: the resource proportion of the cloud host to which the tenant traffic to be scheduled does not exceed a preset resource threshold;

The second constraint includes at least one of the following: the number of cloud hosts to which the tenant traffic to be scheduled remains within a preset interval, the cloud host corresponding to the corresponding action combination and the cloud host corresponding to the scheduled tenant traffic are not the same. Total overlap.
The method according to claim 1, further comprising:

Regularly randomly select one or more sets of historical data from the preset cache area to update the reinforcement learning model;

Each set of historical data includes: historical environment data, the historical action combination with the highest score corresponding to the historical environment data, the reward corresponding to the historical action combination, and the updated historical loop formed after the execution of the historical action combination. environment data.
The method according to claim 6, characterized in that the reward corresponding to the historical action combination is calculated by a preset reward function based on the historical environment data, the preset cloud host price model and the historical action combination; Among them, the reward size is negatively correlated with the total cost of the cloud host corresponding to the historical action combination.
The method according to claim 1, characterized in that the method is executed when any of the following trigger conditions is met:

Receive traffic from new tenants;

The resource usage of the preset number of cloud hosts reaches the preset expansion threshold and triggers expansion requirements;

The resource utilization of the preset number of cloud hosts is lower than the preset shrinkage threshold, triggering the need for shrinkage.
A computer-readable storage medium on which a computer program is stored, characterized in that when the program is executed by a processor, the steps of the method according to any one of claims 1 to 8 are implemented.
A network function virtualization NFV platform controller, including a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that when the processor executes the program, it implements claim 1 ~8 Steps of any of the methods.