CN113254200A

CN113254200A - Resource arrangement method and intelligent agent

Info

Publication number: CN113254200A
Application number: CN202110520783.6A
Authority: CN
Inventors: 刘晶; 徐雷; 毋涛
Original assignee: China United Network Communications Group Co Ltd
Current assignee: China United Network Communications Group Co Ltd
Priority date: 2021-05-13
Filing date: 2021-05-13
Publication date: 2021-08-13
Anticipated expiration: 2041-05-13
Also published as: CN113254200B

Abstract

The invention discloses a resource arranging method and an agent, and relates to the technical field of computers. The specific scheme comprises the following steps: acquiring global reward information and local environment state information, wherein the global reward information is information acquired based on a preset global environment, the global environment corresponds to one or more intelligent agents, and the local environment state information is information acquired according to the local environment corresponding to the current intelligent agent; updating the arrangement strategy according to the global reward information and the local environment state information; in the event that a first resource orchestration request is received, the resources in the local environment are orchestrated based on the first resource orchestration request, the local environment state information, and the updated orchestration policy. The arrangement strategy is updated based on the global reward information and the local environment state information, so that the correlation of the arrangement strategy among the intelligent agents can be reduced, and the arrangement strategy can be updated in time according to the environment change, so that a more reasonable and accurate arrangement strategy is obtained, and the utilization rate of resources is improved.

Description

Resource arrangement method and intelligent agent

Technical Field

The invention relates to the technical field of computers, in particular to a resource arrangement method and an agent.

Background

An agent is one of the important concepts in the field of artificial intelligence, and refers to a computing entity that can continuously and autonomously play a role in a certain environment and has the characteristics of residency, responsiveness, sociality, initiative and the like. In practical applications, a policy may be preset for the agent, and the agent performs a corresponding action based on the preset policy. However, in general, the preset policy is a relatively fixed policy, and there is a correlation between multiple smarts based on the same environment and the same reward update policy, so that the smarts cannot reasonably and accurately update the policy according to the change of the environment.

Disclosure of Invention

Therefore, the invention provides a resource arranging method and an agent, and aims to solve the problem that the agent cannot reasonably and accurately update a strategy according to environmental changes.

In order to achieve the above object, a first aspect of the present invention provides a resource arranging method, including:

acquiring global reward information and local environment state information, wherein the global reward information is information acquired based on a preset global environment, the global environment corresponds to one or more agents, and the local environment state information is information acquired according to a local environment corresponding to a current agent;

updating an arrangement strategy according to the global reward information and the local environment state information;

in the event a first resource orchestration request is received, orchestrating resources in the local environment based on the first resource orchestration request, the local environment state information, and the updated orchestration policy.

Further, before obtaining the global reward information and the local environment status information, the method further includes:

receiving a second resource scheduling request sent by the user terminal;

and arranging the resources in the historical local environment corresponding to the second resource arranging request according to the second resource arranging request, the historical local environment state information corresponding to the second resource arranging request and the historical arranging strategy corresponding to the second resource arranging request.

Further, the obtaining of the global reward information and the local environment status information includes:

and acquiring the global reward information and the local environment state information according to the resources in the historical local environment corresponding to the second resource arranging request after arrangement.

Further, the first resource orchestration request and the second resource orchestration request comprise a resource orchestration type and a resource demand.

Further, the global reward information is information generated by batch processing historical local environment state information and local environment state information corresponding to the second resource arrangement requests of all agents in the global environment, obtaining a resource balance rate and a request acceptance rate of the global environment, and according to a preset reward mechanism, the resource balance rate and the request acceptance rate.

Further, the orchestration policy comprises an action policy;

the updating of the orchestration policy according to the global reward information and the local environment state information includes:

and inputting the global reward information and the local environment state information into a preset action strategy prediction model so that the action strategy prediction model executes action strategy prediction operation and outputs an updated action strategy.

Further, the action policy includes one or more of a path deployment sub-policy and a routing sub-policy.

Further, said orchestrating resources in the local environment based on the first resource orchestration request, the local environment state information, and the updated orchestration policy if the first resource orchestration request is received comprises:

and sending the first resource arranging request, the local environment state information and the updated arranging strategy to a resource manager, and arranging the resources in the local environment by the resource manager.

Further, the resource manager is configured to configure and orchestrate various types of resources in the local environment.

In order to achieve the above object, a second aspect of the present invention provides an agent comprising:

the system comprises a first obtaining module, a second obtaining module and a third obtaining module, wherein the first obtaining module is used for obtaining global reward information and local environment state information, the global reward information is obtained based on a preset global environment, and the global environment corresponds to one or more agents;

the second acquisition module is used for acquiring local environment state information, wherein the local environment state information is acquired according to a local environment corresponding to the current agent;

the updating module is used for updating the arranging strategy according to the global reward information and the local environment state information;

and the arranging module is used for arranging the resources in the local environment based on the first resource arranging request, the local environment state information and the updated arranging strategy under the condition of receiving the first resource arranging request.

The invention has the following advantages:

according to the resource arranging method provided by the invention, each intelligent agent updates the arranging strategy according to the global reward information and the local environment state information, so that the correlation of the arranging strategies among the intelligent agents can be reduced, the arranging strategy can be updated in time according to the environment change, a more reasonable and accurate arranging strategy can be obtained, when a resource arranging request is received, the resource in the local environment corresponding to each intelligent agent is arranged according to the resource arranging request, the local environment state information and the updated arranging strategy, and the utilization rate of the resource is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.

Fig. 1 is a flowchart of a resource scheduling method according to an embodiment of the present invention;

fig. 2 is a flowchart of a resource scheduling method according to a second embodiment of the present invention;

fig. 3 is a block diagram illustrating components of an agent according to a third embodiment of the present invention;

fig. 4 is a block diagram illustrating a resource scheduling system according to a fourth embodiment of the present invention;

fig. 5 is a block diagram of a resource arrangement system according to a fifth embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.

The application provides a resource arranging method in a first aspect. Fig. 1 is a flowchart of a resource orchestration method according to an embodiment of the present application. As shown in fig. 1, the resource arranging method includes the following steps:

step S101, obtaining global reward information and local environment state information.

The global reward information is information obtained based on a preset global environment, the global environment corresponds to one or more intelligent agents, and the local environment state information is information obtained according to a local environment corresponding to the current intelligent agent.

In some embodiments, the global reward information is information generated by batch processing of local environment states of all agents in the global environment, obtaining a resource balancing rate and a request acceptance rate of the global environment, and according to a preset reward mechanism, the resource balancing rate and the request acceptance rate. The batch processing may be executed after a preset time period is reached, or may be executed after the number of the processed resource scheduling requests reaches a preset number threshold, and those skilled in the art may flexibly set the batch processing according to actual requirements, which is not limited in the present application.

For example, after the agent performs resource arrangement, the resource balance rate and the request acceptance rate calculated according to the global environment are both improved, and the global reward information should be forward feedback information. The global reward information can be a numerical value with a symbol, wherein positive and negative of the symbol indicate that the reward information is fed back positively or negatively, and the size of the numerical value indicates the degree of positive or negative feedback.

In some embodiments, the agent obtains the local environment state information through a device with an information acquisition function, where the local environment state information includes information such as resource type, resource occupancy, resource vacancy and the like in the local environment.

It should be noted that the above local environment state information is only an example, and may be specifically set according to actual needs, and other non-described local environment state information is also within the protection scope of the present application, and is not described herein again.

It should be further noted that the preset reward mechanism itself is also an iterative update mechanism, so that more reasonable and accurate global reward information can be obtained, and accordingly, the update of the orchestration strategy is more reasonable and accurate.

And step S102, updating the arrangement strategy according to the global reward information and the local environment state information.

The arrangement strategy refers to a strategy for arranging various resources in a local environment.

In some embodiments, the orchestration policy comprises an action policy comprising one or more of a path deployment sub-policy and a routing sub-policy. Accordingly, the actions of the agent include path deployment and routing configuration.

In some embodiments, the global reward information and the local environment state information are input into a preset action strategy prediction model, so that the action strategy prediction model executes action strategy prediction operation, and outputs updated action strategy. The action strategy prediction model is a model constructed based on an Actor-Critic algorithm. The Actor-Critic algorithm combines two types of reinforcement Learning algorithms, Policy-based Policy Gradient and value-based Q-Learning, with the former being the Actor to select behavior based on probability and the latter being Critic to evaluate the action score of the Actor, which in turn modifies the probability of behavior based on the Critic's score. Based on the method, the Actor-Critic algorithm can effectively process the selection of continuous actions and can also perform single-step updating.

Step S103, in the case of receiving the first resource arranging request, arranging the resource in the local environment based on the first resource arranging request, the local environment state information and the updated arranging strategy.

In some embodiments, the global environment includes various types of servers, storage devices, network devices, etc., and the resources of these devices are virtualized into various types of virtual resources. The global environment corresponds to a plurality of agents, the action of each agent includes VNF (Virtual Network Feature) multipath deployment and traffic flow routing, and the corresponding action policy includes a path policy and a routing policy.

Assume that the first resource orchestration request comprises a resource orchestration type and a resource demand. After receiving the first resource arranging request, the agent arranges various virtual resources in the local environment corresponding to the agent according to the resource arranging type and the resource demand amount corresponding to the first resource arranging request, the current local environment state information of the agent and the updated arranging strategy.

In some implementations, the first resource orchestration request, the local environment state information, and the updated orchestration policy are sent to a resource manager, which orchestrates the resources in the local environment. Wherein the resource manager is used for configuring and arranging various resources in the local environment. For example, the resource manager includes an SFC (Service Function Chaining) orchestrator, an SDN (Software Defined Network) controller, an edge computing orchestrator, an NFV (Network Functions Virtualization) orchestrator, and the like.

It should be noted that, after receiving the first resource orchestration request, the agent may also receive other resource orchestration requests, and in the case of receiving other resource orchestration requests, the agent iteratively orchestrates resources according to the current resource orchestration method and updates the orchestration policy.

According to the resource arranging method provided by the embodiment, each intelligent agent updates the arranging strategy according to the global reward information and the local environment state information, the relevance of the arranging strategies among the intelligent agents can be reduced, the arranging strategy can be updated timely according to the environment change, a more reasonable and accurate arranging strategy is obtained, when a resource arranging request is received, the resources in the local environment corresponding to each intelligent agent are arranged according to the resource arranging request, the local environment state information and the updated arranging strategy, and the utilization rate of the resources is improved.

Fig. 2 is a flowchart of a resource orchestration method according to a second embodiment of the present application. As shown in fig. 2, the resource arranging method includes the following steps:

step S201, receiving a second resource scheduling request sent by the user terminal.

The agent updating and arranging strategy is an iterative updating process. And after receiving a resource arranging request every time, the intelligent agent executes resource arranging according to the current arranging strategy, the local environment state information changes, new global reward information is correspondingly generated, based on the change, the intelligent agent updates the arranging strategy according to the new global reward information and the changed local environment state information, so that a new arranging strategy is obtained, and the process is repeated until a preset stopping condition is met. The stopping condition may be a condition that the number of iterations satisfies a preset iteration threshold, and the stopping condition may be flexibly set according to a requirement in practical application, which is not limited in the present application.

In some embodiments, the agent receives a second resource orchestration request before obtaining the global reward information and the local environment state information corresponding to the first resource orchestration request, wherein the second resource orchestration request comprises a resource orchestration type and a resource demand.

Step S202, arranging the resources in the historical local environment corresponding to the second resource arranging request according to the second resource arranging request, the historical local environment state information corresponding to the second resource arranging request and the historical arranging strategy corresponding to the second resource arranging request.

The historical local environment state information corresponding to the second resource arrangement request refers to the local environment state information corresponding to the current agent before the agent executes the first resource arrangement request (the local environment state information corresponding to the second resource arrangement request is before the time point corresponding to the first resource arrangement request, and therefore the local environment state information corresponding to the second resource arrangement request belongs to the historical local environment state information relative to the first resource arrangement request). Similarly, the historical scheduling policy corresponding to the second resource scheduling request refers to the scheduling policy corresponding to the current agent before the first resource scheduling request is executed.

The method for arranging the resources in the historical local environment corresponding to the second resource arrangement request by the agent according to the second resource arrangement request, the historical local environment state information corresponding to the second resource arrangement request, and the historical arrangement policy corresponding to the second resource arrangement request is similar to that in step S103 in the first embodiment of the present application, and details are not repeated here.

Step S203, obtaining the global reward information and the local environment state information.

And step S204, updating the arranging strategy according to the global reward information and the local environment state information.

Step S205, in case of receiving the first resource orchestration request, orchestrating resources in the local environment based on the first resource orchestration request, the local environment state information, and the updated orchestration policy.

Steps S203 to S205 in this embodiment are the same as steps S101 to S103 in the first embodiment of the present application, and are not described herein again.

The steps of the above methods are divided for clarity, and the implementation may be combined into one step or split some steps, and the steps are divided into multiple steps, so long as the same logical relationship is included, which are all within the protection scope of the present patent; it is within the scope of the patent to add insignificant modifications to the algorithms or processes or to introduce insignificant design changes to the core design without changing the algorithms or processes.

A second aspect of the present application provides an agent. Fig. 3 is a block diagram of an agent according to a third embodiment of the present application. As shown in fig. 3, the agent includes: a first obtaining module 301, a second obtaining module 302, an updating module 303 and an orchestration module 304.

The first obtaining module 301 is configured to obtain global reward information and local environment status information.

It should be noted that the preset reward mechanism itself is also an iterative update mechanism, so that more reasonable and accurate global reward information can be obtained, and accordingly, the update of the orchestration strategy is more reasonable and accurate.

A second obtaining module 302, configured to obtain the local environment status information.

In some embodiments, the agent obtains the local environment state information through the second obtaining module 302 with an information collecting function, where the local environment state information includes information of resource type, resource occupancy amount, resource idle amount, and the like in the local environment.

And the updating module 303 is configured to update the orchestration policy according to the global reward information and the local environment state information.

In some embodiments, the updating module 303 updates the orchestration policy according to the global reward information and the local environment state information, including:

and inputting the global reward information and the local environment state information into a preset action strategy prediction model so that the action strategy prediction model executes action strategy prediction operation and outputs an updated action strategy. The action strategy prediction model is a model constructed based on an Actor-Critic algorithm.

The orchestration module 304 is configured to, upon receiving the first resource orchestration request, orchestrate resources in the local environment based on the first resource orchestration request, the local environment state information, and the updated orchestration policy.

In some embodiments, the global environment includes various types of servers, storage devices, network devices, etc., and the resources of these devices are virtualized into various types of virtual resources. The global environment corresponds to a plurality of agents, the action of each agent comprises VNF multi-path deployment and traffic flow routing, and the corresponding action strategy comprises a path strategy and a routing strategy.

In some implementations, the first resource orchestration request, the local environment state information, and the updated orchestration policy are sent to a resource manager, which orchestrates the resources in the local environment. Wherein the resource manager is used for configuring and arranging various resources in the local environment. For example, the resource managers include SFC organizers, SDN controllers, edge computing organizers, NFV organizers, and the like.

The intelligent agents provided by the embodiment update the arrangement strategy according to the global reward information and the local environment state information, so that the correlation of the arrangement strategy among the intelligent agents can be reduced, the arrangement strategy can be updated in time according to the environment change, a more reasonable and accurate arrangement strategy can be obtained, when a resource arrangement request is received, the resources in the local environment corresponding to each intelligent agent are arranged according to the resource arrangement request, the local environment state information and the updated arrangement strategy, and the utilization rate of the resources is improved.

A third aspect of the present application provides a resource orchestration system. Fig. 4 is a block diagram of a resource arrangement system according to a fourth embodiment of the present application. As shown in FIG. 4, the resource orchestration system 400 comprises: agent 410, reward module 420, environment 430, and resource manager 440.

The environment 410 includes n resources, where n is an integer greater than or equal to 1, from the first resource 431 to the nth resource 43 n; the agent 410 includes a predetermined and updatable orchestration policy 411, and a predetermined executable action 412; the reward module 420 is configured to determine a request acceptance rate 421 and a resource balance rate 422 according to the environment state information acquired from the environment 410 and a preset reward mechanism, and determine global reward information according to the request acceptance rate 421 and the resource balance rate 422, so that the agent 410 refers to the global reward information when updating the orchestration policy; the resource manager 440 is configured to execute the orchestration work of each resource in the environment 410 according to the orchestration policy 411 of the agent 410, the local environment state information, and the specific request content in the resource orchestration request, after receiving the resource orchestration request.

It should be noted that the first resource 431 to the nth resource 43n may be the same type of resource or different types of resources, which are only exemplary in this embodiment, and other types of resource distribution situations are also within the scope of the present application, and the present application is not limited thereto.

It should be further noted that the reward information output by the reward module 420 is reward information of a global nature, that is, the global reward information is information that is generated according to a preset reward mechanism, a resource balance rate and a request acceptance rate, and that is, the global reward information is batch-processed according to local environment state information corresponding to all agents (only one agent is exemplarily shown in fig. 4, other agents are not shown) in the global environment to obtain the resource balance rate and the request acceptance rate of the global environment.

Fig. 5 is a block diagram of a resource orchestration system according to a fifth embodiment of the present application. As shown in fig. 5, the resource scheduling system mainly includes: the actor commentary family network model 500 includes m agents, such as a first agent 511, a second agent 521, an m-th agent 5m1, wherein m is an integer greater than or equal to 1, a first local environment 512, a second local environment 522, an m-th local environment 5m2 corresponding to the agents, and an experience pool formed by a first experience 513, a second experience 523, an m-th experience 5m 3. The actor critic network model 500 is a network model constructed based on an actor critic algorithm, and the network model can obtain global reward information according to the set of local environment state information of each agent, so that the agents can refer to the global reward information when performing the updating of the orchestration strategy.

Each agent corresponds to its own local environment and operates asynchronously based on its own local environment, and the set of local environments of all agents constitutes a global environment. Specifically to the first agent 511, the first agent 511 obtains first local environment state information S1 from the first local environment 512_tAnd upon receiving the resource scheduling request, according to the first local environment status information S1_tAnd global bonus information r_iAnd the current orchestration policy outputs action a1 to first local environment 512_t. Where i represents the number of updates of global reward information, t represents the number of changes of local environmental status information of the first agent 511, and t is related to the number of resource orchestration requests received by the first agent 511. The second agent 521 and the mth agent 5m1 have the same functions and the same resource arrangement method as the first agent 511, and are not described herein again (j denotes the number of times of changing the second local environment state information of the second agent 521, and k denotes the number of times of changing the mth local environment state information of the mth agent 5m 1).

For example, in the initial state, the first local environment state information of the first agent 511 is S1₀After receiving a resource orchestration request, the first agent 511, according to the content (such as information of resource type, resource quantity, etc.) in the resource orchestration request, S1₀And an initial orchestration policy execution action a1₀Wherein the initial orchestration policy may be a preset policy. After the resource orchestration operation is performed, the current local environment status of the first agent 511 is S1₀Change to S1₁The first agent 511 according to S1₁Current global prize information r₁And updating the editing strategy. After receiving a new resource orchestration request, the first agent 511 iteratively performs the update operation of the resource orchestration and the orchestration policy according to the method described above.

It should be noted that the global bonus information r_iIs information obtained from experience of all agents in the global environment. Specifically, after each agent performs resource orchestration several times, each agent's local environmental state information, action information, and global incentive information constitute the corresponding experience information for each agent. For example, the first agent 511 corresponds to the first experience 513, the second agent corresponds to the second experience 523, and the mth agent corresponds to the mth experience 5m3, where the m sets of experiences form an experience pool. When the preset batch processing condition is reached, the actor comment family network model 500 carries out batch processing on the experiences in the experience pool, acquires new global reward information and issues the new global reward information toEach agent. And each agent updates the current orchestration strategy according to the new global reward information and executes resource orchestration based on the updated orchestration strategy.

In the embodiment, a plurality of agents run asynchronously, the actor comment family network model generates global reward information based on global environment information, and each agent updates the local arrangement strategy according to the shared global reward information during local interaction with the environment, so that the correlation of the arrangement strategies among the agents is weakened, and more reasonable and accurate arrangement strategies can be obtained.

It should be noted that each module referred to in this embodiment is a logical module, and in practical applications, one logical unit may be one physical unit, may be a part of one physical unit, and may be implemented by a combination of multiple physical units. In addition, in order to highlight the innovative part of the present invention, elements that are not so closely related to solving the technical problems proposed by the present invention are not introduced in the present embodiment, but this does not indicate that other elements are not present in the present embodiment.

It will be understood that the above embodiments are merely exemplary embodiments taken to illustrate the principles of the present invention, which is not limited thereto. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit and substance of the invention, and these modifications and improvements are also considered to be within the scope of the invention.

Claims

1. A method of resource orchestration, comprising:

2. The method of claim 1, wherein prior to obtaining the global reward information and the local environment status information, further comprising:

receiving a second resource scheduling request sent by the user terminal;

3. The method of claim 2, wherein obtaining global reward information and local environment status information comprises:

4. The resource orchestration method according to claim 2, wherein the first resource orchestration request and the second resource orchestration request comprise a resource orchestration type and a resource demand.

5. The resource orchestration method according to claim 2 or 3, wherein the global incentive information is information generated by batch processing historical local environment state information and local environment state information corresponding to the second resource orchestration requests of all agents in the global environment, obtaining a resource balancing rate and a request acceptance rate of the global environment, and according to a preset incentive mechanism, the resource balancing rate and the request acceptance rate.

6. The resource orchestration method according to claim 1, wherein the orchestration policy comprises an action policy;

7. The method of claim 6, wherein the action policy comprises one or more of a path deployment sub-policy and a routing sub-policy.

8. The resource orchestration method according to claim 1, wherein orchestrating resources in the local environment based on the first resource orchestration request, the local environment state information, and an updated orchestration policy if the first resource orchestration request is received comprises:

9. The method of claim 8, wherein the resource manager is configured to configure and orchestrate various types of resources in the local environment.

10. An agent, comprising: