CN111629037B

CN111629037B - Dynamic cloud content distribution network content placement method based on collaborative reinforcement learning

Info

Publication number: CN111629037B
Application number: CN202010408027.XA
Authority: CN
Inventors: 陆佃杰; 贺明鑫; 张桂娟; 田杰; 刘弘
Original assignee: Shandong Normal University
Current assignee: Shandong Data Trading Co ltd
Priority date: 2020-05-14
Filing date: 2020-05-14
Publication date: 2022-05-27
Anticipated expiration: 2040-05-14
Also published as: CN111629037A

Abstract

The invention discloses a dynamic cloud content distribution network content placement method based on collaborative reinforcement learning, which comprises the following steps: establishing a dynamic cloud content distribution network; taking all cloud proxy servers in a network as nodes, and determining a source node and a destination node set; establishing a dynamic CCDN content placement model CRL-CPM based on collaborative reinforcement learning based on a source node and a destination node; on the basis of CRL-CPM, a time-varying distribution tree is constructed through a time-varying distribution tree construction algorithm based on CRL, and content distribution is carried out by utilizing the distribution tree.

Description

Dynamic cloud content distribution network content placement method based on collaborative reinforcement learning

Technical Field

The invention belongs to the technical field of cloud content distribution networks, and particularly relates to a dynamic cloud content distribution network content placement method based on collaborative reinforcement learning.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

With the increase of the number of internet users and the increase of user network requests, more and more data packets are transmitted in the network, which results in serious network congestion. A Content Delivery Network (CDN) deploys a large number of servers to transmit Content from an origin server to a CDN boundary node server, so that a user can access the CDN in the near vicinity, waiting time of the user is reduced, and Network congestion of a central server is also alleviated. But traditional content distribution network servers are expensive to deploy and maintain and are difficult to scale. Nowadays, Cloud computing is rapidly developed, so that a Content distribution Network can lease Cloud resources to deploy servers as required, and thus a Cloud Content Delivery Network (CCDN) is formed. Cloud resources such as rental cloud servers are low in cost and therefore can be widely deployed in a wide range around the world, and many cloud servers are connected to and controlled by an origin server. The cloud content distribution network has the characteristics of strong expandability, high flexibility and the like, and the deployment cost is reduced.

In CCDN, a large number of cloud servers are deployed at the "edge" of the internet, which are used to strategically store the content of the origin server. Content placement is one of the key technologies of CCDN, and Content Providers (CPs) usually optimize distribution cost by building multicast trees, however most distribution tree building algorithms are more suitable for static networks with unchanged network state. Renting new cloud servers and closing some servers at night with low flow, so that the number and the positions of cloud proxy servers in the CCDN can be changed, and the method is dynamic, and the traditional content placement method only provides a fixed distribution path and is not suitable for a network with a dynamically changed network congestion state.

The inventors found in their research that Reinforcement Learning (RL) technology gives a return each time an action is performed by modeling as a Markov Decision Process (MDP), Agent obtains a long-term optimal return by continuously trying to improve the action scheme in the environment, Reinforcement Learning is used to solve the optimization problem of a single Agent, which selects the optimal behavior by exploring the environment, but the single Agent exploring environment is costly and inefficient.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a dynamic cloud content distribution network content placement method based on collaborative reinforcement learning, and the method can adapt to a dynamically changing network and select a path with low congestion cost.

In order to achieve the above object, one or more embodiments of the present invention provide the following technical solutions:

the dynamic cloud content distribution network content placement method based on collaborative reinforcement learning comprises the following steps:

establishing a dynamic cloud content distribution network;

taking all cloud proxy servers in a network as nodes, and determining a source node and a destination node set;

establishing a dynamic CCDN content placement model CRL-CPM based on collaborative reinforcement learning based on a source node and a destination node;

on the basis of CRL-CPM, a time-varying distribution tree is constructed through a time-varying distribution tree construction algorithm based on CRL, and content distribution is carried out by utilizing the distribution tree.

The further technical scheme is that a dynamic CCDN content placement model CRL-CPM based on collaborative reinforcement learning:

the action of the adjacent nodes is converged by an optimal strategy obtained after the adjacent nodes share the self exploration environment, the node receives the optimal strategy broadcasted by the adjacent node and compares the optimal strategy with the local strategy of the node, and one optimal strategy is selected from the local strategy and the strategy broadcasted by the adjacent node to execute the next action;

the efficiency of exploring the environment is improved by sharing own strategy through adjacent agents in the cooperative reinforcement learning, and the self-adaptive distribution tree established in the CCDN by the CRL method can self-adaptively adjust the path according to the existence condition of the nodes and the congestion condition of the nodes, so that the content can be placed more quickly.

In a further technical scheme, in the collaborative reinforcement learning, the dynamic Agent set N ═ N₁,n₂,...,n_mAnd the nodes in the CCDN, namely the V sets in the directed graph, are corresponded.

Further technical scheme, for each Agent n_iThere is one dynamic neighbor Agent set M_iFor storing Agent n_iSurrounding neighbor nodes. Wherein

In a further technical scheme, S is a set of all states in CCDN, and for each Agent n_iAll have a unique corresponding state s_iAnd storing a set of state sets

S_iRepresenting data transmission to current Agent n_iThe other states represent the states in which an Agent successfully sends data to a neighboring Agent through an executable action, typicallyThe row action failure will be transferred to S_iState of All states in

The further technical scheme is that a time-varying distribution tree construction algorithm is used for constructing a time-varying distribution tree:

constructing a Cache table;

through a node broadcast algorithm, the node broadcasts own strategy value to surrounding neighbor nodes and updates own cooperative reinforcement learning Q value through the neighbor strategy value received by the node;

a path is selected for each destination node to its terminating node by a learning-based Q value update algorithm. And updating the Q value;

and establishing a path from the source node to the destination node through the reverse routing information to construct the CRL-TDT.

In a further technical scheme, in a Cache table of each node, the Cache is composed of eight fields: termination Node is Agent n_jOnly one source node is defined for the termination node for completing the transmission task;

an advertisement Node is an Agent for sending broadcast;

action is Agent n_iTo Agent n_jA delegation operation performed by the transmission data;

memory Ratio is Agent n_jThe storage remaining rate of (2);

load Ratio is Agent n_jA load residual rate;

transmission Time is Agent n_iTime and Agent n of receiving broadcast_jTime difference of sending broadcast;

v value is Agent n_iSlave Agent n_jReceived newly broadcasted V_j；

Q value is the estimated reward value for selecting this action.

Further technical scheme, each Agent n_iHas an update program when Agent n_iWhen a new neighbor is discovered through the discovery operation, the program adds the newly discovered neighbor Agent to the neighbor setAnd then M_iAnd in Cache_iAdding information about the new neighbor;

if the V value newly broadcast by a certain neighbor Agent is not received for a long time, the neighbor is driven from M_iDeleted in the set and will also be in Cache_iAnd deleting the cached information.

According to the further technical scheme, congestion information obtained by exploring an external environment by a node in the CCDN needs to be broadcast to adjacent nodes, an advertisement packet is defined, and strategy return value information of the current Agent is stored; when an initial node finds an optimal path leading to a termination node, a reverse route is required to be established so that a source node places contents to a destination node through the path, and a path packet is defined and used for storing information such as the path leading to the termination node from the initial node and the like;

after the current node successfully sends the path packet to the next-hop node, the node receiving the path packet returns confirmation information to the current node, and if the reception fails, the node does not return, and a verification packet is defined and used for storing the confirmation information.

In a further technical scheme, each node sends an advertise packet filling information of the node to a neighbor node, nodes except a source node receive advertise packets broadcasted by the neighbor node, a Cache table of the node is updated according to the information in the packet, if the advertise packet of the source node is received, a return function is updated and a Q value is calculated to update a Q value, if the advertise packet of the common node is received, the return function is updated and the Q value is calculated to update the original Q value, wherein Transmission time is the time difference between the time of receiving the packet and the sending time.

In the further technical scheme, each destination node is taken as a sending node to send a path packet to a next hop node, the node of the destination node is written into a node sequence, a termination node number is taken as a source node, and the path packet is sent to the next hop node according to a Cache table and a P of the destination node_i(s' | s, a) selecting action, writing the node corresponding to the action into next node, writing the sending time into sending time, sending the path packet filled with the information to next-hop node, and waiting for receiving the confirmation packet of the next-hop node, if the result exceeds oneIf the time T is not received, changing the Transmission time of the action into Transmission time + c in the Cache table, updating the Q value of the action, and continuously sending a path packet to other nodes;

if the node receiving the packet is the source node, sending a verification packet to the previous node, constructing a reverse route according to the node sequence in the packet, and not continuing to forward the path packet;

if the packet is received and the packet is not the source node, the packet is added into the node sequence, corresponding information of the packet is filled, a path packet is continuously sent to the downstream, and a verification packet is sent to the previous node.

In the further technical scheme, when the destination node finds a path to the termination node, the termination node sends data to the destination node through a reverse route according to the node sequence, and when all the destination nodes find the paths to the termination node, a time-varying distribution number CRL-TDT is constructed.

The invention discloses a dynamic cloud content distribution network content placement system based on collaborative reinforcement learning, which comprises a server, wherein the server is configured to:

establishing a dynamic cloud content distribution network;

all cloud proxy servers are used as nodes, and a source node and a destination node are determined to be collected according to the cloud proxy servers in the network;

The above one or more technical solutions have the following beneficial effects:

the invention introduces Cooperative Reinforcement Learning (CRL), which utilizes feedback between agents to adapt and optimize system routing behavior.

The invention can enable adjacent agents to communicate with each other about the behaviors they learn. We define a cooperative feedback model and a negative feedback model to achieve this function.

The cooperative feedback model is as follows: the Agent communicates with the neighbor Agent after a specified time interval, and shares the optimal strategy value of the Agent.

A negative feedback model: as time goes by, the Agent will decay the optimal policy value of the neighbor Agent obtained through sharing until receiving a new optimal policy value shared by the neighbor agents.

The cooperative feedback model enables the agents to share the optimal strategy of the agents at intervals, increases the probability that the adjacent agents take the same or related actions, and can generate positive feedback in the routing of a group of agents in the network. The negative feedback model enables the Agent to attenuate the return of the optimal strategy obtained through sharing so as to adapt to the dynamic change of network congestion. The positive feedback in the routing process causes convergence between Agent's routing strategies and continues until either routing congestion or negative feedback generated by the decay model of the application, thereby enabling the Agent to adjust the routing strategies. Therefore, the method can adapt to dynamically changing networks and select a path with low congestion cost.

In the cooperative feedback model, an Agent communicates with a neighbor Agent after a specified time interval, and shares the optimal strategy value of the Agent.

In the negative feedback model, as time increases, the Agent can attenuate the optimal strategy value of the neighbor Agent obtained by sharing until receiving a new optimal strategy value shared by the neighbor agents.

The invention provides a time-varying distribution tree construction algorithm based on a CRL (cross-domain continuous processing) to construct a distribution tree on the basis of a CRL-CPM (cross-domain continuous processing), wherein congestion information and path information are transmitted through an advertisement packet and a path packet, each cloud proxy server stores the information in a Cache table, and updates the congestion information in the Cache table per se through the advertisement packet. The node selects the best path with the highest return through the congestion information in the packet, the source node obtains the node serial number passed by the path through the path information, and the time-varying distribution tree is constructed through the serial numbers.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a diagram of a reinforcement learning decision making process according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating selection of reinforcement learning paths in CCDN according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating collaborative reinforcement learning for sharing a multi-Agent policy according to an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating selection of a cooperative reinforcement learning path in a CCDN according to an embodiment of the present invention;

FIG. 5 is an overall flow chart of an embodiment of the present invention;

FIG. 6 is a diagram illustrating an example of a CRL-TDT building process according to an embodiment of the present invention;

FIGS. 7(a) -7 (c) are diagrams of three different network topology configurations according to embodiments of the present invention;

8(a) -8 (c) are diagrams comparing the congestion cost for three network topologies according to the embodiment of the present invention;

FIGS. 9(a) -9 (c) are graphs comparing congestion costs for three topologies at low load according to an embodiment of the present invention;

fig. 10(a) -10 (c) are graphs comparing the congestion cost of three topologies under high load according to the embodiment of the present invention.

Detailed Description

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiments and features of the embodiments of the invention may be combined with each other without conflict.

Example one

The embodiment discloses a dynamic cloud content distribution network content placement method based on collaborative reinforcement learning, which comprises the following steps:

(1) a dynamic cloud content distribution network is entered.

(2) And (4) regarding all the cloud proxy servers as nodes, and determining a source node and a destination node set.

(3) A dynamic CCDN content placement model (CRL-CPM) based on collaborative reinforcement learning is proposed.

(4) On the basis of CRL-CPM, a time-varying distribution tree (CRL-time-varying delivery tree) is constructed through a time-varying distribution tree construction algorithm based on CRL.

In the step (1), CCDN is defined as directed graph G ═ V, E, where V is a vertex set representing a cloud proxy server in the network, and E is an edge set representing all possible links in the network.

In the step (2), the cloud servers in the CCDN are divided into three types, including a source server, a relay cloud proxy server and a target cloud proxy server. The origin server stores the raw data and is responsible for placing the content to the destination cloud proxy server. The relay cloud proxy server may participate in the content placement process as a relay server. For convenience, only one origin server is considered and is referred to as an origin node. Similarly, all the destination cloud proxy servers are called destination nodes, and the intermediate cloud proxy servers are called relay nodes. For the destination node set, D ═ D (D)₁,D₂,...,D_j) And (4) showing.

In step (3), some dynamic characteristics of CCDN are summarized as follows:

1) cloud proxy server change: the CCDN content provider can lease or terminate the lease cloud server as needed, and even shut down some nodes at night to save energy consumption, so that the number and location of nodes in the network change.

2) Network congestion changes: the congestion level of a network node in a CCDN varies with the amount of data that is processed by the current node.

In RL, Agent regard peripheral other Agent as external environment, oneself make movements, and change the state, the environment will give a value of reporting back; by repeatedly and continuously exploring the environment, the self optimal strategy is found to adapt to the environment. The single Agent exploration environment has low efficiency, and if the environment change frequency is high, the single Agent exploration speed is often not higher than the environment change speed, so that the strategy obtained by exploration is not optimal. As shown in fig. 1.

The traditional content placement method provides a fixed distribution path between a source node and a destination node, and does not consider the constantly changing network condition; the reinforcement learning method can continuously adjust the strategy according to the dynamic condition of the network, but the efficiency of a single Agent exploration environment in the reinforcement learning is low. As shown in fig. 2, with the RL method, a node can only adapt to a changing network environment by exploring the environment and executing a local policy of the node, and because the exploring environment is not efficient, the distribution efficiency of a distribution path in a local decision cannot be optimal.

In the CRL, the agents can broadcast own V values to adjacent agents by using broadcast operation (discovery operation), each Agent stores the V values of the received neighbors in a local Cache, and own neighbor local views can be updated through the local Cache.

And each Agent is required to broadcast the latest V value to the surrounding neighbor agents at intervals, so that the neighbor agents can update the latest V value in real time. A decay model is also provided, and the V value in the local Cache of the Agent decays along with the increase of time, namely, the return value V of the task completed by the neighbor Agent cached locally by the Agent decreases along with the increase of time. As shown in fig. 3. If the V value of a certain neighbor Agent is not received for a long time, it is indicated that the network congestion cost between the two agents is large, so that the V value broadcasted by the neighbor cannot be transmitted to the Agent, and when the V value in the local cache is smaller and smaller along with the increase of time and exceeds a certain time, the Agent deletes the neighbor node from the own neighbor set.

Therefore, a content placement model (CRL-CPM) based on collaborative reinforcement learning is proposed, actions of neighboring nodes are converged by an optimal policy obtained after the neighboring nodes share their own exploration environment, as shown in fig. 4, the node receives the optimal policy broadcasted by the neighboring nodes and compares the optimal policy with its own local policy, and one optimal policy is selected from the local policy and the neighboring broadcasted policy to execute the next action. The efficiency of exploring the environment is improved by sharing own strategy through adjacent agents in the cooperative reinforcement learning, and the self-adaptive distribution tree established in the CCDN by the CRL method can self-adaptively adjust the path according to the existence condition of the nodes and the congestion condition of the nodes, so that the content can be placed more quickly.

As can be seen from the above two diagrams of fig. 3 and 4, the neighboring agents in the CRL share the environmental information obtained from their own exploration environments, and the agents can update their own policies by the environmental information broadcasted by the neighboring agents, without their own exploration environments, which improves efficiency.

Then, mapping the problem to the collaborative reinforcement learning, and correspondingly defining the state, the action and the return function:

1) dynamic Agent set N ═ N₁,n₂,...,n_mAnd the nodes in the CCDN, namely the V sets in the directed graph, are corresponded.

2) For each Agent n_iThere is one dynamic neighbor Agent set M_iFor storing Agent n_iSurrounding neighbor nodes. Wherein

3) S is the set of all states in CCDN, for each Agent n_iAll have a unique corresponding state s_iAnd storing a set of state sets

S_iRepresenting data transmission to current Agent n_iThe other states represent the states in which an Agent successfully sends data to a neighboring Agent through an executable action. Normally, failure to execute an action will be transferred to S_iStatus. Wherein All states in

4) A is the set of all actions in CCDN, i.e. the E set in the directed graph, for each Agent n_iAll have a set of dynamic operation sets

Wherein

Is the set of operations that execute the local policy,

is a collection of delegation operations, discovery operations are used to attempt to discover new neighbors, update Agent n_iNeighbor set M of_i. If a new neighbor n is found_jThen update

To add a new commit operation and will also receive a commit operation from n_jThe delegation operation of (1).

5) In the CCDN, wire transmission is performed between nodes, so a state transition probability of 1 is considered.

6) For each Agent n_iThere is a delay function D (s' | s)_iA) which represents n_iThe estimated delay cost for performing an action a transition to state s' in state s, denoted by the propagation delay of the data, t_ijExpressed in Agent n_iAnd n_jTime spent in the transmission of (a).

D(s_j|s_i,a_i)＝t_ij

There is a storage function M (s ') representing the storage cost of the Agent for the next state s', where cm is the used storage capacity and tm is the total storage capacity. The server with large storage capacity can store more contents, and when the server stores the contents required by the edge node, the edge node does not need to request the contents from a source server which is farther away, so that the time cost is saved.

M(s')＝(tm-cm)/tm (1)

There is a load function L (s ') representing the load cost of the Agent corresponding to the next state s', where cl is the current load of the node and tl is the maximum load that the node can bear. The nodes with large load occupancy rates have more data to be queued and forwarded, so that the queuing time is increased, and the nodes with large load idle rates do not need more queuing time. The idle node is selected, so that time cost can be saved, and the load idle rate of the current node is represented by a load function.

L(s')＝(tl-cl)/tl (2)

The return function is defined as:

R(s'|s,a)＝x₁/D(s'|s_i,a)+x₂M(s')+x₃L(s') (3)

wherein x₁、x₂、x₃The number of constants is 3, and the main determinants of the reward function are adjusted by adjusting the sizes of the three constants. The smaller the transmission cost between two agents, the larger the reward function.

If the next hop node is the source node, i.e. the terminating node, the reporting function is:

R(s'|s,a)＝x₁/D(s'|s_i,a)+x₂M(s')+x₃L(s')+c (4)

where c is a constant. If the state is changed to the end node state after the action a is executed, the value of the report function is increased.

7) A negative feedback model formula is also defined in CCDN, and Agent n increases along with time_iOf_iThe value of V buffered in (i.e., the value of V broadcast from the neighbor Agent) is reduced.

Decay(V_j)＝V_j·β^td (5)

Where td is the last slave n_jReceiving broadcast of V to the time of the present passAnd beta is a scaling factor of the set decay rate.

8) The collaborative reinforcement learning Q value updating formula based on the distributed model comprises the following steps:

Q_i(s,a)_new＝Q_i(s,a)_old+α[(R(s'|s,a)+γDecay(V_j(s'))-Q_i(s,a)_old] (6)

wherein S, S' is E.S_niR (s' | s, a) is a return function and represents single-step reward obtained by state conversion; γ is the future return attenuation; v_j(s') is V ∈ Cache_i(ii) a α is learning efficiency; q_i(s,a)_oldThe Q value before this update.

9) In order to better adapt to the dynamic change of the network environment and avoid the congestion of the optimal strategy path caused by the decision convergence of adjacent agents as much as possible, for each Agent n_iWhen selecting an action, the best action is not necessarily selected, but there is an action selection probability P_i(s' | s, a) which represents n_iProbability of selecting action a at state s and transitioning to state s'. Action a of suggesting selection of maximum reward value_maxWith a probability of 0.7, and selecting a sub-maximum action a_subProbability of (2) is 0.2, and probability of selecting random action (search) is 0.1.

P_i(s'|s,a_max)+P_i(s”|s,a_sub)+P_i(s”'|s,a_ran)＝1

So Agent n_iThe broadcasted V-value function is:

V_i(s)＝[P_i(s'|s,a_max)Q_i(s,a_max)+P_i(s”|s,a_sub)Q_i(s,a_sub)+P_i(s”'|s,a_ran)Q_iave] (7)

wherein Q_i(s'|s,a_max) Is Agent n_iOf the optimal policy return value, Q_i(s”|s,a_sub) Reporting value, Q, for sub-optimal strategy_iaveIs the average reported value.

The step (4) is a time-varying distribution tree construction algorithm, and the construction of the time-varying distribution tree CRL-TDT comprises the following steps:

1) and constructing a Cache table and introducing three packets in the network.

2) Through a node broadcast algorithm, the node broadcasts own strategy value to surrounding neighbor nodes and updates own Q value through the neighbor strategy value received by the node.

3) A path is selected for each destination node to its terminating node by a learning-based Q value update algorithm. And updates the Q value.

4) And establishing a path from the source node to the destination node through the reverse routing information to construct the CRL-TDT.

In the step 1), the Cache table of each node is shown in table 1, and in the table, the Cache is composed of eight fields: termination Node is Agent n_jFor convenience, only one source node is defined; an advertisement Node is an Agent that sends a broadcast, e.g., Agent n_j(ii) a Action is Agent n_iTo Agent n_jA delegation operation performed by the transmission data; memory Ratio is Agent n_jThe storage remaining rate of (2); load Ratio is Agent n_jA load residual rate; transmission Time is Agent n_iTime and Agent n of receiving broadcast_jTime difference of transmission broadcast; v value is Agent n_iSlave Agent n_jReceived newly broadcasted V_j(ii) a Q value is the estimated reward value for selecting this action; in order to save the storage space of the Cache, each Agent n_iHas an update program when Agent n_iWhen a new neighbor is discovered by the discovery operation, the program adds the newly discovered neighbor Agent to the neighbor set M_iAnd in Cache_iAdding information about the new neighbor; if the V value newly broadcast by a certain neighbor Agent is not received for a long time, the neighbor is driven from M_iDeleted in the set and will also be in Cache_iAnd deleting the cached information.

TABLE 1

Data packets in a network are divided into three types, congestion information obtained by exploring an external environment by a node in the CCDN needs to be broadcast to adjacent nodes, an advertisement packet is defined and used for realizing a feedback model, and information such as a strategy return value of a current Agent is stored; when the starting node finds an optimal path leading to the terminating node, a reverse route is required to be established so that the source node places contents to the destination node through the path, and a path packet is defined and used for storing information such as the path leading to the terminating node from the starting node and the like; after the current node successfully sends the path packet to the next-hop node, the node receiving the path packet returns confirmation information to the current node, and if the reception fails, the node does not return, and a verification packet is defined for storing the confirmation information.

The format of the advertisement packet is as follows:

wherein the node number is the node number from which the packet was sent, stating from which node the packet came; the termination node number is a termination node of the node transmission task; the sending time is the time when the node broadcasts the packet; memory ratio and load ratio are the storage residual rate and the load residual rate of the broadcasting node; vvalue is the best estimated return value for the node to complete the transmission task.

The path packet format is:

wherein the node sequence is a node sequence passed by the path packet, and the sequence number of the node is added to each node passed by the path packet; the next node is the serial number of the selected next hop node; the termination node number is the node number of the termination node, typically the source node; the sending time is the time the node sends the packet.

The configuration packet format is:

wherein the node number is the node number that sent the packet; the sending time is the time when the node sends the packet; and V value is an estimated return value of the current node for completing the transmission task, and is used for updating the Cache of the node for receiving the packet.

The specific steps of the step 2) are as follows:

each node sends an advertise packet filling self information to a neighbor node, nodes except a source node receive advertise packets broadcasted by the neighbor node, a Cache table of the node is updated according to the information in the packet, if the advertise packet of the source node is received, a return function is updated according to a formula (4) and a Q value is calculated to update a Q value, if the advertise packet of the ordinary node is received, the return function is updated according to a formula (3) and the Q value is calculated to update the original Q value, wherein Transmission time is the time difference between the time of receiving the packet and the sending time.

The specific steps of the step 3) are as follows:

each destination node is used as a sending node to send a path packet to a next hop node, the node of the destination node is written into a node sequence, a termination node number is used as a source node, and the Cache table and the P of the destination node are used as per se_i(s' | s, a) selecting an action, writing a node corresponding to the action into a next node, writing sending time into sending time, sending a path packet filled with the information to a next-hop node, waiting for receiving a confirmation packet of the next-hop node, changing the Transmission time of the action into Transmission time + c in a Cache table if the path packet is not received within a certain time T, updating the Q value of the action according to a formula (6), and continuing to send the path packet to other nodes. If the node receiving the packet is the source node, sending a verification packet to the previous node, constructing a reverse route according to the node sequence in the packet, and not continuing to forward the path packet; if the source node does not receive the packet, the source node adds the source node into the node sequence, fills the corresponding information of the source node, continues to send a path packet to the downstream, and sends a verification packet to the previous hop node.

In the step 4): when the destination node finds a path to its terminating node, then its terminating node sends data to the destination node through a reverse route according to the node sequence. When all destination nodes find the path to the termination node, a time-varying distribution number CRL-TDT is constructed.

In a more detailed implementation example, as shown in fig. 5, in the embodiment of the present invention, first, a cloud content distribution network is input, a destination node set is determined, then, a dynamic CCDN content placement model based on a CRL is constructed, then, a time-varying distribution tree based on the CRL is constructed, and finally, content distribution is performed by using the distribution tree.

Fig. 6 shows a node broadcasting process and a time-varying distribution tree building process in the CCDN, where a black square represents a source node, a node 9 is the source node, other squares represent destination nodes, a node 1 is one of the destination nodes, a circle represents other nodes in the network, and a dotted line represents an adaptive path selected by the collaborative reinforcement learning model. Each destination node sends a path packet to find a path reaching the source node, and when all the destination nodes find a path reaching the source node, a time-varying distribution tree is constructed.

Taking node number 11 in fig. 6 as an example:

(1) node 11 broadcasts and sends the advertisement packet to all neighbor nodes.

(2) And after the neighbor node receives the advertisement packet from the node No. 11, updating the Cache table of the neighbor node through the information in the packet.

(3) Meanwhile, node 11 will also receive the advertise packet from different neighbor nodes, and update its Cache table through the information in the packet.

All nodes broadcast own advertisement packets to surrounding neighbor nodes at intervals to inform the neighbor own strategy information.

Taking node number 1 in fig. 6 as an example:

(1) and the node 1 is used as a sending node to select a proper neighbor node 3 to send a path packet according to the Cache table of the node 1.

(2) After receiving the path packet, the node 3 sends a confirmation packet back to the node 1, returns confirmation information, the V value of the node itself, and the like.

(3) After receiving the confirmation packet returned by the neighbor node No. 3, the node No. 1 updates the Cache table of the node No. 3 through the information in the packet.

(4) And the node No. 3 selects a proper neighbor node according to the Cache table of the node No. 3 to continuously send the path packet.

When all destination nodes find a path to the source node, a time-varying distribution tree is constructed. Due to dynamic change of network environment, the nodes can frequently send advertisement packets, path packets and configuration packets so as to better adapt to CCDN.

Finally, comparing the CRL-TDT with the congestion cost of reinforcement learning and data distribution of Dijkstra algorithm, the specific process is as follows:

1. calculating congestion cost

And defining the Total time cost of data distribution as the difference between the sending time of the edge node and the receiving time of the source node.

2. Setting parameter values

Each time 6 destination nodes are selected, x is chosen, since only one source node is specified and the impact of node storage capacity on the reward function should be reduced₂Is set to 1, x₁、x₃Set to 3; c is set to 10; β is set to 0.9; α is set to 0.9; γ is set to 0.9. The present invention uses the communication network topology employed by Boyan and Littman to test the algorithms of the present application, including an irregular 6 x 6 network and a LATA telephone network with 116 nodes.

3. Congestion cost comparison

Fig. 7(a) -7 (c) are diagrams of three different network topology structures, and fig. 8(a) -8 (c) are graphs comparing total congestion cost of CRL-TDT and reinforcement learning and dijkstra algorithm of the present invention in three network topologies, where the load corresponds to the value of the poisson arrival process parameter of the average number of data packets injected per time unit. It can be seen from the figure that when the network load is low, the congestion cost of the CRL-TDT of the present invention is basically consistent with that of reinforcement learning and dijkstra algorithm, and as the network load increases, the congestion cost of the dijkstra algorithm significantly increases and is far higher than that of the CRL-TDT, and the CRL-TDT algorithm can obtain a lower congestion cost under the condition of a higher load. Therefore, the CRL-TDT can effectively reduce the congestion cost of the network under different network load conditions.

As shown in fig. 9(a) -9 (c), fig. 10(a) -10 (c), comparing the change of the CRL-TDT of the present invention with the reinforcement learning and dijkstra algorithm in the congestion cost increase with time under two different loads, i.e. high and low, under three network topologies, it can be seen from fig. 9(a) -9 (c) that when the network load is low, the CRL-TDT and the reinforcement learning converge to a certain value over a certain time, and the CRL-TDT converges faster, while the dijkstra algorithm keeps the congestion cost at a low value all the time when the load is low. As shown in fig. 10(a) -10 (c), when the load is higher, the congestion cost converges to a certain value after a certain time by CRL-TDT and reinforcement learning, and the convergence of CRL-TDT is faster, while the congestion cost is higher by dijkstra algorithm when the load is higher. Therefore, the Dijkstra algorithm cannot reduce network congestion globally when the load is high, the convergence speed of the reinforcement learning algorithm is slow, and the CRL-TDT can obtain lower congestion cost in long term under high and low load conditions.

Therefore, the CRL-TDT can well adapt to the change of network load, and the congestion cost is saved.

Example two

The present embodiment aims to provide a computing device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the following steps, including:

(1) a dynamic cloud content distribution network is entered.

EXAMPLE III

An object of the present embodiment is to provide a computer-readable storage medium.

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, performs the steps of:

(1) a dynamic cloud content distribution network is entered.

The steps involved in the apparatuses of the second and third embodiments correspond to those of the first embodiment of the method, and the detailed description thereof can be found in the relevant description of the first embodiment. The term "computer-readable storage medium" should be taken to include a single medium or multiple media containing one or more sets of instructions; it should also be understood to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any of the methods of the present invention.

Those skilled in the art will appreciate that the modules or steps of the present invention described above can be implemented using general purpose computer means, or alternatively, they can be implemented using program code that is executable by computing means, such that they are stored in memory means for execution by the computing means, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps of them are fabricated into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. The dynamic cloud content distribution network content placement method based on collaborative reinforcement learning is characterized by comprising the following steps:

establishing a dynamic cloud content distribution network;

on the basis of CRL-CPM, a time-varying distribution tree is constructed through a time-varying distribution tree construction algorithm based on CRL, and content distribution is carried out by utilizing the distribution tree;

the dynamic CCDN content placement model CRL-CPM based on collaborative reinforcement learning:

the efficiency of exploring the environment is improved by sharing own strategy through adjacent cloud proxy servers, and the self-adaptive distribution tree established in the CCDN by the CRL method can be used for self-adaptively adjusting the path according to the existence condition of the nodes and the congestion condition of the nodes, so that the content can be placed more quickly;

a time-varying distribution tree construction algorithm, namely constructing a time-varying distribution tree:

constructing a Cache table;

selecting a path from each destination node to a termination node thereof through a learning-based Q value updating algorithm, and updating a Q value;

2. The collaborative reinforcement learning-based dynamic cloud content distribution network content placement method according to claim 1, wherein in collaborative reinforcement learning, a dynamic cloud proxy Agent set N ═ N₁,n₂,...,n_mAnd the nodes corresponding to the CCDN are V sets in the directed graph.

3. The collaborative reinforcement learning-based dynamic cloud content distribution network content placement method according to claim 2, wherein for each cloud proxy Agent n_iThere is one dynamic neighbor cloud proxy server Agent set M_iUsed for storing cloud proxy server Agent n_iSurrounding neighbor nodes, wherein

4. The collaborative reinforcement learning-based dynamic cloud content distribution network content placement method according to claim 1, wherein each cloud proxy Agent n_iHas an updating program in it, when the Agent n of the cloud proxy server_iWhen a new neighbor is discovered through the discovery operation, the program adds the newly discovered neighbor cloud proxy Agent to the neighbor set M_iAnd in Cache_iAdding information about the new neighbor;

if the newly broadcasted V value of a neighbor cloud proxy server Agent is not received for a long time, the neighbor is driven to M_iDeleted in the set and will also be in Cache_iAnd deleting the cached information.

5. The dynamic cloud content distribution network content placement method based on collaborative reinforcement learning according to claim 1, wherein congestion information obtained by nodes in CCDN exploring an external environment needs to be broadcasted to adjacent nodes, a broadcast packet is defined, and policy return value information of a current cloud proxy server Agent is stored; when the starting node finds an optimal path leading to the terminating node, a reverse route is required to be established so that the source node places contents to the destination node through the path, and a path packet is defined and used for storing path information leading to the terminating node from the starting node;

after the current node sends the path packet to the next hop node successfully, the node receiving the path packet will return the confirmation information to the current node, if the reception fails, the confirmation information will not be returned, and a confirmation packet is defined for storing the confirmation information.

6. The method as claimed in claim 5, wherein each node sends a broadcast packet, which fills its own information, to a neighboring node, and nodes other than the source node receive the broadcast packet of the neighboring broadcast, update its Cache table according to the information in the packet, update the reward function and calculate the Q value to update the Q value if the broadcast packet of the source node is received, update the reward function and calculate the Q value to update the original Q value if the broadcast packet of the normal node is received, wherein the Transmission time is a time difference between the time of receiving the packet and the sending time.

7. The method as claimed in claim 1, wherein each destination node is used as a sending node to send a path packet to a next hop node, the node itself is written into a node sequence, a termination node number is used as a source node, and a probability P is selected according to its Cache table and action_i(s' | s, a) SelectivityAction selection probability P_i(s' | s, a) represents node n_iSelecting an action a in a state s and converting the probability into a state s', writing a node corresponding to the action into a next node, writing sending time into sending time, sending a path packet filled with the information to a next hop node, waiting for receiving a confirmation packet of the next hop node, changing the Transmission time of the action into Transmission time + c in a Cache table if the path packet is not received within a certain time T, updating the Q value of the action, and continuing to send the path packet to other nodes;

if the received packet is not the source node, adding the packet into the node sequence, filling information corresponding to the packet, continuing to send a path packet to the downstream, and sending a verification packet to the previous hop node;

8. The dynamic cloud content distribution network content placement system based on collaborative reinforcement learning is characterized by comprising a server, wherein the server is configured to:

establishing a dynamic cloud content distribution network;

constructing a Cache table;