CN111901833B

CN111901833B - Combined service scheduling and content caching method for unreliable channel transmission

Info

Publication number: CN111901833B
Application number: CN202010677841.1A
Authority: CN
Inventors: 罗晶晶; 张琬璐; 聂涛; 高林; 郑福春; 张钦宇
Original assignee: Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2020-07-13
Filing date: 2020-07-13
Publication date: 2023-07-18
Anticipated expiration: 2040-07-13
Also published as: CN111901833A

Abstract

The invention provides a joint service scheduling and content caching method for unreliable channel transmission, which comprises the following steps: service scheduling: the base station with higher reliability of the scheduling channel can serve the request of the user, so that the service cost caused by retransmission can be reduced; content caching: and the state information among the intelligent agents is utilized for collaborative caching, and caching decisions can be coordinated among the base stations to realize the maximization of the reduction of service overhead. The beneficial effects of the invention are as follows: simulation results show that the service scheduling step of the invention is superior to the shortest distance priority strategy. And compared with the distributed multi-agent deep Q network strategy, when the content quantity and the local cache capacity are increased, the content caching step provided by the invention can obtain better performance and has better robustness.

Description

Combined service scheduling and content caching method for unreliable channel transmission

Technical Field

The invention relates to the technical field of wireless edge caching, in particular to a combined service scheduling and content caching method for unreliable channel transmission.

Background

With the rapid development of mobile internet and internet of things, data traffic in wireless networks grows exponentially. The latest report of cisco shows that global internet traffic will exceed 4.6ZB in 2023, 321% more than 2017. In a cellular access network mainly comprising a base station, a content request of a user sequentially passes through the base station, an S-GW and a P-GW, then enters the Internet, and is routed and forwarded to a far-end content server. The physical distance of the user from the content server may create network transmission delays. When there are fewer users and the network conditions are better, the latency of the user to the remote server is not significant. However, in a practical scenario, a large number of users often initiate a relatively concentrated transmission request for the popular content, which can cause huge stress on the network, so that the quality of experience of the users is drastically reduced. In addition, repeated transmission of large amounts of popular content (especially mobile high definition video) can result in significant waste of communication resources. In order to meet the ever-increasing data demands of users for online video, web browsing, online gaming, etc., providers are motivated to seek new service technologies that provide a high quality experience. As a key technology for next generation mobile communications, edge caching provides a new solution. The edge cache pre-stores a part of the content in the edge node, so that congestion of a backhaul link can be effectively avoided, and communication resource consumption caused by repeated downloading of the content can be reduced.

There are some related technical achievements of edge cache mechanism design in the current academia. Two common edge caching methods are: and firstly predicting the popularity of the content, then updating the cache, and directly learning the cache strategy. Nevertheless, there are still a number of problems with edge caching that remain to be solved. In a practical scenario, content delivery may fail intermittently or with significant delay due to mobility, fading, communication errors, etc. In this case, the wireless communication channel is often unreliable, and the reliability of the channel is unknown. Thus, the caching strategy on reliable channels cannot be directly applied to unreliable channel scenarios, especially in data intensive applications. In order to bring higher quality of experience to users and reduce service overhead for operators, service scheduling policies and content caching policies need to be redesigned.

Disclosure of Invention

The invention provides a joint service scheduling and content caching method for unreliable channel transmission, which comprises the following steps:

service scheduling: the base station with higher reliability of the scheduling channel can serve the request of the user, so that the service cost caused by retransmission can be reduced; the service scheduling step is a maximum prize priority (Maximal Reward Priority, MRP) policy.

Content caching: the state information among the intelligent agents is utilized for collaborative caching, and caching decisions can be coordinated among the base stations to realize maximization of the reduction of service overhead; the content caching step is a collaborative multi-agent actor evaluation (Collaborative Multi-Agent Actor Critic, CMA-AC) strategy.

As a further improvement of the present invention, in the service scheduling step, the core of the service scheduling step is to always schedule a base station with a higher channel reliability to service a user's request. However, this problem is challenging because the reliability of the channel is unknown. Taking the service user u as an example, we define that the rewards that the base station n can theoretically obtain for providing the content f to the user u can be expressed as

Wherein b _u,f (t) -a service scheduling policy for a request of a user u for a content f at a time slot t;

d _u,f (t) -the number of requests for content f by user u in t time slots;

a _n,f (t-1) -a buffer decision of the content f by the time slot (t-1) base station n;

c _n,u -the base station n serves the service overhead required for the user u to transmit once;

c ₀ -the core network serves the service overhead required by the user;

p _n,u -the degree of reliability of the communication channel between base station n and user u.

In the formula (4), II { b } _u,f (t) =n } =1 indicates that the request of user u for content f is served by base station n,representing a reduction in service overhead due to edge caching compared to directly retrieving content from the core network; further, the average rewards obtained by base station n service u can be obtained:

wherein the method comprises the steps ofRepresenting the total number of times the first t time slot user u is served by the base station n; consider average rewards +.>The reliability of the channel can be reflected to a certain extent, and the service scheduling step of the system is obtained through a greedy algorithm. Therefore, the present invention names this service scheduling step as an MRP service scheduling policy, i.e. a base station with a higher average prize is always selected from the base stations capable of providing services to serve the user's request.

As a further improvement of the present invention, in the service scheduling step, whenAt this point, base station n has not previously served user u; in order to ensure that base station n serves user u at least once, the service scheduling step at this time is denoted b _u,f (t)＝n。

As a further improvement of the present invention, in the service scheduling step, whenThe user's request is serviced according to the following policies:

wherein,,

indicating a base station in which content f is buffered in a neighboring base station of user u in time slot t, l (t) indicating a slaveSelecting a base station n with the largest average prize in a time slot (t-1);

(1) the method comprises the following steps The user u requests the content f in the time slot t, and the adjacent base stations of the user u do not cache the content f, and the request of the user u for the content f is served by the core network;

(2) the method comprises the following steps In the time slot t, the user u requests the content f, and only one adjacent base station n caches the content f, and at the moment, the request of the user u for the content f is provided by the base station n;

(3) the method comprises the following steps The user u requests the content f in the time slot t, and a plurality of adjacent base stations buffer the content f, at this time, the request of the user u for the content f is served by the base station n with the largest average prize of the time slot (t-1).

As a further improvement of the present invention, in the content caching step, each time slot is divided into 3 phases: a content delivery phase, an information exchange phase and a cache update phase; in the content delivery phase, a user initiates a content request to a base station, and the base station simultaneously serves the request of the user according to a service scheduling policy; after the content delivery phase is finished, the system enters an information exchange phase, and at the phase, different base stations exchange request state information and cache decision information with each other; in the cache updating stage, each base station carries out cache updating according to the global state information obtained in the information exchange stage.

As a further improvement of the present invention, in the content caching step, each base station is regarded as an agent, which contains an Actor network and a Critic network, the Actor network being a policy network; given the state s observed by the current base station n _n The Actor network can output the caching decision a of the agent _n The method comprises the steps of carrying out a first treatment on the surface of the The Critic network is an evaluation network for estimating the total rewards that the system can obtain; critic network exchanges information between systemsMapping the global state s obtained in the stage to a cost function; by utilizing the Critic network to direct the Actor network to perform parameter updates, each agent is able to update its own cache in the direction of prize maximization. More specifically, each agent maintains an experience buffer. By randomly sampling and replaying past experiences, it can overcome the correlation of adjacent experiences and learn its own caching strategy based on the past.

As a further improvement of the present invention, in the content caching step, a content delivery phase: each time slot, each agent servicing the user's request according to a service scheduling step after receiving the user's request; taking the example of base station n in time slot τ, after the content delivery phase is completed, the base station can obtain the state s _n (tau-1) taking action a _n (tau-1) obtained prize r _n (tau) while the base station enters the next state s _n (τ); therefore, a group of quadruplets about base station states, rewards and buffer decisions can be obtained and put into the experience pool of each base station, and k samples are taken from the experience pool for training; after the content delivery phase is finished, the system enters an information exchange phase.

As a further development of the invention, in the content caching step, the information exchange phase: exchanging request state information and cache behavior information between base stations; after the information exchange is finished, the system enters a buffer updating stage, and each base station updates the buffer of the base station according to the corresponding buffer decision.

As a further improvement of the present invention, in the content caching step, a cache update phase: the Critic network updates network parameters through the cache state information and the cache decision information about other intelligent agents obtained in the information exchange stage; meanwhile, the Critic network is utilized to guide the Actor network to update parameters, so that each base station performs cache update towards the direction of rewarding maximization.

As a further improvement of the present invention, in the content caching step, we use epsilon for balanced exploration-utilization _τ Updating cache behavior by greedy policy, i.e. with ε _τ Probability random caching of (c)In 1-epsilon _τ The probability of selecting the cache behavior with the largest cost function.

The beneficial effects of the invention are as follows: simulation results show that the service scheduling step of the invention is superior to the shortest distance first (Shortest Distance Priority, SDP) strategy. And, compared with distributed multi-agent deep Q network (Distributed Multi-Agent Deep Q Network, DMA-DQN) strategy, the content caching step proposed by the present invention can obtain better performance and have better robustness when the content number and local caching capacity are increased.

Drawings

FIG. 1 is a system model diagram of the present invention;

FIG. 2 is a schematic diagram of key elements of the present invention across time slots;

fig. 3-5 are performance simulation graphs of the present invention in different scenarios, respectively.

Detailed Description

Aiming at the defects of the existing service scheduling strategy and content caching strategy in the edge caching technology, the invention provides a combined service scheduling and content caching method for unreliable channel transmission, and aims to formulate a multi-agent decision problem so as to minimize the service cost of a system. The effectiveness of a deep reinforcement learning caching strategy is measured by the rewards it gets. While the design of rewards should present the goal of minimizing service overhead. Thus, the present invention defines rewards as a reduction in service overhead due to edge caching compared to directly retrieving content from the core network. The optimization goal is to find the optimal service scheduling policy and content caching policy to maximize the long-term rewards of the system.

In order to achieve the above purpose, the present invention is realized by the following technical scheme:

and (3) system model:

the application scenario considers a cellular network supporting caching, as shown in fig. 1. Wherein the base station may be connected to the core network through a backhaul link. There are N base stations (local cache capacity C) and U users in the area. The base station set and the user set are respectively represented asIs->Let us assume that the service area of the base station is limited, denoted as l _c Users within this range can be served by the base station. We express the neighbouring base stations of user u as set +.>Similarly, the contiguous set of users of base station n is denoted +.>We assume that there is an exchange of information, such as request information and cache decision information, between the different base stations.

Channel model:

it is assumed that the communication channel between the user and the base station is unreliable. For user u, the set of communication channel reliabilities with all neighboring base stations is denoted asWherein p is _n,u The reliability of the communication channel between base station n and user u, i.e. the probability of success in transmitting content from base station n to user u, is indicated. Non-contiguous base station for user uReliability p of the channel because of the inability to establish a communication link between them _n,u =0. In this unreliable channel case, the requested content will be repeatedly transmitted by the base station until user u successfully retrieves the content.

Service model:

we assume that in the network model under consideration there is one content setThe time is divided into discrete time slots, each user independently requests content from the content library at each time slot, and the user preferences are unknown. Assume thatThe content requirements of all users in time slot t are denoted as d (t) = { d _u,f (t)} ^U×F Wherein d is _u,f (t) =1 means that user u requests content f in time slot t, otherwise d _u,f (t) =0. Accordingly, the service scheduling policy at time slot t is expressed as b (t) = { b _u,f (t)} ^U×F Wherein b _u,f (t) ∈ { -1,0,1, …, N } is request d _u,f Service policies of (t). Specifically, b _u,f (t) = -1 represents d _u,f (t) no service is required, b _u,f (t) =0 and b _u,f (t) =1, 2, …, N represents request d _u,f (t) served by the core network and base stations 1,2, …, N, respectively. Meanwhile, we define a (t) = { a ₁ (t)，a ₂ (t)，...，a _n (t)，...，a _N (t) } is a buffer decision of a time slot t system, where a _n And (t) is a buffer decision of the base station n in the time slot t, and the total buffer number is not more than the buffer capacity of the base station.

The invention discloses a joint service scheduling and content caching method for unreliable channel transmission, which comprises a service scheduling step and a content caching step.

Service scheduling: the base station with higher reliability of the scheduling channel can serve the request of the user, so that the service cost caused by retransmission can be reduced; the service scheduling step is MRP strategy.

Content caching: the state information among the intelligent agents is utilized for collaborative caching, and caching decisions can be coordinated among the base stations to realize maximization of the reduction of service overhead; the content caching step is a CMA-AC strategy.

As shown in fig. 2, the present invention divides each slot into 3 phases: a content delivery phase, an information exchange phase and a cache update phase. In the content delivery phase, the user initiates a content request to the base station, while the base station services the user's request according to the service scheduling policy. After the content delivery phase is over, the system enters an information exchange phase. At this stage, different base stations exchange request state information, buffer decision information, and the like with each other. In the cache updating stage, each base station carries out cache updating according to the global state information obtained in the information exchange stage.

In FIG. 2, taking base station n as an example, we can obtain the caching strategy a of base station n at the beginning of time slot t, i.e. after the end of the content placement phase of time slot t-1 _n (t-1). After the end of the content delivery phase of time slot t, base station n obtains the request status g of all users _n (t) and rewards r brought by service users at the current moment _n (t). Rewards r obtained by base station n of time slot t _n (t) received by the base station buffer decision a _n (t-1) service scheduling policy b _n The combined influence of (t-1). Will (a) _n (t-1),g _n (t)) as state s of base station n in t time slot _n (t) caching policy function pi _n Will put state s _n (t) mapping to a _n (t). Thus a _n (t) can also be expressed as pi _n (s _n (t)) given a service policy b, the performance of base station n is measured by a state value function expressed as

Where γ represents a discount factor, between (0, 1). r is (r) _n (s _n (τ),π _n (s _n (τ))|b)＝r _n (t+1), i.e.)

The optimization goal is to design a proper service scheduling policy b and a content caching policy pi _n To maximize the overall rewards of the system, expressed as follows

η and ψ represent all possible solution sets of the service scheduling policy and the content caching policy, respectively.

Service scheduling:

the core of the service scheduling step is to always schedule base stations with a higher degree of channel reliability to service the user's request. However, this problem is challenging because the reliability of the channel is unknown. Taking the service user u as an example, we define that the rewards that the base station n can theoretically obtain for providing the content f to the user u can be expressed as

d _u,f (t) -the number of requests for content f by user u in t time slots;

c ₀ -the core network serves the service overhead required by the user;

In the formula (4), II { b } _u,f (t) =n } =1 indicates that the request of user u for content f is served by base station n,representing a reduction in service overhead due to edge caching compared to directly retrieving content from the core network. Furthermore, we can get the average rewards obtained by the base station n service u:

wherein the method comprises the steps ofIndicating the total number of times the first t time slot user u is served by the base station n. Consider average rewards +.>Can be used forThe reliability of the channel is reflected to a certain extent, and the service scheduling strategy of the system can be obtained through a greedy algorithm. Therefore, we name this service scheduling policy as MRP service scheduling policy, i.e. the base station that will average higher rewards is always selected from the base stations that can provide the service to serve the user's request.

The invention divides the MRP service scheduling strategy into two parts: the first part is when At this point base station n has not previously served user u. To ensure that base station n serves user u at least once, the service scheduling policy at this time can be denoted as b _u,f (t) =n. The second part is-> The user's request is serviced according to the following policies:

wherein,,

indicating a base station having content f buffered in a neighbouring base station of user u in time slot t, l (t) indicating from +.>The base station n with the highest average prize in time slot (t-1) is selected.

Content caching:

after giving the service policy b (t), equation 3 can be reduced to

The aim of the invention is to find an optimal strategy pi ^* To maximize the overall prize. By defining a state transition matrix, the state cost function can be recursively represented by the bellman equation

To describe the execution of the cache behavior in the current state, it will be under policy pi _n In state s _n Lower execution behavior a _n Is defined as a state behavior value functionThe expression is shown in formula 9:

from the above expression we can get A denotes all possible decision sets, but due to transition probability +.>Unknown in advance, so that the state-behavior value function cannot be obtained through strategy iteration. In view of this, the present invention introduces an algorithm based on Q learning to solve this problem.

In the content caching step, each base station is regarded as an agent, and comprises an Actor network and a Critic network, wherein the Actor network is a strategy network; given the state s observed by the current base station n _n The Actor network can output the caching decision a of the agent _n The method comprises the steps of carrying out a first treatment on the surface of the The Critic network is an evaluation network for estimating the total rewards that the system can obtain; the Critic network maps the global state s obtained by the system in the information exchange stage to a cost function; by utilizing the Critic network to direct the Actor network to perform parameter updates, each agent is able to update its own cache in the direction of prize maximization. More specifically, each agent maintains an experience buffer. By randomly sampling and replaying past experiences, it can overcome the correlation of adjacent experiences and learn its own caching strategy based on the past.

The following details regarding the content caching step are:

1. content delivery phase: each time slot, each agent, after receiving the user's request, services the user's request according to a service scheduling policy. Taking the example of base station n in time slot τ, after the content delivery phase is completed, the base station can obtain the state s _n (tau-1) taking action a _n (tau-1) obtained prize r _n (tau) while the base station enters the next state s _n (τ). Thus, we can get a set of information about the base station status, rewards and buffering blocksThe four-element combination is put into the experience pool of each base station, and k samples are taken from the experience pool for training. After the content delivery phase is finished, the system enters an information exchange phase.

2. Information exchange stage: the base stations exchange request state information and cache behavior information. After the information exchange is finished, the system enters a buffer updating stage, and each base station updates the buffer of the base station according to the corresponding buffer decision.

3. Cache update stage: the Critic network updates the network parameters according to equation 10. Meanwhile, the Critic network is utilized to guide the Actor network to update parameters(see equation 11) such that each base station performs a cache update in the direction of the prize maximization.

Wherein,,s＝{s ₁ ,s ₂ ,…,s _N and represents the state of the system.

For equilibrium exploration-utilization we use ε _τ Updating cache behavior by greedy policy, i.e. with ε _τ Is randomly cached with a probability of 1-epsilon _τ The caching behavior with the largest probability selection cost function is shown in the following formula 12:

performance simulation:

to evaluate performance, we combine two service scheduling policies and two content caching policies, respectively, to simulate four schemes.

One of the service scheduling strategies is the MRP strategy provided by the invention, and the other is the SDP strategy. The core of the SDP policy is to select the closest edge node as possible for serving the user. Specifically, for each user within the area, the system may select an edge node closest to the user from among the edge nodes capable of providing services to the user to service the user's request.

One content caching strategy is a CMA-AC strategy provided by the invention, and the other content caching strategy is a DMA-DQN strategy. In the DMA-DQN, each edge node can be seen as an agent, and each agent finds an optimal policy through the DQN algorithm to minimize the service overhead of the system, but the different agents are independent from each other.

The four schemes are respectively as follows:

1.CMA-AC+MRP；

2.CMA-AC+SDP；

3.DMA-DQN+MRP；

4.DMA-DQN+SDP。

we first evaluate the scenario with 10 content and 2 cache capacity, and the simulation result is shown in fig. 3. The four schemes in fig. 3 have similar performance. This is because the buffer capacity of the base stations is small, resulting in a very weak degree of cooperation between the base stations. Therefore, the proposed CMA-AC caching strategy and MRP service scheduling strategy have no significant advantages.

Then, we evaluate the scenario with 15 content and 4 buffer capacity, and the simulation result is shown in fig. 4. It is noted that when the content caching policies are the same, the scheme performance of the service scheduling policy being the MRP policy is better than the scheme performance of the service scheduling policy being the SDP policy. The reason is that the MRP policy selects the channel with the highest average prize to serve the user's request, where a higher average prize means higher reliability. Whereas the SDP policy selects the nearest base station to fulfill the user's request, the nearest base station may not be the most channel-reliable base station. Thus, a scheme with a service scheduling policy being an MRP policy may achieve better performance than a scheme with a service scheduling policy being an SDP policy. In addition, when the service scheduling policies are the same, the scheme performance of the content caching policy, which is the CMA-AC policy, is better than the scheme performance of the content caching policy, which is the DAM-DQN policy. This is because the CMA-AC policy coordinates the collaboration between base stations by using the state information of the other agents, while the DMA-DQN policy updates the caching policy based only on the state information of each agent itself. Therefore, the solution performance of the content caching policy being the CMA-AC policy is better than the solution performance of the content caching policy being the DMA-DQN policy.

Finally, we evaluate the scenario with 20 content and 4 buffer capacity, and the simulation result is shown in fig. 5. In this scenario we find that schemes employing CMA-AC content caching policies still have better performance. Furthermore, the CMA-AC strategy is more robust than the DMA-DQN strategy. This is because the CAM-AC strategy aims to maximize the rewards of the system, while the DMA-DQN strategy aims to maximize the rewards of each agent.

The beneficial effects of the invention are as follows:

1. under the condition that user preference and channel reliability are unknown, the content caching strategy design of a plurality of base stations is researched, and the problem is modeled as a multi-agent deep reinforcement learning problem.

2. The invention proposes a service scheduling step (MRP service scheduling strategy) to solve the service scheduling problem. When a request content is available from a plurality of neighboring base stations, the MRP service scheduling policy may schedule base stations with higher reliability to service the user's request.

3. The content caching step (CMA-AC strategy) proposed by the present invention is used to solve the content caching problem. The CMA-AC strategy can effectively utilize the state information of other multi-agents to coordinate the collaborative caching between the base stations.

4. Simulation results show that the MRP service scheduling strategy provided by the invention has better performance than the SDP strategy, and the CMA-AC strategy provided by the invention is superior to the DMA-DQN strategy when the content quantity and the local cache capacity are increased. Furthermore, the CMA-AC strategy is more robust than the DMA-DQN strategy.

The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.

Claims

1. An unreliable channel transmission oriented joint service scheduling and content caching method, comprising:

service scheduling: the base station with higher reliability of the scheduling channel can serve the request of the user, so that the service cost caused by retransmission can be reduced;

content caching: the state information among the intelligent agents is utilized for collaborative caching, and caching decisions can be coordinated among the base stations to realize maximization of the reduction of service overhead;

in the service scheduling step, defining that the base station n provides the content f for the user u can theoretically obtain rewards can be expressed as

Wherein b _u，f (t) -a service scheduling policy for a request of a user u for a content f at a time slot t;

d _u，f (t) -the number of requests for content f by user u in t time slots;

a _n，f (t-1) -a buffer decision of the content f by the time slot (t-1) base station n;

c _n，u -the base station n serves the service overhead required for the user u to transmit once;

c ₀ -the core network serves the service overhead required by the user;

p _n，u -the degree of reliability of the communication channel between base station n and user u;

in the formula (4), the amino acid sequence of the compound,indicating that user u requests for content f is served by base station n,/->Representing a reduction in service overhead due to edge caching compared to directly retrieving content from the core network; further, the average rewards obtained by base station n service u can be obtained:

wherein the method comprises the steps ofRepresenting the total number of times the first t time slot user u is served by the base station n; consider average rewards +.>The reliability of the channel can be reflected to a certain extent, and the service scheduling step of the system is obtained through a greedy algorithm;

in the content caching step, each time slot is divided into 3 phases: a content delivery phase, an information exchange phase and a cache update phase; in the content delivery phase, a user initiates a content request to a base station, and the base station simultaneously serves the request of the user according to a service scheduling policy; after the content delivery phase is finished, the system enters an information exchange phase, and at the phase, different base stations exchange request state information and cache decision information with each other; in the cache updating stage, each base station carries out cache updating according to the global state information obtained in the information exchange stage.

2. The federated service scheduling and content caching method as claimed in claim 1, wherein in said service scheduling step, whenAt this time base station n has not been previously takenUser u is crossed; in order to ensure that base station n serves user u at least once, the service scheduling step at this time is denoted b _u，f (t)＝n。

3. The federated service scheduling and content caching method as claimed in claim 1, wherein in said service scheduling step, whenThe user's request is serviced according to the following policies:

wherein,,

indicating a base station having content f buffered in a neighbouring base station of user u in time slot t, l (t) indicating from +.>Selecting a base station n with the largest average prize in a time slot (t-1);

4. The joint service scheduling and content caching method according to claim 1, wherein in the content caching step, each base station is regarded as an agent, and comprises an Actor network and a Critic network, and the Actor network is a policy network; given the state s observed by the current base station n _n The Actor network can output the caching decision a of the agent _n The method comprises the steps of carrying out a first treatment on the surface of the The Critic network is an evaluation network for estimating the total rewards that the system can obtain; the Critic network maps the global state s obtained by the system in the information exchange stage to a cost function; by utilizing the Critic network to direct the Actor network to perform parameter updates, each agent is able to update its own cache in the direction of prize maximization.

5. The syndicated service scheduling and content caching method according to claim 4, wherein in the content caching step, a content delivery phase: each time slot, each agent servicing the user's request according to a service scheduling step after receiving the user's request; taking the example of base station n in time slot τ, after the content delivery phase is completed, the base station can obtain the state s _n (tau-1) taking action a _n (tau-1) obtained prize r _n (tau) while the base station enters the next state s _n (τ); therefore, a group of quadruplets about base station states, rewards and buffer decisions can be obtained and put into the experience pool of each base station, and k samples are taken from the experience pool for training; after the content delivery phase is finished, the system enters an information exchange phase.

6. The federated service scheduling and content caching method of claim 5, wherein in the content caching step, the information exchange phase: exchanging request state information and cache behavior information between base stations; after the information exchange is finished, the system enters a buffer updating stage, and each base station updates the buffer of the base station according to the corresponding buffer decision.

7. The federated service scheduling and content caching method of claim 6, wherein in the content caching step, the cache update phase: the Critic network updates network parameters through the cache state information and the cache decision information about other intelligent agents obtained in the information exchange stage; meanwhile, the Critic network is utilized to guide the Actor network to update parameters, so that each base station performs cache update towards the direction of rewarding maximization.

8. The joint service scheduling and content caching method according to claim 7, wherein in the content caching step, in order to balance exploration-utilization, we use epsilon greedy strategy to update the caching behavior, namely randomly caching with epsilon probability, and selecting the caching behavior with the largest cost function with 1-epsilon probability.