CN116156455A

CN116156455A - Internet of vehicles edge content caching decision method based on federal reinforcement learning

Info

Publication number: CN116156455A
Application number: CN202211708649.XA
Authority: CN
Inventors: 林艳; 包金鸣; 邹骏; 张一晋; 李骏; 束锋
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-05-23

Abstract

The invention discloses a method for caching and deciding edge content of an internet of vehicles based on federal reinforcement learning, which specifically comprises the following steps: inputting a vehicle networking environment and initializing network parameters of each vehicle; at the current time slot, each vehicle interacts with the road side unit to obtain observation information; according to the observation information, each vehicle independently decides to act; after the action is executed, each vehicle obtains the rewards of environmental feedback, and the sample data is cached to an experience multiplexing pool; when the number of samples is sufficient, each vehicle updates the network according to a flexible actor-critique algorithm; the aggregation center collects local network parameters for federal aggregation and broadcasts the aggregation parameters to the local for training; and after the current training is finished, resetting the Internet of vehicles environment, and starting the training of the next round. The invention aims at realizing the minimization of the balance of the transmission delay of the system content and the overhead of the edge cache by utilizing a network architecture taking a user as a center under the environment of the internet of vehicles so as to ensure that the vehicles finish the distributed decision of the edge cache on the premise of protecting the privacy.

Description

Internet of vehicles edge content caching decision method based on federal reinforcement learning

Technical Field

The invention relates to the technical field of wireless communication, in particular to a federal reinforcement learning-based internet of vehicles edge content caching decision method.

Background

In recent years, under the drive of sixth-generation mobile communication technology, new generation information technologies such as the internet of vehicles automobile industry, big data, artificial intelligence and the like are deeply integrated, and efficient and reliable communication services can be provided. However, with the increase of the number of vehicles, the increasing real-time communication service puts higher demands on ultra-low time delay and ultra-high reliability of the internet of vehicles. To address the above challenges, edge caching techniques reduce the delay in content delivery by deploying cache resources at edge nodes to accomplish local content distribution, avoiding reliance on cloud-centric delivery (Zhang Y, zhao J, cao g.roadcast: a popularity aware content sharing scheme in vanets [ J ]. ACM SIGMOBILE Mobile Computing and Communications Review,2010,13 (4): 1-14.). However, due to the limited storage capacity of the edge nodes, the rapidly growing internet of vehicles application data requests cannot be completely cached to the edge nodes; secondly, in view of the high mobility of the internet of vehicles, frequently changing content requirements, and harsh communication environments, there is a need to design an efficient edge content caching method for the scenes of the internet of vehicles.

Considering that the problem of intelligent edge content caching of the Internet of vehicles is essentially a discrete sequence decision-making problem without a model, the method can be solved by adopting a multi-agent reinforcement learning method which can finish local decision-making by sharing training information. Compared with the traditional optimization algorithm, the deep reinforcement learning can learn experience through interaction of an agent with an uncertain environment so as to solve the problem of dynamic decision. Even if dynamic environmental changes cannot be predicted in advance, the agent can learn how to take action or how to map the acquired information to action, thereby maximizing the system rewards. In recent years, researchers at home and abroad focus on researching intelligent edge content caching decisions so as to efficiently utilize edge caching resources in a dynamic wireless transmission environment of the Internet of vehicles. For example, qiao et al use depth deterministic policy gradient algorithm to learn the change law of the Internet of vehicles wireless environment by using the local observation information of the vehicle users, and propose a collaborative edge caching method to minimize the long-term tradeoff of system content transmission delay and caching overhead (Qiao G, leng S, maharjan S, et al deep reinforcement learning for cooperative content caching in vehicular edge computing and networks [ J ]. IEEE Internet of Things Journal,2019,7 (1): 247-257.). However, the existing internet of vehicles edge content caching schemes are mostly built on a centralized network architecture, and do not fully utilize the deployment characteristics of dense heterogeneous edge nodes. In addition, privacy protection of individual data of a vehicle user is also a non-negligible content in view of the openness of the internet of vehicles communication scheme. Therefore, in the environment of dense heterogeneous deployment of edge nodes, how to realize seamless coverage of high capacity, low cache overhead and vehicle-mounted communication by using a decentralised network architecture on the basis of privacy protection still needs further research.

Disclosure of Invention

In order to overcome the problems in the related art, the embodiment of the invention discloses an Internet of vehicles edge content caching decision method based on federal reinforcement learning. The technical method comprises the following steps:

step 1, inputting a vehicle networking environment, initializing parameters of an own actor network and a critic network by each vehicle agent, and modeling an optimization problem;

step 2, in the current time slot, each vehicle agent interacts with a road side unit in an observation range to obtain observation information such as the distance between the vehicle agent and the road side unit, the cache state of the road side unit, the residual cache capacity of the road side unit and the like;

step 3, according to the local observation information, each vehicle agent can independently decide the associated road side unit in the edge node cluster and decide whether to cache the request content of the current time slot in the cluster;

step 4, after the action decision is executed, each vehicle agent acquires the trade-off rewards of the total content delivery delay and the edge cache overhead of the system fed back by the vehicle networking environment, and all sample data are cached to an experience multiplexing pool;

step 5, judging whether the number of samples is enough, if so, entering a step 6, otherwise, entering a step 7;

Step 6, when the number of samples is enough, each vehicle agent updates own Actor network and Critic network parameters according to a flexible Actor-critter algorithm;

step 7, collecting the weight parameters of the Actor network of each vehicle intelligent agent by the aggregation center, and performing federal aggregation, wherein the aggregated parameters are broadcasted to a vehicle user in one training round to perform local training;

step 8, judging whether the current training round is finished, if not, returning to the step 2 to start the training of the next round, and if so, entering the step 9;

step 9, judging whether convergence is achieved, if not, resetting the environment of the Internet of vehicles, and returning to the step 1; and if yes, finishing training and finishing the decision of caching the edge content of the Internet of vehicles.

Compared with the prior art, the invention has the remarkable advantages that: (1) Aiming at the link congestion load caused by the network architecture of the vehicle networking cache scene centering network, the invention designs an edge node cluster centering on a user by utilizing the deployment characteristic of the dense edge nodes so as to realize high capacity, low cache overhead and vehicle-mounted communication seamless coverage; (2) Aiming at the problems of a large amount of information interaction and privacy leakage caused by intelligent algorithm centralized training, the invention utilizes the privacy protection advantage of federal learning, and realizes the collaborative decision of vehicle users by sharing the local model neural network weight so as to reduce the long-term transmission delay and the edge cache overhead of the system content.

The invention is further described in detail below with reference to the accompanying drawings.

Drawings

FIG. 1 is a flow chart of a method for caching and deciding the edge content of the Internet of vehicles based on a federal reinforcement learning framework.

FIG. 2 is a graph showing average trade-off convergence performance per vehicle user versus cycle in an embodiment of the present invention.

Fig. 3 is a graph showing the convergence performance of average transmission delay per vehicle user according to the embodiment of the present invention.

Fig. 4 is a graph showing the convergence performance of average buffer overhead per vehicle user according to the periodic variation in the embodiment of the present invention.

FIG. 5 is a graph showing average trade-off convergence performance per vehicle user as a function of maximum number of associations in an embodiment of the present invention.

Detailed Description

The invention provides a federal reinforcement learning-based vehicle networking edge content caching decision method. Specifically, in a unit time slot, each vehicle user is regarded as an intelligent agent, and obtains the distance between the vehicle user and a road side unit, the cache state of the road side unit, the residual cache capacity of the road side unit and other observation information in an observation range, and adopts a neural network training action strategy to minimize the balance of the total content delivery time delay and the edge cache cost of the system, and the method comprises the following steps in combination with fig. 1-2:

As a specific embodiment, the inputting the internet of vehicles environment described in step 1 specifically includes:

(1) Time slot model: dispersing continuous training time into multiple time slots, denoted as

Wherein each time slot has a duration τ, wherein the channel state information and system parameters remain unchanged over the duration of a single time slot, but may vary randomly between different time slots;

(2) Network model: establishing a networking model of the vehicle as a Manhattan grid model, wherein road side units capable of providing communication services are uniformly distributed on two sides of a road; the road side unit is used as an edge node, has special communication resources and limited local storage resources, and is associated with an edge server through a high-speed wired link; the edge server is controlled by a software-defined centralized controller, and can execute edge association, cache resource allocation and the like; let the set of road side units be

The vehicles are assembled into

All vehicle users can travel along four directions of the front, the back, the left and the right of the road, and each direction is provided with a plurality of lanes to ensure the passing of vehicles;

(3) Vehicle movement model: the speed variation of the vehicle follows the following gaussian-markov random process, in particular, when the vehicle user i is at an initial speed

During driving, its speed at time slot t +.>

May be represented as time slot t-1

Speed at

Sum of progressive speed and a random variable:

wherein ,

and σ_i Is the corresponding asymptotic mean and standard deviation of the vehicle user i speed; parameter eta _i ∈[0,1]The memory depth representing the last slot speed and determining the time dependence of the movement of the vehicle user i; z represents the standard normal distribution of uncorrelated zero mean unit variances. Notably, η _i The closer to 1, the current slot speed of the vehicle user i becomes more dependent on the speed of the last slot;

(4) Content request model: order the

Representing a set of all vehicle request contents, +.>

Representing the size of all requested content, +.>

Characteristic values representing not all the request contents for distinguishing different request contents; it is assumed that the vehicle user i can only generate one request content f in the time slot t, expressed as

/>

wherein ,

Indicating that the vehicle user i requests the content f in time slot t and vice versa +.>

Considering that a vehicle user may prefer a file similar to a previously time-slot requested content in addition to a content according to a global popularity, it is assumed that the vehicle user is according to a global popularity according to a probability of epsilon, and a probability of 1 epsilon is according to a local personal preference, wherein the global popularity and the local personal preference are defined as follows:

(1) global popularity: order the

Representing global popularity of individual vehicle user request files, which follow Mandelbrot-Zipf distribution, i.e. satisfy

wherein ,I_f Representing the order in which the content f is arranged in descending global popularity; alpha and beta respectively represent a platform factor and a skewness factor; (2) local personal preferences: in this case each vehicle user requests a file based on the similarity of their previously requested content; for example, when the vehicle user requests the content f in the time slot t, the content with the highest similarity with the content f will be requested in the time slot t+1; the section adopts cosine similarity to represent contents f and f ^* Similarity of (3):

(5) User-centric edge cache model: in order to improve the transmission rate of request content of a vehicle user, a network architecture taking the user as a center is adopted to design a vehicle networking edge content cache frame; specifically, each vehicle user selects a single or a plurality of adjacent road side units to construct each edge node cluster taking the user as the center by observing the information of the road side units in the perception range; let the maximum number of road side units observable by the vehicle user i in the time slot t be O _max The maximum number of road side units that it can associate is denoted as S _max ≤O _max The method comprises the steps of carrying out a first treatment on the surface of the Order the

Representing the cluster of edge nodes serving the vehicle user i at time slot t, then

Indicating the number of road side units in the cluster at the moment; let the edge association decision of the vehicle user i at the time slot t be expressed as:

wherein ,

edge node cluster representing that a road side unit j belongs to a vehicle user i in time slot t +.>

Otherwise

Further, a user-centric edge node cluster may be represented as

The method comprises the steps that a road side unit and a cloud center are provided with cache capacities, wherein the cache capacity of the cloud center can process cache requests of all vehicle users, and the limited cache capacity of each road side unit is C; under the framework of caching the edge content with the center of the user, the vehicle user can request to cache part of the request files to the associated edge node cluster, so as to provide the cache content service; thus, the vehicle user needs to further decide whether to cache the request content of the current time slot to each road side unit in the edge node cluster; let the edge buffer decision variable of the vehicle user i at the time slot t be expressed as

wherein ,

indicating to the vehicle user i to buffer the requested content of the time slot t to the road side unit j, otherwise +. >

(6) Wireless transmission model: assuming that mutual interference between communication links has been eliminated by allocating orthogonal resource blocks, and that both the road side units and the vehicle users are equipped with single antennas; considering that the channel power gain comprises two parts of fast fading and slow fading, wherein the main cause of the fast fading part is multipath effect, namely Rayleigh fading; the slow fading part consists of path loss and shadow fading; order the

The channel gain between the vehicle user i and the road side unit j at time slot t is expressed as:

wherein ,

is the corresponding fast fading part, which follows the complex gaussian distribution of zero mean unit variance, i.e. satisfies

While the slow fading portion between the vehicle user i and the road side unit j is represented as

wherein ,A_i Is a constant in path loss; beta _i Is a lognormal component with standard deviation ζ in shadow fading;

representing the distance between the vehicle user i and the road side unit j at the time slot t; η represents the attenuation coefficient of the path loss component; />

Representing the path loss of the channel between the vehicle user i and the road side unit j at time slot t;

let the edge node cluster at time slot t

The transmitting power of the middle road side unit j is p _j The achievable downlink transmission rate for vehicle user i is expressed as

Wherein B is the channel bandwidth, sigma ² Power being additive white gaussian noise;

(7) And (3) a time delay model: making the request content of each vehicle user tolerant to delay

In addition, in order to fully utilize the edge cache resources, the edge server automatically refreshes the storage space of the edge server every cycle delta, so that the edge server has a short service vacuum period to replace the cache content every cycle delta; therefore, if the vehicle user i needs to download the cache content from the edge server at time slot t, its content delivery needs to be delayed by a tolerable delay +.>

And completed in the next cache replacement refresh period, i.e

Where (n + 1) delta-t represents the number of slots remaining for the corresponding cache refresh period,

considering that the content delivery delay of the vehicle user depends on whether the requested content has been cached in the edge server in advance, the cache state of the time slot t-way side unit j is defined as

wherein ,

indicating that content f has been buffered in the road side unit j in time slot t, and vice versa + ->

Thus, delivery of the requested content by the vehicle user includes two cases, edge caching and cloud downloading:

(1) edge caching: when edge node cluster

When any edge node has cached the request content f, the vehicle user i can directly acquire the request content from the edge, and the time delay is expressed as

(2) And (3) cloud downloading: when edge node cluster

When all edge nodes in (a) do not cache the request content f, the edge server needs to be downloaded from the cloud center, resulting in an extra fixed delay +.>

Based on the above model, the total time delay for the vehicle user i to acquire the request content f can be expressed as

/>

wherein ,1_{·} Indicating that 1 is set if the constraint is satisfied and 0 is set otherwise.

As a specific embodiment, modeling the optimization problem in step 1 is specifically:

taking the long-term tradeoff of minimizing the content delivery delay and the edge cache overhead as an optimization target, designing an intelligent edge cache scheme taking a user as a center; for this, normalized content delivery latency effects and edge cache overhead effects are first designed, expressed as:

thus, the optimization problem is expressed as follows:

wherein ,ω₁ And omega ₂ Weights respectively representing content delivery delay effects and cache overhead effects; c (C) ₁ The edge node cluster size of each vehicle user representing any time slot is not more than S _max The method comprises the steps of carrying out a first treatment on the surface of the C (C) ₂ Means that each at most a single vehicle user is served in any time slot; c (C) ₃ Indicating that the content of each road side unit buffer memory cannot exceed the local buffer memory capacity; c (C) ₄ Indicating that the requested content delivery delay for each vehicle user at any time slot should not exceed the maximum allowable delay.

As a specific embodiment, in step 2, each vehicle agent interacts with a road side unit in the observation range to obtain the observation information such as the distance between the vehicle agent and the road side unit, the buffer status of the road side unit, and the remaining buffer capacity of the road side unit, where the specific steps are as follows:

in the time slot t, each vehicle user is used as an intelligent agent, and the respective observation state is obtained through interaction with the environment, and the observation state of the vehicle i in the time slot t is expressed as

wherein ,

representing the distance between the time slot t, the vehicle user i and the road side unit j; />

A flag bit indicating whether the request content f has been cached by the slot t-way side unit j; />

The residual buffer space of the time slot t-way side unit j is represented; />

And the number of the remained time slots is indicated by the time slot t and the time slot t are refreshed from the buffer space.

As a specific embodiment, in step 3, each vehicle agent may independently decide, according to the local observation information, an associated road side unit in the edge node cluster, and decide, in the cluster, whether to cache the request content of the current time slot, specifically:

and determining actions made by each vehicle agent interacting with the environment, including edge-associated decision variables and edge-cached decision variables. The action of vehicle i at time slot t is expressed as:

As a specific embodiment, after the action decision is performed in step 4, each vehicle agent obtains a trade-off reward of total content delivery delay and edge buffer overhead of the system fed back by the internet of vehicles environment, specifically:

when all vehicle agents have performed the action, the environment will feedback a global incentive. Defining an average tradeoff per user as a global reward, expressed as:

wherein ,

representing normalized content delivery delay utility, +.>

Representing normalized edge cache overhead utility; ρ _t Representing that constraint C is not satisfied ₁ -C ₄ Penalty term at that time.

As a specific implementation manner, in step 6, each vehicle agent updates its own actor network and critic network parameters according to the federal discrete flexible actor-critter algorithm, specifically:

(1) Flexibility value function: the flexible actor-critic algorithm optimization objective is to meet entropy maximization in addition to maximizing the cumulative rewards value returned by training, i.e

wherein ,ρ_π A trajectory distribution representing a strategy pi; alpha represents an entropy coefficient; to ensure that the agent can continually explore, entropy is introduced to randomize the strategy, expressed as

Therefore, when the algorithm makes a decision, the probability of outputting each action can be dispersed as far as possible, so that the learning capacity of the vehicle intelligent body on the environment is improved, the vehicle intelligent body can adaptively adjust the strategy in the vehicle networking environment with continuously changed channel conditions, and further a more reasonable decision is made; based on this, a soft state action value function is defined as follows

wherein ,

a cumulative award representing a discount; further, the soft state value function may be expressed as

Wherein action a depends on the probability distribution pi (|s).

(2) Critic network: under the initiator-Critic network framework of the algorithm, the whole training process is alternately performed between strategy evaluation and strategy improvement, so that the long-term balance maximization of the environment of the Internet of vehicles system is realized; to judge the quality of the action strategy, two Critic networks are constructed, which are expressed as online value networks 1 and 2, and target value networks are respectively constructed for the two Critic networks; the input of the neural networks is local observation information collected by vehicle agents in the environment, and the local observation information is output as the value corresponding to each action in the action space; in view of the fact that a neural network is generally adopted to evaluate a soft value function in an algorithm framework, weight parameters defining the online value networks 1 and 2 and corresponding target networks are respectively theta _main,1 、θ _target,1 θ _main,2 、θ _target,2 ；

In order to improve the training stability and avoid the divergence of the reinforcement learning algorithm, an experience multiplexing pool is adopted to relieve the correlation among samples; during the training process, each vehicle agent is respectively gathered from the multiplexing pool

Is to randomly extract sequence pairs of batch size, i.e. +.>

Redefining the flexible state value function as:

Wherein pi (|s) _t ) ^T Denoted s _t Action probability distribution pi (|s) in case _t ) Is a transpose of (2);

the updates to value networks 1 and 2 in the Critic network are approximated as building a bellman mean square error loss function for each Q network, expressed as:

wherein i=1, 2 represents value networks 1 and 2, respectively; to achieve the goal of maximum entropy, the entropy needs to be included in rewards, and the functional expression of the judgment behavior strategy value obtained by the general deduction formula of Bellman is as follows:

in addition, to obtain Q _soft Optimal approximation of (s, α), updating the parameter θ in the gradient direction _m,i Expressed as a minimized loss function:

wherein ,η_c Represents the learning rate of the Critic network,

gradient calculation representing a loss function;

the target value network does not actively participate in the learning process and cannot be updated independently, so that a soft update mode is adopted. It replicates the latest parameters of the online value network at intervals for small-scale updates, expressed as:

θ _target，i (t+1)←τθ _main，i (t)+(1-τ)θ _target，i (t)

where τ represents the degree of soft update.

(3) Actor network: the task of the Actor network is to seek policy improvement based on the estimated value generated by the neural network, which is input as local observation information collected by each vehicle agent in the environment, output as behavior probability of action dimension, expressed as pi _θ (|s); whereas the updating process of an Actor network can be expressed as an exploration of the strategy by maximizing the system rewards, the decision to guide the action by a soft state action value function, i.e. the loss function of the Actor network is defined as:

similarly, the Actor network updates the parameter φ in the gradient direction to minimize the loss function, expressed as:

wherein ,η_a Indicating the learning rate of the Actor network,

gradient calculation representing a loss function;

(4) Self-adjusting entropy coefficient: the entropy coefficient alpha is used as a weight, and the randomness of the action strategy is controlled by changing the numerical value. The greater α, the more dispersed the probability of outputting actions when the algorithm makes a decision, resulting in more complete exploration of the environment by the vehicle agent and further more actions being attempted. Considering that the reward value fed back by the system is continuously changed in the training process of the Internet of vehicles environment, the dependence on a priori fixed alpha can lead to unstable training, and further the convergence performance of the system is affected. In view of this, this section considers the method of adaptive entropy coefficients, i.e. when the vehicle agent explores a new environment at the beginning of training, increasing α to make the agent explore as fully as possible; while as the number of training rounds increases, the optimal action is essentially determined, then α is reduced to optimize the long-term rewards tradeoff of the system as much as possible. The loss function defining the entropy coefficient is:

wherein ,

representing the current state s _t And (3) a threshold value of entropy.

As a specific implementation manner, the aggregation center collects the weight parameters of the Actor network of each vehicle agent and performs federal aggregation, and the aggregated parameters are broadcast to the vehicle user in a training round to perform local training, specifically:

firstly, each vehicle agent trains a DRL model in a distributed mode according to local observation information; secondly, uploading the trained Actor network weight parameters of the local DRL model to a cloud center so as to share strategies; finally, the global model parameters of the federal aggregation are broadcast to the local agent for the training of the next iteration; the communication mode only needs to upload the parameters of the nerve network and download strategies, so that the communication load is greatly reduced; in addition, the privacy of the system is protected in view of the fact that the local information of the user cannot be directly obtained from the neural network parameters; in each training round, the update formula of the weight parameters of the vehicle agent Actor network is as follows:

φ(t+1)＝ξφ ⁱ (t)+(1-ξ)H ⁱ (t)

wherein ,Hⁱ (t)＝∑ _k≠i φ ^k (t), ζ is a weight coefficient.

The invention is further described in detail below with reference to the accompanying drawings and specific examples.

Examples

The embodiment provides a method for caching and deciding the edge content of the internet of vehicles based on federal reinforcement learning, which is specifically described below:

1. Establishing a car networking system model:

the simulation environment is set according to the manhattan grid model in the 3gpp TR 36.885 standard. Which is a kind ofThe medium city map was set to 500×500m, the number of vehicles was 4, and the number of roadside units was 36. The maximum observation range of the vehicle user is 200m, and the maximum association number of the road side units is 4. The number of contents that a vehicle user can request is 20, and the content data size is set to [20,50]Mbit. The platform factor alpha of the global popularity of the requested content is-0.88, and the skewness factor beta is 0.35. At the beginning of training, the vehicle position is initialized, and when the vehicle user runs to the intersection, the following running direction is selected with equal probability, namely, the probability of each direction is 0.25. The speed of each vehicle follows a Gaussian-Markov movement model, wherein the progressive average of the initial speeds is

Standard deviation sigma _i =0.1, parameter η _i Set to 0.1. Path loss model set-up adoption

In addition, if the duration of the unit time slot is set to be 0.1s, the cloud downloading time delay is set to be 1s, and the time delay constraint is +.>

Set to 1.5s.

2. Establishing a federal reinforcement learning algorithm framework:

federal enhancement algorithms combine federal averaging algorithms and flexible actor-critique algorithms. In the flexible actor-critter algorithm framework, critic network fits it using a fully connected neural network of 2 hidden layers, where the number of hidden layer neurons is 64 and the activation function is a linear rectification function f (x) =max (0, x); similar to Critic networks, the Actor networks fit using the same full-connection layer neural network. And constructing a virtual cloud server as an aggregation center, and performing federal aggregation on the weight parameters of the local Actor network.

3. Training phase of algorithm:

first, each vehicle agent trains its edge caching strategy using a flexible actor-critter algorithm. For vehicle agent i, local state information is entered

Defined as->

Namely the data size of the requested content of the vehicle agent in time slot t, the global popularity of the file, the distance between the vehicle agent and the road side unit j, the flag bit of the requested content f cached by the road side unit j, the size of the remained caching space and the number of the remained time slots refreshed by the caching space. Secondly, the vehicle agent decides on the action based on the local observation state, expressed as +.>

Namely an edge association decision variable and an edge cache decision variable.

After the algorithm inputs the state, according to the action predicted by the Actor network, after the interaction with the Internet of vehicles environment, the vehicle intelligent agent obtains the global rewards fed back by the system

And transition to the next state +.>

When the sample data are enough, the vehicle intelligent agent respectively updates the Actor network and the Critic network according to the gradient descent method. At the end of each round, each vehicle agent uploads the weight parameters of the local Actor network to the cloud center for federal averaging, and the aggregated global parameters are broadcast to the vehicle agents for local training of the next round. 3000 training sets and 100 test sets are adopted, and a Critic network learning rate eta is set in the training process _c 10e of ^-4 Actor network learning rate eta _a 10e of ^-4 The discount factor gamma is 0.9, the soft update degree is 0.01, the empirical multiplexing pool size is 5000, and the number of samples trained once is set to 100.

The present invention compares the proposed solution with the following reference solution:

(1) Edge caching scheme based on independent flexible actor-critter algorithm (ISAC): each vehicle user respectively adopts SAC algorithm as an agent to train own strategy, namely each agent trains independent Critic and Actor networks respectively and makes decisions in a distributed mode according to own local observation information.

(2) Edge caching scheme based on independent deep Q network algorithm (Independent Deep Q Network, IDQN): each vehicle user respectively adopts an DQN algorithm as an agent to train own strategy, namely each agent trains an independent Q network and makes decisions in a distributed mode according to own local observation information.

(3) Edge caching scheme based on federal DQN (Federated DQN, feddqn): each vehicle user is used as an intelligent agent to federally train own strategies by adopting a DQN algorithm which utilizes a shared local Q network weight, wherein the Q network weight is uploaded to a cloud center for federally averaging, and then each vehicle user downloads aggregated global parameters and makes decisions in a distributed mode according to own local observation information.

As shown in fig. 2, it is apparent that the ISAC-based playback scheme converges most rapidly, the proposed scheme is inferior, and the FedDQN-based and IDQN-based schemes converge most slowly, by comparing the proposed scheme with the reference scheme. The underlying cause of this phenomenon is: the scheme and the scheme based on ISAC try to maximize entropy targets and enhance the exploration capacity of the algorithm, so that the convergence of the model is promoted, and better convergence performance is achieved; the scheme based on IDQN and FedDQN utilizes a random strategy to search, so that the problems that a model is not easy to converge and sample searching and utilization are difficult to coordinate when a high-dimensional action space in the environment of the Internet of vehicles is handled are faced. Furthermore, the proposed scheme may achieve a better system average tradeoff with protection of privacy than ISAC-based schemes. The reasons for this include two aspects: firstly, the proposal enables each agent to share experience of interaction with the environment by sharing weight parameters of an Actor network in a local DRL model, thereby promoting a policy with better decision of each agent; secondly, the cloud center only collects weight parameters and cannot acquire local observation information from training, so that communication overhead is reduced, and privacy of a vehicle user is protected.

As shown in fig. 3, as the number of training rounds increases, the average transmission delay per vehicle user shows a decreasing trend, which indicates the effectiveness of each scenario. Compared with the reference scheme, the scheme is obviously superior to the IDQN-based scheme and the FedDQN-based scheme in the optimization performance of average transmission delay, and is slightly superior to the ISAC-based scheme. The cause of this phenomenon includes two aspects: firstly, four schemes can well optimize the average transmission delay index, and when the utility function is normalized, the algorithm performance is difficult to distinguish; secondly, because of the complex and changeable dynamic car networking environment, the convergence result has concurrency, so that obvious performance advantages are difficult to embody. As shown in fig. 4, as the number of training rounds increases, the edge buffer overhead of each scheme shows a decreasing trend, wherein the edge buffer overhead of the proposed scheme is the smallest, the ISAC-based scheme is inferior, and the IDQN-and fedqn-based schemes perform the worst. With reference to fig. 3 and fig. 4, it can be seen that the proposed solution can greatly reduce the overhead of edge buffering, and simultaneously, can also well optimize the transmission delay of each vehicle user.

Fig. 5 shows the rule of influence of different maximum association number parameters in the vehicle networking construction edge node cluster on average weighing performance of each vehicle user. Specifically, as the number of associated roadside units increases, so does the convergence result of the proposed scheme. The reason for this phenomenon is that the user-centric edge node cluster can seamlessly adapt to dynamic fluctuations in the network topology, and as the number of road side units that make up the cluster increases, its joint transmission technique can greatly improve the data transmission rate. However, when the associated roadside units reach a certain number, the edge node cluster saturates the gain of the transmission rate, i.e. when S _max Average weighted convergence result per vehicle user versus S when=4 _max The difference is not large when=3.

In summary, the invention provides the internet of vehicles edge caching scheme based on the federal flexible actor-criticism algorithm by combining federal learning privacy protection advantages under the framework of taking the user as a center aiming at the unknown high-dynamic topology and channel state characteristics of the internet of vehicles. The proposal realizes the collaborative training decision of the multi-agent environment on the basis of not revealing the local training data of the vehicle users, further realizes the joint optimization of the transmission delay and the cache cost while ensuring the high privacy, and is superior to the benchmark proposal in the aspects of average balance and convergence performance of each vehicle user.

Claims

1. The internet of vehicles edge content caching decision-making method based on federal reinforcement learning is characterized by comprising the following specific steps:

2. The internet of vehicles edge cache decision method based on federal multi-agent reinforcement learning according to claim 1, wherein the inputting the internet of vehicles environment in step 1 specifically comprises:

The vehicles are assembled into

During driving, its speed at time slot t +.>

Can be expressed as speed +.1 at time slot t-1>

Sum of progressive speed and a random variable:

wherein ,

(4) Content request model: order the

Representing a set of all vehicle request contents, +.>

Representing the size of all requested content, +.>

wherein ,

(1) global popularity: order the

wherein ,I_f Representing content f as fullDescending order of popularity; alpha and beta respectively represent a platform factor and a skewness factor;

(2) local personal preferences: in this case each vehicle user requests a file based on the similarity of their previously requested content; for example, when the vehicle user requests the content f in the time slot t, the content with the highest similarity with the content f will be requested in the time slot t+1; the section adopts cosine similarity to represent contents f and f ^* Similarity of (3):

wherein ,

representing road sidesUnit j belongs to the edge node cluster of vehicle user i in time slot t +.>

On the contrary->

Further, the user-centric edge node cluster may be denoted +.>

wherein ,

(6) Wireless transmission model: assuming that mutual interference between communication links has been eliminated by allocating orthogonal resource blocks, and that both the road side units and the vehicle users are equipped with single antennas; considering channel power gain including fast fadingAnd slow fading, wherein the main cause of the fast fading part is multipath effect, i.e. rayleigh fading; the slow fading part consists of path loss and shadow fading; let->

wherein ,

let the edge node cluster at time slot t

And completed in the next cache replacement refresh period, i.e

wherein ,

(1) edge caching: when edge node cluster

(2) And (3) cloud downloading: when edge node cluster

wherein ,1_{·} Indicating that 1 is set in the case where the constraint is satisfied,and otherwise, setting 0.

3. The internet of vehicles edge content caching decision method based on federal multi-agent reinforcement learning according to claim 2, wherein the modeling of the optimization problem in step 1 is specifically:

taking long-term trade-off of minimizing content delivery delay and edge cache overhead as an optimization target, designing an intelligent edge cache method with a user as a center; for this, normalized content delivery latency effects and edge cache overhead effects are first designed, expressed as:

thus, the optimization problem is expressed as follows:

s.t.C ₁ :

C ₂ :

C ₃ :

C ₄ :

4. The internet of vehicles edge content caching decision-making method based on federal multi-agent reinforcement learning according to claim 3, wherein in step 2, each vehicle agent interacts with a road side unit in an observation range to obtain observation information such as a distance between the vehicle agent and the road side unit, a cache state of the road side unit, and a remaining cache capacity of the road side unit, and the method is specifically as follows:

wherein ,

The residual buffer space of the time slot t-way side unit j is represented; />

5. The method for determining the caching of the internet of vehicles edge content based on reinforcement learning of multiple federal agents according to claim 4, wherein in step 3, each vehicle agent can independently determine the associated road side unit in the edge node cluster and determine whether to cache the request content of the current time slot in the cluster according to the local observation information, specifically:

6. the method for determining the edge content cache of the internet of vehicles based on the reinforcement learning of multiple federal agents according to claim 5, wherein after the action decision is performed in step 4, each vehicle agent obtains a trade-off reward of total content delivery delay and edge cache overhead of the system fed back by the environment of the internet of vehicles, specifically:

wherein ,

representing normalized content delivery delay utility, +.>

7. The internet of vehicles edge content caching decision-making method based on federal multi-agent reinforcement learning according to claim 6, wherein in step 6, each vehicle agent updates its own Actor network and Critic network parameters according to a flexible Actor-critique algorithm, specifically:

wherein ,

a cumulative award representing a discount; further, the methodThe soft state value function may be expressed as

Wherein action a depends on the probability distribution pi (|s).

(2) Critic network: under the initiator-Critic network framework of the algorithm, the whole training process is alternately performed between strategy evaluation and strategy improvement, so that the long-term balance maximization of the environment of the Internet of vehicles system is realized; to judge the quality of the action strategy, two Critic networks are constructed, which are expressed as online value networks 1 and 2, and target value networks are respectively constructed for the two Critic networks; the input of the neural networks is local observation information collected by vehicle agents in the environment, and the local observation information is output as the value corresponding to each action in the action space; in view of the fact that a neural network is generally adopted to evaluate a soft value function in an algorithm framework, weight parameters defining the online value networks 1 and 2 and corresponding target networks are respectively theta _main，1 、θ _target，1 θ _main，2 、θ _target，2 ；

Is to randomly extract sequence pairs of batch size, i.e. +.>

Redefining the flexible state value function as:

in addition, to obtain Q _soft The optimal approximation of (s, a) updates the parameter θ in the gradient direction _m,i Expressed as a minimized loss function:

wherein ,η_c Represents the learning rate of the Critic network,

gradient calculation representing a loss function;

θ _target，i (t+1)←τθ _main，i (t)+(1-τ)θ _target，i (t)

Where τ represents the degree of soft update.

(3) Actor network: the task of the Actor network is to seek policy improvement based on the estimates generated by the neural network, which is input as the environmentThe local observation information collected by each vehicle intelligent agent is output as the action probability of action dimension, and is expressed as pi _θ (|s); whereas the updating process of an Actor network can be expressed as an exploration of the strategy by maximizing the system rewards, the decision to guide the action by a soft state action value function, i.e. the loss function of the Actor network is defined as:

wherein ,η_a Indicating the learning rate of the Actor network,

gradient calculation representing a loss function;

/>

wherein ,

representing the current state s _t And (3) a threshold value of entropy.

8. The internet of vehicles edge content caching decision method based on federal multi-agent reinforcement learning according to claim 7, wherein in step 7, the aggregation center collects the Actor network weight parameters of each vehicle agent, and performs federal aggregation, and the aggregated parameters are broadcasted to the vehicle user in one training round to perform local training, specifically:

φ(t+1)＝ξφ ⁱ (t)+(1-ξ)H ⁱ (t)

wherein ,Hⁱ (t)＝∑ _k≠i φ ^k (t), ζ is a weight coefficient.