CN116156455A - Internet of vehicles edge content caching decision method based on federal reinforcement learning - Google Patents

Internet of vehicles edge content caching decision method based on federal reinforcement learning Download PDF

Info

Publication number
CN116156455A
CN116156455A CN202211708649.XA CN202211708649A CN116156455A CN 116156455 A CN116156455 A CN 116156455A CN 202211708649 A CN202211708649 A CN 202211708649A CN 116156455 A CN116156455 A CN 116156455A
Authority
CN
China
Prior art keywords
vehicle
edge
content
time slot
cache
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211708649.XA
Other languages
Chinese (zh)
Inventor
林艳
包金鸣
邹骏
张一晋
李骏
束锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology
Original Assignee
Nanjing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology filed Critical Nanjing University of Science and Technology
Priority to CN202211708649.XA priority Critical patent/CN116156455A/en
Publication of CN116156455A publication Critical patent/CN116156455A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/30Services specially adapted for particular environments, situations or purposes
    • H04W4/40Services specially adapted for particular environments, situations or purposes for vehicles, e.g. vehicle-to-pedestrians [V2P]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W24/00Supervisory, monitoring or testing arrangements
    • H04W24/02Arrangements for optimising operational condition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W28/00Network traffic management; Network resource management
    • H04W28/16Central resource management; Negotiation of resources or communication parameters, e.g. negotiating bandwidth or QoS [Quality of Service]
    • H04W28/18Negotiating wireless communication parameters
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/30Services specially adapted for particular environments, situations or purposes
    • H04W4/40Services specially adapted for particular environments, situations or purposes for vehicles, e.g. vehicle-to-pedestrians [V2P]
    • H04W4/44Services specially adapted for particular environments, situations or purposes for vehicles, e.g. vehicle-to-pedestrians [V2P] for communication between vehicles and infrastructures, e.g. vehicle-to-cloud [V2C] or vehicle-to-home [V2H]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Traffic Control Systems (AREA)

Abstract

The invention discloses a method for caching and deciding edge content of an internet of vehicles based on federal reinforcement learning, which specifically comprises the following steps: inputting a vehicle networking environment and initializing network parameters of each vehicle; at the current time slot, each vehicle interacts with the road side unit to obtain observation information; according to the observation information, each vehicle independently decides to act; after the action is executed, each vehicle obtains the rewards of environmental feedback, and the sample data is cached to an experience multiplexing pool; when the number of samples is sufficient, each vehicle updates the network according to a flexible actor-critique algorithm; the aggregation center collects local network parameters for federal aggregation and broadcasts the aggregation parameters to the local for training; and after the current training is finished, resetting the Internet of vehicles environment, and starting the training of the next round. The invention aims at realizing the minimization of the balance of the transmission delay of the system content and the overhead of the edge cache by utilizing a network architecture taking a user as a center under the environment of the internet of vehicles so as to ensure that the vehicles finish the distributed decision of the edge cache on the premise of protecting the privacy.

Description

Internet of vehicles edge content caching decision method based on federal reinforcement learning
Technical Field
The invention relates to the technical field of wireless communication, in particular to a federal reinforcement learning-based internet of vehicles edge content caching decision method.
Background
In recent years, under the drive of sixth-generation mobile communication technology, new generation information technologies such as the internet of vehicles automobile industry, big data, artificial intelligence and the like are deeply integrated, and efficient and reliable communication services can be provided. However, with the increase of the number of vehicles, the increasing real-time communication service puts higher demands on ultra-low time delay and ultra-high reliability of the internet of vehicles. To address the above challenges, edge caching techniques reduce the delay in content delivery by deploying cache resources at edge nodes to accomplish local content distribution, avoiding reliance on cloud-centric delivery (Zhang Y, zhao J, cao g.roadcast: a popularity aware content sharing scheme in vanets [ J ]. ACM SIGMOBILE Mobile Computing and Communications Review,2010,13 (4): 1-14.). However, due to the limited storage capacity of the edge nodes, the rapidly growing internet of vehicles application data requests cannot be completely cached to the edge nodes; secondly, in view of the high mobility of the internet of vehicles, frequently changing content requirements, and harsh communication environments, there is a need to design an efficient edge content caching method for the scenes of the internet of vehicles.
Considering that the problem of intelligent edge content caching of the Internet of vehicles is essentially a discrete sequence decision-making problem without a model, the method can be solved by adopting a multi-agent reinforcement learning method which can finish local decision-making by sharing training information. Compared with the traditional optimization algorithm, the deep reinforcement learning can learn experience through interaction of an agent with an uncertain environment so as to solve the problem of dynamic decision. Even if dynamic environmental changes cannot be predicted in advance, the agent can learn how to take action or how to map the acquired information to action, thereby maximizing the system rewards. In recent years, researchers at home and abroad focus on researching intelligent edge content caching decisions so as to efficiently utilize edge caching resources in a dynamic wireless transmission environment of the Internet of vehicles. For example, qiao et al use depth deterministic policy gradient algorithm to learn the change law of the Internet of vehicles wireless environment by using the local observation information of the vehicle users, and propose a collaborative edge caching method to minimize the long-term tradeoff of system content transmission delay and caching overhead (Qiao G, leng S, maharjan S, et al deep reinforcement learning for cooperative content caching in vehicular edge computing and networks [ J ]. IEEE Internet of Things Journal,2019,7 (1): 247-257.). However, the existing internet of vehicles edge content caching schemes are mostly built on a centralized network architecture, and do not fully utilize the deployment characteristics of dense heterogeneous edge nodes. In addition, privacy protection of individual data of a vehicle user is also a non-negligible content in view of the openness of the internet of vehicles communication scheme. Therefore, in the environment of dense heterogeneous deployment of edge nodes, how to realize seamless coverage of high capacity, low cache overhead and vehicle-mounted communication by using a decentralised network architecture on the basis of privacy protection still needs further research.
Disclosure of Invention
In order to overcome the problems in the related art, the embodiment of the invention discloses an Internet of vehicles edge content caching decision method based on federal reinforcement learning. The technical method comprises the following steps:
step 1, inputting a vehicle networking environment, initializing parameters of an own actor network and a critic network by each vehicle agent, and modeling an optimization problem;
step 2, in the current time slot, each vehicle agent interacts with a road side unit in an observation range to obtain observation information such as the distance between the vehicle agent and the road side unit, the cache state of the road side unit, the residual cache capacity of the road side unit and the like;
step 3, according to the local observation information, each vehicle agent can independently decide the associated road side unit in the edge node cluster and decide whether to cache the request content of the current time slot in the cluster;
step 4, after the action decision is executed, each vehicle agent acquires the trade-off rewards of the total content delivery delay and the edge cache overhead of the system fed back by the vehicle networking environment, and all sample data are cached to an experience multiplexing pool;
step 5, judging whether the number of samples is enough, if so, entering a step 6, otherwise, entering a step 7;
Step 6, when the number of samples is enough, each vehicle agent updates own Actor network and Critic network parameters according to a flexible Actor-critter algorithm;
step 7, collecting the weight parameters of the Actor network of each vehicle intelligent agent by the aggregation center, and performing federal aggregation, wherein the aggregated parameters are broadcasted to a vehicle user in one training round to perform local training;
step 8, judging whether the current training round is finished, if not, returning to the step 2 to start the training of the next round, and if so, entering the step 9;
step 9, judging whether convergence is achieved, if not, resetting the environment of the Internet of vehicles, and returning to the step 1; and if yes, finishing training and finishing the decision of caching the edge content of the Internet of vehicles.
Compared with the prior art, the invention has the remarkable advantages that: (1) Aiming at the link congestion load caused by the network architecture of the vehicle networking cache scene centering network, the invention designs an edge node cluster centering on a user by utilizing the deployment characteristic of the dense edge nodes so as to realize high capacity, low cache overhead and vehicle-mounted communication seamless coverage; (2) Aiming at the problems of a large amount of information interaction and privacy leakage caused by intelligent algorithm centralized training, the invention utilizes the privacy protection advantage of federal learning, and realizes the collaborative decision of vehicle users by sharing the local model neural network weight so as to reduce the long-term transmission delay and the edge cache overhead of the system content.
The invention is further described in detail below with reference to the accompanying drawings.
Drawings
FIG. 1 is a flow chart of a method for caching and deciding the edge content of the Internet of vehicles based on a federal reinforcement learning framework.
FIG. 2 is a graph showing average trade-off convergence performance per vehicle user versus cycle in an embodiment of the present invention.
Fig. 3 is a graph showing the convergence performance of average transmission delay per vehicle user according to the embodiment of the present invention.
Fig. 4 is a graph showing the convergence performance of average buffer overhead per vehicle user according to the periodic variation in the embodiment of the present invention.
FIG. 5 is a graph showing average trade-off convergence performance per vehicle user as a function of maximum number of associations in an embodiment of the present invention.
Detailed Description
The invention provides a federal reinforcement learning-based vehicle networking edge content caching decision method. Specifically, in a unit time slot, each vehicle user is regarded as an intelligent agent, and obtains the distance between the vehicle user and a road side unit, the cache state of the road side unit, the residual cache capacity of the road side unit and other observation information in an observation range, and adopts a neural network training action strategy to minimize the balance of the total content delivery time delay and the edge cache cost of the system, and the method comprises the following steps in combination with fig. 1-2:
Step 1, inputting a vehicle networking environment, initializing parameters of an own actor network and a critic network by each vehicle agent, and modeling an optimization problem;
step 2, in the current time slot, each vehicle agent interacts with a road side unit in an observation range to obtain observation information such as the distance between the vehicle agent and the road side unit, the cache state of the road side unit, the residual cache capacity of the road side unit and the like;
step 3, according to the local observation information, each vehicle agent can independently decide the associated road side unit in the edge node cluster and decide whether to cache the request content of the current time slot in the cluster;
step 4, after the action decision is executed, each vehicle agent acquires the trade-off rewards of the total content delivery delay and the edge cache overhead of the system fed back by the vehicle networking environment, and all sample data are cached to an experience multiplexing pool;
step 5, judging whether the number of samples is enough, if so, entering a step 6, otherwise, entering a step 7;
step 6, when the number of samples is enough, each vehicle agent updates own Actor network and Critic network parameters according to a flexible Actor-critter algorithm;
step 7, collecting the weight parameters of the Actor network of each vehicle intelligent agent by the aggregation center, and performing federal aggregation, wherein the aggregated parameters are broadcasted to a vehicle user in one training round to perform local training;
Step 8, judging whether the current training round is finished, if not, returning to the step 2 to start the training of the next round, and if so, entering the step 9;
step 9, judging whether convergence is achieved, if not, resetting the environment of the Internet of vehicles, and returning to the step 1; and if yes, finishing training and finishing the decision of caching the edge content of the Internet of vehicles.
As a specific embodiment, the inputting the internet of vehicles environment described in step 1 specifically includes:
(1) Time slot model: dispersing continuous training time into multiple time slots, denoted as
Figure BDA0004025588120000031
Wherein each time slot has a duration τ, wherein the channel state information and system parameters remain unchanged over the duration of a single time slot, but may vary randomly between different time slots;
(2) Network model: establishing a networking model of the vehicle as a Manhattan grid model, wherein road side units capable of providing communication services are uniformly distributed on two sides of a road; the road side unit is used as an edge node, has special communication resources and limited local storage resources, and is associated with an edge server through a high-speed wired link; the edge server is controlled by a software-defined centralized controller, and can execute edge association, cache resource allocation and the like; let the set of road side units be
Figure BDA0004025588120000041
The vehicles are assembled into
Figure BDA0004025588120000042
All vehicle users can travel along four directions of the front, the back, the left and the right of the road, and each direction is provided with a plurality of lanes to ensure the passing of vehicles;
(3) Vehicle movement model: the speed variation of the vehicle follows the following gaussian-markov random process, in particular, when the vehicle user i is at an initial speed
Figure BDA0004025588120000043
During driving, its speed at time slot t +.>
Figure BDA0004025588120000044
May be represented as time slot t-1
Speed at
Figure BDA0004025588120000045
Sum of progressive speed and a random variable:
Figure BDA0004025588120000046
wherein ,
Figure BDA0004025588120000047
and σi Is the corresponding asymptotic mean and standard deviation of the vehicle user i speed; parameter eta i ∈[0,1]The memory depth representing the last slot speed and determining the time dependence of the movement of the vehicle user i; z represents the standard normal distribution of uncorrelated zero mean unit variances. Notably, η i The closer to 1, the current slot speed of the vehicle user i becomes more dependent on the speed of the last slot;
(4) Content request model: order the
Figure BDA0004025588120000048
Representing a set of all vehicle request contents, +.>
Figure BDA0004025588120000049
Representing the size of all requested content, +.>
Figure BDA00040255881200000410
Characteristic values representing not all the request contents for distinguishing different request contents; it is assumed that the vehicle user i can only generate one request content f in the time slot t, expressed as
Figure BDA00040255881200000411
/>
wherein ,
Figure BDA00040255881200000412
Indicating that the vehicle user i requests the content f in time slot t and vice versa +.>
Figure BDA00040255881200000413
Considering that a vehicle user may prefer a file similar to a previously time-slot requested content in addition to a content according to a global popularity, it is assumed that the vehicle user is according to a global popularity according to a probability of epsilon, and a probability of 1 epsilon is according to a local personal preference, wherein the global popularity and the local personal preference are defined as follows:
(1) global popularity: order the
Figure BDA00040255881200000414
Representing global popularity of individual vehicle user request files, which follow Mandelbrot-Zipf distribution, i.e. satisfy
Figure BDA00040255881200000415
wherein ,If Representing the order in which the content f is arranged in descending global popularity; alpha and beta respectively represent a platform factor and a skewness factor; (2) local personal preferences: in this case each vehicle user requests a file based on the similarity of their previously requested content; for example, when the vehicle user requests the content f in the time slot t, the content with the highest similarity with the content f will be requested in the time slot t+1; the section adopts cosine similarity to represent contents f and f * Similarity of (3):
Figure BDA0004025588120000051
(5) User-centric edge cache model: in order to improve the transmission rate of request content of a vehicle user, a network architecture taking the user as a center is adopted to design a vehicle networking edge content cache frame; specifically, each vehicle user selects a single or a plurality of adjacent road side units to construct each edge node cluster taking the user as the center by observing the information of the road side units in the perception range; let the maximum number of road side units observable by the vehicle user i in the time slot t be O max The maximum number of road side units that it can associate is denoted as S max ≤O max The method comprises the steps of carrying out a first treatment on the surface of the Order the
Figure BDA0004025588120000052
Representing the cluster of edge nodes serving the vehicle user i at time slot t, then
Figure BDA0004025588120000053
Indicating the number of road side units in the cluster at the moment; let the edge association decision of the vehicle user i at the time slot t be expressed as:
Figure BDA0004025588120000054
wherein ,
Figure BDA0004025588120000055
edge node cluster representing that a road side unit j belongs to a vehicle user i in time slot t +.>
Figure BDA0004025588120000056
Otherwise
Figure BDA0004025588120000057
Further, a user-centric edge node cluster may be represented as
Figure BDA0004025588120000058
The method comprises the steps that a road side unit and a cloud center are provided with cache capacities, wherein the cache capacity of the cloud center can process cache requests of all vehicle users, and the limited cache capacity of each road side unit is C; under the framework of caching the edge content with the center of the user, the vehicle user can request to cache part of the request files to the associated edge node cluster, so as to provide the cache content service; thus, the vehicle user needs to further decide whether to cache the request content of the current time slot to each road side unit in the edge node cluster; let the edge buffer decision variable of the vehicle user i at the time slot t be expressed as
Figure BDA0004025588120000059
wherein ,
Figure BDA00040255881200000510
indicating to the vehicle user i to buffer the requested content of the time slot t to the road side unit j, otherwise +. >
Figure BDA00040255881200000511
(6) Wireless transmission model: assuming that mutual interference between communication links has been eliminated by allocating orthogonal resource blocks, and that both the road side units and the vehicle users are equipped with single antennas; considering that the channel power gain comprises two parts of fast fading and slow fading, wherein the main cause of the fast fading part is multipath effect, namely Rayleigh fading; the slow fading part consists of path loss and shadow fading; order the
Figure BDA00040255881200000512
The channel gain between the vehicle user i and the road side unit j at time slot t is expressed as:
Figure BDA00040255881200000513
wherein ,
Figure BDA0004025588120000061
is the corresponding fast fading part, which follows the complex gaussian distribution of zero mean unit variance, i.e. satisfies
Figure BDA0004025588120000062
While the slow fading portion between the vehicle user i and the road side unit j is represented as
Figure BDA0004025588120000063
wherein ,Ai Is a constant in path loss; beta i Is a lognormal component with standard deviation ζ in shadow fading;
Figure BDA0004025588120000064
representing the distance between the vehicle user i and the road side unit j at the time slot t; η represents the attenuation coefficient of the path loss component; />
Figure BDA0004025588120000065
Representing the path loss of the channel between the vehicle user i and the road side unit j at time slot t;
let the edge node cluster at time slot t
Figure BDA0004025588120000066
The transmitting power of the middle road side unit j is p j The achievable downlink transmission rate for vehicle user i is expressed as
Figure BDA0004025588120000067
Wherein B is the channel bandwidth, sigma 2 Power being additive white gaussian noise;
(7) And (3) a time delay model: making the request content of each vehicle user tolerant to delay
Figure BDA0004025588120000068
In addition, in order to fully utilize the edge cache resources, the edge server automatically refreshes the storage space of the edge server every cycle delta, so that the edge server has a short service vacuum period to replace the cache content every cycle delta; therefore, if the vehicle user i needs to download the cache content from the edge server at time slot t, its content delivery needs to be delayed by a tolerable delay +.>
Figure BDA0004025588120000069
And completed in the next cache replacement refresh period, i.e
Figure BDA00040255881200000610
Where (n + 1) delta-t represents the number of slots remaining for the corresponding cache refresh period,
Figure BDA00040255881200000611
considering that the content delivery delay of the vehicle user depends on whether the requested content has been cached in the edge server in advance, the cache state of the time slot t-way side unit j is defined as
Figure BDA00040255881200000612
wherein ,
Figure BDA00040255881200000613
indicating that content f has been buffered in the road side unit j in time slot t, and vice versa + ->
Figure BDA00040255881200000614
Thus, delivery of the requested content by the vehicle user includes two cases, edge caching and cloud downloading:
(1) edge caching: when edge node cluster
Figure BDA00040255881200000615
When any edge node has cached the request content f, the vehicle user i can directly acquire the request content from the edge, and the time delay is expressed as
Figure BDA00040255881200000616
(2) And (3) cloud downloading: when edge node cluster
Figure BDA00040255881200000617
When all edge nodes in (a) do not cache the request content f, the edge server needs to be downloaded from the cloud center, resulting in an extra fixed delay +.>
Figure BDA00040255881200000618
Based on the above model, the total time delay for the vehicle user i to acquire the request content f can be expressed as
Figure BDA0004025588120000071
/>
wherein ,1{·} Indicating that 1 is set if the constraint is satisfied and 0 is set otherwise.
As a specific embodiment, modeling the optimization problem in step 1 is specifically:
taking the long-term tradeoff of minimizing the content delivery delay and the edge cache overhead as an optimization target, designing an intelligent edge cache scheme taking a user as a center; for this, normalized content delivery latency effects and edge cache overhead effects are first designed, expressed as:
Figure BDA0004025588120000072
Figure BDA0004025588120000073
thus, the optimization problem is expressed as follows:
Figure BDA0004025588120000074
Figure BDA0004025588120000075
Figure BDA0004025588120000076
Figure BDA0004025588120000077
Figure BDA0004025588120000078
wherein ,ω1 And omega 2 Weights respectively representing content delivery delay effects and cache overhead effects; c (C) 1 The edge node cluster size of each vehicle user representing any time slot is not more than S max The method comprises the steps of carrying out a first treatment on the surface of the C (C) 2 Means that each at most a single vehicle user is served in any time slot; c (C) 3 Indicating that the content of each road side unit buffer memory cannot exceed the local buffer memory capacity; c (C) 4 Indicating that the requested content delivery delay for each vehicle user at any time slot should not exceed the maximum allowable delay.
As a specific embodiment, in step 2, each vehicle agent interacts with a road side unit in the observation range to obtain the observation information such as the distance between the vehicle agent and the road side unit, the buffer status of the road side unit, and the remaining buffer capacity of the road side unit, where the specific steps are as follows:
in the time slot t, each vehicle user is used as an intelligent agent, and the respective observation state is obtained through interaction with the environment, and the observation state of the vehicle i in the time slot t is expressed as
Figure BDA0004025588120000079
wherein ,
Figure BDA00040255881200000710
representing the distance between the time slot t, the vehicle user i and the road side unit j; />
Figure BDA00040255881200000711
A flag bit indicating whether the request content f has been cached by the slot t-way side unit j; />
Figure BDA00040255881200000712
The residual buffer space of the time slot t-way side unit j is represented; />
Figure BDA00040255881200000713
And the number of the remained time slots is indicated by the time slot t and the time slot t are refreshed from the buffer space.
As a specific embodiment, in step 3, each vehicle agent may independently decide, according to the local observation information, an associated road side unit in the edge node cluster, and decide, in the cluster, whether to cache the request content of the current time slot, specifically:
and determining actions made by each vehicle agent interacting with the environment, including edge-associated decision variables and edge-cached decision variables. The action of vehicle i at time slot t is expressed as:
Figure BDA0004025588120000081
As a specific embodiment, after the action decision is performed in step 4, each vehicle agent obtains a trade-off reward of total content delivery delay and edge buffer overhead of the system fed back by the internet of vehicles environment, specifically:
when all vehicle agents have performed the action, the environment will feedback a global incentive. Defining an average tradeoff per user as a global reward, expressed as:
Figure BDA0004025588120000082
wherein ,
Figure BDA0004025588120000083
representing normalized content delivery delay utility, +.>
Figure BDA0004025588120000084
Representing normalized edge cache overhead utility; ρ t Representing that constraint C is not satisfied 1 -C 4 Penalty term at that time.
As a specific implementation manner, in step 6, each vehicle agent updates its own actor network and critic network parameters according to the federal discrete flexible actor-critter algorithm, specifically:
(1) Flexibility value function: the flexible actor-critic algorithm optimization objective is to meet entropy maximization in addition to maximizing the cumulative rewards value returned by training, i.e
Figure BDA0004025588120000085
wherein ,ρπ A trajectory distribution representing a strategy pi; alpha represents an entropy coefficient; to ensure that the agent can continually explore, entropy is introduced to randomize the strategy, expressed as
Figure BDA0004025588120000086
Therefore, when the algorithm makes a decision, the probability of outputting each action can be dispersed as far as possible, so that the learning capacity of the vehicle intelligent body on the environment is improved, the vehicle intelligent body can adaptively adjust the strategy in the vehicle networking environment with continuously changed channel conditions, and further a more reasonable decision is made; based on this, a soft state action value function is defined as follows
Figure BDA0004025588120000087
wherein ,
Figure BDA0004025588120000088
a cumulative award representing a discount; further, the soft state value function may be expressed as
Figure BDA0004025588120000089
Wherein action a depends on the probability distribution pi (|s).
(2) Critic network: under the initiator-Critic network framework of the algorithm, the whole training process is alternately performed between strategy evaluation and strategy improvement, so that the long-term balance maximization of the environment of the Internet of vehicles system is realized; to judge the quality of the action strategy, two Critic networks are constructed, which are expressed as online value networks 1 and 2, and target value networks are respectively constructed for the two Critic networks; the input of the neural networks is local observation information collected by vehicle agents in the environment, and the local observation information is output as the value corresponding to each action in the action space; in view of the fact that a neural network is generally adopted to evaluate a soft value function in an algorithm framework, weight parameters defining the online value networks 1 and 2 and corresponding target networks are respectively theta main,1 、θ target,1 θ main,2 、θ target,2
In order to improve the training stability and avoid the divergence of the reinforcement learning algorithm, an experience multiplexing pool is adopted to relieve the correlation among samples; during the training process, each vehicle agent is respectively gathered from the multiplexing pool
Figure BDA0004025588120000091
Is to randomly extract sequence pairs of batch size, i.e. +.>
Figure BDA0004025588120000092
Redefining the flexible state value function as:
Figure BDA0004025588120000093
Wherein pi (|s) t ) T Denoted s t Action probability distribution pi (|s) in case t ) Is a transpose of (2);
the updates to value networks 1 and 2 in the Critic network are approximated as building a bellman mean square error loss function for each Q network, expressed as:
Figure BDA0004025588120000094
wherein i=1, 2 represents value networks 1 and 2, respectively; to achieve the goal of maximum entropy, the entropy needs to be included in rewards, and the functional expression of the judgment behavior strategy value obtained by the general deduction formula of Bellman is as follows:
Figure BDA0004025588120000095
in addition, to obtain Q soft Optimal approximation of (s, α), updating the parameter θ in the gradient direction m,i Expressed as a minimized loss function:
Figure BDA0004025588120000096
wherein ,ηc Represents the learning rate of the Critic network,
Figure BDA0004025588120000097
gradient calculation representing a loss function;
the target value network does not actively participate in the learning process and cannot be updated independently, so that a soft update mode is adopted. It replicates the latest parameters of the online value network at intervals for small-scale updates, expressed as:
θ target,i (t+1)←τθ main,i (t)+(1-τ)θ target,i (t)
where τ represents the degree of soft update.
(3) Actor network: the task of the Actor network is to seek policy improvement based on the estimated value generated by the neural network, which is input as local observation information collected by each vehicle agent in the environment, output as behavior probability of action dimension, expressed as pi θ (|s); whereas the updating process of an Actor network can be expressed as an exploration of the strategy by maximizing the system rewards, the decision to guide the action by a soft state action value function, i.e. the loss function of the Actor network is defined as:
Figure BDA0004025588120000101
similarly, the Actor network updates the parameter φ in the gradient direction to minimize the loss function, expressed as:
Figure BDA0004025588120000102
wherein ,ηa Indicating the learning rate of the Actor network,
Figure BDA0004025588120000103
gradient calculation representing a loss function;
(4) Self-adjusting entropy coefficient: the entropy coefficient alpha is used as a weight, and the randomness of the action strategy is controlled by changing the numerical value. The greater α, the more dispersed the probability of outputting actions when the algorithm makes a decision, resulting in more complete exploration of the environment by the vehicle agent and further more actions being attempted. Considering that the reward value fed back by the system is continuously changed in the training process of the Internet of vehicles environment, the dependence on a priori fixed alpha can lead to unstable training, and further the convergence performance of the system is affected. In view of this, this section considers the method of adaptive entropy coefficients, i.e. when the vehicle agent explores a new environment at the beginning of training, increasing α to make the agent explore as fully as possible; while as the number of training rounds increases, the optimal action is essentially determined, then α is reduced to optimize the long-term rewards tradeoff of the system as much as possible. The loss function defining the entropy coefficient is:
Figure BDA0004025588120000104
wherein ,
Figure BDA0004025588120000105
representing the current state s t And (3) a threshold value of entropy.
As a specific implementation manner, the aggregation center collects the weight parameters of the Actor network of each vehicle agent and performs federal aggregation, and the aggregated parameters are broadcast to the vehicle user in a training round to perform local training, specifically:
firstly, each vehicle agent trains a DRL model in a distributed mode according to local observation information; secondly, uploading the trained Actor network weight parameters of the local DRL model to a cloud center so as to share strategies; finally, the global model parameters of the federal aggregation are broadcast to the local agent for the training of the next iteration; the communication mode only needs to upload the parameters of the nerve network and download strategies, so that the communication load is greatly reduced; in addition, the privacy of the system is protected in view of the fact that the local information of the user cannot be directly obtained from the neural network parameters; in each training round, the update formula of the weight parameters of the vehicle agent Actor network is as follows:
φ(t+1)=ξφ i (t)+(1-ξ)H i (t)
wherein ,Hi (t)=∑ k≠i φ k (t), ζ is a weight coefficient.
The invention is further described in detail below with reference to the accompanying drawings and specific examples.
Examples
The embodiment provides a method for caching and deciding the edge content of the internet of vehicles based on federal reinforcement learning, which is specifically described below:
1. Establishing a car networking system model:
the simulation environment is set according to the manhattan grid model in the 3gpp TR 36.885 standard. Which is a kind ofThe medium city map was set to 500×500m, the number of vehicles was 4, and the number of roadside units was 36. The maximum observation range of the vehicle user is 200m, and the maximum association number of the road side units is 4. The number of contents that a vehicle user can request is 20, and the content data size is set to [20,50]Mbit. The platform factor alpha of the global popularity of the requested content is-0.88, and the skewness factor beta is 0.35. At the beginning of training, the vehicle position is initialized, and when the vehicle user runs to the intersection, the following running direction is selected with equal probability, namely, the probability of each direction is 0.25. The speed of each vehicle follows a Gaussian-Markov movement model, wherein the progressive average of the initial speeds is
Figure BDA0004025588120000111
Standard deviation sigma i =0.1, parameter η i Set to 0.1. Path loss model set-up adoption
Figure BDA0004025588120000112
In addition, if the duration of the unit time slot is set to be 0.1s, the cloud downloading time delay is set to be 1s, and the time delay constraint is +.>
Figure BDA0004025588120000113
Set to 1.5s.
2. Establishing a federal reinforcement learning algorithm framework:
federal enhancement algorithms combine federal averaging algorithms and flexible actor-critique algorithms. In the flexible actor-critter algorithm framework, critic network fits it using a fully connected neural network of 2 hidden layers, where the number of hidden layer neurons is 64 and the activation function is a linear rectification function f (x) =max (0, x); similar to Critic networks, the Actor networks fit using the same full-connection layer neural network. And constructing a virtual cloud server as an aggregation center, and performing federal aggregation on the weight parameters of the local Actor network.
3. Training phase of algorithm:
first, each vehicle agent trains its edge caching strategy using a flexible actor-critter algorithm. For vehicle agent i, local state information is entered
Figure BDA0004025588120000114
Defined as->
Figure BDA0004025588120000115
Namely the data size of the requested content of the vehicle agent in time slot t, the global popularity of the file, the distance between the vehicle agent and the road side unit j, the flag bit of the requested content f cached by the road side unit j, the size of the remained caching space and the number of the remained time slots refreshed by the caching space. Secondly, the vehicle agent decides on the action based on the local observation state, expressed as +.>
Figure BDA0004025588120000116
Namely an edge association decision variable and an edge cache decision variable.
After the algorithm inputs the state, according to the action predicted by the Actor network, after the interaction with the Internet of vehicles environment, the vehicle intelligent agent obtains the global rewards fed back by the system
Figure BDA0004025588120000117
And transition to the next state +.>
Figure BDA0004025588120000118
When the sample data are enough, the vehicle intelligent agent respectively updates the Actor network and the Critic network according to the gradient descent method. At the end of each round, each vehicle agent uploads the weight parameters of the local Actor network to the cloud center for federal averaging, and the aggregated global parameters are broadcast to the vehicle agents for local training of the next round. 3000 training sets and 100 test sets are adopted, and a Critic network learning rate eta is set in the training process c 10e of -4 Actor network learning rate eta a 10e of -4 The discount factor gamma is 0.9, the soft update degree is 0.01, the empirical multiplexing pool size is 5000, and the number of samples trained once is set to 100.
The present invention compares the proposed solution with the following reference solution:
(1) Edge caching scheme based on independent flexible actor-critter algorithm (ISAC): each vehicle user respectively adopts SAC algorithm as an agent to train own strategy, namely each agent trains independent Critic and Actor networks respectively and makes decisions in a distributed mode according to own local observation information.
(2) Edge caching scheme based on independent deep Q network algorithm (Independent Deep Q Network, IDQN): each vehicle user respectively adopts an DQN algorithm as an agent to train own strategy, namely each agent trains an independent Q network and makes decisions in a distributed mode according to own local observation information.
(3) Edge caching scheme based on federal DQN (Federated DQN, feddqn): each vehicle user is used as an intelligent agent to federally train own strategies by adopting a DQN algorithm which utilizes a shared local Q network weight, wherein the Q network weight is uploaded to a cloud center for federally averaging, and then each vehicle user downloads aggregated global parameters and makes decisions in a distributed mode according to own local observation information.
As shown in fig. 2, it is apparent that the ISAC-based playback scheme converges most rapidly, the proposed scheme is inferior, and the FedDQN-based and IDQN-based schemes converge most slowly, by comparing the proposed scheme with the reference scheme. The underlying cause of this phenomenon is: the scheme and the scheme based on ISAC try to maximize entropy targets and enhance the exploration capacity of the algorithm, so that the convergence of the model is promoted, and better convergence performance is achieved; the scheme based on IDQN and FedDQN utilizes a random strategy to search, so that the problems that a model is not easy to converge and sample searching and utilization are difficult to coordinate when a high-dimensional action space in the environment of the Internet of vehicles is handled are faced. Furthermore, the proposed scheme may achieve a better system average tradeoff with protection of privacy than ISAC-based schemes. The reasons for this include two aspects: firstly, the proposal enables each agent to share experience of interaction with the environment by sharing weight parameters of an Actor network in a local DRL model, thereby promoting a policy with better decision of each agent; secondly, the cloud center only collects weight parameters and cannot acquire local observation information from training, so that communication overhead is reduced, and privacy of a vehicle user is protected.
As shown in fig. 3, as the number of training rounds increases, the average transmission delay per vehicle user shows a decreasing trend, which indicates the effectiveness of each scenario. Compared with the reference scheme, the scheme is obviously superior to the IDQN-based scheme and the FedDQN-based scheme in the optimization performance of average transmission delay, and is slightly superior to the ISAC-based scheme. The cause of this phenomenon includes two aspects: firstly, four schemes can well optimize the average transmission delay index, and when the utility function is normalized, the algorithm performance is difficult to distinguish; secondly, because of the complex and changeable dynamic car networking environment, the convergence result has concurrency, so that obvious performance advantages are difficult to embody. As shown in fig. 4, as the number of training rounds increases, the edge buffer overhead of each scheme shows a decreasing trend, wherein the edge buffer overhead of the proposed scheme is the smallest, the ISAC-based scheme is inferior, and the IDQN-and fedqn-based schemes perform the worst. With reference to fig. 3 and fig. 4, it can be seen that the proposed solution can greatly reduce the overhead of edge buffering, and simultaneously, can also well optimize the transmission delay of each vehicle user.
Fig. 5 shows the rule of influence of different maximum association number parameters in the vehicle networking construction edge node cluster on average weighing performance of each vehicle user. Specifically, as the number of associated roadside units increases, so does the convergence result of the proposed scheme. The reason for this phenomenon is that the user-centric edge node cluster can seamlessly adapt to dynamic fluctuations in the network topology, and as the number of road side units that make up the cluster increases, its joint transmission technique can greatly improve the data transmission rate. However, when the associated roadside units reach a certain number, the edge node cluster saturates the gain of the transmission rate, i.e. when S max Average weighted convergence result per vehicle user versus S when=4 max The difference is not large when=3.
In summary, the invention provides the internet of vehicles edge caching scheme based on the federal flexible actor-criticism algorithm by combining federal learning privacy protection advantages under the framework of taking the user as a center aiming at the unknown high-dynamic topology and channel state characteristics of the internet of vehicles. The proposal realizes the collaborative training decision of the multi-agent environment on the basis of not revealing the local training data of the vehicle users, further realizes the joint optimization of the transmission delay and the cache cost while ensuring the high privacy, and is superior to the benchmark proposal in the aspects of average balance and convergence performance of each vehicle user.

Claims (8)

1. The internet of vehicles edge content caching decision-making method based on federal reinforcement learning is characterized by comprising the following specific steps:
step 1, inputting a vehicle networking environment, initializing parameters of an own actor network and a critic network by each vehicle agent, and modeling an optimization problem;
step 2, in the current time slot, each vehicle agent interacts with a road side unit in an observation range to obtain observation information such as the distance between the vehicle agent and the road side unit, the cache state of the road side unit, the residual cache capacity of the road side unit and the like;
Step 3, according to the local observation information, each vehicle agent can independently decide the associated road side unit in the edge node cluster and decide whether to cache the request content of the current time slot in the cluster;
step 4, after the action decision is executed, each vehicle agent acquires the trade-off rewards of the total content delivery delay and the edge cache overhead of the system fed back by the vehicle networking environment, and all sample data are cached to an experience multiplexing pool;
step 5, judging whether the number of samples is enough, if so, entering a step 6, otherwise, entering a step 7;
step 6, when the number of samples is enough, each vehicle agent updates own Actor network and Critic network parameters according to a flexible Actor-critter algorithm;
step 7, collecting the weight parameters of the Actor network of each vehicle intelligent agent by the aggregation center, and performing federal aggregation, wherein the aggregated parameters are broadcasted to a vehicle user in one training round to perform local training;
step 8, judging whether the current training round is finished, if not, returning to the step 2 to start the training of the next round, and if so, entering the step 9;
step 9, judging whether convergence is achieved, if not, resetting the environment of the Internet of vehicles, and returning to the step 1; and if yes, finishing training and finishing the decision of caching the edge content of the Internet of vehicles.
Compared with the prior art, the invention has the remarkable advantages that: (1) Aiming at the link congestion load caused by the network architecture of the vehicle networking cache scene centering network, the invention designs an edge node cluster centering on a user by utilizing the deployment characteristic of the dense edge nodes so as to realize high capacity, low cache overhead and vehicle-mounted communication seamless coverage; (2) Aiming at the problems of a large amount of information interaction and privacy leakage caused by intelligent algorithm centralized training, the invention utilizes the privacy protection advantage of federal learning, and realizes the collaborative decision of vehicle users by sharing the local model neural network weight so as to reduce the long-term transmission delay and the edge cache overhead of the system content.
2. The internet of vehicles edge cache decision method based on federal multi-agent reinforcement learning according to claim 1, wherein the inputting the internet of vehicles environment in step 1 specifically comprises:
(1) Time slot model: dispersing continuous training time into multiple time slots, denoted as
Figure FDA0004025588110000011
Wherein each time slot has a duration τ, wherein the channel state information and system parameters remain unchanged over the duration of a single time slot, but may vary randomly between different time slots;
(2) Network model: establishing a networking model of the vehicle as a Manhattan grid model, wherein road side units capable of providing communication services are uniformly distributed on two sides of a road; the road side unit is used as an edge node, has special communication resources and limited local storage resources, and is associated with an edge server through a high-speed wired link; the edge server is controlled by a software-defined centralized controller, and can execute edge association, cache resource allocation and the like; let the set of road side units be
Figure FDA00040255881100000213
The vehicles are assembled into
Figure FDA00040255881100000214
All vehicle users can travel along four directions of the front, the back, the left and the right of the road, and each direction is provided with a plurality of lanes to ensure the passing of vehicles;
(3) Vehicle movement model: the speed variation of the vehicle follows the following gaussian-markov random process, in particular, when the vehicle user i is at an initial speed
Figure FDA0004025588110000021
During driving, its speed at time slot t +.>
Figure FDA0004025588110000022
Can be expressed as speed +.1 at time slot t-1>
Figure FDA0004025588110000023
Sum of progressive speed and a random variable:
Figure FDA0004025588110000024
wherein ,
Figure FDA0004025588110000025
and σi Is the corresponding asymptotic mean and standard deviation of the vehicle user i speed; parameter eta i ∈[0,1]The memory depth representing the last slot speed and determining the time dependence of the movement of the vehicle user i; z represents the standard normal distribution of uncorrelated zero mean unit variances. Notably, η i The closer to 1, the current slot speed of the vehicle user i becomes more dependent on the speed of the last slot;
(4) Content request model: order the
Figure FDA0004025588110000026
Representing a set of all vehicle request contents, +.>
Figure FDA00040255881100000215
Representing the size of all requested content, +.>
Figure FDA0004025588110000027
Characteristic values representing not all the request contents for distinguishing different request contents; it is assumed that the vehicle user i can only generate one request content f in the time slot t, expressed as
Figure FDA0004025588110000028
wherein ,
Figure FDA0004025588110000029
Indicating that the vehicle user i requests the content f in time slot t and vice versa +.>
Figure FDA00040255881100000210
Considering that a vehicle user may prefer a file similar to a previously time-slot requested content in addition to a content according to a global popularity, it is assumed that the vehicle user is according to a global popularity according to a probability of epsilon, and a probability of 1 epsilon is according to a local personal preference, wherein the global popularity and the local personal preference are defined as follows:
(1) global popularity: order the
Figure FDA00040255881100000211
Representing global popularity of individual vehicle user request files, which follow Mandelbrot-Zipf distribution, i.e. satisfy
Figure FDA00040255881100000212
wherein ,If Representing content f as fullDescending order of popularity; alpha and beta respectively represent a platform factor and a skewness factor;
(2) local personal preferences: in this case each vehicle user requests a file based on the similarity of their previously requested content; for example, when the vehicle user requests the content f in the time slot t, the content with the highest similarity with the content f will be requested in the time slot t+1; the section adopts cosine similarity to represent contents f and f * Similarity of (3):
Figure FDA0004025588110000031
(5) User-centric edge cache model: in order to improve the transmission rate of request content of a vehicle user, a network architecture taking the user as a center is adopted to design a vehicle networking edge content cache frame; specifically, each vehicle user selects a single or a plurality of adjacent road side units to construct each edge node cluster taking the user as the center by observing the information of the road side units in the perception range; let the maximum number of road side units observable by the vehicle user i in the time slot t be O max The maximum number of road side units that it can associate is denoted as S max ≤O max The method comprises the steps of carrying out a first treatment on the surface of the Order the
Figure FDA0004025588110000032
Representing the cluster of edge nodes serving the vehicle user i at time slot t, then
Figure FDA0004025588110000033
Indicating the number of road side units in the cluster at the moment; let the edge association decision of the vehicle user i at the time slot t be expressed as:
Figure FDA0004025588110000034
wherein ,
Figure FDA0004025588110000035
representing road sidesUnit j belongs to the edge node cluster of vehicle user i in time slot t +.>
Figure FDA0004025588110000036
On the contrary->
Figure FDA0004025588110000037
Further, the user-centric edge node cluster may be denoted +.>
Figure FDA0004025588110000038
The method comprises the steps that a road side unit and a cloud center are provided with cache capacities, wherein the cache capacity of the cloud center can process cache requests of all vehicle users, and the limited cache capacity of each road side unit is C; under the framework of caching the edge content with the center of the user, the vehicle user can request to cache part of the request files to the associated edge node cluster, so as to provide the cache content service; thus, the vehicle user needs to further decide whether to cache the request content of the current time slot to each road side unit in the edge node cluster; let the edge buffer decision variable of the vehicle user i at the time slot t be expressed as
Figure FDA0004025588110000039
wherein ,
Figure FDA00040255881100000310
indicating to the vehicle user i to buffer the requested content of the time slot t to the road side unit j, otherwise +. >
Figure FDA00040255881100000311
(6) Wireless transmission model: assuming that mutual interference between communication links has been eliminated by allocating orthogonal resource blocks, and that both the road side units and the vehicle users are equipped with single antennas; considering channel power gain including fast fadingAnd slow fading, wherein the main cause of the fast fading part is multipath effect, i.e. rayleigh fading; the slow fading part consists of path loss and shadow fading; let->
Figure FDA00040255881100000312
The channel gain between the vehicle user i and the road side unit j at time slot t is expressed as:
Figure FDA00040255881100000313
wherein ,
Figure FDA00040255881100000314
is the corresponding fast fading part, which follows the complex gaussian distribution of zero mean unit variance, i.e. satisfies
Figure FDA0004025588110000041
While the slow fading portion between the vehicle user i and the road side unit j is represented as
Figure FDA0004025588110000042
wherein ,Ai Is a constant in path loss; beta i Is a lognormal component with standard deviation ζ in shadow fading;
Figure FDA0004025588110000043
representing the distance between the vehicle user i and the road side unit j at the time slot t; η represents the attenuation coefficient of the path loss component; />
Figure FDA0004025588110000044
Representing the path loss of the channel between the vehicle user i and the road side unit j at time slot t;
let the edge node cluster at time slot t
Figure FDA00040255881100000417
The transmitting power of the middle road side unit j is p j The achievable downlink transmission rate for vehicle user i is expressed as
Figure FDA0004025588110000045
Wherein B is the channel bandwidth, sigma 2 Power being additive white gaussian noise;
(7) And (3) a time delay model: making the request content of each vehicle user tolerant to delay
Figure FDA0004025588110000046
In addition, in order to fully utilize the edge cache resources, the edge server automatically refreshes the storage space of the edge server every cycle delta, so that the edge server has a short service vacuum period to replace the cache content every cycle delta; therefore, if the vehicle user i needs to download the cache content from the edge server at time slot t, its content delivery needs to be delayed by a tolerable delay +.>
Figure FDA0004025588110000047
And completed in the next cache replacement refresh period, i.e
Figure FDA0004025588110000048
Where (n + 1) delta-t represents the number of slots remaining for the corresponding cache refresh period,
Figure FDA0004025588110000049
considering that the content delivery delay of the vehicle user depends on whether the requested content has been cached in the edge server in advance, the cache state of the time slot t-way side unit j is defined as
Figure FDA00040255881100000410
wherein ,
Figure FDA00040255881100000411
indicating that content f has been buffered in the road side unit j in time slot t, and vice versa + ->
Figure FDA00040255881100000412
Thus, delivery of the requested content by the vehicle user includes two cases, edge caching and cloud downloading:
(1) edge caching: when edge node cluster
Figure FDA00040255881100000413
When any edge node has cached the request content f, the vehicle user i can directly acquire the request content from the edge, and the time delay is expressed as
Figure FDA00040255881100000414
(2) And (3) cloud downloading: when edge node cluster
Figure FDA00040255881100000415
When all edge nodes in (a) do not cache the request content f, the edge server needs to be downloaded from the cloud center, resulting in an extra fixed delay +.>
Figure FDA00040255881100000416
Based on the above model, the total time delay for the vehicle user i to acquire the request content f can be expressed as
Figure FDA0004025588110000051
wherein ,1{·} Indicating that 1 is set in the case where the constraint is satisfied,and otherwise, setting 0.
3. The internet of vehicles edge content caching decision method based on federal multi-agent reinforcement learning according to claim 2, wherein the modeling of the optimization problem in step 1 is specifically:
taking long-term trade-off of minimizing content delivery delay and edge cache overhead as an optimization target, designing an intelligent edge cache method with a user as a center; for this, normalized content delivery latency effects and edge cache overhead effects are first designed, expressed as:
Figure FDA0004025588110000052
Figure FDA0004025588110000053
thus, the optimization problem is expressed as follows:
Figure FDA0004025588110000054
s.t.C 1 :
Figure FDA0004025588110000055
C 2 :
Figure FDA0004025588110000056
C 3 :
Figure FDA0004025588110000057
C 4 :
Figure FDA0004025588110000058
wherein ,ω1 And omega 2 Weights respectively representing content delivery delay effects and cache overhead effects; c (C) 1 The edge node cluster size of each vehicle user representing any time slot is not more than S max The method comprises the steps of carrying out a first treatment on the surface of the C (C) 2 Means that each at most a single vehicle user is served in any time slot; c (C) 3 Indicating that the content of each road side unit buffer memory cannot exceed the local buffer memory capacity; c (C) 4 Indicating that the requested content delivery delay for each vehicle user at any time slot should not exceed the maximum allowable delay.
4. The internet of vehicles edge content caching decision-making method based on federal multi-agent reinforcement learning according to claim 3, wherein in step 2, each vehicle agent interacts with a road side unit in an observation range to obtain observation information such as a distance between the vehicle agent and the road side unit, a cache state of the road side unit, and a remaining cache capacity of the road side unit, and the method is specifically as follows:
in the time slot t, each vehicle user is used as an intelligent agent, and the respective observation state is obtained through interaction with the environment, and the observation state of the vehicle i in the time slot t is expressed as
Figure FDA0004025588110000059
wherein ,
Figure FDA00040255881100000510
representing the distance between the time slot t, the vehicle user i and the road side unit j; />
Figure FDA00040255881100000511
A flag bit indicating whether the request content f has been cached by the slot t-way side unit j; />
Figure FDA0004025588110000061
The residual buffer space of the time slot t-way side unit j is represented; />
Figure FDA0004025588110000062
And the number of the remained time slots is indicated by the time slot t and the time slot t are refreshed from the buffer space.
5. The method for determining the caching of the internet of vehicles edge content based on reinforcement learning of multiple federal agents according to claim 4, wherein in step 3, each vehicle agent can independently determine the associated road side unit in the edge node cluster and determine whether to cache the request content of the current time slot in the cluster according to the local observation information, specifically:
And determining actions made by each vehicle agent interacting with the environment, including edge-associated decision variables and edge-cached decision variables. The action of vehicle i at time slot t is expressed as:
Figure FDA0004025588110000063
6. the method for determining the edge content cache of the internet of vehicles based on the reinforcement learning of multiple federal agents according to claim 5, wherein after the action decision is performed in step 4, each vehicle agent obtains a trade-off reward of total content delivery delay and edge cache overhead of the system fed back by the environment of the internet of vehicles, specifically:
when all vehicle agents have performed the action, the environment will feedback a global incentive. Defining an average tradeoff per user as a global reward, expressed as:
Figure FDA0004025588110000064
wherein ,
Figure FDA0004025588110000065
representing normalized content delivery delay utility, +.>
Figure FDA00040255881100000610
Representing normalized edge cache overhead utility; ρ t Representing that constraint C is not satisfied 1 -C 4 Penalty term at that time.
7. The internet of vehicles edge content caching decision-making method based on federal multi-agent reinforcement learning according to claim 6, wherein in step 6, each vehicle agent updates its own Actor network and Critic network parameters according to a flexible Actor-critique algorithm, specifically:
(1) Flexibility value function: the flexible actor-critic algorithm optimization objective is to meet entropy maximization in addition to maximizing the cumulative rewards value returned by training, i.e
Figure FDA0004025588110000066
wherein ,ρπ A trajectory distribution representing a strategy pi; alpha represents an entropy coefficient; to ensure that the agent can continually explore, entropy is introduced to randomize the strategy, expressed as
Figure FDA0004025588110000067
Therefore, when the algorithm makes a decision, the probability of outputting each action can be dispersed as far as possible, so that the learning capacity of the vehicle intelligent body on the environment is improved, the vehicle intelligent body can adaptively adjust the strategy in the vehicle networking environment with continuously changed channel conditions, and further a more reasonable decision is made; based on this, a soft state action value function is defined as follows
Figure FDA0004025588110000068
wherein ,
Figure FDA0004025588110000069
a cumulative award representing a discount; further, the methodThe soft state value function may be expressed as
Figure FDA0004025588110000071
Wherein action a depends on the probability distribution pi (|s).
(2) Critic network: under the initiator-Critic network framework of the algorithm, the whole training process is alternately performed between strategy evaluation and strategy improvement, so that the long-term balance maximization of the environment of the Internet of vehicles system is realized; to judge the quality of the action strategy, two Critic networks are constructed, which are expressed as online value networks 1 and 2, and target value networks are respectively constructed for the two Critic networks; the input of the neural networks is local observation information collected by vehicle agents in the environment, and the local observation information is output as the value corresponding to each action in the action space; in view of the fact that a neural network is generally adopted to evaluate a soft value function in an algorithm framework, weight parameters defining the online value networks 1 and 2 and corresponding target networks are respectively theta main,1 、θ target,1 θ main,2 、θ target,2
In order to improve the training stability and avoid the divergence of the reinforcement learning algorithm, an experience multiplexing pool is adopted to relieve the correlation among samples; during the training process, each vehicle agent is respectively gathered from the multiplexing pool
Figure FDA0004025588110000072
Is to randomly extract sequence pairs of batch size, i.e. +.>
Figure FDA0004025588110000073
Redefining the flexible state value function as:
Figure FDA0004025588110000074
wherein pi (|s) t ) T Denoted s t Action probability distribution pi (|s) in case t ) Is a transpose of (2);
the updates to value networks 1 and 2 in the Critic network are approximated as building a bellman mean square error loss function for each Q network, expressed as:
Figure FDA0004025588110000075
wherein i=1, 2 represents value networks 1 and 2, respectively; to achieve the goal of maximum entropy, the entropy needs to be included in rewards, and the functional expression of the judgment behavior strategy value obtained by the general deduction formula of Bellman is as follows:
Figure FDA0004025588110000076
in addition, to obtain Q soft The optimal approximation of (s, a) updates the parameter θ in the gradient direction m,i Expressed as a minimized loss function:
Figure FDA0004025588110000077
wherein ,ηc Represents the learning rate of the Critic network,
Figure FDA0004025588110000078
gradient calculation representing a loss function;
the target value network does not actively participate in the learning process and cannot be updated independently, so that a soft update mode is adopted. It replicates the latest parameters of the online value network at intervals for small-scale updates, expressed as:
θ target,i (t+1)←τθ main,i (t)+(1-τ)θ target,i (t)
Where τ represents the degree of soft update.
(3) Actor network: the task of the Actor network is to seek policy improvement based on the estimates generated by the neural network, which is input as the environmentThe local observation information collected by each vehicle intelligent agent is output as the action probability of action dimension, and is expressed as pi θ (|s); whereas the updating process of an Actor network can be expressed as an exploration of the strategy by maximizing the system rewards, the decision to guide the action by a soft state action value function, i.e. the loss function of the Actor network is defined as:
Figure FDA0004025588110000081
similarly, the Actor network updates the parameter φ in the gradient direction to minimize the loss function, expressed as:
Figure FDA0004025588110000082
wherein ,ηa Indicating the learning rate of the Actor network,
Figure FDA0004025588110000083
gradient calculation representing a loss function;
(4) Self-adjusting entropy coefficient: the entropy coefficient alpha is used as a weight, and the randomness of the action strategy is controlled by changing the numerical value. The greater α, the more dispersed the probability of outputting actions when the algorithm makes a decision, resulting in more complete exploration of the environment by the vehicle agent and further more actions being attempted. Considering that the reward value fed back by the system is continuously changed in the training process of the Internet of vehicles environment, the dependence on a priori fixed alpha can lead to unstable training, and further the convergence performance of the system is affected. In view of this, this section considers the method of adaptive entropy coefficients, i.e. when the vehicle agent explores a new environment at the beginning of training, increasing α to make the agent explore as fully as possible; while as the number of training rounds increases, the optimal action is essentially determined, then α is reduced to optimize the long-term rewards tradeoff of the system as much as possible. The loss function defining the entropy coefficient is:
Figure FDA0004025588110000084
/>
wherein ,
Figure FDA0004025588110000085
representing the current state s t And (3) a threshold value of entropy.
8. The internet of vehicles edge content caching decision method based on federal multi-agent reinforcement learning according to claim 7, wherein in step 7, the aggregation center collects the Actor network weight parameters of each vehicle agent, and performs federal aggregation, and the aggregated parameters are broadcasted to the vehicle user in one training round to perform local training, specifically:
firstly, each vehicle agent trains a DRL model in a distributed mode according to local observation information; secondly, uploading the trained Actor network weight parameters of the local DRL model to a cloud center so as to share strategies; finally, the global model parameters of the federal aggregation are broadcast to the local agent for the training of the next iteration; the communication mode only needs to upload the parameters of the nerve network and download strategies, so that the communication load is greatly reduced; in addition, the privacy of the system is protected in view of the fact that the local information of the user cannot be directly obtained from the neural network parameters; in each training round, the update formula of the weight parameters of the vehicle agent Actor network is as follows:
φ(t+1)=ξφ i (t)+(1-ξ)H i (t)
wherein ,Hi (t)=∑ k≠i φ k (t), ζ is a weight coefficient.
CN202211708649.XA 2022-12-29 2022-12-29 Internet of vehicles edge content caching decision method based on federal reinforcement learning Pending CN116156455A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211708649.XA CN116156455A (en) 2022-12-29 2022-12-29 Internet of vehicles edge content caching decision method based on federal reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211708649.XA CN116156455A (en) 2022-12-29 2022-12-29 Internet of vehicles edge content caching decision method based on federal reinforcement learning

Publications (1)

Publication Number Publication Date
CN116156455A true CN116156455A (en) 2023-05-23

Family

ID=86361158

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211708649.XA Pending CN116156455A (en) 2022-12-29 2022-12-29 Internet of vehicles edge content caching decision method based on federal reinforcement learning

Country Status (1)

Country Link
CN (1) CN116156455A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116582840A (en) * 2023-07-13 2023-08-11 江南大学 Level distribution method and device for Internet of vehicles communication, storage medium and electronic equipment
CN116709359A (en) * 2023-08-01 2023-09-05 南京邮电大学 Self-adaptive route joint prediction method for flight Ad Hoc network
CN116761152A (en) * 2023-08-14 2023-09-15 合肥工业大学 Roadside unit edge cache placement and content delivery method
CN116911480A (en) * 2023-07-25 2023-10-20 北京交通大学 Path prediction method and system based on trust sharing mechanism in Internet of vehicles scene
CN117118592A (en) * 2023-10-25 2023-11-24 北京航空航天大学 Method and system for selecting Internet of vehicles client based on homomorphic encryption algorithm
CN117793805A (en) * 2024-02-27 2024-03-29 厦门宇树康信息技术有限公司 Dynamic user random access mobile edge computing resource allocation method and system
CN117939505A (en) * 2024-03-22 2024-04-26 南京邮电大学 Edge collaborative caching method and system based on excitation mechanism in vehicle edge network
CN117979259A (en) * 2024-04-01 2024-05-03 华东交通大学 Asynchronous federation deep learning method and system for mobile edge collaborative caching

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116582840A (en) * 2023-07-13 2023-08-11 江南大学 Level distribution method and device for Internet of vehicles communication, storage medium and electronic equipment
CN116911480A (en) * 2023-07-25 2023-10-20 北京交通大学 Path prediction method and system based on trust sharing mechanism in Internet of vehicles scene
CN116709359A (en) * 2023-08-01 2023-09-05 南京邮电大学 Self-adaptive route joint prediction method for flight Ad Hoc network
CN116709359B (en) * 2023-08-01 2023-10-31 南京邮电大学 Self-adaptive route joint prediction method for flight Ad Hoc network
CN116761152A (en) * 2023-08-14 2023-09-15 合肥工业大学 Roadside unit edge cache placement and content delivery method
CN116761152B (en) * 2023-08-14 2023-11-03 合肥工业大学 Roadside unit edge cache placement and content delivery method
CN117118592A (en) * 2023-10-25 2023-11-24 北京航空航天大学 Method and system for selecting Internet of vehicles client based on homomorphic encryption algorithm
CN117118592B (en) * 2023-10-25 2024-01-09 北京航空航天大学 Method and system for selecting Internet of vehicles client based on homomorphic encryption algorithm
CN117793805A (en) * 2024-02-27 2024-03-29 厦门宇树康信息技术有限公司 Dynamic user random access mobile edge computing resource allocation method and system
CN117793805B (en) * 2024-02-27 2024-04-26 厦门宇树康信息技术有限公司 Dynamic user random access mobile edge computing resource allocation method and system
CN117939505A (en) * 2024-03-22 2024-04-26 南京邮电大学 Edge collaborative caching method and system based on excitation mechanism in vehicle edge network
CN117939505B (en) * 2024-03-22 2024-05-24 南京邮电大学 Edge collaborative caching method and system based on excitation mechanism in vehicle edge network
CN117979259A (en) * 2024-04-01 2024-05-03 华东交通大学 Asynchronous federation deep learning method and system for mobile edge collaborative caching

Similar Documents

Publication Publication Date Title
CN116156455A (en) Internet of vehicles edge content caching decision method based on federal reinforcement learning
Liu et al. Cooperative offloading and resource management for UAV-enabled mobile edge computing in power IoT system
Luo et al. Self-learning based computation offloading for internet of vehicles: Model and algorithm
Arkian et al. A cluster-based vehicular cloud architecture with learning-based resource management
Zhang et al. Deep reinforcement learning based IRS-assisted mobile edge computing under physical-layer security
CN111711666B (en) Internet of vehicles cloud computing resource optimization method based on reinforcement learning
Ren et al. Blockchain-based VEC network trust management: A DRL algorithm for vehicular service offloading and migration
Qin et al. Collaborative edge computing and caching in vehicular networks
CN114827191B (en) Dynamic task unloading method for fusing NOMA in vehicle-road cooperative system
CN112565377B (en) Content grading optimization caching method for user service experience in Internet of vehicles
CN114973673B (en) Task unloading method combining NOMA and content cache in vehicle-road cooperative system
Zheng et al. Digital twin empowered heterogeneous network selection in vehicular networks with knowledge transfer
CN114449482A (en) Heterogeneous vehicle networking user association method based on multi-agent deep reinforcement learning
CN116390125A (en) Industrial Internet of things cloud edge cooperative unloading and resource allocation method based on DDPG-D3QN
CN117042050A (en) Multi-user intelligent data unloading method based on distributed hybrid heterogeneous decision
CN116321293A (en) Edge computing unloading and resource allocation method based on multi-agent reinforcement learning
CN116321307A (en) Bidirectional cache placement method based on deep reinforcement learning in non-cellular network
Hu et al. Dynamic task offloading in MEC-enabled IoT networks: A hybrid DDPG-D3QN approach
Wang et al. Joint spectrum access and power control in air-air communications-a deep reinforcement learning based approach
CN117354833A (en) Cognitive Internet of things resource allocation method based on multi-agent reinforcement learning algorithm
Lyu et al. Service-driven resource management in vehicular networks based on deep reinforcement learning
CN116137724A (en) Task unloading and resource allocation method based on mobile edge calculation
Wang et al. Deep Reinforcement Learning-Based Computation Offloading and Power Allocation within Dynamic Platoon Network
Gui et al. Spectrum-Energy-Efficient Mode Selection and Resource Allocation for Heterogeneous V2X Networks: A Federated Multi-Agent Deep Reinforcement Learning Approach
CN114531685A (en) Resource allocation method based on migration reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination