CN116156455A - Internet of vehicles edge content caching decision method based on federal reinforcement learning - Google Patents
Internet of vehicles edge content caching decision method based on federal reinforcement learning Download PDFInfo
- Publication number
- CN116156455A CN116156455A CN202211708649.XA CN202211708649A CN116156455A CN 116156455 A CN116156455 A CN 116156455A CN 202211708649 A CN202211708649 A CN 202211708649A CN 116156455 A CN116156455 A CN 116156455A
- Authority
- CN
- China
- Prior art keywords
- vehicle
- edge
- content
- time slot
- cache
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 49
- 230000002787 reinforcement Effects 0.000 title claims abstract description 22
- 238000012549 training Methods 0.000 claims abstract description 58
- 230000009471 action Effects 0.000 claims abstract description 47
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 38
- 230000005540 biological transmission Effects 0.000 claims abstract description 20
- 230000006855 networking Effects 0.000 claims abstract description 19
- 230000002776 aggregation Effects 0.000 claims abstract description 17
- 238000004220 aggregation Methods 0.000 claims abstract description 10
- 239000003795 chemical substances by application Substances 0.000 claims description 87
- 230000006870 function Effects 0.000 claims description 30
- 238000004891 communication Methods 0.000 claims description 20
- 238000005562 fading Methods 0.000 claims description 17
- 230000008569 process Effects 0.000 claims description 15
- 238000005457 optimization Methods 0.000 claims description 14
- 238000013528 artificial neural network Methods 0.000 claims description 13
- 238000009826 distribution Methods 0.000 claims description 13
- 230000000694 effects Effects 0.000 claims description 10
- 230000007774 longterm Effects 0.000 claims description 9
- 230000003993 interaction Effects 0.000 claims description 7
- 230000008901 benefit Effects 0.000 claims description 6
- 238000013461 design Methods 0.000 claims description 5
- 238000003860 storage Methods 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 4
- 230000001186 cumulative effect Effects 0.000 claims description 4
- 230000006872 improvement Effects 0.000 claims description 4
- 230000006399 behavior Effects 0.000 claims description 3
- 230000000750 progressive effect Effects 0.000 claims description 3
- 230000003044 adaptive effect Effects 0.000 claims description 2
- 239000000654 additive Substances 0.000 claims description 2
- 230000000996 additive effect Effects 0.000 claims description 2
- 230000003111 delayed effect Effects 0.000 claims description 2
- 230000001419 dependent effect Effects 0.000 claims description 2
- 238000011156 evaluation Methods 0.000 claims description 2
- 210000005036 nerve Anatomy 0.000 claims description 2
- 230000008447 perception Effects 0.000 claims description 2
- 238000013468 resource allocation Methods 0.000 claims description 2
- 230000007613 environmental effect Effects 0.000 abstract description 2
- 229940049705 immune stimulating antibody conjugate Drugs 0.000 description 6
- 238000012935 Averaging Methods 0.000 description 3
- 230000003247 decreasing effect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000000243 solution Substances 0.000 description 2
- 241000353097 Molva molva Species 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000003139 buffering effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 239000012088 reference solution Substances 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 238000005303 weighing Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W4/00—Services specially adapted for wireless communication networks; Facilities therefor
- H04W4/30—Services specially adapted for particular environments, situations or purposes
- H04W4/40—Services specially adapted for particular environments, situations or purposes for vehicles, e.g. vehicle-to-pedestrians [V2P]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W24/00—Supervisory, monitoring or testing arrangements
- H04W24/02—Arrangements for optimising operational condition
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W28/00—Network traffic management; Network resource management
- H04W28/16—Central resource management; Negotiation of resources or communication parameters, e.g. negotiating bandwidth or QoS [Quality of Service]
- H04W28/18—Negotiating wireless communication parameters
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W4/00—Services specially adapted for wireless communication networks; Facilities therefor
- H04W4/30—Services specially adapted for particular environments, situations or purposes
- H04W4/40—Services specially adapted for particular environments, situations or purposes for vehicles, e.g. vehicle-to-pedestrians [V2P]
- H04W4/44—Services specially adapted for particular environments, situations or purposes for vehicles, e.g. vehicle-to-pedestrians [V2P] for communication between vehicles and infrastructures, e.g. vehicle-to-cloud [V2C] or vehicle-to-home [V2H]
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Traffic Control Systems (AREA)
Abstract
The invention discloses a method for caching and deciding edge content of an internet of vehicles based on federal reinforcement learning, which specifically comprises the following steps: inputting a vehicle networking environment and initializing network parameters of each vehicle; at the current time slot, each vehicle interacts with the road side unit to obtain observation information; according to the observation information, each vehicle independently decides to act; after the action is executed, each vehicle obtains the rewards of environmental feedback, and the sample data is cached to an experience multiplexing pool; when the number of samples is sufficient, each vehicle updates the network according to a flexible actor-critique algorithm; the aggregation center collects local network parameters for federal aggregation and broadcasts the aggregation parameters to the local for training; and after the current training is finished, resetting the Internet of vehicles environment, and starting the training of the next round. The invention aims at realizing the minimization of the balance of the transmission delay of the system content and the overhead of the edge cache by utilizing a network architecture taking a user as a center under the environment of the internet of vehicles so as to ensure that the vehicles finish the distributed decision of the edge cache on the premise of protecting the privacy.
Description
Technical Field
The invention relates to the technical field of wireless communication, in particular to a federal reinforcement learning-based internet of vehicles edge content caching decision method.
Background
In recent years, under the drive of sixth-generation mobile communication technology, new generation information technologies such as the internet of vehicles automobile industry, big data, artificial intelligence and the like are deeply integrated, and efficient and reliable communication services can be provided. However, with the increase of the number of vehicles, the increasing real-time communication service puts higher demands on ultra-low time delay and ultra-high reliability of the internet of vehicles. To address the above challenges, edge caching techniques reduce the delay in content delivery by deploying cache resources at edge nodes to accomplish local content distribution, avoiding reliance on cloud-centric delivery (Zhang Y, zhao J, cao g.roadcast: a popularity aware content sharing scheme in vanets [ J ]. ACM SIGMOBILE Mobile Computing and Communications Review,2010,13 (4): 1-14.). However, due to the limited storage capacity of the edge nodes, the rapidly growing internet of vehicles application data requests cannot be completely cached to the edge nodes; secondly, in view of the high mobility of the internet of vehicles, frequently changing content requirements, and harsh communication environments, there is a need to design an efficient edge content caching method for the scenes of the internet of vehicles.
Considering that the problem of intelligent edge content caching of the Internet of vehicles is essentially a discrete sequence decision-making problem without a model, the method can be solved by adopting a multi-agent reinforcement learning method which can finish local decision-making by sharing training information. Compared with the traditional optimization algorithm, the deep reinforcement learning can learn experience through interaction of an agent with an uncertain environment so as to solve the problem of dynamic decision. Even if dynamic environmental changes cannot be predicted in advance, the agent can learn how to take action or how to map the acquired information to action, thereby maximizing the system rewards. In recent years, researchers at home and abroad focus on researching intelligent edge content caching decisions so as to efficiently utilize edge caching resources in a dynamic wireless transmission environment of the Internet of vehicles. For example, qiao et al use depth deterministic policy gradient algorithm to learn the change law of the Internet of vehicles wireless environment by using the local observation information of the vehicle users, and propose a collaborative edge caching method to minimize the long-term tradeoff of system content transmission delay and caching overhead (Qiao G, leng S, maharjan S, et al deep reinforcement learning for cooperative content caching in vehicular edge computing and networks [ J ]. IEEE Internet of Things Journal,2019,7 (1): 247-257.). However, the existing internet of vehicles edge content caching schemes are mostly built on a centralized network architecture, and do not fully utilize the deployment characteristics of dense heterogeneous edge nodes. In addition, privacy protection of individual data of a vehicle user is also a non-negligible content in view of the openness of the internet of vehicles communication scheme. Therefore, in the environment of dense heterogeneous deployment of edge nodes, how to realize seamless coverage of high capacity, low cache overhead and vehicle-mounted communication by using a decentralised network architecture on the basis of privacy protection still needs further research.
Disclosure of Invention
In order to overcome the problems in the related art, the embodiment of the invention discloses an Internet of vehicles edge content caching decision method based on federal reinforcement learning. The technical method comprises the following steps:
step 1, inputting a vehicle networking environment, initializing parameters of an own actor network and a critic network by each vehicle agent, and modeling an optimization problem;
step 2, in the current time slot, each vehicle agent interacts with a road side unit in an observation range to obtain observation information such as the distance between the vehicle agent and the road side unit, the cache state of the road side unit, the residual cache capacity of the road side unit and the like;
step 3, according to the local observation information, each vehicle agent can independently decide the associated road side unit in the edge node cluster and decide whether to cache the request content of the current time slot in the cluster;
step 4, after the action decision is executed, each vehicle agent acquires the trade-off rewards of the total content delivery delay and the edge cache overhead of the system fed back by the vehicle networking environment, and all sample data are cached to an experience multiplexing pool;
step 5, judging whether the number of samples is enough, if so, entering a step 6, otherwise, entering a step 7;
Step 6, when the number of samples is enough, each vehicle agent updates own Actor network and Critic network parameters according to a flexible Actor-critter algorithm;
step 7, collecting the weight parameters of the Actor network of each vehicle intelligent agent by the aggregation center, and performing federal aggregation, wherein the aggregated parameters are broadcasted to a vehicle user in one training round to perform local training;
step 8, judging whether the current training round is finished, if not, returning to the step 2 to start the training of the next round, and if so, entering the step 9;
step 9, judging whether convergence is achieved, if not, resetting the environment of the Internet of vehicles, and returning to the step 1; and if yes, finishing training and finishing the decision of caching the edge content of the Internet of vehicles.
Compared with the prior art, the invention has the remarkable advantages that: (1) Aiming at the link congestion load caused by the network architecture of the vehicle networking cache scene centering network, the invention designs an edge node cluster centering on a user by utilizing the deployment characteristic of the dense edge nodes so as to realize high capacity, low cache overhead and vehicle-mounted communication seamless coverage; (2) Aiming at the problems of a large amount of information interaction and privacy leakage caused by intelligent algorithm centralized training, the invention utilizes the privacy protection advantage of federal learning, and realizes the collaborative decision of vehicle users by sharing the local model neural network weight so as to reduce the long-term transmission delay and the edge cache overhead of the system content.
The invention is further described in detail below with reference to the accompanying drawings.
Drawings
FIG. 1 is a flow chart of a method for caching and deciding the edge content of the Internet of vehicles based on a federal reinforcement learning framework.
FIG. 2 is a graph showing average trade-off convergence performance per vehicle user versus cycle in an embodiment of the present invention.
Fig. 3 is a graph showing the convergence performance of average transmission delay per vehicle user according to the embodiment of the present invention.
Fig. 4 is a graph showing the convergence performance of average buffer overhead per vehicle user according to the periodic variation in the embodiment of the present invention.
FIG. 5 is a graph showing average trade-off convergence performance per vehicle user as a function of maximum number of associations in an embodiment of the present invention.
Detailed Description
The invention provides a federal reinforcement learning-based vehicle networking edge content caching decision method. Specifically, in a unit time slot, each vehicle user is regarded as an intelligent agent, and obtains the distance between the vehicle user and a road side unit, the cache state of the road side unit, the residual cache capacity of the road side unit and other observation information in an observation range, and adopts a neural network training action strategy to minimize the balance of the total content delivery time delay and the edge cache cost of the system, and the method comprises the following steps in combination with fig. 1-2:
Step 1, inputting a vehicle networking environment, initializing parameters of an own actor network and a critic network by each vehicle agent, and modeling an optimization problem;
step 2, in the current time slot, each vehicle agent interacts with a road side unit in an observation range to obtain observation information such as the distance between the vehicle agent and the road side unit, the cache state of the road side unit, the residual cache capacity of the road side unit and the like;
step 3, according to the local observation information, each vehicle agent can independently decide the associated road side unit in the edge node cluster and decide whether to cache the request content of the current time slot in the cluster;
step 4, after the action decision is executed, each vehicle agent acquires the trade-off rewards of the total content delivery delay and the edge cache overhead of the system fed back by the vehicle networking environment, and all sample data are cached to an experience multiplexing pool;
step 5, judging whether the number of samples is enough, if so, entering a step 6, otherwise, entering a step 7;
step 6, when the number of samples is enough, each vehicle agent updates own Actor network and Critic network parameters according to a flexible Actor-critter algorithm;
step 7, collecting the weight parameters of the Actor network of each vehicle intelligent agent by the aggregation center, and performing federal aggregation, wherein the aggregated parameters are broadcasted to a vehicle user in one training round to perform local training;
Step 8, judging whether the current training round is finished, if not, returning to the step 2 to start the training of the next round, and if so, entering the step 9;
step 9, judging whether convergence is achieved, if not, resetting the environment of the Internet of vehicles, and returning to the step 1; and if yes, finishing training and finishing the decision of caching the edge content of the Internet of vehicles.
As a specific embodiment, the inputting the internet of vehicles environment described in step 1 specifically includes:
(1) Time slot model: dispersing continuous training time into multiple time slots, denoted asWherein each time slot has a duration τ, wherein the channel state information and system parameters remain unchanged over the duration of a single time slot, but may vary randomly between different time slots;
(2) Network model: establishing a networking model of the vehicle as a Manhattan grid model, wherein road side units capable of providing communication services are uniformly distributed on two sides of a road; the road side unit is used as an edge node, has special communication resources and limited local storage resources, and is associated with an edge server through a high-speed wired link; the edge server is controlled by a software-defined centralized controller, and can execute edge association, cache resource allocation and the like; let the set of road side units be The vehicles are assembled intoAll vehicle users can travel along four directions of the front, the back, the left and the right of the road, and each direction is provided with a plurality of lanes to ensure the passing of vehicles;
(3) Vehicle movement model: the speed variation of the vehicle follows the following gaussian-markov random process, in particular, when the vehicle user i is at an initial speedDuring driving, its speed at time slot t +.>May be represented as time slot t-1
wherein , and σi Is the corresponding asymptotic mean and standard deviation of the vehicle user i speed; parameter eta i ∈[0,1]The memory depth representing the last slot speed and determining the time dependence of the movement of the vehicle user i; z represents the standard normal distribution of uncorrelated zero mean unit variances. Notably, η i The closer to 1, the current slot speed of the vehicle user i becomes more dependent on the speed of the last slot;
(4) Content request model: order theRepresenting a set of all vehicle request contents, +.>Representing the size of all requested content, +.>Characteristic values representing not all the request contents for distinguishing different request contents; it is assumed that the vehicle user i can only generate one request content f in the time slot t, expressed as
wherein ,Indicating that the vehicle user i requests the content f in time slot t and vice versa +.>Considering that a vehicle user may prefer a file similar to a previously time-slot requested content in addition to a content according to a global popularity, it is assumed that the vehicle user is according to a global popularity according to a probability of epsilon, and a probability of 1 epsilon is according to a local personal preference, wherein the global popularity and the local personal preference are defined as follows:
(1) global popularity: order theRepresenting global popularity of individual vehicle user request files, which follow Mandelbrot-Zipf distribution, i.e. satisfy
wherein ,If Representing the order in which the content f is arranged in descending global popularity; alpha and beta respectively represent a platform factor and a skewness factor; (2) local personal preferences: in this case each vehicle user requests a file based on the similarity of their previously requested content; for example, when the vehicle user requests the content f in the time slot t, the content with the highest similarity with the content f will be requested in the time slot t+1; the section adopts cosine similarity to represent contents f and f * Similarity of (3):
(5) User-centric edge cache model: in order to improve the transmission rate of request content of a vehicle user, a network architecture taking the user as a center is adopted to design a vehicle networking edge content cache frame; specifically, each vehicle user selects a single or a plurality of adjacent road side units to construct each edge node cluster taking the user as the center by observing the information of the road side units in the perception range; let the maximum number of road side units observable by the vehicle user i in the time slot t be O max The maximum number of road side units that it can associate is denoted as S max ≤O max The method comprises the steps of carrying out a first treatment on the surface of the Order theRepresenting the cluster of edge nodes serving the vehicle user i at time slot t, thenIndicating the number of road side units in the cluster at the moment; let the edge association decision of the vehicle user i at the time slot t be expressed as:
wherein ,edge node cluster representing that a road side unit j belongs to a vehicle user i in time slot t +.>OtherwiseFurther, a user-centric edge node cluster may be represented as
The method comprises the steps that a road side unit and a cloud center are provided with cache capacities, wherein the cache capacity of the cloud center can process cache requests of all vehicle users, and the limited cache capacity of each road side unit is C; under the framework of caching the edge content with the center of the user, the vehicle user can request to cache part of the request files to the associated edge node cluster, so as to provide the cache content service; thus, the vehicle user needs to further decide whether to cache the request content of the current time slot to each road side unit in the edge node cluster; let the edge buffer decision variable of the vehicle user i at the time slot t be expressed as
wherein ,indicating to the vehicle user i to buffer the requested content of the time slot t to the road side unit j, otherwise +. >
(6) Wireless transmission model: assuming that mutual interference between communication links has been eliminated by allocating orthogonal resource blocks, and that both the road side units and the vehicle users are equipped with single antennas; considering that the channel power gain comprises two parts of fast fading and slow fading, wherein the main cause of the fast fading part is multipath effect, namely Rayleigh fading; the slow fading part consists of path loss and shadow fading; order theThe channel gain between the vehicle user i and the road side unit j at time slot t is expressed as:
wherein ,is the corresponding fast fading part, which follows the complex gaussian distribution of zero mean unit variance, i.e. satisfiesWhile the slow fading portion between the vehicle user i and the road side unit j is represented as
wherein ,Ai Is a constant in path loss; beta i Is a lognormal component with standard deviation ζ in shadow fading;representing the distance between the vehicle user i and the road side unit j at the time slot t; η represents the attenuation coefficient of the path loss component; />Representing the path loss of the channel between the vehicle user i and the road side unit j at time slot t;
let the edge node cluster at time slot tThe transmitting power of the middle road side unit j is p j The achievable downlink transmission rate for vehicle user i is expressed as
Wherein B is the channel bandwidth, sigma 2 Power being additive white gaussian noise;
(7) And (3) a time delay model: making the request content of each vehicle user tolerant to delayIn addition, in order to fully utilize the edge cache resources, the edge server automatically refreshes the storage space of the edge server every cycle delta, so that the edge server has a short service vacuum period to replace the cache content every cycle delta; therefore, if the vehicle user i needs to download the cache content from the edge server at time slot t, its content delivery needs to be delayed by a tolerable delay +.>And completed in the next cache replacement refresh period, i.e
Where (n + 1) delta-t represents the number of slots remaining for the corresponding cache refresh period,
considering that the content delivery delay of the vehicle user depends on whether the requested content has been cached in the edge server in advance, the cache state of the time slot t-way side unit j is defined as
wherein ,indicating that content f has been buffered in the road side unit j in time slot t, and vice versa + ->Thus, delivery of the requested content by the vehicle user includes two cases, edge caching and cloud downloading:
(1) edge caching: when edge node clusterWhen any edge node has cached the request content f, the vehicle user i can directly acquire the request content from the edge, and the time delay is expressed as
(2) And (3) cloud downloading: when edge node clusterWhen all edge nodes in (a) do not cache the request content f, the edge server needs to be downloaded from the cloud center, resulting in an extra fixed delay +.>
Based on the above model, the total time delay for the vehicle user i to acquire the request content f can be expressed as
wherein ,1{·} Indicating that 1 is set if the constraint is satisfied and 0 is set otherwise.
As a specific embodiment, modeling the optimization problem in step 1 is specifically:
taking the long-term tradeoff of minimizing the content delivery delay and the edge cache overhead as an optimization target, designing an intelligent edge cache scheme taking a user as a center; for this, normalized content delivery latency effects and edge cache overhead effects are first designed, expressed as:
thus, the optimization problem is expressed as follows:
wherein ,ω1 And omega 2 Weights respectively representing content delivery delay effects and cache overhead effects; c (C) 1 The edge node cluster size of each vehicle user representing any time slot is not more than S max The method comprises the steps of carrying out a first treatment on the surface of the C (C) 2 Means that each at most a single vehicle user is served in any time slot; c (C) 3 Indicating that the content of each road side unit buffer memory cannot exceed the local buffer memory capacity; c (C) 4 Indicating that the requested content delivery delay for each vehicle user at any time slot should not exceed the maximum allowable delay.
As a specific embodiment, in step 2, each vehicle agent interacts with a road side unit in the observation range to obtain the observation information such as the distance between the vehicle agent and the road side unit, the buffer status of the road side unit, and the remaining buffer capacity of the road side unit, where the specific steps are as follows:
in the time slot t, each vehicle user is used as an intelligent agent, and the respective observation state is obtained through interaction with the environment, and the observation state of the vehicle i in the time slot t is expressed as
wherein ,representing the distance between the time slot t, the vehicle user i and the road side unit j; />A flag bit indicating whether the request content f has been cached by the slot t-way side unit j; />The residual buffer space of the time slot t-way side unit j is represented; />And the number of the remained time slots is indicated by the time slot t and the time slot t are refreshed from the buffer space.
As a specific embodiment, in step 3, each vehicle agent may independently decide, according to the local observation information, an associated road side unit in the edge node cluster, and decide, in the cluster, whether to cache the request content of the current time slot, specifically:
and determining actions made by each vehicle agent interacting with the environment, including edge-associated decision variables and edge-cached decision variables. The action of vehicle i at time slot t is expressed as:
As a specific embodiment, after the action decision is performed in step 4, each vehicle agent obtains a trade-off reward of total content delivery delay and edge buffer overhead of the system fed back by the internet of vehicles environment, specifically:
when all vehicle agents have performed the action, the environment will feedback a global incentive. Defining an average tradeoff per user as a global reward, expressed as:
wherein ,representing normalized content delivery delay utility, +.>Representing normalized edge cache overhead utility; ρ t Representing that constraint C is not satisfied 1 -C 4 Penalty term at that time.
As a specific implementation manner, in step 6, each vehicle agent updates its own actor network and critic network parameters according to the federal discrete flexible actor-critter algorithm, specifically:
(1) Flexibility value function: the flexible actor-critic algorithm optimization objective is to meet entropy maximization in addition to maximizing the cumulative rewards value returned by training, i.e
wherein ,ρπ A trajectory distribution representing a strategy pi; alpha represents an entropy coefficient; to ensure that the agent can continually explore, entropy is introduced to randomize the strategy, expressed asTherefore, when the algorithm makes a decision, the probability of outputting each action can be dispersed as far as possible, so that the learning capacity of the vehicle intelligent body on the environment is improved, the vehicle intelligent body can adaptively adjust the strategy in the vehicle networking environment with continuously changed channel conditions, and further a more reasonable decision is made; based on this, a soft state action value function is defined as follows
wherein ,a cumulative award representing a discount; further, the soft state value function may be expressed as
Wherein action a depends on the probability distribution pi (|s).
(2) Critic network: under the initiator-Critic network framework of the algorithm, the whole training process is alternately performed between strategy evaluation and strategy improvement, so that the long-term balance maximization of the environment of the Internet of vehicles system is realized; to judge the quality of the action strategy, two Critic networks are constructed, which are expressed as online value networks 1 and 2, and target value networks are respectively constructed for the two Critic networks; the input of the neural networks is local observation information collected by vehicle agents in the environment, and the local observation information is output as the value corresponding to each action in the action space; in view of the fact that a neural network is generally adopted to evaluate a soft value function in an algorithm framework, weight parameters defining the online value networks 1 and 2 and corresponding target networks are respectively theta main,1 、θ target,1 θ main,2 、θ target,2 ;
In order to improve the training stability and avoid the divergence of the reinforcement learning algorithm, an experience multiplexing pool is adopted to relieve the correlation among samples; during the training process, each vehicle agent is respectively gathered from the multiplexing poolIs to randomly extract sequence pairs of batch size, i.e. +.>Redefining the flexible state value function as:
Wherein pi (|s) t ) T Denoted s t Action probability distribution pi (|s) in case t ) Is a transpose of (2);
the updates to value networks 1 and 2 in the Critic network are approximated as building a bellman mean square error loss function for each Q network, expressed as:
wherein i=1, 2 represents value networks 1 and 2, respectively; to achieve the goal of maximum entropy, the entropy needs to be included in rewards, and the functional expression of the judgment behavior strategy value obtained by the general deduction formula of Bellman is as follows:
in addition, to obtain Q soft Optimal approximation of (s, α), updating the parameter θ in the gradient direction m,i Expressed as a minimized loss function:
wherein ,ηc Represents the learning rate of the Critic network,gradient calculation representing a loss function;
the target value network does not actively participate in the learning process and cannot be updated independently, so that a soft update mode is adopted. It replicates the latest parameters of the online value network at intervals for small-scale updates, expressed as:
θ target,i (t+1)←τθ main,i (t)+(1-τ)θ target,i (t)
where τ represents the degree of soft update.
(3) Actor network: the task of the Actor network is to seek policy improvement based on the estimated value generated by the neural network, which is input as local observation information collected by each vehicle agent in the environment, output as behavior probability of action dimension, expressed as pi θ (|s); whereas the updating process of an Actor network can be expressed as an exploration of the strategy by maximizing the system rewards, the decision to guide the action by a soft state action value function, i.e. the loss function of the Actor network is defined as:
similarly, the Actor network updates the parameter φ in the gradient direction to minimize the loss function, expressed as:
wherein ,ηa Indicating the learning rate of the Actor network,gradient calculation representing a loss function;
(4) Self-adjusting entropy coefficient: the entropy coefficient alpha is used as a weight, and the randomness of the action strategy is controlled by changing the numerical value. The greater α, the more dispersed the probability of outputting actions when the algorithm makes a decision, resulting in more complete exploration of the environment by the vehicle agent and further more actions being attempted. Considering that the reward value fed back by the system is continuously changed in the training process of the Internet of vehicles environment, the dependence on a priori fixed alpha can lead to unstable training, and further the convergence performance of the system is affected. In view of this, this section considers the method of adaptive entropy coefficients, i.e. when the vehicle agent explores a new environment at the beginning of training, increasing α to make the agent explore as fully as possible; while as the number of training rounds increases, the optimal action is essentially determined, then α is reduced to optimize the long-term rewards tradeoff of the system as much as possible. The loss function defining the entropy coefficient is:
As a specific implementation manner, the aggregation center collects the weight parameters of the Actor network of each vehicle agent and performs federal aggregation, and the aggregated parameters are broadcast to the vehicle user in a training round to perform local training, specifically:
firstly, each vehicle agent trains a DRL model in a distributed mode according to local observation information; secondly, uploading the trained Actor network weight parameters of the local DRL model to a cloud center so as to share strategies; finally, the global model parameters of the federal aggregation are broadcast to the local agent for the training of the next iteration; the communication mode only needs to upload the parameters of the nerve network and download strategies, so that the communication load is greatly reduced; in addition, the privacy of the system is protected in view of the fact that the local information of the user cannot be directly obtained from the neural network parameters; in each training round, the update formula of the weight parameters of the vehicle agent Actor network is as follows:
φ(t+1)=ξφ i (t)+(1-ξ)H i (t)
wherein ,Hi (t)=∑ k≠i φ k (t), ζ is a weight coefficient.
The invention is further described in detail below with reference to the accompanying drawings and specific examples.
Examples
The embodiment provides a method for caching and deciding the edge content of the internet of vehicles based on federal reinforcement learning, which is specifically described below:
1. Establishing a car networking system model:
the simulation environment is set according to the manhattan grid model in the 3gpp TR 36.885 standard. Which is a kind ofThe medium city map was set to 500×500m, the number of vehicles was 4, and the number of roadside units was 36. The maximum observation range of the vehicle user is 200m, and the maximum association number of the road side units is 4. The number of contents that a vehicle user can request is 20, and the content data size is set to [20,50]Mbit. The platform factor alpha of the global popularity of the requested content is-0.88, and the skewness factor beta is 0.35. At the beginning of training, the vehicle position is initialized, and when the vehicle user runs to the intersection, the following running direction is selected with equal probability, namely, the probability of each direction is 0.25. The speed of each vehicle follows a Gaussian-Markov movement model, wherein the progressive average of the initial speeds isStandard deviation sigma i =0.1, parameter η i Set to 0.1. Path loss model set-up adoptionIn addition, if the duration of the unit time slot is set to be 0.1s, the cloud downloading time delay is set to be 1s, and the time delay constraint is +.>Set to 1.5s.
2. Establishing a federal reinforcement learning algorithm framework:
federal enhancement algorithms combine federal averaging algorithms and flexible actor-critique algorithms. In the flexible actor-critter algorithm framework, critic network fits it using a fully connected neural network of 2 hidden layers, where the number of hidden layer neurons is 64 and the activation function is a linear rectification function f (x) =max (0, x); similar to Critic networks, the Actor networks fit using the same full-connection layer neural network. And constructing a virtual cloud server as an aggregation center, and performing federal aggregation on the weight parameters of the local Actor network.
3. Training phase of algorithm:
first, each vehicle agent trains its edge caching strategy using a flexible actor-critter algorithm. For vehicle agent i, local state information is enteredDefined as->Namely the data size of the requested content of the vehicle agent in time slot t, the global popularity of the file, the distance between the vehicle agent and the road side unit j, the flag bit of the requested content f cached by the road side unit j, the size of the remained caching space and the number of the remained time slots refreshed by the caching space. Secondly, the vehicle agent decides on the action based on the local observation state, expressed as +.>Namely an edge association decision variable and an edge cache decision variable.
After the algorithm inputs the state, according to the action predicted by the Actor network, after the interaction with the Internet of vehicles environment, the vehicle intelligent agent obtains the global rewards fed back by the systemAnd transition to the next state +.>When the sample data are enough, the vehicle intelligent agent respectively updates the Actor network and the Critic network according to the gradient descent method. At the end of each round, each vehicle agent uploads the weight parameters of the local Actor network to the cloud center for federal averaging, and the aggregated global parameters are broadcast to the vehicle agents for local training of the next round. 3000 training sets and 100 test sets are adopted, and a Critic network learning rate eta is set in the training process c 10e of -4 Actor network learning rate eta a 10e of -4 The discount factor gamma is 0.9, the soft update degree is 0.01, the empirical multiplexing pool size is 5000, and the number of samples trained once is set to 100.
The present invention compares the proposed solution with the following reference solution:
(1) Edge caching scheme based on independent flexible actor-critter algorithm (ISAC): each vehicle user respectively adopts SAC algorithm as an agent to train own strategy, namely each agent trains independent Critic and Actor networks respectively and makes decisions in a distributed mode according to own local observation information.
(2) Edge caching scheme based on independent deep Q network algorithm (Independent Deep Q Network, IDQN): each vehicle user respectively adopts an DQN algorithm as an agent to train own strategy, namely each agent trains an independent Q network and makes decisions in a distributed mode according to own local observation information.
(3) Edge caching scheme based on federal DQN (Federated DQN, feddqn): each vehicle user is used as an intelligent agent to federally train own strategies by adopting a DQN algorithm which utilizes a shared local Q network weight, wherein the Q network weight is uploaded to a cloud center for federally averaging, and then each vehicle user downloads aggregated global parameters and makes decisions in a distributed mode according to own local observation information.
As shown in fig. 2, it is apparent that the ISAC-based playback scheme converges most rapidly, the proposed scheme is inferior, and the FedDQN-based and IDQN-based schemes converge most slowly, by comparing the proposed scheme with the reference scheme. The underlying cause of this phenomenon is: the scheme and the scheme based on ISAC try to maximize entropy targets and enhance the exploration capacity of the algorithm, so that the convergence of the model is promoted, and better convergence performance is achieved; the scheme based on IDQN and FedDQN utilizes a random strategy to search, so that the problems that a model is not easy to converge and sample searching and utilization are difficult to coordinate when a high-dimensional action space in the environment of the Internet of vehicles is handled are faced. Furthermore, the proposed scheme may achieve a better system average tradeoff with protection of privacy than ISAC-based schemes. The reasons for this include two aspects: firstly, the proposal enables each agent to share experience of interaction with the environment by sharing weight parameters of an Actor network in a local DRL model, thereby promoting a policy with better decision of each agent; secondly, the cloud center only collects weight parameters and cannot acquire local observation information from training, so that communication overhead is reduced, and privacy of a vehicle user is protected.
As shown in fig. 3, as the number of training rounds increases, the average transmission delay per vehicle user shows a decreasing trend, which indicates the effectiveness of each scenario. Compared with the reference scheme, the scheme is obviously superior to the IDQN-based scheme and the FedDQN-based scheme in the optimization performance of average transmission delay, and is slightly superior to the ISAC-based scheme. The cause of this phenomenon includes two aspects: firstly, four schemes can well optimize the average transmission delay index, and when the utility function is normalized, the algorithm performance is difficult to distinguish; secondly, because of the complex and changeable dynamic car networking environment, the convergence result has concurrency, so that obvious performance advantages are difficult to embody. As shown in fig. 4, as the number of training rounds increases, the edge buffer overhead of each scheme shows a decreasing trend, wherein the edge buffer overhead of the proposed scheme is the smallest, the ISAC-based scheme is inferior, and the IDQN-and fedqn-based schemes perform the worst. With reference to fig. 3 and fig. 4, it can be seen that the proposed solution can greatly reduce the overhead of edge buffering, and simultaneously, can also well optimize the transmission delay of each vehicle user.
Fig. 5 shows the rule of influence of different maximum association number parameters in the vehicle networking construction edge node cluster on average weighing performance of each vehicle user. Specifically, as the number of associated roadside units increases, so does the convergence result of the proposed scheme. The reason for this phenomenon is that the user-centric edge node cluster can seamlessly adapt to dynamic fluctuations in the network topology, and as the number of road side units that make up the cluster increases, its joint transmission technique can greatly improve the data transmission rate. However, when the associated roadside units reach a certain number, the edge node cluster saturates the gain of the transmission rate, i.e. when S max Average weighted convergence result per vehicle user versus S when=4 max The difference is not large when=3.
In summary, the invention provides the internet of vehicles edge caching scheme based on the federal flexible actor-criticism algorithm by combining federal learning privacy protection advantages under the framework of taking the user as a center aiming at the unknown high-dynamic topology and channel state characteristics of the internet of vehicles. The proposal realizes the collaborative training decision of the multi-agent environment on the basis of not revealing the local training data of the vehicle users, further realizes the joint optimization of the transmission delay and the cache cost while ensuring the high privacy, and is superior to the benchmark proposal in the aspects of average balance and convergence performance of each vehicle user.
Claims (8)
1. The internet of vehicles edge content caching decision-making method based on federal reinforcement learning is characterized by comprising the following specific steps:
step 1, inputting a vehicle networking environment, initializing parameters of an own actor network and a critic network by each vehicle agent, and modeling an optimization problem;
step 2, in the current time slot, each vehicle agent interacts with a road side unit in an observation range to obtain observation information such as the distance between the vehicle agent and the road side unit, the cache state of the road side unit, the residual cache capacity of the road side unit and the like;
Step 3, according to the local observation information, each vehicle agent can independently decide the associated road side unit in the edge node cluster and decide whether to cache the request content of the current time slot in the cluster;
step 4, after the action decision is executed, each vehicle agent acquires the trade-off rewards of the total content delivery delay and the edge cache overhead of the system fed back by the vehicle networking environment, and all sample data are cached to an experience multiplexing pool;
step 5, judging whether the number of samples is enough, if so, entering a step 6, otherwise, entering a step 7;
step 6, when the number of samples is enough, each vehicle agent updates own Actor network and Critic network parameters according to a flexible Actor-critter algorithm;
step 7, collecting the weight parameters of the Actor network of each vehicle intelligent agent by the aggregation center, and performing federal aggregation, wherein the aggregated parameters are broadcasted to a vehicle user in one training round to perform local training;
step 8, judging whether the current training round is finished, if not, returning to the step 2 to start the training of the next round, and if so, entering the step 9;
step 9, judging whether convergence is achieved, if not, resetting the environment of the Internet of vehicles, and returning to the step 1; and if yes, finishing training and finishing the decision of caching the edge content of the Internet of vehicles.
Compared with the prior art, the invention has the remarkable advantages that: (1) Aiming at the link congestion load caused by the network architecture of the vehicle networking cache scene centering network, the invention designs an edge node cluster centering on a user by utilizing the deployment characteristic of the dense edge nodes so as to realize high capacity, low cache overhead and vehicle-mounted communication seamless coverage; (2) Aiming at the problems of a large amount of information interaction and privacy leakage caused by intelligent algorithm centralized training, the invention utilizes the privacy protection advantage of federal learning, and realizes the collaborative decision of vehicle users by sharing the local model neural network weight so as to reduce the long-term transmission delay and the edge cache overhead of the system content.
2. The internet of vehicles edge cache decision method based on federal multi-agent reinforcement learning according to claim 1, wherein the inputting the internet of vehicles environment in step 1 specifically comprises:
(1) Time slot model: dispersing continuous training time into multiple time slots, denoted asWherein each time slot has a duration τ, wherein the channel state information and system parameters remain unchanged over the duration of a single time slot, but may vary randomly between different time slots;
(2) Network model: establishing a networking model of the vehicle as a Manhattan grid model, wherein road side units capable of providing communication services are uniformly distributed on two sides of a road; the road side unit is used as an edge node, has special communication resources and limited local storage resources, and is associated with an edge server through a high-speed wired link; the edge server is controlled by a software-defined centralized controller, and can execute edge association, cache resource allocation and the like; let the set of road side units be The vehicles are assembled intoAll vehicle users can travel along four directions of the front, the back, the left and the right of the road, and each direction is provided with a plurality of lanes to ensure the passing of vehicles;
(3) Vehicle movement model: the speed variation of the vehicle follows the following gaussian-markov random process, in particular, when the vehicle user i is at an initial speedDuring driving, its speed at time slot t +.>Can be expressed as speed +.1 at time slot t-1>Sum of progressive speed and a random variable:
wherein , and σi Is the corresponding asymptotic mean and standard deviation of the vehicle user i speed; parameter eta i ∈[0,1]The memory depth representing the last slot speed and determining the time dependence of the movement of the vehicle user i; z represents the standard normal distribution of uncorrelated zero mean unit variances. Notably, η i The closer to 1, the current slot speed of the vehicle user i becomes more dependent on the speed of the last slot;
(4) Content request model: order theRepresenting a set of all vehicle request contents, +.>Representing the size of all requested content, +.>Characteristic values representing not all the request contents for distinguishing different request contents; it is assumed that the vehicle user i can only generate one request content f in the time slot t, expressed as
wherein ,Indicating that the vehicle user i requests the content f in time slot t and vice versa +.>Considering that a vehicle user may prefer a file similar to a previously time-slot requested content in addition to a content according to a global popularity, it is assumed that the vehicle user is according to a global popularity according to a probability of epsilon, and a probability of 1 epsilon is according to a local personal preference, wherein the global popularity and the local personal preference are defined as follows:
(1) global popularity: order theRepresenting global popularity of individual vehicle user request files, which follow Mandelbrot-Zipf distribution, i.e. satisfy
wherein ,If Representing content f as fullDescending order of popularity; alpha and beta respectively represent a platform factor and a skewness factor;
(2) local personal preferences: in this case each vehicle user requests a file based on the similarity of their previously requested content; for example, when the vehicle user requests the content f in the time slot t, the content with the highest similarity with the content f will be requested in the time slot t+1; the section adopts cosine similarity to represent contents f and f * Similarity of (3):
(5) User-centric edge cache model: in order to improve the transmission rate of request content of a vehicle user, a network architecture taking the user as a center is adopted to design a vehicle networking edge content cache frame; specifically, each vehicle user selects a single or a plurality of adjacent road side units to construct each edge node cluster taking the user as the center by observing the information of the road side units in the perception range; let the maximum number of road side units observable by the vehicle user i in the time slot t be O max The maximum number of road side units that it can associate is denoted as S max ≤O max The method comprises the steps of carrying out a first treatment on the surface of the Order theRepresenting the cluster of edge nodes serving the vehicle user i at time slot t, thenIndicating the number of road side units in the cluster at the moment; let the edge association decision of the vehicle user i at the time slot t be expressed as:
wherein ,representing road sidesUnit j belongs to the edge node cluster of vehicle user i in time slot t +.>On the contrary->Further, the user-centric edge node cluster may be denoted +.>
The method comprises the steps that a road side unit and a cloud center are provided with cache capacities, wherein the cache capacity of the cloud center can process cache requests of all vehicle users, and the limited cache capacity of each road side unit is C; under the framework of caching the edge content with the center of the user, the vehicle user can request to cache part of the request files to the associated edge node cluster, so as to provide the cache content service; thus, the vehicle user needs to further decide whether to cache the request content of the current time slot to each road side unit in the edge node cluster; let the edge buffer decision variable of the vehicle user i at the time slot t be expressed as
wherein ,indicating to the vehicle user i to buffer the requested content of the time slot t to the road side unit j, otherwise +. >(6) Wireless transmission model: assuming that mutual interference between communication links has been eliminated by allocating orthogonal resource blocks, and that both the road side units and the vehicle users are equipped with single antennas; considering channel power gain including fast fadingAnd slow fading, wherein the main cause of the fast fading part is multipath effect, i.e. rayleigh fading; the slow fading part consists of path loss and shadow fading; let->The channel gain between the vehicle user i and the road side unit j at time slot t is expressed as:
wherein ,is the corresponding fast fading part, which follows the complex gaussian distribution of zero mean unit variance, i.e. satisfiesWhile the slow fading portion between the vehicle user i and the road side unit j is represented as
wherein ,Ai Is a constant in path loss; beta i Is a lognormal component with standard deviation ζ in shadow fading;representing the distance between the vehicle user i and the road side unit j at the time slot t; η represents the attenuation coefficient of the path loss component; />Representing the path loss of the channel between the vehicle user i and the road side unit j at time slot t;
let the edge node cluster at time slot tThe transmitting power of the middle road side unit j is p j The achievable downlink transmission rate for vehicle user i is expressed as
Wherein B is the channel bandwidth, sigma 2 Power being additive white gaussian noise;
(7) And (3) a time delay model: making the request content of each vehicle user tolerant to delayIn addition, in order to fully utilize the edge cache resources, the edge server automatically refreshes the storage space of the edge server every cycle delta, so that the edge server has a short service vacuum period to replace the cache content every cycle delta; therefore, if the vehicle user i needs to download the cache content from the edge server at time slot t, its content delivery needs to be delayed by a tolerable delay +.>And completed in the next cache replacement refresh period, i.e
Where (n + 1) delta-t represents the number of slots remaining for the corresponding cache refresh period,
considering that the content delivery delay of the vehicle user depends on whether the requested content has been cached in the edge server in advance, the cache state of the time slot t-way side unit j is defined as
wherein ,indicating that content f has been buffered in the road side unit j in time slot t, and vice versa + ->Thus, delivery of the requested content by the vehicle user includes two cases, edge caching and cloud downloading:
(1) edge caching: when edge node clusterWhen any edge node has cached the request content f, the vehicle user i can directly acquire the request content from the edge, and the time delay is expressed as
(2) And (3) cloud downloading: when edge node clusterWhen all edge nodes in (a) do not cache the request content f, the edge server needs to be downloaded from the cloud center, resulting in an extra fixed delay +.>
Based on the above model, the total time delay for the vehicle user i to acquire the request content f can be expressed as
wherein ,1{·} Indicating that 1 is set in the case where the constraint is satisfied,and otherwise, setting 0.
3. The internet of vehicles edge content caching decision method based on federal multi-agent reinforcement learning according to claim 2, wherein the modeling of the optimization problem in step 1 is specifically:
taking long-term trade-off of minimizing content delivery delay and edge cache overhead as an optimization target, designing an intelligent edge cache method with a user as a center; for this, normalized content delivery latency effects and edge cache overhead effects are first designed, expressed as:
thus, the optimization problem is expressed as follows:
wherein ,ω1 And omega 2 Weights respectively representing content delivery delay effects and cache overhead effects; c (C) 1 The edge node cluster size of each vehicle user representing any time slot is not more than S max The method comprises the steps of carrying out a first treatment on the surface of the C (C) 2 Means that each at most a single vehicle user is served in any time slot; c (C) 3 Indicating that the content of each road side unit buffer memory cannot exceed the local buffer memory capacity; c (C) 4 Indicating that the requested content delivery delay for each vehicle user at any time slot should not exceed the maximum allowable delay.
4. The internet of vehicles edge content caching decision-making method based on federal multi-agent reinforcement learning according to claim 3, wherein in step 2, each vehicle agent interacts with a road side unit in an observation range to obtain observation information such as a distance between the vehicle agent and the road side unit, a cache state of the road side unit, and a remaining cache capacity of the road side unit, and the method is specifically as follows:
in the time slot t, each vehicle user is used as an intelligent agent, and the respective observation state is obtained through interaction with the environment, and the observation state of the vehicle i in the time slot t is expressed as
wherein ,representing the distance between the time slot t, the vehicle user i and the road side unit j; />A flag bit indicating whether the request content f has been cached by the slot t-way side unit j; />The residual buffer space of the time slot t-way side unit j is represented; />And the number of the remained time slots is indicated by the time slot t and the time slot t are refreshed from the buffer space.
5. The method for determining the caching of the internet of vehicles edge content based on reinforcement learning of multiple federal agents according to claim 4, wherein in step 3, each vehicle agent can independently determine the associated road side unit in the edge node cluster and determine whether to cache the request content of the current time slot in the cluster according to the local observation information, specifically:
And determining actions made by each vehicle agent interacting with the environment, including edge-associated decision variables and edge-cached decision variables. The action of vehicle i at time slot t is expressed as:
6. the method for determining the edge content cache of the internet of vehicles based on the reinforcement learning of multiple federal agents according to claim 5, wherein after the action decision is performed in step 4, each vehicle agent obtains a trade-off reward of total content delivery delay and edge cache overhead of the system fed back by the environment of the internet of vehicles, specifically:
when all vehicle agents have performed the action, the environment will feedback a global incentive. Defining an average tradeoff per user as a global reward, expressed as:
7. The internet of vehicles edge content caching decision-making method based on federal multi-agent reinforcement learning according to claim 6, wherein in step 6, each vehicle agent updates its own Actor network and Critic network parameters according to a flexible Actor-critique algorithm, specifically:
(1) Flexibility value function: the flexible actor-critic algorithm optimization objective is to meet entropy maximization in addition to maximizing the cumulative rewards value returned by training, i.e
wherein ,ρπ A trajectory distribution representing a strategy pi; alpha represents an entropy coefficient; to ensure that the agent can continually explore, entropy is introduced to randomize the strategy, expressed asTherefore, when the algorithm makes a decision, the probability of outputting each action can be dispersed as far as possible, so that the learning capacity of the vehicle intelligent body on the environment is improved, the vehicle intelligent body can adaptively adjust the strategy in the vehicle networking environment with continuously changed channel conditions, and further a more reasonable decision is made; based on this, a soft state action value function is defined as follows
wherein ,a cumulative award representing a discount; further, the methodThe soft state value function may be expressed as
Wherein action a depends on the probability distribution pi (|s).
(2) Critic network: under the initiator-Critic network framework of the algorithm, the whole training process is alternately performed between strategy evaluation and strategy improvement, so that the long-term balance maximization of the environment of the Internet of vehicles system is realized; to judge the quality of the action strategy, two Critic networks are constructed, which are expressed as online value networks 1 and 2, and target value networks are respectively constructed for the two Critic networks; the input of the neural networks is local observation information collected by vehicle agents in the environment, and the local observation information is output as the value corresponding to each action in the action space; in view of the fact that a neural network is generally adopted to evaluate a soft value function in an algorithm framework, weight parameters defining the online value networks 1 and 2 and corresponding target networks are respectively theta main,1 、θ target,1 θ main,2 、θ target,2 ;
In order to improve the training stability and avoid the divergence of the reinforcement learning algorithm, an experience multiplexing pool is adopted to relieve the correlation among samples; during the training process, each vehicle agent is respectively gathered from the multiplexing poolIs to randomly extract sequence pairs of batch size, i.e. +.>Redefining the flexible state value function as:
wherein pi (|s) t ) T Denoted s t Action probability distribution pi (|s) in case t ) Is a transpose of (2);
the updates to value networks 1 and 2 in the Critic network are approximated as building a bellman mean square error loss function for each Q network, expressed as:
wherein i=1, 2 represents value networks 1 and 2, respectively; to achieve the goal of maximum entropy, the entropy needs to be included in rewards, and the functional expression of the judgment behavior strategy value obtained by the general deduction formula of Bellman is as follows:
in addition, to obtain Q soft The optimal approximation of (s, a) updates the parameter θ in the gradient direction m,i Expressed as a minimized loss function:
wherein ,ηc Represents the learning rate of the Critic network,gradient calculation representing a loss function;
the target value network does not actively participate in the learning process and cannot be updated independently, so that a soft update mode is adopted. It replicates the latest parameters of the online value network at intervals for small-scale updates, expressed as:
θ target,i (t+1)←τθ main,i (t)+(1-τ)θ target,i (t)
Where τ represents the degree of soft update.
(3) Actor network: the task of the Actor network is to seek policy improvement based on the estimates generated by the neural network, which is input as the environmentThe local observation information collected by each vehicle intelligent agent is output as the action probability of action dimension, and is expressed as pi θ (|s); whereas the updating process of an Actor network can be expressed as an exploration of the strategy by maximizing the system rewards, the decision to guide the action by a soft state action value function, i.e. the loss function of the Actor network is defined as:
similarly, the Actor network updates the parameter φ in the gradient direction to minimize the loss function, expressed as:
wherein ,ηa Indicating the learning rate of the Actor network,gradient calculation representing a loss function;
(4) Self-adjusting entropy coefficient: the entropy coefficient alpha is used as a weight, and the randomness of the action strategy is controlled by changing the numerical value. The greater α, the more dispersed the probability of outputting actions when the algorithm makes a decision, resulting in more complete exploration of the environment by the vehicle agent and further more actions being attempted. Considering that the reward value fed back by the system is continuously changed in the training process of the Internet of vehicles environment, the dependence on a priori fixed alpha can lead to unstable training, and further the convergence performance of the system is affected. In view of this, this section considers the method of adaptive entropy coefficients, i.e. when the vehicle agent explores a new environment at the beginning of training, increasing α to make the agent explore as fully as possible; while as the number of training rounds increases, the optimal action is essentially determined, then α is reduced to optimize the long-term rewards tradeoff of the system as much as possible. The loss function defining the entropy coefficient is:
8. The internet of vehicles edge content caching decision method based on federal multi-agent reinforcement learning according to claim 7, wherein in step 7, the aggregation center collects the Actor network weight parameters of each vehicle agent, and performs federal aggregation, and the aggregated parameters are broadcasted to the vehicle user in one training round to perform local training, specifically:
firstly, each vehicle agent trains a DRL model in a distributed mode according to local observation information; secondly, uploading the trained Actor network weight parameters of the local DRL model to a cloud center so as to share strategies; finally, the global model parameters of the federal aggregation are broadcast to the local agent for the training of the next iteration; the communication mode only needs to upload the parameters of the nerve network and download strategies, so that the communication load is greatly reduced; in addition, the privacy of the system is protected in view of the fact that the local information of the user cannot be directly obtained from the neural network parameters; in each training round, the update formula of the weight parameters of the vehicle agent Actor network is as follows:
φ(t+1)=ξφ i (t)+(1-ξ)H i (t)
wherein ,Hi (t)=∑ k≠i φ k (t), ζ is a weight coefficient.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211708649.XA CN116156455A (en) | 2022-12-29 | 2022-12-29 | Internet of vehicles edge content caching decision method based on federal reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211708649.XA CN116156455A (en) | 2022-12-29 | 2022-12-29 | Internet of vehicles edge content caching decision method based on federal reinforcement learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116156455A true CN116156455A (en) | 2023-05-23 |
Family
ID=86361158
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211708649.XA Pending CN116156455A (en) | 2022-12-29 | 2022-12-29 | Internet of vehicles edge content caching decision method based on federal reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116156455A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116582840A (en) * | 2023-07-13 | 2023-08-11 | 江南大学 | Level distribution method and device for Internet of vehicles communication, storage medium and electronic equipment |
CN116709359A (en) * | 2023-08-01 | 2023-09-05 | 南京邮电大学 | Self-adaptive route joint prediction method for flight Ad Hoc network |
CN116761152A (en) * | 2023-08-14 | 2023-09-15 | 合肥工业大学 | Roadside unit edge cache placement and content delivery method |
CN116911480A (en) * | 2023-07-25 | 2023-10-20 | 北京交通大学 | Path prediction method and system based on trust sharing mechanism in Internet of vehicles scene |
CN117118592A (en) * | 2023-10-25 | 2023-11-24 | 北京航空航天大学 | Method and system for selecting Internet of vehicles client based on homomorphic encryption algorithm |
CN117793805A (en) * | 2024-02-27 | 2024-03-29 | 厦门宇树康信息技术有限公司 | Dynamic user random access mobile edge computing resource allocation method and system |
CN117939505A (en) * | 2024-03-22 | 2024-04-26 | 南京邮电大学 | Edge collaborative caching method and system based on excitation mechanism in vehicle edge network |
CN117979259A (en) * | 2024-04-01 | 2024-05-03 | 华东交通大学 | Asynchronous federation deep learning method and system for mobile edge collaborative caching |
-
2022
- 2022-12-29 CN CN202211708649.XA patent/CN116156455A/en active Pending
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116582840A (en) * | 2023-07-13 | 2023-08-11 | 江南大学 | Level distribution method and device for Internet of vehicles communication, storage medium and electronic equipment |
CN116911480A (en) * | 2023-07-25 | 2023-10-20 | 北京交通大学 | Path prediction method and system based on trust sharing mechanism in Internet of vehicles scene |
CN116709359A (en) * | 2023-08-01 | 2023-09-05 | 南京邮电大学 | Self-adaptive route joint prediction method for flight Ad Hoc network |
CN116709359B (en) * | 2023-08-01 | 2023-10-31 | 南京邮电大学 | Self-adaptive route joint prediction method for flight Ad Hoc network |
CN116761152A (en) * | 2023-08-14 | 2023-09-15 | 合肥工业大学 | Roadside unit edge cache placement and content delivery method |
CN116761152B (en) * | 2023-08-14 | 2023-11-03 | 合肥工业大学 | Roadside unit edge cache placement and content delivery method |
CN117118592A (en) * | 2023-10-25 | 2023-11-24 | 北京航空航天大学 | Method and system for selecting Internet of vehicles client based on homomorphic encryption algorithm |
CN117118592B (en) * | 2023-10-25 | 2024-01-09 | 北京航空航天大学 | Method and system for selecting Internet of vehicles client based on homomorphic encryption algorithm |
CN117793805A (en) * | 2024-02-27 | 2024-03-29 | 厦门宇树康信息技术有限公司 | Dynamic user random access mobile edge computing resource allocation method and system |
CN117793805B (en) * | 2024-02-27 | 2024-04-26 | 厦门宇树康信息技术有限公司 | Dynamic user random access mobile edge computing resource allocation method and system |
CN117939505A (en) * | 2024-03-22 | 2024-04-26 | 南京邮电大学 | Edge collaborative caching method and system based on excitation mechanism in vehicle edge network |
CN117939505B (en) * | 2024-03-22 | 2024-05-24 | 南京邮电大学 | Edge collaborative caching method and system based on excitation mechanism in vehicle edge network |
CN117979259A (en) * | 2024-04-01 | 2024-05-03 | 华东交通大学 | Asynchronous federation deep learning method and system for mobile edge collaborative caching |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN116156455A (en) | Internet of vehicles edge content caching decision method based on federal reinforcement learning | |
Liu et al. | Cooperative offloading and resource management for UAV-enabled mobile edge computing in power IoT system | |
Luo et al. | Self-learning based computation offloading for internet of vehicles: Model and algorithm | |
Arkian et al. | A cluster-based vehicular cloud architecture with learning-based resource management | |
Zhang et al. | Deep reinforcement learning based IRS-assisted mobile edge computing under physical-layer security | |
CN111711666B (en) | Internet of vehicles cloud computing resource optimization method based on reinforcement learning | |
Ren et al. | Blockchain-based VEC network trust management: A DRL algorithm for vehicular service offloading and migration | |
Qin et al. | Collaborative edge computing and caching in vehicular networks | |
CN114827191B (en) | Dynamic task unloading method for fusing NOMA in vehicle-road cooperative system | |
CN112565377B (en) | Content grading optimization caching method for user service experience in Internet of vehicles | |
CN114973673B (en) | Task unloading method combining NOMA and content cache in vehicle-road cooperative system | |
Zheng et al. | Digital twin empowered heterogeneous network selection in vehicular networks with knowledge transfer | |
CN114449482A (en) | Heterogeneous vehicle networking user association method based on multi-agent deep reinforcement learning | |
CN116390125A (en) | Industrial Internet of things cloud edge cooperative unloading and resource allocation method based on DDPG-D3QN | |
CN117042050A (en) | Multi-user intelligent data unloading method based on distributed hybrid heterogeneous decision | |
CN116321293A (en) | Edge computing unloading and resource allocation method based on multi-agent reinforcement learning | |
CN116321307A (en) | Bidirectional cache placement method based on deep reinforcement learning in non-cellular network | |
Hu et al. | Dynamic task offloading in MEC-enabled IoT networks: A hybrid DDPG-D3QN approach | |
Wang et al. | Joint spectrum access and power control in air-air communications-a deep reinforcement learning based approach | |
CN117354833A (en) | Cognitive Internet of things resource allocation method based on multi-agent reinforcement learning algorithm | |
Lyu et al. | Service-driven resource management in vehicular networks based on deep reinforcement learning | |
CN116137724A (en) | Task unloading and resource allocation method based on mobile edge calculation | |
Wang et al. | Deep Reinforcement Learning-Based Computation Offloading and Power Allocation within Dynamic Platoon Network | |
Gui et al. | Spectrum-Energy-Efficient Mode Selection and Resource Allocation for Heterogeneous V2X Networks: A Federated Multi-Agent Deep Reinforcement Learning Approach | |
CN114531685A (en) | Resource allocation method based on migration reinforcement learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |