CN114980123A - Vehicle networking edge resource allocation method based on federal multi-agent reinforcement learning - Google Patents

Vehicle networking edge resource allocation method based on federal multi-agent reinforcement learning Download PDF

Info

Publication number
CN114980123A
CN114980123A CN202210395450.XA CN202210395450A CN114980123A CN 114980123 A CN114980123 A CN 114980123A CN 202210395450 A CN202210395450 A CN 202210395450A CN 114980123 A CN114980123 A CN 114980123A
Authority
CN
China
Prior art keywords
vehicle
network
agent
federal
time slot
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210395450.XA
Other languages
Chinese (zh)
Inventor
包金鸣
林艳
陶奕宇
张一晋
邹骏
李骏
束锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology
Original Assignee
Nanjing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology filed Critical Nanjing University of Science and Technology
Priority to CN202210395450.XA priority Critical patent/CN114980123A/en
Publication of CN114980123A publication Critical patent/CN114980123A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W16/00Network planning, e.g. coverage or traffic planning tools; Network deployment, e.g. resource partitioning or cells structures
    • H04W16/02Resource partitioning among network components, e.g. reuse partitioning
    • H04W16/10Dynamic resource partitioning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W12/00Security arrangements; Authentication; Protecting privacy or anonymity
    • H04W12/03Protecting confidentiality, e.g. by encryption
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/30Services specially adapted for particular environments, situations or purposes
    • H04W4/40Services specially adapted for particular environments, situations or purposes for vehicles, e.g. vehicle-to-pedestrians [V2P]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W52/00Power management, e.g. TPC [Transmission Power Control], power saving or power classes
    • H04W52/02Power saving arrangements
    • H04W52/0203Power saving arrangements in the radio access network or backbone network of wireless communication networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W84/00Network topologies
    • H04W84/02Hierarchically pre-organised networks, e.g. paging networks, cellular networks, WLAN [Wireless Local Area Network] or WLL [Wireless Local Loop]
    • H04W84/10Small scale networks; Flat hierarchical networks
    • H04W84/12WLAN [Wireless Local Area Networks]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Security & Cryptography (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention discloses a vehicle networking edge resource allocation method based on federal multi-agent reinforcement learning, which specifically comprises the following steps: inputting a car networking environment, initializing intelligent agent local Q network and federal network parameters, and modeling an optimization problem; the vehicle intelligent agents are divided into two types of alpha and beta according to whether the intelligent agents can obtain rewards, and the two types of vehicle intelligent agents respectively observe the local state and input the local state into a Q network in the current time slot; encrypting the Q network output, and outputting the joint action decision of the two types of vehicle intelligent agents through a federal network; then the alpha vehicle intelligent agent obtains the global reward fed back by the system, and simultaneously the cache pool stores the sample data of the current time slot; when the number of samples is enough, the alpha type vehicle agent and the beta type vehicle agent respectively update the parameters of the local Q network and the federal network; after the current training round is finished, the car networking environment is reset, and the training of the next round is started. The invention improves the connectivity of the Internet of vehicles on the premise of privacy protection, and simultaneously reduces the switching overhead and energy loss.

Description

Vehicle networking edge resource allocation method based on federal multi-agent reinforcement learning
Technical Field
The invention relates to the technical field of wireless communication, in particular to a vehicle networking edge resource allocation method based on federal multi-agent reinforcement learning.
Background
In recent years, vehicles have become important carriers for driving the rapid development of economy and improving the quality of life of people as indispensable transportation modes for human travel. The car networking technology, as a product of the modern industry and the rapid development of mobile communication, can effectively handle delay-sensitive services of road safety, traffic efficiency, automatic driving and real-time information interaction in an intelligent traffic system. However, with the explosive increase of the number of vehicle users, it is difficult for the conventional single access method and communication resource allocation scheme to simultaneously meet the requirements of various trips. Since most of the car networking services occurring around vehicles only require close range communication to be provided in the surroundings and most tasks are time-efficient, dedicated short range communication technologies are widely used to support high speed transmissions, mainly in car-to-car communication and car-to-infrastructure communication (Jameeel F, Wyne S, Javed M A, et al. interference-aided Communications networks: Future research opportunities and exchanges [ J ]. IEEE Communications networks, 2018,56(10): 36-42.). However, as a typical heterogeneous wireless network with high mobility, the internet of vehicles has a difficulty in continuously and stably providing services for nodes with fixed coverage, which causes expensive handover overhead, and may cause transmission interruption, even resulting in inefficient traffic safety service processing. Therefore, an efficient and rapid user access scheme is very important for realizing the high efficiency and high performance of the Internet of vehicles.
With the development of artificial intelligence, Deep Reinforcement Learning (DRL) is considered as an effective solution to the complex sequence decision problem, which can learn the best strategy by directly interacting with the environment without prior knowledge of any system dynamics (e.g., wireless channel, vehicle location, etc.), avoiding the need for state transition matrices required when using traditional solutions to solve the markov decision problem. In recent years, scholars try to introduce a DRL algorithm to solve the problem of edge resource allocation, Khan et al, by decomposing a task centrally processed by a cloud to each roadside unit for local training, and designing a user access scheme by using a distributed reinforcement learning framework, can finally minimize network coordination overhead while meeting the transmission rate requirements of each vehicle (Khan H, Elgabli a, samarakon S, et al. However, although the existing research designs various user access schemes for the car networking scene, and provides a cooperative management strategy for communication and energy consumption resources based on different service requirements, training of the DRL algorithm depends on a large amount of information interaction of the car users, and a serious privacy disclosure problem is brought.
Disclosure of Invention
The invention aims to provide a vehicle networking edge resource allocation method based on federal multi-agent reinforcement learning, so that the connectivity of the vehicle networking is improved and the switching overhead and energy loss are reduced on the premise of privacy protection.
The technical solution for realizing the purpose of the invention is as follows: a vehicle networking edge resource allocation method based on federal multi-agent reinforcement learning comprises the following specific steps:
step 1, inputting a car networking environment, initializing intelligent agent local Q network and federal network parameters, and modeling an optimization problem;
step 2, dividing the intelligent bodies into two types of alpha type vehicle intelligent bodies and beta type vehicle intelligent bodies according to whether the intelligent bodies can obtain rewards or not, respectively acquiring the channel state between the intelligent bodies and the roadside units, the observable roadside unit positions and the roadside unit positions related to the previous time slot by the alpha type vehicle intelligent bodies and the beta type vehicle intelligent bodies in the current time slot, and cascading the two types of vehicle intelligent bodies to be used as the input of a Q network;
step 3, adopting a Gaussian difference method to encrypt the Q network output, and outputting a joint edge access decision of alpha-type and beta-type vehicle intelligent agents and a power distribution decision of a downlink through a federal network;
step 4, the alpha-type vehicle intelligent agent obtains balance rewards of connection gain, switching expense and energy loss fed back by the system, and simultaneously the system inputs sample data of the current time slot into a cache pool;
step 5, judging whether the number of samples is enough, if so, entering step 6, otherwise, directly entering step 7;
step 6, when the number of samples is enough, the alpha-type vehicle intelligent bodies and the beta-type vehicle intelligent bodies respectively update parameters of a local Q network and a federal network, and a cache pool is emptied after updating is completed;
step 7, judging whether the current training round is finished, if not, returning to the step 2 to start the training of the next round, and if so, entering the step 8;
step 8, judging whether convergence is achieved, if not, resetting the Internet of vehicles environment, and returning to the step 1; if yes, the training is finished, and the Internet of vehicles edge resource allocation is completed.
Compared with the prior art, the invention has the following remarkable advantages: (1) aiming at the problem of massive information interaction caused by centralized processing of a single-agent reinforcement learning algorithm, a multi-agent scene in the Internet of vehicles is considered, and vehicle users make cooperative decisions so as to improve the connectivity of the Internet of vehicles and reduce switching overhead and energy loss; (2) the output value of the intelligent agent local Q network is subjected to privacy protection by adopting a Gaussian difference method, so that the intelligent agents cannot share the observed environmental state information, and the safety degree of local data is greatly improved; (3) the scheme based on the reinforced learning of the federal multi-agent promotes model training by sharing encrypted training data, and improves the performance index of the system.
The invention is described in further detail below with reference to the accompanying drawings.
Drawings
FIG. 1 is a flow chart of the Federal multi-agent reinforcement learning-based vehicle networking edge resource allocation method of the invention.
Fig. 2 is a schematic model diagram of a car networking system according to an embodiment of the present invention.
FIG. 3 is a graph showing the convergence performance of the Internet of vehicles system according to the embodiment of the invention.
FIG. 4 is a graph showing the average performance of the vehicle users of the Internet of vehicles system as a function of the number of roadside units in the embodiment of the invention.
Fig. 5 is a graph of average performance of vehicle users of an internet of vehicles system as a function of a maximum threshold value of downlink transmission power in an embodiment of the present invention.
Detailed Description
The invention provides a vehicle networking edge resource allocation method based on federal multi-agent reinforcement learning. The scheme is that in a unit time slot t, a single vehicle user serves as an intelligent agent, a neural network training combined action strategy is adopted by observing the channel state between a vehicle and a road side unit and the position information of the road side unit, so that the transmission rate is improved, the switching overhead and the energy loss are reduced, and the balance among the three is realized, and the method comprises the following steps in combination with the figures 1-2:
step 1, inputting a car networking environment, initializing intelligent agent local Q network and federal network parameters, and modeling an optimization problem;
step 2, classifying according to whether the intelligent bodies can obtain rewards, respectively obtaining the channel state between the intelligent bodies and the road side units, the observable road side unit position and the road side unit position related to the last time slot by the alpha type vehicle intelligent bodies and the beta type vehicle intelligent bodies in the current time slot, and cascading the channel state, the observable road side unit position and the road side unit position as the input of a Q network;
step 3, adopting a Gaussian difference method to encrypt the Q network output, and outputting a joint edge access decision of alpha-type and beta-type vehicle intelligent agents and a power distribution decision of a downlink through a federal network;
step 4, the alpha-type vehicle intelligent agent obtains balance rewards of connection gain, switching expense and energy loss fed back by the system, and simultaneously the system inputs sample data of the current time slot into a cache pool;
step 5, judging whether the number of samples is enough, if so, entering step 6, otherwise, entering step 7;
step 6, when the number of samples is enough, the alpha-type vehicle intelligent bodies and the beta-type vehicle intelligent bodies respectively update parameters of a local Q network and a federal network, and the cache pool is emptied after updating is completed;
step 7, judging whether the current training round is finished, if not, returning to the step 2 to start the training of the next round, and if so, entering the step 8;
step 8, judging whether convergence is achieved, if not, resetting the Internet of vehicles environment, and returning to the step 1; if so, finishing training and completing the distribution of the edge resources of the Internet of vehicles.
As a specific implementation manner, the inputting of the car networking environment in step 1 specifically includes:
(1) and (3) time slot model: the continuous training time is discretized into a plurality of time slots, denoted as
Figure BDA0003598707360000031
Where the channel state information and system parameters remain constant for the duration of a single time slot, but may vary randomly from time slot to time slot.
(2) And (3) network model: establishing an urban multi-lane highway model, wherein roadside units supporting edge communication are uniformly distributed on two sides of a highway and are collected
Figure BDA0003598707360000041
Represents; vehicles run from two ends of the highway in opposite directions, the transmission rate of data of the vehicles is improved by establishing a vehicle and infrastructure link, and the vehicle set
Figure BDA0003598707360000042
And (4) showing.
(3) Vehicle movement model: the speed variation of the vehicle follows the following gaussian-markov random process:
Figure BDA0003598707360000043
wherein v (t) represents the speed of the vehicle at time slot t, v (t-1) represents the speed of the vehicle at time slot t-1,
Figure BDA0003598707360000044
representing the approximate mean of the velocity, ζ representing the approximate variance of the velocity, ξ representing the degree of memory, and z representing the random gaussian process of the uncorrelated zero mean unit variance.
(4) Switching the model: suppose that the vehicle can observe nearby O max Information of individual rsus, and adaptively selecting an associated rsu. Defining the association variables between vehicle k and all road side units as
Figure BDA0003598707360000045
Wherein
Figure BDA0003598707360000046
Indicating that the roadside unit r is associated with vehicle k at time slot t, and conversely,
Figure BDA0003598707360000047
consider that the rsu has limited coverage and a single timeslot can only serve one vehicle. When the road side unit associated with the adjacent time slot vehicle is changed, switching occurs, namely:
Figure BDA0003598707360000048
wherein H k (t) indicates the number of switches made by vehicle k between adjacent time slots, 1 {·} Indicating that 1 is set if the constraint is satisfied and, conversely, 0 is set.
(5) A power distribution model: transmit power utilization for downlink RSUThe level distribution of the scattering is expressed as [ P ] min ,P max ]P levels within the range. Order to
Figure BDA0003598707360000049
Representing the downlink transmission power configured by the RSU r associated with the vehicle k at the time slot t, the power distribution variable of the RSU configured by the vehicle k is
Figure BDA00035987073600000410
Wherein
Figure BDA00035987073600000411
Indicating that vehicle k selects p as the power of the downlink RSU at time slot t, i.e.
Figure BDA00035987073600000412
On the contrary, the method can be used for carrying out the following steps,
Figure BDA00035987073600000413
(6) the wireless communication model is as follows: consider a system in which mutual interference between vehicle and infrastructure links has been eliminated by an interference cancellation method. The channel power gain is assumed to consist of small scale fading (rayleigh fading) and path loss. Order to
Figure BDA0003598707360000051
Representing the channel gain between vehicle k and roadside unit r. Given time slot t, the rsu r associated with vehicle user k and its configured downlink power, according to shannon's formula, the achievable transmission rate for vehicle k is expressed as:
Figure BDA0003598707360000052
further, assuming that the minimum data rate required by all vehicles per slot is the same and fixed, R is used min And (4) showing.
As a specific implementation manner, the modeling of the optimization problem in step 1 specifically includes:
considering optimization and combination of edge access and power allocation, constructing an optimization problem, wherein the optimization target is the balance of maximized connection gain, switching overhead and energy loss:
Figure BDA0003598707360000053
Figure BDA0003598707360000054
Figure BDA0003598707360000055
wherein omega 1 ,ω 2 ,ω 3 Weight coefficients of connection gain, switching overhead and energy loss, and ω 123 1 is ═ 1; equation C1 ensures seamless connectivity for the vehicle user, and equation C2 reflects the minimum data rate constraint required for the vehicle user.
In a specific embodiment, step 2 is divided into two types, specifically, an α -type vehicle agent and a β -type vehicle agent according to whether the agent can obtain the reward;
in view of privacy protection, the vehicle agent can only observe its own status information and cannot obtain the reward fed back by the system in real time or accurately. Depending on whether the agent can receive rewards, it is divided into two categories:
the intelligent body of the alpha-type vehicle: the local state of the user can be observed, and corresponding system rewards can be timely and accurately obtained;
beta type vehicle agent: the local status of the user can be observed, but the system reward cannot be obtained due to privacy protection.
As a specific implementation manner, the step 2 of respectively obtaining, by the vehicle agents of α type and β type in the current time slot, a channel state between the vehicle agent and the rsu, an observable rsu position, and a rsu position associated with the previous time slot specifically includes:
in the time slot t, each vehicle user is used as an agent and obtains respective observation states through interaction with the environment, and the state of the vehicle k in the time slot t is represented as follows:
Figure BDA0003598707360000056
wherein,
Figure BDA0003598707360000061
representing channel state information between the vehicle k and its observable roadside units at time slot t;
Figure BDA0003598707360000062
representing the position of a roadside unit observable by the vehicle k at the time slot t;
Figure BDA0003598707360000063
indicating the location of the road side unit associated with vehicle k at time slot t-1.
As a specific implementation manner, the step 3 of outputting the joint edge access decision of the α -type and β -type vehicle agents and the power allocation decision of the downlink through the federal network specifically includes:
the actions each vehicle agent makes with the environment interaction are determined, including selection of an associated road side unit and selection of downlink transmission power. At time slot t, the action of vehicle k is represented as:
Figure BDA0003598707360000064
as a specific implementation manner, the α -type vehicle agent in step 4 obtains a balance reward of connection gain, switching overhead and energy loss of system feedback, specifically:
when all vehicle agents have performed the action, the environment will be rewarded with a global reward. Defining an average trade-off per user as a global reward, expressed as:
Figure BDA0003598707360000065
as a specific implementation manner, the initializing of the intelligent local Q network and the federal network in step 1, and the encrypting of the Q network output by using the gaussian difference method in step 3 specifically include:
(1) local Q network: vehicle agent for alpha and beta type and local Q network construction for estimating action value function
Figure BDA0003598707360000066
Wherein
Figure BDA0003598707360000067
Representing a future cumulative prize with a discount factor and gamma representing the discount factor. Thus, the local Q network output value defining alpha and beta type vehicle agents is Q α (·;θ α ) And Q β (·;θ β ) Wherein theta α And theta β Respectively, are weight parameters of the deep neural network.
(2) Difference of gaussians: based on privacy protection, a Gaussian difference method is adopted, random variables obeying Gaussian distribution are added to the output of the local Q networks of the intelligent agents of the alpha-type vehicles and the beta-type vehicles respectively, and the method is defined as follows:
Figure BDA0003598707360000068
Figure BDA0003598707360000069
(3) federal network: a multi-layer perceptron network is adopted as a federal network, the output of an encrypted Q network is used as the input, and a joint decision is output to predict the joint action, which is expressed as:
Figure BDA0003598707360000071
wherein MLP (.; [ theta ]) MLP ) Representing a multi-layered perceptron network, [ | · c]Indicating a cascading operation.
When the vehicle intelligent bodies of the alpha type and the beta type update the network models, the Q network output encrypted by the opposite side is respectively regarded as a fixed value, namely:
Figure BDA0003598707360000072
Figure BDA0003598707360000073
as a specific implementation manner, the inputting of the sample data of the current time slot into the cache pool in step 4 specifically includes:
in order to improve the stability of the reinforcement learning algorithm, an experience playback method is adopted; the experience buffer pool is used for storing parameters of each step in the previous learning process, and randomly extracting some samples from the previous samples for learning in the later learning process. Wherein the experience pool is represented as
Figure BDA0003598707360000074
The N groups of samples extracted by the alpha type vehicle intelligent agent are shown as
Figure BDA0003598707360000075
The sample extracted by beta type vehicle intelligent body is expressed as
Figure BDA0003598707360000076
When the number of stored samples reaches an upper limit, the old samples will be removed to reserve space for the new sample storage.
As a specific implementation, when the number of samples is sufficient, the vehicle agents of α type and β type update the parameters of the local Q network and the federal network, respectively, as described in step 6, specifically:
in the actual model training process, an online network and a target network exist to prevent frequent updating and reduce the divergence and oscillation of training. Wherein, the parameters of the online network are continuously updated and used for training the neural network and calculating the Q estimation value; the target network temporarily fixes the parameters, updates the parameters once every other period of time, and calculates the Q target value. Considering that β -type vehicle agents cannot obtain a global reward for system feedback, the Q estimate of the federal network can only be calculated by α -type vehicle agents and shared with β, and is expressed as:
Figure BDA0003598707360000077
wherein r is j For a global reward obtained by an alpha-type vehicle agent, gamma represents a discount factor.
Thus, the loss function of the network is expressed as:
Figure BDA0003598707360000078
Figure BDA0003598707360000081
in practical calculations, a random gradient descent method is generally used to optimize the loss function. In the training process, firstly, the alpha type vehicle intelligent agent calculates to obtain y j And transmitting the data to a beta type vehicle intelligent body; then, the beta-type vehicle agent updates its own Q network and multi-layer perceptron network and updates the parameter theta β 、θ MLP And
Figure BDA0003598707360000082
to the alpha vehicle agent:
Figure BDA0003598707360000083
Figure BDA0003598707360000084
and finally, updating the self Q network and the federal network parameters by the alpha-type vehicle intelligent agent according to the received parameters:
Figure BDA0003598707360000085
Figure BDA0003598707360000086
updating all networks, ending the training process if the vehicle round is finished, and outputting an optimal strategy pi * Otherwise, the next training is started.
The invention is further described in detail below with reference to the figures and the specific embodiments.
Examples
The embodiment provides a vehicle networking edge resource allocation method based on federal multi-agent reinforcement learning, which is described in detail as follows:
1. establishing a vehicle networking system model:
the method adopts Python software to simulate the scene of the urban multi-lane expressway, the total length of the road is set to be 1 kilometer, 12 roadside units are uniformly distributed on two sides of the road, and the radius of the service range is 200 meters. Considering that a pair of vehicles runs from two ends of a road to each other, the vehicles can observe the information of the nearest 4 road side units at most, and 1 road side unit is selected for access. The vehicle movement model follows a Gaussian random process, wherein the initial value of the vehicle is
Figure BDA0003598707360000087
ξ is 0.1 and ζ is 0.1. The channel model between the vehicle and the roadside unit is as follows, the path loss model:
Figure BDA0003598707360000088
small scale fading complianceAnd (4) distribution is facilitated. Downlink transmission power of 23,35]dBm and a minimum transmission rate constraint of 8 bits/s/Hz.
2. Establishing a federal multi-agent reinforcement learning algorithm framework:
the federal multi-agent reinforcement learning algorithm framework is derived from a Deep Q network algorithm (DQN), and an action strategy is determined through a value function estimation algorithm. The algorithm framework mainly comprises a Q network and a federal network, wherein three layers of fully-connected neural networks are adopted as a local Q network of an intelligent agent and respectively comprise an input layer, an output layer and a hidden layer, and the number of neurons of the hidden layer is 80; and (3) adopting a multilayer perceptron network as a federal network, taking a Q network output value subjected to Gaussian differential processing as a federal network input, and outputting an action value function to predict the combined action.
3. And (3) training the algorithm:
firstly, respectively inputting the current state s α (t)、s β (t) is defined as
Figure BDA0003598707360000091
Namely the observable channel state of the vehicle agent, the position of the road side unit and the position of the road side unit associated with the last time slot. Secondly, defining the action output by the federal network as the joint action of the vehicle intelligent agent,
Figure BDA0003598707360000092
namely edge access decisions of the vehicle agent and selection of downlink power.
After the algorithm inputs the state, according to the joint action predicted by the federal network, after the current action interacts with the environment, the alpha-type vehicle intelligent body obtains the global reward
Figure BDA0003598707360000093
And shifts to the next state s α (t+1)、s β (t + 1). The network is then updated by minimizing the loss function that changes at each iteration i. During the training process, a decaying learning rate η (from 0.01 to 0.001) is selected, with the discount factor γ being 0.9. 2000 training sets were used toAnd 100 test sets, the number of samples trained once is set to be 32, and the weights of the reward functions are omega respectively 1 =0.5,ω 2 =0.25,ω 3 0.25. First, the alpha-type vehicle agent calculates to obtain y j And transmitting the data to a beta type vehicle intelligent body; then, the beta-type vehicle agent updates its own Q network and multi-layer perceptron network and updates the parameter theta β 、θ MLP And
Figure BDA0003598707360000094
transmitting the signal to an alpha type vehicle intelligent body; and finally, updating the self Q network and the federal network parameters by the alpha-type vehicle intelligent agent according to the received parameters. All the networks are updated, if the vehicle round is finished, the training process is finished, and the optimal strategy pi is output * Otherwise, starting the next training.
Consider the following baseline scheme: whereas α -type vehicle agents can train models independently and β -type vehicle agents share a similar system environment, it is possible to train only the α -type vehicle agent's model to obtain the combined action strategy of the α -and β -type vehicle agents, i.e., the DQN _ α algorithm (with input s as input) α (t) outputting a joint action Q value of the vehicle agent).
As shown in FIG. 3, first, the federated multi-agent reinforcement learning algorithm proposed by the present description has better convergence performance than the comparison algorithm. This indicates that even if the training data is noise protected, the beta-type vehicle agent can still provide an unobservable training strategy to help the alpha-type vehicle agent learn the union strategy. Secondly, because the state information of the two agents is effectively utilized, the provided algorithm greatly reduces the switching times on the premise of ensuring satisfactory transmission rate.
As shown in fig. 4, as the number of rsus increases, the average vehicle user trade-off between the two schemes increases, and the transmission rate follows the same trend. The reason for this is that as the number of rsus increases, the vehicle may have more opportunities to connect to closer rsus, thereby increasing the data rate. In addition, thanks to the federal policy learning between α -type and β -type vehicle agents, the federal multi-agent reinforcement learning algorithm proposed in this specification can improve the connection gain on the basis of privacy protection and significantly reduce the handover overhead.
As shown in fig. 5, the present description further investigated the impact of the downlink transmit power threshold on the average performance of the vehicle user. First, as the power threshold increases, the overall trade-off and the benefit of the connection increase, as the increased transmit power helps to increase the data rate. However, since power allocation is not directly associated with the handover overhead, the number of handovers remains almost unchanged. Secondly, the federal multi-agent reinforcement learning framework provided by the description shares the encrypted Q values of the alpha-type and beta-type vehicle agents, and can promote the training of the model on the premise of privacy protection, so that the vehicle agents are promoted to make more effective combined decisions under different downlink transmission power thresholds.
In conclusion, the invention provides a vehicle networking edge resource allocation method based on federal multi-agent reinforcement learning, which aims at unknown high dynamic topology and channel state characteristics of the vehicle networking and researches the vehicle networking joint edge access and power allocation problem based on privacy protection. The scheme adopts a Gaussian difference method, and the privacy of interaction between vehicle intelligent agents is kept during decision training; and meanwhile, a multilayer perceptron model is introduced to share encrypted training data, so that an intelligent agent can better train the model. The Federal multi-agent reinforcement learning-based vehicle networking edge resource allocation method provided by the invention can promote the connectivity of the vehicle networking and reduce the switching overhead and energy loss on the premise of protecting the privacy information of the local state of a vehicle user, thereby supporting various ultrahigh-reliability and low-delay communication applications in the vehicle networking.

Claims (10)

1. A vehicle networking edge resource allocation method based on federal multi-agent reinforcement learning is characterized by comprising the following specific steps:
step 1, inputting a car networking environment, initializing intelligent agent local Q network and federal network parameters, and modeling an optimization problem;
step 2, according to whether the intelligent bodies can obtain rewards, the intelligent bodies are divided into alpha type vehicle intelligent bodies and beta type vehicle intelligent bodies, the alpha type vehicle intelligent bodies and the beta type vehicle intelligent bodies respectively obtain the channel state between the intelligent bodies and the road side units, the observable road side unit positions and the road side unit positions related to the last time slot in the current time slot, and the channel state, the observable road side unit positions and the road side unit positions related to the last time slot are cascaded to be used as the input of a Q network;
step 3, adopting a Gaussian difference method to encrypt the Q network output, and outputting a joint edge access decision of alpha-type and beta-type vehicle intelligent agents and a power distribution decision of a downlink through a federal network;
step 4, the alpha-type vehicle intelligent agent obtains balance rewards of connection gain, switching expense and energy loss fed back by the system, and simultaneously the system inputs sample data of the current time slot into a cache pool;
step 5, judging whether the number of samples is enough, if so, entering step 6, otherwise, directly entering step 7;
step 6, when the number of samples is enough, the alpha-type vehicle intelligent bodies and the beta-type vehicle intelligent bodies respectively update parameters of a local Q network and a federal network, and the cache pool is emptied after updating is completed;
step 7, judging whether the current training round is finished, if not, returning to the step 2 to start the training of the next round, and if so, entering the step 8;
step 8, judging whether convergence is achieved, if not, resetting the Internet of vehicles environment, and returning to the step 1; if so, finishing training and completing the distribution of the edge resources of the Internet of vehicles.
2. The method for allocating edge resources of the car networking based on the federal multi-agent reinforcement learning as claimed in claim 1, wherein the step 1 of inputting the car networking environment specifically comprises:
(1) and (3) time slot model: the continuous training time is discretized into a plurality of time slots, denoted as
Figure FDA0003598707350000011
Wherein the channel state information and system parameters remain constant over the duration of a single time slot, but may vary randomly between different time slots;
(2) networkModel: establishing an urban multi-lane highway model, wherein roadside units supporting edge communication are uniformly distributed on two sides of a highway and are collected
Figure FDA0003598707350000012
Represents; vehicles run from two ends of the highway in opposite directions, the transmission rate of data of the vehicles is improved by establishing a vehicle and infrastructure link, and the vehicle set
Figure FDA0003598707350000013
Represents;
(3) vehicle movement model: the speed variation of the vehicle follows the following gaussian-markov random process:
Figure FDA0003598707350000014
wherein v (t) represents the speed of the vehicle at time slot t, v (t-1) represents the speed of the vehicle at time slot t-1,
Figure FDA00035987073500000213
representing a velocity approximation mean, zeta representing a velocity approximation variance, ξ representing a degree of memory, z representing an uncorrelated, random gaussian process of zero mean unit variance;
(4) switching the model: suppose that the vehicle can observe nearby O max Information of each road side unit, and adaptively selecting the associated road side unit; defining the association variables between vehicle k and all road side units as
Figure FDA0003598707350000021
Wherein
Figure FDA0003598707350000022
Indicating that the roadside unit r is associated with vehicle k at time slot t, and conversely,
Figure FDA0003598707350000023
considering that the coverage of the road side unit is limited, and a single time slot can only serve one vehicle, when the road side unit associated with the adjacent time slot vehicle is changed, switching occurs, namely:
Figure FDA0003598707350000024
wherein H k (t) indicates the number of switches made by vehicle k between adjacent time slots, 1 {·} Indicating that 1 is set under the condition that the constraint is met, and otherwise, 0 is set;
(5) a power distribution model: the transmission power of the downlink road side unit adopts discretized grade distribution and is represented as [ P min ,P max ]P levels within the range; order to
Figure FDA0003598707350000025
Represents the downlink transmission power configured by the RSU r associated with the vehicle k at the time slot t, and the power distribution variable of the RSU configured by the vehicle k
Figure FDA0003598707350000026
Is composed of
Figure FDA0003598707350000027
Wherein
Figure FDA0003598707350000028
Indicating that vehicle k selects p as the power of the downlink RSU at time slot t, i.e.
Figure FDA0003598707350000029
On the contrary, the first step is to take the reverse,
Figure FDA00035987073500000210
(6) the wireless communication model is as follows: consider that the mutual interference between the vehicle and the infrastructure link has been eliminated in the system by an interference cancellation method; the channel power gain is assumed to be composed of small-scale fading, namely Rayleigh fading, and path loss; order to
Figure FDA00035987073500000211
Representing the channel gain between the vehicle k and the rsu R, and according to the shannon formula, given the rsu R associated with the vehicle user k at the time slot t and the configured downlink power thereof, the available transmission rate R of the vehicle k is obtained k (t) is expressed as:
Figure FDA00035987073500000212
further, assuming that the minimum data rate required by all vehicles per slot is the same and fixed, R is used min And (4) showing.
3. The method for allocating edge resources in a car networking based on federated multi-agent reinforcement learning as defined in claim 2, wherein the step 1 of modeling the optimization problem specifically is:
considering optimization and combination of edge access and power allocation, constructing an optimization problem, wherein the optimization target is the balance of maximized connection gain, switching overhead and energy loss:
Figure FDA0003598707350000031
Figure FDA0003598707350000032
Figure FDA0003598707350000033
wherein ω is 1 ,ω 2 ,ω 3 Weight coefficients of connection gain, switching overhead and energy loss, and ω 123 1 is ═ 1; equation C1 ensures seamless connectivity for the vehicle user, and equation C2 reflects the minimum data rate constraint required for the vehicle user.
4. The method for allocating edge resources of the car networking based on the federal multi-agent reinforcement learning of claim 3, wherein the step 2 is divided into two types, specifically, an α -type vehicle agent and a β -type vehicle agent according to whether the agent can obtain the reward;
in consideration of privacy protection, the vehicle agent can only observe the state information of the vehicle agent and cannot accurately obtain the reward fed back by the system in real time; depending on whether the agent can receive rewards, it is divided into two categories:
the intelligent body of the alpha-type vehicle: the local state of the user can be observed, and corresponding system rewards can be timely and accurately obtained;
beta type vehicle agent: the local state of the user can be observed, but the user cannot obtain system reward due to privacy protection.
5. The method for allocating edge resources of the internet of vehicles based on federal multi-agent reinforcement learning as claimed in claim 4, wherein in step 2, the α -type and β -type vehicle agents respectively obtain the channel state between themselves and the rsu, the observable rsu position and the rsu position associated with the previous time slot, specifically:
in the time slot t, each vehicle user is used as an agent, and obtains respective observation states through interaction with the environment, and the state of the vehicle k in the time slot t is represented as:
Figure FDA0003598707350000034
wherein,
Figure FDA0003598707350000035
representing channel state information between the vehicle k and its observable roadside units at time slot t;
Figure FDA0003598707350000036
representing the position of a roadside unit observable by the vehicle k at time slot t;
Figure FDA0003598707350000037
indicating the location of the road side unit associated with vehicle k at time slot t-1.
6. The method for allocating edge resources in a car networking based on federated multi-agent reinforcement learning as defined in claim 5, wherein the step 3 of outputting the joint edge access decision of α -type and β -type vehicle agents and the power allocation decision of downlink through the federated network is specifically:
determining actions by each vehicle agent to interact with the environment, including selection of an associated roadside unit and selection of downlink transmission power; at time slot t, the action of vehicle k is represented as:
Figure FDA0003598707350000041
7. the method for allocating edge resources of a car networking based on federal multi-agent reinforcement learning as claimed in claim 6, wherein the α -type vehicle agents in step 4 get trade-off rewards of connection gain, switching overhead and energy loss of system feedback, specifically:
after all vehicle agents perform actions, the environment feeds back a system global reward; defining an average trade-off per user as a global reward, expressed as:
Figure FDA0003598707350000042
8. the method for allocating the edge resources of the car networking based on the federal multi-agent reinforcement learning as claimed in claim 7, wherein the step 1 is to initialize the agent local Q network and the federal network, and the step 3 is to encrypt the Q network output by using the gaussian difference method, specifically:
(1) local Q network: local Q networks are constructed for alpha and beta vehicle agents, respectively, to estimate a cost of action function
Figure FDA0003598707350000043
Wherein
Figure FDA0003598707350000044
Representing a future jackpot with a discount factor γ; defining the local Q network output value of alpha and beta type vehicle intelligent agent as Q α (·;θ α ) And Q β (·;θ β ) Wherein theta α And theta β Respectively are weight parameters of the deep neural network;
(2) difference of gaussians: based on privacy protection, a Gaussian difference method is adopted, random variables obeying Gaussian distribution are added to the output of the local Q networks of the intelligent agents of the alpha-type vehicles and the beta-type vehicles respectively, and the method is represented as follows:
Figure FDA0003598707350000045
Figure FDA0003598707350000046
(3) federal network: a multi-layer perceptron network is adopted as a federal network, the output of an encrypted Q network is used as the input, and a joint decision is output to predict the action, which is expressed as:
Figure FDA0003598707350000051
wherein MLP (.; [ theta ]) MLP ) Representing a multi-layered perceptron network [ |. ]]Indicating a cascading operation;
when the alpha type and beta type vehicle agents update the network model, the two types of vehicle agents respectively treat the Q network output encrypted by the opposite party as a constant value, namely:
Figure FDA0003598707350000052
Figure FDA0003598707350000053
9. the method for allocating edge resources in the internet of vehicles based on the federal multi-agent reinforcement learning as claimed in claim 8, wherein the step 4 of inputting the sample data of the current time slot into the buffer pool specifically comprises:
the stability of the reinforcement learning algorithm is improved by adopting an experience playback method; the experience cache pool is used for storing parameters of each step in the previous learning process, and randomly extracting some samples from the previous samples for learning in the later learning process; wherein the experience pool is represented as
Figure FDA0003598707350000054
The N groups of samples extracted by the alpha type vehicle intelligent body are shown as
Figure FDA0003598707350000055
The sample extracted by beta type vehicle intelligent body is expressed as
Figure FDA0003598707350000056
When the number of stored samples reaches an upper limit, the old samples will be removed as new sample storesAnd storing the reserved space.
10. The method for assigning edge resources in a car networking based on federal multi-agent reinforcement learning as claimed in claim 9, wherein when the number of samples is sufficient, α -type and β -type vehicle agents update the parameters of the local Q network and the federal network, respectively, in step 6, specifically:
an online network and a target network exist in the actual model training process, wherein the online network continuously updates parameters and is used for training a neural network and calculating a Q estimation value; the target network temporarily fixes the parameters, updates the parameters once every a period of time, and calculates a Q target value;
considering that β -type vehicle agents cannot obtain a global reward for system feedback, the Q estimate of the federal network can only be calculated by α -type vehicle agents and shared with β, and is expressed as:
Figure FDA0003598707350000057
wherein r is j Gamma represents a discount factor for the global reward earned by an alpha-type vehicle agent;
the loss function of the network is therefore expressed as:
Figure FDA0003598707350000058
Figure FDA0003598707350000061
in actual calculation, a random gradient descent method is adopted to optimize a loss function; in the training process, firstly, the alpha type vehicle intelligent agent calculates to obtain y j And transmitting the data to a beta type vehicle intelligent body; then, the beta-type vehicle agent updates its own Q network and multi-layer perceptron network and updates the parameter theta β 、θ MLP And
Figure FDA0003598707350000062
to the alpha vehicle agent:
Figure FDA0003598707350000063
Figure FDA0003598707350000064
and finally, updating the self Q network and the federal network parameters by the alpha-type vehicle intelligent agent according to the received parameters:
Figure FDA0003598707350000065
Figure FDA0003598707350000066
all the networks are updated, if the vehicle round is finished, the training process is finished, and the optimal strategy pi is output * Otherwise, starting the next training.
CN202210395450.XA 2022-04-15 2022-04-15 Vehicle networking edge resource allocation method based on federal multi-agent reinforcement learning Pending CN114980123A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210395450.XA CN114980123A (en) 2022-04-15 2022-04-15 Vehicle networking edge resource allocation method based on federal multi-agent reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210395450.XA CN114980123A (en) 2022-04-15 2022-04-15 Vehicle networking edge resource allocation method based on federal multi-agent reinforcement learning

Publications (1)

Publication Number Publication Date
CN114980123A true CN114980123A (en) 2022-08-30

Family

ID=82977311

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210395450.XA Pending CN114980123A (en) 2022-04-15 2022-04-15 Vehicle networking edge resource allocation method based on federal multi-agent reinforcement learning

Country Status (1)

Country Link
CN (1) CN114980123A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116506829A (en) * 2023-04-25 2023-07-28 江南大学 Federal edge learning vehicle selection method based on C-V2X communication
CN116582840A (en) * 2023-07-13 2023-08-11 江南大学 Level distribution method and device for Internet of vehicles communication, storage medium and electronic equipment
CN118157835A (en) * 2024-02-02 2024-06-07 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Block chain-based intelligent vehicle data acquisition system and privacy protection method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116506829A (en) * 2023-04-25 2023-07-28 江南大学 Federal edge learning vehicle selection method based on C-V2X communication
CN116506829B (en) * 2023-04-25 2024-05-10 广东北斗烽火台卫星定位科技有限公司 Federal edge learning vehicle selection method based on C-V2X communication
CN116582840A (en) * 2023-07-13 2023-08-11 江南大学 Level distribution method and device for Internet of vehicles communication, storage medium and electronic equipment
CN118157835A (en) * 2024-02-02 2024-06-07 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Block chain-based intelligent vehicle data acquisition system and privacy protection method

Similar Documents

Publication Publication Date Title
CN114980123A (en) Vehicle networking edge resource allocation method based on federal multi-agent reinforcement learning
Xu et al. Service offloading with deep Q-network for digital twinning-empowered internet of vehicles in edge computing
CN109803344B (en) A kind of unmanned plane network topology and routing joint mapping method
CN113543074B (en) Joint computing migration and resource allocation method based on vehicle-road cloud cooperation
CN109862610A (en) A kind of D2D subscriber resource distribution method based on deeply study DDPG algorithm
CN106411749A (en) Path selection method for software defined network based on Q learning
CN113867354B (en) Regional traffic flow guiding method for intelligent cooperation of automatic driving multiple vehicles
CN116156455A (en) Internet of vehicles edge content caching decision method based on federal reinforcement learning
Zhang et al. New computing tasks offloading method for MEC based on prospect theory framework
Daeichian et al. Fuzzy Q-learning-based multi-agent system for intelligent traffic control by a game theory approach
CN114499648B (en) Unmanned aerial vehicle cluster network intelligent multi-hop routing method based on multi-agent cooperation
Wang et al. Collaborative edge computing for social internet of vehicles to alleviate traffic congestion
CN114116047A (en) V2I unloading method for vehicle-mounted computation-intensive application based on reinforcement learning
CN115052262A (en) Potential game-based vehicle networking computing unloading and power optimization method
CN116030623A (en) Collaborative path planning and scheduling method based on blockchain in cognitive Internet of vehicles scene
CN114449482A (en) Heterogeneous vehicle networking user association method based on multi-agent deep reinforcement learning
Gao et al. Fast adaptive task offloading and resource allocation via multiagent reinforcement learning in heterogeneous vehicular fog computing
CN114629769B (en) Traffic map generation method of self-organizing network
CN116963034A (en) Emergency scene-oriented air-ground network distributed resource scheduling method
Li et al. Collaborative computing in vehicular networks: A deep reinforcement learning approach
Nguyen et al. Multi-agent task assignment in vehicular edge computing: A regret-matching learning-based approach
CN117290071A (en) Fine-grained task scheduling method and service architecture in vehicle edge calculation
CN116896561A (en) Calculation unloading and resource allocation method for vehicle-road cooperative system
CN115173926B (en) Communication method and communication system of star-ground fusion relay network based on auction mechanism
CN115756873A (en) Mobile edge computing unloading method and platform based on federal reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination