CN114980123A

CN114980123A - Vehicle networking edge resource allocation method based on federal multi-agent reinforcement learning

Info

Publication number: CN114980123A
Application number: CN202210395450.XA
Authority: CN
Inventors: 包金鸣; 林艳; 陶奕宇; 张一晋; 邹骏; 李骏; 束锋
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2022-04-15
Filing date: 2022-04-15
Publication date: 2022-08-30

Abstract

The invention discloses a vehicle networking edge resource allocation method based on federal multi-agent reinforcement learning, which specifically comprises the following steps: inputting a car networking environment, initializing intelligent agent local Q network and federal network parameters, and modeling an optimization problem; the vehicle intelligent agents are divided into two types of alpha and beta according to whether the intelligent agents can obtain rewards, and the two types of vehicle intelligent agents respectively observe the local state and input the local state into a Q network in the current time slot; encrypting the Q network output, and outputting the joint action decision of the two types of vehicle intelligent agents through a federal network; then the alpha vehicle intelligent agent obtains the global reward fed back by the system, and simultaneously the cache pool stores the sample data of the current time slot; when the number of samples is enough, the alpha type vehicle agent and the beta type vehicle agent respectively update the parameters of the local Q network and the federal network; after the current training round is finished, the car networking environment is reset, and the training of the next round is started. The invention improves the connectivity of the Internet of vehicles on the premise of privacy protection, and simultaneously reduces the switching overhead and energy loss.

Description

Vehicle networking edge resource allocation method based on federal multi-agent reinforcement learning

Technical Field

The invention relates to the technical field of wireless communication, in particular to a vehicle networking edge resource allocation method based on federal multi-agent reinforcement learning.

Background

In recent years, vehicles have become important carriers for driving the rapid development of economy and improving the quality of life of people as indispensable transportation modes for human travel. The car networking technology, as a product of the modern industry and the rapid development of mobile communication, can effectively handle delay-sensitive services of road safety, traffic efficiency, automatic driving and real-time information interaction in an intelligent traffic system. However, with the explosive increase of the number of vehicle users, it is difficult for the conventional single access method and communication resource allocation scheme to simultaneously meet the requirements of various trips. Since most of the car networking services occurring around vehicles only require close range communication to be provided in the surroundings and most tasks are time-efficient, dedicated short range communication technologies are widely used to support high speed transmissions, mainly in car-to-car communication and car-to-infrastructure communication (Jameeel F, Wyne S, Javed M A, et al. interference-aided Communications networks: Future research opportunities and exchanges [ J ]. IEEE Communications networks, 2018,56(10): 36-42.). However, as a typical heterogeneous wireless network with high mobility, the internet of vehicles has a difficulty in continuously and stably providing services for nodes with fixed coverage, which causes expensive handover overhead, and may cause transmission interruption, even resulting in inefficient traffic safety service processing. Therefore, an efficient and rapid user access scheme is very important for realizing the high efficiency and high performance of the Internet of vehicles.

With the development of artificial intelligence, Deep Reinforcement Learning (DRL) is considered as an effective solution to the complex sequence decision problem, which can learn the best strategy by directly interacting with the environment without prior knowledge of any system dynamics (e.g., wireless channel, vehicle location, etc.), avoiding the need for state transition matrices required when using traditional solutions to solve the markov decision problem. In recent years, scholars try to introduce a DRL algorithm to solve the problem of edge resource allocation, Khan et al, by decomposing a task centrally processed by a cloud to each roadside unit for local training, and designing a user access scheme by using a distributed reinforcement learning framework, can finally minimize network coordination overhead while meeting the transmission rate requirements of each vehicle (Khan H, Elgabli a, samarakon S, et al. However, although the existing research designs various user access schemes for the car networking scene, and provides a cooperative management strategy for communication and energy consumption resources based on different service requirements, training of the DRL algorithm depends on a large amount of information interaction of the car users, and a serious privacy disclosure problem is brought.

Disclosure of Invention

The invention aims to provide a vehicle networking edge resource allocation method based on federal multi-agent reinforcement learning, so that the connectivity of the vehicle networking is improved and the switching overhead and energy loss are reduced on the premise of privacy protection.

The technical solution for realizing the purpose of the invention is as follows: a vehicle networking edge resource allocation method based on federal multi-agent reinforcement learning comprises the following specific steps:

step 1, inputting a car networking environment, initializing intelligent agent local Q network and federal network parameters, and modeling an optimization problem;

step 2, dividing the intelligent bodies into two types of alpha type vehicle intelligent bodies and beta type vehicle intelligent bodies according to whether the intelligent bodies can obtain rewards or not, respectively acquiring the channel state between the intelligent bodies and the roadside units, the observable roadside unit positions and the roadside unit positions related to the previous time slot by the alpha type vehicle intelligent bodies and the beta type vehicle intelligent bodies in the current time slot, and cascading the two types of vehicle intelligent bodies to be used as the input of a Q network;

step 3, adopting a Gaussian difference method to encrypt the Q network output, and outputting a joint edge access decision of alpha-type and beta-type vehicle intelligent agents and a power distribution decision of a downlink through a federal network;

step 4, the alpha-type vehicle intelligent agent obtains balance rewards of connection gain, switching expense and energy loss fed back by the system, and simultaneously the system inputs sample data of the current time slot into a cache pool;

step 5, judging whether the number of samples is enough, if so, entering step 6, otherwise, directly entering step 7;

step 6, when the number of samples is enough, the alpha-type vehicle intelligent bodies and the beta-type vehicle intelligent bodies respectively update parameters of a local Q network and a federal network, and a cache pool is emptied after updating is completed;

step 7, judging whether the current training round is finished, if not, returning to the step 2 to start the training of the next round, and if so, entering the step 8;

step 8, judging whether convergence is achieved, if not, resetting the Internet of vehicles environment, and returning to the step 1; if yes, the training is finished, and the Internet of vehicles edge resource allocation is completed.

Compared with the prior art, the invention has the following remarkable advantages: (1) aiming at the problem of massive information interaction caused by centralized processing of a single-agent reinforcement learning algorithm, a multi-agent scene in the Internet of vehicles is considered, and vehicle users make cooperative decisions so as to improve the connectivity of the Internet of vehicles and reduce switching overhead and energy loss; (2) the output value of the intelligent agent local Q network is subjected to privacy protection by adopting a Gaussian difference method, so that the intelligent agents cannot share the observed environmental state information, and the safety degree of local data is greatly improved; (3) the scheme based on the reinforced learning of the federal multi-agent promotes model training by sharing encrypted training data, and improves the performance index of the system.

The invention is described in further detail below with reference to the accompanying drawings.

Drawings

FIG. 1 is a flow chart of the Federal multi-agent reinforcement learning-based vehicle networking edge resource allocation method of the invention.

Fig. 2 is a schematic model diagram of a car networking system according to an embodiment of the present invention.

FIG. 3 is a graph showing the convergence performance of the Internet of vehicles system according to the embodiment of the invention.

FIG. 4 is a graph showing the average performance of the vehicle users of the Internet of vehicles system as a function of the number of roadside units in the embodiment of the invention.

Fig. 5 is a graph of average performance of vehicle users of an internet of vehicles system as a function of a maximum threshold value of downlink transmission power in an embodiment of the present invention.

Detailed Description

The invention provides a vehicle networking edge resource allocation method based on federal multi-agent reinforcement learning. The scheme is that in a unit time slot t, a single vehicle user serves as an intelligent agent, a neural network training combined action strategy is adopted by observing the channel state between a vehicle and a road side unit and the position information of the road side unit, so that the transmission rate is improved, the switching overhead and the energy loss are reduced, and the balance among the three is realized, and the method comprises the following steps in combination with the figures 1-2:

step 2, classifying according to whether the intelligent bodies can obtain rewards, respectively obtaining the channel state between the intelligent bodies and the road side units, the observable road side unit position and the road side unit position related to the last time slot by the alpha type vehicle intelligent bodies and the beta type vehicle intelligent bodies in the current time slot, and cascading the channel state, the observable road side unit position and the road side unit position as the input of a Q network;

step 5, judging whether the number of samples is enough, if so, entering step 6, otherwise, entering step 7;

step 6, when the number of samples is enough, the alpha-type vehicle intelligent bodies and the beta-type vehicle intelligent bodies respectively update parameters of a local Q network and a federal network, and the cache pool is emptied after updating is completed;

step 8, judging whether convergence is achieved, if not, resetting the Internet of vehicles environment, and returning to the step 1; if so, finishing training and completing the distribution of the edge resources of the Internet of vehicles.

As a specific implementation manner, the inputting of the car networking environment in step 1 specifically includes:

(1) and (3) time slot model: the continuous training time is discretized into a plurality of time slots, denoted as

Where the channel state information and system parameters remain constant for the duration of a single time slot, but may vary randomly from time slot to time slot.

(2) And (3) network model: establishing an urban multi-lane highway model, wherein roadside units supporting edge communication are uniformly distributed on two sides of a highway and are collected

Represents; vehicles run from two ends of the highway in opposite directions, the transmission rate of data of the vehicles is improved by establishing a vehicle and infrastructure link, and the vehicle set

And (4) showing.

(3) Vehicle movement model: the speed variation of the vehicle follows the following gaussian-markov random process:

wherein v (t) represents the speed of the vehicle at time slot t, v (t-1) represents the speed of the vehicle at time slot t-1,

representing the approximate mean of the velocity, ζ representing the approximate variance of the velocity, ξ representing the degree of memory, and z representing the random gaussian process of the uncorrelated zero mean unit variance.

(4) Switching the model: suppose that the vehicle can observe nearby O _max Information of individual rsus, and adaptively selecting an associated rsu. Defining the association variables between vehicle k and all road side units as

Wherein

Indicating that the roadside unit r is associated with vehicle k at time slot t, and conversely,

consider that the rsu has limited coverage and a single timeslot can only serve one vehicle. When the road side unit associated with the adjacent time slot vehicle is changed, switching occurs, namely:

wherein H _k (t) indicates the number of switches made by vehicle k between adjacent time slots, 1 _{·} Indicating that 1 is set if the constraint is satisfied and, conversely, 0 is set.

(5) A power distribution model: transmit power utilization for downlink RSUThe level distribution of the scattering is expressed as [ P ] _min ,P _max ]P levels within the range. Order to

Representing the downlink transmission power configured by the RSU r associated with the vehicle k at the time slot t, the power distribution variable of the RSU configured by the vehicle k is

Wherein

Indicating that vehicle k selects p as the power of the downlink RSU at time slot t, i.e.

On the contrary, the method can be used for carrying out the following steps,

(6) the wireless communication model is as follows: consider a system in which mutual interference between vehicle and infrastructure links has been eliminated by an interference cancellation method. The channel power gain is assumed to consist of small scale fading (rayleigh fading) and path loss. Order to

Representing the channel gain between vehicle k and roadside unit r. Given time slot t, the rsu r associated with vehicle user k and its configured downlink power, according to shannon's formula, the achievable transmission rate for vehicle k is expressed as:

further, assuming that the minimum data rate required by all vehicles per slot is the same and fixed, R is used _min And (4) showing.

As a specific implementation manner, the modeling of the optimization problem in step 1 specifically includes:

considering optimization and combination of edge access and power allocation, constructing an optimization problem, wherein the optimization target is the balance of maximized connection gain, switching overhead and energy loss:

wherein omega ₁ ，ω ₂ ，ω ₃ Weight coefficients of connection gain, switching overhead and energy loss, and ω ₁ +ω ₂ +ω ₃ 1 is ═ 1; equation C1 ensures seamless connectivity for the vehicle user, and equation C2 reflects the minimum data rate constraint required for the vehicle user.

In a specific embodiment, step 2 is divided into two types, specifically, an α -type vehicle agent and a β -type vehicle agent according to whether the agent can obtain the reward;

in view of privacy protection, the vehicle agent can only observe its own status information and cannot obtain the reward fed back by the system in real time or accurately. Depending on whether the agent can receive rewards, it is divided into two categories:

the intelligent body of the alpha-type vehicle: the local state of the user can be observed, and corresponding system rewards can be timely and accurately obtained;

beta type vehicle agent: the local status of the user can be observed, but the system reward cannot be obtained due to privacy protection.

As a specific implementation manner, the step 2 of respectively obtaining, by the vehicle agents of α type and β type in the current time slot, a channel state between the vehicle agent and the rsu, an observable rsu position, and a rsu position associated with the previous time slot specifically includes:

in the time slot t, each vehicle user is used as an agent and obtains respective observation states through interaction with the environment, and the state of the vehicle k in the time slot t is represented as follows:

wherein,

representing channel state information between the vehicle k and its observable roadside units at time slot t;

representing the position of a roadside unit observable by the vehicle k at the time slot t;

indicating the location of the road side unit associated with vehicle k at time slot t-1.

As a specific implementation manner, the step 3 of outputting the joint edge access decision of the α -type and β -type vehicle agents and the power allocation decision of the downlink through the federal network specifically includes:

the actions each vehicle agent makes with the environment interaction are determined, including selection of an associated road side unit and selection of downlink transmission power. At time slot t, the action of vehicle k is represented as:

as a specific implementation manner, the α -type vehicle agent in step 4 obtains a balance reward of connection gain, switching overhead and energy loss of system feedback, specifically:

when all vehicle agents have performed the action, the environment will be rewarded with a global reward. Defining an average trade-off per user as a global reward, expressed as:

as a specific implementation manner, the initializing of the intelligent local Q network and the federal network in step 1, and the encrypting of the Q network output by using the gaussian difference method in step 3 specifically include:

(1) local Q network: vehicle agent for alpha and beta type and local Q network construction for estimating action value function

Wherein

Representing a future cumulative prize with a discount factor and gamma representing the discount factor. Thus, the local Q network output value defining alpha and beta type vehicle agents is Q _α (·；θ _α ) And Q _β (·；θ _β ) Wherein theta _α And theta _β Respectively, are weight parameters of the deep neural network.

(2) Difference of gaussians: based on privacy protection, a Gaussian difference method is adopted, random variables obeying Gaussian distribution are added to the output of the local Q networks of the intelligent agents of the alpha-type vehicles and the beta-type vehicles respectively, and the method is defined as follows:

(3) federal network: a multi-layer perceptron network is adopted as a federal network, the output of an encrypted Q network is used as the input, and a joint decision is output to predict the joint action, which is expressed as:

wherein MLP (.; [ theta ]) _MLP ) Representing a multi-layered perceptron network, [ | · c]Indicating a cascading operation.

When the vehicle intelligent bodies of the alpha type and the beta type update the network models, the Q network output encrypted by the opposite side is respectively regarded as a fixed value, namely:

as a specific implementation manner, the inputting of the sample data of the current time slot into the cache pool in step 4 specifically includes:

in order to improve the stability of the reinforcement learning algorithm, an experience playback method is adopted; the experience buffer pool is used for storing parameters of each step in the previous learning process, and randomly extracting some samples from the previous samples for learning in the later learning process. Wherein the experience pool is represented as

The N groups of samples extracted by the alpha type vehicle intelligent agent are shown as

The sample extracted by beta type vehicle intelligent body is expressed as

When the number of stored samples reaches an upper limit, the old samples will be removed to reserve space for the new sample storage.

As a specific implementation, when the number of samples is sufficient, the vehicle agents of α type and β type update the parameters of the local Q network and the federal network, respectively, as described in step 6, specifically:

in the actual model training process, an online network and a target network exist to prevent frequent updating and reduce the divergence and oscillation of training. Wherein, the parameters of the online network are continuously updated and used for training the neural network and calculating the Q estimation value; the target network temporarily fixes the parameters, updates the parameters once every other period of time, and calculates the Q target value. Considering that β -type vehicle agents cannot obtain a global reward for system feedback, the Q estimate of the federal network can only be calculated by α -type vehicle agents and shared with β, and is expressed as:

wherein r is ^j For a global reward obtained by an alpha-type vehicle agent, gamma represents a discount factor.

Thus, the loss function of the network is expressed as:

in practical calculations, a random gradient descent method is generally used to optimize the loss function. In the training process, firstly, the alpha type vehicle intelligent agent calculates to obtain y ^j And transmitting the data to a beta type vehicle intelligent body; then, the beta-type vehicle agent updates its own Q network and multi-layer perceptron network and updates the parameter theta _β 、θ _MLP And

to the alpha vehicle agent:

and finally, updating the self Q network and the federal network parameters by the alpha-type vehicle intelligent agent according to the received parameters:

updating all networks, ending the training process if the vehicle round is finished, and outputting an optimal strategy pi ^* Otherwise, the next training is started.

The invention is further described in detail below with reference to the figures and the specific embodiments.

Examples

The embodiment provides a vehicle networking edge resource allocation method based on federal multi-agent reinforcement learning, which is described in detail as follows:

1. establishing a vehicle networking system model:

the method adopts Python software to simulate the scene of the urban multi-lane expressway, the total length of the road is set to be 1 kilometer, 12 roadside units are uniformly distributed on two sides of the road, and the radius of the service range is 200 meters. Considering that a pair of vehicles runs from two ends of a road to each other, the vehicles can observe the information of the nearest 4 road side units at most, and 1 road side unit is selected for access. The vehicle movement model follows a Gaussian random process, wherein the initial value of the vehicle is

ξ is 0.1 and ζ is 0.1. The channel model between the vehicle and the roadside unit is as follows, the path loss model:

small scale fading complianceAnd (4) distribution is facilitated. Downlink transmission power of 23,35]dBm and a minimum transmission rate constraint of 8 bits/s/Hz.

2. Establishing a federal multi-agent reinforcement learning algorithm framework:

the federal multi-agent reinforcement learning algorithm framework is derived from a Deep Q network algorithm (DQN), and an action strategy is determined through a value function estimation algorithm. The algorithm framework mainly comprises a Q network and a federal network, wherein three layers of fully-connected neural networks are adopted as a local Q network of an intelligent agent and respectively comprise an input layer, an output layer and a hidden layer, and the number of neurons of the hidden layer is 80; and (3) adopting a multilayer perceptron network as a federal network, taking a Q network output value subjected to Gaussian differential processing as a federal network input, and outputting an action value function to predict the combined action.

3. And (3) training the algorithm:

firstly, respectively inputting the current state s _α (t)、s _β (t) is defined as

Namely the observable channel state of the vehicle agent, the position of the road side unit and the position of the road side unit associated with the last time slot. Secondly, defining the action output by the federal network as the joint action of the vehicle intelligent agent,

namely edge access decisions of the vehicle agent and selection of downlink power.

After the algorithm inputs the state, according to the joint action predicted by the federal network, after the current action interacts with the environment, the alpha-type vehicle intelligent body obtains the global reward

And shifts to the next state s _α (t+1)、s _β (t + 1). The network is then updated by minimizing the loss function that changes at each iteration i. During the training process, a decaying learning rate η (from 0.01 to 0.001) is selected, with the discount factor γ being 0.9. 2000 training sets were used toAnd 100 test sets, the number of samples trained once is set to be 32, and the weights of the reward functions are omega respectively ₁ ＝0.5，ω ₂ ＝0.25，ω ₃ 0.25. First, the alpha-type vehicle agent calculates to obtain y ^j And transmitting the data to a beta type vehicle intelligent body; then, the beta-type vehicle agent updates its own Q network and multi-layer perceptron network and updates the parameter theta _β 、θ _MLP And

transmitting the signal to an alpha type vehicle intelligent body; and finally, updating the self Q network and the federal network parameters by the alpha-type vehicle intelligent agent according to the received parameters. All the networks are updated, if the vehicle round is finished, the training process is finished, and the optimal strategy pi is output ^* Otherwise, starting the next training.

Consider the following baseline scheme: whereas α -type vehicle agents can train models independently and β -type vehicle agents share a similar system environment, it is possible to train only the α -type vehicle agent's model to obtain the combined action strategy of the α -and β -type vehicle agents, i.e., the DQN _ α algorithm (with input s as input) _α (t) outputting a joint action Q value of the vehicle agent).

As shown in FIG. 3, first, the federated multi-agent reinforcement learning algorithm proposed by the present description has better convergence performance than the comparison algorithm. This indicates that even if the training data is noise protected, the beta-type vehicle agent can still provide an unobservable training strategy to help the alpha-type vehicle agent learn the union strategy. Secondly, because the state information of the two agents is effectively utilized, the provided algorithm greatly reduces the switching times on the premise of ensuring satisfactory transmission rate.

As shown in fig. 4, as the number of rsus increases, the average vehicle user trade-off between the two schemes increases, and the transmission rate follows the same trend. The reason for this is that as the number of rsus increases, the vehicle may have more opportunities to connect to closer rsus, thereby increasing the data rate. In addition, thanks to the federal policy learning between α -type and β -type vehicle agents, the federal multi-agent reinforcement learning algorithm proposed in this specification can improve the connection gain on the basis of privacy protection and significantly reduce the handover overhead.

As shown in fig. 5, the present description further investigated the impact of the downlink transmit power threshold on the average performance of the vehicle user. First, as the power threshold increases, the overall trade-off and the benefit of the connection increase, as the increased transmit power helps to increase the data rate. However, since power allocation is not directly associated with the handover overhead, the number of handovers remains almost unchanged. Secondly, the federal multi-agent reinforcement learning framework provided by the description shares the encrypted Q values of the alpha-type and beta-type vehicle agents, and can promote the training of the model on the premise of privacy protection, so that the vehicle agents are promoted to make more effective combined decisions under different downlink transmission power thresholds.

In conclusion, the invention provides a vehicle networking edge resource allocation method based on federal multi-agent reinforcement learning, which aims at unknown high dynamic topology and channel state characteristics of the vehicle networking and researches the vehicle networking joint edge access and power allocation problem based on privacy protection. The scheme adopts a Gaussian difference method, and the privacy of interaction between vehicle intelligent agents is kept during decision training; and meanwhile, a multilayer perceptron model is introduced to share encrypted training data, so that an intelligent agent can better train the model. The Federal multi-agent reinforcement learning-based vehicle networking edge resource allocation method provided by the invention can promote the connectivity of the vehicle networking and reduce the switching overhead and energy loss on the premise of protecting the privacy information of the local state of a vehicle user, thereby supporting various ultrahigh-reliability and low-delay communication applications in the vehicle networking.

Claims

1. A vehicle networking edge resource allocation method based on federal multi-agent reinforcement learning is characterized by comprising the following specific steps:

step 2, according to whether the intelligent bodies can obtain rewards, the intelligent bodies are divided into alpha type vehicle intelligent bodies and beta type vehicle intelligent bodies, the alpha type vehicle intelligent bodies and the beta type vehicle intelligent bodies respectively obtain the channel state between the intelligent bodies and the road side units, the observable road side unit positions and the road side unit positions related to the last time slot in the current time slot, and the channel state, the observable road side unit positions and the road side unit positions related to the last time slot are cascaded to be used as the input of a Q network;

2. The method for allocating edge resources of the car networking based on the federal multi-agent reinforcement learning as claimed in claim 1, wherein the step 1 of inputting the car networking environment specifically comprises:

Wherein the channel state information and system parameters remain constant over the duration of a single time slot, but may vary randomly between different time slots;

(2) networkModel: establishing an urban multi-lane highway model, wherein roadside units supporting edge communication are uniformly distributed on two sides of a highway and are collected

Represents;

representing a velocity approximation mean, zeta representing a velocity approximation variance, ξ representing a degree of memory, z representing an uncorrelated, random gaussian process of zero mean unit variance;

(4) switching the model: suppose that the vehicle can observe nearby O _max Information of each road side unit, and adaptively selecting the associated road side unit; defining the association variables between vehicle k and all road side units as

Wherein

considering that the coverage of the road side unit is limited, and a single time slot can only serve one vehicle, when the road side unit associated with the adjacent time slot vehicle is changed, switching occurs, namely:

wherein H _k (t) indicates the number of switches made by vehicle k between adjacent time slots, 1 _{·} Indicating that 1 is set under the condition that the constraint is met, and otherwise, 0 is set;

(5) a power distribution model: the transmission power of the downlink road side unit adopts discretized grade distribution and is represented as [ P _min ,P _max ]P levels within the range; order to

Represents the downlink transmission power configured by the RSU r associated with the vehicle k at the time slot t, and the power distribution variable of the RSU configured by the vehicle k

Is composed of

Wherein

On the contrary, the first step is to take the reverse,

(6) the wireless communication model is as follows: consider that the mutual interference between the vehicle and the infrastructure link has been eliminated in the system by an interference cancellation method; the channel power gain is assumed to be composed of small-scale fading, namely Rayleigh fading, and path loss; order to

Representing the channel gain between the vehicle k and the rsu R, and according to the shannon formula, given the rsu R associated with the vehicle user k at the time slot t and the configured downlink power thereof, the available transmission rate R of the vehicle k is obtained _k (t) is expressed as:

3. The method for allocating edge resources in a car networking based on federated multi-agent reinforcement learning as defined in claim 2, wherein the step 1 of modeling the optimization problem specifically is:

wherein ω is ₁ ，ω ₂ ，ω ₃ Weight coefficients of connection gain, switching overhead and energy loss, and ω ₁ +ω ₂ +ω ₃ 1 is ═ 1; equation C1 ensures seamless connectivity for the vehicle user, and equation C2 reflects the minimum data rate constraint required for the vehicle user.

4. The method for allocating edge resources of the car networking based on the federal multi-agent reinforcement learning of claim 3, wherein the step 2 is divided into two types, specifically, an α -type vehicle agent and a β -type vehicle agent according to whether the agent can obtain the reward;

in consideration of privacy protection, the vehicle agent can only observe the state information of the vehicle agent and cannot accurately obtain the reward fed back by the system in real time; depending on whether the agent can receive rewards, it is divided into two categories:

beta type vehicle agent: the local state of the user can be observed, but the user cannot obtain system reward due to privacy protection.

5. The method for allocating edge resources of the internet of vehicles based on federal multi-agent reinforcement learning as claimed in claim 4, wherein in step 2, the α -type and β -type vehicle agents respectively obtain the channel state between themselves and the rsu, the observable rsu position and the rsu position associated with the previous time slot, specifically:

in the time slot t, each vehicle user is used as an agent, and obtains respective observation states through interaction with the environment, and the state of the vehicle k in the time slot t is represented as:

wherein,

representing the position of a roadside unit observable by the vehicle k at time slot t;

6. The method for allocating edge resources in a car networking based on federated multi-agent reinforcement learning as defined in claim 5, wherein the step 3 of outputting the joint edge access decision of α -type and β -type vehicle agents and the power allocation decision of downlink through the federated network is specifically:

determining actions by each vehicle agent to interact with the environment, including selection of an associated roadside unit and selection of downlink transmission power; at time slot t, the action of vehicle k is represented as:

。

7. the method for allocating edge resources of a car networking based on federal multi-agent reinforcement learning as claimed in claim 6, wherein the α -type vehicle agents in step 4 get trade-off rewards of connection gain, switching overhead and energy loss of system feedback, specifically:

after all vehicle agents perform actions, the environment feeds back a system global reward; defining an average trade-off per user as a global reward, expressed as:

。

8. the method for allocating the edge resources of the car networking based on the federal multi-agent reinforcement learning as claimed in claim 7, wherein the step 1 is to initialize the agent local Q network and the federal network, and the step 3 is to encrypt the Q network output by using the gaussian difference method, specifically:

(1) local Q network: local Q networks are constructed for alpha and beta vehicle agents, respectively, to estimate a cost of action function

Wherein

Representing a future jackpot with a discount factor γ; defining the local Q network output value of alpha and beta type vehicle intelligent agent as Q _α (·；θ _α ) And Q _β (·；θ _β ) Wherein theta _α And theta _β Respectively are weight parameters of the deep neural network;

(2) difference of gaussians: based on privacy protection, a Gaussian difference method is adopted, random variables obeying Gaussian distribution are added to the output of the local Q networks of the intelligent agents of the alpha-type vehicles and the beta-type vehicles respectively, and the method is represented as follows:

(3) federal network: a multi-layer perceptron network is adopted as a federal network, the output of an encrypted Q network is used as the input, and a joint decision is output to predict the action, which is expressed as:

wherein MLP (.; [ theta ]) _MLP ) Representing a multi-layered perceptron network [ |. ]]Indicating a cascading operation;

when the alpha type and beta type vehicle agents update the network model, the two types of vehicle agents respectively treat the Q network output encrypted by the opposite party as a constant value, namely:

。

9. the method for allocating edge resources in the internet of vehicles based on the federal multi-agent reinforcement learning as claimed in claim 8, wherein the step 4 of inputting the sample data of the current time slot into the buffer pool specifically comprises:

the stability of the reinforcement learning algorithm is improved by adopting an experience playback method; the experience cache pool is used for storing parameters of each step in the previous learning process, and randomly extracting some samples from the previous samples for learning in the later learning process; wherein the experience pool is represented as

The N groups of samples extracted by the alpha type vehicle intelligent body are shown as

The sample extracted by beta type vehicle intelligent body is expressed as

When the number of stored samples reaches an upper limit, the old samples will be removed as new sample storesAnd storing the reserved space.

10. The method for assigning edge resources in a car networking based on federal multi-agent reinforcement learning as claimed in claim 9, wherein when the number of samples is sufficient, α -type and β -type vehicle agents update the parameters of the local Q network and the federal network, respectively, in step 6, specifically:

an online network and a target network exist in the actual model training process, wherein the online network continuously updates parameters and is used for training a neural network and calculating a Q estimation value; the target network temporarily fixes the parameters, updates the parameters once every a period of time, and calculates a Q target value;

considering that β -type vehicle agents cannot obtain a global reward for system feedback, the Q estimate of the federal network can only be calculated by α -type vehicle agents and shared with β, and is expressed as:

wherein r is ^j Gamma represents a discount factor for the global reward earned by an alpha-type vehicle agent;

the loss function of the network is therefore expressed as:

in actual calculation, a random gradient descent method is adopted to optimize a loss function; in the training process, firstly, the alpha type vehicle intelligent agent calculates to obtain y ^j And transmitting the data to a beta type vehicle intelligent body; then, the beta-type vehicle agent updates its own Q network and multi-layer perceptron network and updates the parameter theta _β 、θ _MLP And

to the alpha vehicle agent:

all the networks are updated, if the vehicle round is finished, the training process is finished, and the optimal strategy pi is output ^* Otherwise, starting the next training.