CN114449482A

CN114449482A - Heterogeneous vehicle networking user association method based on multi-agent deep reinforcement learning

Info

Publication number: CN114449482A
Application number: CN202210242124.5A
Authority: CN
Inventors: 陶奕宇; 林艳; 包金鸣; 张一晋; 邹骏; 李骏; 束锋
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2022-03-11
Filing date: 2022-03-11
Publication date: 2022-05-06
Anticipated expiration: 2042-03-11
Also published as: CN114449482B

Abstract

The invention discloses a heterogeneous vehicle networking user association method based on multi-agent deep reinforcement learning, which comprises the steps of firstly modeling a problem into a partially observable Markov decision process, then adopting the idea of decomposing a team value function, specifically establishing a centralized training distributed execution framework, and utilizing and summing the team value function and each user value function to achieve the purpose of implicitly training the user value function; then, the invention also refers to experience playback and a target network mechanism, uses an epsilon-greedy strategy to explore and select actions, utilizes a recurrent neural network to store historical information, selects a Huber loss function to calculate loss and simultaneously performs gradient descent, and finally learns the association strategy of the heterogeneous Internet of vehicles users. Compared with a multi-agent independent deep Q learning algorithm and other traditional algorithms, the multi-agent independent deep Q learning method based on the multi-agent intelligent network can effectively improve energy efficiency and reduce switching cost at the same time under the heterogeneous vehicle networking environment.

Description

Heterogeneous vehicle networking user association method based on multi-agent deep reinforcement learning

Technical Field

The invention relates to the technical field of wireless communication, in particular to a heterogeneous vehicle networking user association method based on multi-agent deep reinforcement learning.

Background

With the rapid development of economy in recent years, the number of global automobiles increases day by day, the probability of traffic jam and traffic accident is greatly increased while the automobiles bring convenience to people, and thus, vehicle-mounted Ad-hoc Networks (VANETs) are produced. VANETs communicate vehicles, pedestrians and roads in a certain range by using advanced wireless communication technology, sensing technology and the like, comprehensively sense the traffic and the roads, and form a special mobile ad hoc network (Cao S, Lee V C. an access and complete performance modeling of the IEEE 802.11p MAC layer for VANET [ J ]. Computer Communications,2020,149: 107-.

Since the internet of vehicles has high mobility, in order to ensure seamless communication, the user association of the vehicle faces frequent switching, failed switching, high energy consumption, etc., which in turn causes poor user experience and heavy signal load (Gures E, Shayea I, Alhammadi A, et al. A compatible substrate on mobility management in 5G heterologous networks: architecture, exchange and solutions [ J ]. IEEE Access,2020,8: 195883-. For this reason, handover overhead and energy efficiency are often targeted for optimization of user-associated policies in the internet of vehicles. However, joint optimization problems often have difficulty in obtaining a global optimal strategy due to non-convex and combination characteristics, and in this case, a reinforcement learning method is widely applied to sequence decision due to its advantages. Lin Y et al (Lin Y, Zhang Z, Huang Y, et al. heterologous user-centralized cluster simulation of connectivity-handover trade-off in the virtual networks [ J ]. IEEE Transactions on temporal Technology,2020,69(12):16027 and 16043.) propose a user-centric intelligent heterogeneous cluster migration solution that greatly reduces the amount of switching overhead while maintaining the average data transmission rate per user by means of a single-agent depth-deterministic policy gradient algorithm. Khan H et al (Khan H, Elgabli a, samarakon S, et al. relationship learning-based vehicle-cell association algorithm for high mobile alliance communication [ J ]. IEEE Transactions on Cognitive Communications and networks, 2019,5(4): 1073-. However, the above single agent method requires a lot of or almost complete information to make a central decision, which not only results in a difficult implementation due to an excessive computational dimension, but also usually results in a lot of unnecessary communication overhead, so that it is necessary to further study how to obtain better performance with less information and resources.

Disclosure of Invention

The invention aims to provide a heterogeneous vehicle networking user association method based on multi-agent deep reinforcement learning.

The technical solution for realizing the purpose of the invention is as follows: a heterogeneous Internet of vehicles user association method based on multi-agent deep reinforcement learning comprises the following steps:

step 1: initializing algorithm related parameters including weight parameters in a local online Q network and a target Q network and hidden state parameters of a recurrent neural network layer;

step 2: each vehicle user obtains local state information by observing the current environment, then inputs the local state information into a local online Q network to obtain a corresponding Q value, and selects an associated action according to an epsilon-greedy strategy;

and step 3: each vehicle user is associated to an adjacent road side unit or vehicle base station according to the selected associated action, and data transmission is carried out to obtain a team reward value from environment feedback;

and 4, step 4: each vehicle user re-observes the current local state information;

and 5: repeating the step 2 to the step 4 until all vehicle users exit the road, stopping correlation, taking the process from the road entering to the road exiting of all vehicles as a round, and storing the experience information of the whole round into an experience pool;

step 6: extracting a plurality of rounds of experience information from the experience pool, updating the online Q network by using a value decomposition network VDN algorithm, and copying online Q network parameters every T time slots to form a new target Q network;

and 7: and (5) repeating the steps 2 to 6 until the team reward value is converged, and finishing the training.

Compared with the prior art, the invention has the following remarkable advantages: (1) by adopting a multi-agent reinforcement learning technology, the vehicle users are allowed to not exchange information with each other, so that communication resources are saved, meanwhile, the algorithm calculation dimensionality is reduced, and the calculation efficiency is improved; (2) the system has the advantages that the more optimal balance of the average energy efficiency and the switching overhead of the system under the heterogeneous Internet of vehicles scene is realized, so that the continuity of energy sources is improved, and the continuity of communication is ensured.

Drawings

FIG. 1 is a flow chart of a heterogeneous Internet of vehicles user association method based on multi-agent deep reinforcement learning.

FIG. 2 is a network structure diagram of a VDN algorithm in the heterogeneous vehicle networking user association method based on multi-agent deep reinforcement learning.

FIG. 3 is a graph of different policy convergence reward values, average energy efficiency and handover overhead versus the number of RSUs.

FIG. 4 is a graph of different policy convergence reward values, average energy efficiency and handover overhead versus RSU transmit power.

Detailed Description

Aiming at the heterogeneous Vehicle networking scene that two types of communication links of a Vehicle-Vehicle (Vehicle to Vehicle, V2V) and a Vehicle-Infrastructure (Vehicle to Infrastructure, V2I) exist simultaneously, the characteristics that a Vehicle networking user moves at a high speed and only can obtain local information are considered, and the goal is to optimize the balance between switching cost and energy efficiency. Then, the invention also refers to experience playback and a target network mechanism, uses an epsilon-greedy strategy to explore and select actions, utilizes a recurrent neural network to store historical information, selects a Huber loss function to calculate loss and simultaneously performs gradient descent, and finally learns the association strategy of the heterogeneous Internet of vehicles users.

The invention provides a heterogeneous Internet of vehicles user association method based on multi-agent deep reinforcement learning, which reduces switching overhead while improving the average energy efficiency of a system, and specifically comprises the following steps:

Further, the local online Q network and the target Q network in step 1 are both provided with two linear network layers and one gating cycle unit layer.

Further, the local state information in step 2 specifically includes:

the local state information of the vehicle user i in the time slot t is

Wherein

For the current time slot the vehicle user's own location,

being nearest to the vehicle user

The location of the individual surrounding roadside units,

for the maximum number of rsus to which each slotted vehicle user can connect,

for the location of all of the vehicle base stations,

the location of all devices connected by the vehicle user for the last time slot.

The associated actions of the vehicle user i include:

(1) selectingIs selected to at most

Connecting the road side units;

(2) only one vehicle base station is selected for connection.

Further, the epsilon-greedy policy in step 2 specifically includes: a vehicle user randomly selects one type of associated action in an action space by using a probability epsilon to search, and selects the associated action corresponding to the maximum Q value by using a probability of 1-epsilon.

Further, the team prize value in step 3 is specifically:

assuming that the vehicle users all have allocated orthogonal resource blocks, considering only small-scale fading and path loss for the communication links between vehicles and the communication links between vehicles and infrastructure, the data transmission rate V of the vehicle user i_i ^tExpressed as:

wherein

Set of road side units or vehicle base stations, p, associated with a current time slot vehicle user i_cThe transmit power of the road side unit or vehicle base station c,

for the channel gain, σ, between the RSU or vehicle base station c and this vehicle user connection²Is an additive white gaussian noise variance;

the time is dispersed into a plurality of time slots, and the switching cost of the vehicle user i in the current time slot t

Expressed as:

wherein

A road side unit set or a vehicle base station associated with the last time slot;

team award value R^tFor the sum of all vehicle reward values, i.e.

Where K is the number of vehicle users and the reward value of user i in time slot t

Comprises the following steps:

wherein

A minimum transmission rate limit for the vehicle user,

for the minimum transmitting power of the road side unit, the parameter beta belongs to [0,1 ]]Weights for the energy efficiency reward component.

Further, in step 4, each vehicle user re-observes the current local state information, specifically as follows:

each vehicle user and vehicle base station according to a movement formula

Updating the location where v_tAnd v_t-1The speed of the time slot t and the previous time slot t-1 respectively,

and

are respectively provided withFor the progressive speed and standard deviation of the vehicle, the memory parameter α represents the dependency of the current speed on the previous speed, the parameter n is an uncorrelated random gaussian process with a mean value of 0 and a variance of 1.

Further, the round experience information in step 5 is a set obtained by arranging single-step experience information of the whole round according to the sequence of the round steps, wherein the single-step experience information is a quadruple and comprises local observation information

Selected associated action

Single step reward

And local observation information of the next time slot

Further, the updating of the online Q network by using the value decomposition network VDN algorithm in step 6 specifically includes:

step 6-1: collecting the extracted vehicle users

Round experience of each user, including local observation information

Selected associated action

And local observation information of the next time slot

Respectively inputting the local online Q network and the target Q network to respectively obtain the Q value of the online Q network

And maximum Q value of target Q network

All vehicle users are obtained

1,2, K and

and correspondingly summing K to obtain a Q value of the total online Q network

And total target Q network maximum Q value

The method comprises the following specific steps:

wherein R is^tFor team prize values at time slot t, aggregated by vehicle users

Single step reward in round experience for K users

Are summed up, i.e.

Gamma is a discount factor, used to controlThe effect of long-distance rewards on the current reward, δ is used to control the sensitivity to outliers during the learning process;

step 6-3: extracting the experience of the next time slot and replacing the current experience;

step 6-4: and repeating the step 6-1 to the step 6-3 until all rounds of experience extraction in the whole round are finished.

Compared with a multi-agent independent deep Q learning algorithm and other traditional algorithms, the method can effectively improve energy efficiency and reduce switching overhead at the same time under the heterogeneous vehicle networking environment.

The invention is described in further detail below with reference to the figures and the embodiments.

Examples

The present embodiment provides that when all vehicle users walk to the end of the road, a round is completed in which the vehicle user makes every move and associated action, referred to as a time slot.

With reference to fig. 1, the heterogeneous internet of vehicles user association method based on multi-agent deep reinforcement learning in this embodiment includes the following specific steps:

step 1: algorithm-related parameters are initialized.

Initializing parameters such as related weight and bias of a local online Q network and a target Q network of a vehicle user and hidden state parameters of a Recurrent neural network layer, wherein the local online Q network and the target Q network both have two linear network layers and a Gated Recurrent Unit (GRU) layer.

Step 2: each vehicle user obtains local state information by observing the current environment, then inputs the local state information into a local network to obtain a corresponding Q value, and selects an associated action according to an epsilon-greedy strategy.

Specifically, taking the vehicle user i as an example, the local status information of the vehicle user i in the time slot t is

Wherein

For the current time slotThe location of the user of the vehicle himself,

being nearest to the vehicle user

The location of the individual surrounding roadside units,

for the location of all of the vehicle base stations,

the location of all devices connected by the vehicle user for the last time slot. Agent will observe locally

Inputting a first layer linear network, and activating a function through a ReLU; further, the output and the cyclic neural network hidden state obtained in the last time slot

Inputting GRU to obtain the hidden state of the current recurrent neural network

And a network output value; finally, the network output value is input into the last layer of linear network, and the output is the Q value corresponding to all actions of the online Q network

Wherein

Is the set of all actions. And associated actions of vehicle user i

The method specifically comprises the following steps: (1) is selected at most

Connecting the road side units; (2) only one vehicle base station is selected for connection. The epsilon-greedy strategy according to which the action is specifically selected is as follows: a vehicle user randomly selects one type of associated action in an action space by using a probability epsilon to search, and selects the associated action corresponding to the maximum Q value by using a probability of 1-epsilon.

And step 3: and each vehicle user is associated to the adjacent road side unit or vehicle base station according to the selected associated action, and data transmission is carried out to obtain the team reward value from the environmental feedback.

Assume that the vehicle users have all allocated orthogonal resource blocks. Aiming at the communication link between vehicles and infrastructure, the invention only considers the small-scale fading and the path loss, and then the data transmission rate V of the vehicle user i_i ^tCan be expressed as:

wherein

for the channel gain, σ, between the RSU or vehicle base station c and this vehicle user connection²Is an additive white gaussian noise variance.

For ease of modeling, the time is discretized into a plurality of time slots. Switching overhead of vehicle user i in current time slot t

Can be expressed as:

wherein

The set of rsus or vehicle bss associated with the previous timeslot.

The team prize value being the sum of all vehicle prize values, i.e.

Wherein the reward value r of user i in time slot t_i ^tComprises the following steps:

wherein

A minimum transmission rate limit for the vehicle user,

for the minimum transmitting power of the road side unit, the parameter beta belongs to [0,1 ]]The weight of the energy efficiency reward component is,

the maximum number of road side units which can be connected by each time slot vehicle user.

And 4, step 4: each vehicle user re-observes the current local state information.

Each vehicle user and vehicle base station according to a movement formula

Update the location, wherein

And

respectively the progressive speed and the standard deviation of the vehicle, a memory parameter alpha represents the dependency of the current speed on the previous speed, and parameters n are 0Uncorrelated random gaussian processes with values and variances of 1.

And 5: and repeating the steps 2-4 until the maximum number of steps of the round is reached, and storing the experience information of the whole round into an experience pool.

The round experience information is a set formed by arranging single-step experience information of the whole round according to the sequence of the step number, wherein the single-step information is a quadruple and comprises local observation information

Selected associated action

Single step reward r_i ^tAnd local observation information of the next time slot

Step 6: and extracting a plurality of rounds of experience from the experience pool, updating the current network by using a value decomposition network VDN algorithm, and copying current network parameters every T time slots to form a new target network. With reference to fig. 2, the algorithm learning specifically includes the following steps:

step 6-1: collecting the extracted vehicle users

Respectively inputting the round experience of each user into respective local Q network and target Q network to respectively obtain

And

all vehicle users are obtained

1,2, K and

i＝1,2, sum of K correspondence to obtain

And

the method comprises the following specific steps:

step 6-2: calculating loss according to a Huber loss function, and further performing gradient reduction on the loss to update the local network, wherein the loss function is specifically as follows:

where δ is used to control the sensitivity to outliers during learning.

step 6-4: repeating steps 1 to 3 until the round of experience is over.

And 7: and repeating the steps 2-6 until the team reward value is converged, and finishing the training.

The simulation of the invention adopts Python programming, and the parameter setting does not influence the generality. The method for comparison with the method comprises the following steps: (1) multi-agent Independent Deep Q learning algorithm (IDQN): each vehicle user has an independent deep Q network, takes other users as a part of the environment, and trains by using a deep Q learning method; (2) FULL join algorithm (FULL): connecting nearest

A road side unit; (3) strongest signal connection algorithm (RSS): the connection signal strength is the highestLarge associable devices.

In the embodiment, the road side units are uniformly distributed on two sides of the road, the length of the road is 1 kilometer, the width of the road is 7.5 meters, and the coverage ranges of the road side units and the vehicle base station are 200m and 50m respectively. One-way driving and vehicle progressive speed of vehicle user and vehicle base station

Standard deviation of

The memory parameter alpha is 0.1. The simulation sets that the number of vehicle users is 5, the number of vehicle base stations is 5, the number of road side units is 30, and the maximum connectable road side units of the vehicle users

Minimum transmission rate for vehicle users

Minimum transmit power of road side unit

The rewarding weight parameter beta is 0.6, the transmitting power of the vehicle base station is 30dBm, the transmitting power of the road side unit is 32dBm, the power density of the environmental noise is-174 dBm/Hz, and the carrier bandwidth is 180 kHz.

In a local Q network of a vehicle user, the number of two layers of hidden neuron nodes is 64, the hidden state dimension of a GRU is 64, the empirical pool capacity is 5000, the learning rate eta is 0.00002, the discount factor gamma is 0.8, the sampling Size is 32, the Huber parameter delta is 0.1, the activation function is ReLU, and the optimizer is Adam. The empirical pool capacity of the comparison algorithm IDQN is 10000, the learning rate is 0.0005, and the rest is the same as the VDN algorithm.

FIG. 3 shows the relationship between the number of roadside units and the reward value, the handover overhead, and the average energy efficiency. It can be observed that as the number of roadside units increases, all of the reward value curves trend upward. The reason for this is that the density of the roadside units increases, the communication distance decreases, the transmission rate and thus the energy efficiency increases greatly, and although the switching overhead increases as well, the magnitude is far smaller than the magnitude of the increase in energy efficiency.

Fig. 4 shows the relationship between the road side unit transmission power and the reward value, the switching overhead and the average energy efficiency. It can be observed that as the transmit power of the roadside unit increases, all of the prize value curves trend downward. The reason for this is that the switching overhead increases the transmission rate due to the fixed number of road side units without large fluctuations, but the amplitude is much less than the amplitude of the increase in the transmission power, and therefore the energy efficiency is reduced. In addition, as the transmitting power of the road side unit is increased, the reduction amplitude of the reward value of the method provided by the invention is obviously smaller than other three baselines, and the better performance is still kept.

As can be seen from fig. 3 and 4, under the condition of equal number of rsus or equal rsus, the reward convergence value of the VDN-based association method is obviously better than other baselines, and under the condition of sufficient rsus or rsus transmission power, the method has better energy efficiency and less switching overhead.

The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A heterogeneous Internet of vehicles user association method based on multi-agent deep reinforcement learning is characterized by comprising the following steps:

and 2, step: each vehicle user obtains local state information by observing the current environment, then inputs the local state information into a local online Q network to obtain a corresponding Q value, and selects an associated action according to an epsilon-greedy strategy;

and 5: repeating the step 2 to the step 4 until all vehicle users exit the road, stopping association, taking the process from the road to the road as a round, and storing the experience information of the whole round into an experience pool;

and 7: and (5) repeating the step 2 to the step 6 until the reward value of the team is converged, and finishing the training.

2. The multi-agent deep reinforcement learning-based heterogeneous car networking user association method according to claim 1, wherein the local online Q network and the target Q network in step 1 are provided with two linear network layers and one gating cycle unit layer.

3. The multi-agent deep reinforcement learning-based heterogeneous internet of vehicles user association method as claimed in claim 1, wherein the local state information in step 2 is specifically:

the local state information of the vehicle user i in the time slot t is

Wherein

For the current time slot the vehicle user's own location,

being nearest to the vehicle user

The location of the individual surrounding roadside units,

for the maximum number of rsus to which each slotted vehicle user can connect,

for the location of all of the vehicle base stations,

4. The multi-agent deep reinforcement learning-based heterogeneous vehicle networking user association method according to claim 3, wherein the association action of a vehicle user i comprises:

(1) is selected at most

Connecting the road side units;

(2) only one vehicle base station is selected for connection.

5. The heterogeneous vehicle networking user association method based on multi-agent deep reinforcement learning of claim 1, wherein the e-greedy policy in step 2 is specifically:

a vehicle user randomly selects one type of associated action in an action space by using a probability epsilon to search, and selects the associated action corresponding to the maximum Q value by using a probability of 1-epsilon.

6. The multi-agent deep reinforcement learning-based heterogeneous vehicle networking user association method according to claim 1, wherein the team reward value in step 3 is specifically:

wherein

gain, σ, of the channel between the road side unit or vehicle base station c and the vehicle user connection²Is an additive white gaussian noise variance;

Expressed as:

wherein

team award value R^tFor the sum of all vehicle reward values, i.e.

Where K is the number of vehicle users and the reward value r of user i in time slot t_i ^tComprises the following steps:

wherein

A minimum transmission rate limit for the vehicle user,

7. The multi-agent deep reinforcement learning-based heterogeneous vehicle networking user association method according to claim 1, wherein each vehicle user in step 4 re-observes current local state information as follows:

each vehicle user and vehicle base station according to a movement formula

and

respectively, the progressive speed and the standard deviation of the vehicle, the memory parameter alpha represents the dependency of the current speed on the previous speed, the parameter n is 0 mean and the variance is an irrelevant random gaussian process under 1.

8. The multi-agent based deep reinforcement learning of claim 1The heterogeneous Internet of vehicles user association method is characterized in that the round experience information in the step 5 is a set formed by arranging single-step experience information of the whole round according to the sequence of the round steps, wherein the single-step information is a quadruple and comprises local observation information

Selected associated action

9. The multi-agent deep reinforcement learning-based heterogeneous vehicle networking user association method according to claim 1, wherein the value decomposition network VDN algorithm is used for updating an online Q network in step 6, and specifically comprises the following steps:

step 6-1: collecting the extracted vehicle users

Round experience of each user, including local observation information

Selected associated action

And local observation information of the next time slot

And maximum Q value of target Q network

All vehicle users are obtained

And

the Q value of the total on-line Q network is obtained by corresponding summation

And total target Q network maximum Q value

The method comprises the following specific steps:

step 6-2: calculating loss according to a Huber loss function, and further performing gradient reduction on the loss to update the online Q network, wherein the loss function is specifically as follows:

wherein R is^tFor team prize values at time slot t, aggregated by vehicle users

Single step reward r in round experience of K users_i ^tAre summed, i.e.

Gamma is a discount factor and is used for controlling the influence of long-distance rewards on the current rewards, and delta is used for controlling the sensitivity to abnormal values in the learning process;