CN114449482A - Heterogeneous vehicle networking user association method based on multi-agent deep reinforcement learning - Google Patents

Heterogeneous vehicle networking user association method based on multi-agent deep reinforcement learning Download PDF

Info

Publication number
CN114449482A
CN114449482A CN202210242124.5A CN202210242124A CN114449482A CN 114449482 A CN114449482 A CN 114449482A CN 202210242124 A CN202210242124 A CN 202210242124A CN 114449482 A CN114449482 A CN 114449482A
Authority
CN
China
Prior art keywords
vehicle
network
user
time slot
experience
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210242124.5A
Other languages
Chinese (zh)
Other versions
CN114449482B (en
Inventor
陶奕宇
林艳
包金鸣
张一晋
邹骏
李骏
束锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology
Original Assignee
Nanjing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology filed Critical Nanjing University of Science and Technology
Priority to CN202210242124.5A priority Critical patent/CN114449482B/en
Publication of CN114449482A publication Critical patent/CN114449482A/en
Application granted granted Critical
Publication of CN114449482B publication Critical patent/CN114449482B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/30Services specially adapted for particular environments, situations or purposes
    • H04W4/40Services specially adapted for particular environments, situations or purposes for vehicles, e.g. vehicle-to-pedestrians [V2P]
    • H04W4/44Services specially adapted for particular environments, situations or purposes for vehicles, e.g. vehicle-to-pedestrians [V2P] for communication between vehicles and infrastructures, e.g. vehicle-to-cloud [V2C] or vehicle-to-home [V2H]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/02Services making use of location information
    • H04W4/021Services related to particular areas, e.g. point of interest [POI] services, venue services or geofences
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/02Services making use of location information
    • H04W4/023Services making use of location information using mutual or relative location information between multiple location based services [LBS] targets or of distance thresholds
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/02Services making use of location information
    • H04W4/025Services making use of location information using location based information parameters
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/30Services specially adapted for particular environments, situations or purposes
    • H04W4/40Services specially adapted for particular environments, situations or purposes for vehicles, e.g. vehicle-to-pedestrians [V2P]
    • H04W4/46Services specially adapted for particular environments, situations or purposes for vehicles, e.g. vehicle-to-pedestrians [V2P] for vehicle-to-vehicle communication [V2V]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention discloses a heterogeneous vehicle networking user association method based on multi-agent deep reinforcement learning, which comprises the steps of firstly modeling a problem into a partially observable Markov decision process, then adopting the idea of decomposing a team value function, specifically establishing a centralized training distributed execution framework, and utilizing and summing the team value function and each user value function to achieve the purpose of implicitly training the user value function; then, the invention also refers to experience playback and a target network mechanism, uses an epsilon-greedy strategy to explore and select actions, utilizes a recurrent neural network to store historical information, selects a Huber loss function to calculate loss and simultaneously performs gradient descent, and finally learns the association strategy of the heterogeneous Internet of vehicles users. Compared with a multi-agent independent deep Q learning algorithm and other traditional algorithms, the multi-agent independent deep Q learning method based on the multi-agent intelligent network can effectively improve energy efficiency and reduce switching cost at the same time under the heterogeneous vehicle networking environment.

Description

Heterogeneous vehicle networking user association method based on multi-agent deep reinforcement learning
Technical Field
The invention relates to the technical field of wireless communication, in particular to a heterogeneous vehicle networking user association method based on multi-agent deep reinforcement learning.
Background
With the rapid development of economy in recent years, the number of global automobiles increases day by day, the probability of traffic jam and traffic accident is greatly increased while the automobiles bring convenience to people, and thus, vehicle-mounted Ad-hoc Networks (VANETs) are produced. VANETs communicate vehicles, pedestrians and roads in a certain range by using advanced wireless communication technology, sensing technology and the like, comprehensively sense the traffic and the roads, and form a special mobile ad hoc network (Cao S, Lee V C. an access and complete performance modeling of the IEEE 802.11p MAC layer for VANET [ J ]. Computer Communications,2020,149: 107-.
Since the internet of vehicles has high mobility, in order to ensure seamless communication, the user association of the vehicle faces frequent switching, failed switching, high energy consumption, etc., which in turn causes poor user experience and heavy signal load (Gures E, Shayea I, Alhammadi A, et al. A compatible substrate on mobility management in 5G heterologous networks: architecture, exchange and solutions [ J ]. IEEE Access,2020,8: 195883-. For this reason, handover overhead and energy efficiency are often targeted for optimization of user-associated policies in the internet of vehicles. However, joint optimization problems often have difficulty in obtaining a global optimal strategy due to non-convex and combination characteristics, and in this case, a reinforcement learning method is widely applied to sequence decision due to its advantages. Lin Y et al (Lin Y, Zhang Z, Huang Y, et al. heterologous user-centralized cluster simulation of connectivity-handover trade-off in the virtual networks [ J ]. IEEE Transactions on temporal Technology,2020,69(12):16027 and 16043.) propose a user-centric intelligent heterogeneous cluster migration solution that greatly reduces the amount of switching overhead while maintaining the average data transmission rate per user by means of a single-agent depth-deterministic policy gradient algorithm. Khan H et al (Khan H, Elgabli a, samarakon S, et al. relationship learning-based vehicle-cell association algorithm for high mobile alliance communication [ J ]. IEEE Transactions on Cognitive Communications and networks, 2019,5(4): 1073-. However, the above single agent method requires a lot of or almost complete information to make a central decision, which not only results in a difficult implementation due to an excessive computational dimension, but also usually results in a lot of unnecessary communication overhead, so that it is necessary to further study how to obtain better performance with less information and resources.
Disclosure of Invention
The invention aims to provide a heterogeneous vehicle networking user association method based on multi-agent deep reinforcement learning.
The technical solution for realizing the purpose of the invention is as follows: a heterogeneous Internet of vehicles user association method based on multi-agent deep reinforcement learning comprises the following steps:
step 1: initializing algorithm related parameters including weight parameters in a local online Q network and a target Q network and hidden state parameters of a recurrent neural network layer;
step 2: each vehicle user obtains local state information by observing the current environment, then inputs the local state information into a local online Q network to obtain a corresponding Q value, and selects an associated action according to an epsilon-greedy strategy;
and step 3: each vehicle user is associated to an adjacent road side unit or vehicle base station according to the selected associated action, and data transmission is carried out to obtain a team reward value from environment feedback;
and 4, step 4: each vehicle user re-observes the current local state information;
and 5: repeating the step 2 to the step 4 until all vehicle users exit the road, stopping correlation, taking the process from the road entering to the road exiting of all vehicles as a round, and storing the experience information of the whole round into an experience pool;
step 6: extracting a plurality of rounds of experience information from the experience pool, updating the online Q network by using a value decomposition network VDN algorithm, and copying online Q network parameters every T time slots to form a new target Q network;
and 7: and (5) repeating the steps 2 to 6 until the team reward value is converged, and finishing the training.
Compared with the prior art, the invention has the following remarkable advantages: (1) by adopting a multi-agent reinforcement learning technology, the vehicle users are allowed to not exchange information with each other, so that communication resources are saved, meanwhile, the algorithm calculation dimensionality is reduced, and the calculation efficiency is improved; (2) the system has the advantages that the more optimal balance of the average energy efficiency and the switching overhead of the system under the heterogeneous Internet of vehicles scene is realized, so that the continuity of energy sources is improved, and the continuity of communication is ensured.
Drawings
FIG. 1 is a flow chart of a heterogeneous Internet of vehicles user association method based on multi-agent deep reinforcement learning.
FIG. 2 is a network structure diagram of a VDN algorithm in the heterogeneous vehicle networking user association method based on multi-agent deep reinforcement learning.
FIG. 3 is a graph of different policy convergence reward values, average energy efficiency and handover overhead versus the number of RSUs.
FIG. 4 is a graph of different policy convergence reward values, average energy efficiency and handover overhead versus RSU transmit power.
Detailed Description
Aiming at the heterogeneous Vehicle networking scene that two types of communication links of a Vehicle-Vehicle (Vehicle to Vehicle, V2V) and a Vehicle-Infrastructure (Vehicle to Infrastructure, V2I) exist simultaneously, the characteristics that a Vehicle networking user moves at a high speed and only can obtain local information are considered, and the goal is to optimize the balance between switching cost and energy efficiency. Then, the invention also refers to experience playback and a target network mechanism, uses an epsilon-greedy strategy to explore and select actions, utilizes a recurrent neural network to store historical information, selects a Huber loss function to calculate loss and simultaneously performs gradient descent, and finally learns the association strategy of the heterogeneous Internet of vehicles users.
The invention provides a heterogeneous Internet of vehicles user association method based on multi-agent deep reinforcement learning, which reduces switching overhead while improving the average energy efficiency of a system, and specifically comprises the following steps:
step 1: initializing algorithm related parameters including weight parameters in a local online Q network and a target Q network and hidden state parameters of a recurrent neural network layer;
step 2: each vehicle user obtains local state information by observing the current environment, then inputs the local state information into a local online Q network to obtain a corresponding Q value, and selects an associated action according to an epsilon-greedy strategy;
and step 3: each vehicle user is associated to an adjacent road side unit or vehicle base station according to the selected associated action, and data transmission is carried out to obtain a team reward value from environment feedback;
and 4, step 4: each vehicle user re-observes the current local state information;
and 5: repeating the step 2 to the step 4 until all vehicle users exit the road, stopping correlation, taking the process from the road entering to the road exiting of all vehicles as a round, and storing the experience information of the whole round into an experience pool;
step 6: extracting a plurality of rounds of experience information from the experience pool, updating the online Q network by using a value decomposition network VDN algorithm, and copying online Q network parameters every T time slots to form a new target Q network;
and 7: and (5) repeating the steps 2 to 6 until the team reward value is converged, and finishing the training.
Further, the local online Q network and the target Q network in step 1 are both provided with two linear network layers and one gating cycle unit layer.
Further, the local state information in step 2 specifically includes:
the local state information of the vehicle user i in the time slot t is
Figure BDA0003542751750000031
Wherein
Figure BDA0003542751750000032
For the current time slot the vehicle user's own location,
Figure BDA0003542751750000033
being nearest to the vehicle user
Figure BDA0003542751750000034
The location of the individual surrounding roadside units,
Figure BDA0003542751750000035
for the maximum number of rsus to which each slotted vehicle user can connect,
Figure BDA0003542751750000041
for the location of all of the vehicle base stations,
Figure BDA0003542751750000042
the location of all devices connected by the vehicle user for the last time slot.
The associated actions of the vehicle user i include:
(1) selectingIs selected to at most
Figure BDA0003542751750000043
Connecting the road side units;
(2) only one vehicle base station is selected for connection.
Further, the epsilon-greedy policy in step 2 specifically includes: a vehicle user randomly selects one type of associated action in an action space by using a probability epsilon to search, and selects the associated action corresponding to the maximum Q value by using a probability of 1-epsilon.
Further, the team prize value in step 3 is specifically:
assuming that the vehicle users all have allocated orthogonal resource blocks, considering only small-scale fading and path loss for the communication links between vehicles and the communication links between vehicles and infrastructure, the data transmission rate V of the vehicle user ii tExpressed as:
Figure BDA0003542751750000044
wherein
Figure BDA0003542751750000045
Set of road side units or vehicle base stations, p, associated with a current time slot vehicle user icThe transmit power of the road side unit or vehicle base station c,
Figure BDA0003542751750000046
for the channel gain, σ, between the RSU or vehicle base station c and this vehicle user connection2Is an additive white gaussian noise variance;
the time is dispersed into a plurality of time slots, and the switching cost of the vehicle user i in the current time slot t
Figure BDA0003542751750000047
Expressed as:
Figure BDA0003542751750000048
wherein
Figure BDA0003542751750000049
A road side unit set or a vehicle base station associated with the last time slot;
team award value RtFor the sum of all vehicle reward values, i.e.
Figure BDA00035427517500000410
Where K is the number of vehicle users and the reward value of user i in time slot t
Figure BDA00035427517500000411
Comprises the following steps:
Figure BDA00035427517500000412
wherein
Figure BDA00035427517500000413
A minimum transmission rate limit for the vehicle user,
Figure BDA00035427517500000414
for the minimum transmitting power of the road side unit, the parameter beta belongs to [0,1 ]]Weights for the energy efficiency reward component.
Further, in step 4, each vehicle user re-observes the current local state information, specifically as follows:
each vehicle user and vehicle base station according to a movement formula
Figure BDA0003542751750000051
Updating the location where vtAnd vt-1The speed of the time slot t and the previous time slot t-1 respectively,
Figure BDA0003542751750000052
and
Figure BDA0003542751750000053
are respectively provided withFor the progressive speed and standard deviation of the vehicle, the memory parameter α represents the dependency of the current speed on the previous speed, the parameter n is an uncorrelated random gaussian process with a mean value of 0 and a variance of 1.
Further, the round experience information in step 5 is a set obtained by arranging single-step experience information of the whole round according to the sequence of the round steps, wherein the single-step experience information is a quadruple and comprises local observation information
Figure BDA0003542751750000054
Selected associated action
Figure BDA0003542751750000055
Single step reward
Figure BDA0003542751750000056
And local observation information of the next time slot
Figure BDA0003542751750000057
Further, the updating of the online Q network by using the value decomposition network VDN algorithm in step 6 specifically includes:
step 6-1: collecting the extracted vehicle users
Figure BDA0003542751750000058
Round experience of each user, including local observation information
Figure BDA0003542751750000059
Selected associated action
Figure BDA00035427517500000510
And local observation information of the next time slot
Figure BDA00035427517500000511
Respectively inputting the local online Q network and the target Q network to respectively obtain the Q value of the online Q network
Figure BDA00035427517500000512
And maximum Q value of target Q network
Figure BDA00035427517500000513
All vehicle users are obtained
Figure BDA00035427517500000514
1,2, K and
Figure BDA00035427517500000515
and correspondingly summing K to obtain a Q value of the total online Q network
Figure BDA00035427517500000516
And total target Q network maximum Q value
Figure BDA00035427517500000517
The method comprises the following specific steps:
Figure BDA00035427517500000518
Figure BDA00035427517500000519
Figure BDA00035427517500000520
wherein R istFor team prize values at time slot t, aggregated by vehicle users
Figure BDA00035427517500000521
Single step reward in round experience for K users
Figure BDA00035427517500000522
Are summed up, i.e.
Figure BDA00035427517500000523
Gamma is a discount factor, used to controlThe effect of long-distance rewards on the current reward, δ is used to control the sensitivity to outliers during the learning process;
step 6-3: extracting the experience of the next time slot and replacing the current experience;
step 6-4: and repeating the step 6-1 to the step 6-3 until all rounds of experience extraction in the whole round are finished.
Compared with a multi-agent independent deep Q learning algorithm and other traditional algorithms, the method can effectively improve energy efficiency and reduce switching overhead at the same time under the heterogeneous vehicle networking environment.
The invention is described in further detail below with reference to the figures and the embodiments.
Examples
The present embodiment provides that when all vehicle users walk to the end of the road, a round is completed in which the vehicle user makes every move and associated action, referred to as a time slot.
With reference to fig. 1, the heterogeneous internet of vehicles user association method based on multi-agent deep reinforcement learning in this embodiment includes the following specific steps:
step 1: algorithm-related parameters are initialized.
Initializing parameters such as related weight and bias of a local online Q network and a target Q network of a vehicle user and hidden state parameters of a Recurrent neural network layer, wherein the local online Q network and the target Q network both have two linear network layers and a Gated Recurrent Unit (GRU) layer.
Step 2: each vehicle user obtains local state information by observing the current environment, then inputs the local state information into a local network to obtain a corresponding Q value, and selects an associated action according to an epsilon-greedy strategy.
Specifically, taking the vehicle user i as an example, the local status information of the vehicle user i in the time slot t is
Figure BDA0003542751750000061
Wherein
Figure BDA0003542751750000062
For the current time slotThe location of the user of the vehicle himself,
Figure BDA0003542751750000063
being nearest to the vehicle user
Figure BDA0003542751750000064
The location of the individual surrounding roadside units,
Figure BDA0003542751750000065
for the location of all of the vehicle base stations,
Figure BDA0003542751750000066
the location of all devices connected by the vehicle user for the last time slot. Agent will observe locally
Figure BDA0003542751750000067
Inputting a first layer linear network, and activating a function through a ReLU; further, the output and the cyclic neural network hidden state obtained in the last time slot
Figure BDA0003542751750000068
Inputting GRU to obtain the hidden state of the current recurrent neural network
Figure BDA0003542751750000069
And a network output value; finally, the network output value is input into the last layer of linear network, and the output is the Q value corresponding to all actions of the online Q network
Figure BDA00035427517500000610
Wherein
Figure BDA00035427517500000611
Is the set of all actions. And associated actions of vehicle user i
Figure BDA00035427517500000612
The method specifically comprises the following steps: (1) is selected at most
Figure BDA00035427517500000613
Connecting the road side units; (2) only one vehicle base station is selected for connection. The epsilon-greedy strategy according to which the action is specifically selected is as follows: a vehicle user randomly selects one type of associated action in an action space by using a probability epsilon to search, and selects the associated action corresponding to the maximum Q value by using a probability of 1-epsilon.
And step 3: and each vehicle user is associated to the adjacent road side unit or vehicle base station according to the selected associated action, and data transmission is carried out to obtain the team reward value from the environmental feedback.
Assume that the vehicle users have all allocated orthogonal resource blocks. Aiming at the communication link between vehicles and infrastructure, the invention only considers the small-scale fading and the path loss, and then the data transmission rate V of the vehicle user ii tCan be expressed as:
Figure BDA0003542751750000071
wherein
Figure BDA0003542751750000072
Set of road side units or vehicle base stations, p, associated with a current time slot vehicle user icThe transmit power of the road side unit or vehicle base station c,
Figure BDA0003542751750000073
for the channel gain, σ, between the RSU or vehicle base station c and this vehicle user connection2Is an additive white gaussian noise variance.
For ease of modeling, the time is discretized into a plurality of time slots. Switching overhead of vehicle user i in current time slot t
Figure BDA0003542751750000074
Can be expressed as:
Figure BDA0003542751750000075
wherein
Figure BDA0003542751750000076
The set of rsus or vehicle bss associated with the previous timeslot.
The team prize value being the sum of all vehicle prize values, i.e.
Figure BDA0003542751750000077
Wherein the reward value r of user i in time slot ti tComprises the following steps:
Figure BDA0003542751750000078
wherein
Figure BDA0003542751750000079
A minimum transmission rate limit for the vehicle user,
Figure BDA00035427517500000710
for the minimum transmitting power of the road side unit, the parameter beta belongs to [0,1 ]]The weight of the energy efficiency reward component is,
Figure BDA00035427517500000711
the maximum number of road side units which can be connected by each time slot vehicle user.
And 4, step 4: each vehicle user re-observes the current local state information.
Each vehicle user and vehicle base station according to a movement formula
Figure BDA00035427517500000712
Update the location, wherein
Figure BDA00035427517500000713
And
Figure BDA00035427517500000714
respectively the progressive speed and the standard deviation of the vehicle, a memory parameter alpha represents the dependency of the current speed on the previous speed, and parameters n are 0Uncorrelated random gaussian processes with values and variances of 1.
And 5: and repeating the steps 2-4 until the maximum number of steps of the round is reached, and storing the experience information of the whole round into an experience pool.
The round experience information is a set formed by arranging single-step experience information of the whole round according to the sequence of the step number, wherein the single-step information is a quadruple and comprises local observation information
Figure BDA00035427517500000715
Selected associated action
Figure BDA00035427517500000716
Single step reward ri tAnd local observation information of the next time slot
Figure BDA00035427517500000717
Step 6: and extracting a plurality of rounds of experience from the experience pool, updating the current network by using a value decomposition network VDN algorithm, and copying current network parameters every T time slots to form a new target network. With reference to fig. 2, the algorithm learning specifically includes the following steps:
step 6-1: collecting the extracted vehicle users
Figure BDA0003542751750000081
Respectively inputting the round experience of each user into respective local Q network and target Q network to respectively obtain
Figure BDA0003542751750000082
And
Figure BDA0003542751750000083
all vehicle users are obtained
Figure BDA0003542751750000084
1,2, K and
Figure BDA0003542751750000085
i=1,2, sum of K correspondence to obtain
Figure BDA0003542751750000086
And
Figure BDA0003542751750000087
the method comprises the following specific steps:
Figure BDA0003542751750000088
Figure BDA0003542751750000089
step 6-2: calculating loss according to a Huber loss function, and further performing gradient reduction on the loss to update the local network, wherein the loss function is specifically as follows:
Figure BDA00035427517500000810
where δ is used to control the sensitivity to outliers during learning.
Step 6-3: extracting the experience of the next time slot and replacing the current experience;
step 6-4: repeating steps 1 to 3 until the round of experience is over.
And 7: and repeating the steps 2-6 until the team reward value is converged, and finishing the training.
The simulation of the invention adopts Python programming, and the parameter setting does not influence the generality. The method for comparison with the method comprises the following steps: (1) multi-agent Independent Deep Q learning algorithm (IDQN): each vehicle user has an independent deep Q network, takes other users as a part of the environment, and trains by using a deep Q learning method; (2) FULL join algorithm (FULL): connecting nearest
Figure BDA00035427517500000811
A road side unit; (3) strongest signal connection algorithm (RSS): the connection signal strength is the highestLarge associable devices.
In the embodiment, the road side units are uniformly distributed on two sides of the road, the length of the road is 1 kilometer, the width of the road is 7.5 meters, and the coverage ranges of the road side units and the vehicle base station are 200m and 50m respectively. One-way driving and vehicle progressive speed of vehicle user and vehicle base station
Figure BDA00035427517500000812
Standard deviation of
Figure BDA00035427517500000813
The memory parameter alpha is 0.1. The simulation sets that the number of vehicle users is 5, the number of vehicle base stations is 5, the number of road side units is 30, and the maximum connectable road side units of the vehicle users
Figure BDA00035427517500000814
Minimum transmission rate for vehicle users
Figure BDA00035427517500000815
Minimum transmit power of road side unit
Figure BDA00035427517500000816
The rewarding weight parameter beta is 0.6, the transmitting power of the vehicle base station is 30dBm, the transmitting power of the road side unit is 32dBm, the power density of the environmental noise is-174 dBm/Hz, and the carrier bandwidth is 180 kHz.
In a local Q network of a vehicle user, the number of two layers of hidden neuron nodes is 64, the hidden state dimension of a GRU is 64, the empirical pool capacity is 5000, the learning rate eta is 0.00002, the discount factor gamma is 0.8, the sampling Size is 32, the Huber parameter delta is 0.1, the activation function is ReLU, and the optimizer is Adam. The empirical pool capacity of the comparison algorithm IDQN is 10000, the learning rate is 0.0005, and the rest is the same as the VDN algorithm.
FIG. 3 shows the relationship between the number of roadside units and the reward value, the handover overhead, and the average energy efficiency. It can be observed that as the number of roadside units increases, all of the reward value curves trend upward. The reason for this is that the density of the roadside units increases, the communication distance decreases, the transmission rate and thus the energy efficiency increases greatly, and although the switching overhead increases as well, the magnitude is far smaller than the magnitude of the increase in energy efficiency.
Fig. 4 shows the relationship between the road side unit transmission power and the reward value, the switching overhead and the average energy efficiency. It can be observed that as the transmit power of the roadside unit increases, all of the prize value curves trend downward. The reason for this is that the switching overhead increases the transmission rate due to the fixed number of road side units without large fluctuations, but the amplitude is much less than the amplitude of the increase in the transmission power, and therefore the energy efficiency is reduced. In addition, as the transmitting power of the road side unit is increased, the reduction amplitude of the reward value of the method provided by the invention is obviously smaller than other three baselines, and the better performance is still kept.
As can be seen from fig. 3 and 4, under the condition of equal number of rsus or equal rsus, the reward convergence value of the VDN-based association method is obviously better than other baselines, and under the condition of sufficient rsus or rsus transmission power, the method has better energy efficiency and less switching overhead.
The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (9)

1. A heterogeneous Internet of vehicles user association method based on multi-agent deep reinforcement learning is characterized by comprising the following steps:
step 1: initializing algorithm related parameters including weight parameters in a local online Q network and a target Q network and hidden state parameters of a recurrent neural network layer;
and 2, step: each vehicle user obtains local state information by observing the current environment, then inputs the local state information into a local online Q network to obtain a corresponding Q value, and selects an associated action according to an epsilon-greedy strategy;
and step 3: each vehicle user is associated to an adjacent road side unit or vehicle base station according to the selected associated action, and data transmission is carried out to obtain a team reward value from environment feedback;
and 4, step 4: each vehicle user re-observes the current local state information;
and 5: repeating the step 2 to the step 4 until all vehicle users exit the road, stopping association, taking the process from the road to the road as a round, and storing the experience information of the whole round into an experience pool;
step 6: extracting a plurality of rounds of experience information from the experience pool, updating the online Q network by using a value decomposition network VDN algorithm, and copying online Q network parameters every T time slots to form a new target Q network;
and 7: and (5) repeating the step 2 to the step 6 until the reward value of the team is converged, and finishing the training.
2. The multi-agent deep reinforcement learning-based heterogeneous car networking user association method according to claim 1, wherein the local online Q network and the target Q network in step 1 are provided with two linear network layers and one gating cycle unit layer.
3. The multi-agent deep reinforcement learning-based heterogeneous internet of vehicles user association method as claimed in claim 1, wherein the local state information in step 2 is specifically:
the local state information of the vehicle user i in the time slot t is
Figure FDA0003542751740000011
Wherein
Figure FDA0003542751740000012
For the current time slot the vehicle user's own location,
Figure FDA0003542751740000013
being nearest to the vehicle user
Figure FDA0003542751740000014
The location of the individual surrounding roadside units,
Figure FDA0003542751740000015
for the maximum number of rsus to which each slotted vehicle user can connect,
Figure FDA0003542751740000016
for the location of all of the vehicle base stations,
Figure FDA0003542751740000017
the location of all devices connected by the vehicle user for the last time slot.
4. The multi-agent deep reinforcement learning-based heterogeneous vehicle networking user association method according to claim 3, wherein the association action of a vehicle user i comprises:
(1) is selected at most
Figure FDA0003542751740000018
Connecting the road side units;
(2) only one vehicle base station is selected for connection.
5. The heterogeneous vehicle networking user association method based on multi-agent deep reinforcement learning of claim 1, wherein the e-greedy policy in step 2 is specifically:
a vehicle user randomly selects one type of associated action in an action space by using a probability epsilon to search, and selects the associated action corresponding to the maximum Q value by using a probability of 1-epsilon.
6. The multi-agent deep reinforcement learning-based heterogeneous vehicle networking user association method according to claim 1, wherein the team reward value in step 3 is specifically:
assuming that the vehicle users all have allocated orthogonal resource blocks, considering only small-scale fading and path loss for the communication links between vehicles and the communication links between vehicles and infrastructure, the data transmission rate V of the vehicle user ii tExpressed as:
Figure FDA0003542751740000021
wherein
Figure FDA0003542751740000022
Set of road side units or vehicle base stations, p, associated with a current time slot vehicle user icThe transmit power of the road side unit or vehicle base station c,
Figure FDA0003542751740000023
gain, σ, of the channel between the road side unit or vehicle base station c and the vehicle user connection2Is an additive white gaussian noise variance;
the time is dispersed into a plurality of time slots, and the switching cost of the vehicle user i in the current time slot t
Figure FDA0003542751740000024
Expressed as:
Figure FDA0003542751740000025
wherein
Figure FDA0003542751740000026
A road side unit set or a vehicle base station associated with the last time slot;
team award value RtFor the sum of all vehicle reward values, i.e.
Figure FDA0003542751740000027
Where K is the number of vehicle users and the reward value r of user i in time slot ti tComprises the following steps:
Figure FDA0003542751740000028
wherein
Figure FDA0003542751740000029
A minimum transmission rate limit for the vehicle user,
Figure FDA00035427517400000210
for the minimum transmitting power of the road side unit, the parameter beta belongs to [0,1 ]]Weights for the energy efficiency reward component.
7. The multi-agent deep reinforcement learning-based heterogeneous vehicle networking user association method according to claim 1, wherein each vehicle user in step 4 re-observes current local state information as follows:
each vehicle user and vehicle base station according to a movement formula
Figure FDA00035427517400000211
Updating the location where vtAnd vt-1The speed of the time slot t and the previous time slot t-1 respectively,
Figure FDA00035427517400000212
and
Figure FDA00035427517400000213
respectively, the progressive speed and the standard deviation of the vehicle, the memory parameter alpha represents the dependency of the current speed on the previous speed, the parameter n is 0 mean and the variance is an irrelevant random gaussian process under 1.
8. The multi-agent based deep reinforcement learning of claim 1The heterogeneous Internet of vehicles user association method is characterized in that the round experience information in the step 5 is a set formed by arranging single-step experience information of the whole round according to the sequence of the round steps, wherein the single-step information is a quadruple and comprises local observation information
Figure FDA0003542751740000031
Selected associated action
Figure FDA0003542751740000032
Single step reward ri tAnd local observation information of the next time slot
Figure FDA0003542751740000033
9. The multi-agent deep reinforcement learning-based heterogeneous vehicle networking user association method according to claim 1, wherein the value decomposition network VDN algorithm is used for updating an online Q network in step 6, and specifically comprises the following steps:
step 6-1: collecting the extracted vehicle users
Figure FDA0003542751740000034
Round experience of each user, including local observation information
Figure FDA0003542751740000035
Selected associated action
Figure FDA0003542751740000036
And local observation information of the next time slot
Figure FDA0003542751740000037
Respectively inputting the local online Q network and the target Q network to respectively obtain the Q value of the online Q network
Figure FDA0003542751740000038
And maximum Q value of target Q network
Figure FDA0003542751740000039
All vehicle users are obtained
Figure FDA00035427517400000310
And
Figure FDA00035427517400000311
the Q value of the total on-line Q network is obtained by corresponding summation
Figure FDA00035427517400000312
And total target Q network maximum Q value
Figure FDA00035427517400000313
The method comprises the following specific steps:
Figure FDA00035427517400000314
Figure FDA00035427517400000315
step 6-2: calculating loss according to a Huber loss function, and further performing gradient reduction on the loss to update the online Q network, wherein the loss function is specifically as follows:
Figure FDA00035427517400000316
wherein R istFor team prize values at time slot t, aggregated by vehicle users
Figure FDA00035427517400000317
Single step reward r in round experience of K usersi tAre summed, i.e.
Figure FDA00035427517400000318
Gamma is a discount factor and is used for controlling the influence of long-distance rewards on the current rewards, and delta is used for controlling the sensitivity to abnormal values in the learning process;
step 6-3: extracting the experience of the next time slot and replacing the current experience;
step 6-4: and repeating the step 6-1 to the step 6-3 until all rounds of experience extraction in the whole round are finished.
CN202210242124.5A 2022-03-11 2022-03-11 Heterogeneous Internet of vehicles user association method based on multi-agent deep reinforcement learning Active CN114449482B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210242124.5A CN114449482B (en) 2022-03-11 2022-03-11 Heterogeneous Internet of vehicles user association method based on multi-agent deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210242124.5A CN114449482B (en) 2022-03-11 2022-03-11 Heterogeneous Internet of vehicles user association method based on multi-agent deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN114449482A true CN114449482A (en) 2022-05-06
CN114449482B CN114449482B (en) 2024-05-14

Family

ID=81359674

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210242124.5A Active CN114449482B (en) 2022-03-11 2022-03-11 Heterogeneous Internet of vehicles user association method based on multi-agent deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN114449482B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115018017A (en) * 2022-08-03 2022-09-06 中国科学院自动化研究所 Multi-agent credit allocation method, system and equipment based on ensemble learning
CN115185190A (en) * 2022-09-13 2022-10-14 清华大学 Urban drainage system control method and device based on multi-agent reinforcement learning
CN117234785A (en) * 2023-11-09 2023-12-15 华能澜沧江水电股份有限公司 Centralized control platform error analysis system based on artificial intelligence self-query

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108829797A (en) * 2018-04-25 2018-11-16 苏州思必驰信息科技有限公司 Multiple agent dialog strategy system constituting method and adaptive approach
CN110493826A (en) * 2019-08-28 2019-11-22 重庆邮电大学 A kind of isomery cloud radio access network resources distribution method based on deeply study
CN110956851A (en) * 2019-12-02 2020-04-03 清华大学 Intelligent networking automobile cooperative scheduling lane changing method
CN111786713A (en) * 2020-06-04 2020-10-16 大连理工大学 Unmanned aerial vehicle network hovering position optimization method based on multi-agent deep reinforcement learning
CN111785045A (en) * 2020-06-17 2020-10-16 南京理工大学 Distributed traffic signal lamp combined control method based on actor-critic algorithm
WO2021097391A1 (en) * 2019-11-16 2021-05-20 Uatc, Llc Systems and methods for vehicle-to-vehicle communications for improved autonomous vehicle operations
CN112995951A (en) * 2021-03-12 2021-06-18 南京航空航天大学 5G Internet of vehicles V2V resource allocation method adopting depth certainty strategy gradient algorithm
CN113568675A (en) * 2021-07-08 2021-10-29 广东利通科技投资有限公司 Internet of vehicles edge calculation task unloading method based on layered reinforcement learning
US20220014963A1 (en) * 2021-03-22 2022-01-13 Shu-Ping Yeh Reinforcement learning for multi-access traffic management
CN113952733A (en) * 2021-05-31 2022-01-21 厦门渊亭信息科技有限公司 Multi-agent self-adaptive sampling strategy generation method
CN114143891A (en) * 2021-11-30 2022-03-04 南京工业大学 FDQL-based multi-dimensional resource collaborative optimization method in mobile edge network

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108829797A (en) * 2018-04-25 2018-11-16 苏州思必驰信息科技有限公司 Multiple agent dialog strategy system constituting method and adaptive approach
CN110493826A (en) * 2019-08-28 2019-11-22 重庆邮电大学 A kind of isomery cloud radio access network resources distribution method based on deeply study
WO2021097391A1 (en) * 2019-11-16 2021-05-20 Uatc, Llc Systems and methods for vehicle-to-vehicle communications for improved autonomous vehicle operations
CN110956851A (en) * 2019-12-02 2020-04-03 清华大学 Intelligent networking automobile cooperative scheduling lane changing method
CN111786713A (en) * 2020-06-04 2020-10-16 大连理工大学 Unmanned aerial vehicle network hovering position optimization method based on multi-agent deep reinforcement learning
CN111785045A (en) * 2020-06-17 2020-10-16 南京理工大学 Distributed traffic signal lamp combined control method based on actor-critic algorithm
CN112995951A (en) * 2021-03-12 2021-06-18 南京航空航天大学 5G Internet of vehicles V2V resource allocation method adopting depth certainty strategy gradient algorithm
US20220014963A1 (en) * 2021-03-22 2022-01-13 Shu-Ping Yeh Reinforcement learning for multi-access traffic management
CN113952733A (en) * 2021-05-31 2022-01-21 厦门渊亭信息科技有限公司 Multi-agent self-adaptive sampling strategy generation method
CN113568675A (en) * 2021-07-08 2021-10-29 广东利通科技投资有限公司 Internet of vehicles edge calculation task unloading method based on layered reinforcement learning
CN114143891A (en) * 2021-11-30 2022-03-04 南京工业大学 FDQL-based multi-dimensional resource collaborative optimization method in mobile edge network

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
EMRE GURES; IBRAHEEM SHAYEA; ABDULRAQEB ALHAMMADI; MUSTAFA ERGEN;: ""A Comprehensive Survey on Mobility Management in 5G Heterogeneous Networks: Architectures, Challenges and Solutions"", 《IEEE ACCESS》, 13 October 2020 (2020-10-13) *
HAMZA KHAN; ANIS ELGABLI; SUMUDU SAMARAKOON; MEHDI BENNIS;: ""Reinforcement Learning-Based Vehicle-Cell Association Algorithm for Highly Mobile Millimeter Wave Communication"", 《IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING》, 12 September 2019 (2019-09-12) *
YAN LIN; ZHENGMING ZHANG; YONGMING HUANG; JUN LI; FENG SHU; LAJOS HANZO: ""Heterogeneous User-Centric Cluster Migration Improves the Connectivity-Handover Trade-Off in Vehicular Networks"", 《IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY》, 1 December 2020 (2020-12-01) *
YESHENG ZHANG; WEI ZU; YANG GAO; HONGXING CHANG: ""Research on autonomous maneuvering decision of UCAV based on deep reinforcement learning"", 《2018 CHINESE CONTROL AND DECISION CONFERENCE (CCDC)》, 9 July 2018 (2018-07-09) *
刘雷;陈晨;冯杰;肖婷婷;裴庆祺: ""车载边缘计算卸载技术研究综述"", 《电子学报》, 15 May 2021 (2021-05-15) *
吴静媛;孙亮;杨树;李岩: ""面向车路群智协同的运营测试融合体系"", 《无线电工程》, 5 January 2022 (2022-01-05) *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115018017A (en) * 2022-08-03 2022-09-06 中国科学院自动化研究所 Multi-agent credit allocation method, system and equipment based on ensemble learning
CN115185190A (en) * 2022-09-13 2022-10-14 清华大学 Urban drainage system control method and device based on multi-agent reinforcement learning
CN117234785A (en) * 2023-11-09 2023-12-15 华能澜沧江水电股份有限公司 Centralized control platform error analysis system based on artificial intelligence self-query
CN117234785B (en) * 2023-11-09 2024-02-02 华能澜沧江水电股份有限公司 Centralized control platform error analysis system based on artificial intelligence self-query

Also Published As

Publication number Publication date
CN114449482B (en) 2024-05-14

Similar Documents

Publication Publication Date Title
Elsayed et al. AI-enabled future wireless networks: Challenges, opportunities, and open issues
CN114449482A (en) Heterogeneous vehicle networking user association method based on multi-agent deep reinforcement learning
Shi et al. Drone-cell trajectory planning and resource allocation for highly mobile networks: A hierarchical DRL approach
Liu et al. Asynchronous deep reinforcement learning for collaborative task computing and on-demand resource allocation in vehicular edge computing
CN109068391B (en) Internet of vehicles communication optimization algorithm based on edge calculation and Actor-Critic algorithm
CN113132943B (en) Task unloading scheduling and resource allocation method for vehicle-side cooperation in Internet of vehicles
Wang et al. Collaborative mobile computation offloading to vehicle-based cloudlets
Hossain et al. Multi-objective Harris hawks optimization algorithm based 2-Hop routing algorithm for CR-VANET
CN116156455A (en) Internet of vehicles edge content caching decision method based on federal reinforcement learning
Mekki et al. Vehicular cloud networking: evolutionary game with reinforcement learning-based access approach
Zhou et al. DRL-based low-latency content delivery for 6G massive vehicular IoT
CN115277845A (en) Multi-agent near-end strategy-based distributed edge cache decision method for Internet of vehicles
CN111132083A (en) NOMA-based distributed resource allocation method in vehicle formation mode
Şahin et al. Reinforcement learning scheduler for vehicle-to-vehicle communications outside coverage
Kumar et al. Optimized clustering for data dissemination using stochastic coalition game in vehicular cyber-physical systems
CN115866787A (en) Network resource allocation method integrating terminal direct transmission communication and multi-access edge calculation
Xu et al. Deep reinforcement learning for multi-objective resource allocation in multi-platoon cooperative vehicular networks
Alablani et al. An SDN/ML-based adaptive cell selection approach for HetNets: a real-world case study in London, UK
Ji et al. Multi-Agent Reinforcement Learning Resources Allocation Method Using Dueling Double Deep Q-Network in Vehicular Networks
Wang et al. Energy efficiency resource management for D2D-NOMA enabled network: A dinkelbach combined twin delayed deterministic policy gradient approach
Bhadauria et al. QoS based deep reinforcement learning for V2X resource allocation
Liu et al. A Q-learning based adaptive congestion control for V2V communication in VANET
Bali et al. Learning automata-assisted predictive clustering approach for vehicular cyber-physical system
CN117354833A (en) Cognitive Internet of things resource allocation method based on multi-agent reinforcement learning algorithm
Ren et al. Joint spectrum allocation and power control in vehicular communications based on dueling double DQN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Lin Yan

Inventor after: Tao Yiyu

Inventor after: Bao Jinming

Inventor after: Zhang Yijin

Inventor after: Zou Jun

Inventor after: Li Jun

Inventor after: Shu Feng

Inventor before: Tao Yiyu

Inventor before: Lin Yan

Inventor before: Bao Jinming

Inventor before: Zhang Yijin

Inventor before: Zou Jun

Inventor before: Li Jun

Inventor before: Shu Feng

GR01 Patent grant
GR01 Patent grant