CN117412391A - Enhanced dual-depth Q network-based Internet of vehicles wireless resource allocation method - Google Patents

Enhanced dual-depth Q network-based Internet of vehicles wireless resource allocation method Download PDF

Info

Publication number
CN117412391A
CN117412391A CN202311322831.6A CN202311322831A CN117412391A CN 117412391 A CN117412391 A CN 117412391A CN 202311322831 A CN202311322831 A CN 202311322831A CN 117412391 A CN117412391 A CN 117412391A
Authority
CN
China
Prior art keywords
link
network
delay
agent
kth
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311322831.6A
Other languages
Chinese (zh)
Inventor
张文静
宋晓勤
张莉涓
雷磊
吴志豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN202311322831.6A priority Critical patent/CN117412391A/en
Publication of CN117412391A publication Critical patent/CN117412391A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W16/00Network planning, e.g. coverage or traffic planning tools; Network deployment, e.g. resource partitioning or cells structures
    • H04W16/14Spectrum sharing arrangements between different networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/30Services specially adapted for particular environments, situations or purposes
    • H04W4/40Services specially adapted for particular environments, situations or purposes for vehicles, e.g. vehicle-to-pedestrians [V2P]
    • H04W4/46Services specially adapted for particular environments, situations or purposes for vehicles, e.g. vehicle-to-pedestrians [V2P] for vehicle-to-vehicle communication [V2V]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/04Wireless resource allocation
    • H04W72/044Wireless resource allocation based on the type of the allocated resource
    • H04W72/0453Resources in frequency domain, e.g. a carrier in FDMA
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/04Wireless resource allocation
    • H04W72/044Wireless resource allocation based on the type of the allocated resource
    • H04W72/0473Wireless resource allocation based on the type of the allocated resource the resource being transmission power
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/40Resource management for direct mode communication, e.g. D2D or sidelink
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/50Allocation or scheduling criteria for wireless resources
    • H04W72/535Allocation or scheduling criteria for wireless resources based on resource usage policies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/50Allocation or scheduling criteria for wireless resources
    • H04W72/54Allocation or scheduling criteria for wireless resources based on quality criteria
    • H04W72/541Allocation or scheduling criteria for wireless resources based on quality criteria using the level of interference
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/50Allocation or scheduling criteria for wireless resources
    • H04W72/54Allocation or scheduling criteria for wireless resources based on quality criteria
    • H04W72/542Allocation or scheduling criteria for wireless resources based on quality criteria using measured or perceived quality
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/50Allocation or scheduling criteria for wireless resources
    • H04W72/54Allocation or scheduling criteria for wireless resources based on quality criteria
    • H04W72/543Allocation or scheduling criteria for wireless resources based on quality criteria based on requested quality, e.g. QoS

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention provides an enhanced double-depth Q network-based wireless resource allocation method for the Internet of vehicles, which is applicable to urban road resource allocation scenes in a high-dynamic vehicle-mounted environment. The sub-band and power allocation strategy of the V2V link is optimized by introducing techniques such as preferential experience playback and multi-step learning into the multi-agent dual-depth Q network. The algorithm can minimize the total cost of the network under the constraint conditions of satisfying the time delay, reliability and the like of the user. The EDDQN algorithm used by the invention can dynamically distribute resources according to interference conditions, has good convergence in a vehicle-mounted environment, effectively solves the problem of joint optimization of V2V link channel distribution and power selection, and has good reliability and robustness under different loads and link numbers.

Description

Enhanced dual-depth Q network-based Internet of vehicles wireless resource allocation method
Technical Field
The invention relates to the technology of the Internet of vehicles, in particular to a wireless resource allocation method of the Internet of vehicles, and more particularly relates to a wireless resource allocation method of the Internet of vehicles based on an enhanced dual-depth Q Network (EDDQN, enhanced Double Deep Q-Network).
Background
With the continuous breakthrough of new technologies in the fields of artificial intelligence, big data, mobile communication and the like, the vehicle-mounted network is in a state of vigorous development. The internet of vehicles is used as an information bearing platform of a new-generation intelligent traffic system (ITS, intelligent transportation system), is very important for improving traffic management capability in aspects of road safety, transportation efficiency, internet access and the like, and simultaneously provides information such as driving safety guarantee, real-time road condition monitoring and the like for novel applications such as automatic driving and the like. The third generation partnership project 3GPP introduced a new type of Vehicle infrastructure architecture for Vehicle-to-Everything (V2X), including communication between vehicles and the surrounding environment and infrastructure, including Vehicle-to-other vehicles (V2V), vehicle-to-road infrastructure, vehicle-to-pedestrian, and Vehicle-to-network, etc., providing wireless communication services to vehicles, and further enhancing and optimizing the Vehicle networking communication capabilities in the latest Release 17 version. Therefore, the Internet of vehicles is used as a key application of the Internet of things technology in an intelligent traffic system, intelligent coordination among vehicles, infrastructure and the like can be realized, the safety of roads is greatly improved, and meanwhile, the traffic management efficiency is improved.
However, vehicle communication systems face significant challenges when handling large amounts of large data. Because of limited communication resources among vehicles, the requirement of transmitting a large amount of data in real time cannot be met, and the problem of conflict is caused to information transmission among communication vehicle nodes. In addition, since vehicles in the internet of vehicles have different demands and priorities, communication resources have an imbalance problem in the allocation process, resulting in an increase in transmission delay of some important data. In a complex traffic environment, the Doppler effect, the rapid change of channel state information and the multi-user interference generated by a large number of V2X users sharing a frequency spectrum caused by the high-speed movement of a vehicle have great influence on the reliability and the transmission delay of data transmission, and the communication transmission performance of the vehicle-mounted users is extremely easy to be reduced. Therefore, it is important to reasonably and efficiently allocate wireless communication resources in the internet of vehicles.
The invention provides an EDDQN-based distributed multi-objective joint optimization vehicle networking wireless resource allocation algorithm, which aims at an urban road resource allocation scene in a high-dynamic vehicle-mounted environment and aims at minimizing the total cost of a network as an optimization target. Meanwhile, the method introduces skills such as priority experience playback and multi-step learning into a Double Deep Q Network (DDQN), optimizes the sub-band and power allocation strategy of the V2V link, and has good reliability and robustness.
Disclosure of Invention
The invention aims to: aiming at the problems existing in the prior art, the method for distributing the wireless resources of the Internet of vehicles based on the enhanced double-depth Q network is provided. Aiming at urban road resource allocation scenes in a high-dynamic vehicle-mounted environment, a deep reinforcement learning model of DDQN is adopted, and priority experience playback and multi-step learning technology are introduced into the model, so that weighted sum minimization of combined total time delay and energy consumption in a system is realized under the condition that time delay limit and signal-to-noise ratio threshold are met.
The technical scheme is as follows: aiming at the urban road resource allocation scene in the high-dynamic vehicle-mounted environment, the purpose of minimizing the weighted sum of the time delay and the energy consumption of the network is achieved by reasonable and efficient resource allocation. In order to reduce the system cost and improve the spectrum utilization rate, the V2V link shares the channel resource under the condition of limited spectrum resource and certain communication density. V2V Users (VUE) which directly carry out link communication with adjacent vehicles access the network through a PC5 interface, share limited frequency spectrum resources, realize low-delay and high-reliability direct communication, and exchange and transmit important information related to driving safety such as Vehicle distance, speed and the like with high efficiency. By adopting the distributed resource allocation method, the base station does not need to intensively schedule channel state information, each V2V link is regarded as an intelligent body, and the channel and the transmitting power are selected by observing the local state information and the channel information broadcast by the base station. A multi-agent deep reinforcement learning model based on DDQN is constructed, and then a priority experience playback and multi-step learning skill optimization deep reinforcement learning model is introduced. And obtaining the optimal V2V user transmitting power and channel allocation strategy according to the optimized EDDQN model. The invention is realized by the following technical scheme: an Internet of vehicles wireless resource allocation method based on an enhanced dual-depth Q network comprises the following steps:
(1) Constructing a wireless resource distribution system of the Internet of vehicles with vehicle movement and channel time-varying characteristics so as to meet the requirements of low delay and high reliability of V2V communication service in urban bidirectional road environment;
(2) To jointly allocate radio resources (transmission power and subbands) for each V2V link in a reasonably efficient manner, a communication model is established that includes K pairs of V2V links and M subbands;
(3) Calculating the network delay and energy consumption of each V2V link, comprehensively considering the weighted sum of the network delay and the energy consumption to obtain the total cost of the network, and taking the minimum total cost of the network as an optimization target under the condition of meeting the V2V link delay and reliability;
(4) According to the optimization target, constructing a multi-agent deep reinforcement learning model based on DDQN;
(5) To enhance the performance of the deep reinforcement learning model, priority experience playback and multi-step learning skills are introduced to the multi-agent DDQN;
(6) Training the optimized deep reinforcement learning model;
(7) In the execution stage, each V2V link obtains a state according to the current local observation, and loads a trained model to obtain the optimal V2V user transmitting power and channel allocation strategy;
further, the step (1) includes the following specific steps:
(1a) Establishing a V2V user resource allocation system model, wherein the model environment is a urban bidirectional road, road direction and network topology information are introduced, and meanwhile, the road capacity is limited;
(1b) In the system model, the positions of the vehicle users are randomly generated following a spatial poisson distribution, and the forward direction is defined for the vehicle in accordance with the lane.
(1c) V2V users directly carrying out link communication with adjacent vehicles access the network through a PC5 interface, share limited frequency spectrum resources, realize low-delay and high-reliability direct communication, and efficiently exchange and transmit important information related to driving safety, such as vehicle distance, speed and the like.
Further, the step (2) includes the following specific steps:
(2a) Establishing a communication model of Internet of vehicles resource allocation, wherein the system comprises M sub-bands and K-to-V2V links which are respectively combined by aggregationAnd->It means that the user equipment of the V2V link requests service through the ul lc slice. Total licensed bandwidth W 0 Is equally divided into M subchannels of bandwidth W. The model adopts an orthogonal frequency division multiplexing (OFDM, orthogonal frequency division multiplexing) technology for channel transmission, and sub-bands are mutually orthogonal and do not generate interference. However, the same sub-band can be shared by a plurality of users, so that interference can be generated between V2V link users sharing the same sub-band, and the transmission rate of signals can be influenced;
(2b) Within each subband, the channel power gain is flat. The channel power gain g [ m ] includes large-scale fading and small-scale fading, expressed as:
g[m]=α|h[m]| 2 expression 1
Where α is a large scale fade, including path loss and shadowing, which is stationary within the subband; i h [ m ]]| 2 Is small scale fading, and changes according to Rayleigh fading in sub-bands and uncorrelated times;
(2c) SINR of the kth V2V link on the mth subband is expressed as:
wherein P is k [m]Representing the transmit power, g, of the kth V2V link user k [m]Representing the channel power gain, sigma, of the channel used by the kth V2V link 2 Representing system noise power; i k [m]Indicating that the kth V2V link is interfered;
(2d) The interference experienced by the kth V2V link on the mth subband is expressed as:
wherein ρ is k′ [m]Allocation indicator, ρ, representing subband k′ [m]=1 means that the kth' V2V link user multiplexes the mth subband spectrum, otherwise ρ k′ [m]=0;P k′ [m]Representing the transmission power of the kth V2V link user;representing the channel interference power gain of the kth V2V link to the kth V2V link;
for the kth V2V link, the selection information of the sub-band is expressed as:
ρ k ={ρ k [1],ρ k [2],...,ρ k [m]...,ρ k [M]expression 4
Specifying that each link can select only one resource block for transmission at a time, i.e. within a time
(2e) The transmission rate of the kth V2V link on the mth subband is expressed as:
R k [m]=W log(1+γ k [m]) Expression 5
Further, the step (3) comprises the following specific steps:
(3a) Calculating the transmission delay of each V2V link in the communication model, wherein the total transmission delay of the network in the system is formed by the transmission delays of all V2V links, and the total transmission delay is expressed as:
wherein T is k Representing the transmission delay of the kth V2V link; d (D) k The total load required to be transmitted for the kth V2V link;
(3b) Accordingly, the total transmission energy consumption of the V2V link is expressed as:
wherein E is k Representing transmission energy consumption of a kth V2V link;
(3c) Considering the cost of the V2V link comprehensively, the total cost of the optimization target network is defined as a weighted sum of the total transmission delay and the total transmission energy consumption of the V2V link, and the total cost of the network is expressed as:
ζ=λ 1 T+λ 2 e expression 8
Wherein lambda is 1 Is the weight of the transmission delay of the V2V link, lambda 2 Transmitting energy consumption weights, lambda, for V2V links 12 =1, the importance of transmission delay and transmission energy consumption is measured with weights; since the time delay and the energy consumption may be different in magnitude, lambda is adjusted empirically 1 And lambda (lambda) 2 The time delay and the energy consumption are ensured to be unified under the same order of magnitude;
(3d) When considering the decentralized resource allocation of the V2V link user end, only the transmission delay is considered as the delay of the V2V link, and other scheduling delays of the MAC layer are not considered. Thus, the constraint of V2V links on latency is expressed as:
wherein T is max Representing the maximum tolerable delay of the kth V2V link;
(3e) The reliability constraints for V2V communication are expressed as:
wherein, gamma th A user signal to noise ratio threshold representing a kth V2V link;
(3f) In summary, the following objective functions and constraints can be established:
wherein, the objective function is to minimize the total cost of the network, constraint conditions C1 and C2 are time delay constraint and reliability constraint on the V2V link, constraint condition C3 indicates that the total power transmitted by the user side of the V2V link on all sub-bands cannot exceed the maximum rated transmitting power, constraint conditions C4 and C5 mean that each V2V link can only be allocated with one sub-band at the same time, and the same sub-band allows access of a plurality of V2V links at the same time;
further, the step (4) comprises the following specific steps:
(4a) Each V2V link is considered as an agent, and at each time t, each V2V link acquires the current state from the state space S according to local observation, and the kth V2V link acquires the current state asThenThe smart body will utilize the action cost function of DDQN +.>Making policy pi, selecting action ++from action space A>The action selection comprises selecting transmission sub-bands and corresponding transmission power, the environment will be transferred to a new state according to the policy selection of all V2V links>At the same time, each agent shares an instant prize r i
(4b) The state space S is defined as local observation information related to resource allocation and low-dimensional fingerprint information, including a local instantaneous channel information set G of sub-channel m uplinks k [m]The total interference power I of all V2V links, each sharing the same sub-band k [m]Total load D that V2V user needs to transmit k And residual load B k And the random exploration variable epsilon in each agent training round number e and epsilon-greedy algorithm, the states are expressed as:
(4c) Defining action space A as transmitting power P selected by intelligent agent k And subband C k ,P k E {1, 2..p } is the transmit power of the V2V link user, C k E {1,2,..m } represents the subchannel accessed by the V2V link user, the actions are expressed as:
(4d) Defining a joint rewards function r i The goal of the resource allocation is to minimize the total cost of the network, taking into account the SINR threshold values and delay constraints of the links, and thereforeThe reward function for each agent is expressed as:
wherein C and A 1 Is two fixed larger constant values lambda 3 And lambda (lambda) 4 Weight values for measuring signal-to-noise ratio and time delay importance;
wherein A is 2 Is a constant;
the setting of the rewarding function enables the obtained rewarding to be maximum when the load of the V2V link is transmitted; in the link transmission process, the smaller network cost can obtain larger rewards, and the signal to noise ratio and transmission delay which do not meet the requirements can obtain punishment;
(4e) According to the established state, action and rewards, a deep reinforcement learning model is established on the basis of Q learning, a DDQN algorithm is adopted, and when a network is updated, each agent needs to minimize a loss function to realize gradient descent, and the loss function is expressed as:
wherein D is the sample space, β ε [0,1 ]]Is a discount factor, beta-1 represents a future reward being given attention, while beta-0 represents a current reward being more focused,and->The parameters of the real network and the target network of the kth agent at the time t are respectively;
further, the step (5) comprises the following specific steps:
(5a) Techniques such as preferential experience playback and multi-step learning are introduced into the multi-agent DDQN to alleviate the problem of Q overestimation, wherein the multi-step learning combines an instant prize obtained from an action with an n-step estimated prize for the action, and the multi-step learning's cumulative discount prize is expressed as:
accordingly, the loss function is further adjusted, and the multi-step loss function is expressed as:
wherein w is t The importance weight of the sample at the t time is represented; n represents the learning step length during multi-step learning;
w t =(x t +ε) expression 19
Wherein x is t Representing the priority of the sample at the time t, and using the empirical TD error as a measure; epsilon is a small constant for avoiding a weight value of 0; sigma is a super-parameter used to control the sampling probability of the sample, typically taking a value of 0,1]Between them;
(5b) Updating the target network data into the real network data at certain iteration times every interval, wherein the parameter updating is expressed as:
wherein,calculating a gradient for a kth agent; η is learning rate, which is a super parameter;
further, the step (6) comprises the following specific steps:
(6a) Starting the environment simulator to generate the vehicleVehicle and link, calculating relevant parameters such as channel fading and the like, and setting initial load D for all V2V links k =d and T max =Γ, initializing the real network parameters θ of each agent k And a target network parameter θ' k
(6b) Initializing a training round number P;
(6c) Updating parameters such as vehicle position, channel large-scale fading, total load and the like, initializing a time step t in a P round, and updating parameters such as channel small-scale fading, residual load and the like;
(6d) Each agent asynchronously depends on the input stateOutput action->Obtain instant rewards->At the same time the environment goes to the next state +.>Thereby obtaining training data->
(6e) When t is more than N, calculating N-step rewardsTraining data +.>Storing the results in respective experience pools;
(6f) Each agent calculates the TD error and priority x of its own sample t
(6g) Each agent normalizes all sample priorities x in its own experience pool t And obtain corresponding probability distribution
(6h) Each of which isThe intelligent agent randomly extracts a small batch of data set D from the experience pool with a certain probability, and inputs the data set D into the reality network theta k
(6i) Each agent calculates loss value through the real network and the target networkUpdating the real network theta of the intelligent agent by adopting a small-batch gradient descent strategy and a back propagation algorithm of the neural network k Parameters of (2);
(6j) The training times reach the target network updating interval, and according to the actual network parameter theta k Updating target network parameter θ k ′;
(6k) Judging whether t < K is satisfied, wherein K is the total time step in the p rounds, if so, t=t+1, entering the step (6 c), otherwise, entering the step (61);
(61) Judging whether p is less than 1, wherein I is a training round number set threshold, if so, p=p+1, entering a step (6 c), otherwise, finishing optimization to obtain an optimized deep reinforcement learning model;
further, the step (7) comprises the following specific steps:
(7a) Each V2V link obtains state from current local observationsWill->Inputting an EDDQN deep reinforcement learning model which is already optimized and trained;
(7b) Outputting the optimal action strategyObtaining the optimal V2V user transmitting power P k And allocating channel C k And outputting the optimal action strategy.
The beneficial effects are that: according to the vehicle networking wireless resource allocation method based on the enhanced double-depth Q network, priority experience playback and multi-step learning skills are introduced for urban road resource allocation scenes in a high-dynamic vehicle-mounted environment, an optimized deep reinforcement learning model is utilized to obtain an optimal V2V link channel allocation and transmission power joint optimization strategy, and V2V users select proper transmission power and allocation channels, so that the total cost of the network is minimized under the constraint conditions of satisfying user time delay, reliability and the like. The invention can dynamically allocate resources according to interference conditions by using the algorithm, and has good convergence in the vehicle-mounted environment.
In summary, the method for distributing wireless resources of the internet of vehicles based on the distributed multi-objective joint optimization of the enhanced dual-depth Q network is superior in terms of minimizing network cost under the condition of guaranteeing reasonable resource distribution, low interference among V2V links and low computational complexity.
Drawings
FIG. 1 is a schematic diagram of a DDQN deep reinforcement learning algorithm framework for introducing preferential experience playback and multi-step learning skills provided by an embodiment of the present invention;
fig. 2 is a diagram of a convergence result of deep Q network training under the EDDQN algorithm provided by the embodiment of the invention;
fig. 3 is a diagram of simulation results of the relationship between the total cost of the network and the load under the EDDQN algorithm according to the embodiment of the invention.
Detailed Description
The core idea of the invention is that: aiming at urban road resource allocation scenes in a high-dynamic vehicle-mounted environment, each V2V link is regarded as an agent, and the optimal V2V link channel allocation and emission power joint optimization strategy is obtained by introducing priority experience playback and multi-step learning skills into a multi-agent double-depth Q network and utilizing an optimized deep reinforcement learning model. The V2V link user can achieve the purpose of minimizing the total cost of the network by selecting proper transmitting power and distributing channels under the constraint conditions of satisfying the time delay, reliability and the like of the user.
The present invention is described in further detail below.
Step (1), constructing a vehicle networking wireless resource distribution system with vehicle movement and channel time-varying characteristics so as to meet the low-delay and high-reliability requirements of V2V communication service in urban bidirectional road environments, wherein the method specifically comprises the following steps:
(1a) Establishing a V2V user resource allocation system model, wherein the model environment is a urban bidirectional road, road direction and network topology information are introduced, and meanwhile, the road capacity is limited;
(1b) In the system model, the positions of the vehicle users are randomly generated according to the space poisson distribution, and the advancing direction is regulated for the vehicle according to the lanes;
(1c) V2V users which directly carry out link communication with adjacent vehicles access the network through a PC5 interface, share limited frequency spectrum resources, realize low-delay and high-reliability direct communication, and efficiently exchange and transmit important information related to driving safety such as vehicle distance, speed and the like;
step (2), in order to jointly allocate radio resources (transmission power and subbands) for each V2V link in a reasonably efficient manner, a communication model is established comprising K pairs of V2V links and M subbands, specifically:
(2a) Establishing a communication model of Internet of vehicles resource allocation, wherein the system comprises M sub-bands and K-to-V2V links which are respectively combined by aggregationAnd->It means that the user equipment of the V2V link requests service through the ul lc slice. Total licensed bandwidth W 0 Is equally divided into M subchannels of bandwidth W. The model adopts an orthogonal frequency division multiplexing technology to carry out channel transmission, and sub-bands are mutually orthogonal without interference. However, the same sub-band can be shared by a plurality of users, so that interference can be generated between the VUEs sharing the same sub-band, and the transmission rate of signals can be influenced;
(2b) Within each subband, the channel power gain is flat. The channel power gain g [ m ] includes large-scale fading and small-scale fading, expressed as:
g[m]=α|h[m]| 2 expression 1
Where α is a large scale fade, including path loss and shadowing, which is stationary within the subband; i h [ m ]]| 2 Is small scale fading, and changes according to Rayleigh fading in sub-bands and uncorrelated times;
(2c) SINR of the kth V2V link on the mth subband is expressed as:
wherein P is k [m]Representing the transmit power, g, of the kth V2V link user k [m]Representing the channel power gain, sigma, of the channel used by the kth V2V link 2 Representing system noise power; i k [m]Indicating that the kth V2V link is interfered;
(2d) The interference experienced by the kth V2V link on the mth subband is expressed as:
wherein ρ is k′ [m]Allocation indicator, ρ, representing subband k′ [m]=1 means that the kth' V2V link user multiplexes the mth subband spectrum, otherwise ρ k′ [m]=0;P k′ [m]Representing the transmission power of the kth V2V link user;representing the channel interference power gain of the kth V2V link to the kth V2V link;
for the kth V2V link, the selection information of the sub-band is expressed as:
ρ k ={ρ k [1],ρ k [2],...,ρ k [m]...,ρ k [M]expression 4
Specifying that each link can select only one resource block for transmission at a time, i.e. within a time
(2e) The transmission rate of the kth V2V link on the mth subband is expressed as:
R k [m]=W log(1+γ k [m]) Expression 5
Calculating the network delay and energy consumption of each V2V link, comprehensively considering the weighted sum of the network delay and the energy consumption to obtain the total cost of the network, and taking the minimum total cost of the network as an optimization target under the condition of meeting the V2V link delay and reliability, wherein the method comprises the following steps of:
(3a) Calculating the transmission delay of each V2V link in the communication model, wherein the total transmission delay of the network in the system is formed by the transmission delays of all V2V links, and the total transmission delay is expressed as:
wherein T is k Representing the transmission delay of the kth V2V link; d (D) k The load required to be transmitted for the kth V2V link;
(3b) Accordingly, the total transmission energy consumption of the V2V link can be expressed as:
wherein E is k Representing transmission energy consumption of a kth V2V link;
(3c) Considering the costs of the V2V links comprehensively, the total cost of the optimization target network is defined as a weighted sum of the transmission delay and the transmission energy consumption of the V2V links, and the total cost of the network can be expressed as:
ζ=λ 1 T+λ 2 e expression 8
Wherein lambda is 1 Is the weight of the transmission delay of the V2V link, lambda 2 Transmitting energy consumption weights, lambda, for V2V links 12 =1, the importance of transmission delay and transmission energy consumption is weighted; since the time delay and the energy consumption may be different in magnitude, lambda is adjusted empirically 1 And lambda (lambda) 2 The time delay and the energy consumption are ensured to be unified under the same order of magnitude;
(3d) When considering the decentralized resource allocation of the V2V link user end, only the transmission delay is considered as the delay of the V2V link, and other scheduling delays of the MAC layer are not considered. Thus, the constraint of V2V links on latency is expressed as:
wherein T is max Representing the maximum tolerable delay of the kth V2V link;
(3e) The reliability constraints for V2V communication are expressed as:
wherein, gamma th A user signal to noise ratio threshold representing a kth V2V link;
(3f) In summary, the following objective functions and constraints can be established:
wherein, the objective function is to minimize the total cost of the network, the constraint conditions C1 and C2 are time delay constraint and reliability constraint on the V2V link, the constraint condition C3 indicates that the total power transmitted by the V2V link user terminal on all sub-bands cannot exceed the maximum rated transmitting power, the constraint conditions C4 and C5 mean that each V2V link can only be allocated to one sub-band at the same time, but the same sub-band allows the access of a plurality of V2V links at the same time;
step (4), constructing a multi-agent deep reinforcement learning model based on DDQN according to an optimization target, wherein the multi-agent deep reinforcement learning model specifically comprises the following steps:
(4a) Each V2V link is considered as an agent, and at each instant t, each V2V link obtains the current state from the state space S according to local observations, kthThe current state of each V2V link is obtained as followsThe smart will then utilize the action cost function of DDQN +.>Making policy pi, selecting action ++from action space A>The action selection comprises selecting transmission sub-bands and corresponding transmission power, the environment will be transferred to a new state according to the policy selection of all V2V links>At the same time, each agent shares an instant prize r i
(4b) The state space S is defined as local observation information related to resource allocation and low-dimensional fingerprint information, including a local instantaneous channel information set G of sub-channel m uplinks k [m]The total interference power I of all V2V links, each sharing the same sub-band k [m]Total load D that V2V user needs to transmit k And residual load B k And the states of the random exploration variable epsilon in the training round number e and epsilon-greedy algorithm of the intelligent agent are expressed as follows:
(4c) Defining action space A as transmitting power P selected by intelligent agent k And subband C k ,P k E {1, 2..p } is the transmit power of the V2V link user, C k E {1,2,..m } represents the subchannel accessed by the V2V link user, the actions are expressed as:
(4d) Defining a joint rewards function r t The goal of the resource allocation is to minimize the total cost of the network taking into account the SINR threshold values and delay constraints of the links, so the reward function for each agent is expressed as:
wherein C and A 1 Two fixed larger constant values; lambda (lambda) 3 And lambda (lambda) 4 Weight values for measuring signal-to-noise ratio and time delay importance;
wherein A is 2 Is a constant. The setting of the rewarding function enables the obtained rewarding to be maximum when the load of the V2V link is transmitted; in the link transmission process, a smaller network cost can obtain a larger reward, and the signal-to-interference-and-noise ratio and the transmission delay which do not meet the requirements can obtain punishment.
(4e) According to the established state, action and rewards, a deep reinforcement learning model is established on the basis of Q learning, a DDQN algorithm is adopted, and when a network is updated, each agent needs to minimize a loss function to realize gradient descent, and the loss function is expressed as:
wherein D is the sample space, β ε [0,1 ]]Is a discount factor, beta-1 represents a future reward being given attention, while beta-0 represents a current reward being more focused,and->The real network and the real network of the kth agent at the t moment are respectivelyTarget network parameters;
and (5) introducing skills such as preferential experience playback and multi-step learning to the multi-agent DDQN in order to enhance the performance of the deep reinforcement learning model, wherein the method comprises the following steps of:
(5a) Techniques such as preferential experience playback and multi-step learning are introduced into the multi-agent DDQN to alleviate the problem of Q overestimation, wherein the multi-step learning combines an instant prize obtained from an action with an n-step estimated prize for the action, and the multi-step learning's cumulative discount prize is expressed as:
accordingly, the loss function is further adjusted, and the multi-step loss function is expressed as:
wherein w is t The importance weight of the sample at the t time is represented; n represents the step length during multi-step learning;
w t =(x t +ε) expression 19
Wherein x is t Representing the priority of the sample at the time t, and using the empirical TD error as a measure; epsilon is a small constant for avoiding a weight value of 0; sigma is a super-parameter used to control the sampling probability of the sample, typically taking a value of 0,1]Between them;
(5b) Updating the target network data into the real network data at certain iteration times every interval, wherein the parameter updating is expressed as:
wherein,calculation ladder for kth agentThe degree, eta represents the learning rate and is a super parameter;
the step (6) of training the optimized reinforcement learning model specifically comprises the following steps:
(6a) Starting up environment simulator to generate vehicle and link, calculating related parameters such as channel fading, and setting initial load D for all V2V links k =d and T max =Γ, initializing the real network parameters θ of each agent k And a target network parameter θ' k
(6b) Initializing a training round number P;
(6c) Updating parameters such as vehicle position, channel large-scale fading, total load and the like, initializing a time step t in a P round, and updating parameters such as channel small-scale fading, residual load and the like;
(6d) Each agent asynchronously depends on the input stateOutput action->And obtain instant rewards +.>At the same time the environment goes to the next state +.>Thereby obtaining training data->
(6e) When t is more than N, calculating N-step rewardsTraining data +.>Storing the results in respective experience pools; />
(6f) Each agent calculates its ownTD error and priority x of body samples t
(6g) Each agent normalizes all sample priorities x in its own experience pool t And obtaining corresponding probability distribution;
(6h) Each agent randomly extracts a small batch of data set D from experience with a certain probability, and inputs the data set D into a real network theta k
(6i) Each agent calculates loss value through the real network and the target networkUpdating a real network theta by adopting a small-batch gradient descent strategy and adopting a back propagation algorithm of a neural network k Parameters of (2);
(6j) The training times reach the target network updating interval, and according to the actual network parameter theta k Updating target network parameter θ k ′;
(6k) Judging whether t < K is satisfied, wherein K is the total time step in the p rounds, if so, t=t+1, entering the step (6 c), otherwise, entering the step (61);
(61) Judging whether p < I is met, wherein I is a training round number set threshold value, if so, p=p+1, entering a step (6 c), otherwise, finishing optimization, and obtaining an optimized deep reinforcement learning model;
step (7), during the execution phase, each V2V link obtains a state according to the current local observationAnd loading the trained model to obtain the optimal V2V user transmitting power and channel allocation strategy, comprising the following steps:
(7a) Each V2V link obtains state from current local observationsWill->Input to the optimally trained EDDQN depthStrengthening a learning model;
(7b) Outputting the optimal action strategyObtaining the optimal V2V user transmitting power P k And allocating channel C k And outputting the optimal action strategy.
In fig. 1, a schematic diagram of a DDQN deep reinforcement learning algorithm framework introducing preferential experience playback and multi-step learning skills is described, and it can be seen that a V2V link as an agent considers the priority of samples when depositing and taking out samples, thereby realizing preferential experience playback, and considers sample information at N times when calculating a loss function and processing the samples, thereby realizing multi-step learning.
In fig. 2, a depth Q network training convergence result diagram under the EDDQN algorithm is described, and it can be seen that the optimized average rewards of the depth Q network gradually tend to reach convergence smoothly along with the increase of the iteration times.
In fig. 3, a simulation result diagram of the total cost of the network and the load relationship under the EDDQN algorithm is described, and it can be seen that the EDDQN algorithm can reduce the network cost by about 4% compared to the D3QN algorithm, can reduce the network cost by about 11% compared to the DDQN, and can reduce the network cost by more than about 47% compared to the random algorithm under different V2V link load conditions.
According to the description of the invention, it should be apparent to those skilled in the art that the distributed multi-objective joint optimization method for the wireless resource allocation of the Internet of vehicles based on the enhanced dual-depth Q network effectively solves the joint optimization problem of V2V link channel allocation and power selection.
What is not described in detail in the present application belongs to the prior art known to those skilled in the art.

Claims (1)

1. The Internet of vehicles wireless resource allocation method based on the enhanced double-depth Q network is characterized by comprising the following steps:
(1) Constructing a wireless resource distribution system of the Internet of vehicles with vehicle movement and channel time-varying characteristics so as to meet the low-delay and high-reliability requirements of communication service between vehicles (V2V) in urban bidirectional road environments;
(2) To jointly allocate radio resources (transmission power and subbands) for each V2V link in a reasonably efficient manner, a communication model is established that includes K pairs of V2V links and M subbands;
(3) Calculating the network delay and energy consumption of each V2V link, comprehensively considering the weighted sum of the network delay and the energy consumption to obtain the total cost of the network, and taking the minimum total cost of the network as an optimization target under the condition of meeting the V2V link delay and reliability;
(4) Constructing a multi-agent deep reinforcement learning model based on a double-depth Q network (DDQN) according to an optimization target;
(5) To enhance the performance of the deep reinforcement learning model, priority experience playback and multi-step learning skills are introduced to the multi-agent DDQN;
(6) Training the optimized deep reinforcement learning model;
(7) In the execution stage, each V2V link obtains a state according to the current local observation, and loads a trained model to obtain the optimal V2V user transmitting power and channel allocation strategy;
further, the step (4) comprises the following specific steps:
(4a) Each V2V link is considered as an agent, and at each time t, each V2V link acquires the current state from the state space S according to local observation, and the kth V2V link acquires the current state asThe smart will then utilize the action cost function of DDQN +.>Making policy pi, selecting action ++from action space A>Action selection includes selecting transmission subbands and corresponding transmit powers according to policies for all V2V linksOptionally, the environment will transition to a new state +.>At the same time, each agent shares an instant prize r t
(4b) Defining a state space S as local observation information related to resource allocation and low-dimensional fingerprint information, including a local instantaneous channel information set G of sub-channel m uplinks k [m]The total interference power I of all V2V links, each sharing the same sub-band k [m]Total load D that V2V link users need to transmit k And residual load B k And the states of the random exploration variable epsilon in the training round number e and epsilon-greedy algorithm of the intelligent agent are expressed as follows:
wherein,is a set of subchannels;
(4c) Defining the action space A as the transmit power P selected by the agent k And subband C k ,P k E {1, 2..p } is the transmit power of the V2V link user, C k E {1,2,..m } represents the subchannel accessed by the V2V link user, the actions are expressed as:
(4d) Defining a joint rewards function r t The goal of the resource allocation is to minimize the total cost of the network taking into account the SINR threshold values and delay constraints of the links, so the reward function for each agent is expressed as:
wherein C and A 1 Is two fixed larger constant values, ζ is the total cost of the network, λ 3 And lambda (lambda) 4 To measure the weight value of the signal-to-noise ratio and the time delay importance, K and M are the total number of V2V links and the total number of sub-bands respectively, ρ k [m]Allocation indicator, ρ, representing subband k [m]=1 means that the kth V2V link user multiplexes the mth subband spectrum, otherwise ρ k [m]=0,γ k For the signal-to-noise ratio of the kth V2V link, gamma th A user signal to noise ratio threshold, T, representing the kth V2V link k Representing the transmission delay of the kth V2V link, T max Representing the maximum tolerable delay of the kth V2V link;
wherein A is 2 Is a constant;
the setting of the rewarding function enables the obtained rewarding to be maximum when the load of the V2V link is transmitted; in the link transmission process, the smaller network cost can obtain larger rewards, and the signal to noise ratio and transmission delay which do not meet the requirements can obtain punishment;
(4e) According to the established state, action and rewards, a deep reinforcement learning model is established on the basis of Q learning, a DDQN algorithm is adopted, and when a network is updated, each intelligent agent needs to minimize a loss function to realize gradient descent, and the loss function is expressed as:
wherein D is the sample space, β ε [0,1 ]]Is a discount factor, beta-1 represents a future reward being given attention, while beta-0 represents a current reward being more focused,and->The real network parameters and the target network parameters of the kth agent at the time t are respectively.
CN202311322831.6A 2023-10-12 2023-10-12 Enhanced dual-depth Q network-based Internet of vehicles wireless resource allocation method Pending CN117412391A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311322831.6A CN117412391A (en) 2023-10-12 2023-10-12 Enhanced dual-depth Q network-based Internet of vehicles wireless resource allocation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311322831.6A CN117412391A (en) 2023-10-12 2023-10-12 Enhanced dual-depth Q network-based Internet of vehicles wireless resource allocation method

Publications (1)

Publication Number Publication Date
CN117412391A true CN117412391A (en) 2024-01-16

Family

ID=89488195

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311322831.6A Pending CN117412391A (en) 2023-10-12 2023-10-12 Enhanced dual-depth Q network-based Internet of vehicles wireless resource allocation method

Country Status (1)

Country Link
CN (1) CN117412391A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117793801A (en) * 2024-02-26 2024-03-29 北京理工大学 Vehicle-mounted task unloading scheduling method and system based on hybrid reinforcement learning
CN118175646A (en) * 2024-03-13 2024-06-11 重庆理工大学 Resource allocation method of NGMA IoE network supporting 6G

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117793801A (en) * 2024-02-26 2024-03-29 北京理工大学 Vehicle-mounted task unloading scheduling method and system based on hybrid reinforcement learning
CN117793801B (en) * 2024-02-26 2024-04-23 北京理工大学 Vehicle-mounted task unloading scheduling method and system based on hybrid reinforcement learning
CN118175646A (en) * 2024-03-13 2024-06-11 重庆理工大学 Resource allocation method of NGMA IoE network supporting 6G

Similar Documents

Publication Publication Date Title
CN112995951B (en) 5G Internet of vehicles V2V resource allocation method adopting depth certainty strategy gradient algorithm
CN111970733B (en) Collaborative edge caching algorithm based on deep reinforcement learning in ultra-dense network
CN114389678B (en) Multi-beam satellite resource allocation method based on decision performance evaluation
CN113543074B (en) Joint computing migration and resource allocation method based on vehicle-road cloud cooperation
CN117412391A (en) Enhanced dual-depth Q network-based Internet of vehicles wireless resource allocation method
Zhang et al. Team learning-based resource allocation for open radio access network (O-RAN)
CN114885426B (en) 5G Internet of vehicles resource allocation method based on federal learning and deep Q network
CN110769514B (en) Heterogeneous cellular network D2D communication resource allocation method and system
Park et al. Network resource optimization with reinforcement learning for low power wide area networks
Wang et al. Joint resource allocation and power control for D2D communication with deep reinforcement learning in MCC
CN105379412B (en) A kind of system and method controlling multiple radio access nodes
CN115278707B (en) NOMA terahertz network energy efficiency optimization method based on intelligent reflector assistance
CN114885420A (en) User grouping and resource allocation method and device in NOMA-MEC system
CN116456493A (en) D2D user resource allocation method and storage medium based on deep reinforcement learning algorithm
CN111629352B (en) V2X resource allocation method based on Underlay mode in 5G cellular network
CN109819422B (en) Stackelberg game-based heterogeneous Internet of vehicles multi-mode communication method
CN114867030A (en) Double-time-scale intelligent wireless access network slicing method
CN115866787A (en) Network resource allocation method integrating terminal direct transmission communication and multi-access edge calculation
CN116582860A (en) Link resource allocation method based on information age constraint
CN116546462A (en) Multi-agent air-ground network resource allocation method based on federal learning
CN118042633B (en) Joint interference and AoI perception resource allocation method and system based on joint reinforcement learning
CN115173922A (en) CMADDQN network-based multi-beam satellite communication system resource allocation method
CN117715219A (en) Space-time domain resource allocation method based on deep reinforcement learning
CN110753365B (en) Heterogeneous cellular network interference coordination method
Ren et al. Joint spectrum allocation and power control in vehicular communications based on dueling double DQN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination