CN112995951B

CN112995951B - 5G Internet of vehicles V2V resource allocation method adopting depth certainty strategy gradient algorithm

Info

Publication number: CN112995951B
Application number: CN202110273529.0A
Authority: CN
Inventors: 王书墨; 宋晓勤; 柴新越; 缪娟娟; 王奎宇
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2021-03-12
Filing date: 2021-03-12
Publication date: 2022-04-08
Anticipated expiration: 2041-03-12
Also published as: CN112995951A

Abstract

The invention provides a vehicle-to-vehicle (V2V) communication resource allocation method based on a depth deterministic strategy gradient (DDPG) algorithm, wherein V2V communication accesses a 5G network by using a network slicing technology, an optimal V2V user channel allocation and transmission power joint optimization strategy is obtained by utilizing a depth reinforcement learning optimization strategy, a V2V user reduces mutual interference among V2V links by selecting proper transmission power and channels, and the total system throughput of the V2V links is maximized under the condition of meeting the link delay constraint. The invention uses DDPG algorithm to effectively solve the joint optimization problem of V2V user channel allocation and power selection, and can stably represent in the optimization of a series of continuous action spaces.

Description

5G Internet of vehicles V2V resource allocation method adopting depth certainty strategy gradient algorithm

Technical Field

The invention relates to a Vehicle networking technology, in particular to a resource allocation method of a Vehicle networking, and more particularly to a Vehicle-to-Vehicle (V2V) communication resource allocation method of a 5G Vehicle networking by adopting a Deep Deterministic Policy Gradient (DDPG) algorithm.

Background

The Vehicle-to-information (V2X) is a typical application of the Internet of Things (IoT) in the field of Intelligent Transportation Systems (ITS), and refers to a ubiquitous Intelligent Vehicle network formed based on an Intranet, the Internet and a mobile Vehicle-mounted network. The vehicle networking system shares and exchanges data according to an agreed communication protocol and a data interaction standard. By sensing and cooperating pedestrians, roadside facilities, vehicles, networks and clouds in real time, the intelligent traffic management and service is realized, for example, the road safety is improved, the road condition sensing is enhanced, and the traffic jam is reduced.

Reasonable vehicle networking resource allocation is crucial to mitigating interference, improving network efficiency, and ultimately optimizing wireless communication performance. Conventional resource allocation schemes mostly use slowly varying large-scale fading channel information for allocation. There is a heuristic location dependent uplink resource allocation scheme proposed in the literature that features spatial resource reuse without requiring complete channel state information, thus reducing signaling overhead. Additional research has developed a framework that includes vehicle grouping, multiplexed channel selection, and power control that can reduce the overall interference of V2V users to the cellular network while maximizing the sum or minimum achievable rate of V2V users. However, with the increasing of communication traffic and the great improvement of the communication speed requirement, the fast change of a wireless channel due to high mobility brings great uncertainty to resource allocation, and the traditional resource allocation method cannot meet the requirements of high reliability and low delay of people on the internet of vehicles.

Deep learning provides a multi-tiered computational model that can learn efficient data representations with multiple levels of abstraction from unstructured sources, providing a powerful data-driven approach to solving many problems traditionally thought to be difficult. Compared with the traditional resource allocation algorithm, the resource allocation scheme based on the deep reinforcement learning algorithm can better meet the requirements of high reliability and low time delay of the Internet of vehicles. There is a document that proposes a novel distributed vehicle-to-vehicle communication resource allocation mechanism based on deep reinforcement learning that can be applied to unicast and broadcast scenes. According to the distributed resource allocation mechanism, the agent, i.e., the V2V link or vehicle, can make a decision to find the best sub-band and transmission power level without waiting for global state information. However, the existing V2V resource allocation algorithm based on deep reinforcement learning cannot meet the differentiated service requirements of scenes such as high bandwidth, large capacity, ultra-reliability and low time delay in a 5G network.

Therefore, the resource allocation method provided by the invention adopts the 5G network slicing technology, can provide differentiated services for different application scenes in a 5G network, and simultaneously adopts the DDPG algorithm which can stably express in the optimization of a series of continuous action spaces to allocate the V2V resources, takes the maximization of the system throughput as the optimization target of the V2V resource allocation, and obtains good balance between complexity and performance.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems in the prior art, a deep reinforcement learning-based DDPG algorithm V2V user resource allocation method is provided, and V2V communication accesses a 5G network by a network slicing technology. The method can realize the V2V user resource allocation with the maximized system throughput at lower V2V link delay under the condition that the V2V link does not interfere with the V2I link.

The technical scheme is as follows: under the condition of considering the delay of the V2V link, the aim of maximizing the throughput of the system communication system is achieved with reasonable resource allocation. We adopt 5G network slicing technique, the V2V link and the V2I link use different slices, and the V2V link does not interfere with the V2I link. With the distributed resource allocation method, the base station is not required to centrally schedule the channel state information, each V2V link is treated as an agent, and the channel and transmit power are selected based on the instantaneous state information and the information shared from the neighbors by each time slot. And optimizing the deep reinforcement learning model by using a DDPG algorithm through establishing the deep reinforcement learning model. And obtaining the optimal V2V user transmitting power and channel allocation strategy according to the optimized deep reinforcement learning model. The invention is realized by the following technical scheme: a method for distributing V2V resources based on 5G network slices by adopting a DDPG algorithm comprises the following steps:

(1) communication services in the internet of vehicles are divided into two types, namely broadband multimedia data transmission between vehicles and roadside facilities (V2I) and data transmission between vehicles (V2V) about driving safety;

(2) dividing V2I and V2V communication traffic into different slices respectively by using a 5G network slicing technology;

(3) the constructed user resource allocation system model is that K shares a channel with the authorized bandwidth of B for V2V users;

(4) by adopting a distributed resource allocation method, under the condition of considering V2V link delay, a deep reinforcement learning model is constructed with the aim of maximizing the throughput of a communication system;

(5) considering a joint optimization problem in a continuous action space, and optimizing a deep reinforcement learning model by using a deep certainty strategy gradient (DDPG) algorithm comprising three mechanisms of deep learning fitting, soft updating and memory playback;

(6) and obtaining the optimal V2V user transmitting power and channel allocation strategy according to the optimized deep reinforcement learning model.

Further, the step (4) comprises the following specific steps:

(4a) the state space S is specifically defined as channel information related to resource allocation, including subchannel m corresponding V2V link instantaneous channel information G_t[m]Interference strength I received in the previous time slot of sub-channel m_t-1[m]Number of times N that subchannel m was selected by an adjacent V2V link in the previous time slot_t-1[m]Residual load L transmitted by V2V user_tResidual time delay U_tI.e. by

s_t＝{G_t，I_t-1，N_t-1，L_t，U_t}

Considering the V2V link as an agent, each time the V2V link is based on the current state s_tSelecting a channel and transmitting power by the E.S;

(4b) defining the motion space A as the transmit power and the selected channel, denoted as

Wherein the content of the first and second substances,

the transmission power of the k-th V2V link user,

the situation that the mth channel is used by the kth V2V link user;

(4c) the goal of defining the reward function R, V2V resource allocation is to select the spectral sub-band and transmit power for the V2V link, maximizing the system throughput for the V2V link while meeting the delay constraint and causing less interference to other V2V links. The reward function can thus be expressed as:

wherein, T₀To the maximum tolerable delay, λ_d、λ_pIs a weight of two parts, T₀-U_tIs the time taken for transmission, the penalty increases as the transmission time increases.

(4d) Establishing a deep reinforcement learning model on the basis of Q learning according to the established S, A and R, and evaluating the function Q (S)_t，a_t) Represents the slave state s_tPerforming action a_tThe resulting discount reward, the Q-value update function is:

wherein r is_tIs an instant reward function, gamma is a discount factor, s_tState information for V2V link at time t, s_t+1Indicating that the V2V link is performing a_tIn the latter state, A is action a_tThe formed motion space.

Has the advantages that: the invention provides a V2V resource allocation method based on a 5G network slice by adopting a deep deterministic strategy gradient algorithm, wherein V2V communication uses a network slice technology to access a 5G network, an optimal V2V user channel allocation and transmission power joint optimization strategy is obtained by utilizing a deep reinforcement learning optimization strategy, a V2V user reduces mutual interference between V2V links by selecting proper transmission power and allocation channels, and the system throughput of the V2V link is maximized under the constraint of meeting link delay. The invention uses DDPG algorithm to effectively solve the joint optimization problem of V2V user channel allocation and power selection, and can stably represent in the optimization of a series of continuous action spaces.

In conclusion, under the conditions of ensuring reasonable resource allocation, low interference between V2V links and low computational complexity, the V2V resource allocation method based on the 5G network slice, which adopts the deep deterministic strategy gradient algorithm, is superior in maximizing the throughput of the V2V system.

Drawings

FIG. 1 is a flowchart of a 5G Internet of vehicles V2V resource allocation method using a deep deterministic policy gradient algorithm according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a V2V user resource allocation model based on 5G network slicing technology according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a deep reinforcement learning framework based on an Actor-Critic model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a V2V communication deep reinforcement learning model according to an embodiment of the present invention;

Detailed Description

The core idea of the invention is that: the V2V communication is accessed to the 5G network by a network slicing technology, a distributed resource allocation method is adopted, each V2V link is taken as an intelligent agent, and a deep reinforcement learning model is optimized by establishing the deep reinforcement learning model and utilizing a DDPG algorithm. And obtaining the optimal V2V user transmitting power and channel allocation strategy according to the optimized deep reinforcement learning model.

The present invention is described in further detail below.

Step (1), the communication service in the internet of vehicles is divided into two types, namely, broadband multimedia data transmission between vehicles and roadside facilities (V2I) and data transmission between vehicles (V2V) related to driving safety.

And (2) respectively dividing V2I and V2V into different slices by using a 5G network slicing technology.

Step (3), the constructed user resource allocation system model is that the K pair of V2V users share the channel with the authorized bandwidth of B,

the method comprises the following specific steps:

(3a) establishing a V2V user resource allocation system model, wherein the system comprises K pairs of V2V Users (VUEs) and is represented by a set K ═ 1, 2,.. K, and a total authorized bandwidth B is equally divided into M bandwidth equal to B₀Sub-channels, sets for sub-channels

Represents;

(3b) the SINR of the kth V2V link may be expressed as:

wherein the content of the first and second substances,

G_dis the total interference power, g, of all V2V links sharing the same RB_kIs the channel gain for the kth V2V link internet-of-vehicles user,

is the interference gain of the k' th V2V link to the k 2V 2V link. The channel capacity of the kth V2V link may be expressed as:

C^v[k]＝W·log(1+γ^v[k]) (ii) a Expression 3

(3c) For the kth V2V link, it selects the channel information at time t as:

if it is

The mth channel is used by the kth V2V link while there is

And i ≠ m, i.e

K is the total number of the V2V links, and M is the total number of the available channels of the V2V link access slice.

Step (4), a distributed resource allocation method is adopted, and a deep reinforcement learning model is constructed with the goal of maximizing the throughput of the communication system under the condition of considering the delay of the V2V link, and the method comprises the following specific steps:

(4a) in particularDefining a state space S as observation information related to resource allocation, including subchannel corresponding V2V link instantaneous channel information

Interference strength I received by previous time slot of sub-channel_t-1[m]，

Number of times N that channel m was selected by an adjacent V2V link in a previous time slot_y-1[m]，

The remaining V2V load L_tResidual time delay U_tI.e. by

Wherein the content of the first and second substances,

the transmission power of the k-th V2V link user,

for the use of the mth channel by the kth V2V link user,

indicating that the mth channel is used by the kth V2V link user,

indicating that the mth channel is not used by the kth V2V link user;

wherein, T₀To the maximum tolerable delay, λ_d、λ_pIs a weight of two parts, T₀-U_tIs the time taken for transmission, the penalty increases as the transmission time increases. In order to obtain a good return for a long period of time, both the pre-ocular return and the future return should be considered. Thus, the main goal of reinforcement learning is to find a strategy to maximize the expected cumulative discount return,

wherein β ∈ [0, 1] is a discount factor;

(4d) and according to the established S, A and R, establishing a deep reinforcement learning model on the basis of Q learning: evaluating the function Q(s)_t，a_t) Represents the slave state s_tPerforming action a_tThe later generated discount reward, the Q value update function is

And (5) in order to solve the problem of V2V resource allocation based on 5G network slices, an action space in a deep reinforcement learning model established by taking a V2V link as an agent comprises two variables of transmitting power and channel selection, continuous change of the transmitting power within a certain range is considered, and in order to solve the problem of joint optimization in the high-dimensional action space, especially the continuous action space, the deep reinforcement learning model is optimized by using a DDPG algorithm comprising three mechanisms of deep learning fitting, soft updating and memory playback.

The deep learning fitting means that the DDPG algorithm is based on an Actor-Critic framework, and a deterministic strategy a, mu (s | theta) and an action value function Q (s, a | delta) are fitted by using a deep neural network with parameters of theta and delta respectively, and the method is shown in the attached figure 3 of the specification.

The soft update means that the parameters of the action value network are frequently updated in a gradient manner and are used for calculating the gradient of the policy network, so that the learning process of the action value network is likely to be unstable, and therefore, the network is updated by adopting a soft update manner.

Respectively creating an online network and a target network for the strategy network and the action value network:

the network is continuously updated by gradient descent in the training process, and the updating mode of the target network is as follows

θ' ═ τ θ + (1- τ) θ expression 9

δ' ═ τ δ + (1- τ) δ expression 10

The empirical playback mechanism means that the sample data of state transition generated when the sample data interacts with the environment has time sequence relevance, and the deviation of action value function fitting is easily caused. Therefore, by taking the experience playback mechanism of the DQN algorithm as a reference, the collected samples are firstly put into a sample pool, and then some mini-batch samples are randomly selected from the sample pool to be used for training the network. The processing removes the correlation and the dependency among samples, solves the problems of the correlation among data and the non-static distribution of the correlation among data, and enables the algorithm to be easier to converge.

The method for optimizing the deep reinforcement learning model by using the DDPG algorithm comprising three mechanisms of deep learning fitting, soft updating and memory playback comprises the following steps:

(5a) initializing the training round number P;

(5b) initializing a time step t in the P round;

(5c) the online Actor policy network inputs the state s according to_tOutput action a_tAnd obtain an instant prize r_tWhile going to the next state s_t+1Thereby obtaining training data(s)_t，a_t，r_t，s_t+1)；

(5d) Training data(s)_t，a_t，r_t，s_t+1) Storing the experience in an experience playback pool;

(5e) randomly sampling m training data(s) from an empirical replay pool_t，a_t，r_t，s_t+1) Forming a data set, and sending the data set to an online Actor policy network, an online Critic evaluation network, a target Actor policy network and a target Critic evaluation network;

(5f) setting Q to be estimated

y_i＝r_i+γQ′(s_i+1，μ′(s_i+1| θ ') | δ') expression 11

Defining the loss function of an online Critic evaluation network as

Updating all parameters theta of the Critic current network through gradient back propagation of the neural network;

(5g) defining a given sampling strategy gradient of an on-line Actor strategy network as

Updating all parameters delta of the current network of the Actor through the gradient back propagation of the neural network;

(5h) if the online training frequency reaches the target network updating frequency, respectively updating target network parameters delta 'and theta' according to the online network parameters delta and theta;

(5i) judging whether t is less than K, wherein K is the total time step in the p round, if so, t is t +1, and entering the step 5c, otherwise, entering the step 5 j;

(5j) and judging whether p is less than I or not, setting a threshold value for the training round number by I, if so, setting p to be p +1, and entering the step 5b, otherwise, finishing the optimization and obtaining the optimized deep reinforcement learning model.

Step (6), obtaining the optimal V2V user transmitting power and channel allocation strategy according to the optimized deep reinforcement learning model, comprising the following steps:

(6a) inputting the state information s of the system at a certain moment by using the deep reinforcement learning model trained by the DDPG algorithm_k(t)；

(6b) Outputting the optimal action strategy

Obtaining the optimal V2V user transmitting power

And allocating channels

Finally, the drawings in the specification are explained in detail.

In fig. 1, a flow of a 5G internet of vehicles V2V resource allocation method using a deep deterministic strategy gradient algorithm is described, in which V2V communication uses a network slicing technique to access a 5G network, and a DDPG optimized deep reinforcement learning model is used to obtain an optimal V2V user channel allocation and transmit power joint optimization strategy.

In fig. 2, a V2V user resource allocation model based on the 5G network slicing technique is depicted, with V2V communications and V2I communications using different slices.

In fig. 3, it is described that the deep learning fitting refers to the DDPG algorithm fitting a deterministic strategy a ═ μ (s | θ) and an action value function Q (s, a | δ) using deep neural networks with parameters θ and δ, respectively, based on an Actor-criticc framework.

In FIG. 4The V2V communication deep reinforcement learning model is described. It can be seen that the V2V link acts as an agent based on the current state s_tE S selects a channel and transmit power according to a reward function.

Based on the description of the present invention, it should be apparent to those skilled in the art that the V2V resource allocation method based on the deep reinforcement learning DDPG algorithm using the 5G network slicing technique of the present invention can improve the system throughput and ensure that the communication delay meets the safety requirement.

Details not described in the present application are well within the skill of those in the art.

Claims

1. A5G Internet of vehicles V2V resource allocation method adopting a depth deterministic strategy gradient algorithm is characterized by comprising the following steps:

(4) by adopting a distributed resource allocation method, under the condition of considering V2V link delay, a state space S is constructed by taking the maximization of the throughput of a communication system as a target and is observation information related to resource allocation, an action space A is a transmission power and a selected channel, and an incentive R is a deep reinforcement learning model of the weighted sum of the throughput of the system and the delay of the system, and specifically comprises the following steps:

(4a) the state space S is specifically defined as observation information related to resource allocation, including subchannel m corresponding V2V link instantaneous channel state information G_t[m]Interference strength I received in the previous time slot of sub-channel m_t-1[m]Number of times N that subchannel m was selected by an adjacent V2V link in the previous time slot_t-1[m]Residual load L transmitted by V2V user_tResidual time delay U_tI.e. by

s_t＝{G_t，I_t-1，N_t-1，L_t，U_t}

(4b) defining the action space A as the transmit power and the selected channel, denoted as

Wherein the content of the first and second substances,

the transmission power of the k-th V2V link user,

for the use of the mth channel by the kth V2V link user,

indicating that the mth channel is used by the kth V2V link user,

indicating that the mth channel is not used by the kth V2V link user;

(4c) the goal of defining the reward function R, V2V resource allocation is that the V2V link selects a spectral sub-band and transmit power that maximizes the system throughput of the V2V link while satisfying the delay constraint, so the reward function can be expressed as:

wherein, C^v[k]For the channel capacity, T, of the Kth V2V link₀To the maximum tolerable delay, λ_d、λ_pIs a weight of two parts, T₀-U_tIs the time taken for transmission, as a function of the transmission timeIncreasing, the penalty will also increase;

wherein r is_tIs an instant reward function, gamma is a discount factor, s_tState information for V2V link at time t, s_t+1Indicating that the V2V link is performing a_tIn the latter state, A is action a_tThe formed motion space;