CN112995951A

CN112995951A - 5G Internet of vehicles V2V resource allocation method adopting depth certainty strategy gradient algorithm

Info

Publication number: CN112995951A
Application number: CN202110273529.0A
Authority: CN
Inventors: 王书墨; 宋晓勤; 柴新越; 缪娟娟; 王奎宇
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2021-03-12
Filing date: 2021-03-12
Publication date: 2021-06-18
Anticipated expiration: 2041-03-12
Also published as: CN112995951B

Abstract

The present invention proposes a vehicle-to-vehicle (V2V) communication resource allocation method based on the Deep Deterministic Policy Gradient (DDPG) algorithm. The V2V communication uses the network slicing technology to access the 5G network, and uses the deep reinforcement learning optimization strategy to obtain the optimal V2V user. Channel allocation and transmit power joint optimization strategy, V2V users can reduce the mutual interference between V2V links by selecting appropriate transmit power and channels, and maximize the total system throughput of V2V links under the constraint of link delay. The invention can effectively solve the joint optimization problem of V2V user channel allocation and power selection by using the DDPG algorithm, and can perform stably in the optimization of a series of continuous action spaces.

Description

A 5G Internet of Vehicles V2V Resource Allocation Method Using Deep Deterministic Policy Gradient Algorithm

技术领域technical field

本发明涉及一种车联网技术，尤其涉及一种车联网的资源分配方法，更具体地说，涉及一种采用深度确定性策略梯度(Deep Deterministic Policy Gradient，DDPG)算法的5G车联网的车对车(Vehicle-to-Vehicle，V2V)通信资源分配方法。The present invention relates to an Internet of Vehicles technology, in particular to a resource allocation method for the Internet of Vehicles, and more particularly, to a vehicle-to-vehicle pairing system for 5G Internet of Vehicles using a Deep Deterministic Policy Gradient (DDPG) algorithm. Vehicle-to-Vehicle (V2V) communication resource allocation method.

背景技术Background technique

车联网(Vehicle-to-everything，V2X)是物联网(Internet of Things，IoT)在智能交通系统(Intelligent Transportation System，ITS)领域中的典型应用，它是指基于Intranet、Internet和移动车载网络而形成的无处不在的智能车网络。车联网根据约定的通信协议和数据交互标准共享和交换数据。它通过对行人、路边设施、车辆、网络和云之间的实时感知和协作，实现了智能交通管理和服务，例如改善了道路安全，增强了路况感知并减少了交通拥堵。Vehicle-to-everything (V2X) is a typical application of Internet of Things (IoT) in the field of Intelligent Transportation System (ITS). The ubiquitous smart car network formed. The Internet of Vehicles shares and exchanges data according to agreed communication protocols and data exchange standards. It enables intelligent traffic management and services such as improved road safety, enhanced road condition awareness and reduced traffic congestion through real-time perception and collaboration between pedestrians, roadside facilities, vehicles, networks and the cloud.

合理的车联网资源分配对于减轻干扰、提高网络效率和最终优化无线通信性能至关重要。传统的资源分配方案大多利用缓慢变化的大规模衰落信道信息进行分配。有文献提出了一种启发式的位置相关上行链路资源分配方案，其特征在于空间资源重用，而不需要完整的信道状态信息，因此减少了信令开销。另有研究开发了包括车辆分组、复用信道选择和功率控制的框架，可以降低V2V用户对蜂窝网络的总干扰，同时最大化V2V用户的和速率或最小可达速率。但随着通信量的与日俱增和通信速率需求的大幅提升，高移动性导致无线信道快速变化给资源分配带来很大的不确定性，传统的资源分配方法无法满足人们对车联网的高可靠性和低延时需求。Reasonable allocation of IoV resources is crucial for mitigating interference, improving network efficiency, and ultimately optimizing wireless communication performance. Most of the traditional resource allocation schemes use the slowly changing large-scale fading channel information for allocation. A heuristic location-dependent uplink resource allocation scheme has been proposed, which is characterized by spatial resource reuse without the need for complete channel state information, thus reducing signaling overhead. Another research has developed a framework including vehicle grouping, multiplexing channel selection and power control, which can reduce the total interference of V2V users to the cellular network while maximizing the sum rate or minimum achievable rate of V2V users. However, with the increasing communication volume and the substantial increase in communication rate requirements, high mobility leads to rapid changes in wireless channels, which brings great uncertainty to resource allocation. Traditional resource allocation methods cannot meet people's high reliability of the Internet of Vehicles. and low latency requirements.

深度学习提供了多层计算模型，可以从非结构化源中学习具有多级抽象的高效数据表示，为解决许多传统上被认为是困难的问题提供了一种强大的数据驱动方法。基于深度强化学习算法的资源分配方案比传统资源分配算法更能满足车联网的高可靠性和低延时性的要求。有文献提出了一种可以应用于单播和广播场景的基于深度强化学习的新型分布式车对车通信资源分配机制。根据分布的资源分配机制，智能体，即V2V链路或车辆不需要等待全局状态信息就可以做出决定以找到最佳子带和传输功率水平。但现有的基于深度强化学习的V2V资源分配算法无法满足5G网络下高带宽、大容量、超可靠低时延等场景的差异化服务需求。Deep learning provides multi-layered computational models that can learn efficient data representations with multiple levels of abstraction from unstructured sources, providing a powerful data-driven approach to many traditionally considered difficult problems. The resource allocation scheme based on the deep reinforcement learning algorithm can better meet the requirements of high reliability and low latency of the Internet of Vehicles than the traditional resource allocation algorithm. Some literatures propose a novel distributed vehicle-to-vehicle communication resource allocation mechanism based on deep reinforcement learning that can be applied to unicast and broadcast scenarios. According to the distributed resource allocation mechanism, the agent, i.e. the V2V link or the vehicle does not need to wait for global state information to make a decision to find the optimal subband and transmit power level. However, the existing V2V resource allocation algorithms based on deep reinforcement learning cannot meet the needs of differentiated services in scenarios such as high bandwidth, large capacity, ultra-reliable and low-latency under 5G networks.

因此本发明提出的资源分配方法采用5G网络切片技术，能在5G网络下为不同应用场景提供差异化服务，同时采用可在一系列连续动作空间的优化中表现稳定的DDPG算法进行V2V资源分配，以系统吞吐量最大化作为V2V资源分配的优化目标，在复杂度和性能之间取得了很好的平衡。Therefore, the resource allocation method proposed in the present invention adopts the 5G network slicing technology, which can provide differentiated services for different application scenarios under the 5G network, and at the same time adopts the DDPG algorithm that can perform stable in the optimization of a series of continuous action spaces for V2V resource allocation, Taking the maximization of system throughput as the optimization goal of V2V resource allocation, it achieves a good balance between complexity and performance.

发明内容SUMMARY OF THE INVENTION

发明目的：针对现有技术存在的上述问题，提出一种基于深度强化学习DDPG算法V2V用户资源分配方法，V2V通信以网络切片技术接入5G网络。该方法能在V2V链路对V2I链路没有干扰的情况下，以较低的V2V链路延迟实现系统吞吐量最大化的V2V用户资源分配。Purpose of the invention: In view of the above problems existing in the prior art, a method for V2V user resource allocation based on the deep reinforcement learning DDPG algorithm is proposed. V2V communication uses network slicing technology to access 5G network. The method can achieve V2V user resource allocation that maximizes system throughput with lower V2V link delay under the condition that the V2V link does not interfere with the V2I link.

技术方案：在考虑V2V链路延迟的情况下，以合理的资源分配达到系统通信系统吞吐量最大化的目的。我们采用5G网络切片技术，V2V链路和V2I链路使用不同的切片，V2V链路对V2I链路不产生干扰。采用分布式的资源分配方法，不需要基站集中调度信道状态信息，将每条V2V链路视为智能体，并且基于瞬时状态信息和每个时隙从邻居共享的信息来选择信道和发射功率。通过建立深度强化学习模型，利用DDPG算法优化深度强化学习模型。根据优化后的深度强化学习模型，得到最优的V2V用户发射功率和信道分配策略。完成上述发明通过以下技术方案实现：一种采用DDPG算法的基于5G网络切片的V2V资源分配方法，包括步骤如下：Technical solution: In the case of considering the V2V link delay, the purpose of maximizing the throughput of the system communication system is achieved by reasonable resource allocation. We adopt 5G network slicing technology, V2V link and V2I link use different slices, V2V link does not interfere with V2I link. The distributed resource allocation method does not require the base station to centrally schedule channel state information, treats each V2V link as an agent, and selects the channel and transmit power based on the instantaneous state information and information shared from neighbors for each time slot. By establishing a deep reinforcement learning model, the DDPG algorithm is used to optimize the deep reinforcement learning model. According to the optimized deep reinforcement learning model, the optimal V2V user transmit power and channel allocation strategy are obtained. Completion of the above invention is achieved through the following technical solutions: a V2V resource allocation method based on 5G network slicing using the DDPG algorithm, including the following steps:

(1)将车联网中的通信业务分为两种类型，即车辆与路边设施之间(V2I)的宽带多媒体数据传输以及车与车之间(V2V)关于行车安全的数据传输；(1) The communication services in the Internet of Vehicles are divided into two types, namely broadband multimedia data transmission between vehicles and roadside facilities (V2I) and data transmission between vehicles and vehicles (V2V) about driving safety;

(2)利用5G网络切片技术，将V2I与V2V通信业务分别划分到不同切片；(2) Use 5G network slicing technology to divide V2I and V2V communication services into different slices respectively;

(3)构建的用户资源分配系统模型为K对V2V用户共用授权带宽为B的信道；(3) The constructed user resource allocation system model is that K shares a channel with an authorized bandwidth of B for V2V users;

(4)采用分布式的资源分配方法，在考虑V2V链路延迟的情况下，以通信系统吞吐量最大化为目标构建深度强化学习模型；(4) Using a distributed resource allocation method, in the case of considering the V2V link delay, a deep reinforcement learning model is constructed with the goal of maximizing the throughput of the communication system;

(5)考虑连续动作空间中的联合优化问题，利用包含深度学习拟合，软更新，记忆回放三个机制的深度确定性策略梯度(DDPG)算法优化深度强化学习模型；(5) Consider the joint optimization problem in the continuous action space, and optimize the deep reinforcement learning model by using the deep deterministic policy gradient (DDPG) algorithm including three mechanisms of deep learning fitting, soft update, and memory playback;

(6)根据优化后的深度强化学习模型，得到最优V2V用户发射功率和信道分配策略。(6) According to the optimized deep reinforcement learning model, the optimal V2V user transmit power and channel allocation strategy are obtained.

进一步的，所述步骤(4)包括如下具体步骤：Further, the step (4) includes the following specific steps:

(4a)，具体地定义状态空间S为与资源分配有关的信道信息，包括子信道m相应V2V链路瞬时信道信息G_t[m]，子信道m前一时隙接收到的干扰强度I_t-1[m]，子信道m在前一时隙被相邻的V2V链路选择的次数N_t-1[m]，V2V用户传输的剩余负载L_t，剩余时延U_t，即(4a), specifically define the state space S as the channel information related to resource allocation, including the instantaneous channel information G _t [m] of the corresponding V2V link of the sub-channel m, the interference intensity I _{t- 1} [m], the number of times the subchannel m is selected by the adjacent V2V link in the previous time slot N _t-1 [m], the remaining load L _t transmitted by the V2V user, and the remaining delay U _t , namely

s_t＝{G_t，I_t-1，N_t-1，L_t，U_t}s _t ={G _t , I _t-1 , N _t-1 , L _t , U _t }

将V2V链路视为智能体，每次V2V链路基于当前状态s_t∈S选择信道和发射功率；Considering the V2V link as an agent, each V2V link selects the channel and transmit power based on the current state s _t ∈ S;

(4b)，定义动作空间A为发射功率和选择的信道，表示为

(4b), define the action space A as the transmit power and the selected channel, expressed as

其中，

为第k个V2V链路用户的发射功率，

为第m个信道被第k个V2V链路用户使用情况；in,

is the transmit power of the kth V2V link user,

is the usage of the mth channel by the kth V2V link user;

(4c)，定义奖励函数R，V2V资源分配的目标是V2V链路选择频谱子带和发射功率，在满足延迟约束，对其他V2V链路产生较小的干扰的要求下最大化V2V链路的系统吞吐量。因此奖励函数可以表示为：(4c), define the reward function R, the goal of V2V resource allocation is to select the spectral subband and transmit power of the V2V link, and maximize the V2V link under the requirement of satisfying the delay constraint and causing less interference to other V2V links. system throughput. So the reward function can be expressed as:

其中，T₀为最大可容忍延迟，λ_d、λ_p为两个部分的权值，T₀-U_t是传输所用的时间，随着传输时间的增加，惩罚也会增加。Among them, T ₀ is the maximum tolerable delay, λ _d and λ _p are the weights of the two parts, and T ₀ -U _t is the time used for transmission. As the transmission time increases, the penalty will also increase.

(4d)，依据建立好的S，A和R，在Q学习的基础上建立深度强化学习模型，评估函数Q(s_t，a_t)表示从状态s_t执行动作a_t后产生的折扣奖励，Q值更新函数为：(4d), according to the established S, A and R, a deep reinforcement learning model is established on the basis of Q learning, and the evaluation function Q(s _t , a _t ) represents the discounted reward generated after the action a _t is executed from the state s _t , the Q value update function is:

其中，r_t为即时奖励函数，γ为折扣因子，s_t为V2V链路在t时刻的状态信息，s_t+1表示V2V链路在执行a_t后的状态，A为动作a_t构成的动作空间。Among them, r _t is the immediate reward function, γ is the discount factor, s _t is the state information of the V2V link at time t, s _t+1 represents the state of the V2V link after executing at _t , and A is composed of action at _t action space.

有益效果：本发明提出的一种采用深度确定性策略梯度算法的基于5G网络切片的V2V资源分配方法，V2V通信使用网络切片技术接入5G网络，利用深度强化学习优化策略获得最优的V2V用户信道分配和发射功率联合优化策略，V2V用户通过选择合适的发射功率和分配信道，来降低V2V链路之间的相互干扰，在满足链路延迟的约束下，最大化V2V链路的系统吞吐量。本发明使用DDPG算法可以有效解决V2V用户信道分配和功率选择的联合优化问题，可以在一系列连续动作空间的优化中表现稳定。Beneficial effects: a V2V resource allocation method based on 5G network slicing using a deep deterministic strategy gradient algorithm proposed by the present invention, V2V communication uses network slicing technology to access 5G network, and uses deep reinforcement learning optimization strategy to obtain optimal V2V users Channel allocation and transmit power joint optimization strategy, V2V users can reduce the mutual interference between V2V links by selecting appropriate transmit power and allocation channels, and maximize the system throughput of V2V links under the constraint of link delay . The invention can effectively solve the joint optimization problem of V2V user channel allocation and power selection by using the DDPG algorithm, and can perform stably in the optimization of a series of continuous action spaces.

综上所述，在保证资源分配合理，V2V链路间低干扰以及计算复杂度低的情况下，本发明提出的一种采用深度确定性策略梯度算法的基于5G网络切片的V2V资源分配方法在最大化V2V系统吞吐量方面是优越的。To sum up, in the case of ensuring reasonable resource allocation, low interference between V2V links and low computational complexity, a V2V resource allocation method based on 5G network slicing using a deep deterministic policy gradient algorithm proposed by the present invention is as follows: It is superior in terms of maximizing the throughput of the V2V system.

附图说明Description of drawings

图1为本发明实施例提供的一种采用深度确定性策略梯度算法的5G车联网V2V资源分配方法的流程图；1 is a flowchart of a 5G Internet of Vehicles V2V resource allocation method using a deep deterministic policy gradient algorithm provided by an embodiment of the present invention;

图2为本发明实施例提供的基于5G网络切片技术的V2V用户资源分配模型示意图；2 is a schematic diagram of a V2V user resource allocation model based on 5G network slicing technology provided by an embodiment of the present invention;

图3为本发明实施例提供的基于Actor-Critic模型的深度强化学习框架示意图；3 is a schematic diagram of a deep reinforcement learning framework based on an Actor-Critic model provided by an embodiment of the present invention;

图4为本发明实施例提供的V2V通信深度强化学习模型示意图；4 is a schematic diagram of a deep reinforcement learning model for V2V communication provided by an embodiment of the present invention;

具体实施方式Detailed ways

本发明的核心思想在于：V2V通信以网络切片技术接入5G网络，采用分布式的资源分配方法，将每条V2V链路视为智能体，通过建立深度强化学习模型，利用DDPG算法优化深度强化学习模型。根据优化后的深度强化学习模型，得到最优的V2V用户发射功率和信道分配策略。The core idea of the present invention is: V2V communication uses network slicing technology to access 5G network, adopts distributed resource allocation method, treats each V2V link as an intelligent body, establishes a deep reinforcement learning model, and uses DDPG algorithm to optimize deep reinforcement Learning models. According to the optimized deep reinforcement learning model, the optimal V2V user transmit power and channel allocation strategy are obtained.

下面对本发明做进一步详细描述。The present invention will be described in further detail below.

步骤(1)，车联网中的通信业务车联网中的通信业务分为两种类型即，车辆与路边设施之间(V2I)的宽带多媒体数据传输以及车与车之间(V2V)与行车安全相关的数据传输。Step (1), the communication service in the Internet of Vehicles The communication service in the Internet of Vehicles is divided into two types, namely, broadband multimedia data transmission between vehicles and roadside facilities (V2I) and vehicle-to-vehicle (V2V) and driving. Security-Related Data Transmission.

步骤(2)，利用5G网络切片技术，将V2I与V2V分别划分到不同切片。In step (2), the 5G network slicing technology is used to divide V2I and V2V into different slices respectively.

步骤(3)，构建的用户资源分配系统模型为K对V2V用户共用授权带宽为B的信道，In step (3), the constructed user resource allocation system model is that K shares a channel with an authorized bandwidth of B to V2V users,

包括如下具体步骤：It includes the following specific steps:

(3a)，建立V2V用户资源分配系统模型，系统包括K对V2V用户(VUEs)，用集合κ＝{1，2，...，K}表示，总的授权带宽B被等分成M个带宽为B₀的子信道，子信道用集合

表示；(3a), establish a V2V user resource allocation system model, the system includes K pairs of V2V users (VUEs), represented by a set κ = {1, 2, ..., K}, the total authorized bandwidth B is equally divided into M bandwidths is the sub-channel of B ₀ , and the sub-channel uses a set

express;

(3b)，第K条V2V链路的SINR可以表示为：(3b), the SINR of the K-th V2V link can be expressed as:

其中，in,

G_d是共享相同RB的所有V2V链路的总干扰功率，g_k是第k条V2V链路车联网用户的信道增益，

是第k′条V2V链路对第k条V2V链路的干扰增益。第K条V2V链路的信道容量可以表示为：G _d is the total interference power of all V2V links sharing the same RB, g _k is the channel gain of the k-th V2V link IoV user,

is the interference gain of the k'th V2V link to the kth V2V link. The channel capacity of the Kth V2V link can be expressed as:

C^v[k]＝W·log(1+γ^v[k])；表达式3C ^v [k]=W·log(1+ ^γv [k]); Expression 3

(3c)，对于第k个V2V链路，其在t时刻选择信道信息为：(3c), for the kth V2V link, the channel information selected at time t is:

若

则第m个信道被第k条V2V链路使用，同时有

且i≠m，即

K为V2V链路总个数，M为V2V链路接入切片的可用信道总数。like

Then the mth channel is used by the kth V2V link, and there are

And i≠m, that is

K is the total number of V2V links, and M is the total number of available channels of the V2V link access slice.

步骤(4)，采用分布式的资源分配方法，在考虑V2V链路延迟的情况下，以通信系统吞吐量最大化为目标构建深度强化学习模型，包括如下具体步骤：In step (4), a distributed resource allocation method is adopted, and a deep reinforcement learning model is constructed with the goal of maximizing the throughput of the communication system under the consideration of the V2V link delay, including the following specific steps:

(4a)，具体地定义状态空间S为与资源分配有关的观测信息，包括子信道相应V2V链路瞬时信道信息

子信道前一时隙接收到的干扰强度I_t-1[m]，

信道m在前一时隙被相邻的V2V链路选择的次数N_y-1[m]，

剩余的V2V负载L_t，剩余时延U_t，即(4a), specifically define the state space S as the observation information related to resource allocation, including the instantaneous channel information of the corresponding V2V link of the sub-channel

The received interference strength It _-1 [m] of the subchannel in the previous slot,

Number N _y-1 [m] that channel m was selected by adjacent V2V links in the previous time slot,

The remaining V2V load L _t , the remaining time delay U _t , namely

(4b)，定义动作空间A为发射功率和选择的信道，表示为

其中，

为第k个V2V链路用户的发射功率，

为第m个信道被第k个V2V链路用户使用情况，

表示第m个信道被第k个V2V链路用户使用，

表示第m个信道没有被第k个V2V链路用户使用；in,

is the transmit power of the kth V2V link user,

is the usage of the mth channel by the kth V2V link user,

indicates that the mth channel is used by the kth V2V link user,

Indicates that the mth channel is not used by the kth V2V link user;

其中，T₀为最大可容忍延迟，λ_d、λ_p为两个部分的权值，T₀-U_t是传输所用的时间，随着传输时间的增加，惩罚也会增加。为了获得长期的良好回报，应同时考虑眼前的回报和未来的回报。因此，强化学习的主要目标是找到一种策略来最大化预期的累积折扣回报，Among them, T ₀ is the maximum tolerable delay, λ _d and λ _p are the weights of the two parts, and T ₀ -U _t is the time used for transmission. As the transmission time increases, the penalty will also increase. For good long-term returns, both immediate and future returns should be considered. Therefore, the main goal of reinforcement learning is to find a strategy that maximizes the expected cumulative discounted return,

其中，β∈[0，1]是折扣因子；where β∈[0,1] is the discount factor;

(4d)，依据建立好的S，A和R，在Q学习的基础上建立深度强化学习模型：评估函数Q(s_t，a_t)表示从状态s_t执行动作a_t后产生的折扣奖励，Q值更新函数为(4d), according to the established S, A and R, establish a deep reinforcement learning model on the basis of Q learning: the evaluation function Q(s _t , a _t ) represents the discounted reward generated after performing the action a _t from the state s _t , the Q value update function is

步骤(5)，为了解决基于5G网络切片的V2V资源分配问题，以V2V链路为智能体所建立的深度强化学习模型中的动作空间包括发射功率和信道选择两个变量，考虑发射功率一定范围内连续变化，为了解决这种高维动作空间，尤其是连续动作空间中的联合优化问题，利用包含深度学习拟合，软更新，记忆回放三个机制的DDPG算法优化深度强化学习模型。Step (5), in order to solve the problem of V2V resource allocation based on 5G network slicing, the action space in the deep reinforcement learning model established with the V2V link as the agent includes two variables: transmission power and channel selection, considering a certain range of transmission power. In order to solve this high-dimensional action space, especially the joint optimization problem in the continuous action space, the deep reinforcement learning model is optimized by using the DDPG algorithm including the three mechanisms of deep learning fitting, soft update, and memory playback.

深度学习拟合指DDPG算法基于Actor-Critic框架，分别使用参数为θ和δ的深度神经网络来拟合确定性策略a＝μ(s|θ)和动作值函数Q(s，a|δ)如说明书附图图3所示。Deep learning fitting means that the DDPG algorithm is based on the Actor-Critic framework, and uses deep neural networks with parameters θ and δ to fit the deterministic strategy a=μ(s|θ) and the action value function Q(s, a|δ) As shown in Figure 3 of the accompanying drawings.

软更新指动作值网络的参数在频繁梯度更新的同时，又用于计算策略网络的梯度，使得动作值网络的学习过程很可能出现不稳定的情况，所以提出采用软更新方式来更新网络。Soft update means that the parameters of the action value network are used to calculate the gradient of the policy network while the parameters of the action value network are updated frequently, which makes the learning process of the action value network likely to be unstable. Therefore, a soft update method is proposed to update the network.

分别为策略网络和动作值网络创建在线网络和目标网络两个神经网络：Create two neural networks, the online network and the target network, for the policy network and the action-value network, respectively:

训练过程中利用梯度下降不断更新网络，目标网络的更新方式如下During the training process, gradient descent is used to continuously update the network. The update method of the target network is as follows

θ′＝τθ+(1-τ)θ 表达式9θ′=τθ+(1-τ)θ Expression 9

δ′＝τδ+(1-τ)δ 表达式10δ′=τδ+(1-τ)δ Expression 10

经验回放机制是指与环境交互时产生的状态转换样本数据具有时序关联性，易造成动作值函数拟合的偏差。因此，借鉴DQN算法的经验回放机制，将采集到的样本先放入样本池，然后从样本池中随机选出一些mini-batch样本用于对网络的训练。这种处理去除了样本间的相关性和依赖性，解决了数据间相关性及其非静态分布的问题，使得算法更容易收敛。The experience replay mechanism means that the state transition sample data generated when interacting with the environment has a time-series correlation, which is easy to cause the deviation of the action value function fitting. Therefore, drawing on the experience playback mechanism of the DQN algorithm, the collected samples are first put into the sample pool, and then some mini-batch samples are randomly selected from the sample pool for training the network. This process removes the correlation and dependency between samples, solves the problem of correlation between data and its non-static distribution, and makes the algorithm easier to converge.

利用包含深度学习拟合，软更新，记忆回放三个机制的DDPG算法优化深度强化学习模型，包括如下步骤：Using the DDPG algorithm including deep learning fitting, soft update, and memory playback to optimize the deep reinforcement learning model includes the following steps:

(5a)，初始化训练回合数P；(5a), initialize the number of training rounds P;

(5b)，初始化P回合中的时间步t；(5b), initialize the time step t in the P round;

(5c)，在线Actor策略网络根据输入状态s_t，输出动作a_t，并获取即时的奖励r_t，同时转到下一状态s_t+1，从而获得训练数据(s_t，a_t，r_t，s_t+1)；(5c), the online Actor policy network outputs the action at according to the input state s _t , and obtains the immediate reward r _t , and at the same time goes to the next state s _t+1 , thereby obtaining the training data (s _t , at _t , _r _t , s _t+1 );

(5d)，将训练数据(s_t，a_t，r_t，s_t+1)存入经验回放池中；(5d), store the training data (s _t , at _t , r _t , s _t+1 ) into the experience playback pool;

(5e)，从经验回放池中随机采样m个训练数据(s_t，a_t，r_t，s_t+1)构成数据集，发送给在线Actor策略网络、在线Critic评价网络、目标Actor策略网络和目标Critic评价网络；(5e), randomly sample m training data (s _t , at , r _t , s _t ₊₁ ) from the experience playback pool to form a dataset, and send it to the online Actor policy network, online Critic evaluation network, and target Actor policy network and target Critic evaluation network;

(5f)，设置Q估计为(5f), set the Q estimate as

y_i＝r_i+γQ′(s_i+1，μ′(s_i+1|θ′)|δ′) 表达式11y _i =r _i +γQ′(s _i+1 , μ′(s _i+1 |θ′)|δ′) Expression 11

定义在线Critic评价网络的损失函数为The loss function of the online critical evaluation network is defined as

通过神经网络的梯度反向传播来更新Critic当前网络的所有参数θ；Update all parameters θ of Critic's current network through gradient back-propagation of the neural network;

(5g)，定义在线Actor策略网络的给抽样策略梯度为(5g), define the gradient of the sampling strategy for the online Actor strategy network as

通过神经网络的梯度反向传播来更新Actor当前网络的所有参数δ；Update all parameters δ of Actor's current network through gradient back-propagation of the neural network;

(5h)，若在线训练次数达到目标网络更新频数，根据在线网络参数δ和θ分别更新目标网络参数δ′和θ′；(5h), if the number of online training reaches the target network update frequency, update the target network parameters δ′ and θ′ respectively according to the online network parameters δ and θ;

(5i)，判断是否满足t＜K，K为p回合中的总时间步，若是，t＝t+1，进入步骤5c，否则，进入步骤5j；(5i), judge whether t<K is satisfied, K is the total time step in p round, if so, t=t+1, go to step 5c, otherwise, go to step 5j;

(5j)，判断是否满足p＜I，I为训练回合数设定阈值，若是，p＝p+1，进入步骤5b，否则，优化结束，得到优化后的深度强化学习模型。(5j), judge whether p<I is satisfied, I is the set threshold for the number of training rounds, if so, p=p+1, go to step 5b, otherwise, the optimization ends, and the optimized deep reinforcement learning model is obtained.

步骤(6)，根据优化后的深度强化学习模型，得到最优V2V用户发射功率和信道分配策略，包括如下步骤：Step (6), according to the optimized deep reinforcement learning model, obtain the optimal V2V user transmit power and channel allocation strategy, including the following steps:

(6a)，利用DDPG算法训练好的深度强化学习模型，输入系统某时刻的状态信息s_k(t)；(6a), using the deep reinforcement learning model trained by the DDPG algorithm, input the state information _sk (t) of the system at a certain time;

(6b)，输出最优动作策略

得到最优的V2V用户发射功率

和分配信道

(6b), output the optimal action strategy

Get the optimal V2V user transmit power

and assign channels

最后，对说明书中的附图进行详细说明。Finally, the drawings in the specification are described in detail.

在图1中，描述了一种采用深度确定性策略梯度算法的5G车联网V2V资源分配方法的流程，V2V通信使用网络切片技术接入5G网络，利用DDPG优化深度强化学习模型获得最优的V2V用户信道分配和发射功率联合优化策略。In Figure 1, the flow of a 5G Internet of Vehicles V2V resource allocation method using a deep deterministic policy gradient algorithm is described. V2V communication uses network slicing technology to access the 5G network, and uses DDPG to optimize the deep reinforcement learning model to obtain the optimal V2V User channel allocation and transmit power joint optimization strategy.

在图2中，描述了基于5G网络切片技术的V2V用户资源分配模型，V2V通信和V2I通信使用不同的切片。In Figure 2, the V2V user resource allocation model based on 5G network slicing technology is described, and V2V communication and V2I communication use different slices.

在图3中，描述了深度学习拟合指DDPG算法基于Actor-Critic框架，分别使用参数为θ和δ的深度神经网络来拟合确定性策略a＝μ(s|θ)和动作值函数Q(s，a|δ)。In Figure 3, the deep learning fitting is described. The DDPG algorithm is based on the Actor-Critic framework and uses deep neural networks with parameters θ and δ to fit the deterministic policy a=μ(s|θ) and the action value function Q, respectively. (s, a|δ).

在图4中，描述了V2V通信深度强化学习模型。可以看出V2V链路作为智能体基于当前状态s_t∈S根据奖励函数选择信道和发射功率。In Figure 4, a deep reinforcement learning model for V2V communication is depicted. It can be seen that the V2V link as an agent selects the channel and transmit power according to the reward function based on the current state s _t ∈ S.

根据对本发明的说明，本领域的技术人员应该不难看出，本发明的采用5G网络切片技术基于深度强化学习DDPG算法的V2V资源分配方法可以提高系统吞吐量并且能保证通信时延达到安全要求。According to the description of the present invention, those skilled in the art should not be difficult to see that the V2V resource allocation method based on the deep reinforcement learning DDPG algorithm using the 5G network slicing technology of the present invention can improve the system throughput and ensure that the communication delay meets the security requirements.

本发明申请书中未作详细描述的内容属于本领域专业技术人员公知的现有技术。The contents not described in detail in the application of the present invention belong to the prior art known to those skilled in the art.

Claims

1. A5G Internet of vehicles V2V resource allocation method adopting a depth deterministic strategy gradient algorithm is characterized by comprising the following steps:

(1) communication services in the internet of vehicles are divided into two types, namely broadband multimedia data transmission between vehicles and roadside facilities (V2I) and data transmission between vehicles (V2V) about driving safety;

(2) dividing V2I and V2V communication traffic into different slices respectively by using a 5G network slicing technology;

(3) the constructed user resource allocation system model is that K shares a channel with the authorized bandwidth of B for V2V users;

(4) by adopting a distributed resource allocation method, under the condition of considering V2V link delay, a deep reinforcement learning model is constructed with the aim of maximizing the throughput of a communication system;

(5) considering a joint optimization problem in a continuous action space, and optimizing a deep reinforcement learning model by using a deep certainty strategy gradient (DDPG) algorithm comprising three mechanisms of deep learning fitting, soft updating and memory playback;

(6) and obtaining the optimal V2V user transmitting power and channel allocation strategy according to the optimized deep reinforcement learning model.

2. The 5G Internet of vehicles V2V resource allocation method adopting the deep deterministic strategy gradient algorithm according to claim 1, wherein the step (4) comprises the following specific steps:

(4a) the state space S is specifically defined as observation information related to resource allocation, including subchannel m corresponding V2V link instantaneous channel state information G_t[m]Interference strength I received in the previous time slot of sub-channel m_t-1[m]Number of times N that subchannel m was selected by an adjacent V2V link in the previous time slot_t-1[m]Residual load L transmitted by V2V user_tResidual time delay U_tI.e. by

s_t＝{G_t，I_t-1，N_t-1，L_t，U_t}

Considering the V2V link as an agent, each time the V2V link is based on the current state s_tSelecting a channel and transmitting power by the E.S;

(4b) defining the action space A as the transmit power and the selected channel, denoted as

Wherein,

the transmission power of the k-th V2V link user,

for the use of the mth channel by the kth V2V link user,

indicating that the mth channel is used by the kth V2V link user,

indicating that the mth channel is not used by the kth V2V link user;

(4c) the goal of defining the reward function R, V2V resource allocation is that the V2V link selects the spectral sub-band and transmit power to maximize the system throughput of the V2V link while satisfying the delay constraint. The reward function can thus be expressed as:

wherein, T₀To the maximum tolerable delay, λ_d、λ_pIs a weight of two parts, T₀-U_tIs the time taken for transmission, the penalty will increase as the transmission time increases;

(4d) establishing a deep reinforcement learning model on the basis of Q learning according to the established S, A and R, and evaluating the function Q (S)_t，a_t) Represents the slave state s_tPerforming action a_tThe resulting discount reward, the Q-value update function is:

wherein r is_tIs an instant reward function, gamma is a discount factor, s_tFor the state of V2V link at time tInformation, s_t+1Indicating that the V2V link is performing a_tIn the latter state, A is action a_tThe formed motion space.