CN116405904A

CN116405904A - TACS network resource allocation method based on deep reinforcement learning

Info

Publication number: CN116405904A
Application number: CN202310358415.5A
Authority: CN
Inventors: 望前程; 赵恒凯; 郑国莘
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2023-04-06
Filing date: 2023-04-06
Publication date: 2023-07-07

Abstract

The invention relates to a TACS network resource allocation method based on deep reinforcement learning, which comprises the following steps: constructing a communication scene in a tunnel environment aiming at a train autonomous operation system, and dividing signaling priority for communication service in the communication scene; under the premise of restraining the maximum time delay of a communication link between trains in the communication scene, constructing a multi-agent deep reinforcement learning model by taking the throughput maximization of an autonomous running system of the train as a target; training the multi-agent deep reinforcement learning model using a depth deterministic strategy gradient algorithm based on a deep learning fit, a attentional mechanism and preferential empirical playback; based on the trained multi-agent deep reinforcement learning model, an optimal action strategy is obtained, and network resources are distributed based on the optimal action strategy. Compared with the prior art, the invention effectively improves the total capacity of the TACS system and reduces the transmission delay of the T2T link.

Description

TACS network resource allocation method based on deep reinforcement learning

Technical Field

The invention relates to the field of rail transit, in particular to a TACS network resource allocation method based on deep reinforcement learning.

Background

Along with the acceleration of the urban development process and the enhancement of people's green trip consciousness, the track traffic has higher and higher proportion in public traffic, and the track traffic sustainable development, the promotion operation quality and efficiency, realization safety and high efficiency and high-tech track traffic are more and more urgent. Rail transit communication-based train control systems are widely used. Along with the continuous upgrading of the train control system, the TACS communication mode can obviously improve the control safety of the train through direct information exchange between the front train and the rear train.

The Train autonomous operation system (Train Autonomous Circumambulation System, TACS) mode of rail transit Train communication has Train-to-Train (T2T) communication between the Train and the trackside (T2W), ensuring important information transmission and improving the utilization rate of the whole communication resource. With the popularization of machine learning algorithms, more advanced is a resource allocation scheme based on Multi-agent deep reinforcement learning (Multi-agent deep reinforcement learning, MADRL) proposed by adopting a deep Q-Network (DQN) algorithm above a TACS communication resource allocation scheme, but the DQN algorithm cannot ensure constant convergence, and the DQN algorithm cannot solve the problem of predicting continuous actions, wherein the power in the simulation process is generally a discrete value and does not accord with a real tunnel environment; meanwhile, the signaling priority of communication between trains is not considered in the existing work.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a TACS network resource allocation method based on deep reinforcement learning, which establishes a multi-agent deep reinforcement learning model based on a communication scene under a tunnel environment after signaling priority is considered, trains the multi-agent deep reinforcement learning model by using a depth deterministic strategy gradient algorithm based on deep learning fitting, attention mechanism and priority experience playback, obtains a network resource allocation result based on the trained multi-agent deep reinforcement learning model, and improves network resource allocation efficiency under the tunnel environment.

The aim of the invention can be achieved by the following technical scheme:

the invention provides a TACS network resource allocation method based on deep reinforcement learning, which comprises the following steps:

constructing a communication scene in a tunnel environment aiming at a train autonomous operation system, and dividing signaling priority for communication service in the communication scene;

under the premise of restraining the maximum time delay of a communication link between trains in the communication scene, constructing a multi-agent deep reinforcement learning model by taking the throughput maximization of an autonomous running system of the train as a target;

training the multi-agent deep reinforcement learning model using a depth deterministic strategy gradient algorithm based on a deep learning fit, a attentional mechanism and preferential empirical playback;

based on the trained multi-agent deep reinforcement learning model, an optimal action strategy is obtained, and network resources are distributed based on the optimal action strategy.

As an optimal technical scheme, the construction process of the multi-agent deep reinforcement learning model comprises the following steps of:

taking a communication link between each train and a train workshop as an agent, acquiring a state in a state space by each agent, correspondingly taking an action in an action space, and distributing network resources according to a strategy, wherein the strategy is determined by a probability action cost function for representing that the agent executes a certain action under a certain state;

each agent obtains rewards by interacting with the communication scenario for determining actions to be selectively performed in the next new state, wherein the rewards are based on capacities of the inter-train-to-train communication link and the inter-train-to-trackside communication link, and delay acquisitions of the inter-train-to-train communication link.

As a preferred technical scheme, the rewards are obtained by adopting the following formula:

wherein T is ₀ Lambda is the maximum delay of the communication link between trains _c 、λ _d 、λ _p Is the weight (T) ₀ -U _t ) Time taken for transmission, C ^c [m]C for channel capacity of communication link between mth train and trackside ^d [n]For the channel capacity of the nth train-to-train communication link,

n is the number of inter-train and trackside communication links and the number of inter-train and inter-train communication links, respectively.

As a preferable technical solution, the training process of the multi-agent deep reinforcement learning model includes the following steps:

initializing an Actor deep learning neural network of each intelligent agent in the multi-intelligent agent deep reinforcement learning model, and a Critic deep learning neural network based on a multi-head attention mechanism and subjected to decentralization;

resetting the environment of the Internet of vehicles, updating the train position, acquiring updated large-scale fading and updated small-scale fading according to the updated train position, acquiring a training sample and storing the training sample in a preset experience playback pool;

and randomly selecting a training sample with a small size from the experience playback pool to form a data set, and training the Actor deep learning neural network and the Critic deep learning neural network of each intelligent agent.

As a preferred technical solution, the training samples include an input state and a corresponding output action, a state of a next moment, and a common reward obtained after all agents perform actions, where the input state and the state of the next moment include a channel gain of a communication link between a train and the train in a current time slot, a channel gain of the communication link between the train and a trackside in the current time slot, link interference, selection of a neighbor subchannel of a previous time slot, a transmission load, and a remaining time satisfying a delay constraint.

As an preferable technical scheme, the channel gains of the communication link between trains and the trackside are as follows: g _n [m]＝α _n h _n [m]Wherein h is _n [m]For the nth train to inter-train communication transmitter to fade power component, alpha, on the mth sub-band by a small scale _n Is the large scale fading component of the nth train to inter-train communication transmitter.

As a preferable technical scheme, training is carried out on the Actor deep learning neural network and the Critic deep learning neural network of the intelligent agent based on a loss function, wherein the loss function is obtained based on rewards and probability action cost functions obtained by interaction of each intelligent agent with a communication scene.

As a preferred technical solution, the process of implementing allocation of network resources based on the optimal action policy includes the following steps:

and acquiring the optimal transmitting power and channel allocation data of the communication link between the trains based on the optimal action strategy, and acquiring the capacities of the communication link between the trains and the trackside and the transmission delay of the communication link between the trains based on the trained multi-agent deep reinforcement learning model.

As an preferable technical scheme, the communication scene comprises a core network, a base station and a backbone network based on LTE-M, and a vehicle-mounted controller and a target controller, wherein the vehicle-mounted controller is used for realizing communication between adjacent trains according to a line plan, exchanging train position and resource information, generating a mobile authorization, and the target controller is used for registering and unlocking the line resource occupation condition, tracking a non-communication train, recovering driving resources and degrading the safety protection of a route.

As a preferred technical solution, the process of prioritizing the signaling for the communication traffic in the communication scenario includes the following steps:

the automatic train operation, automatic train protection and automatic train monitoring services are classified into a first priority, and the passenger information system, the video monitoring system and the high data rate service are classified into a second priority.

Compared with the prior art, the invention has the following advantages:

(1) The distribution effect for tunnel environment is good: the method is based on a communication scene in a tunnel environment, which is subjected to signaling priority consideration, a multi-agent deep reinforcement learning model is established, the multi-agent deep reinforcement learning model is trained by using a depth deterministic strategy gradient algorithm based on deep learning fitting, attention mechanism and priority experience playback, and a network resource allocation result is obtained by using the trained multi-agent deep reinforcement learning model.

(2) The efficiency of information interaction, the accuracy of decision making and the stability of training are improved: in a TACS scene, each agent needs to interact with other agents to cooperatively complete tasks, the traditional method is to equally consider and then make decisions on the information of all agents, however, in a complex scene, the method may cause information redundancy and unnecessary communication overhead, the method enables the agents to pay more attention to and attach importance to important information by introducing an attention mechanism, reduces unnecessary communication, thereby improving the efficiency of information interaction, predicting the future state and behavior more accurately, and in addition, the performance of the MADDPG algorithm in a multi-agent game is influenced by the stability of training, and the attention mechanism can enable the model to learn and converge more stably, so that the stability and convergence speed of training are improved.

Drawings

Fig. 1 is a flowchart of a TACS network resource allocation method based on deep reinforcement learning in embodiment 1;

fig. 2 is a schematic diagram of each part of the communication scenario in embodiment 1;

FIG. 3 is a schematic diagram of a model optimization process.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

Example 1

As shown in fig. 1, this embodiment first divides the content of different communications in the train control system into different priorities, and then models the TACS communication mode as a MADRL problem, and proposes a Multi-agent depth deterministic strategy gradient (Multi-Agent Deep Deterministic Policy Gradient, madddpg) algorithm based on distributed execution, which is applied in the communication scenario shown in fig. 2 to solve the problem of power control of continuous actions of each agent. The method comprises the following steps:

s1: and constructing a T2T system based on LTE-M communication. The method comprises the following substeps:

s11: the TACS system based on LTE-M communication is established on the existing train control system based on communication and composed of a core network, a base station and a backbone network, but the functions of the original computer interlocking and a regional controller are integrated into VOBC (virtual basic block controller) to realize direct communication between adjacent trains according to a line plan, so that key information such as train position and resources can be exchanged, and mobile authorization can be timely generated;

s12: the OC is additionally arranged to register and unlock the line resource occupation condition, and the OC realizes the system degradation functions of non-communication train tracking, driving resource recovery, route safety protection and the like, so that the compatibility and the easy deployment of the signal system are realized.

S2: and dividing priorities of different signaling according to different communication contents, and simulating a TACS communication real scene. The method comprises the following substeps:

s21: according to different communication contents, dividing priorities of different signaling;

s22: the 5MHz bandwidth in 20MHz bandwidth of LTE-M is divided and is specially used for transmission of higher priority service (such as ATO instruction), and the rest 15MHz bandwidth participates in a frequency spectrum sharing resource allocation scheme, and the allocation mode is shown in table 1.

Table 1 spectrum shared resource allocation

S3: constructing different resource allocation frameworks of a TACS scene, and deducing a signal-to-interference-and-noise ratio formula and a channel capacity formula of a T2T link and a T2W link. The derivation process is as follows:

the signal gain is: g _n [m]＝α _n h _n [m]

In the formula, h _n [m]For frequency-dependent small-scale fading power components, alpha _n Is a large scale fading effect, mainly path loss.

T2W link signal-to-interference-noise ratio:

wherein,,

for the transmission power, sigma, of the mth T2W link ² Is noise power +.>

For the transmission power of the nth T2T transmitter on the mth subband, M e {1, …, M }, N e {1, …, N }. ρ _n [m]Is a binary spectrum allocation index, ρ _n [m]=1, meaning that the nth T2T link uses the mth subband, otherwise ρ _n [m]=0. Suppose that each T2T link has access to only one sub-band, i.e., Σ _n ρ _n [m]≤1。

Signal-to-interference-and-noise ratio of T2T link:

wherein,,

representing the transmit power of the nth T2T transmitter on the mth subband, +.>

Represents the interference channel from the mth T2T transmitter to the nth T2T receiver in the mth sub-band g _n′,n [m]Similarly.

T2W link channel capacity:

T2T link channel capacity:

where B is the bandwidth of each spectral subband.

S4: the spectrum sharing problem in TACS communication is modeled as one MADRL problem. The method comprises the following substeps:

s41: the spectrum sharing scenario is modeled as a MADRL problem in conjunction with reinforcement learning algorithms. In connection with fig. 3, it is evident that each T2T link acts as an agent, one state S being observed in the state space S _t And accordingly takes an action a in action space a _t The sub-band and the transmission power are selected according to a strategy pi. Policy v may be defined by action cost function Q (S _t ,A _t ) Determination that it represents an agent in a certain state S _t In the case of (a) executing a certain action A _t I.e., pi (a |s) =p (a _t ＝a∣S _t ＝s)。

S42: individual agents obtain rewards R by interacting with tunnel communication environment _t Thereby guiding itself in the next new state S _t+1 Action A to be selected for execution _t+1 . Rewards R _t System capacity by T2T and T2W linksThe amount and the delay constraint of the corresponding T2T link.

S43: multiple T2T agents represent multiple trains in a tunnel scenario, all of which jointly explore the environment and improve spectrum allocation and power control strategies based on their own observations of the environment state, as described in fig. 3.

S5: the DDPG algorithm is applied to the MADRL model to provide the MADDPG algorithm, and a state space, a reward function, a cost function, an algorithm flow and the like of the algorithm are provided, so that a reinforcement learning distributed execution algorithm is designed. The method specifically comprises the following substeps:

s51: state space: s is S _t ＝{G _t ,H _t ,I _t-1 ,E _t-1 ,F _t ,U _t }

Wherein, the channel gain G of the T2T link in the current time slot _t Channel gain H of T2W link at current time slot _t Interference I caused by other links before _t-1 Selection E of last time slot neighbor subchannel _t-1 The transmission load F _t Residual time U satisfying time delay constraint _t 。

Bonus function:

wherein T is ₀ Lambda is the maximum acceptable time delay _c 、λ _d 、λ _p Is a three-part weight. (T) ₀ -U _t ) The time taken for transmission.

Final rewards:

wherein, beta is E0, 1, which is the attenuation factor.

Action cost function:

where α represents the learning rate.

Loss function:

wherein,,

θ' is a parameter of the target Q network.

S52: the specific algorithm flow is as follows:

initializing the actual network and target network parameters of the Actor and Critic of each agent;

initializing the size B of each agent's experience pool _k ；

Resetting the Internet of vehicles environment;

updating the vehicle position and the large-scale fading alpha;

the reality Actor strategy network is based on the input state S _t Output action A _t The intelligent agent obtains the state S at the next moment after executing the action _t+1 All agents perform actions and acquire a common prize R _t ；

Updating small-scale fading of the channel;

obtaining training data (S) _t ,A _t ,R _t ,S _t+1 )；

Storing training data in an experience playback pool;

randomly sampling m training data from an experience playback pool to form a data set, and transmitting the data set to an Actor reality network, a Critic reality network, an Actor target network and a Critic target network;

setting the Q estimation as follows:

the loss function defining the online Critic evaluation network is:

updating an Actor target network;

updating the Critic target network;

updating all parameters delta of the current network of the Actor through gradient back propagation of the neural network;

if the online training times reach the target network updating frequency, respectively updating the target network parameters delta according to the online network parameters delta and theta ^′ And theta ^′ 。

And S6, considering the joint optimization problem in the continuous action space, and optimizing a deep reinforcement learning model by using a DDPG algorithm comprising three mechanisms of deep learning fitting, attention mechanism and experience playback. The method specifically comprises the following substeps:

s61: deep learning fitting refers to the invention of fitting deterministic strategies and action value functions using deep neural networks of different parameters;

s62: because communication between trains only needs to pay attention to communication between trains adjacent in the same direction in practice, and not much attention is required to trains traveling in different directions, the attention mechanism can be introduced to improve the efficiency and accuracy of information interaction.

The method adopts centralized training and distributed execution, and an intelligent agent cannot observe the complete state of the environment, so that the intelligent agent is decentralised.

In particular, the following steps may be employed to draw attention to the mechanism:

(1) The direction and speed information of each agent is encoded as a vector as input data.

(2) For each agent, according to the information such as the direction and the relative speed of the agent and other agents, an attention vector is calculated and used for reflecting the attention degree and the importance of the current agent to other agents.

(3) And performing dot product operation on the attention vector of each agent and the input vectors of other agents to obtain attention weighted input vectors of each agent to the other agents.

(4) The attention weighted input vector for each agent is used as input to calculate the motion or state of the vehicle.

S63: experience playback by setting an experience playback mechanism, a sample pool is collected first, and then some small-size samples are randomly selected from the sample pool for training.

S7: and obtaining an optimal T2T transmitting power selection model, an allocation channel decision, a T2W link capacity with better performance, a T2T system capacity and a lower T2T transmission delay according to the optimized deep reinforcement learning model. The method specifically comprises the following substeps:

s71: outputting the optimal action strategy to obtain the optimal T2T transmitting power and the allocation channel strategy;

s72: and obtaining the T2W link capacity, the T2T system capacity and the lower T2T transmission delay with better performance according to the optimized deep reinforcement learning model.

As shown in fig. 2, the Vehicle On-Board Controller (VOBC) communicates with the target Controller (Object Controller, OC), the passenger information system (Passenger Information System, PIS) and the video monitoring system (Image Monitoring System, IMS) through a base station-backbone-core network; wherein, step S1 includes:

aiming at the spectrum sharing problem of multi-agent deep reinforcement learning based on continuous action space in a TACS communication scene, the simulation power of the reinforcement learning algorithm in the past is generally discrete value, so that a tunnel scene cannot be truly simulated, and meanwhile, the signaling priority problem of communication between trains is not considered in the existing work. In the embodiment, the communication modes are divided into two types of T2T and T2W, the communication content is divided into two different priorities according to the safety importance of the train, the resource sharing is modeled as a Multi-agent deep reinforcement learning problem, and a Multi-agent deep deterministic strategy gradient (Multi-Agent Deep Deterministic Policy Gradient, MADDPG) algorithm based on distributed execution is provided. Each agent continuously interacts with the tunnel environment and observes own local state to obtain a common reward, and the Critic network is intensively trained by summarizing actions of other agents, so that the power control selected by each agent is improved. By designing the reward function and the training mechanism, the multi-agent algorithm can realize distributed resource allocation, effectively improve the total capacity of the TACS system and reduce the transmission delay of the T2T link.

Example 2

The present embodiment provides an electronic device, including: one or more processors and a memory having stored therein one or more programs comprising instructions for performing the deep reinforcement learning based TACS network resource allocation method as described in embodiment 1.

Example 3

The present embodiment provides a computer-readable storage medium comprising one or more programs for execution by one or more processors of an electronic device, the one or more programs comprising instructions for performing the deep reinforcement learning-based TACS network resource allocation method described in embodiment 1.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A TACS network resource allocation method based on deep reinforcement learning is characterized by comprising the following steps:

2. The TACS network resource allocation method based on deep reinforcement learning according to claim 1, wherein the construction process of the multi-agent deep reinforcement learning model comprises the following steps:

3. The TACS network resource allocation method according to claim 2, wherein the rewards are obtained by the following formula:

4. The TACS network resource allocation method of claim 1, wherein the training of the multi-agent deep reinforcement learning model comprises the steps of:

5. The method for distributing resources to TACS network based on deep reinforcement learning according to claim 4, wherein said training samples include input states and corresponding output actions, states at the next moment and common rewards obtained after all agents have performed actions, said input states and said states at the next moment each include channel gain of a communication link between a train and a train in a current time slot, channel gain of a communication link between a train and a trackside, link interference in a current time slot, selection of a last time slot adjacent subchannel, transmission load and remaining time satisfying a delay constraint.

6. The TACS network resource allocation method according to claim 5, wherein the channel gains of the inter-train-to-train communication link and the inter-train-to-trackside communication link are: g _n [m]＝α _n h _n [m]Wherein h is _n [m]For the nth train to inter-train communication transmitter to fade power component, alpha, on the mth sub-band by a small scale _n Large-scale fading separation for nth train-to-train communication transmitterAmount of the components.

7. The TACS network resource allocation method of claim 4, wherein the agent's Actor deep learning neural network and Critic deep learning neural network are trained based on a loss function, the loss function being obtained based on rewards and probabilistic action cost functions obtained by each agent through interaction with a communication scene.

8. The TACS network resource allocation method according to claim 1, wherein the process of implementing allocation of network resources based on the optimal action policy comprises the steps of:

9. The TACS network resource allocation method according to claim 1, wherein the communication scenario includes a core network, a base station and a backbone network based on LTE-M, and a vehicle-mounted controller and a target controller, where the vehicle-mounted controller is configured to implement communication between adjacent trains according to a line plan, exchange train location and resource information, and generate a movement authorization, and the target controller is configured to register and unlock a line resource occupation condition, and track a non-communication train, recover a driving resource, and degrade an approach safety protection.

10. The TACS network resource allocation method according to claim 1, wherein the process of prioritizing the traffic within the communication scenario comprises the steps of: