CN116405904A - TACS network resource allocation method based on deep reinforcement learning - Google Patents

TACS network resource allocation method based on deep reinforcement learning Download PDF

Info

Publication number
CN116405904A
CN116405904A CN202310358415.5A CN202310358415A CN116405904A CN 116405904 A CN116405904 A CN 116405904A CN 202310358415 A CN202310358415 A CN 202310358415A CN 116405904 A CN116405904 A CN 116405904A
Authority
CN
China
Prior art keywords
train
communication
agent
reinforcement learning
deep reinforcement
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310358415.5A
Other languages
Chinese (zh)
Inventor
望前程
赵恒凯
郑国莘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN202310358415.5A priority Critical patent/CN116405904A/en
Publication of CN116405904A publication Critical patent/CN116405904A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/30Services specially adapted for particular environments, situations or purposes
    • H04W4/40Services specially adapted for particular environments, situations or purposes for vehicles, e.g. vehicle-to-pedestrians [V2P]
    • H04W4/42Services specially adapted for particular environments, situations or purposes for vehicles, e.g. vehicle-to-pedestrians [V2P] for mass transport vehicles, e.g. buses, trains or aircraft
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/04Wireless resource allocation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/50Allocation or scheduling criteria for wireless resources
    • H04W72/535Allocation or scheduling criteria for wireless resources based on resource usage policies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/50Allocation or scheduling criteria for wireless resources
    • H04W72/56Allocation or scheduling criteria for wireless resources based on priority criteria
    • H04W72/566Allocation or scheduling criteria for wireless resources based on priority criteria of the information or information source or recipient
    • H04W72/569Allocation or scheduling criteria for wireless resources based on priority criteria of the information or information source or recipient of the traffic information
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Geometry (AREA)
  • Computer Hardware Design (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention relates to a TACS network resource allocation method based on deep reinforcement learning, which comprises the following steps: constructing a communication scene in a tunnel environment aiming at a train autonomous operation system, and dividing signaling priority for communication service in the communication scene; under the premise of restraining the maximum time delay of a communication link between trains in the communication scene, constructing a multi-agent deep reinforcement learning model by taking the throughput maximization of an autonomous running system of the train as a target; training the multi-agent deep reinforcement learning model using a depth deterministic strategy gradient algorithm based on a deep learning fit, a attentional mechanism and preferential empirical playback; based on the trained multi-agent deep reinforcement learning model, an optimal action strategy is obtained, and network resources are distributed based on the optimal action strategy. Compared with the prior art, the invention effectively improves the total capacity of the TACS system and reduces the transmission delay of the T2T link.

Description

TACS network resource allocation method based on deep reinforcement learning
Technical Field
The invention relates to the field of rail transit, in particular to a TACS network resource allocation method based on deep reinforcement learning.
Background
Along with the acceleration of the urban development process and the enhancement of people's green trip consciousness, the track traffic has higher and higher proportion in public traffic, and the track traffic sustainable development, the promotion operation quality and efficiency, realization safety and high efficiency and high-tech track traffic are more and more urgent. Rail transit communication-based train control systems are widely used. Along with the continuous upgrading of the train control system, the TACS communication mode can obviously improve the control safety of the train through direct information exchange between the front train and the rear train.
The Train autonomous operation system (Train Autonomous Circumambulation System, TACS) mode of rail transit Train communication has Train-to-Train (T2T) communication between the Train and the trackside (T2W), ensuring important information transmission and improving the utilization rate of the whole communication resource. With the popularization of machine learning algorithms, more advanced is a resource allocation scheme based on Multi-agent deep reinforcement learning (Multi-agent deep reinforcement learning, MADRL) proposed by adopting a deep Q-Network (DQN) algorithm above a TACS communication resource allocation scheme, but the DQN algorithm cannot ensure constant convergence, and the DQN algorithm cannot solve the problem of predicting continuous actions, wherein the power in the simulation process is generally a discrete value and does not accord with a real tunnel environment; meanwhile, the signaling priority of communication between trains is not considered in the existing work.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a TACS network resource allocation method based on deep reinforcement learning, which establishes a multi-agent deep reinforcement learning model based on a communication scene under a tunnel environment after signaling priority is considered, trains the multi-agent deep reinforcement learning model by using a depth deterministic strategy gradient algorithm based on deep learning fitting, attention mechanism and priority experience playback, obtains a network resource allocation result based on the trained multi-agent deep reinforcement learning model, and improves network resource allocation efficiency under the tunnel environment.
The aim of the invention can be achieved by the following technical scheme:
the invention provides a TACS network resource allocation method based on deep reinforcement learning, which comprises the following steps:
constructing a communication scene in a tunnel environment aiming at a train autonomous operation system, and dividing signaling priority for communication service in the communication scene;
under the premise of restraining the maximum time delay of a communication link between trains in the communication scene, constructing a multi-agent deep reinforcement learning model by taking the throughput maximization of an autonomous running system of the train as a target;
training the multi-agent deep reinforcement learning model using a depth deterministic strategy gradient algorithm based on a deep learning fit, a attentional mechanism and preferential empirical playback;
based on the trained multi-agent deep reinforcement learning model, an optimal action strategy is obtained, and network resources are distributed based on the optimal action strategy.
As an optimal technical scheme, the construction process of the multi-agent deep reinforcement learning model comprises the following steps of:
taking a communication link between each train and a train workshop as an agent, acquiring a state in a state space by each agent, correspondingly taking an action in an action space, and distributing network resources according to a strategy, wherein the strategy is determined by a probability action cost function for representing that the agent executes a certain action under a certain state;
each agent obtains rewards by interacting with the communication scenario for determining actions to be selectively performed in the next new state, wherein the rewards are based on capacities of the inter-train-to-train communication link and the inter-train-to-trackside communication link, and delay acquisitions of the inter-train-to-train communication link.
As a preferred technical scheme, the rewards are obtained by adopting the following formula:
Figure BDA0004164101860000021
wherein T is 0 Lambda is the maximum delay of the communication link between trains c 、λ d 、λ p Is the weight (T) 0 -U t ) Time taken for transmission, C c [m]C for channel capacity of communication link between mth train and trackside d [n]For the channel capacity of the nth train-to-train communication link,
Figure BDA0004164101860000022
n is the number of inter-train and trackside communication links and the number of inter-train and inter-train communication links, respectively.
As a preferable technical solution, the training process of the multi-agent deep reinforcement learning model includes the following steps:
initializing an Actor deep learning neural network of each intelligent agent in the multi-intelligent agent deep reinforcement learning model, and a Critic deep learning neural network based on a multi-head attention mechanism and subjected to decentralization;
resetting the environment of the Internet of vehicles, updating the train position, acquiring updated large-scale fading and updated small-scale fading according to the updated train position, acquiring a training sample and storing the training sample in a preset experience playback pool;
and randomly selecting a training sample with a small size from the experience playback pool to form a data set, and training the Actor deep learning neural network and the Critic deep learning neural network of each intelligent agent.
As a preferred technical solution, the training samples include an input state and a corresponding output action, a state of a next moment, and a common reward obtained after all agents perform actions, where the input state and the state of the next moment include a channel gain of a communication link between a train and the train in a current time slot, a channel gain of the communication link between the train and a trackside in the current time slot, link interference, selection of a neighbor subchannel of a previous time slot, a transmission load, and a remaining time satisfying a delay constraint.
As an preferable technical scheme, the channel gains of the communication link between trains and the trackside are as follows: g n [m]=α n h n [m]Wherein h is n [m]For the nth train to inter-train communication transmitter to fade power component, alpha, on the mth sub-band by a small scale n Is the large scale fading component of the nth train to inter-train communication transmitter.
As a preferable technical scheme, training is carried out on the Actor deep learning neural network and the Critic deep learning neural network of the intelligent agent based on a loss function, wherein the loss function is obtained based on rewards and probability action cost functions obtained by interaction of each intelligent agent with a communication scene.
As a preferred technical solution, the process of implementing allocation of network resources based on the optimal action policy includes the following steps:
and acquiring the optimal transmitting power and channel allocation data of the communication link between the trains based on the optimal action strategy, and acquiring the capacities of the communication link between the trains and the trackside and the transmission delay of the communication link between the trains based on the trained multi-agent deep reinforcement learning model.
As an preferable technical scheme, the communication scene comprises a core network, a base station and a backbone network based on LTE-M, and a vehicle-mounted controller and a target controller, wherein the vehicle-mounted controller is used for realizing communication between adjacent trains according to a line plan, exchanging train position and resource information, generating a mobile authorization, and the target controller is used for registering and unlocking the line resource occupation condition, tracking a non-communication train, recovering driving resources and degrading the safety protection of a route.
As a preferred technical solution, the process of prioritizing the signaling for the communication traffic in the communication scenario includes the following steps:
the automatic train operation, automatic train protection and automatic train monitoring services are classified into a first priority, and the passenger information system, the video monitoring system and the high data rate service are classified into a second priority.
Compared with the prior art, the invention has the following advantages:
(1) The distribution effect for tunnel environment is good: the method is based on a communication scene in a tunnel environment, which is subjected to signaling priority consideration, a multi-agent deep reinforcement learning model is established, the multi-agent deep reinforcement learning model is trained by using a depth deterministic strategy gradient algorithm based on deep learning fitting, attention mechanism and priority experience playback, and a network resource allocation result is obtained by using the trained multi-agent deep reinforcement learning model.
(2) The efficiency of information interaction, the accuracy of decision making and the stability of training are improved: in a TACS scene, each agent needs to interact with other agents to cooperatively complete tasks, the traditional method is to equally consider and then make decisions on the information of all agents, however, in a complex scene, the method may cause information redundancy and unnecessary communication overhead, the method enables the agents to pay more attention to and attach importance to important information by introducing an attention mechanism, reduces unnecessary communication, thereby improving the efficiency of information interaction, predicting the future state and behavior more accurately, and in addition, the performance of the MADDPG algorithm in a multi-agent game is influenced by the stability of training, and the attention mechanism can enable the model to learn and converge more stably, so that the stability and convergence speed of training are improved.
Drawings
Fig. 1 is a flowchart of a TACS network resource allocation method based on deep reinforcement learning in embodiment 1;
fig. 2 is a schematic diagram of each part of the communication scenario in embodiment 1;
FIG. 3 is a schematic diagram of a model optimization process.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
Example 1
As shown in fig. 1, this embodiment first divides the content of different communications in the train control system into different priorities, and then models the TACS communication mode as a MADRL problem, and proposes a Multi-agent depth deterministic strategy gradient (Multi-Agent Deep Deterministic Policy Gradient, madddpg) algorithm based on distributed execution, which is applied in the communication scenario shown in fig. 2 to solve the problem of power control of continuous actions of each agent. The method comprises the following steps:
s1: and constructing a T2T system based on LTE-M communication. The method comprises the following substeps:
s11: the TACS system based on LTE-M communication is established on the existing train control system based on communication and composed of a core network, a base station and a backbone network, but the functions of the original computer interlocking and a regional controller are integrated into VOBC (virtual basic block controller) to realize direct communication between adjacent trains according to a line plan, so that key information such as train position and resources can be exchanged, and mobile authorization can be timely generated;
s12: the OC is additionally arranged to register and unlock the line resource occupation condition, and the OC realizes the system degradation functions of non-communication train tracking, driving resource recovery, route safety protection and the like, so that the compatibility and the easy deployment of the signal system are realized.
S2: and dividing priorities of different signaling according to different communication contents, and simulating a TACS communication real scene. The method comprises the following substeps:
s21: according to different communication contents, dividing priorities of different signaling;
s22: the 5MHz bandwidth in 20MHz bandwidth of LTE-M is divided and is specially used for transmission of higher priority service (such as ATO instruction), and the rest 15MHz bandwidth participates in a frequency spectrum sharing resource allocation scheme, and the allocation mode is shown in table 1.
Table 1 spectrum shared resource allocation
Figure BDA0004164101860000051
S3: constructing different resource allocation frameworks of a TACS scene, and deducing a signal-to-interference-and-noise ratio formula and a channel capacity formula of a T2T link and a T2W link. The derivation process is as follows:
the signal gain is: g n [m]=α n h n [m]
In the formula, h n [m]For frequency-dependent small-scale fading power components, alpha n Is a large scale fading effect, mainly path loss.
T2W link signal-to-interference-noise ratio:
Figure BDA0004164101860000061
wherein,,
Figure BDA0004164101860000062
for the transmission power, sigma, of the mth T2W link 2 Is noise power +.>
Figure BDA0004164101860000063
For the transmission power of the nth T2T transmitter on the mth subband, M e {1, …, M }, N e {1, …, N }. ρ n [m]Is a binary spectrum allocation index, ρ n [m]=1, meaning that the nth T2T link uses the mth subband, otherwise ρ n [m]=0. Suppose that each T2T link has access to only one sub-band, i.e., Σ n ρ n [m]≤1。
Signal-to-interference-and-noise ratio of T2T link:
Figure BDA0004164101860000064
wherein,,
Figure BDA0004164101860000065
representing the transmit power of the nth T2T transmitter on the mth subband, +.>
Figure BDA0004164101860000066
Represents the interference channel from the mth T2T transmitter to the nth T2T receiver in the mth sub-band g n′,n [m]Similarly.
T2W link channel capacity:
Figure BDA0004164101860000067
T2T link channel capacity:
Figure BDA0004164101860000068
where B is the bandwidth of each spectral subband.
S4: the spectrum sharing problem in TACS communication is modeled as one MADRL problem. The method comprises the following substeps:
s41: the spectrum sharing scenario is modeled as a MADRL problem in conjunction with reinforcement learning algorithms. In connection with fig. 3, it is evident that each T2T link acts as an agent, one state S being observed in the state space S t And accordingly takes an action a in action space a t The sub-band and the transmission power are selected according to a strategy pi. Policy v may be defined by action cost function Q (S t ,A t ) Determination that it represents an agent in a certain state S t In the case of (a) executing a certain action A t I.e., pi (a |s) =p (a t =a∣S t =s)。
S42: individual agents obtain rewards R by interacting with tunnel communication environment t Thereby guiding itself in the next new state S t+1 Action A to be selected for execution t+1 . Rewards R t System capacity by T2T and T2W linksThe amount and the delay constraint of the corresponding T2T link.
S43: multiple T2T agents represent multiple trains in a tunnel scenario, all of which jointly explore the environment and improve spectrum allocation and power control strategies based on their own observations of the environment state, as described in fig. 3.
S5: the DDPG algorithm is applied to the MADRL model to provide the MADDPG algorithm, and a state space, a reward function, a cost function, an algorithm flow and the like of the algorithm are provided, so that a reinforcement learning distributed execution algorithm is designed. The method specifically comprises the following substeps:
s51: state space: s is S t ={G t ,H t ,I t-1 ,E t-1 ,F t ,U t }
Wherein, the channel gain G of the T2T link in the current time slot t Channel gain H of T2W link at current time slot t Interference I caused by other links before t-1 Selection E of last time slot neighbor subchannel t-1 The transmission load F t Residual time U satisfying time delay constraint t
Bonus function:
Figure BDA0004164101860000077
wherein T is 0 Lambda is the maximum acceptable time delay c 、λ d 、λ p Is a three-part weight. (T) 0 -U t ) The time taken for transmission.
Final rewards:
Figure BDA0004164101860000071
wherein, beta is E0, 1, which is the attenuation factor.
Action cost function:
Figure BDA0004164101860000072
where α represents the learning rate.
Loss function:
Figure BDA0004164101860000073
wherein,,
Figure BDA0004164101860000074
θ' is a parameter of the target Q network.
S52: the specific algorithm flow is as follows:
initializing the actual network and target network parameters of the Actor and Critic of each agent;
initializing the size B of each agent's experience pool k
Resetting the Internet of vehicles environment;
updating the vehicle position and the large-scale fading alpha;
the reality Actor strategy network is based on the input state S t Output action A t The intelligent agent obtains the state S at the next moment after executing the action t+1 All agents perform actions and acquire a common prize R t
Updating small-scale fading of the channel;
obtaining training data (S) t ,A t ,R t ,S t+1 );
Storing training data in an experience playback pool;
randomly sampling m training data from an experience playback pool to form a data set, and transmitting the data set to an Actor reality network, a Critic reality network, an Actor target network and a Critic target network;
setting the Q estimation as follows:
Figure BDA0004164101860000075
the loss function defining the online Critic evaluation network is:
Figure BDA0004164101860000076
updating an Actor target network;
updating the Critic target network;
updating all parameters delta of the current network of the Actor through gradient back propagation of the neural network;
if the online training times reach the target network updating frequency, respectively updating the target network parameters delta according to the online network parameters delta and theta And theta
And S6, considering the joint optimization problem in the continuous action space, and optimizing a deep reinforcement learning model by using a DDPG algorithm comprising three mechanisms of deep learning fitting, attention mechanism and experience playback. The method specifically comprises the following substeps:
s61: deep learning fitting refers to the invention of fitting deterministic strategies and action value functions using deep neural networks of different parameters;
s62: because communication between trains only needs to pay attention to communication between trains adjacent in the same direction in practice, and not much attention is required to trains traveling in different directions, the attention mechanism can be introduced to improve the efficiency and accuracy of information interaction.
The method adopts centralized training and distributed execution, and an intelligent agent cannot observe the complete state of the environment, so that the intelligent agent is decentralised.
In particular, the following steps may be employed to draw attention to the mechanism:
(1) The direction and speed information of each agent is encoded as a vector as input data.
(2) For each agent, according to the information such as the direction and the relative speed of the agent and other agents, an attention vector is calculated and used for reflecting the attention degree and the importance of the current agent to other agents.
(3) And performing dot product operation on the attention vector of each agent and the input vectors of other agents to obtain attention weighted input vectors of each agent to the other agents.
(4) The attention weighted input vector for each agent is used as input to calculate the motion or state of the vehicle.
S63: experience playback by setting an experience playback mechanism, a sample pool is collected first, and then some small-size samples are randomly selected from the sample pool for training.
S7: and obtaining an optimal T2T transmitting power selection model, an allocation channel decision, a T2W link capacity with better performance, a T2T system capacity and a lower T2T transmission delay according to the optimized deep reinforcement learning model. The method specifically comprises the following substeps:
s71: outputting the optimal action strategy to obtain the optimal T2T transmitting power and the allocation channel strategy;
s72: and obtaining the T2W link capacity, the T2T system capacity and the lower T2T transmission delay with better performance according to the optimized deep reinforcement learning model.
As shown in fig. 2, the Vehicle On-Board Controller (VOBC) communicates with the target Controller (Object Controller, OC), the passenger information system (Passenger Information System, PIS) and the video monitoring system (Image Monitoring System, IMS) through a base station-backbone-core network; wherein, step S1 includes:
aiming at the spectrum sharing problem of multi-agent deep reinforcement learning based on continuous action space in a TACS communication scene, the simulation power of the reinforcement learning algorithm in the past is generally discrete value, so that a tunnel scene cannot be truly simulated, and meanwhile, the signaling priority problem of communication between trains is not considered in the existing work. In the embodiment, the communication modes are divided into two types of T2T and T2W, the communication content is divided into two different priorities according to the safety importance of the train, the resource sharing is modeled as a Multi-agent deep reinforcement learning problem, and a Multi-agent deep deterministic strategy gradient (Multi-Agent Deep Deterministic Policy Gradient, MADDPG) algorithm based on distributed execution is provided. Each agent continuously interacts with the tunnel environment and observes own local state to obtain a common reward, and the Critic network is intensively trained by summarizing actions of other agents, so that the power control selected by each agent is improved. By designing the reward function and the training mechanism, the multi-agent algorithm can realize distributed resource allocation, effectively improve the total capacity of the TACS system and reduce the transmission delay of the T2T link.
Example 2
The present embodiment provides an electronic device, including: one or more processors and a memory having stored therein one or more programs comprising instructions for performing the deep reinforcement learning based TACS network resource allocation method as described in embodiment 1.
Example 3
The present embodiment provides a computer-readable storage medium comprising one or more programs for execution by one or more processors of an electronic device, the one or more programs comprising instructions for performing the deep reinforcement learning-based TACS network resource allocation method described in embodiment 1.
While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims (10)

1. A TACS network resource allocation method based on deep reinforcement learning is characterized by comprising the following steps:
constructing a communication scene in a tunnel environment aiming at a train autonomous operation system, and dividing signaling priority for communication service in the communication scene;
under the premise of restraining the maximum time delay of a communication link between trains in the communication scene, constructing a multi-agent deep reinforcement learning model by taking the throughput maximization of an autonomous running system of the train as a target;
training the multi-agent deep reinforcement learning model using a depth deterministic strategy gradient algorithm based on a deep learning fit, a attentional mechanism and preferential empirical playback;
based on the trained multi-agent deep reinforcement learning model, an optimal action strategy is obtained, and network resources are distributed based on the optimal action strategy.
2. The TACS network resource allocation method based on deep reinforcement learning according to claim 1, wherein the construction process of the multi-agent deep reinforcement learning model comprises the following steps:
taking a communication link between each train and a train workshop as an agent, acquiring a state in a state space by each agent, correspondingly taking an action in an action space, and distributing network resources according to a strategy, wherein the strategy is determined by a probability action cost function for representing that the agent executes a certain action under a certain state;
each agent obtains rewards by interacting with the communication scenario for determining actions to be selectively performed in the next new state, wherein the rewards are based on capacities of the inter-train-to-train communication link and the inter-train-to-trackside communication link, and delay acquisitions of the inter-train-to-train communication link.
3. The TACS network resource allocation method according to claim 2, wherein the rewards are obtained by the following formula:
Figure FDA0004164101850000011
wherein T is 0 Lambda is the maximum delay of the communication link between trains c 、λ d 、λ p Is the weight (T) 0 -U t ) Time taken for transmission, C c [m]C for channel capacity of communication link between mth train and trackside d [n]For the channel capacity of the nth train-to-train communication link,
Figure FDA0004164101850000012
n is the number of inter-train and trackside communication links and the number of inter-train and inter-train communication links, respectively.
4. The TACS network resource allocation method of claim 1, wherein the training of the multi-agent deep reinforcement learning model comprises the steps of:
initializing an Actor deep learning neural network of each intelligent agent in the multi-intelligent agent deep reinforcement learning model, and a Critic deep learning neural network based on a multi-head attention mechanism and subjected to decentralization;
resetting the environment of the Internet of vehicles, updating the train position, acquiring updated large-scale fading and updated small-scale fading according to the updated train position, acquiring a training sample and storing the training sample in a preset experience playback pool;
and randomly selecting a training sample with a small size from the experience playback pool to form a data set, and training the Actor deep learning neural network and the Critic deep learning neural network of each intelligent agent.
5. The method for distributing resources to TACS network based on deep reinforcement learning according to claim 4, wherein said training samples include input states and corresponding output actions, states at the next moment and common rewards obtained after all agents have performed actions, said input states and said states at the next moment each include channel gain of a communication link between a train and a train in a current time slot, channel gain of a communication link between a train and a trackside, link interference in a current time slot, selection of a last time slot adjacent subchannel, transmission load and remaining time satisfying a delay constraint.
6. The TACS network resource allocation method according to claim 5, wherein the channel gains of the inter-train-to-train communication link and the inter-train-to-trackside communication link are: g n [m]=α n h n [m]Wherein h is n [m]For the nth train to inter-train communication transmitter to fade power component, alpha, on the mth sub-band by a small scale n Large-scale fading separation for nth train-to-train communication transmitterAmount of the components.
7. The TACS network resource allocation method of claim 4, wherein the agent's Actor deep learning neural network and Critic deep learning neural network are trained based on a loss function, the loss function being obtained based on rewards and probabilistic action cost functions obtained by each agent through interaction with a communication scene.
8. The TACS network resource allocation method according to claim 1, wherein the process of implementing allocation of network resources based on the optimal action policy comprises the steps of:
and acquiring the optimal transmitting power and channel allocation data of the communication link between the trains based on the optimal action strategy, and acquiring the capacities of the communication link between the trains and the trackside and the transmission delay of the communication link between the trains based on the trained multi-agent deep reinforcement learning model.
9. The TACS network resource allocation method according to claim 1, wherein the communication scenario includes a core network, a base station and a backbone network based on LTE-M, and a vehicle-mounted controller and a target controller, where the vehicle-mounted controller is configured to implement communication between adjacent trains according to a line plan, exchange train location and resource information, and generate a movement authorization, and the target controller is configured to register and unlock a line resource occupation condition, and track a non-communication train, recover a driving resource, and degrade an approach safety protection.
10. The TACS network resource allocation method according to claim 1, wherein the process of prioritizing the traffic within the communication scenario comprises the steps of:
the automatic train operation, automatic train protection and automatic train monitoring services are classified into a first priority, and the passenger information system, the video monitoring system and the high data rate service are classified into a second priority.
CN202310358415.5A 2023-04-06 2023-04-06 TACS network resource allocation method based on deep reinforcement learning Pending CN116405904A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310358415.5A CN116405904A (en) 2023-04-06 2023-04-06 TACS network resource allocation method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310358415.5A CN116405904A (en) 2023-04-06 2023-04-06 TACS network resource allocation method based on deep reinforcement learning

Publications (1)

Publication Number Publication Date
CN116405904A true CN116405904A (en) 2023-07-07

Family

ID=87006995

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310358415.5A Pending CN116405904A (en) 2023-04-06 2023-04-06 TACS network resource allocation method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN116405904A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117098192A (en) * 2023-08-07 2023-11-21 北京交通大学 Urban rail ad hoc network resource allocation method based on capacity and time delay optimization
CN117931461A (en) * 2024-03-25 2024-04-26 荣耀终端有限公司 Scheduling method of computing resources, training method of strategy network and device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117098192A (en) * 2023-08-07 2023-11-21 北京交通大学 Urban rail ad hoc network resource allocation method based on capacity and time delay optimization
CN117098192B (en) * 2023-08-07 2024-04-26 北京交通大学 Urban rail ad hoc network resource allocation method based on capacity and time delay optimization
CN117931461A (en) * 2024-03-25 2024-04-26 荣耀终端有限公司 Scheduling method of computing resources, training method of strategy network and device

Similar Documents

Publication Publication Date Title
CN111376954B (en) Train autonomous scheduling method and system
CN116405904A (en) TACS network resource allocation method based on deep reinforcement learning
CN111369181B (en) Train autonomous scheduling deep reinforcement learning method and device
CN111311959B (en) Multi-interface cooperative control method and device, electronic equipment and storage medium
CN113867354B (en) Regional traffic flow guiding method for intelligent cooperation of automatic driving multiple vehicles
CN111619624A (en) Tramcar operation control method and system based on deep reinforcement learning
CN114116047A (en) V2I unloading method for vehicle-mounted computation-intensive application based on reinforcement learning
CN112565377B (en) Content grading optimization caching method for user service experience in Internet of vehicles
CN114786152A (en) Credible collaborative computing system for intelligent rail transit
Lv et al. Edge computing task offloading for environmental perception of autonomous vehicles in 6G networks
CN116017479A (en) Distributed multi-unmanned aerial vehicle relay network coverage method
Qi et al. Social prediction-based handover in collaborative-edge-computing-enabled vehicular networks
CN106685608A (en) Resource scheduling method, device and node of vehicle-road collaborative communication system
Dong et al. Support vector machine for channel prediction in high-speed railway communication systems
CN114745699B (en) Vehicle-to-vehicle communication mode selection method and system based on neural network and storage medium
CN112367638A (en) Intelligent frequency spectrum selection method for vehicle-vehicle communication of urban rail transit vehicle
CN115208892B (en) Vehicle-road collaborative online task scheduling method and system based on dynamic resource demand
CN113534784A (en) Decision method of intelligent body action and related equipment
CN115100866B (en) Vehicle-road cooperative automatic driving decision-making method based on layered reinforcement learning
CN116588138A (en) Block chain-based distributed intelligent auxiliary automatic driving method
CN115715021A (en) Internet of vehicles resource allocation method and system
CN116151538A (en) Simulation system and method applied to intelligent bus demand balanced scheduling
Deng et al. Enhancing Multi-Agent Communication Collaboration through GPT-Based Semantic Information Extraction and Prediction
CN113592100B (en) Multi-agent reinforcement learning method and system
Huang et al. A novel method on probability evaluation of ZC handover scenario based on SMC

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination