CN116600267A - Doppler resistance method based on deep reinforcement learning in high-speed rail honeycomb-free system - Google Patents

Doppler resistance method based on deep reinforcement learning in high-speed rail honeycomb-free system Download PDF

Info

Publication number
CN116600267A
CN116600267A CN202310417624.2A CN202310417624A CN116600267A CN 116600267 A CN116600267 A CN 116600267A CN 202310417624 A CN202310417624 A CN 202310417624A CN 116600267 A CN116600267 A CN 116600267A
Authority
CN
China
Prior art keywords
doppler
speed
network
agent
maddpg
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310417624.2A
Other languages
Chinese (zh)
Inventor
李佳珉
凌捷
张曦照
朱鹏程
王东明
尤肖虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202310417624.2A priority Critical patent/CN116600267A/en
Publication of CN116600267A publication Critical patent/CN116600267A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/30Services specially adapted for particular environments, situations or purposes
    • H04W4/40Services specially adapted for particular environments, situations or purposes for vehicles, e.g. vehicle-to-pedestrians [V2P]
    • H04W4/42Services specially adapted for particular environments, situations or purposes for vehicles, e.g. vehicle-to-pedestrians [V2P] for mass transport vehicles, e.g. buses, trains or aircraft
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04BTRANSMISSION
    • H04B7/00Radio transmission systems, i.e. using radiation field
    • H04B7/02Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas
    • H04B7/04Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas
    • H04B7/0413MIMO systems
    • H04B7/0452Multi-user MIMO systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04BTRANSMISSION
    • H04B7/00Radio transmission systems, i.e. using radiation field
    • H04B7/02Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas
    • H04B7/04Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas
    • H04B7/0413MIMO systems
    • H04B7/0456Selection of precoding matrices or codebooks, e.g. using matrices antenna weighting
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L27/00Modulated-carrier systems
    • H04L27/26Systems using multi-frequency codes
    • H04L27/2601Multicarrier modulation systems
    • H04L27/2647Arrangements specific to the receiver only
    • H04L27/2655Synchronisation arrangements
    • H04L27/2657Carrier synchronisation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W16/00Network planning, e.g. coverage or traffic planning tools; Network deployment, e.g. resource partitioning or cells structures
    • H04W16/24Cell structures
    • H04W16/28Cell structures using beam steering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/30Services specially adapted for particular environments, situations or purposes
    • H04W4/40Services specially adapted for particular environments, situations or purposes for vehicles, e.g. vehicle-to-pedestrians [V2P]
    • H04W4/44Services specially adapted for particular environments, situations or purposes for vehicles, e.g. vehicle-to-pedestrians [V2P] for communication between vehicles and infrastructures, e.g. vehicle-to-cloud [V2C] or vehicle-to-home [V2H]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention discloses an anti-Doppler method based on deep reinforcement learning in a high-speed rail honeycomb-free system, which comprises the following steps: forming a beam forming network at a transmitter and a receiver by using a large-scale Uniform Linear Array (ULA) to separate signals at different angles, wherein beam branches at the transmitter in different directions correspond to different Doppler Frequency Offsets (DFOs); modeling relevant content of a high-speed rail scene as environmental states, actions, and rewards in MADDPG in combination with multi-agent deep reinforcement learning (MADDPG); when the high-speed rail is communicated with a plurality of base stations at the same time, each antenna array is used as an intelligent body, the high-speed rail side performs precoding and precompensation according to the result of the MADDPG algorithm, the base station side performs beam forming according to the result of the MADDPG algorithm, separates different DFOs from an angle domain and compensates corresponding Doppler frequency offset; according to the feedback of the environment, the MADDPG network is continuously trained until convergence.

Description

Doppler resistance method based on deep reinforcement learning in high-speed rail honeycomb-free system
Technical Field
The invention relates to a wireless communication method used in a high-speed railway communication scene, belonging to the technical field of mobile communication.
Background
With the development of wireless communication systems, communication in high mobility scenarios, particularly high speed rail, is receiving increasing attention. However, high mobility faces many challenges compared to communication in static scenarios. When the terminal is in a state of high-speed movement, the relative motion between the base station and the terminal causes Doppler frequency shift, so that frequency shift of a received signal is caused, and inter-subcarrier interference is also caused in an OFDM system, so that the complexity of channel estimation and equalization is greatly increased, the communication performance is greatly reduced, and therefore, the problem is very important to be solved.
At present, how to combat the effect of doppler has been widely studied, for example: the time-varying channel is directly estimated and predicted, but when different paths are mixed together at a receiving end, the complexity is higher and the difficulty is higher; the Doppler effect is regarded as the superposition of signals on a frequency domain, and the diversity gain is obtained by adopting a Doppler diversity technology to carry out different diversity processing and recombination on the signals in a fast time-varying channel, but for a complex high-speed mobile environment, how to carry out Doppler diversity is still very difficult; the method uses pilot frequency, cyclic prefix and radio environment diagram to estimate Doppler frequency offset, etc., but in high-speed moving scene, the method of time domain or frequency domain is difficult to realize separation, compensation or elimination of DFO fundamentally, and in addition, the method can directly search the optimal beam forming vector without explicit channel estimation, but does not consider the influence of Doppler and compensates. In addition, considering that the propagation space angle is related, that is, different angles correspond to different doppler frequency offsets and are uniquely determined by the maximum doppler frequency offset and the incident angle, some researches have been attempted to eliminate the influence of the doppler effect from the airspace, but the calculation amount and the cost of joint estimation of a large-scale MIMO channel matrix and DFO parameters based on the traditional methods such as Maximum Likelihood (ML) are large.
There are some researches to solve the problem by using a deep learning method in a high-speed moving scene, which has strong learning ability, wide coverage range and good adaptability, but the deep learning algorithm can not perform unbiased estimation on the law of data in an application scene which can only provide limited data quantity. In order to achieve good accuracy, a large data support is required, and is therefore severely limited by the amount of training data. For example: in a high-speed railway communication scene, if deep learning offline measurement and storage of multipath or DFO spatial distribution of different track positions are adopted, extremely large data volume is required for training, and when the data volume is insufficient, the accuracy and corresponding performance of an algorithm can be affected.
In order to overcome the problem, a deep reinforcement learning algorithm is provided, and the Agent continuously learns knowledge according to rewards or penalties in the interaction with the environment, so that the Agent is more suitable for the environment. Firstly, a deep neural network is constructed by using a deep learning algorithm, high-level features are extracted from original data, low-dimensional feature representation under high-dimensional data is automatically analyzed, and then decision making is performed by using a reinforcement learning theory. In the practical process, many scenarios involve interactions of multiple agents, and during the training process, the policy of each agent changes, so from the perspective of each agent, the environment becomes very unstable (the actions of the other agents bring about environmental changes), so that multiple agents are required to directly and continuously cooperate and cooperate, and MADDPG is further proposed.
In addition, if the conventional large-scale MIMO system is still adopted in the high-speed mobile communication scenario, the rapid and frequent handover of the cell will bring about great interference, and seriously affect the communication performance. The cell-free honeycomb-free large-scale MIMO system deploys hundreds of thousands of access points in the communication range, and the access points cooperate with each other, so that higher space diversity and multiplexing gain can be provided, and the system spectrum efficiency is remarkably improved; moreover, each user in the system can be served by all access points, the average distance between the user and the access points is greatly shortened, and more balanced service quality can be provided. Therefore, the design of the Doppler compensation scheme based on the deep reinforcement learning in the honeycomb-free large-scale MIMO system in the high-speed moving scene has theoretical and practical significance.
Disclosure of Invention
Technical problems: the invention aims to provide an anti-Doppler method based on deep reinforcement learning in a high-speed rail non-cellular system so as to maximize the reachable rate of a communication system. Consider that multiple base stations jointly serve a high-speed rail, and the high-speed rail and the base stations perform joint training and environment interaction. Finally, precoding and angle domain Doppler compensation are carried out on the user side, and each base station carries out corresponding beam forming when receiving.
The technical scheme is as follows: the invention discloses a Doppler resistance method based on deep reinforcement learning in a high-speed rail honeycomb-free system, which comprises the following steps:
the method comprises the following steps: the wireless network communication system in the honeycomb-free large-scale MIMO system under the high-speed moving scene comprises a high-speed rail, a plurality of Access Points (AP) and a central control unit (CPU), wherein the APs are connected to the central control unit (CPU) through a forward link and jointly serve the high-speed rail in the same time frequency resource; the mobile relay is arranged at the top of the high-speed train and is connected with the in-train access to form a two-hop structure, so that the penetration loss is relieved, and the switching scale is reduced; in a high-speed moving scene, due to the relative motion of a receiving end and a transmitting end, doppler frequency offset is caused, so that interference among subcarriers of an OFDM system is caused, the communication performance is seriously influenced, and the Doppler frequency offset is corrected by adopting a Doppler compensation method based on multi-agent deep reinforcement learning.
Wherein,,
the Doppler compensation method based on multi-agent deep reinforcement learning comprises the following steps:
step one: establishing a plurality of AP and high-speed rail communication models, wherein in a honeycomb-free large-scale MIMO system, the plurality of APs simultaneously provide services for high-speed rails and are connected to a central processing unit through a forward link;
step two: initializing intelligent agents, states, actions and rewards of the multi-intelligent-agent depth deterministic strategy gradient MADDPG when training is started, wherein the intelligent agents, states, actions and rewards correspond to channel state information, a selected angle and a system reachable rate respectively; initializing an action network and an evaluation critic network, and setting and initializing two network parameters;
step three: judging whether the intelligent agent corresponds to a relay on the high-speed railway side or a large-scale antenna array on the AP side, and respectively executing different operations;
step four: if the intelligent agent is a high-speed railway side antenna array, pre-coding and Doppler pre-compensation processing are carried out according to an exit angle AOD output by an actor network so as to reduce Doppler frequency offset corresponding to different paths;
step five: if the intelligent agent is an AP side antenna array, performing AP side beam alignment according to an arrival angle AOA output by an actor network;
step six: updating MADDPG network parameters according to rewards obtained by the system, and continuously repeating the steps three to six until convergence.
The parameter setting and initializing of the actor network and the critic network in the second step specifically comprises the following steps:
step 2.1, MADDPG parameter setting and initialization;
the deep reinforcement learning mainly comprises an intelligent agent, an environment, a state, actions and rewards; at the t-th time step, it is assumed that the current state of the environment is expressed asActions performed by the agent->The environment will transition to a new state s t+1 At the same time the environment gets the corresponding prize value +.>Wherein->Is a state space, ++>Is a motion space, < >>Is a reward function;
the specific setting and joint optimization problems are as follows:
status: the information of all the agents, i.e. the large-scale antenna array, constitutes the state at time t, and the state of the mth agent is recorded asAnd the states of all agents in a cell-free massive MIMO system in a high-speed moving scenario can be defined as +.>Wherein the status of each agent is its respective channel status information;
the actions are as follows: in each time step, each intelligent agent selects a different angle, wherein the antenna array at the AP side selects a receiving angle AOA, and the antenna array at the train side selects a transmitting angle AOD at the current sampling time;
rewarding: the goal is to maximize the total rate achievable by the uplink at the CPU, in which case each agent knows what state it is in and what action should be taken to obtain the maximum system sum prize;
thus, the MADDPG's environment, status, actions, rewards are defined.
The high-speed railway side pre-coding and Doppler pre-compensation treatment comprises the following steps:
step 4.1, judging that the intelligent agent is a high-speed rail or AP side antenna array for a certain position;
step 4.2, judging a Gao Tiece antenna array, and executing actions according to a machine learning algorithm to obtain AOD angles corresponding to different paths;
step 4.3, through step 4.2, the high-speed railway side separates AOA angles of different paths, namely different Doppler frequency offsets, and Doppler frequency offset precompensation and precoding processing are carried out in an angle domain before transmission;
thereby completing Doppler compensation of the high-speed rail side transmitting end; in addition, due to the fact that the multi-agent deep learning algorithm is used, time delay of the traditional algorithm for separating different angles can be reduced.
The step five of AP-side beam alignment specifically includes the following steps:
step 5.1, judging that the intelligent agent is a high-speed rail or AP side antenna array for a certain position;
step 5.2, judging an AP side antenna array, and executing actions according to the output of the MADDPG algorithm to obtain AOA angles corresponding to different paths;
step 5.3, through step 5.2, the high-speed rail side separates AOD angles of different paths, and beam forming processing is carried out according to different angles when signals are received;
thereby beam alignment of the AP-side receiving end can be completed.
The step six of updating the madddpg network parameters, according to rewards obtained by the system, comprises the following steps:
step 6.1, for MADDPG, the prize value is set to the total achievable rate of the system at the CPU taking into account the inter-subcarrier interference;
step 6.2, in the training process, determining the action to be executed in the current state by the actor network of each antenna array, wherein the current state is the instantaneous channel state information of each intelligent agent so as to obtain the best effect, and completing the evaluation of the action by calculating a state action function by the critic network;
step 6.3, training and updating by using the estimated and target state action functions through the step 6.2 and the critical network through a time difference TD method, and then guiding the actor network to update the strategy by using the estimated and target state action functions as a supervision signal of the actor network;
step 6.4, continuously iterating training according to the total reward value of the system until convergence;
by combining with the MADDPG algorithm, training and updating are continuously carried out so as to obtain the maximum reachable speed of the system as a target, and different intelligent agents execute respective optimal actions, so that the Doppler frequency offset is correspondingly compensated in an angle domain, and the influence of the Doppler frequency offset in a high-speed moving scene is effectively relieved.
The beneficial effects are that: the invention considers that a plurality of base stations are combined to provide services for the high-speed rail, and utilizes the characteristic that Doppler frequency offset and angle are directly related to each other, precoding and angle domain Doppler compensation are carried out on a user side, each base station carries out corresponding beam forming when receiving, and finally Doppler frequency offset compensation is completed in an angle domain. Meanwhile, the method combines the MADDPG algorithm, and reduces the calculation complexity. The method plays a good role in Doppler compensation, can improve system performance under the condition of running with lower complexity in beam alignment, and can obtain the performance equivalent to exhaustive search.
Drawings
Fig. 1 is a schematic diagram of a cellular-free massive MIMO system in a high-speed mobile scenario of the present invention;
FIG. 2 is a schematic flow chart of the present invention for angle domain Doppler frequency offset compensation;
FIG. 3 is a schematic diagram of the Doppler frequency offset compensation in combination with MADDPG according to the present invention;
FIG. 4 is a graph of simulation contrast of achievable rates for different deep neural network structures of the present invention;
FIG. 5 is a graph of simulated comparison of the achievable rates of different agents of the invention;
FIG. 6 is a simulated contrast chart of the Doppler frequency offset compensation effect of the present invention;
fig. 7 is a comparative graph of beam alignment performance simulations for different schemes of the present invention.
Detailed Description
The technical scheme of the invention is further described below with reference to the accompanying drawings and the specific embodiments.
The invention relates to an anti-Doppler method based on deep reinforcement learning in a high-speed rail non-cellular system, as shown in figure 1, wherein a network architecture comprises high-speed rails, a plurality of Access Points (AP) and a central control unit (CPU), wherein the plurality of APs are connected to the central processing unit through front-end links and jointly serve the high-speed rails in the same time frequency resource; the mobile relay is arranged at the top of the train and forms a two-hop structure with the in-car access point, so that the problems of penetration loss and group switching are relieved; in a high-speed moving scene, due to the relative motion of a receiving end and a transmitting end, doppler frequency offset is caused, and interference among subcarriers of an OFDM system is caused, so that the communication performance is seriously affected, and as shown in fig. 2, the Doppler compensation method based on multi-agent deep reinforcement learning is as follows:
step one: establishing a plurality of AP and high-speed rail communication models, wherein in a honeycomb-free large-scale MIMO system, the plurality of APs simultaneously provide services for high-speed rails and are connected to a central processing unit through a forward link;
step two: when training is started, respectively initializing an agent, a state, actions and rewards of MADDPG as channel information, a selected beam angle and system performance, initializing an actor and critic network and updating parameters;
step three: judging whether the intelligent agent corresponds to a relay on the high-speed railway side or a large-scale antenna array on the AP side, and respectively executing different operations;
step four: if the intelligent agent is a high-speed railway side antenna array, pre-coding and Doppler pre-compensation processing are carried out according to the AOD output by the actor network, so that Doppler frequency offset of different paths is reduced, and the compensation of the Doppler frequency offset is completed;
step five: if the intelligent agent is an AP side antenna array, carrying out beam alignment processing according to the AOA output by the actor network;
step six: updating MADDPG network parameters according to rewards obtained by the system, and continuously repeating the steps three to six until convergence.
The method for establishing theoretical analysis is as follows:
consider that in a non-cellular scenario, L base stations serve a high-speed rail. Wherein the method comprises the steps ofEach base station AP is equipped with N R Antenna, high-speed rail is equipped with N T An antenna. Assuming that the time of each base station is synchronized and the time is equally divided into frames, the uplink channel from the train to the first AP at the nth sampling time is modeled as:
wherein beta is l Is the large-scale fading between the base station array and the train end array, eta l For random phase shift, f d V/λ is the maximum doppler frequency offset, where v and λ are train movement speed and wavelength, respectively, θ l AndAOD and AOA for the ith AP array to train array, +.>And a Tl ) T for antenna array response of base station and train s Is the sampling interval. (. Cndot. H Representing the conjugate transpose of the matrix and vector. For ULA antenna arrays:
where d is the antenna spacing. Definition x= [ X [0 ]],X[1],...,X[N-1]] T Is an information symbol in an OFDM block, wherein N is the number of subcarriers, namely the frequency domain transmission signal of the kth subcarrier is X [ k ]]Applying an N-point Inverse Discrete Fourier Transform (IDFT) operation and adding a length of N cp After the Cyclic Prefix (CP) of (c), the time domain signal at the nth time instant may be expressed as:
assume that the transmission and reception angles selected by the first base station and the train end based on MADDPG are respectivelyAnd->The time of the receiving end is ideally synchronous, the train end performs precoding and DFO compensation processing on the transmitting signal at the angle, and the base station side performs corresponding beam forming processing. Thus, the reception signal y of the first base station at the nth time l [n]The method comprises the following steps:
wherein,,by combining the received signals y l [n]Recovering the frequency domain signal by Discrete Fourier Transform (DFT), and obtaining the following steps:
can be further decomposed into:
wherein the first term corresponds to the transmitted signal on the subcarrier, i.e. the desired signal, the second term corresponds to the interference from other subcarriers while being affected by the residual DFO, and the third term corresponds to the interference from other precoding and compensation directions. Then, the SINR of the kth subcarrier of the ith AP is expressed as:
and | represents the vector 2 norm. Finally, the total achievable rate of the system is:
in the invention, MADDPG multi-agent deep reinforcement learning is combined, as shown in fig. 3, wherein the specific process is as follows:
(1) Deep reinforcement learning DRL
Deep reinforcement learning is mainly composed of agents, environments, states, actions, and rewards. At the t-th time step, it is assumed that the current state of the environment is expressed asThe intelligent agent performs action +.>The environment will transition to a new state s t+1 At the same time the environment gets the corresponding prize value +.>
The specific setting and joint optimization problems are as follows:
status: the information of all the agents (large-scale antenna array) constitutes the state at time t, and the state of the mth agent is recorded asAnd the states of all agents in a cell-free massive MIMO system in a high-speed mobile scenario can be defined asWherein the status of each agent is its respective channel status information;
the actions are as follows: in each time step, each intelligent agent selects a different angle, wherein the antenna array at the AP side selects a receiving angle AOA, and the antenna array at the train side selects a transmitting angle AOD at the current sampling time;
rewarding: the goal is to maximize the total rate achievable by the uplink at the CPU, i.e., at each time step t, (8) is treated as a reward functionThe final goal is to find a rebate prize R t The maximized optimal strategy, which is defined asWhere gamma E [0, 1) is a discount factor affecting long-term rewards. In this case, each agent knows what state it is in and what action should be taken to obtain the maximum system sum prize.
(2) MADDPG-based Doppler compensation
In the multi-agent system of the invention, each antenna array is an agent, and the change of the state of each agent not only affects the behavior of other agents, but also affects the environment, and the corresponding relation of other specific parameters is shown in table 2. The MADDPG network adopts centralized training and decentralized execution, and improves learning stability through interaction with an HSR environment. As shown by the algorithm in table 3, during training, the actor network of each antenna array decides the action performed in the current state, which is the instantaneous channel state information of each agent, to achieve the best effect, and the critic network evaluates the future average benefit of the action by calculating the Q value.
P (s ' |s, a) is the probability of making a behavior a in a state s, a state transition to s ', and the corresponding reward is R (s, a|s '). The critic network trains and updates by using the estimated and target Q values through a Time Difference (TD) method, and then serves as a supervision signal of the actor network to guide the actor network to update the strategy. We use J (pi) i ) As an expected reward for the ith agent, it may be based on the observed value pi i Optimizing an actor network:
wherein μ(s) ti ) Derived from the actor network, to select the reception and transmission angles respectively,derived from the critic network responsible for evaluating the actions. In (a)The cardioid network is optimized to minimize losses:
wherein,,is at the weight->An objective function below.
The technical scheme plays a good role in Doppler compensation, and can improve system performance and obtain performance close to exhaustive search under the condition of running with lower complexity in beam alignment. Specific performance analyses will be elaborated on in the simulation results.
The simulation parameters are shown in Table I, and the simulation results are shown in FIG. 4, FIG. 5, FIG. 6 and FIG. 7.
Table I parameter settings
Table 3 compares the different deep neural network structures and analyzes their effect on algorithm convergence, where the total number of units num units set several different sets from 8 to 128, and it can be seen that as the number of training increases, the system total prize value, i.e., the achievable rate, increases, where the total number of units 64 is best for structural performance, and this parameter is set identically in the simulation below.
Fig. 4 shows the relationship between each agent and the system rewards. As training time increases, the prize values for all agents tend to be optimal. However, in multi-agent deep reinforcement learning, the convergence speed of each agent prize value is different and different from the maximum prize due to the difference in environment. Since this algorithm produces optimal solutions in the competition and collaboration of agents, the system rewards eventually tend to converge.
In fig. 5, comparing with the initial system including doppler frequency offset and not including doppler frequency offset, it can be seen that the angular domain doppler compensation method based on madppg proposed by us can play a role and obtain better performance.
And in fig. 6, a comparison simulates the performance of the random, exhaustive, and inventive approach in the beam alignment direction, it can be observed that when only the beam alignment gain is considered,
(1) The MADDPG-based algorithm is far superior to the performance of a random method;
(2) MADDPG-based algorithms can achieve performance approaching that of the exhaustive approach.
Meanwhile, compared with the traditional algorithm, the multi-agent deep learning greatly reduces the time complexity of calculation and brings further development to communication.
Table 2 is a parameter lookup table of the madppg algorithm of the present invention:
table 3 is a flow chart of the MADDPG algorithm of the present invention;

Claims (6)

1. a method for resisting doppler based on deep reinforcement learning in a high-speed rail cellular-free system, the method comprising: the wireless network communication system in the honeycomb-free large-scale MIMO system under the high-speed moving scene comprises a high-speed rail, a plurality of Access Points (AP) and a central control unit (CPU), wherein the APs are connected to the central control unit (CPU) through a forward link and jointly serve the high-speed rail in the same time frequency resource; the mobile relay is arranged at the top of the high-speed train and is connected with the in-train access to form a two-hop structure, so that the penetration loss is relieved, and the switching scale is reduced; in a high-speed moving scene, due to the relative motion of a receiving end and a transmitting end, doppler frequency offset is caused, so that interference among subcarriers of an OFDM system is caused, the communication performance is seriously influenced, and the Doppler frequency offset is corrected by adopting a Doppler compensation method based on multi-agent deep reinforcement learning.
2. The doppler compensation method based on deep reinforcement learning in a high-speed rail honeycomb-free system according to claim 1, wherein the doppler compensation method based on multi-agent deep reinforcement learning is as follows:
step one: establishing a plurality of AP and high-speed rail communication models, wherein in a honeycomb-free large-scale MIMO system, the plurality of APs simultaneously provide services for high-speed rails and are connected to a central processing unit through a forward link;
step two: initializing intelligent agents, states, actions and rewards of the multi-intelligent-agent depth deterministic strategy gradient MADDPG when training is started, wherein the intelligent agents, states, actions and rewards correspond to channel state information, a selected angle and a system reachable rate respectively; initializing an action network and an evaluation critic network, and setting and initializing two network parameters;
step three: judging whether the intelligent agent corresponds to a relay on the high-speed railway side or a large-scale antenna array on the AP side, and respectively executing different operations;
step four: if the intelligent agent is a high-speed railway side antenna array, pre-coding and Doppler pre-compensation processing are carried out according to an exit angle AOD output by an actor network so as to reduce Doppler frequency offset corresponding to different paths;
step five: if the intelligent agent is an AP side antenna array, performing AP side beam alignment according to an arrival angle AOA output by an actor network;
step six: updating MADDPG network parameters according to rewards obtained by the system, and continuously repeating the steps three to six until convergence.
3. The method for resisting doppler based on deep reinforcement learning in a high-speed railway non-cellular system according to claim 2, wherein the parameter setting and initialization of the actor network and the critic network in the second step specifically comprises the following steps:
step 2.1, MADDPG parameter setting and initialization;
the deep reinforcement learning mainly comprises an intelligent agent, an environment, a state, actions and rewards; at the t-th time step, it is assumed that the current state of the environment is expressed asActions performed by the agent->The environment transitions to a new state s t+1 At the same time the environment gets the corresponding prize value +.>Wherein->Is a state space, ++>Is a motion space, < >>Is a reward function;
the specific setting and joint optimization problems are as follows:
status: the information of all the agents, i.e. the large-scale antenna array, constitutes the state at time t, and the state of the mth agent is recorded asThe states of all agents in a cell-free massive MIMO system in a high-speed mobile scenario can be defined asWherein the status of each agent is its respective channel status information;
the actions are as follows: in each time step, each intelligent agent selects a different angle, wherein the antenna array at the AP side selects a receiving angle AOA, and the antenna array at the train side selects a transmitting angle AOD at the current sampling time;
rewarding: the goal is to maximize the total rate achievable by the uplink at the CPU, in which case each agent knows what state it is in and what action should be taken to obtain the maximum system sum prize;
thus, the MADDPG's environment, status, actions, rewards are defined.
4. The method for resisting doppler based on deep reinforcement learning in a high-speed railway honeycomb-free system according to claim 2, wherein the high-speed railway side pre-coding and doppler pre-compensation processing in step four comprises the following steps:
step 4.1, judging that the intelligent agent is a high-speed rail or AP side antenna array for a certain position;
step 4.2, judging a Gao Tiece antenna array, and executing actions according to a machine learning algorithm to obtain AOD angles corresponding to different paths;
step 4.3, through step 4.2, the high-speed railway side separates AOA angles of different paths, namely different Doppler frequency offsets, and Doppler frequency offset precompensation and precoding processing are carried out in an angle domain before transmission;
thereby completing Doppler compensation of the high-speed rail side transmitting end; in addition, due to the fact that the multi-agent deep learning algorithm is used, time delay of the traditional algorithm for separating different angles can be reduced.
5. The method for resisting doppler based on deep reinforcement learning in a high-speed railway non-cellular system according to claim 2, wherein the AP-side beam alignment in the fifth step comprises the following steps:
step 5.1, judging that the intelligent agent is a high-speed rail or AP side antenna array for a certain position;
step 5.2, judging an AP side antenna array, and executing actions according to the output of the MADDPG algorithm to obtain AOA angles corresponding to different paths;
step 5.3, through step 5.2, the high-speed rail side separates AOD angles of different paths, and beam forming processing is carried out according to different angles when signals are received;
thereby beam alignment of the AP-side receiving end can be completed.
6. The method for resisting doppler based on deep reinforcement learning in a high-speed rail cellular-free system according to claim 2, wherein the updating of the madppg network parameter in step six is performed according to rewards obtained by the system, and the updating of the madppg network parameter specifically comprises the following steps:
step 6.1, for MADDPG, the prize value is set to the total achievable rate of the system at the CPU taking into account the inter-subcarrier interference;
step 6.2, in the training process, determining the action to be executed in the current state by the actor network of each antenna array, wherein the current state is the instantaneous channel state information of each intelligent agent so as to obtain the best effect, and completing the evaluation of the action by calculating a state action function by the critic network;
step 6.3, training and updating by using the estimated and target state action functions through the step 6.2 and the critical network through a time difference TD method, and then guiding the actor network to update the strategy by using the estimated and target state action functions as a supervision signal of the actor network;
step 6.4, continuously iterating training according to the total reward value of the system until convergence;
by combining with the MADDPG algorithm, training and updating are continuously carried out so as to obtain the maximum reachable speed of the system as a target, and different intelligent agents execute respective optimal actions, so that the Doppler frequency offset is correspondingly compensated in an angle domain, and the influence of the Doppler frequency offset in a high-speed moving scene is effectively relieved.
CN202310417624.2A 2023-04-19 2023-04-19 Doppler resistance method based on deep reinforcement learning in high-speed rail honeycomb-free system Pending CN116600267A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310417624.2A CN116600267A (en) 2023-04-19 2023-04-19 Doppler resistance method based on deep reinforcement learning in high-speed rail honeycomb-free system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310417624.2A CN116600267A (en) 2023-04-19 2023-04-19 Doppler resistance method based on deep reinforcement learning in high-speed rail honeycomb-free system

Publications (1)

Publication Number Publication Date
CN116600267A true CN116600267A (en) 2023-08-15

Family

ID=87598144

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310417624.2A Pending CN116600267A (en) 2023-04-19 2023-04-19 Doppler resistance method based on deep reinforcement learning in high-speed rail honeycomb-free system

Country Status (1)

Country Link
CN (1) CN116600267A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116761150A (en) * 2023-08-18 2023-09-15 华东交通大学 High-speed rail wireless communication method based on AP and STAR-RIS unit selection
CN117478474A (en) * 2023-10-16 2024-01-30 中国人民解放军国防科技大学 Channel precompensation-based antagonistic sample signal waveform generation method
CN117485410A (en) * 2024-01-02 2024-02-02 成都工业学院 Data communication system and method of train operation control system

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116761150A (en) * 2023-08-18 2023-09-15 华东交通大学 High-speed rail wireless communication method based on AP and STAR-RIS unit selection
CN116761150B (en) * 2023-08-18 2023-10-24 华东交通大学 High-speed rail wireless communication method based on AP and STAR-RIS unit selection
CN117478474A (en) * 2023-10-16 2024-01-30 中国人民解放军国防科技大学 Channel precompensation-based antagonistic sample signal waveform generation method
CN117478474B (en) * 2023-10-16 2024-04-19 中国人民解放军国防科技大学 Channel precompensation-based antagonistic sample signal waveform generation method
CN117485410A (en) * 2024-01-02 2024-02-02 成都工业学院 Data communication system and method of train operation control system
CN117485410B (en) * 2024-01-02 2024-04-02 成都工业学院 Data communication system and method of train operation control system

Similar Documents

Publication Publication Date Title
CN116600267A (en) Doppler resistance method based on deep reinforcement learning in high-speed rail honeycomb-free system
Mehrabi et al. Decision directed channel estimation based on deep neural network $ k $-step predictor for MIMO communications in 5G
CN108650003B (en) Hybrid transmission method for joint Doppler compensation in large-scale MIMO high-speed mobile scene
Zheng et al. Cell-free massive MIMO-OFDM for high-speed train communications
CN109547076A (en) Mixing precoding algorithms in the extensive MIMO of millimeter wave based on DSBO
CN111786923A (en) Channel estimation method for time-frequency double-channel selection of orthogonal frequency division multiplexing system
Chu et al. Deep reinforcement learning based end-to-end multiuser channel prediction and beamforming
CN109995403A (en) The improved LAS detection algorithm of simulated annealing thought is based in extensive mimo system
Choi et al. Downlink extrapolation for FDD multiple antenna systems through neural network using extracted uplink path gains
CN115733530A (en) Combined precoding method for reconfigurable intelligent surface assisted millimeter wave communication
CN117560043B (en) Non-cellular network power control method based on graph neural network
Li et al. Selective uplink training for massive MIMO systems
Zhang et al. Adversarial training-aided time-varying channel prediction for TDD/FDD systems
CN114665930A (en) Downlink blind channel estimation method of large-scale de-cellular MIMO system
Banerjee et al. Access point clustering in cell-free massive MIMO using multi-agent reinforcement learning
Lee et al. Multi-agent deep reinforcement learning (MADRL) meets multi-user MIMO systems
CN114268348A (en) Honeycomb-free large-scale MIMO power distribution method based on deep reinforcement learning
CN112887233A (en) Sparse Bayesian learning channel estimation method based on 2-dimensional cluster structure
CN109818891B (en) Lattice reduction assisted low-complexity greedy sphere decoding detection method
Yi et al. 6G intelligent distributed uplink beamforming for transport system in highly dynamic environments
Dai et al. Channel estimation with predictor antennas in high-speed railway
Ma et al. Model-driven deep learning based channel estimation for millimeter-wave massive hybrid MIMO systems
EP4402866A1 (en) Improved pilot assisted radio propagation channel estimation based on machine learning
Dreifuerst et al. Neural Codebook Design for Network Beam Management
He et al. Generalized Regression Neural Network Based Channel Identification and Compensation Using Scattered Pilot.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination