CN115296709A

CN115296709A - Self-adaptive anti-interference beam forming method based on preprocessing deep reinforcement learning

Info

Publication number: CN115296709A
Application number: CN202210785303.3A
Authority: CN
Inventors: 郝传辉; 孙绪保; 王胜利
Original assignee: Shandong University of Science and Technology
Current assignee: Shandong University of Science and Technology
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-11-04

Abstract

The invention discloses a self-adaptive anti-interference beam forming method based on preprocessing deep reinforcement learning, belonging to the technical field of navigation and comprising the following steps: constructing a GPS terminal signal model, which comprises a 2x 2 dual-polarized antenna array, a control variable, a time-varying gain interference variable and Gaussian noise; constructing a deep learning Convolutional Neural Network (CNN), which comprises a data feature extraction network layer, a convolutional network layer, a pooling layer, an activation function layer and a full link layer; carrying out reinforcement learning processing and decision implementation on the Q network on the deep learning convolution neural network to obtain a deep reinforcement learning Q network; training the deep reinforcement learning Q network to obtain a trained deep reinforcement learning convolutional neural network

The invention carries out wave beam control on array signals, realizes wave beam main lobe positioning and sidelobe null interference signals through controlling the control variables; the deep Q network learning interference data characteristics automatically determine the next execution action without human intervention, and the autonomy of the intelligent antenna is greatly improved.

Description

Self-adaptive anti-interference beam forming method based on preprocessing deep reinforcement learning

Technical Field

The invention discloses a self-adaptive anti-interference beam forming method based on preprocessing deep reinforcement learning, and belongs to the technical field of navigation.

Background

The self-adaptive anti-interference beam forming is one of main research contents of an anti-interference technology of the intelligent antenna, is widely applied to the fields of transportation, surveying and mapping, telecommunication, water conservancy, fishery, natural disaster relief, aerospace and the like, and has extremely high commercial economic value. The current adaptive anti-interference beam forming technology adjusts the weighting factor of each array element signal according to a certain rule algorithm, so as to adjust the radiation pattern of the antenna array, thereby achieving the purposes of enhancing the expected signal and suppressing the interference signal. However, in the field of actual wireless channel transmission, due to the complex electromagnetic environment and the dynamically variable interference influence, non-uniform dispersion imbalance, multi-directional anisotropy and non-deterministic variation occur during array beam forming, and these adverse factors cause the wave front of the antenna array to be distorted to generate an angle diffusion phenomenon, so that the beam components are scattered and suspended in the transmission channel direction of an interference source, and variable virtual interference signals, namely airspace superposition interference, are caused. The research on an array beam forming algorithm of wavefront distortion caused by dynamic interference is relatively less, but with the development of artificial intelligence, an anti-interference strategy of machine learning is applied to the field of beam forming, and an expected signal of wavefront distortion is modeled and learned and constrained in a dynamic interference environment.

Related publications such as CN202110568887.4: a large array rapid self-adaptive anti-interference method based on a convolutional neural network. The convolution neural network is utilized to solve the problems of large calculation amount and poor beam shape protection in the large phased array self-adaptive beam forming in the prior art, thereby controlling the beam forming to realize the anti-interference purpose. Related documents also research methods for implementing anti-interference by machine learning, such as recently published [1] Z.Xiao, B.Gao, S.Liu and L.Xiao.Learing Based Power Control for MMwave Massive MIMO against jam [ C ]. IEEE Global Communications Conference (GLOBECOM), pp.1-6,2018.Xiao et al adopt DQN learning method to improve the total rate of anti-interference system under unknown low-complexity environment. [2] H.Yang, Z.Xiong, J.ZHao, D.Niyato, L.Xiao and Q.Wu.deep Reinforcement Learning-Based intention Reflecting Surface for Secure Wireless Communications [ J ]. IEEE Transactions on Wireless Communications, vol.20, no.1, pp.375-388, jan.2021. A Deep recommendation Learning errors [ J ] IEEE Communications Letters, vol.22, no.5, pp.998-1001, and May 2018, an anti-interference beam forming strategy is realized by a DRL algorithm, and the strategy acquires the optimal convex solution of the anti-interference beam and improves the total rate of the system to the maximum extent. There are two limitations on learning decisions for interference rejection: i) Part of the information in the direction of the interfering beam may be lost due to unknown circumstances; and ii) the anti-interference strategy is intelligently switched according to a dynamic environment, but the interference signal is difficult to track in real time.

Reinforcement learning, also called reinforcement learning, mainly includes two types of methods: value-based and probability-based methods. The value-based method optimizes the estimation functions of the action values in different states through experienced learning, so as to obtain the optimal action control strategy, and the reinforced learning exceeds the performance of human beings in most Atari (Yadary) games; the deep neural network has achieved remarkable results in the computer field, and particularly in the computer vision field, the convolutional neural network can be used for effectively extracting convolution characteristics of an image, and excellent results are achieved in non-linear fitting, object positioning, object identification and image semantic segmentation through a deep neural network-based method. However, when facing phase disturbance of the antenna array, system randomness errors or being affected by channel transmission clutter, the interference source can still destroy the array beam and generate distortion, so that the superposition of the interference space is enhanced, and the steering vector formed by the beam generates phase diffusion, so that the covariance matrix of the signal generates a rank bias phenomenon. This phenomenon is equivalent to the fact that interference signals are cracked in a channel environment, so that the array degree of freedom is greatly consumed, partial information in the interference beam direction is lost, and the system transmission performance is seriously reduced. Therefore, how to eliminate wavefront distortion in practical engineering application and establish a stable anti-interference strategy has important significance for researching the maximum optimization system transmission total rate under the array wave space overlapping interference model.

Disclosure of Invention

The invention provides a self-adaptive anti-interference beam forming method based on preprocessing deep reinforcement learning, which solves the problem that in the prior art, interference signals are cracked in a channel environment to cause partial information loss in the direction of interference beams.

The self-adaptive anti-interference beam forming method based on the preprocessing deep reinforcement learning comprises the following steps:

s1, constructing a GPS terminal signal model, wherein the GPS terminal signal model comprises a 2x 2 dual-polarized antenna array, a control variable, a time-varying gain interference variable and Gaussian noise;

s2, constructing a deep learning Convolutional Neural Network (CNN), wherein the CNN comprises a data feature extraction network layer, a convolutional network layer, a pooling layer, an activation function layer and a full link layer;

s3, carrying out reinforcement learning processing and decision implementation on the Q network on the deep learning convolutional neural network to obtain a deep reinforcement learning Q network;

s4, training the deep reinforcement learning Q network to obtain a trained deep reinforcement learning convolution neural network

Preferably, in S1, the 2 × 2 dual-polarized antenna array obtains 8 polarization port gains, the manipulated variables realize beam patterns at different azimuth angles, the time-varying gain interference variables satisfy time-varying rayleigh attenuation, and gaussian noise is used as an interference auxiliary quantity;

A2X 2 dual-polarized antenna array is composed of a double-layer 'Rogers RO 3010' material substrate, the relative dielectric constant of the antenna array is 10.2, and the dielectric loss tangent of the antenna array is 0.0035.

Preferably, in S1, the manipulated variables are specifically the ith pitch and azimuth of the downlink navigation transmission 1# link

Wherein θ and

respectively an azimuth angle and an initial pitch angle,

for the beam signal steering vector, diag () represents a diagonal matrix symbol, exp () represents an exponential function, j is an imaginary symbol, ω is an angular frequency, sin () cos () represents a trigonometric function product, and R = d/λ is the ratio of the array element spacing d to the resonant wavelength λ.

Preferably, in S1, the time-varying gain interference variable and the gaussian noise are interference variables of a total GPS receiving signal model in downlink navigation transmission links 3# and 4#, and a structural formula of the time-varying gain interference variable is shown in the following

In the formula

For an interference quantity similar to the desired signal structure, E _j H _t Obtaining a total GPS signal model according to a downlink navigation transmission link as Rayleigh attenuation interference quantity of time-varying t

In the formula (I), the compound is shown in the specification,

in order to receive the overall time-varying GPS signal,

for the expected GPS signal in the # 1 link,

and

the manipulated variables of the expected GPS signals in the 1# and 2# links respectively,

setting the manipulated variable of the interference signal in the 3# and 4# links, wherein n (t) is a Gaussian noise signal

Where i =1x or 2x or j,

is an IRS sensor phase keying Nx 1 matrix, j is an imaginary number symbol, the amplitude and the phase of the IRS sensor phase keying unit are respectively eta which belongs to 0 and 1, phi which belongs to 0 and 2 pi]。

Preferably, in S2, the data feature extraction network layer is connected to a convolutional network layer, the convolutional network layer is connected to a pooling layer, and the pooling layer and the activation function layer are respectively connected to a full link layer;

the data feature extraction network layer is a network layer which is subjected to data dimension reduction standardization processing after an input layer, extracts a feature value of input data and inputs the feature value to a data feature extraction network, and the output of the data feature extraction network is a corresponding convolution feature value after the standardization processing at the time t;

the convolution network layer is gradually cognized from regional perception to a higher perception region, and the convolution output formula is

In the formula (I), the compound is shown in the specification,

is the Cartesian inner product, X _t ,，W _c And b are the input, weight and offset variables in the convolutional layer, respectively;

the pooling layer reduces dimensionality and compresses data to avoid passingDegree fit, selecting maximum pooling pattern to filter excessive noise, s = s _x δ(s _x ≥n _th ) In the formula, s _x For the raw data of the network input, δ(s) _x ≥n _th ) As the original data s _x ≥n _th Impulse function of n _th Is a minimum noise threshold;

the activation function layer selects a linear rectification function ReLu () as an activation function to realize nonlinear conversion from convolution pooling to full link, and the expression is f _ReLu (x) Maximum function value of = max (0, x), f _ReLu (x) The mathematical symbol of the ReLu (x) activation function is represented, and max (0, x) represents the maximum function value between 0 and x;

the full link layer combines all the characteristics of the nodes of the previous layer at the moment t, Y _t ＝W _f ·X _t ，Y _t+1 ＝W _f ·X _t+1 ＝Y _t +W _f ·ΔX _t+1 In which Y is _t And Y _t+1 Output data at times t and t +1, W, respectively _f Is the weight vector of the full link layer, Δ X _t+1 ＝X _t+1 -X _t Representing the error variable between the input data X at times t and t +1, Δ X, according to the recursive nature of the Markov decision process _t+1 ＝{X _t+1 -X _t ,X _t -X _t-1 ,…,X _t-T+2 -X _t-T+1 And (c) only affect changes of the input information at the time t and the time t +1, but remain unchanged at the time t-1.

Preferably, S3 comprises:

s3.1. Pi strategy based on t moment state s ^t And available actions a ^t The low variance estimates the desired Q-function,

implementing deep Q learning for desired operators

s represents the initial state of Q learning, a represents the Q initial learning activity, and the pi strategy is the system state from the moment t

To available actions

Is mapped to a probability distribution of pi(s) ^t ,a ^t ):

State s at moment S3.2.T +1 ^t+1 And available actions a ^t+1 Acquisition optimization

Value, i-th weight w based on pi strategy _i A desired target Q-value is obtained,

in the formula, r _t Is a cost function at the time t, mu is a controllable factor, and max () is a maximum solver;

s3.3. Non-linear CNN is preprocessed by randomly selecting uniformly distributed elements

To approximate it to the target desired value:

in the formula (I), the compound is shown in the specification,

r(a ^t )＝α(a ^t )-γδ(a ^t ≠a ^t-1 ) Wherein the reward factor α (a) ^t )＝δ(ξ _k (a ^t )≥ξ _th ) In the formula, xi _k (a ^t ) For available actions a at time t ^t The amount of the reward of; in the reward function formula, gamma represents the loss cost coefficient of the transmission signal power, xi _th Setting the threshold value set during data transmission of a GNSS emitter;

s3.4. TransfusionEntering a state

Cross entropy error is used to prevent the gradient from disappearing each time a loss function is trained, the loss function being:

wherein log () is a logarithmic function;

s3.5, optimizing the gradient descent of the loss function in S3.4 under the beta learning rate to obtain an updated Q function algorithm:

in the formula, # _t Is a gradient value at the time t;

s3.6, in order to avoid local convergence, selecting an expected target action by adopting an epsilon-greedy strategy, selecting a maximum Q table value according to the random probability epsilon is more than or equal to p and less than or equal to 1-epsilon, selecting a current action by using p and less than or equal to epsilon, and expressing the epsilon-greedy strategy of the agent action as follows:

preferably, S4 comprises:

s4.1, in each learning time t, the learning agent adopts CNN to pair state information S ^t Performing a pre-processing to observe a system state for each event step to perform an action, the action comprising an interference rejection weight and a phase shift;

s4.2, selecting an optimal Q function by using an element-greedy strategy to balance exploration and utilization;

s4.3, obtaining the maximum Q function in each time slot according to the probability range epsilon not less than p not more than 1-epsilon, and executing(s) ^t ,a ^t ) Thereafter, a prize r(s) is earned ^t ,a ^t ) And the next state S ^t+1 Stored in set D to test the sample at the next time slot t + 1;

and S4.4, feeding back the updated Q value to the CNN so as to take the next action until the loop reaches the maximum iteration number.

Compared with the prior art, the invention has the beneficial effects that: the invention utilizes deep reinforcement learning to carry out beam control on array signals, and realizes beam main lobe positioning and sidelobe null interference signals through controlling manipulated variables. In the anti-interference process, the execution action of the next step is determined by the characteristics of the deep Q network learning interference data, manual intervention is not needed, and the autonomy of the intelligent antenna is greatly improved; the adopted characterization learning capacity of the deep convolutional network is used for carrying out translation invariant classification on input information (beam forming), the input data is pooled maximally, and the convolutional neural network can process a beam array face covariance matrix with smaller calculation amount through convolution kernel parameter sharing in a hidden layer and sparsity of interlayer connection so as to enable the beam array face covariance matrix to approach a target value of a Q function; the reinforcement learning method is used for reinforcement learning of the target Q value, weakening the space overlapping property of interference signals, further eliminating wave front distortion data, seeking the maximization of the transmission total rate of a target anti-interference system, and further reducing the consumption of the degree of freedom of the array antenna.

Drawings

FIG. 1 is a technical flow chart of the present invention;

FIG. 2 is a schematic diagram of a deep learning convolutional neural network CNN operation according to the method of the present invention;

FIG. 3 is a schematic diagram of a deep reinforcement learning Q network training phase of the method of the present invention;

FIG. 4 is a graph showing comparison of the results of examples of the present invention.

Detailed Description

The following description will further illustrate embodiments of the present invention with reference to specific examples:

a method for adaptive anti-interference beam forming based on preprocessing deep reinforcement learning, as shown in fig. 1, includes:

s2, constructing a deep learning convolutional neural network CNN, as shown in FIG. 2, including a data feature extraction network layer, a convolutional network layer, a pooling layer, an activation function layer and a full link layer;

s3, carrying out reinforcement learning processing and decision implementation on the Q network on the deep learning convolution neural network to obtain a deep reinforcement learning Q network;

s4, training the deep reinforcement learning Q network, and obtaining a trained deep reinforcement learning convolutional neural network as shown in figure 3

In S1, the 2x 2 dual-polarized antenna array obtains 8 polarization port gains, a variable is controlled to realize beam modes under different azimuth angles, time-varying gain interference variables meet time-varying Rayleigh attenuation, and Gaussian noise serves as interference auxiliary quantity;

In S1, the control variable is specifically the ith pitch and azimuth of the downlink navigation transmission 1# link

Wherein θ and

respectively an azimuth angle and an initial pitch angle,

In S1, the time-varying gain interference variable and the Gaussian noise are interference quantities of a GPS total received signal model in downlink navigation transmission 3# and 4# links, and the structural formula of the time-varying gain interference variable is shown as

In the formula

For an amount of interference similar to the desired signal structure, E _j H _t For the Rayleigh attenuation interference amount of time-varying t, the total GPS signal model is obtained according to the downlink navigation transmission link

In the formula (I), the compound is shown in the specification,

in order to receive the overall time-varying GPS signal,

for the expected GPS signal in the # 1 link,

and

the manipulated variables of the desired GPS signal in the 1# and 2# links respectively,

setting the manipulated variable of interference signals in 3# and 4# links and n (t) as Gaussian noise signals

Where i =1x or 2x or j,

the method is characterized in that an IRS sensor phase keying Nx1 matrix is formed, j is an imaginary number symbol, the amplitude and the phase of an IRS sensor phase keying unit are respectively eta belonged to 0 and 1, phi belonged to 0 and 2 pi]。

In S2, the data feature extraction network layer is connected with a convolution network layer, the convolution network layer is connected with a pooling layer, and the pooling layer and the activation function layer are respectively connected with a full link layer;

the convolution network layer is gradually cognized from region perception to a higher perception region, and the convolution output formula is

In the formula (I), the compound is shown in the specification,

the pooling layer reduces dimensionality and compresses data to avoid overfitting, selects the maximum pooling mode to filter excessive noise, s = s _x δ(s _x ≥n _th ) In the formula, s _x For the raw data of the network input, δ(s) _x ≥n _th ) For the original data s _x ≥n _th Impulse function of, n _th Is a minimum noise threshold;

the activation function layer selects a linear rectification function ReLu () as an activation function to realize nonlinear conversion from convolution pooling to full chaining, and the expression is f _ReLu (x) Maximum function value of = max (0, x), f _ReLu (x) The mathematical symbol of the ReLu (x) activation function is represented, and max (0, x) represents the maximum function value between 0 and x;

the full link layer combines all the characteristics of the nodes of the previous layer at the moment t, Y _t ＝W _f ·X _t ，Y _t+1 ＝W _f ·X _t+1 ＝Y _t +W _f ·ΔX _t+1 Wherein Y is _t And Y _t+1 Are respectively asOutput data at times t and t +1, W _f Is the weight vector of the full link layer, Δ X _t+1 ＝X _t+1 -X _t Representing the error variable between the input data X at the time t and t +1, deltaX, according to the recursive nature of the Markov decision process _t+1 ＝{X _t+1 -X _t ,X _t -X _t-1 ,…,X _t-T+2 -X _t-T+1 And the change of the input information only affects t and t +1 moments, but is kept unchanged at the t-1 moment.

S3 comprises the following steps:

s3.1. State s based on t moment under pi strategy ^t And available actions a ^t The low variance estimates the desired Q-function,

implementing deep Q learning for desired operators

s represents the initial state of Q learning, a represents Q initial learning activity, and pi strategy is the system state from t moment

To available actions

Is mapped to a probability distribution of pi(s) ^t ,a ^t ):

To approximate it to the target desired value:

in the formula (I), the compound is shown in the specification,

r(a ^t )＝α(a ^t )-γδ(a ^t ≠a ^t-1 ) Wherein the reward factor α (a) ^t )＝δ(ξ _k (a ^t )≥ξ _th ) In the formula, xi _k (a ^t ) For available actions a at time t ^t The amount of the reward of; in the reward function formula, gamma represents the loss cost coefficient of the transmission signal power, xi _th Setting a threshold value set during data transmission of a GNSS emitter;

s3.4. Input state

wherein log () is a logarithmic function;

in the formula, /) _t Is a gradient value at the time t;

s4, the method comprises the following steps:

s4.2, selecting an optimal Q function by using an epsilon-greedy strategy to balance exploration and utilization;

s4.3, obtaining the maximum Q function in each time slot according to the probability range epsilon is less than or equal to p and less than or equal to 1-epsilon, and executing(s) ^t ,a ^t ) Thereafter, a prize r(s) is won ^t ,a ^t ) And the next state S ^t+1 Stored in set D to test the sample at the next time slot t + 1;

The data calculation results are shown in FIG. 4, the simulation operating environment is the learning library of TensorFlow v1 and Keras on Python v3.6.6Win64, and simulation analysis is performed on a computer with 4 Intel (R) i5-6500 CPU cores (TM), 1258GPU and 8GB memory with main frequency. In addition, the Ansoft HFSS 15.0 software was used to simulate a 2x 2 dual-polarized GNSS smart antenna array. All experimental procedures were run on PyCharm version.2018 to evaluate the interference rejection performance of the proposed DRL.

The convolutional neural network adopts an Adam method to update network parameters; the convolution depth is 32, and the deviation number is 8 bits; the maximum pooling setting is: pooling layer is 1, tensor is 1, padding = 'VALID'; converting input data into shape, wherein the sample number is 1024, and the matrix is 8 multiplied by 1; the initial learning rate is 0.01, and every time loss training stops decreasing, the learning rate is divided by 2; the number of iterative training times is 200; the Small batch (Mini-batch) size is 32; the node of the hidden layer is set to 1024.

Simulation parameters: reference arrangementThe learning rate λ =0.1, max e greedy e =0.9, e delta coefficient Δ e =0.1, discount coefficient μ =0.7 in reward function, cost of consumption coefficient γ =0.2, threshold ξ _th =15dB; the parameter of the rayleigh fading gain is set to λ =1,f _c ＝1.5GHz，c＝3×10 ⁸ m/s，α＝2.8，d _l =2m; setting PL in channel loss L-dB ₀ ＝30dB，d ₀ =1m, γ =0.8. In addition, the number of CNN convolution kernels is set to 5, where CL has 8 kernels of size 1 × 3, FCL has 32 neurons, maximum PL. In addition, to compare the performance of the proposed method, the following three methods were set:

i) The state space of the GNSS transmission system is optimized by adopting a RL method (Q learning) based on a neural network, wherein the neural network adopts 8 nerve cores, 1 ReLu layer and 1 output layer (named as reinforcement learning);

ii) a greedy learning method (expressed as greedy learning without Q learning) is set, and the method adopts a CNN and a epsilon-greedy strategy to optimize 1# transmitting power sent by the GNSS dual-polarized antenna array to a user without Q learning (named as greedy learning);

iii) A minimum distortion-free adaptive anti-interference beam forming method (spatial filtering) is adopted, and the weight vector is adaptively changed for received data in a minimum variance distortion-free response mode, so that the average power output by an array is minimized (named as minimum distortion-free response).

It is to be understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make various changes, modifications, additions and substitutions within the spirit and scope of the present invention.

Claims

1. The self-adaptive anti-interference beam forming method based on the preprocessing deep reinforcement learning is characterized by comprising the following steps of:

s4, training the deep reinforcement learning Q network to obtain a trained deep reinforcement learning convolutional neural network

2. The adaptive anti-interference beam forming method based on preprocessing deep reinforcement learning of claim 1, wherein in S1, the 2 × 2 dual-polarized antenna array obtains 8 polarization port gains, the control variables realize beam modes at different azimuth angles, the time-varying gain interference variables meet time-varying rayleigh attenuation, and gaussian noise is used as an interference auxiliary quantity;

a2 x 2 dual-polarized antenna array is composed of a double-layer substrate made of Rogers RO3010 material, and has a relative dielectric constant of 10.2 and a dielectric loss tangent of 0.0035.

3. The adaptive interference rejection beamforming method based on preprocessing deep reinforcement learning of claim 2, wherein in S1, the manipulated variables are specifically the ith pitch and azimuth of downlink navigation transmission 1# link

Wherein θ and

respectively an azimuth angle and an initial pitch angle,

for the beam signal steering vector, diag () represents the diagonal matrix symbols, exp () represents the exponential function, j is the imaginary symbol, ω is the angular frequency, sin () cos () represents the trigonometric function product, and R = d/λ is the ratio of the array element spacing d to the resonance wavelength λ.

4. The adaptive anti-interference beam forming method based on preprocessing deep reinforcement learning as claimed in claim 3, wherein in S1, the time-varying gain interference variable and Gaussian noise are interference quantities of a GPS total received signal model in downlink navigation transmission 3# and 4# links, and the structural formula of the time-varying gain interference variable is shown as

In the formula

For an interference quantity similar to the desired signal structure, E _j H _t For the Rayleigh attenuation interference amount of time-varying t, the total GPS signal model is obtained according to the downlink navigation transmission link

In the formula (I), the compound is shown in the specification,

in order to receive the overall time-varying GPS signal,

for the expected GPS signal in the # 1 link,

and

Where i =1x or 2x or j,

5. The adaptive anti-interference beam forming method based on preprocessing deep reinforcement learning as claimed in claim 4, wherein in S2, the data feature extraction network layer is connected with a convolution network layer, the convolution network layer is connected with a pooling layer, and the pooling layer and the activation function layer are respectively connected with a full link layer;

In the formula (I), the compound is shown in the specification,

is the Cartesian inner product, X _t ,，W _c And b are in the convolutional layers, respectivelyInputs, weights and bias variables;

the pooling layer reduces dimensionality and compresses data to avoid overfitting, selects the maximum pooling mode to filter excessive noise, s = s _x δ(s _x ≥n _th ) In the formula, s _x For the raw data of the network input, δ(s) _x ≥n _th ) As the original data s _x ≥n _th Impulse function of, n _th Is a minimum noise threshold;

the full link layer combines all the characteristics of the nodes of the previous layer at the moment t, Y _t ＝W _f ·X _t ，Y _t+1 ＝W _f ·X _t+1 ＝Y _t +W _f ·ΔX _t+1 In which Y is _t And Y _t+1 Output data at times t and t +1, W, respectively _f Is the weight vector of the full link layer, Δ X _t+1 ＝X _t+1 -X _t Representing the error variable between the input data X at the time t and t +1, deltaX, according to the recursive nature of the Markov decision process _t+1 ＝{X _t+1 -X _t ,X _t -X _t-1 ,…,X _t-T+2 -X _t-T+1 And the change of the input information only affects t and t +1 moments, but is kept unchanged at the t-1 moment.

6. The adaptive interference rejection beamforming method based on preprocessing deep reinforcement learning according to claim 5, wherein S3 comprises:

implementing deep Q learning for desired operators

To available actions

The probability distribution of the mapping of (2),

To approximate it to the target desired value:

in the formula (I), the compound is shown in the specification,

r(a ^t )＝α(a ^t )-γδ(a ^t ≠a ^t-1 ) Wherein the reward factor alpha (a) ^t )＝δ(ξ _k (a ^t )≥ξ _th ) In the formula, xi _k (a ^t ) For available actions a at time t ^t The amount of the reward of; in the reward function formula, gamma represents the loss cost coefficient of the transmission signal power, xi _th Setting a threshold value set during data transmission of a GNSS emitter;

s3.4. Input state

in the formula, log () is a logarithmic function;

in the formula, # _t Is a gradient value at the time t;

s3.6, in order to avoid local convergence, selecting an expected target action by adopting an epsilon-greedy strategy, selecting a maximum Q table value according to the random probability epsilon is less than or equal to p and less than or equal to 1-epsilon, selecting a current action by using p less than or equal to epsilon, and expressing the epsilon-greedy strategy of the proxy action as follows:

7. the adaptive interference rejection beamforming method based on preprocessing deep reinforcement learning according to claim 6, wherein S4 comprises:

s4.1, in each learning time t, the learning agent adopts CNN to pair state information S ^t Pre-processing is performed to observe the system state of each event step, to perform actions,the actions include interference rejection weights and phase shifts;