CN115296705A

CN115296705A - Active monitoring method in MIMO communication system

Info

Publication number: CN115296705A
Application number: CN202210470392.2A
Authority: CN
Inventors: 唐岚; 郭德邻
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2022-04-28
Filing date: 2022-04-28
Publication date: 2022-11-04
Anticipated expiration: 2042-04-28
Also published as: CN115296705B

Abstract

The invention discloses an active monitoring method in an MIMO communication system, which comprises a suspicious transmitter A, a suspicious receiver B and a legal monitor E, wherein the transmitter A sends information to the transmitter B, the monitor E makes a decision according to a part of known channels to improve the monitoring performance, and the A makes a corresponding decision to stop the monitoring of the monitor, thereby generating a monitoring and anti-monitoring game between a source node A and the monitor E. The invention designs a reinforcement learning algorithm to optimize the transmitting power strategies of the monitor E and the transmitter A and obtain Nash equilibrium in the monitoring and anti-monitoring games between the monitor E and the transmitter A.

Description

Active monitoring method in MIMO communication system

Technical Field

The invention belongs to the field of wireless communication, and particularly relates to an active monitoring method in a Multiple Input Multiple Output (MIMO) system, and more particularly relates to an optimization method of a power distribution strategy based on a multi-agent reinforcement learning algorithm (FSP-SAC).

Background

In recent decades, wireless communication has played a very important role in people's daily life by providing efficient convenience in linking people.

Currently, the physical layer snooping can be divided into two categories, passive snooping and active snooping. Passive listening is simply to accept the leaked radio signal through a silent receiver. However, with the deployment of highband channels and MIMO in fifth generation (5G) networks, the beams transmitting the signals become more and more station-wide, so that passive listening will have difficulty listening to valid information. According to the information theory, information can be decoded in the physical layer sense as long as the channel capacity of the communication channel is smaller than the channel capacity of the listening channel. Therefore, in order to improve the monitoring efficiency, an active monitoring method for reducing the capacity of the communication channel by using the interference signal is beginning to be widely used.

Currently, a common active monitoring method only considers a static monitored target, and with the development of anti-monitoring measures, more and more illegal information transmitters start to intelligently adjust transmission power and reduce monitoring channel capacity by using noise signals, so that a game problem about anti-monitoring and monitoring is generated, which also causes great difficulty to legal active monitoring. Therefore, it is important to construct a method to seek the solution of nash equilibrium in the anti-snooping and snooping game problems.

Disclosure of Invention

The purpose of the invention is as follows: in view of the problems and deficiencies of the prior art, an object of the present invention is to provide an active listening method in an MIMO communication system, which optimizes a power transmission policy of a listener and a power transmission policy of a suspicious source node, so that the policies of the two achieve nash balance.

The technical scheme is as follows: in order to achieve the above object, the present invention adopts a technical solution of an active monitoring method in an MIMO communication system, comprising the steps of:

(1) In each time slot t, a multi-antenna transmitter (transmitter for short) A transmits an information signal to a multi-antenna receiver (receiver for short) B

And transmits an interference signal to a multi-antenna listener (simply called listener) E

To reduce the channel capacity A to E and thus prevent listening from E, transmitter A's action

Is shown as

Wherein

And

are respectively

And

based on local channel information, transmitter A

Using a strategy of pi ^A Selecting actions, emitter A use based on

Conditional probability distribution of

The value of sampling is the selected action

(2) At each time slot t, the listener E transmits an interference signal x to the receiver B _E (t) to reduce the channel capacity between transmitter A and receiver B to increase the listening success rate, the action of listener E

Is shown as

Wherein

Is x _E (t) covariance matrix, and the listener E is based on the local channel information

Using a strategy of pi ^E Selecting actions, listener E usage based on

Conditional probability distribution of

The value of sampling is the selected action

(3) At each time slot t, when both the transmitter A and the listener E have performed an action, a prize is respectively awarded

And

let pi = { pi ^A ,π ^E Define the average reward function J of the emitter A _A (π) is

Wherein

Representing the average reward function J of a listener E based on a mathematical expectation of the time-slot t taken by a condition pi _E (π) is

Optimization strategy pi ^A To maximize J _A (π), optimization strategy π ^E To maximum J _E And (pi) to achieve nash-nash equalization in the snoop and anti-snoop game.

Further, the step (3) further comprises the following steps:

1) For any device n, wherein n is epsilon { A, E }, initializing a parameter of theta ⁿ Parameterized strategy of

One parameter is psi ⁿ Strategy (2)

One parameter is ω ⁿ Value function of

And one parameter is phi ⁿ Value function of

A parameter phi ⁿ Is assigned to

Parameter (d) of

2) At each time slot t, device n uses the policy with a probability of 0.1

To choose an action to use the policy with a probability of 0.9

Selecting an action to collect data

Store to the first storage area

Wherein when n = A, the data

Is composed of

When n = E, data

Is composed of

If the action is by policy

Selecting, then the data will be

To a second storage area

3) From

Randomly selecting samples with length L

Calculating gradients

Wherein

Which means that the gradient is taken over the variable x,

is made by a policy

Sampling is carried out, and the temperature parameter alpha belongs to [0,1 ]]Calculating

Gradient of (2)

Calculating gradients

Wherein the discount factor γ ∈ (0, 1), from

Randomly selecting samples with the length L

Calculating gradients

Then the parameter theta is updated ⁿ ←θ ⁿ + ηΔθ ⁿ ,ω ⁿ ←ω ⁿ +ηΔω ⁿ ,φ ⁿ ←φ ⁿ +ηΔφ ⁿ ,

ψ ⁿ ←ψ ⁿ + ηΔψ ⁿ Wherein eta is learning rate, eta value range is (0, 1), v is moving average parameter, v value range is (0, 1), symbol ← shows that the value on the right side of arrow is assigned to the left side, then step 2 is returned until strategy parameter theta ⁿ No longer changed.

Has the advantages that: the invention solves the problem of dimension disaster in the face of high-dimensional games by designing an algorithm FSP-SAC and introducing deep strong chemistry, and solves the problem that the common single-agent reinforcement learning algorithm is difficult to converge in the game problem by combining virtual game playing and deep reinforcement learning technologies, so that the algorithm can gradually converge to Nash equilibrium.

Drawings

FIG. 1 is a diagram of a system model of the present invention;

FIG. 2 is a graph comparing the performance of the method used in the present invention with other methods;

FIG. 3 is a graph comparing the performance of the method used in the present invention with other methods.

Detailed Description

The present invention is further illustrated by the following figures and specific examples, which are to be understood as illustrative only and not as limiting the scope of the invention, which is to be given the full breadth of the appended claims and any and all equivalent modifications thereof which may occur to those skilled in the art upon reading the present specification.

As shown in fig. 1, the communication system considered consists of a multi-antenna transmitter (transmitter) a, a multi-antenna receiver (receiver) B, and a multi-antenna listener (listener) E. Let the number of transmitting antennas of transmitter A be N _A The number of receiving antennas of the receiver B is N _B And the monitor E has two groups of antennas, one for transmitting the interference signals, the number of which is

Another group for listening to signals from A, the number of which is

At time slot t, the channel matrices between transmitter A and receiver B, transmitter A and listener E, and listener E and receiver B are denoted H, respectively _AB (t),H _AE (t), and H _EB (t)。

Wherein

A complex field space of size i x j is represented.

In each time slot t, the signal transmitted by the transmitter A is composed of an information signal

And artificial noise signal

Is composed of (a) wherein

Is expressed as

Wherein

Is represented as a pre-coding matrix of

Wherein

In order that the artificial noise does not interfere with the information signal,

from H _AB N of (t) corresponding to non-0 singular values _B A right singular value vector, and

from the remaining N _A -N _B Right singular vectors corresponding to singular values of 0. The total signal transmitted by the transmitter a in the time slot t is denoted as

The signals received by receiver B are:

wherein x _E (t) is the interference signal transmitted by listener E to B,

n _B is gaussian white noise. The signal received by listener E is:

from equation (1), the covariance matrix of the received signal B is:

wherein

Superscript (.) ^H Representing the conjugate transpose of a matrix or vector. The covariance matrix of the received interference is:

wherein

Is n _B Of covariance matrix, σ ² Is the noise coefficient, I _x An identity matrix of size x is shown. According to equations (3) and (4), the channel capacity between the transmitter A and the receiver B is

Where the function det represents the determinant of the matrix, the superscript (.) ^-1 Representing the inverse of the matrix.

According to the formula (2), the covariance matrix of the signal at the E terminal of the listener is:

the covariance matrix of the interference is:

from equations (6) and (7), the channel capacity between transmitter a and receiver B is:

in time slot t, both transmitter a and listener E can only obtain partial channel information, which we call local observation information (or local channel information). In time slot t, the local observation information of transmitter A is defined as

Wherein

A space of local observation information of a. The local observation information of the listener E is defined as

The local observation information space of E. And global state is defined as

Wherein

Is a global state space.

At each time slot t, the transmitter A decides to transmit a signal

And

power allocation of (2), i.e. covariance matrix

And

the power of the signal is

Where tr is the trace of the matrix, the power of the artificial noise

If transmitter A does not know channel H _AE (t), noise power can be assumed

Equally distributed over each artificial noise stream, i.e.

If it is assumed that each stream of the signal is uncorrelated with each other, then

And

are all positive semidefinite symmetric matrices. The signal power and the artificial noise power need to meet the overall constraint of the attack:

wherein

Is the maximum power of the transmitter a. Define the action of the transmitter A in the time slot t as

Interference signal x of listener E _E (t) covariance matrix

Assuming that each stream of the signal is uncorrelated with each other, then

A symmetric matrix is defined for half positive. The power of the noise signal needs to satisfy the total power constraint:

wherein

Is the maximum transmit power of the listener E. Define the action of E in time slot t as

Defining federated actions

According to the information theory, if C _E (t) is greater than C _B (t), the listener can decode the transmitter with arbitrarily small errors in the physical layer senseA transmits information to receiver B, so define the reward function of listener E as:

wherein

Representing a Boolean indicating function, outputting 1 when the input value is true, otherwise outputting 0,

because the amplitude of the variation of the reward function to the action can be increased exponentially. If the information transmitted by transmitter a is once eavesdropped, a penalty is given based on each portion of the eavesdropped data. Transmitter a aims to reduce the amount of eavesdropped information while maximizing the transmission rate, so the reward function of transmitter a is defined as:

where ζ is a coefficient that balances transmission rate and information leakage penalty greater than 0. We will r ^E (s _t ,a _t ) And r ^A (s _t ,a _t ) Are respectively abbreviated as

And

the strategy for defining the emitter A is pi _A The selection of the action is carried out,

is based on

The observed information is

When A chooses an action using the probability distribution

The policy for definition of listener E is π ^E The selection of the action is carried out,

is based on

The observed information is

The transmitter A uses the probability distribution to select an action

The joint strategy of the two is expressed as pi = (pi) ^A ,π ^E ). The objective function of the transmitter A is

Meaning that under the conditions of the strategy pi,

mathematical expectation in the time dimension, i.e. the average prize value. Likewise, the listener E has an objective function of

The optimization objectives for transmitter a are:

the optimization objectives for listener E are:

to solve the problems (11) and (12), a common reinforcement learning algorithm can be applied to the transmitter a and the listener E to solve the problems, but the learning results are difficult to converge due to the fact that the strategies of the two parties are changed. Therefore, a multi-agent reinforcement learning algorithm FSP-SAC is designed to learn respectively optimal strategies pi for the transmitter A and the monitor E ^A And pi ^E . The method stabilizes the learning process by training an average strategy and an optimal response strategy.

We present the process of solving the problems (11) and (12) using FSP-SAC, since both problems (11) and (12) are solved using FSP-SAC, for simplicity of description, one of the transmitter a or the listener E is represented by n, i.e. n ∈ { a, E }, and the algorithm process is as follows:

1) For any n ∈ { A, E }, a parameter is initialized to be theta ⁿ Parameterized optimal response strategy

One parameter is psi ⁿ Average strategy of (1)

One parameter is ω ⁿ Value function of

And one parameter is phi ⁿ Value function of

Initializing a value function

To stabilize the learning process, the parameter phi is set ⁿ Is assigned to

Parameter (d) of

2) Data collection: the definition of x to p (x) denotes that x obeys the probability distribution p (x). In each time slot t, the local observation information is

n use policy with probability of 0.1

To select an action:

using policy with a probability of 0.9

Selecting an action:

we call this probabilistic selection strategy a hybrid strategy,

and

the mixing strategy of (1) is pi ⁿ . After both transmitter A and listener E have performed the action, the system transitions to the next state and n is awarded

And local observation information for observing next time slot

Collecting the collected data

Store to storage area

If the action is by a policy

If selected, the data is then processed

Store to another storage area

Assuming that the data collection phase is T steps, when T = T, the data collection is finished, and the optimization learning phase is entered.

3) Reinforcement learning stage using

To update the data in

And

from

Randomly selecting samples with the length L

Calculate about

Gradient:

wherein

Which means that the gradient is taken over the variable x,

is made by a policy

Derived from the sample, not from the sample tau _RL Is obtained by a temperature parameter alpha belongs to [0,1 ]]. Recalculating about

Gradient (2):

calculate about

Gradient (2):

wherein the discount factor γ ∈ (0, 1). Then, parameters are updated: theta.theta. ⁿ ←θ ⁿ +ηΔθ ⁿ ,ω ⁿ ←ω ⁿ + ηΔω ⁿ ,φ ⁿ ←φ ⁿ +ηΔφ ⁿ ,

Wherein eta is the learning rate, eta is in a value range of (0, 1), v is the sliding average parameter, v is in a value range of (0, 1), and the symbol ← shows that the value on the right side of the arrow is assigned to the left side.

4) Supervising the learning phase, using

Is updated with the data of

From

Randomly selecting samples with the length of B

Then calculating the gradient

Then, parameters are updated: psi ⁿ ←ψ ⁿ +ηΔψ ⁿ . Then return to step 2) until strategy

Converges to a steady state.

Finally, we simulated the system. The simulation parameters are set as: sigma ² ＝10 ^-8 mW,N _A ＝ 4,

The distances among the transmitter A, the receiver B and the monitor E are all 200m, and the path loss index is 3.48. The coefficient ζ =2 in the formula (10). The strategy and value functions are parameterized by a multilayer perceptron (one type of artificial neural network), and the activation function is ReLu (Rectified Linear Unit), which has two layers, wherein each layer has 128 neurons. η =0.0003, α =0.05, ν = 0.005, γ =0.99, t =1000, l =128.

In FIG. 3, we compared several other methods, where SAC (Soft activator-critical) method is from Soft Actor-critical: off-Policy Maximum Engine expression evaluation with a Stoustic Actor, woLF-PPO method is from Win or lean fast expression strategy optimization. Fig. 2 and 3 are learning curve diagrams of the transmitter a and the monitor E, respectively, and it can be seen that a result curve using the multi-agent reinforcement learning algorithm FSP-SAC progressively converges to a steady-state value, while other learning algorithms for comparison face severe fluctuation, so the present invention solves the problem that other reinforcement learning methods are difficult to converge in the game problem, and according to the relationship between convergence and rationality in the game, since the FSP-SAC inherits the rationality from the SAC, it can be judged that the result of the FSP-SAC method converges to nash equilibrium.

Claims

1. An active monitoring method in a MIMO communication system comprises the following steps:

(1) At each time slot t, the multi-antenna transmitter A transmits an information signal to the multi-antenna receiver B

And transmits an interference signal to the multi-antenna listener E

To reduce the channel capacity of the multi-antenna transmitter A to the multi-antenna listener E to prevent listening from the multi-antenna listener E, the multi-antenna transmitter A being active

Is shown as

Wherein

And

are respectively

And

assistant ofDifference matrix, multiple antenna transmitter A based on local channel information

Using a strategy of pi ^A Selection action, multiple antenna transmitter A usage based on

Conditional probability distribution of

The value of sampling is the selected action

(2) At each time slot t, the multi-antenna listener E transmits an interference signal x to the multi-antenna receiver B _E (t) to reduce the channel capacity between the multi-antenna transmitter A and the multi-antenna receiver B to increase the listening success rate, the action of the multi-antenna listener E

Is shown as

Wherein

Is x _E (t) covariance matrix, multi-antenna listener E based on local channel information

Using a strategy of pi ^E Selecting actions, multi-antenna listener E usage based on

Conditional probability distribution of

The value of sampling is the selected action

(3) At each time slot t, when the multi-antenna transmitter A and the multi-antenna transmitter E perform actions, respectively obtaining prizes

And

let pi = { pi = ^A ,π ^E Define an average reward function J for a multi-antenna transmitter A _A (π) is

Wherein

Representing the average reward function J of a multi-antenna listener E based on a mathematical expectation of the time slot t taken by the condition pi _E (π) is

2. The active listening method in a MIMO communication system according to claim 1, wherein said step (3) further comprises the steps of:

1) For any device n, where n is in the range of { A, E }, initializing a parameter to be theta ⁿ Parameterized strategy of

One parameter is psi ⁿ Strategy (2)

One parameter is ω ⁿ Value function of

And one parameter is phi ⁿ Value function of

A parameter phi ⁿ Is assigned to

Parameter (d) of

2) At each time slot t, device n uses the policy with a probability of 0.1

To choose an action to use the policy with a probability of 0.9

Selecting an action to collect data

Store to the first storage area

Wherein when n = A, data

Is composed of

When n = E, data

Is composed of

If the action is by policy

Selecting, then data will be

To a second storage area

3) From

Randomly selecting samples with the length L

Calculating gradients

Wherein

Which means that the gradient is taken over the variable x,

is made by a policy

Gradient of (2)

Calculating gradients

Wherein the discount factor γ ∈ (0, 1), from

Randomly selecting samples with the length L

Calculating gradients

Then updating the parameter theta ⁿ ←θ ⁿ +ηΔθ ⁿ ,ω ⁿ ←ω ⁿ +ηΔω ⁿ ,φ ⁿ ←φ ⁿ +ηΔφ ⁿ ,

ψ ⁿ ←ψ ⁿ +ηΔψ ⁿ Wherein eta is learning rate, eta value range is (0, 1), v is sliding average parameter, v value range is (0, 1), symbol ← shows value on arrow right side is assigned to left side, then step 2 is returned until strategy parameter theta ⁿ No longer changed.