CN114845332A

CN114845332A - Millimeter wave communication link blocking prediction method based on visual information fusion

Info

Publication number: CN114845332A
Application number: CN202210480580.3A
Authority: CN
Inventors: 杨绿溪; 张明寒; 邓淼佩; 周婷; 李春国; 黄永明
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-05-05
Filing date: 2022-05-05
Publication date: 2022-08-02

Abstract

The invention discloses a millimeter wave communication link blocking prediction method based on visual information fusion. The invention can effectively predict the mobile blocking condition in the communication process, can ensure that a user actively switches to another line-of-sight link base station before the blocking occurs, ensures that the communication is always in a line-of-sight link state, and improves the stability of a millimeter wave communication system.

Description

Millimeter wave communication link blocking prediction method based on visual information fusion

Technical Field

The invention belongs to the field of wireless communication and deep learning, and particularly relates to a millimeter wave communication system beam blocking prediction method based on visual information fusion.

Background

Millimeter waves and large-scale MIMO are one of important key technologies of 5G mobile communication, the large bandwidth of the millimeter waves can greatly improve the channel capacity, and the high data rate requirements of various applications such as unmanned driving and virtual reality in the future can be met. By utilizing the wave beam forming technology, the base station can aim the wave beam direction of the signal at the position of the user, and the signal to noise ratio of communication is improved.

However, one of the key challenges faced by millimeter wave communication systems is the susceptibility of high frequency signals to blocking. High frequency signals are transmitted mainly by line-of-sight links due to their high free space loss and weak reflection capability. When there is an object obstacle between the user and the communication base station, the received signal-to-noise ratio will drop dramatically, which may cause a sudden interruption of communication and seriously affect the stability of communication. When there is a blockage in the communication link between the user and the base station, it is often necessary to re-establish a new line-of-sight link, which usually requires some processing time. Especially for massive MIMO systems, beam training tends to bring large time overhead. In view of the low latency requirements of future communication networks, it is desirable that the communication system not only maintain line-of-sight connectivity, but also be capable of sensing future congestion.

Some studies have demonstrated that machine learning models can utilize wireless channel data (e.g., channel or received power) to distinguish line-of-sight links from non-line-of-sight links, e.g., congestion prediction can be performed by collecting a user's beam sequence input to a gated recursive network (GRU). However, the algorithm is suitable for the case of fixed blocking and cannot predict the mobile blocking well.

The multimodal deep learning technology is designed by an algorithm so that a model can simultaneously acquire information of a plurality of modes such as characters, images and sounds, and has recently achieved excellent performance in many natural language processing tasks. In a communication system, the multi-mode technology can be utilized to combine the wireless channel data with other modal data, so that the perception capability of the algorithm to the environment is improved.

Disclosure of Invention

The invention aims to provide a visual fusion beam blocking prediction method based on a Transformer in order to cope with a complex scene of multi-direction mobile blocking in a real communication network, so as to realize the purpose of sensing the blocking condition of burst in a millimeter wave communication system in advance. The scheme can enable a user to actively switch to other line-of-sight link base stations before the blockage occurs, avoid the situation that the signal-to-noise ratio is suddenly reduced due to the blockage in the communication process, and ensure the stability of the communication process.

In order to realize the purpose of the invention, the technical scheme adopted by the invention is as follows: a millimeter wave communication link blocking prediction method based on visual information fusion comprises the following steps:

step (1): the method comprises the steps of modeling a beam blocking prediction problem into a two-classification problem based on multi-mode information, wherein the model consists of a target detection module, a camera selection module, an embedding module, a Transformer module and a classification module. Initializing model parameters including neural network weights and biases of the modules;

the target detection module is responsible for positioning the coordinates of suspected obstacles in the acquired image, the embedding module is responsible for encoding the input beam sequence and the target coordinate sequence into vectors with specified dimensions, the camera selection module predicts the number of cameras where a user is located through the input beam sequence, the Transformer module is an encoder based on an attention mechanism, and the classification module finally outputs the classification result of the model two.

The millimeter wave base station is equipped with three cameras, two of which are side cameras with a 75 degree field of view and one of which is a center camera with a 110 degree field of view.

Step (2): for user u, at each slot τ, a sequence of beams { b ] of length r is constructed _u [τ-r+1]，...，b _u [τ]And image sequence X _n [τ-r+1]，...，X _n [τ]As a training sample sequence S _u . Simultaneously constructing a link state sequence { a ] with the length r _u [τ+1]，...，a _u [τ+r′]As training sample label q _u ；

(2.1) defining an input sequence: the method aims to develop a deep learning model by utilizing an RGB image sequence and a beam sequence to predict the blocking condition of a communication link. For any user u in the communication environment, the image sequence and the beam sequence observed in r unit time intervals form a group of input sequences. For any time slot

The sequence is shown below

Wherein the content of the first and second substances,

representing the RGB image captured by the nth camera in the t-th time slot, W, H, C represents the width, height and number of color channels of the image, respectively. b _u [t]Representation codebook

Is used to serve the index of the beamforming vector of user u in the t-th slot.

Representing the length of the observation interval.

(2.2) defining an output variable q _u : let a _u [t]E {0,1} represents the communication link status of user u at the tth time slot, where 0 represents line-of-sight communication and 1 represents non-line-of-sight communication. The link connection state q of user u in a time window of length r' in the future _u Is shown below

Where 0 indicates that user u is maintaining line-of-sight communication throughout the time window and 1 indicates that link congestion occurs during the time window.

(2.3) defining a model function: the algorithm of the invention aims to establish a function f _Θ (S) the function receives the observed image-beam sequence pairs for future link states

And (6) performing prediction. Where Θ represents a set of parameters of the model, learned from the tag sequence dataset. The goal of model training can be expressed as follows

And (3): image sequence { X _n [τ-r+1]，...，X _n [τ]Inputting a target detection module, and outputting a coordinate sequence of a detection frame of the barrier { d } _n [τ-r+1]，...，d _n [τ]}；

The target detection module needs to have two basic capabilities of 1) rapidly and accurately detecting object coordinates and 2) effectively identifying object types. The YOLO detector can well realize detection precision, and the module adopts the latest YOLOv5 framework and is optimized. The original architecture is modified to detect objects of interest in the scene, i.e. objects that may cause a blockage to the user communication link in the communication scene, such as buses, trucks, trees, buildings, etc.

For a certain time slot τ, the following steps are performed in order:

(3.1) obtaining a sequence of RGB images { X } _n [τ-r+1]，...，X _n [τ]}，

(3.2) sequencing the sequence X _n Inputting a YOLO detector to obtain a coordinate of a detection target boundary box;

(3.3) converting each bounding box coordinate into a 6-dimensional vector including a center coordinate [ x ] _cent ，y _cent ]Coordinates of upper left corner [ x ] ₁ ，y ₁ ]And the lower right corner coordinate [ x ] ₂ ，y ₂ ]. The coordinates are normalized to the interval [0,1 ]]Together they mark the exact position of an object in the scene;

(3.4) stacking the transformed coordinate vectors of an image into a high-dimensional vector

Wherein M represents the number of target objects detected in the image, and t is ∈ { tau-r + 1. Since the algorithm scene is a dynamic communication environment, the number of detection objects in the image at each moment is not fixed, which results in that

Is variable in length. Thus, padding with N-M zero vectors yields a sequence

(3.5) the module finally outputs the coordinate sequence of the detection frame

And (4): beam sequence b _u [τ-r+1]，...，b _u [τ]Inputting the sequence into a beam embedding module to obtain a beam embedding sequence { b [ tau-r +1 ]]，...，b[τ]}. And inputting the beam embedding sequence into a camera selection module, and judging the camera where the user is located at the moment. Inputting the detection coordinate sequence dn corresponding to the camera into a coordinate embedding module, and outputting a corresponding detection coordinate embedding sequence { d [ tau-r +1]，...，d[τ]}；

(4.1) obtaining a beam sequence { b) of a user u _u [τ-r+1]，...，b _u [τ]Wherein the beams are codebooks

The index of the optimal codeword of the serving user. The definition of the optimal code word is as follows

Wherein

As a codebook

Code word of (1), N _m Is the number of base station antennas.

Is a downlink channel between the base station and the user. P _s Which is representative of the transmitted power,

representing the noise power and k the kth carrier.

(4.2) Beam sequence { b _u [τ-r+1]，...，b _u [τ]Is input to the beam embedding module. Since the algorithm will receive data (beams and images) of two modalities, and the dimensions of the two information are different, it is necessary to convert them into vectors of the same dimension by the embedding module.

For the beam sequence, a lookup table with the size | F | is generated, and the beam codeword index b is input _n [t]The embedding layer returns the embedded vector corresponding to the index

Wherein d is _model Is a defined feature vector dimension.

(4.3) embedding the Beam into the sequence { b [ tau-r +1]，...，b[τ]Is input into a camera selection model NET _s Outputting the feature vector

The camera selection module comprises L _s The model can be expressed as

Wherein Θ is _s ＝{W _s ，b _s Denotes the weight and bias of the fully connected layer,

a non-linear function representing a model, which can be written as

Wherein

Indicating the Relu activation function.

And (5): fusing the target detection coordinate embedding sequence and the wave beam embedding sequence, sending the fused sequences into a transformer encoder module, coding the sequences, sending the coded sequences into a classification module for secondary classification, and predicting the link connection state of a user u in a time window with the future length of r

(5.1) since the Transformer relies only on the attention mechanism, there is no loop and convolution structure, in order for the model to be able to exploit the order information of the sequence, it is necessary to insert some information with absolute position before the input sequence. In the invention, Positionembedding and Modal-type embedding modes are adopted to encode the input sequence.

Wherein Positionembedding is calculated as follows

Wherein

Representing the position of the token in the sequence, L _seq Indicating the length of the sequence. i ∈ [ 0.,. d. ], d ∈ _model /2) represents the dimension of positionedbudding.

Embedding the beam into the sequence b [ tau-r +1 ] in turn]，...，b[τ]D [ tau-r +1 ] and target detection coordinate embedding sequence]，...，d[τ]Sending into a position coding function F _PE (·)

b＝b+F _PE (b)

d＝d+F _PE (d)

The Modal-type embedding is mainly used for enabling the model to distinguish information of two modes, namely

ME _b ＝full_like(b，0)

ME _d ＝full_like(d，1)

b＝b+ME _b (b)

d＝d+ME _d (d)

Wherein full _ like (x, n) indicates that a vector with the same dimension as x is constructed and filled with n.

And (5.2) splicing the beam sequence b and the target detection coordinate sequence d, and inputting the spliced beam sequence b and the target detection coordinate sequence d into a transformer encoder model. The transformer encoder is formed by stacking L multi-head attention layers and feedforward neural network layers. The algorithm flow of the multi-head attention mechanism is as follows:

MultiHead(Q，K，V)＝Concat(head ₁ ，...，head _h )W ^o

where head _i ＝Attention(QW _i ^Q ，KW _i ^K ，VW _i ^V )

wherein, the input parameters of the MultiHead function

Q＝K＝V＝{b[τ-r+1]，...，b[τ]，d[τ-r+1]，...，d[τ]}

(5.3) obtaining a characteristic vector after passing through a transformer encoder model

Inputting yt to the classification module NET _o Obtain the final predicted result

Wherein

Indicating the Relu activation function.

And (6): calculating a predicted value

And a label q _u The loss function of (2) performs inverse gradient update on the model parameter Θ;

classification model output occlusion prediction

Blocking label q _u E {0,1 }. In step (4), model output camera selection

Camera tag y _s ∈{(0，0，1)，(0，1，0)，(1，0，0)}

Defining a loss function

As follows

Wherein

In order to predict the loss of congestion,

in order to predict the loss of the camera,

alpha is a predicted camera loss weight coefficient for the model total loss. Updating model parameters theta by adopting a random gradient descent method

And (3) wherein lambda is a learning rate, and steps (2) to (6) are executed in a circulating manner until the algorithm is converged.

The invention has the beneficial effects that:

1) by utilizing a machine learning algorithm, a user can predict the impending communication link blockage so as to switch a communication network to other line-of-sight links in advance and ensure the stability of communication;

2) the model is based on beam and image bimodal information, and compared with the condition that the model only based on wireless information is limited by fixed blockage, the method is suitable for the complex scene of multidirectional mobile blockage;

3) by utilizing the Transformer model based on the attention mechanism, compared with networks such as RNN, LSTM and the like, the parallel computing capability of the model is greatly improved, the time for model training and reasoning is shortened, and the large-scale practical application is facilitated.

4) The model only depends on the beam sequence information and the image information of the user to train and reason, and is insensitive to the signal-to-noise ratio change of the communication environment.

Drawings

FIG. 1 is a flow chart of a method for beam blockage prediction;

FIG. 2 is a schematic diagram of a Transformer module;

FIG. 3 is a validation set ROC graph.

Detailed Description

The technical scheme of the invention is further explained by combining the attached drawings and the detailed implementation mode.

In order to cope with the complex scene of multidirectional mobile blocking existing in a real communication network, the invention provides a visual fusion beam blocking prediction method based on a Transformer, so as to realize the early perception of the blocking condition of burst arrival in a millimeter wave communication system. The scheme can enable a user to actively switch to other line-of-sight link base stations before the blockage occurs, avoid the situation that the signal-to-noise ratio is suddenly reduced due to the blockage in the communication process, and ensure the stability of the communication process.

As shown in fig. 1-3, the beam blocking prediction problem is modeled as a multi-modal two-classification problem based on beams and images, the model consisting of an object detection module, an image selection module, an embedding module, a transformer encoder module, and a classification module. The system comprises a target detection module, an embedding module, a camera selection module, a transformer encoder module, a classification module and a model II classification module, wherein the target detection module is used for positioning coordinates of suspected obstacles from an acquired image, the embedding module is used for encoding an input beam sequence and a target coordinate sequence into vectors with specified dimensions, the camera selection module is used for predicting the number of cameras where a user is located through the input beam sequence, the transformer encoder module is an encoder based on an attention mechanism, and the classification module finally outputs a model II classification result.

1. Simulation environment construction

The simulated communication environment is built based on an open source scenario-the ViWi multi-user scenario "ASUDT 1_ 28". This is an outdoor millimeter wave communication environment built using a game engine and ray tracing software. It was developed using the ViWi data generation framework. The scene depicts a typical busy street including vehicles, pedestrians, trees, buildings, and so on. The cars moving in the scene represent the communication users. While large vehicles, such as moving buses and trucks, act as dynamic blockages during user communications. The simulation scenario includes a total of 50 cars, 8 buses, and 2 trucks, all moving at different speeds.

A millimeter wave base station with the working frequency of 28GHz is deployed beside a street, and the base station is provided with three cameras with the heights of 4.5 meters and different directions, wherein two cameras are side cameras and have 75-degree visual fields, and one camera is a central camera and has 110-degree visual fields. The base station has a uniform linear array of N antennas, using a predefined DFT codebook

Wherein

Can be expressed as

The user u communication in the simulation system adopts OFDM of K subcarriers, and the received downlink signal is

Wherein

Is the received signal of user u on carrier k,

is the channel of base station and user u at carrier k, n _k Is subject to a Gaussian distribution

Random noise of (2). h is _u，k The channel is represented as follows

Wherein alpha is _l For the attenuation coefficient of the l-th path,

for the departure azimuth angle of the ith path,

elevation of departure for the ith path. Upsilon is _l Is the phase of path l, τ _l Is the propagation delay of path l, B is the signal bandwidth, K is the number of carriers, a (-) is the channel response vector.

For user u, the optimal codeword index between the current time and the base station is obtained as the beam, and the definition of the optimal codeword is as follows

Wherein

As a codebook

Code word of (1), N _m Is the number of base station antennas.

Is a downlink channel between the base station and the user u. P _s Which is representative of the transmitted power,

representing the noise power and k the kth carrier.

The simulation environment setting parameters are as follows:

1) the base station antennas are distributed in a uniform linear array, and the number of the antennas is 128; the user side adopts a single full-phase antenna

2) The working frequency of the base station is 28GHz

3) With DFT codebook, there are 128 code words

4) Carrier number K64

2. Communication blocking prediction method training process based on visual information fusion

Construction of { b _u [τ-r+1]，...，b _u [τ]As the original beam input sequence, τ is the current time slot, and r is the time window length. Construction of { X _n [τ-r+1]，...，X _n [τ]And n belongs to {1, 2, 3} as an original image input sequence, and the camera index is n. For any time slot

The input sequence is represented as follows

Let a _u [t]E {0,1} represents the communication link status of user u at the tth time slot, where 0 represents line-of-sight communication and 1 represents non-line-of-sight communication. The link connection state q of user u in a time window of length r' in the future _u Is shown below

Establishing a function f _Θ (S) the function receives observed image-beam sequence pairs for future link states

And (6) performing prediction. Where Θ represents a set of parameters of the model, learned from the tag sequence dataset.The goal of model training can be expressed as follows

Will sequence X _n Inputting into a YOLO detector to obtain the coordinates of the bounding box of the detected object, and converting each bounding box coordinate into a 6-dimensional vector including a central coordinate [ x ] _cent ，y _cent ]Coordinates of upper left corner [ x ] ₁ ，y ₁ ]And the lower right corner coordinate [ x ] ₂ ，y ₂ ]. The coordinates are normalized to the interval [0,1 ]]Together they mark the exact position of an object in the scene. Stacking the transformed coordinate vectors of an image into a high-dimensional vector

Is variable in length. Thus, padding with N-M zero vectors yields a sequence

The beam sequence b _u [τ-r+1]，...，b _u [τ]Inputting the sequence into a beam embedding module to obtain a beam embedding sequence { b [ tau-r +1 ]]，...，b[τ]}. And inputting the beam embedding sequence into a camera selection module, and judging the camera where the user is located at the moment. Inputting the detection coordinate sequence dn corresponding to the camera into a coordinate embedding module, and outputting a corresponding detection coordinate embedding sequence { d [ tau-r +1]，...，d[τ]}。

The target detection coordinate embedding sequence and the beam embedding sequence are fused and then sent to a Transformer encoder module, and because the Transformer only depends on an attention mechanism and has no cycle and convolution structures, in order to enable the model to utilize the sequence information of the sequences, some information with absolute positions needs to be inserted before the input sequences. In the invention, Positionembedding and Modal-type embedding modes are adopted to encode the input sequence.

Wherein Positionembedding is calculated as follows

Wherein

b＝b+F _PE (b)

d＝d+F _PE (d)

ME _b ＝full_like(b，0)

ME _d ＝full_like(d，1)

b＝b+ME _b (b)

d＝d+ME _d (d)

And splicing the beam sequence b and the target detection coordinate sequence d, and inputting the spliced beam sequence b and the target detection coordinate sequence d into a transformer encoder model. Obtaining a characteristic vector after passing through a transformer encoder model

Will y _t Input to a classification module NET _o Obtaining the final predicted result

Wherein

Let us show the Relu activation function.

Calculating a predicted value

And a label q _u The model parameters Θ are updated with the inverse gradient.

Classification model output occlusion prediction

Blocking label q _u E {0,1 }. Model output camera selection

Camera tag y _s ∈{(0，0，1)，(0，1，0)，(1，0，0)}

Defining a loss function

As follows

Wherein

In order to predict the loss of congestion,

in order to predict the loss of the camera,

Training results are as follows: after the model training is completed, 2050 samples in total are input into the test set, wherein 1280 non-blocking samples (0) and 770 blocking samples (1).

The verification results are as follows:

real\predict	0	1
			0	1187	93
1	49	721

the prediction accuracy is as follows:

recall is defined as the proportion of samples correctly predicted to be blocked to all samples labeled as blocked

Precision is defined as the ratio of correctly predicted as blocked samples to all predicted as blocked samples

The ROC curves for model prediction are shown in fig. 3.

It should be noted that modifications and adaptations may occur to those skilled in the art without departing from the principles of the present invention and should be considered within the scope of the present invention.

Claims

1. A millimeter wave communication link blocking prediction method based on visual information fusion is characterized by comprising the following steps:

step (1): modeling a beam blocking prediction problem into a two-classification problem based on multi-mode information, wherein the model consists of a target detection module, a camera selection module, an embedding module, a transform module and a classification module; initializing model parameters including neural network weights and biases of the modules;

step (2): for user u, at each time slot tau, a beam sequence of length r is constructedColumn { b } _u [τ-r+1],…,b _u [τ]And image sequence X _n [τ-r+1],…,X _n [τ]As a training sample sequence S _u (ii) a Simultaneously constructing a link state sequence { a ] with the length r _u [τ+1],…,a _u [τ+r′]As training sample label q _u ；

And (3): image sequence { X _n [τ-r+1],…,X _n [τ]Inputting a target detection module, and outputting a coordinate sequence of a detection frame of the barrier { d } _n [τ-r+1],…,d _n [τ]}；

And (4): beam sequence b _u [τ-r+1],…,b _u [τ]Inputting the sequence into a beam embedding module to obtain a beam embedding sequence { b [ tau-r +1 ]],…,b[τ]}; inputting the beam embedding sequence into a camera selection module, and judging a camera where a user is located at the moment; inputting the detection coordinate sequence dn corresponding to the camera into a coordinate embedding module, and outputting a corresponding detection coordinate embedding sequence { d [ tau-r +1],…,d[τ]}；

And (5): fusing the target detection coordinate embedding sequence and the wave beam embedding sequence, sending the fused sequences into a Transformer module, coding the sequences, sending the coded sequences into a classification module for secondary classification, and predicting the link connection state of the user u in a time window with the future length r

And (6): calculating a predicted value

and (5) circularly executing the steps (2) to (6) until the algorithm converges.

2. The millimeter wave communication link blocking prediction method based on visual information fusion of claim 1, wherein: the step (1) specifically comprises the following steps:

the target detection module is responsible for positioning the coordinates of suspected obstacles from the acquired image, the embedding module is responsible for encoding an input beam sequence and a target coordinate sequence into vectors with specified dimensions, the camera selection module predicts the number of cameras where a user is located through the input beam sequence, the Transformer module is an encoder based on an attention mechanism, and the classification module finally outputs the results of model two classification;

3. The millimeter wave communication link blocking prediction method based on visual information fusion of claim 1, wherein: the step (2) specifically comprises:

(2.1) defining an input sequence: the method aims to develop a deep learning model by utilizing an RGB image sequence and a beam sequence to predict the blocking condition of a communication link; for any user u in the communication environment, an image sequence and a beam sequence observed in r unit time intervals form a group of input sequences; for any time slot

The sequence is shown below

Wherein the content of the first and second substances,

representing the RGB image shot by the nth camera in the t-th time slot, wherein W, H, C respectively represents the width, height and color channel number of the image; b _u [t]Representation codebook

An index of a beamforming vector used to serve user u in the t-th slot;

represents the length of the observation interval;

(2.2) defining an output variable q _u : let a _u [t]E {0,1} represents the communication link state of the user u at the t-th time slot, wherein 0 represents line-of-sight communication and 1 represents non-line-of-sight communication; the link connection state q of user u in a time window of length r' in the future _u Is shown below

Wherein 0 indicates that user u maintains line-of-sight communication throughout the time window, and 1 indicates that link congestion occurs within the time window;

(2.3) defining a model function: the method aims to establish a function f _Θ (S) the function receives observed image-beam sequence pairs for future link states

Carrying out prediction; wherein Θ represents a parameter set of the model, learning from the tag sequence dataset; the goal of model training can be expressed as follows

4. The millimeter wave communication link blocking prediction method based on visual information fusion of claim 1, wherein: the step (3) specifically comprises the following steps:

(3.1) obtaining a sequence of RGB images

(3.3) converting each bounding box coordinate into a 6-dimensional vector including a center coordinate [ x ] _cent ,y _cent ]Coordinate of upper left corner [ x ] ₁ ,y ₁ ]And the lower right corner coordinate [ x ] ₂ ,y ₂ ](ii) a The coordinates are normalized to the interval [0,1 ]]Together they mark the exact position of an object in the scene;

Wherein M represents the number of the target objects detected in the image, and t belongs to { tau-r + 1.,. tau }; since the algorithm scene is a dynamic communication environment, the number of detection objects in the image at each moment is not fixed, which results in that

Is variable in length; thus, padding with N-M zero vectors yields a sequence

(3.5) the module finally outputs the coordinate sequence of the detection frame

5. The millimeter wave communication link blocking prediction method based on visual information fusion of claim 1, wherein: the step (4) specifically comprises the following steps:

(4.1) obtaining a beam sequence { b) of a user u _u [τ-r+1],…,b _u [τ]Wherein the beams are codebooks

The index of the optimal code word of the middle service user; the definition of the optimal code word is as follows

Wherein

As a codebook

Code word of (1), N _m Is the number of base station antennas;

is a downlink channel between a base station and a user; p _s Which is representative of the transmitted power,

representing the noise power, k representing the kth carrier;

(4.2) Beam sequence { b _u [τ-r+1],…,b _u [τ]Inputting the data to a beam embedding module; because the algorithm receives data of two modes, the two information have different dimensions, and therefore the two information need to be converted into vectors of the same dimension through an embedding module;

for a beam sequence, a memory size is generated as

Of the input beam codeword index b _n [t]The embedding layer returns the embedded vector corresponding to the index

Wherein d is _model Is a defined feature vector dimension;

(4.3) embedding the Beam into the sequence { b [ tau-r +1],…,b[τ]Is input into a camera selection model NET _s Outputting the feature vector

Camera selection moduleComprises L _s A fully connected network of layers, the model being represented as

Wherein Θ is _s ＝{W _s ,b _s Denotes the weight and bias of the fully connected layer,

nonlinear function of the representation model, written as

Wherein

Indicating the Relu activation function.

6. The millimeter wave communication link blocking prediction method based on visual information fusion of claim 1, wherein: the step (5) specifically comprises:

(5.1) since the Transformer only depends on the attention mechanism, and has no cycle and convolution structure, in order to enable the model to utilize the sequence information of the sequence, some information with absolute position needs to be inserted in front of the input sequence, and the input sequence is encoded by adopting Positionembedding and Modal-type embedding modes;

wherein Positionembedding is calculated as follows

Wherein

Representing the position of the token in the sequence, L _seq Indicates the length of the sequence; i ∈ [0, …, d ] _model /2) represents the dimension of positionedbudding;

embedding the beam into the sequence b [ tau-r +1 ] in turn],…,b[τ]D [ tau-r +1 ] and target detection coordinate embedding sequence],…,d[τ]Sending into a position coding function F _PE (·)

b＝b+F _PE (b)

d＝d+F _PE (d)

ME _b ＝full_like(b,0)

ME _d ＝full_like(d,1)

b＝b+ME _b (b)

d＝d+ME _d (d)

Wherein full _ like (x, n) represents that a vector with the same dimension as x is constructed and is filled with n;

(5.2) splicing the beam sequence b and the target detection coordinate sequence d, and inputting the spliced beam sequence b and the target detection coordinate sequence d into a Transformer model; the transformer encoder is formed by stacking L multi-head attention layers and a feedforward neural network layer; the algorithm flow of the multi-head attention mechanism is as follows:

MultiHead(Q,K,V)＝Concat(head ₁ ,…,head _h )W ^o

wherein, the input parameters of the MultiHead function

Q＝K＝V＝{b[τ-r+1],…,b[τ],d[τ-r+1],…,d[τ]}

Wherein

Indicating the Relu activation function.

7. The millimeter wave communication link blocking prediction method based on visual information fusion of claim 1, wherein: in the step (6), the classification model outputs the blocking prediction