CN117278154A

CN117278154A - Spectrum prediction method based on attention mechanism

Info

Publication number: CN117278154A
Application number: CN202311379839.6A
Authority: CN
Inventors: 王钢; 孔金山; 高玉龙
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2023-10-23
Filing date: 2023-10-23
Publication date: 2023-12-22

Abstract

The spectrum prediction method based on the attention mechanism solves the problem of how to more accurately predict the spectrum occupation state at as many times as possible in the future, and belongs to the field of spectrum prediction. The invention comprises the following steps: embedding a gating recursion unit into each sub-module of a transducer model based on an attention mechanism to obtain a spectrum prediction network, extracting local correlation of spectrum occupation information by the gating recursion unit, and outputting an information pre-extraction result with position codes; the input data of the training set comprises an input sequence and an output sequence, wherein the input sequence is a historical spectrum information sequence, the output sequence is a future spectrum information sequence shifted by one bit to the right, and the output data is future spectrum information; after training the spectrum prediction network by using the training set, taking the current spectrum information as an input sequence, taking the spectrum information at the last moment in the input sequence as the first input of the decoding submodule output sequence, and performing spectrum prediction in an autoregressive mode by using the spectrum prediction network.

Description

Spectrum prediction method based on attention mechanism

Technical Field

The invention relates to a spectrum prediction method based on an attention mechanism, and belongs to the field of spectrum prediction.

Background

The spectrum prediction technology is used as the supplement of the cognitive radio, and the spectrum occupation condition of future time slots is predicted by mining and analyzing the correlation among historical spectrum data, so that the spectrum sensing only needs to scan and sense the frequency band predicted to be idle, thereby greatly reducing the energy and time loss required by the spectrum sensing and enabling the spectrum decision to be accurately and efficiently carried out in a shorter time. After the spectrum prediction, the spectrum sharing operation is performed, and a Secondary User (SU) can make an appropriate sharing policy in advance according to different service requirements of the Secondary User, so as to make up the time required by response. In addition, the switching based on spectrum prediction is actively performed, and the channel occupation state in the future time slot is judged by self-analyzing the result of spectrum prediction, so that whether the SU needs to perform spectrum switching at one or more future time instants is determined in advance, and the probability of collision between the SU and a Primary User (PU) is reduced.

At present, although the development of the neural network greatly drives the promotion of the spectrum prediction technology, the problem of long-term dependence of the application of a better LSTM and a variant structure thereof on an input sequence is not solved well, and the correlation information storage and conversion capability of a sequence transduction model-Seq-to-Seq are limited by the length of an intermediate vector. In a practical environment, the allocation of spectrum resources is completed under a plurality of channels, and the occupation condition of the final spectrum resources is affected by a spectrum allocation strategy and user behaviors, so that the correlation of the spectrum resources is reflected in time, and the correlation exists between the channels. In addition, if only single-step prediction is performed, only the channel occupation condition in one time slot in the future is scanned each time, frequent prediction and sensing are needed, and when the frequency spectrum is switched, the short-step prediction forces the user to frequently switch channels, and SU also needs to make decisions according to own service requirements for many times, so that the mode has low efficiency and weak practicability.

Disclosure of Invention

Aiming at the problem of how to more accurately predict the spectrum occupation state at as many times as possible in the future, the invention provides a spectrum prediction method based on an attention mechanism.

The invention discloses a spectrum prediction method based on an attention mechanism, which comprises the following steps:

s1, establishing a spectrum prediction network, wherein the spectrum prediction network is formed by embedding a gating recursion unit in each sub-module of a transducer model based on an attention mechanism, the length of the gating recursion unit is equal to the length of an input sequence, and in a coding sub-module and a decoding sub-module, the input sequence firstly enters the gating recursion unit, and the gating recursion unit performs local correlation extraction on spectrum occupation information and outputs an information pre-extraction result with a position code;

s2, taking frequency spectrum occupation state data sequenced in a channel priority descending mode as a training set, wherein input data in the training set comprises an input sequence and an output sequence, wherein the input sequence is a sequence of historical frequency spectrum information, the output sequence is a future frequency spectrum information sequence shifted by one bit to the right, and the output data in the training set is future frequency spectrum information;

s3, training the spectrum prediction network by using a training set;

s4, predicting: and taking the current spectrum information as an input sequence, taking the spectrum information at the last moment in the input sequence as the first input of the output sequence of the decoding submodule, and carrying out spectrum prediction in an autoregressive mode by utilizing a trained spectrum prediction network.

Preferably, the attention mechanisms in the coding submodule are multi-head attention mechanisms, two attention mechanisms in the decoding submodule are a cross multi-head attention mechanism and a shielding multi-head attention mechanism respectively, global correlation extraction is performed by using the attention mechanisms on the basis of local information extraction by the gating recursive unit, wherein the multi-head attention mechanism in the coding submodule is used for extracting global correlation among historical information, the shielding multi-head attention mechanism in the decoding submodule is used for extracting correlation of future time slot frequency spectrum occupation, and the cross multi-head attention mechanism is used for extracting correlation between historical frequency spectrum information and future frequency spectrum information.

Preferably, the nth coding submodule is:

for the output of the gating recursion unit t in the nth coding submodule,/>Representing the output at time t of the nth encoding submodule; LSTM () represents a gating recursion unit, layerNorm () represents layer normalization, multi-head () represents multi-head attention, and FFN () represents FFN transformation using a scaled dot product attention scoring function.

Preferably, the nth codon module is:

wherein,then the MaskMultiHead () represents the masked multi-headed attention mechanism, using the scaled dot product attention scoring function, representing the output at time t of the nth decoding submodule.

Preferably, the normalization in the transducer model is layer normalization.

Preferably, a dropout mechanism is added in the training process of the spectrum prediction network.

Preferably, the forward propagation formula of the gating recursion unit is:

wherein, as follows, the Hadamard product operator, gating cell i _t 、f _t 、o _t C) corresponding to the outputs of the input gate, the forget gate and the output gate respectively _t Represents the output of the memory cell at the current moment, h _t-1 Representing the hidden state, x, of the last moment _t Indicating whenThe input of the previous moment in time,and->Respectively corresponding to the weight matrix of the input sequence in the input gate and the memory unit when extracting the characteristics,and->Weight matrix corresponding to the input sequence in forgetting gate and memory unit for feature extraction>And->B, respectively corresponding to the input sequence in the output gate and the weight matrix when the memory unit performs feature extraction _i 、b _f 、b _o 、b _c Bias term, sigma, & lt/EN for each gate>Representing an activation function.

Preferably, the feedforward neural network in the transducer model is:

FFN(x)＝w ₂ relu(w ₁ x+b ₁ )+b ₂

wherein FFN (x) is the output of the feedforward neural network, x is the input, w ₁ 、w ₂ As a weight matrix, b ₁ 、b ₂ As a bias term, relu () is an activation function.

The invention fuses the recursion structure unit into each sub-module of the transducer model based on the attention mechanism, utilizes the superior local correlation processing capacity of the recursion structure and the characteristic of the output with position codes, complements the advantages of parallelization processing data and efficient global information extraction capacity of the transducer model, overcomes the long-term dependence problem of LSTM and the defect that the transducer model is easy to generate overfitting, and particularly realizes high-accuracy spectrum prediction in multi-channel multi-step prediction closer to the actual environment.

Drawings

FIG. 1 is a schematic diagram of a prediction mode of a model for multi-channel and multi-step prediction;

FIG. 2 is an overall block diagram of an LSTM-transducer model;

FIG. 3 is a block diagram of a gating recursion unit in an LSTM-transducer model;

FIG. 4 is a schematic diagram of an algorithmic implementation of the attention mechanism;

fig. 5 is a diagram of a multi-head attention mechanism structure after parallel operation modification of the attention mechanism.

FIG. 6 is a graph of multi-channel and multi-step predictions showing model superiority, with the abscissa representing prediction step size and the ordinate representing accuracy.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other.

The invention is further described below with reference to the drawings and specific examples, which are not intended to be limiting.

Description of the problem:

in an actual radio environment, the allocation policy of spectrum resources and the frequency usage behavior of users all cause the occupation states of channels to have interdependence. Therefore, here, an M/G/5 queuing theory model close to the real environment will be adopted, and the input process is assumed to obey the poisson distribution with the parameter lambda, namely, the number x of users is reached to obey the poisson distribution in the t time period, and the distribution probability is as follows:

λ represents the number of users that arrive in average per unit time, and x represents the number of users that actually arrive per unit time. In this case, the time interval compliance parameter for the arrival of two adjacent users isThe probability density of the exponential distribution of (a):

i.e. the average time interval that the adjacent two users arrive. Assuming that the service time obeys a general service time distribution with a parameter mu, the distribution probability is:

P(x＝k)＝μ(1-μ) ^k-1 ,k＝1,2,…,N

μ represents the probability that the user may be served in a unit time, i.e. the average number of users served in a unit time,representing the average time a user occupies a channel, and x represents the time the user actually occupies the channel.

Based on the data set, a hypothetical priority allocation rule is added for queuing theory, and priorities of 5 channels are sequentially reduced, so that spectrum occupation state data is generated and used as a data set for model training and testing, and therefore interdependence relations among channels in an actual environment are simulated. The spectrum occupation state of n historical moments before 5 channels is used as historical referenceable information, wherein the state is used for indicating that the channels are unoccupied by 0, and the state is used for indicating that the channels are occupied by 1. The prediction mode is shown in fig. 1, where the first box represents the history information referenced by the model and the second box represents the information of the future time to be predicted. The objective of this embodiment is to predict the spectrum occupancy state at as many times as possible in the future with higher accuracy after analyzing the time and the correlation between channels of a certain number of historical spectrum occupancy states. However, today, applying the mature LSTM-based Seq-to-Seq model in the field of spectrum prediction to express correlation between historic information is subject to the length of the intermediate vector and the application form of the intermediate vector in the decoding submodule. The transducer model based on the attention mechanism is good in the natural language processing field, but after the transducer model is transferred to the spectrum prediction field, the problems of fitting, loss of relative position information in the calculation process of the attention mechanism and the like easily occur, so that the future multi-step spectrum occupation information cannot be predicted with high precision.

Based on the above problems, the present embodiment constructs an LSTM-fransformer model with a combination of a temporal recursive structure and an attention mechanism, inserts a gating recursive unit equal to the length of the history information into each sub-module of the fransformer, performs local correlation extraction on the spectrum occupation information by using the gating recursive unit, and outputs an information pre-extraction result with a position code. And the long-term complementation effect is realized by utilizing the capability of the transducer model for efficiently calculating the correlation in parallel and the global correlation extraction capability. The model not only makes up the defect of LSTM in the long-term dependence treatment, but also relieves the overfitting problem of the transducer model, and greatly improves the accuracy of spectrum prediction. The spectrum prediction method based on the attention mechanism of the embodiment specifically comprises the following steps:

step 1, establishing a spectrum prediction network: the LSTM-transducer model is shown in FIG. 2. The spectrum prediction network is characterized in that a gating recursion unit is embedded in each sub-module of a transducer model based on an attention mechanism, the length of the gating recursion unit is equal to the length of an input sequence, the chunk_size and pre_step_size parameters of the model are modified according to the required window length and the prediction step length, in a coding sub-module and a decoding sub-module, the input sequence is firstly input into the gating recursion unit, and the gating recursion unit performs local correlation extraction on spectrum occupation information and outputs an information pre-extraction result with position codes;

step 2, taking frequency spectrum occupation state data sequenced in a channel priority descending mode as a training set, wherein input data in the training set comprises an input sequence and an output sequence, wherein the input sequence is a sequence of historical frequency spectrum information, the output sequence is a future frequency spectrum information sequence shifted by one bit to the right, and the output data in the training set is future frequency spectrum information;

step 3, training the spectrum prediction network by using a training set;

and 4, predicting: and taking the current spectrum information as an input sequence, taking the spectrum information at the last moment in the input sequence as the first input of the output sequence of the decoding submodule, and carrying out spectrum prediction in an autoregressive mode by utilizing a trained spectrum prediction network. And checking the output result of the decoding submodule, namely, the predicted value of the spectrum occupation condition in the future pre_step_size time slots.

The spectrum prediction network of the present embodiment can maintain multi-step prediction accuracy. The basic idea is to fuse the gating recursion unit with a Transformer architecture based on the attention mechanism completely, and make up the defects of the two structures by utilizing the complementary advantages of the gating recursion unit and the Transformer architecture. Conventional multi-channel spectrum prediction often adopts a Seq-to-Seq structure based on LSTM and LSTM variants, the prediction accuracy of this approach is not high enough, and as the prediction step increases, the prediction accuracy drops dramatically, which is contrary to the multi-step prediction requirements in an actual radio environment. Therefore, in order to acquire the spectrum occupation condition of a plurality of time slots in the future at one time and ensure that the prediction accuracy is high enough, the embodiment firstly uses the superior local information extraction capability of the gating recursion unit and the characteristic of the output self-contained position code to pre-extract the correlation between the grasped information and the unknown information and increase the richness for the input of a transducer. Meanwhile, the high-accuracy multi-channel multi-step prediction is realized by utilizing the high-efficiency global information extraction capability of the transducer model, so that the utilization efficiency of spectrum resources is improved. The prediction accuracy obtained by the model is higher than that of the traditional model, and the accuracy can be maintained to be higher in the future prediction of tens of steps, so that the working efficiency of the subsequent cognitive radio is greatly improved, and the utilization rate of spectrum resources is further improved.

In the present embodiment, if the gating recursion unit outputs the information after the local time correlation extraction as the output result, it is not necessary to consider whether the correlation at a remote distance can be extracted. Meanwhile, the output result is provided with position coding information, and no position coding is required to be additionally added when the output result is used as the input of a transducer model. And the correlation pre-extraction is equivalent to increasing the richness of the data set of the transducer model, and reducing the possibility of over-fitting phenomenon. The transducer layer can supplement the correlation information learned by the gating recursion unit layer and extract the long-term dependence information. The model also follows the coding submodule-decoding submodule structure, and the nth coding submodule is:

for the output of the gating recursion unit t in the nth coding submodule,/>Representing the output at time t of the nth encoding submodule; LSTM () represents a gating recursion unit, layerNorm () represents layer normalization, multi-head () represents multi-head attention, and FFN () represents FFN transformation using a scaled dot product attention scoring function;

the nth codon module is:

wherein,then the MaskMultiHead () represents the masked multi-headed attention mechanism, using the scaled dot product attention scoring function, representing the output at time t of the nth decoding submodule. The form of the attention mechanism adopted in the coding submodule and the decoding submodule is the same as that of the transducer model.

The structure of the gating recursion unit in the model is shown in FIG. 3, and three inputs are provided at each time, namely the hidden state h at the previous time _t-1 Memory cell c _t-1 And input x at the current time _t . The gating is three types: forget gate, input gate, output gate. They determine the last memory cell c by means of a sigmoid function and a dot product operation _t-1 How much information is to be retained and input information x _t And the last hidden state h _t-1 How much is added to the current memory cell c _t The current memory cell c _t How much to be output or to be the next hidden state h _t . The specific form of the gate control unit is as follows:

g(x)＝σ(ωx+b)

the obtained real value is mapped between 0 and 1 through a sigmoid function to represent the preservation or discarding of the information at the last moment. If the value of g (x) is close to 0, no information passes, and if it is close to 1, all information passes.

The forward propagation formula of the gating recursion unit is:

wherein, as follows, the Hadamard product operator, gating cell i _t ，f _t ，o _t Corresponding to input door, forget door and output door c _t A memory unit h representing the current time _t-1 Representing the hidden state, x, of the last moment _t An input representing the current moment in time is presented,andweight matrixes respectively corresponding to the input sequence in the input gate and the memory unit for feature extraction, and other weight matrixes are also expressed in a similar form, b _i 、b _f 、b _o 、b _c Bias terms for each gate. Sigma, & gt>Representing an activation function. Wherein, the sigma multipurpose sigmoid function controls the gate valve between 0 and 1 to describe the passing amount of information and the ++>The tan h or the Relu function is used in multiple ways and is selected according to practical situations.

The attention mechanisms in the coding sub-module of the embodiment are multi-head attention mechanisms, two attention mechanisms in the decoding sub-module are a cross multi-head attention mechanism and a shielding multi-head attention mechanism respectively, global correlation extraction is performed by using the attention mechanisms on the basis of local information extraction by the gating recursion unit, wherein the multi-head attention mechanism in the coding sub-module is used for extracting global correlation between historical information, the shielding multi-head attention mechanism in the decoding sub-module is used for extracting correlation of future time slot spectrum occupation, and the cross multi-head attention mechanism is used for extracting correlation between historical spectrum information and future spectrum information.

The attention mechanism is the part in which the most important capture sequences are related, and the structure is shown in FIG. 4. The attention mechanism may be described as a correlation calculation of a query and a set of keys (keys) to obtain an attention score value, i.e. an attention weight, from which the values (values) are weighted together. The attention weight is calculated by an attention scoring function. The calculation formula of the obtained output is:

where a is an attention scoring function, and the value obtained by the function is converted into an attention weight of 1 after softmax.

There are generally two types of calculation of the attention scoring function: additive attention mechanisms and dot product attention mechanisms. The additive attention mechanism can effectively summarize important information in a sequence of linear complexity, and when the query and key vector lengths are different, this approach is typically chosen as the scoring function. Since matrix multiplication is implemented in a number of efficient ways, dot product attention mechanisms are more computationally efficient and are more widely used, but require the vector lengths of the query and key to be the same. Assuming that all sequences in the query and the key are random variables with the mean value of 0, the variance of 1 and mutual independence, the mean value of dot product results of the vector is 0, and the variance of d is the vector dimension. To make the variance independent of the vector dimension, the value of the dot product is divided byThe attention score value with the mean value of 0 and the variance of 1 and not constrained by vector dimension can be obtained. The general formula for the scaled dot product attention scoring function is as follows:

to improve parallelism, modifications are made on the basis of scaled dot product attention, resulting in a multi-headed attention mechanism, the structure of which is shown in FIG. 5. Firstly, linearly transforming the query, the key and the value and cutting into a plurality of parts with the same dimensions, namely, respectively performing scaling dot product attention calculation on each part, and splicing the obtained results to obtain a linear transformation result, wherein the calculation formula is as follows:

the decoding submodule part also uses a masked multi-head attention layer. This is because the entire right shifted output sequence is taken as input to the decoding submodule at once during the training process. In the actual prediction process, when the ith vector is predicted, the vectors after i are unknown. Therefore, there is a need to mask the relevance of the vector after i, i.e. the attention weight value of this part, to avoid "cheating" behavior of the model during prediction.

The calculation formula of the feedforward neural network part is as follows:

FFN(x)＝w ₂ relu(w ₁ x+b ₁ )+b ₂

the FFN layer contains only one hidden layer and Relu is selected as the activation function.

In the transform block, a residual connection is used, i.e. the output result of the current layer is added to the value input to the layer, so that the effect obtained by the network with a deeper layer is ensured not to be poorer than the effect obtained by the network with a shallower layer.

The application of the attention mechanism in the transducer block is largely divided into three parts. The attention mechanisms of the coding submodules all take the form of a multi-head attention mechanism. Wherein all queries, keys and values are derived from the output of a layer above the coding sub-module. In the decoding submodule, two attention mechanisms are used, namely a cross multi-head attention mechanism and a shielding multi-head attention mechanism. The query of the cross-over multi-head attention mechanism is derived from the output of the layer preceding the decoding submodule, and the key and value are derived from the output of the encoding submodule. The query, key and value of the masked multi-headed attention mechanism are all derived from the output of the previous layer of the decoding submodule.

The normalization in the transducer model of this embodiment is layer normalization (Layer Normalization, LN) and differs from the batch normalization (Batch Normalization, BN) commonly used in CNN. BN is the normalization of the input in all batches of individual neurons in a layer, which is limited by the size of batch size. When the batch_size is smaller, only a small amount of data is normalized, and the obtained result cannot embody the integral characteristic. In addition, like time series, the net input distribution of a neuron is dynamically changing in the neural network and batch normalization operations cannot be used. Instead, LN normalizes all neurons in each batch of each layer separately, without limitation of the sequence length in each batch, and thus is more suitable for time series structures.

In the embodiment, a dropout mechanism is added in the training process of the spectrum prediction network. Considering that the transducer part is easy to be over-fitted, a dropout mechanism is added for the purpose, and meanwhile, the problem of co-adaptability among network nodes is solved. Because different nodes in the network have different characterization capacities, the node with stronger characterization capacity can be continuously strengthened along with the increase of training times, and the node with weak characterization capacity can be continuously weakened until the node with weak characterization capacity can be ignored. This is equivalent to only some nodes in the network being trained, wasting depth and width resources of the network, and limiting model training effect. Dropout can be interpreted as a random discarding of some neurons with a certain probability during model training, in other words, each training can train to a different neuron, and since two neurons are not necessarily retained in one training at the same time, the weight and bias parameter updates in the network do not affect each other. The mechanism breaks through the problem of co-adaptability among neurons, so that the network learning is more robust.

The calculation formula of the feedforward neural network part is as follows:

FFN(x)＝w ₂ relu(w ₁ x+b ₁ )+b ₂

In this embodiment, all the modules are fused to construct an LSTM-transducer model. And using spectrum occupation information of 5 channels generated by an M/G/5 model as a data set, and dividing a training set, a verification set and a test set according to a ratio of 6:2:2. The LSTM-based Seq-to-Seq model, the transducer model, and the LSTM-transducer model are trained separately. In the three models, adam with wide application and excellent application is adopted as an optimizer, and the initial learning rate is set to be 10 ^-3 The loss functions all adopt mean square error lossAnd selecting the prediction accuracy as an evaluation index. Setting the number of hidden units in the Seq-to-Seq model as 200, and setting the learning rate attenuation of the optimizer as 10 ^-6 Training was performed 100 times. In the transducer model, the number of hidden layer units is set to be 200, the batch size is 128, and the learning rate attenuation is changed to be 5×10 ^-6 Training was performed 200 times. For the LSTM-transducer model, the number of hidden layer units was set to 256, the batch size was 64, and the learning rate decay of the optimizer was the same as that of the transducer model, and trained 200 times. In each model, the window length corresponding to the highest prediction accuracy is selected, and in our model, the window length is selected to be 30, namely, the historical time of 30 steps and the correlation information among channels are referenced, so that the channel occupation condition in the future 30 time slots is predicted.

The multi-channel prediction accuracy results of the three on the verification set are compared, as shown in fig. 6. Therefore, the LSTM-transducer model constructed in the embodiment shows excellent performance in multi-channel multi-step spectrum prediction, and when the prediction step length is long, the prediction accuracy is still kept at about 98%, the requirements on the prediction accuracy in an actual radio environment are met, and the efficient performance of the follow-up operation (spectrum sensing, spectrum decision, spectrum sharing and spectrum switching) of the cognitive radio is ensured.

Although the invention herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present invention. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present invention as defined by the appended claims. It should be understood that the different dependent claims and the features described herein may be combined in ways other than as described in the original claims. It is also to be understood that features described in connection with separate embodiments may be used in other described embodiments.

Claims

1. A method of spectrum prediction based on an attention mechanism, the method comprising:

s3, training the spectrum prediction network by using a training set;

s4, predicting: and taking the current spectrum information as an input sequence, taking the spectrum information at the last moment in the input sequence as the first input of the decoder output sequence, and performing spectrum prediction in an autoregressive mode by utilizing a trained spectrum prediction network.

2. The attention mechanism-based spectrum prediction method as claimed in claim 1, wherein the attention mechanisms in the coding sub-module are multi-headed attention mechanisms, the two attention mechanisms in the decoding sub-module are respectively a cross multi-headed attention mechanism and a blocked multi-headed attention mechanism, global correlation extraction is performed by using the attention mechanisms on the basis of local information extraction by the gating recursion unit, wherein the multi-headed attention mechanism in the coding sub-module is used for extracting global correlation between history information, the blocked multi-headed attention mechanism in the decoding sub-module is used for extracting correlation between future time slot spectrum occupation, and the cross multi-headed attention mechanism is used for extracting correlation between history spectrum information and future spectrum information.

3. The attention-based spectrum prediction method of claim 2, wherein the nth coding submodule is:

4. A method of spectrum prediction based on an attention mechanism as claimed in claim 3, wherein the nth decoding submodule is:

5. The attention-based spectrum prediction method of claim 2, wherein the normalization in the transducer model is layer normalization.

6. The attention-based spectrum prediction method as recited in claim 2, wherein a dropout mechanism is added during training of the spectrum prediction network.

7. The attention-based spectrum prediction method of claim 1, wherein the forward propagation formula of the gating recursion unit is:

wherein, as follows, the Hadamard product operator, gating cell i _t 、f _t 、o _t C) corresponding to the outputs of the input gate, the forget gate and the output gate respectively _t Represents the output of the memory cell at the current moment, h _t-1 Representing the hidden state, x, of the last moment _t An input representing the current moment in time is presented,and->Weight matrix corresponding to the input sequence in the input gate and the memory unit for feature extraction respectively, < ->Andweight matrix corresponding to the input sequence in forgetting gate and memory unit for feature extraction>And->B, respectively corresponding to the input sequence in the output gate and the weight matrix when the memory unit performs feature extraction _i 、b _f 、b _o 、b _c Bias term, sigma, & lt/EN for each gate>Representing an activation function.

8. The attention-based spectrum prediction method of claim 1, wherein the feedforward neural network in the transducer model is:

FFN(x)＝w ₂ relu(w ₁ x+b ₁ )+b ₂

9. A computer-readable storage device storing a computer program, characterized in that the computer program when executed implements the attention-based spectrum prediction method according to any of claims 1 to 8.

10. An attention-based spectrum prediction apparatus comprising a storage device, a processor and a computer program stored in the storage device and executable on the processor, wherein execution of the computer program by the processor implements the attention-based spectrum prediction method as claimed in any one of claims 1 to 8.