CN113811009B

CN113811009B - Multi-base-station network resource intelligent allocation method based on space-time feature extraction

Info

Publication number: CN113811009B
Application number: CN202111118071.8A
Authority: CN
Inventors: 李荣鹏; 肖柏狄; 郭荣斌; 赵志峰; 张宏纲
Original assignee: Zhejiang University ZJU; Zhejiang Lab
Current assignee: Zhejiang University ZJU; Zhejiang Lab
Priority date: 2021-09-24
Filing date: 2021-09-24
Publication date: 2022-04-12
Anticipated expiration: 2041-09-24
Also published as: CN113811009A

Abstract

The invention discloses a multi-base-station network resource intelligent allocation method based on space-time characteristic extraction, which extracts position information and space characteristics of each 5G base station through a drawing attention mechanism, learns behavior habits of network users through a long-term and short-term memory mechanism, extracts time characteristics, analyzes fluctuation conditions of each slice data packet on space and time, and can obtain higher system return, namely higher spectral efficiency and better user experience compared with a resource allocation strategy based on an optimization algorithm and a genetic algorithm and a resource allocation strategy based on traditional reinforcement learning, and meanwhile, the method can adapt to a dynamically changing environment and has higher flexibility and robustness.

Description

Multi-base-station network resource intelligent allocation method based on space-time feature extraction

Technical Field

The invention relates to the technical field of wireless communication, in particular to a multi-base-station network resource intelligent allocation method based on space-time characteristic extraction.

Background

Currently, 5G networks have become an indispensable key ring for the development of digital society, and compared with 4G networks, the 5G networks provide massive services capable of meeting wider demands, most of which are not realized by 4G.

ITU defines three main application scenarios for 5G: enhanced mobile bandwidth (eMBB), massive machine-type communication (mtc), ultra-reliable low-latency communication (URLLC). The eMBB is mainly applied to services such as AR/VR (augmented reality/virtual reality) by virtue of high bandwidth, the mMTC is applied to services such as Internet of things and smart home due to high connection density, and the URLLC with low time delay and high reliability can be applied to services such as automatic driving and remote operation.

However, if the 5G network uses a dedicated network for a specific service like the 4G network, it will cause a waste of resources to a large extent. This is because, as mentioned above, different services of the 5G network have different requirements on performance such as communication delay, bandwidth, mobility, reliability, etc., and a plurality of dedicated networks are required to cover all services, which results in huge deployment cost.

Therefore, researchers have proposed Network Slicing (NS) technology. The network slicing technology can flexibly allocate the existing network resources according to different user requirements. Compared with a single network, the method can provide a higher-performance logic network, flexibly allocate limited bandwidth resources, reasonably allocate the network resources without mutual interference, and have higher reliability and safety. In order to meet the changing user requirements and frequent switching between base stations caused by user mobility, how to optimize deployment and adjust resource allocation of network slices in real time is a significant challenge for current 5G service business. The key technical indexes are as follows: while meeting the Service Level Agreement (SLA) of slice subscribers as much as possible to improve the user Service Satisfaction Rate (SSR), the Spectrum Efficiency (SE) is maximized to reduce the resource cost and meet the needs of more subscribers.

The traditional special resource allocation scheme and the resource allocation strategy based on the optimization algorithm and the heuristic algorithm often have strict limiting conditions and complex deductions to form a specific optimization problem, the method is lack of flexibility and expandability, and when the user characteristics and the proportion of various performance users change, the algorithms cannot well respond. Therefore, it is necessary to dynamically and intelligently allocate spectrum resources to different slices according to a service request of a user in order to maximize SE while guaranteeing a basic SSR.

Reinforcement learning learns optimal behavior strategies that maximize revenue by constantly interacting with the environment, capturing state information in the environment, and making action selections based thereon in a trial-and-error manner. The traditional reinforcement learning method is difficult to process continuous or high-dimensional state space conditions, so that a deep learning feature extraction and prediction method is introduced into reinforcement learning, deep features of states are extracted by a deep neural network and represent a state value function, and an optimal action selection strategy for predicting a larger state space by a deep reinforcement learning algorithm is provided. Typical Deep reinforcement learning includes Deep Q Network (DQN), Actor-Critic (A2C), and the like.

Although convolutional neural networks have achieved great success in processing structured information, the data involved in many interesting tasks cannot be represented by a grid-like structure, but rather in an irregular domain, at which time one tends to graph the structure. There is an increasing interest in generalizing the convolution to the graph domain, from which graph convolution neural networks are constantly evolving. The graph attention machine is made into a representative graph convolution neural network mechanism, and a multi-head masking attention machine is introduced to endow neighbor nodes with different influence weights.

In addition, the movement of the user may cause the requirement of the same user to be continuously switched between different base stations, and the user behavior needs to be predicted and the requirement is met in time. The long-term and short-term memory mechanism is used as a typical recurrent neural network mechanism, and can be used for integrating and discarding information of time sequences and extracting time characteristics of the sequences.

Through the two mechanisms, the cooperative cooperation of the nodes in the graph can be enhanced, the change of user behaviors and the information aggregation can be predicted in advance, and meanwhile, the influence on the noise of the neighbor nodes and the influence on the user movement are more robust.

Disclosure of Invention

The invention aims to provide a multi-base-station network resource intelligent allocation method based on space-time characteristic extraction, and compared with the traditional optimization algorithm and heuristic algorithm, the method provided by the invention has better flexibility and expandability; compared with other reinforcement learning algorithms, the method provided by the invention can strengthen the change trend of the cooperative cooperation prediction data packet between the base stations, and predict the change trend of the user behavior to reduce the negative influence of the change of the number of the users in the base stations caused by the mobility of the users on the prediction of the state action value function, so that the reinforcement learning algorithm based on time characteristic extraction is adopted to carry out the resource allocation prediction of the multi-base-station cooperative wireless network, the prediction accuracy can be improved, and the wireless network performance is greatly improved.

In order to achieve the purpose, the invention provides the following technical scheme:

the application discloses a multi-base-station network resource intelligent allocation method based on space-time feature extraction, which is characterized by comprising the following steps:

s1, algorithm network structure G and target network

Building and initializing;

s11, dividing the algorithm network structure G into a state vector coding network Embed, a long-short term memory network LSTM, a graph attention machine network GAT and a depth Q network DQN;

s12, wherein the State vector coding network Embedded is composed of two layers of fully connected networks, and is recorded as

h_m＝Embed(s_m)＝σ(W_es_m+b_e) Wherein W is_e、b_eIs the weight matrix of the layer, sigma is the activation function, and the N-dimensional state vector s in the multi-subject reinforcement learning_mInputting the vector into a state vector coding network Embedded, and outputting a K-dimensional coded vector h_m；

S13, encoding the current subject m and the subject of the current subject m on the adjacent node in the directed graph into a vector h_mAnd h_j,

Calculating attention influence coefficients as input vectors of a graph attention machine mechanism network GAT, and carrying out normalization processing on the attention influence coefficients, wherein D is_mRepresenting a set of subjects of the current subject m on adjacent nodes in the directed graph; multiplying the normalized attention influence coefficient by the input vector to calculate the first layer of the graph attention mechanism network GATOutputting; the attention influence coefficient, the normalization processing and the first layer output are subjected to split charging representation,

the second layer output of the graph attention machine mechanism network GAT is

S14, for the current subject m, combining the first layer outputs of the until current T continuous time-stepped graph attention device network GAT into a sequence

Combining second layer outputs of until current T continuous time-stepping graph attention device networks GAT into a sequence

Will be provided with

And

as the input vector sequence of the long-short term memory network LSTM, integrating the time characteristics of the sequence; the long-short term memory network LSTM is composed of a plurality of units, one unit comprises three structures of a memory gate, a forgetting gate and an output gate, and the output vector of the previous unit is converted into a vector

And

and the vector of the current time

As input, output comprehensive information

And

the memory gate, the forgetting gate and the output gate are used as the core to process data, and the long-short term memory network LSTM finally outputs a vector

And

wherein the content of the first and second substances,

represents the integrated information of all vectors at the first t-1 moments,

representing information related to the current moment in the vector of the t-1 moment;

s15, the deep Q network DQN is composed of multiple layers of fully connected network, and h 'will be output through the first layer of the graph attention mechanism network GAT'_mSecond layer output h ″)_mAnd long and short term memory network LSTM processed output vector

The DQN is used as the input of the depth Q network DQN, the return values of different actions executed in the current state are output, and the action with the highest return is selected and executed to interact with the environment;

s16, after the network structure is defined, a target network is constructed through a weight matrix in a Gaussian distribution random initialization algorithm network

The network structure is completely the same as the algorithm network structure G, and the self weight initialization is completed by a method of copying G weight parameters;

s2, executing resource allocation;

s3, repeating the resource allocation N of step S2_preSecondly, training an algorithm network structure G;

s4, training the algorithm network structure G for X times each time the algorithm network structure G in the step S3 is completed, and assigning the weight parameters of the algorithm network structure G to the target network

Implementing a target network

Updating of (1);

s5, step S3 executing N_trainAnd finishing the training process of the algorithm network structure G.

Preferably, the calculation formula of the influence coefficient of the attention force in the substep S13 is

e_mj＝ATT(W_sh_m,W_th_j)＝(W_sh_m)^T(W_th_j) The formula for normalizing the attention influence coefficient is

The formula for calculating the first layer output of the attention mechanism network of the graph is

Wherein, W_s、W_t、

Is the weight matrix of the layer and is the network parameter to be trained at the same time.

Preferably, the calculation formula of the gate is memorized in the step S14

The calculation formula of the forgetting door is

The calculation formula of the output gate is

The correlation formula of the comprehensive information calculation is

Wherein, W_i、W_f、W_o、W_C、b_i、b_f、b_o、b_CThe weight matrix of the layer and the network parameters to be trained are provided, and tanh is an activation function.

Preferably, the step S2 includes the following substeps:

s21, the wireless resource manager obtains the network state vector of each base station at the current time t, the number of the base stations is M,

obtaining a random number from the (0, 1) uniform distribution, if the random number is larger than epsilon, the wireless resource manager randomly selects an effective action for each base station; if the random number is less than or equal to epsilon, the radio resource manager will s^tCombined with the state vectors of the previous T-1 time points, and input to the network G in step S1, each base station will obtain an action with the maximum return value

Performing action a^tThe radio resource manager will receive the system benefit value

And observe the network state vector s at the next moment^t+1；

S22, setting two hyper-parameters c by the wireless resource system manager₁、c₂And a threshold value c₃The real-time report is calculated,

wherein

Represents the mean value of the SSR slices in each base station acquired from the system, wherein c₁Is 3 to 6, c₂Is 1 to 3, c₃The value of (a) is 0.75-1;

s23, the wireless resource manager will (S)^t,a^t,r^t,s^t+1) The quadruplets are stored to a size of

Of the cache area

Therein, the

3000 to 10000.

Preferably, the step S3 includes the following processes: from the buffer

Selecting p quadruples as training samples, and using p network state vectors s^tRespectively combined with the state vectors of T-1 previous moments to obtain a matrix s¹,s²,…,s^p]^TInputting the data into the algorithm network structure G constructed in the step S1, obtaining the return values generated by executing different actions under p states, and respectively selecting [ a [ [ a ]¹,a²,…,a^p]^TThe corresponding return value is recorded as the predicted return value G(s) under the current network parameters¹,a¹),G(s²,a²),…,G(s^p,a^p) P network state vectors s in the sample^t+1Respectively combined with the state vectors of T-1 previous moments to obtain a matrix

It is input to the target network constructed in step S1

In the method, the return values generated by executing different actions under p states are obtained and respectively selected

Corresponding to the maximum reported value, record as

The loss function of the algorithmic network structure G is

Wherein r isⁱAnd (3) for the instant return corresponding to each sample, taking gamma as a discount factor, taking 0.75-0.9, and training the weight parameter of the algorithm network structure G by applying a batch gradient descent method.

Preferably, the step S5 includes the following processes: the wireless resource manager converts the current network state vector s^tAnd the state vector combination of the current base station and the previous T-1 moments is input into an algorithm network structure G, the algorithm network structure G outputs a return value corresponding to each action for each base station main body, and the action corresponding to the maximum return value is selected as a distribution strategy of the current base station and executed.

Preferably, the value of X is 100-500, and N is_preIs 500 to 3000, N_trainThe value of (1) is 1000-5000.

Preferably, the number p of the quadruples is 32.

Preferably, the batch gradient descent method is Adam, and the learning rate is 0.001.

Preferably, the initial value of epsilon in the sub-step S21 is 0, and the initial value is based on each step

ε＝ε_max-max(0,e^{-train_step/decay_step}) Is increased, wherein epsilon_maxThe value is 0.85-0.95, train _ step is the number of training steps at the current moment, and escape _ step is 2000-4000.

The invention has the beneficial effects that:

(1) the invention utilizes a graph attention mechanism and a long-short term memory mechanism to preprocess the state vector, extracts time and space characteristics, enlarges the receptive field under the condition of limited communication conditions, and strengthens the cooperative cooperation among base stations and the prediction of user behaviors. Through network training, the influence weight of surrounding base stations on the current base station is obtained, the positive influence of effective variables is increased, the negative influence caused by noise and user movement is reduced, and the robustness of the system is enhanced.

(2) According to the method, the state action value function is estimated by using a deep reinforcement learning method, an optimal resource allocation strategy is selected, the reinforcement learning algorithm can generate sample data required by training through interaction with the environment, any experience hypothesis and prior hypothesis on the state action function distribution are not required, the method can adapt to more complex scenes, and the flexibility is better.

(3) Compared with the traditional resource sharing and numerical analysis algorithm, the wireless resource allocation strategy obtained through the cooperation of multiple base stations can obtain a higher system benefit value, namely, the utilization rate of frequency spectrum resources is improved while the basic user service satisfaction rate is ensured, and therefore the user experience is improved.

The features and advantages of the present invention will be described in detail by embodiments in conjunction with the accompanying drawings.

Drawings

FIG. 1 is a flowchart of a multi-base-station cooperative network resource allocation method based on temporal feature extraction reinforcement learning according to the present invention;

fig. 2 shows the system benefit values of the method and some allocation methods of the present invention as they change during the allocation of radio resources, when the parameters specified in the following examples are used.

Detailed Description

To explain technical contents, structural features, and objects and effects of the technical solutions in detail, the following detailed description is given with reference to the accompanying drawings.

Referring to fig. 1, a flowchart of the multi-base-station cooperative network resource allocation method based on time feature extraction reinforcement learning of the present invention specifically includes the following steps:

s1, algorithm network structure G and target network

The method specifically comprises the following substeps of:

s11, an algorithm network structure G of the method is divided into a state vector coding network Embed, a long-short term memory network LSTM, an image attention machine system network GAT and a depth Q network DQN.

S12, wherein the state vector coding network is composed of two layers of full connection network, and is recorded as

h_m＝Embed(s_m)＝σ(W_es_m+b_e)，

Wherein W_e、b_eIs the weight matrix for that layer, and σ is the "ReLu" activation function. Enhancing N-dimensional state vector s in multi-subject learning_m(status vector of mth body) is input into Embed, and K-dimensional encoded vector h is output_m。

(wherein D_mA subject set representing the current subject m on adjacent nodes in the directed graph, and Euclidean distance is used as a standard for constructing the directed graph) as an input vector of the graph attention mechanism network, and is used for calculating an attention influence coefficient and carrying out normalization processing on the attention influence coefficient,

e_mj＝ATT(W_sh_m,W_th_j)＝(W_sh_m)^T(W_th_j)， (2)

and multiplying the normalized attention influence coefficient by the input vector, calculating the first-layer output of the attention mechanism network of the graph through a formula (4), wherein the value of the multi-head attention mechanism parameter K is 2-20.

The three steps of calculating the attention influence coefficient, normalizing, calculating output and the like are represented by the following formulas,

it is noted that the mechanical network has two layers, the second layer has the same structure as the first layer and is represented by the following formula,

wherein, W_s、W_t、

Is the weight matrix of the layer and is also the network parameter to be trained.

S14, for the current subject m, combining the outputs of the two layers of GAT which are continuously stepped until the current T times into a sequence

And

as the input vector sequence of the long-short term memory network LSTM, the time characteristics of the sequence are integrated. The long and short term memory network LSTM comprises a plurality of units, wherein one unit comprises three structures of a memory gate, a forgetting gate and an output gate, and the output vector of the previous unit is converted into a vector

And

and the vector of the current time

As input, output

And

wherein the content of the first and second substances,

represents the integrated information of all vectors at the first t-1 moments,

representing information associated with the current time instant in the vector at time t-1.

The output of the memory gate is calculated by the formula,

the calculation formula output by the forgetting gate is as follows,

the output of the output gate is calculated by the formula,

the correlation formula of the comprehensive information calculation is

The three gates are used as cores for data processing, and the long-short term memory network LSTM finally outputs a vector

And

s15, the deep Q network DQN is composed of a plurality of layers of full-connection networks, output vectors processed by the two-layer graph attention machine network GAT and the long-short term memory network LSTM are used as the input of the deep Q network DQN, return values of different actions executed under the current state are output, and the action with the highest return is selected and executed to interact with the environment;

The network structure is completely the same as the algorithm network structure G, and the self weight initialization is completed by a method of copying G weight parameters.

S2, performing resource allocation, specifically including the following substeps:

s21, the wireless resource manager obtains the network state vector of each base station at the current t moment, the number of the base stations is

And the radio resource manager acquires a random number from (0, 1) uniform distribution by adopting an epsilon-greedy algorithm, and if the random number is greater than epsilon, the radio resource manager randomly selects an effective action for each base station. If the random number is less than or equal to epsilon then the radio resource manager will s^tCombined with the state vectors of the previous T-1 time points, and input to the network G in step S1, each base station will obtain an action with the maximum return value

And observe the network state vector s at the next moment^t+1. The initial value of epsilon is 0, and the basis is that the operation is carried out once

S22, setting two hyper-parameters c by the wireless resource system manager₁、c₂And a threshold value c₃The immediate return is calculated by the following formula,

wherein

Which represents the mean value of the SSRs of the slices in each base station acquired from the system. Setting c₁Is 3 to 6, c₂Is 1 to 3, c₃The value of (a) is 0.75-1.

Of the cache area

In the interior of the container body,

the value of (a) is 3000-10000. If it is not

And if the space is full, deleting the quadruple stored firstly and storing the latest quadruple by adopting a first-in first-out method.

S3, repeating the resource allocation N of step S2_preSub, N_preThe value of (2) is 500-3000, so that the cache area has enough data for training the current network parameters, and the process of training the network G is as follows:

from the buffer

Selecting p quadruples as training samples, and using p network state vectors s^tRespectively combined with the state vectors of T-1 previous moments to obtain a matrix s¹,s²,…,s^p]^TInputting the data into the algorithm network structure G constructed in the step (1) to obtain the return values generated by executing different actions under p states, and respectively selecting [ a ]¹,a²,…,a^p]^TThe corresponding return value is recorded as the predicted return value G(s) under the current network parameters¹,a¹),G(s²,a²),…,G(s^p,a^p)。

P network state vectors s in the sample^t+1Respectively combined with the state vectors of T-1 previous moments to obtain a matrix

And inputs it to step S1 to constructTarget network of

In the method, the return values generated by executing different actions under p states are obtained, and the return values are selected and respectively selected

These actions correspond to the maximum reported value and are recorded as

The loss function of the algorithm network structure G is:

wherein r isⁱAnd (3) taking 0.75-0.9 as the instant return corresponding to each sample and gamma as a discount factor, applying a weight parameter of a batch gradient descent method training algorithm network structure G, selecting Adam as an optimizer, and setting the learning rate to be 0.001.

S4, training the algorithm network structure G for X times in each step of finishing S3, wherein X is 100-500, and G network weight parameters are assigned to the target network

Implementing a target network

And (4) updating.

S5, step S3 executing N_trainNext to, N_trainThe value of (1) is 1000-5000, and the training process of the algorithm network structure G is completed. The wireless resource manager converts the current network state vector s^tThe state vector combination of the state vector and the previous T-1 moments is input into an algorithm network structure G, the algorithm network structure G outputs a return value corresponding to each action for each base station main body, and the maximum return value is selectedAnd taking the action corresponding to the report value as the allocation strategy of the current base station and executing the action.

On a server configured as shown in table 1, a simulation environment is written in Python language, a network framework is built by keras, and tests are performed by taking 3 different types of services (call, video and ultra-reliable low-delay service) as an example. The system has 19 base stations, i.e. M is 19, and the base stations are arranged in a honeycomb manner, the total bandwidth of each base station is 10M, the distributed granularity is set to be 0.5M, and therefore the total 171 distribution strategy, i.e. the number of effective actions is 171. The discount factor γ is set to 0.9 and the multi-head attention coefficient K is 8. Furthermore,. epsilon_maxThe value is 0.95 and the value of escape _ step is 2000. Buffer area

Has a size of 5000, N_preIs 2000, N_trainHas a value of 10000. The optimizer in the batch gradient descent algorithm used by the training algorithm network structure G is Adam, and the learning rate is 0.001. Other parameter cases are as follows:

X＝200、c₁＝5.5、c₂＝2、c₃＝0.8、p＝32。

TABLE 1 System test platform parameters

Processor with a memory having a plurality of memory cells	Intel i9-9900KF 3.6GHZ
		Memory device	64G
Display card	NVIDIA GTX 2080
		Software platform	keras 2.2.4

The method of the present invention is compared to some resource allocation methods, including the Hard Slicing algorithm (Hard Slicing), the DQN algorithm, the LSTM-A2C algorithm, and the GAT-DQN algorithm without adding LSTM. Wherein, the hard slicing algorithm is to uniformly distribute the total bandwidth of the base station to each network slice; the LSTM-A2C algorithm is an algorithm that combines long and short term memory networks with deep reinforcement learning. Referring to fig. 2, the variation of the system benefit values obtained by various methods during the radio resource allocation process is shown, wherein the system benefit values represent the average return values of 19 base stations. For ease of analysis, the median value was plotted every 100 steps. The curves in the analysis chart show that in the previous 4000 steps, as the deep reinforcement learning algorithm needs to be trained by network parameters, the return value is relatively larger in fluctuation and lower in median return compared with the average score method. When the network training is finished, namely 4000 steps, the system benefit value of each deep reinforcement learning algorithm is obviously improved, and the method is more excellent and has better system stability and higher system benefit value.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents or improvements made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A multi-base station network resource intelligent allocation method based on space-time feature extraction is characterized by comprising the following steps:

s1, algorithm network structure G and target network

Building and initializing;

S13, encoding the current subject m and the subject of the current subject m on the adjacent node in the directed graph into a vector h_mAnd

calculating attention influence coefficients as input vectors of a graph attention machine mechanism network GAT, and carrying out normalization processing on the attention influence coefficients, wherein D is_mRepresenting a set of subjects of the current subject m on adjacent nodes in the directed graph; multiplying the normalized attention influence coefficient by the input vector to calculate the first layer output of the attention mechanism network GAT of the graph; the attention influence coefficient, the normalization processing and the first layer output are subjected to split charging representation,

the second layer output of the graph attention machine mechanism network GAT is

Will be provided with

And

And

and the vector of the current time

As input, output comprehensive information

And

And

wherein the content of the first and second substances,

represents the integrated information of all vectors at the first t-1 moments,

s15, the deep Q network DQN is composed of multiple layers of fully connected network, and h 'will be output through the first layer of the graph attention mechanism network GAT'_mSecond layer output h'_mAnd long and short term memory network LSTM processed output vector

And

s2, executing resource allocation;

Implementing a target network

Updating of (1);

2. The method for intelligently allocating network resources of multiple base stations based on spatio-temporal feature extraction as claimed in claim 1, wherein: the calculation formula of the influence coefficient of the attention force in the substep S13 is e_mj＝ATT(W_sh_m，W_th_j)＝(W_sh_m)^T(W_th_j) The formula for normalizing the attention influence coefficient is

Wherein, W_s、W_t、

The weight matrix of the layer is also the network parameter to be trained.

3. The method for intelligently allocating network resources of multiple base stations based on spatio-temporal feature extraction as claimed in claim 1, wherein: the calculation formula of the memory gate in the step S14

The calculation formula of the forgetting door is

The calculation formula of the output gate is

The correlation formula of the comprehensive information calculation is

Wherein, w_i,w_f、w_o、w_C.b_i、b_f、b_o,b_CThe weight matrix of the layer and the network parameters to be trained are provided, and tanh is an activation function.

4. The method for intelligently allocating network resources of multiple base stations based on spatio-temporal feature extraction as claimed in claim 1,

the step S2 includes the following substeps:

And observe the network state vector s at the next moment^t+1；

wherein

s23, the wireless resource manager will (S)^t，a^t，r^t，s^t+1) The quadruplets are stored to a size of

Of the cache area

Therein, the

3000 to 10000.

5. The method for intelligently allocating network resources of multiple base stations based on spatio-temporal feature extraction as claimed in claim 1, wherein: the step S3 includes the following processes: from the buffer

Selecting p quadruples as training samples, and using p network state vectors s^tRespectively combined with the state vectors of T-1 previous moments to obtain a matrix s¹,s²，…，s^p]^TInputting the data into the algorithm network structure G constructed in the step S1, obtaining the return values generated by executing different actions under p states, and respectively selecting [ a [ [ a ]¹，a²，…，a^p]^TThe corresponding return value is recorded as the predicted return value G(s) under the current network parameters¹，a¹)，G(s²，a²)，…，G(s^p，a^p) P network state vectors s in the sample^t+1Respectively with itCombining the state vectors of previous T-1 moments to obtain a matrix

It is input to the target network constructed in step S1

Corresponding to the maximum reported value, record as

The loss function of the algorithmic network structure G is

6. The method for intelligently allocating network resources of multiple base stations based on spatio-temporal feature extraction as claimed in claim 1, wherein: the step S5 includes the following processes: the wireless resource manager converts the current network state vector s^tAnd the state vector combination of the current base station and the previous T-1 moments is input into an algorithm network structure G, the algorithm network structure G outputs a return value corresponding to each action for each base station main body, and the action corresponding to the maximum return value is selected as a distribution strategy of the current base station and executed.

7. The method for intelligently allocating network resources of multiple base stations based on spatio-temporal feature extraction as claimed in claim 1, wherein: the value of X is 100-500, and N is_preIs 500 to 3000, N_trainThe value of (1) is 1000-5000.

8. The method for intelligently allocating network resources of multiple base stations based on spatio-temporal feature extraction as claimed in claim 5, wherein: the number p of the quadruples is 32.

9. The method for intelligently allocating network resources of multiple base stations based on spatio-temporal feature extraction as claimed in claim 5, wherein: the batch gradient descent method is Adam, and the learning rate is 0.001.

10. The method for intelligently allocating network resources of multiple base stations based on spatio-temporal feature extraction as claimed in claim 4, wherein: in the sub-step S21, the initial value of epsilon is 0, and each step is executed according to the condition that g is g_max-max(0，e^{-train-step/decay_step}) Is increased, wherein epsilon_maxThe value is 0.85-0.95, train _ step is the number of training steps at the current moment, and escape _ step is 2000-4000.