CN114615744A - Knowledge migration reinforcement learning network slice general-purpose sensing calculation resource collaborative optimization method - Google Patents

Knowledge migration reinforcement learning network slice general-purpose sensing calculation resource collaborative optimization method Download PDF

Info

Publication number
CN114615744A
CN114615744A CN202210185185.2A CN202210185185A CN114615744A CN 114615744 A CN114615744 A CN 114615744A CN 202210185185 A CN202210185185 A CN 202210185185A CN 114615744 A CN114615744 A CN 114615744A
Authority
CN
China
Prior art keywords
base station
network
time
resource
student
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202210185185.2A
Other languages
Chinese (zh)
Inventor
赵楠
任凡
杜威
陈金莲
陈哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hubei University of Technology
Original Assignee
Hubei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hubei University of Technology filed Critical Hubei University of Technology
Priority to CN202210185185.2A priority Critical patent/CN114615744A/en
Publication of CN114615744A publication Critical patent/CN114615744A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/50Allocation or scheduling criteria for wireless resources
    • H04W72/53Allocation or scheduling criteria for wireless resources based on regulatory allocation policies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/70Services for machine-to-machine communication [M2M] or machine type communication [MTC]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/50Allocation or scheduling criteria for wireless resources
    • H04W72/54Allocation or scheduling criteria for wireless resources based on quality criteria

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention discloses a knowledge migration reinforcement learning network slice general sensing resource collaborative optimization method, aiming at establishing a network slice general sensing resource collaborative optimization problem model by taking the maximum total throughput of user equipment as an optimization target on the basis of considering the constraint limits of network slice general sensing resource, user equipment time delay, energy consumption and the like according to the differentiated service requirements of mobile edge network slice users. On the basis, the optimization problem is modeled into a multi-agent random game process, a knowledge transfer reinforcement learning algorithm among the multi-agents is researched, the exploration efficiency and expandability of a collaborative optimization strategy are improved, and collaborative optimization of network slicing general-purpose computing resources under diversified service scenes is realized.

Description

Knowledge migration reinforcement learning network slice general-purpose sensing calculation resource collaborative optimization method
Technical Field
The invention belongs to the technical field of wireless communication, and particularly relates to a knowledge migration reinforcement learning network slice general sensing calculation resource collaborative optimization method.
Background
With the rapid development of 5G mobile communication and the Internet of things, the mobile communication is being driven to be intelligently evolved by the massive Internet of things equipment and ubiquitous connection requirements. As a new network paradigm, the mobile edge network can significantly reduce transmission energy consumption, network congestion and processing delay by migrating communication resources and computing resources to the edge of the network, and promotes deep fusion of communication, perception and computation.
In a mobile edge network, a network slice can meet the diversified requirements of low time delay and high reliability by sharing resources such as communication, perception, calculation, infrastructure and the like. The existing network slice resource allocation method is roughly researched from the following two aspects. Firstly, the resource optimization type is adopted, and compared with a single resource optimization strategy, the multi-dimensional resource collaborative optimization method is complex, mainly adopts a centralized mode, and has more communication and control overhead. And secondly, a resource allocation mode is adopted, and compared with a static resource allocation mode, the dynamic allocation strategy is changed according to the network environment, so that real-time dynamic optimization of resources can be realized. In an actual mobile edge network, the network slice performance is often related to various types of resources, and the complex dependence relationship of the resources is difficult to be represented by an accurate mathematical model; high-dimensional dynamic network states such as randomness of wireless channel states and time-varying of service flow of sliced users also restrict improvement of network slicing service quality.
Reinforcement learning, a model-free method, with its powerful decision-making capability in a high-dimensional space, is considered as one of promising solutions to solve the above problems. However, the method is less concerned about the safe and efficient optimization problem of the multidimensional resources under the general perception and calculation fusion.
Disclosure of Invention
In order to overcome the problem that a high-dimensional dynamic network restricts the service quality of a network slice, the invention aims to provide a safe and efficient optimization method of network slice multi-dimensional resources under the condition of general sensing calculation fusion.
In order to achieve the purpose, the invention adopts the technical scheme that: a knowledge migration reinforcement learning network slice general-purpose computation resource collaborative optimization method is characterized by comprising the following steps:
step 1, constructing a network slice general sensing calculation fusion resource collaborative optimization model;
step 1, constructing a network slice general sensing calculation fusion resource collaborative optimization model, specifically as follows:
assuming that M base stations share K resource blocks and F computing resources, and N edge sensing devices can be supported to be accessed; the ith base station owns at time t
Figure BDA0003522981460000011
A resource block,
Figure BDA0003522981460000012
A computing resource and
Figure BDA0003522981460000013
edge Devices (ED); wherein l is ∈ [ e, u, m ∈ [ ]]Where l ═ e denotes an enhanced Mobile broadband network slice Type (eMBB), l ═ u denotes a massive Machine Type Communication network slice Type (mtc), and l ═ m denotes an ultra-Reliable Low-Latency Communication (urrllc) network slice Type.
At time t, defining the binary resource block allocation variable of the ith base station accessing the jth edge device as
Figure BDA0003522981460000021
If it is
Figure BDA0003522981460000022
Indicating that the base station i allocates the kth resource block for the EDj; if it is
Figure BDA0003522981460000023
It indicates that the base station i does not allocate the kth resource block for EDj. Consider that each resource block is allocated to at most one edge device, with
Figure BDA0003522981460000024
Binary computation for defining that the ith base station accesses the jth edge device at the time tResource allocation variable
Figure BDA0003522981460000025
If it is
Figure BDA0003522981460000026
Indicating that the base station i distributes the calculation resource f for the EDj; if it is
Figure BDA0003522981460000027
It indicates that the base station i allocates the calculation resource f for EDj. Consider that each computing resource is allocated to at most one edge device, of
Figure BDA0003522981460000028
At time t, consider l ∈ [ e, u, m]Performance differences of three network slice types, eMBB slice concerns the sum of throughputs of all EDs
Figure BDA0003522981460000029
uRLLC section emphasizes the sum of time delays of all EDs
Figure BDA00035229814600000210
Considering that most mMTC slice devices are in a dormant state at the same time, mMTC slices only pay attention to the sum of throughputs of all EDs
Figure BDA00035229814600000211
Therefore, in order to balance the slice differentiation requirements, under the limitations of communication resources, sensing resources, computing resources, total user time delay, total energy consumption and the like, the sum of the throughputs of all users is maximized as an optimization target, and the network slice general sensing computation fusion multidimensional resource collaborative optimization model in step 1 is as follows:
Figure BDA0003522981460000031
Figure BDA0003522981460000032
Figure BDA0003522981460000033
Figure BDA0003522981460000034
Figure BDA0003522981460000035
Figure BDA0003522981460000036
the limiting conditions C1, C2, C3, C4 and C5 are respectively the limiting conditions of a communication resource K, a sensing resource N, a computing resource F, a user total time delay T and a total energy consumption E;
Figure BDA0003522981460000037
indicating the number of resource blocks owned by base station i at time t,
Figure BDA0003522981460000038
indicating the computational resources that base station i owns at time t,
Figure BDA0003522981460000039
represents the edge aware device owned by base station i at time t, where l ∈ [ e, u, m]Where, l ═ e denotes an eMBB network slice type, l ═ u denotes an mtc network slice type, and l ═ m denotes a urrllc network slice type;
Figure BDA00035229814600000310
and
Figure BDA00035229814600000311
respectively representing the binary resource block allocation variable of the ith base station for allocating the kth resource block to the jth edge device at the time t and the binary computing resource of the computing resource fDistributing variables;
Figure BDA00035229814600000312
representing the sum of the throughputs of all EDs at time t,
Figure BDA00035229814600000313
representing the sum of the time delays of all EDs at time t,
Figure BDA00035229814600000314
and M is the total number of base stations, and represents the sum of energy consumption of all EDs at the time t.
Step 2, according to the network slice general sensing calculation fusion resource collaborative optimization model in the step 1, the number of resource blocks owned by the base station i at the time t is optimized through a multi-agent reinforcement learning optimization method based on knowledge migration
Figure BDA00035229814600000315
Computing resources
Figure BDA00035229814600000316
And number of edge devices
Figure BDA00035229814600000317
And binary resource block allocation variable
Figure BDA00035229814600000318
And binary computing resource allocation variables
Figure BDA00035229814600000319
Carrying out optimization solution to obtain the sum of all optimized EDs throughput
Figure BDA00035229814600000320
Step 2.1, modeling the multi-agent random game process: the optimization problem is modeled into a multi-agent random game process, and each base station is equivalent to an agent.
The state of each base station is defined as:
Figure BDA0003522981460000041
wherein the content of the first and second substances,
Figure BDA0003522981460000042
indicating the state of the ith base station at time t,
Figure BDA0003522981460000043
and representing user perception calculation tasks in three network slices when the ith base station is accessed to the jth edge device at the moment t.
The action of each base station is defined as:
Figure BDA0003522981460000044
wherein the content of the first and second substances,
Figure BDA0003522981460000045
showing the action of the ith base station at the time t, including the number of three network slice resource blocks owned by the base station i
Figure BDA0003522981460000046
Counting of resources
Figure BDA0003522981460000047
Number of edge sensing devices
Figure BDA0003522981460000048
And the user resource block allocation strategy when the ith base station allocates the kth resource block for the jth edge device
Figure BDA0003522981460000049
And a computing resource allocation policy when allocating computing resources f
Figure BDA00035229814600000410
The reward function of each base station is defined as:
Figure BDA00035229814600000411
wherein the content of the first and second substances,
Figure BDA00035229814600000412
and (4) representing the reward function of the ith base station at the time t, and reflecting the sum of the throughputs of all EDs in the ith base station.
Step 2.2, modeling according to the multi-agent random game process in the step 2.1, and designing a multi-agent knowledge migration reinforcement learning model, as shown in fig. 1, wherein a base station i observes a state from a network environment at a time t
Figure BDA00035229814600000413
Under the framework of the Actor-Critic algorithm, an Actor network takes action
Figure BDA00035229814600000414
Selecting a student or self-learning behavior mode before, then training the network model in each behavior mode, and updating the network parameters, thereby obtaining the optimal user resource and computing resource allocation strategy.
At time t, base station i uses long-short term memory network unit biWill continue for z states
Figure BDA00035229814600000415
And actions
Figure BDA00035229814600000416
Wait for historical knowledge
Figure BDA00035229814600000417
As its hidden state
Figure BDA00035229814600000418
The student mode is based on a depth determination strategy gradient model, and the Actor network outputs the student modeProbability P'ssWhen the probability exceeds a threshold G, i.e. P'ssIf the number of the base stations is more than G, the base station i selects a student mode and sends suggestion requests to other base stations; conversely, when P'ssAnd when G is less than or equal to G, the base station i selects a self-learning mode.
Designing a multi-head attention mechanism model, wherein each base station takes other base stations as teacher base stations and receives the state of the teacher base station n (n is more than or equal to 1 and less than or equal to M) at time t
Figure BDA00035229814600000419
And actions
Figure BDA00035229814600000420
Waiting for history information
Figure BDA00035229814600000421
And a policy network parameter θn. Learning parameter P in consideration of multi-head attention mechanism1、P2And P3The assigned attention weight can be obtained:
Figure BDA0003522981460000051
where D is the dimension of the history information vector for teacher base station n. The final strategy proposal is to have the weight sum of the linear transformation:
Figure BDA0003522981460000052
wherein, PsIs the learning parameter for the strategy parameter decoding.
Thus, the student base station i utilizes the hidden state
Figure BDA0003522981460000053
The motion at this time is obtained using parameters from a multi-head attention mechanism model:
Figure BDA0003522981460000054
the gain in base station learning performance from student mode is defined herein as a student reward
Figure BDA0003522981460000055
The student Actor-criticic network is trained using a trained attention selection model. By minimizing the student's loss function
Figure BDA0003522981460000056
Updating student Critic network parameters of base station i
Figure BDA0003522981460000057
Figure BDA0003522981460000058
Wherein the content of the first and second substances,
Figure BDA0003522981460000059
is given by the parameter
Figure BDA00035229814600000510
The student goal Critic network of (1),
Figure BDA00035229814600000511
and
Figure BDA00035229814600000512
respectively representing the hidden state at the current time t and the hidden state at the next time t,
Figure BDA00035229814600000513
and
Figure BDA00035229814600000514
respectively representing the student strategy of the current time t base station i and the student strategy of the next time t base station i,
Figure BDA00035229814600000515
and
Figure BDA00035229814600000516
representing the state-behavior value functions, E [. cndot. ] of the student Critic network and the student target Critic network, respectively]In the interest of expectation,
Figure BDA00035229814600000517
γ is the discount factor for the student reward function.
Student Actor network passing strategy with mu parameter
Figure BDA00035229814600000518
A policy gradient update is performed as follows:
Figure BDA00035229814600000519
in the self-learning mode, if the student Actor of the base station i selects the self-learning mode, the student Actor will hide the base station i
Figure BDA00035229814600000520
And sending the data to a self-learning network module, wherein each base station independently performs resource optimization action decision by adopting a Deep Q Network (DQN) method.
The DQN algorithm framework consists of a current value network and a target value network, wherein the current value network is used with weights
Figure BDA00035229814600000521
State-action value of
Figure BDA00035229814600000522
Function to approximate an optimal state-action value function, wherein
Figure BDA00035229814600000523
For the hidden state of base station i at time t,
Figure BDA0003522981460000061
is as followsAn action of previous value network generation; targeted value network usage with weighting
Figure BDA0003522981460000062
State-action value of
Figure BDA0003522981460000063
Function to improve the performance of the whole network, wherein
Figure BDA0003522981460000064
For the hidden state of base station i at the next time t,
Figure BDA0003522981460000065
an action generated for the target value network. After a certain number of rounds, the weights of the current value network are copied
Figure BDA0003522981460000066
To update the weights of the target value network
Figure BDA0003522981460000067
Weighting of current value network by gradient descent method
Figure BDA0003522981460000068
The update is performed to obtain the minimum loss function:
Figure BDA0003522981460000069
wherein
Figure BDA00035229814600000610
For the self-learning reward function, γ is the discount factor.
Meanwhile, in order to reduce the correlation of the empirical data, the algorithm adopts an empirical playback strategy. In a hidden state
Figure BDA00035229814600000611
Next, base station i acts by performing
Figure BDA00035229814600000612
Earning rewards
Figure BDA00035229814600000613
Then hide the state
Figure BDA00035229814600000614
Transition to the hidden state at the next instant t
Figure BDA00035229814600000615
The deep neural network transfers the state to information
Figure BDA00035229814600000616
Stored in the empirical playback memory B. In the learning process, the mini-batch transfer information sample is randomly extracted from the experience playback memory B
Figure BDA00035229814600000617
To train the neural network. By continuously reducing the correlation among training samples, the base station can be helped to learn and train better so as to avoid the problem that the optimal strategy falls into the local minimum. In addition, neural networks often overfit some empirical data, transferring information samples by randomly drawing mini-batch
Figure BDA00035229814600000618
The problem of overfitting can be effectively reduced.
The method has the advantages that the constructed network slice general sensing calculation fusion resource collaborative optimization model is complete, and the base station finds the optimal allocation strategy of user resources and calculation resources through the continuous interaction between the base station and the environment due to the unknown network environment of the model, so that the method has high practical application value; the transfer learning and the deep reinforcement learning are combined, the existing knowledge domain data of a trained base station can be reused, existing large amount of work is reserved, data are transferred and applied to the training process of other base stations rapidly, the timeliness advantage is embodied, the decision efficiency of the deep reinforcement learning in a high-dimensional space is improved, the learning capacity and the generalization capacity of the base stations are also improved effectively, the complexity and the sparseness when resources are allocated for operating the base stations manually in an uncertain environment are avoided, and the base stations can complete resource collaborative optimization allocation more safely and efficiently.
Drawings
FIG. 1: multi-agent knowledge migration reinforcement learning framework
Detailed Description
The present invention will be described in further detail with reference to examples for the purpose of facilitating understanding and practice of the invention by those of ordinary skill in the art, and it is to be understood that the present invention has been described in the illustrative embodiments and is not to be construed as limited thereto.
Aiming at the differentiated service requirements of mobile edge network slice users, on the basis of considering the constraint limits of network slice general sensing resources, user equipment time delay, energy consumption and the like, a network slice general sensing resource collaborative optimization problem model is established by taking the total throughput of the user equipment as an optimization target to be maximized. On the basis, the optimization problem is modeled into a multi-agent random game process, a knowledge transfer reinforcement learning algorithm among the multi-agents is researched, the exploration efficiency and expandability of a collaborative optimization strategy are improved, and collaborative optimization of network slicing general-purpose computing resources under diversified service scenes is realized.
Step 1, constructing a network slice general sensing calculation fusion resource collaborative optimization model;
step 1, constructing a network slice general sensing calculation fusion resource collaborative optimization model, specifically as follows:
assuming that M-5 base stations share K-100 resource blocks and F-100 computing resources, and N-75 edge aware devices can be supported to access the resource blocks; the ith base station owns at time t
Figure BDA0003522981460000071
A resource block,
Figure BDA0003522981460000072
A computing resource and
Figure BDA0003522981460000073
edge Devices (ED); wherein l is ∈ [ e, u, m ∈ [ ]]Where l ═ e denotes an enhanced Mobile broadband network slice Type (eMBB), l ═ u denotes a massive Machine Type Communication network slice Type (mtc), and l ═ m denotes an ultra-Reliable Low-Latency Communication (urrllc) network slice Type.
At time t, defining the binary resource block allocation variable of the ith base station accessing the jth edge device as
Figure BDA0003522981460000074
If it is
Figure BDA0003522981460000075
Indicating that the base station i allocates the kth resource block for the EDj; if it is
Figure BDA0003522981460000076
It indicates that the base station i does not allocate the kth resource block for EDj. Consider that each resource block is allocated to at most one edge device, with
Figure BDA0003522981460000077
Defining a binary computing resource allocation variable for an ith base station to access a jth edge device at time t
Figure BDA0003522981460000078
If it is
Figure BDA0003522981460000079
Indicating that the base station i distributes the calculation resource f for the EDj; if it is
Figure BDA00035229814600000710
It indicates that the base station i allocates the calculation resource f for EDj. Consider that each computing resource is allocated to at most one edge device, there
Figure BDA00035229814600000711
At time t, consider l ∈ [ e, u, m]Performance differences of three network slice types, sum of throughput of all EDs of attention by eMBB slice
Figure BDA00035229814600000712
uRLLC section emphasizes the sum of time delays of all EDs
Figure BDA0003522981460000081
Considering that most mMTC slice devices are in a dormant state at the same time, mMTC slices only pay attention to the sum of throughputs of all EDs
Figure BDA0003522981460000082
Then, in order to balance the slice differentiation requirements, under the restrictions of communication resources, sensing resources, computing resources, total user time delay, total energy consumption, and the like, the sum of the throughputs of all users is maximized as an optimization target, and the network slice general sensing fusion multidimensional resource collaborative optimization model in step 1 is as follows:
Figure BDA0003522981460000083
Figure BDA0003522981460000084
Figure BDA0003522981460000085
Figure BDA0003522981460000086
Figure BDA0003522981460000087
Figure BDA0003522981460000088
the limiting conditions C1, C2, C3, C4 and C5 are respectively constraint conditions of a communication resource K, a sensing resource N, a computing resource F, a user total time delay T and total energy consumption E, and the user total time delay T and the total energy consumption E are determined according to actual engineering conditions;
Figure BDA0003522981460000089
indicating the number of resource blocks owned by base station i at time t,
Figure BDA00035229814600000810
indicating the computational resources that base station i owns at time t,
Figure BDA00035229814600000811
represents the edge aware device owned by base station i at time t, where l ∈ [ e, u, m]Where, l ═ e denotes an eMBB network slice type, l ═ u denotes an mtc network slice type, and l ═ m denotes a urrllc network slice type;
Figure BDA00035229814600000812
and
Figure BDA00035229814600000813
respectively representing a binary resource block allocation variable of the ith base station for allocating the kth resource block and a binary computing resource allocation variable of the computing resource f for the jth edge device at the moment t;
Figure BDA00035229814600000814
representing the sum of the throughputs of all EDs at time t,
Figure BDA00035229814600000815
representing the sum of the time delays of all EDs at time t,
Figure BDA00035229814600000816
and M is the total number of base stations, and represents the sum of energy consumptions of all EDs at the time t.
Step 2, the network slice general sensing calculation fusion according to the step 1A resource collaborative optimization model, which is used for optimizing the number of resource blocks owned by a base station i at the time t by a knowledge migration-based multi-agent deep reinforcement learning optimization method
Figure BDA0003522981460000091
Computing resources
Figure BDA0003522981460000092
And number of edge devices
Figure BDA0003522981460000093
And binary resource block allocation variable
Figure BDA0003522981460000094
And binary computing resource allocation variables
Figure BDA0003522981460000095
Carrying out optimization solution to obtain the sum of all optimized EDs throughput
Figure BDA0003522981460000096
Step 2.1, modeling the multi-agent random game process: the optimization problem is modeled into a multi-agent random game process, and each base station is equivalent to an agent.
The state of each base station is defined as:
Figure BDA0003522981460000097
wherein the content of the first and second substances,
Figure BDA0003522981460000098
indicating the state of the ith base station at time t,
Figure BDA0003522981460000099
and representing user perception calculation tasks in three network slices when the ith base station is accessed to the jth edge device at the moment t.
The action of each base station is defined as:
Figure BDA00035229814600000910
wherein the content of the first and second substances,
Figure BDA00035229814600000911
showing the action of the ith base station at the time t, including the number of three network slice resource blocks owned by the base station i
Figure BDA00035229814600000912
Counting of resources
Figure BDA00035229814600000913
Number of edge sensing devices
Figure BDA00035229814600000914
And the user resource block allocation strategy when the ith base station allocates the kth resource block for the jth edge device
Figure BDA00035229814600000915
And a computing resource allocation policy when allocating computing resources f
Figure BDA00035229814600000916
The reward function of each base station is defined as:
Figure BDA00035229814600000917
wherein the content of the first and second substances,
Figure BDA00035229814600000918
and (4) representing the reward function of the ith base station at the time t, and reflecting the sum of the throughputs of all EDs in the ith base station.
Step 2.2, modeling according to the multi-agent random game process in the step 2.1, and designing a multi-agent knowledge migration reinforcement learning model, as shown in the figure1, base station i observes the state from the network environment at time t
Figure BDA00035229814600000919
Under the framework of the Actor-Critic algorithm, an Actor network takes action
Figure BDA00035229814600000920
Selecting a student or self-learning behavior mode before, then training the network model in each behavior mode, and updating the network parameters, thereby obtaining the optimal user resource and computing resource allocation strategy.
At time t, base station i uses long-short term memory network unit biWill continue for z states
Figure BDA00035229814600000921
And actions
Figure BDA00035229814600000922
Wait for historical knowledge
Figure BDA00035229814600000923
As its hidden state
Figure BDA00035229814600000924
In actual engineering, too few z state settings can lead to insufficient data volume of network training, and too many settings can lead to low system training efficiency.
The student mode is based on a depth determination strategy gradient model, and the Actor network outputs a probability P 'of selecting the student mode'ssWhen the probability exceeds a threshold G, i.e. P'ssIf the number of the base stations is more than G, the base station i selects a student mode and sends suggestion requests to other base stations; conversely, when P'ssWhen G is less than or equal to G, the base station i selects a self-learning mode; when the threshold value G is less than 0.5, the base station i is more inclined to select the student mode; on the contrary, when the threshold value G is greater than 0.5, the base station i is more inclined to select the self-learning mode.
Designing a multi-head attention mechanism model, wherein each base station takes other base stations as teacher base stations and receives the state of the teacher base station n (n is more than or equal to 1 and less than or equal to M) at time t
Figure BDA0003522981460000101
And actions
Figure BDA0003522981460000102
Waiting for history information
Figure BDA0003522981460000103
And a policy network parameter θn. Considering a multi-head attention mechanism learning parameter P1、P2And P3An assigned attention weight can be obtained:
Figure BDA0003522981460000104
where D is the dimension of the history information vector for teacher base station n. The final strategy proposal is to have a linear transformed weight sum:
Figure BDA0003522981460000105
wherein, PsIs the learning parameter for the strategy parameter decoding.
Thus, the student base station i utilizes the hidden state
Figure BDA0003522981460000106
The motion at this time is obtained using parameters from a multi-head attention mechanism model:
Figure BDA0003522981460000107
the gain in base station learning performance from student mode is defined herein as student rewards
Figure BDA0003522981460000108
The student Actor-criticic network is trained using a trained attention selection model. By minimizing the student's loss function
Figure BDA0003522981460000109
Updating student Critic network parameters of base station i
Figure BDA00035229814600001010
Figure BDA00035229814600001011
Wherein the content of the first and second substances,
Figure BDA00035229814600001012
is given by the parameter
Figure BDA00035229814600001013
The student goal Critic network of (1),
Figure BDA00035229814600001014
and
Figure BDA00035229814600001015
respectively representing the hidden state at the current time t and the hidden state at the next time t,
Figure BDA00035229814600001016
and
Figure BDA00035229814600001017
respectively representing the student strategy of the current time t base station i and the student strategy of the next time t base station i,
Figure BDA00035229814600001018
and
Figure BDA00035229814600001019
representing the state-behavior value functions, E [. cndot. ] of the student Critic network and the student target Critic network, respectively]In the interest of expectation,
Figure BDA0003522981460000111
γ is the discount factor for the student reward function.
Student Actor network passing strategy with mu parameter
Figure BDA0003522981460000112
A policy gradient update is performed as follows:
Figure BDA0003522981460000113
in the self-learning mode, if the student Actor of the base station i selects the self-learning mode, the student Actor will hide the base station i
Figure BDA0003522981460000114
And sending the data to a self-learning network module, wherein each base station independently performs resource optimization action decision by adopting a Deep Q Network (DQN) method.
The DQN algorithm framework consists of a current value network and a target value network, wherein the current value network is used with weights
Figure BDA0003522981460000115
State-action value of
Figure BDA0003522981460000116
Function to approximate an optimal state-action value function, wherein
Figure BDA0003522981460000117
For the hidden state of base station i at time t,
Figure BDA0003522981460000118
an action generated for the nonce network; targeted value network usage with weighting
Figure BDA0003522981460000119
State-action value of
Figure BDA00035229814600001110
Function to improve the performance of the whole network, wherein
Figure BDA00035229814600001111
For the hidden state of base station i at the next time t,
Figure BDA00035229814600001112
an action generated for the target value network. After a certain number of rounds, the weights of the current value network are copied
Figure BDA00035229814600001113
To update the weights of the target value network
Figure BDA00035229814600001114
Weighting of current value network by gradient descent method
Figure BDA00035229814600001115
The update is performed to obtain the minimum loss function:
Figure BDA00035229814600001116
wherein
Figure BDA00035229814600001117
For the self-learning reward function, γ is the discount factor.
Meanwhile, in order to reduce the correlation of the empirical data, the algorithm adopts an empirical playback strategy. In a hidden state
Figure BDA00035229814600001118
Next, base station i acts by performing
Figure BDA00035229814600001119
Earning rewards
Figure BDA00035229814600001120
Then hide the state
Figure BDA00035229814600001121
Transition to the hidden state at the next instant t
Figure BDA00035229814600001122
The deep neural network transfers the state to information
Figure BDA00035229814600001123
Stored in the empirical replay memory B. In the learning process, the mini-batch transfer information sample is randomly extracted from the experience playback memory B
Figure BDA00035229814600001124
To train the neural network. By continuously reducing the correlation among training samples, the base station can be helped to learn and train better so as to avoid the problem that the optimal strategy falls into the local minimum. In addition, neural networks often overfit some empirical data, transferring information samples by randomly drawing mini-batch
Figure BDA00035229814600001125
The problem of overfitting can be effectively reduced.

Claims (1)

1. A knowledge migration reinforcement learning network slice general-purpose computation resource collaborative optimization method is characterized by comprising the following steps:
step 1, constructing a network slice general sensing calculation fusion resource collaborative optimization model;
step 1, constructing a network slice general sensing calculation fusion resource collaborative optimization model, specifically as follows:
assuming that M base stations share K resource blocks and F computing resources, and N edge sensing devices can be supported to be accessed; the ith base station owns at time t
Figure FDA0003522981450000011
Resource block, Yi l(t) computing resources and
Figure FDA0003522981450000012
is provided at one edge withSpare (Edge Device, ED); wherein l is ∈ [ e, u, m ∈ [ ]]Where l ═ e denotes an enhanced Mobile BroadBand network slice Type (eMBB), l ═ u denotes a massive Machine Type Communication network slice Type (mtc), and l ═ m denotes an ultra-Reliable Low-Latency Communication (urrllc) network slice Type;
at time t, defining the binary resource block allocation variable of the ith base station accessing the jth edge device as
Figure FDA0003522981450000013
I is more than or equal to 1 and less than or equal to M, j is more than or equal to 1 and less than or equal to N, and K is more than or equal to 1 and less than or equal to K; if it is
Figure FDA0003522981450000014
Indicating that the base station i allocates the kth resource block for the EDj; if it is
Figure FDA0003522981450000015
Indicating that the base station i is the k-th resource block which is not allocated to the EDj; consider that each resource block is allocated to at most one edge device, with
Figure FDA0003522981450000016
Defining a binary computing resource allocation variable for an ith base station to access a jth edge device at time t
Figure FDA0003522981450000017
F is more than or equal to 1 and less than or equal to F; if it is
Figure FDA0003522981450000018
Indicating that the base station i distributes the calculation resource f for the EDj; if it is
Figure FDA0003522981450000019
Indicating that the base station i distributes the calculation resource f for the EDj; consider that each computing resource is allocated to at most one edge device, there
Figure FDA00035229814500000110
At time t, consider l ∈ [ e, u, m]Performance differences of three network slice types, eMBB slice concerns the sum of throughputs of all EDs
Figure FDA00035229814500000111
uRLLC section emphasizes the sum of time delays of all EDs
Figure FDA00035229814500000112
Considering that most mMTC slice devices are in a dormant state at the same time, mMTC slices only pay attention to the sum of throughputs of all EDs
Figure FDA00035229814500000113
Then, in order to balance the slice differentiation requirements, under the restrictions of communication resources, sensing resources, computing resources, total user time delay, total energy consumption, and the like, the sum of the throughputs of all users is maximized as an optimization target, and the network slice general sensing fusion multidimensional resource collaborative optimization model in step 1 is as follows:
Figure FDA0003522981450000021
s.t.C1:
Figure FDA0003522981450000022
C2:
Figure FDA0003522981450000023
C3:
Figure FDA0003522981450000024
C4:
Figure FDA0003522981450000025
C5:
Figure FDA0003522981450000026
the limiting conditions C1, C2, C3, C4 and C5 are respectively the limiting conditions of a communication resource K, a sensing resource N, a computing resource F, a user total time delay T and a total energy consumption E;
Figure FDA0003522981450000027
indicates the number of resource blocks, Y, owned by the base station i at time ti l(t) represents the computational resources owned by base station i at time t,
Figure FDA0003522981450000028
represents the edge aware device owned by base station i at time t, where l ∈ [ e, u, m]Where, l ═ e denotes an eMBB network slice type, l ═ u denotes an mtc network slice type, and l ═ m denotes a urrllc network slice type;
Figure FDA0003522981450000029
and
Figure FDA00035229814500000210
respectively representing a binary resource block allocation variable of the ith base station for allocating the kth resource block and a binary computing resource allocation variable of the computing resource f for the jth edge device at the moment t;
Figure FDA00035229814500000211
representing the sum of the throughputs of all EDs at time t,
Figure FDA00035229814500000212
representing the sum of the time delays of all EDs at time t,
Figure FDA00035229814500000213
representing the sum of energy consumption of all EDs at the time t, wherein M is the total number of base stations;
step 2, according to step 1The network slice general sensing calculation fusion resource collaborative optimization model carries out the reinforcement learning optimization method of the number of the resource blocks owned by the base station i at the time t through the knowledge transfer based multi-agent
Figure FDA00035229814500000214
Computing resource Yi l(t) and number of edge devices
Figure FDA00035229814500000215
And binary resource block allocation variable
Figure FDA00035229814500000216
And binary computing resource allocation variables
Figure FDA00035229814500000217
Carrying out optimization solution to obtain the sum of all optimized EDs throughput
Figure FDA00035229814500000218
Step 2.1, modeling the multi-agent random game process: modeling the optimization problem into a multi-agent random game process, and equivalently using each base station as an agent;
the state of each base station is defined as:
Figure FDA00035229814500000219
wherein the content of the first and second substances,
Figure FDA00035229814500000220
indicating the state of the ith base station at time t,
Figure FDA00035229814500000221
representing user perception calculation tasks in three network slices when the ith base station is accessed to the jth edge device at the moment t;
the action of each base station is defined as:
Figure FDA0003522981450000031
wherein the content of the first and second substances,
Figure FDA0003522981450000032
indicating the action of the ith base station at the time t, including the number of three network slice resource blocks owned by the base station i
Figure FDA0003522981450000033
Number of computing resources Yi l(t) number of edge sensing devices
Figure FDA0003522981450000034
And the user resource block allocation strategy when the ith base station allocates the kth resource block for the jth edge device
Figure FDA0003522981450000035
And a computing resource allocation policy when allocating computing resources f
Figure FDA0003522981450000036
The reward function of each base station is defined as:
Figure FDA0003522981450000037
wherein r isi tThe reward function of the ith base station at the moment t is represented, and the sum of the throughputs of all EDs in the ith base station is reflected;
step 2.2, modeling according to the multi-agent random game process in the step 2.1, and designing a multi-agent knowledge migration reinforcement learning model, as shown in fig. 1, wherein a base station i observes a state from a network environment at a time t
Figure FDA0003522981450000038
Under the framework of the Actor-Critic algorithm, an Actor network takes action
Figure FDA0003522981450000039
Selecting a student or self-learning behavior mode, training network models in respective behavior modes, and updating network parameters so as to obtain an optimal user resource and computing resource allocation strategy;
at time t, base station i uses long-short term memory network unit biWill continue for z states
Figure FDA00035229814500000310
And actions
Figure FDA00035229814500000311
Wait for historical knowledge
Figure FDA00035229814500000312
As its hidden state
Figure FDA00035229814500000313
The student mode is based on a depth determination strategy gradient model, and the Actor network outputs a probability P 'of selecting the student mode'ssWhen the probability exceeds a threshold G, i.e. P'ssIf the number of the base stations is more than G, the base station i selects a student mode and sends suggestion requests to other base stations; conversely, when P'ssWhen G is less than or equal to G, the base station i selects a self-learning mode;
designing a multi-head attention mechanism model, wherein each base station takes other base stations as teacher base stations and receives the state of the teacher base station n (n is more than or equal to 1 and less than or equal to M) at time t
Figure FDA00035229814500000314
And actions
Figure FDA00035229814500000315
Waiting for history information
Figure FDA00035229814500000316
And a policy network parameter θn(ii) a Learning parameter P in consideration of multi-head attention mechanism1、P2And P3The assigned attention weight can be obtained:
Figure FDA00035229814500000317
wherein D is the dimension of the historical information vector of the teacher base station n; the final strategy proposal is to have the weight sum of the linear transformation:
Figure FDA0003522981450000041
wherein, PsIs a learning parameter for policy parameter decoding;
thus, the student base station i utilizes the hidden state
Figure FDA0003522981450000042
The motion at this time is obtained using parameters from a multi-head attention mechanism model:
Figure FDA0003522981450000043
the gain in base station learning performance from student mode is defined herein as a student reward
Figure FDA0003522981450000044
The student Actor-criticic network is trained by using a trained attention selection model; by minimizing the student's loss function
Figure FDA0003522981450000045
Updating student Critic network parameters of base station i
Figure FDA0003522981450000046
Figure FDA0003522981450000047
Wherein the content of the first and second substances,
Figure FDA0003522981450000048
is given by the parameter
Figure FDA0003522981450000049
The student goal Critic network of (1),
Figure FDA00035229814500000410
and
Figure FDA00035229814500000411
respectively representing the hidden state at the current time t and the hidden state at the next time t,
Figure FDA00035229814500000412
and
Figure FDA00035229814500000413
respectively representing the student strategy of the current time t base station i and the student strategy of the next time t base station i,
Figure FDA00035229814500000414
and
Figure FDA00035229814500000415
representing the state-behavior value functions, E [. cndot. ] of the student Critic network and the student target Critic network, respectively]In the interest of expectation,
Figure FDA00035229814500000416
reward function for students, gammaIs a discount factor;
student Actor network passing strategy with mu parameter
Figure FDA00035229814500000417
A policy gradient update is performed as follows:
Figure FDA00035229814500000418
in the self-learning mode, if the student Actor of the base station i selects the self-learning mode, the student Actor will hide the base station i
Figure FDA00035229814500000419
Sending the data to a self-learning network module, wherein each base station independently performs resource optimization action decision by adopting a Deep Q Network (DQN) method;
the DQN algorithm framework consists of a current value network and a target value network, wherein the current value network is used with weights
Figure FDA00035229814500000420
State-action value of
Figure FDA00035229814500000421
Function to approximate an optimal state-action value function, wherein
Figure FDA00035229814500000422
For the hidden state of base station i at time t,
Figure FDA00035229814500000423
an action generated for the nonce network; targeted value network usage with weighting
Figure FDA00035229814500000424
State-action value of
Figure FDA00035229814500000425
Function to improve the performance of the whole network, wherein
Figure FDA00035229814500000426
For the hidden state of base station i at the next time t,
Figure FDA00035229814500000427
an action generated for the target value network; after a certain number of rounds, the weights of the current value network are copied
Figure FDA00035229814500000428
To update the weights of the target value network
Figure FDA00035229814500000429
Weighting of current value network by gradient descent method
Figure FDA00035229814500000430
The update is performed to obtain the minimum loss function:
Figure FDA00035229814500000431
wherein r isi tGamma is a discount factor for the self-learning reward function;
meanwhile, in order to reduce the correlation of the empirical data, an algorithm adopts an empirical playback strategy; in a hidden state
Figure FDA0003522981450000051
Next, base station i acts by performing
Figure FDA0003522981450000052
Earning a reward ri tThen hide the state
Figure FDA0003522981450000053
Transition to the hidden state at the next instant t
Figure FDA0003522981450000054
The deep neural network transfers the state to information
Figure FDA0003522981450000055
Stored in the experience playback memory B; in the learning process, the mini-batch transfer information sample is randomly extracted from the experience playback memory B
Figure FDA0003522981450000056
To train the neural network; by continuously reducing the correlation among training samples, the base station can be helped to better learn and train so as to avoid the problem that the optimal strategy falls into the local minimum; in addition, neural networks often overfit some empirical data, transferring information samples by randomly drawing mini-batch
Figure FDA0003522981450000057
The problem of overfitting can be effectively reduced.
CN202210185185.2A 2022-02-28 2022-02-28 Knowledge migration reinforcement learning network slice general-purpose sensing calculation resource collaborative optimization method Withdrawn CN114615744A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210185185.2A CN114615744A (en) 2022-02-28 2022-02-28 Knowledge migration reinforcement learning network slice general-purpose sensing calculation resource collaborative optimization method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210185185.2A CN114615744A (en) 2022-02-28 2022-02-28 Knowledge migration reinforcement learning network slice general-purpose sensing calculation resource collaborative optimization method

Publications (1)

Publication Number Publication Date
CN114615744A true CN114615744A (en) 2022-06-10

Family

ID=81858654

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210185185.2A Withdrawn CN114615744A (en) 2022-02-28 2022-02-28 Knowledge migration reinforcement learning network slice general-purpose sensing calculation resource collaborative optimization method

Country Status (1)

Country Link
CN (1) CN114615744A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115361301A (en) * 2022-10-09 2022-11-18 之江实验室 Distributed computing network cooperative traffic scheduling system and method based on DQN
CN115496208A (en) * 2022-11-15 2022-12-20 清华大学 Unsupervised multi-agent reinforcement learning method with collaborative mode diversity guidance

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115361301A (en) * 2022-10-09 2022-11-18 之江实验室 Distributed computing network cooperative traffic scheduling system and method based on DQN
US12021751B2 (en) 2022-10-09 2024-06-25 Zhejiang Lab DQN-based distributed computing network coordinate flow scheduling system and method
CN115496208A (en) * 2022-11-15 2022-12-20 清华大学 Unsupervised multi-agent reinforcement learning method with collaborative mode diversity guidance

Similar Documents

Publication Publication Date Title
Wang et al. Smart resource allocation for mobile edge computing: A deep reinforcement learning approach
Huang et al. Deep reinforcement learning-based joint task offloading and bandwidth allocation for multi-user mobile edge computing
Yu et al. Toward resource-efficient federated learning in mobile edge computing
CN111405569A (en) Calculation unloading and resource allocation method and device based on deep reinforcement learning
CN109474980A (en) A kind of wireless network resource distribution method based on depth enhancing study
Chen et al. Multiuser computation offloading and resource allocation for cloud–edge heterogeneous network
CN114615744A (en) Knowledge migration reinforcement learning network slice general-purpose sensing calculation resource collaborative optimization method
Wang et al. Joint resource allocation and power control for D2D communication with deep reinforcement learning in MCC
CN111611062B (en) Cloud-edge collaborative hierarchical computing method and cloud-edge collaborative hierarchical computing system
CN111475274A (en) Cloud collaborative multi-task scheduling method and device
CN113364859B (en) MEC-oriented joint computing resource allocation and unloading decision optimization method in Internet of vehicles
Meng et al. Deep reinforcement learning based task offloading algorithm for mobile-edge computing systems
CN113573363B (en) MEC calculation unloading and resource allocation method based on deep reinforcement learning
CN105743985A (en) Virtual service migration method based on fuzzy logic
Li et al. DQN-enabled content caching and quantum ant colony-based computation offloading in MEC
Fang et al. Smart collaborative optimizations strategy for mobile edge computing based on deep reinforcement learning
KR20230007941A (en) Edge computational task offloading scheme using reinforcement learning for IIoT scenario
Li et al. Computation offloading with reinforcement learning in d2d-mec network
Li et al. Task computation offloading for multi-access edge computing via attention communication deep reinforcement learning
Wang et al. Improving the performance of tasks offloading for internet of vehicles via deep reinforcement learning methods
Zhu et al. Computing offloading decision based on multi-objective immune algorithm in mobile edge computing scenario
Iqbal et al. Convolutional neural network-based deep Q-network (CNN-DQN) resource management in cloud radio access network
CN117436485A (en) Multi-exit point end-edge-cloud cooperative system and method based on trade-off time delay and precision
CN110392377A (en) A kind of 5G super-intensive networking resources distribution method and device
Cui et al. Resource-Efficient DNN Training and Inference for Heterogeneous Edge Intelligence in 6G

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20220610

WW01 Invention patent application withdrawn after publication