CN113543342A

CN113543342A - Reinforced learning resource allocation and task unloading method based on NOMA-MEC

Info

Publication number: CN113543342A
Application number: CN202110756466.4A
Authority: CN
Inventors: 李君�; 沈国丽; 仲星; 丁文杰; 张茜茜; 朱明浩; 王秀敏; 李正权
Original assignee: Ictehi Technology Development Jiangsu Co ltd; Binjiang College of Nanjing University of Information Engineering
Current assignee: Ictehi Technology Development Jiangsu Co ltd; Binjiang College of Nanjing University of Information Engineering
Priority date: 2021-07-05
Filing date: 2021-07-05
Publication date: 2021-10-22
Anticipated expiration: 2041-07-05
Also published as: CN113543342B

Abstract

The invention discloses a resource allocation and task unloading method based on NOMA-MEC reinforcement learning, belonging to the technical field of mobile communication networks. The invention effectively solves the problem of huge task amount in the mobile equipment, reduces the time delay in the whole communication process, obtains the optimal resource allocation mode in different environments and improves the utilization efficiency of channel resources.

Description

Reinforced learning resource allocation and task unloading method based on NOMA-MEC

Technical Field

The invention belongs to the technical field of mobile communication networks, and particularly relates to a resource allocation and task unloading method based on NOMA-MEC reinforcement learning.

Background

With the development of the times, mobile terminals such as mobile phones are gaining great popularity, such as face recognition, online interactive games, augmented reality, and the like, and more mobile applications are rising and drawing great attention. These mobile applications typically require large amounts of resources, large amounts of intensive computing, low latency, and high energy consumption, and handsets with limited computing resources and battery life are nearly incapable of supporting these applications. Mobile Edge Computing (MEC) can meet the high computational requirements of these tasks, and the application of Non-Orthogonal Multiple Access (NOMA) technology can further reduce the problem of multitask offload delay.

The mobile edge computing technology belongs to distributed computing, and data processing, application program running and even realization of some functional services are put on nodes at the edge of a network. The mobile edge is composed of one or more edge servers, namely, servers with calculation storage functions are configured on the traditional base station, and the traditional base station is updated to be the mobile edge calculation base station. By offloading computationally intensive or delay sensitive applications to nearby MEC servers, resource-constrained mobile devices may reduce task processing time while reducing mobile energy consumption and transmission costs.

The non-orthogonal multiple access technology is one of the key technologies of the 5 th generation cellular network, and can provide services for a plurality of users simultaneously on the same frequency band by distributing different powers for terminal users, thereby effectively improving the utilization rate of frequency spectrum. Compared with the traditional Orthogonal Multiple Access (OMA), the method can provide task unloading for more users under the same channel resource condition, and simultaneously, the method adopts a non-Orthogonal Multiple Access (NOMA) mode to Access the users to a communication system in consideration of influence factors in Multiple aspects in the task unloading process.

Machine Learning (ML) is an algorithm for automatically analyzing and obtaining rules from data and predicting unknown data by using the rules, and is researched by more and more scholars as an emerging technology with wide application prospects. Nowadays, the 5G mobile communication network is applied to be more strongly supported by machine learning. Machine Learning is classified into four major categories, namely, supervised Learning, semi-supervised Learning, unsupervised Learning, and Reinforcement Learning (RL) according to Learning methods. Different from other three types of learning modes, the RL learning method does not need complete prior information, and an intelligent agent continuously learns in the interaction process with the environment to finally find the optimal strategy. The RL theory plays a key role in solving problems of dynamic planning, system control, decision making, etc., and in particular, when dealing with dynamic optimization problems, the optimal solution is finally obtained by continuously learning a "trial and error" type to a changing environment. For the research of the joint resource allocation problem of the subcarrier channel and the task unloading in the NOMA-MEC environment, the diversity of the transmission environment greatly increases the design difficulty of the resource allocation strategy, and the application of the RL theory in a wireless communication system provides a brand-new design idea for solving the resource allocation problem.

Each mobile device can offload the whole or part of the computing task to the MEC server for computing by selecting a carrier channel, so as to reduce delay and energy consumption and obtain good user experience. The existing traditional algorithm is feasible for solving the problems of task unloading and resource allocation, but is not applicable to the MEC system with high real-time performance. In the invention, each mobile device is used as an agent, and each agent continuously learns and improves the strategy, so that the environment is dynamically unstable from the perspective of each agent, and the key skills of DON such as experience playback and the like cannot be directly used.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the defects of low frequency spectrum efficiency and limited computing capacity of user equipment, the invention provides a resource allocation and task unloading method based on NOMA-MEC reinforcement learning, and the mobile equipment adopting NOMA technology and MADDPG algorithm can intelligently allocate sub-carrier channels and unload tasks, thereby achieving the purpose of reducing system delay.

The technical scheme is as follows: in order to achieve the purpose, the invention adopts the following technical scheme:

a resource allocation and task unloading method based on NOMA-MEC reinforcement learning comprises the following steps:

step 1, setting N mobile devices (agents) in a network, wherein the N mobile devices are expressed as {1, 2.., N., N }; there are M subchannels, denoted as {1, 2.., M., M }; the task for the mobile device is denoted t₁，t₂...，t_n，...，t_NH, total t_NA task;

step 2, establishing a task unloading and resource allocation combined optimization model by using a NOMA technology; establishing a joint optimization model aiming at carrier channel allocation and task unloading of all mobile equipment in a network;

step 3, converting the combined optimization model into a Markov decision process model, and setting the state, action and reward in the Markov decision process;

and 4, training the learning network by using the MADDPG algorithm, wherein the training aims at minimizing the time delay of the mobile equipment, and the optimal joint subcarrier channel allocation and task unloading strategy is obtained as a result.

Further, the step 2 specifically includes the following steps:

when the mobile equipment is accessed into the network by using the NOMA mode, one subchannel can be occupied by a plurality of mobile equipment, and for subchannel m, the superposed signal is X_m(ii) a At the receiving end of the system, the received signal of any user n in the sub-channel m is Y_n，_m(ii) a For receivingSignals are sorted according to the signal power, the nth mobile equipment is decoded and x is output under the assumption that the power of the nth mobile equipment is strongest_nRecovering the signal estimation value of the nth mobile equipment, subtracting the estimation value of the nth mobile equipment from the received signal to obtain the signals of the rest users, sequentially executing the same operation according to the power, completing the signal decoding of all the mobile equipment, and obtaining the signal-to-noise ratio after the decoding;

method for solving maximum information rate R of nth mobile equipment user on subchannel m in NOMA mode by utilizing Shannon formula_n，_m(ii) a The total delay of uploading the task to the MEC server by the user n through the subchannel m for task unloading is as follows:

in the formula, c_kFor the computing power of the MEC server, r_nCalculating result data for the MEC server;

the delay computed locally by user n is:

where fn moves the computing power of the user.

Further, in step 3, initializing each parameter, a_nIs the action of the mobile device n, denoted as a_n＝{d_n，c_nIn which d is_nIs [0, 1 ]]A continuous value between the two, 0 represents that the mobile device n performs local calculation, and 1 represents that the mobile device n completely unloads the task to the MEC server; c. C_nE {0, 1.., m } represents that the mobile device n selects m subcarrier channels;

s_nis the state of the mobile device n, denoted s_n＝{X_n，x_n，G_mIn which X is_nE {0, 1} indicates that the subcarrier channel is in a free/busy state, x_nData size, G, representing an offload task_mChannel information representing a sub-channel;

r_nis a reward function defined as the negative value of the system delay, denoted as r_n＝-EE(d_n，c_n)。

Further, the step 4 includes the following steps:

step 4.1) adopting MADDPG algorithm to update the mobile equipment user network, wherein each mobile equipment user comprises an Actor network and a criticic network, the Actor network and the criticic network have respective estimation network and target network, and theta is [ theta ]₁，θ₂...θ_n]Parameters representing n agent policies, for the resulting state s_iEach agent generates an action a according to the deterministic policy of the Actor network_iWhile receiving an instant prize r_iEnters a next state s'_iThe combined state, the motion, the reward and the state [ x, a, r, x 'at the next time are set']Stored into experience pool D for subsequent training, x ═ s₁，s₂...s_n]Representing observation vectors, i.e. states, a ═ a₁，a₂...a_n]Represents an action, r ═ r₁，r₂...r_n]Denotes prize, x '═ s'₁，s′₂...s′_n]Indicating the state at the next time.

Step 4.2) when the number of samples in the experience pool D reaches a certain number, sampling batch data from the experience pool D for network training, and carrying out state s_iInputting the data into the Actor estimation network of the ith agent to obtain action a_iAnd a prize r_iThen inputting x and a into a Critic estimation network to obtain an estimation state-action function at the current moment, namely an estimation Q function, and converting the state s 'at the next moment'_iIs input to the Actor target network to obtain action a 'at the next time'_iInputting x 'and a' into a Critic target network to obtain a target Q function y_iAnd then updating the criticic estimation network by using a minimum Loss function, wherein the criticic network has an estimation network and a target network,

q value, μ ' ═ μ ' representing critical target network output '₁，μ′₂...μ′_n]Parameter θ 'with hysteresis update for target policy'_j；

And 4.3) updating the Actor estimation network by the intelligent agent according to the deterministic strategy gradient and the Q function obtained in the criticic estimation network, and aiming at the accumulated expected reward J (mu) of the ith intelligent agent_i) The strategy gradient is expressed as

Step 4.4) repeating the step 4.2) and the step 4.3), and updating the parameters in the Actor target network and the Critic target network by a soft updating method at regular intervals of iteration times;

until the iteration times are set and the network is trained, only the state s at the current moment needs to be set_tInputting the input into the Actor network, and outputting the action a_tAnd obtaining the optimal resource allocation scheme at the current moment, so that the energy efficiency of the mobile equipment user is optimized. When the network state changes, a new allocation scheme can be obtained only by inputting a new state into the Actor network again.

Has the advantages that: compared with the prior art, the method for allocating and unloading the resources based on the NOMA-MEC reinforcement learning of the invention treats each mobile device in the network system as an independent intelligent agent, adopts the MADDPG method based on the Actor-Critic network structure, and ensures that each mobile device can learn a proper strategy to minimize time delay and energy consumption. The invention effectively solves the problem of huge task amount in the mobile equipment, reduces the time delay in the whole communication process, obtains the optimal resource allocation mode in different environments and improves the utilization efficiency of channel resources.

Drawings

FIG. 1 is a system model diagram;

FIG. 2 is a schematic diagram illustrating the steps of the present invention;

FIG. 3 is a block diagram of MADDPG.

Detailed Description

The present invention will be further described with reference to the following embodiments.

Assume that there are a total of N mobile devices in the entire network, where each mobile device has a corresponding task to perform. Each mobile device can select a carrier channel to offload tasks to the MEC server, so that the spectrum utilization efficiency of the system is effectively improved. In order to reduce interference and minimize unloading delay, a distributed reinforcement learning algorithm based on a NOMA-MEC environment is adopted to solve the problem of joint resource allocation of subcarrier channels and task unloading. As training progresses, the policy of each agent changes and the environment becomes unstable from the perspective of any single agent, on the other hand, policy gradient methods typically exhibit large variance when multiple agent coordination is required. And prevents the direct use of past experience replay, traditional reinforcement learning methods (e.g. Q learning or policy gradients) are not suitable for multi-agent environments, so the present invention employs maddppg algorithm, with its idea of centralized training, distributed execution, each mobile device performs resource allocation based only on its own observed environment.

Since each agent is continuously learning and improving its strategy, the environment is dynamically unstable from the viewpoint of each agent, and key skills such as experience playback cannot be directly used. In the invention, each mobile device in the network system is regarded as an independent agent, and each mobile device can learn a proper strategy by adopting the MADDPG method based on the Actor-Critic network structure so as to minimize time delay and energy consumption.

Step 1, setting N mobile devices (intelligent agents) in a network, wherein the N mobile devices are expressed as {1, 2., N., N }, and N is less than or equal to N; a total of M subchannels, denoted as {1, 2., M., M }, M ≦ M; the task for the mobile device is denoted t₁，t₂...，t_n，...，t_NH, total t_NA task;

and 2, establishing a task unloading and resource allocation combined optimization model by adopting the NOMA technology. Establishing a joint optimization model aiming at carrier channel allocation and task unloading of all mobile equipment in a network;

Examples

The model diagram of the system of the present invention is shown in fig. 1, and mainly comprises a base station integrated with an MEC server and N mobile devices. The implementation of the solution is described in further detail below.

The specific implementation steps of the invention are as follows:

step 1, setting N mobile devices (intelligent agents) in a network, wherein the N mobile devices are expressed as {1, 2., N., N }; there are M subchannels, denoted as {1, 2.., M., M }; the task for the mobile device is denoted t₁，t₂...，t_n，...，t_NH, total t_NAnd (4) each task.

And 2, establishing a task unloading and resource allocation combined optimization model by adopting the NOMA technology.

When the mobile equipment is accessed into the network by using the NOMA mode, one subchannel can be occupied by a plurality of mobile equipment, and for subchannel m, the superposed signal is X_m(ii) a At the receiving end of the system, the received signal of the user of the nth mobile device at the sub-channel m is Yn,_m(ii) a For received signals, sorting the received signals according to the signal power, and assuming that the power of the nth mobile equipment is strongest, firstly decoding the nth mobile equipment and outputting x_nRecovering the estimated value of the nth mobile equipment, subtracting the estimated value of the nth mobile equipment from the received signal to obtain the signals of the rest users, sequentially executing the same operation according to the power to finish the signal decoding of all the mobile equipment, and obtaining the decoded signalsTo a signal-to-noise ratio.

Method for solving maximum information rate R of nth mobile equipment user on subchannel m in NOMA mode by utilizing Shannon formula_n，m。

The total delay of uploading the task to the MEC server by the user n through the subchannel m for task unloading is as follows:

the delay computed locally by user n is:

in the formula (f)_nComputing power of the mobile user.

And 3, converting the combined optimization model into a Markov decision process model, and setting the state, action and reward in the Markov decision process.

Initialization of parameters, a_nIs the action of the mobile device n, denoted as a_n＝{d_n，c_nIn which d is_nIs [0, 1 ]]A continuous value between the two, 0 represents that the mobile device n performs local calculation, and 1 represents that the mobile device n completely unloads the task to the MEC server; c. C_nE {0, 1.., m } represents that the mobile device n selects m subcarrier channels;

s_nis the state of the mobile device n, denoted s_n＝{X_n，x_n，G_mIn which X is_nE {0, 1}, 0 indicating that the subcarrier channel is idle, 1 indicating that the subcarrier channel is busy, x_nData size, G, representing an offload task_mChannel information representing a sub-channel;

And 4, training the learning network through an MADDPG algorithm, wherein the training aims at minimizing the time delay of the mobile equipment, and the optimal joint subcarrier channel allocation and task unloading strategy is obtained as a result.

Step (1) adopts a maddppg algorithm to update the network of the mobile device users, each mobile device user includes an Actor network and a critical network, the Actor network and the critical network have their own estimation network and target network, and a block diagram thereof is shown in fig. 3, where θ ═ θ₁，θ₂...θ_n]Parameters representing n agent policies, for the resulting state s_iEach agent generates an action a according to the deterministic policy of the Actor network_iWhile receiving an instant prize r_iEnters a next state s'_iThe combined state, the motion, the reward and the state [ x, a, r, x 'at the next time are set']Stored into experience pool D for subsequent training, x ═ s₁，s₂...s_n]Representing observation vectors, i.e. states, a ═ a₁，a₂...a_n]Represents an action, r ═ r₁，r₂...r_n]Denotes prize, x '═ s'₁，s′₂...s′_n]Indicating the state at the next time.

Step (2) when the samples in the experience pool D reach a certain number, sampling batch data from the experience pool D for network training, and carrying out state s_iInputting the data into the Actor estimation network of the ith agent to obtain action a_iAnd a prize r_iThen inputting x and a into a Critic estimation network to obtain an estimation state-action function at the current moment, namely an estimation Q function, and converting the state s 'at the next moment'_iIs input to the Actor target network to obtain action a 'at the next time'_iInputting x 'and a' into a Critic target network to obtain a target Q function y_iAnd then updating the criticic estimation network by using a minimum Loss function, wherein the criticic network has an estimation network and a target network,

indicates the order of CriticQ value, μ ' ═ μ ' of standard network output '₁，μ′₂...μ′_n]Parameter θ 'with hysteresis update for target policy'_j。

And (3) updating the Actor estimation network by the intelligent agent according to the deterministic strategy gradient and the Q function obtained in the criticic estimation network, and aiming at the accumulated expected reward J (mu) of the ith intelligent agent_i) The strategic gradient is represented as θ_iJ(μ_i)。

And (4) repeating the step (2) and the step (3), and updating parameters in the Actor target network and the Critic target network by a soft updating method at regular intervals of iteration times.

This example is merely to illustrate the process of minimizing system delay for channel allocation and task offloading of a mobile device in this invention and is not a constraint on the inventive data parameters.

The following describes in detail, by way of example, the process of task offloading and resource allocation scheme based on maddppg using NOMA technology. The method comprises the following concrete steps:

step 1, setting a network, wherein 10 mobile devices (intelligent agents) are in total, 5 subchannels are in total, and tasks of the mobile devices are expressed as { t }₁，t₂...，t_n，...，t_NH, total t_NAnd (4) each task.

When the mobile equipment is accessed into the network by using the NOMA mode, one subchannel can be occupied by a plurality of mobile equipment, and for subchannel m, the superposed signal is X_m(ii) a At the receiving end of the system, the received signal of any user n in the sub-channel m is Y_n，m(ii) a For received signals, the work of the signal is determinedRate sorting, assuming the power of the nth mobile device is strongest, decoding the nth mobile device first, and outputting x_nAnd recovering the signal estimation value of the nth mobile equipment, subtracting the estimation value of the nth mobile equipment from the received signal to obtain the signals of the rest users, sequentially executing the same operation according to the power, completing the signal decoding of all the mobile equipment, and obtaining the signal-to-noise ratio after the decoding.

Method for solving maximum information rate R of nth mobile equipment user on subchannel m in NOMA mode by utilizing Shannon formula_n，m；

The total delay for uploading the task to the MEC server by the user of the nth mobile device through the subchannel m to unload the task is as follows:

in the formula, c_kFor the computing power of the MEC server, r_nCalculating result data for the MEC server; the delay calculated locally by the user of the nth mobile device is:

in the formula (f)_nComputing power of the mobile user.

More specifically, 10 mobile devices are respectively considered as agents, a_nIs the action of the mobile device n, denoted as a_n＝{d_n，c_nIn which d is_nIs [0, 1 ]]A continuous value between the two, 0 represents that the mobile device n performs local calculation, and 1 represents that the mobile device n completely unloads the task to the MEC server; c. C_nE {0, 1.., m } represents that the mobile device n selects m subcarrier channels;

r is a reward function defined as the negative value of the system delay, denoted as r_n＝-EE(d_n，c_n)。

Step (1) adopts a maddppg algorithm to update the network of the mobile device users, each mobile device user includes an Actor network and a critical network, the Actor network and the critical network have their own estimation network and target network, and a block diagram thereof is shown in fig. 3, where θ ═ θ₁，θ₂...θ_n]，θ_nRepresenting policy parameters of the nth mobile device user for the obtained state s_iEach agent generates an action a according to the deterministic policy of the Actor network_iWhile receiving an instant prize r_iEnters a next state s'_iThe combined state, the motion, the reward and the state [ x, a, r, x 'at the next time are set']Stored into experience pool D for subsequent training, x ═ s₁，s₂...s_n]Representing observation vectors, i.e. states, a ═ a₁，a₂...a_n]Represents an action, r ═ r₁，r₂...r_n]Denotes prize, x '═ s'₁，s′₂...s′_n]Indicating the state at the next time.

Step (2) when the number of samples in the experience pool D reaches 400, sampling batch data from the experience pool D for network training, and carrying out state s_iInputting the data into the Actor estimation network of the ith agent to obtain action a_iAnd a prize r_iThen inputting x and a into a Critic estimation network to obtain an estimation state-action function at the current moment, namely an estimation Q function, and converting the state s 'at the next moment'_iInputting to the target network of Actor to obtainAction a 'to the next time'_iInputting x 'and a' into a Critic target network to obtain a target Q function y_iAnd then updating the criticic estimation network by using a minimum Loss function, wherein the criticic network has an estimation network and a target network,

q value, μ ' ═ μ ' representing critical target network output '₁，μ′₂...μ′_n]Parameter θ 'with hysteresis update for target policy'_j。

And (4) repeating the steps (2) and (3), and updating parameters in the Actor target network and the criticic target network by a soft updating method every 100 iterations.

Until the iteration number of 2000, after the network is trained, only the state s of the current moment is required to be obtained_tInputting the input into the Actor network, and outputting the action a_tAnd obtaining the optimal resource allocation scheme at the current moment, so that the energy efficiency of the mobile equipment user is optimized. When the network state changes, a new allocation scheme can be obtained only by inputting a new state into the Actor network again.

The above description is only a preferred embodiment of the present invention, and it will be apparent to those skilled in the art that various modifications and variations can be made without departing from the technical principles of the present invention, and these modifications and variations should also be construed as the scope of the present invention.

Claims

1. A resource allocation and task unloading method based on NOMA-MEC reinforcement learning is characterized in that: the method comprises the following steps:

2. The NOMA-MEC-based reinforcement learning resource allocation and task offloading method of claim 1, wherein: the step 2 specifically comprises the following steps:

when the mobile equipment is accessed into the network by using the NOMA mode, one subchannel can be occupied by a plurality of mobile equipment, and for subchannel m, the superposed signal is X_m(ii) a At the receiving end of the system, the received signal of any user n in the sub-channel m is Y_n，m(ii) a For received signals, sorting the received signals according to the signal power, and assuming that the power of the nth mobile equipment is strongest, firstly decoding the nth mobile equipment and outputting x_nRecovering the signal estimation value of the nth mobile equipment, subtracting the estimation value of the nth mobile equipment from the received signal to obtain the signals of the rest users, sequentially executing the same operation according to the power, completing the signal decoding of all the mobile equipment, and obtaining the signal-to-noise ratio after the decoding;

method for solving maximum information rate R of nth mobile equipment user on subchannel m in NOMA mode by utilizing Shannon formula_n，m(ii) a User n uploads task to MEC server through sub-channel mThe total latency of the row task offload is:

the delay computed locally by user n is:

in the formula (f)_nComputing power of the mobile user.

3. The NOMA-MEC-based reinforcement learning resource allocation and task offloading method of claim 1, wherein: in the step 3, the parameters, a, are initialized_nIs the action of the mobile device n, denoted as a_n＝{d_n，c_nIn which d is_nIs [0, 1 ]]A continuous value between the two, 0 represents that the mobile device n performs local calculation, and 1 represents that the mobile device n completely unloads the task to the MEC server; c. C_nE {0, 1.., m } represents that the mobile device n selects m subcarrier channels;

4. The NOMA-MEC-based reinforcement learning resource allocation and task offloading method of claim 1, wherein: in the step 4, the learning network is trained by the MADDPG algorithm, which specifically comprises the following steps:

the network updating of the mobile equipment users is carried out by adopting a MADDPG algorithm, each mobile equipment user comprises an Actor network and a Critic network, the Actor network and the Critic network have respective estimation network and target network, and theta is [ theta ]₁，θ₂…θ_n]Parameters representing n agent policies, for the resulting state s_iEach agent generates an action a according to the deterministic policy of the Actor network_iWhile receiving an instant prize r_iEnters a next state s'_iThe combined state, the motion, the reward and the state [ x, a, r, x 'at the next time are set']Stored into experience pool D for subsequent training, x ═ s₁，s₂…s_n]Representing observation vectors, i.e. states, a ═ a₁，a₂…a_n]Represents an action, r ═ r₁，r₂…r_n]Denotes prize, x '═ s'₁，s′₂…s′_n]Indicating the state at the next time.

5. The NOMA-MEC-based reinforcement learning resource allocation and task offloading method of claim 1, wherein: in step 4, the training aims at minimizing the time delay of the mobile device, and specifically includes the following steps:

4.1) sampling batch data from the experience pool D for network training when the number of samples in the experience pool D reaches a certain number, and carrying out state s_iInputting the data into the Actor estimation network of the ith agent to obtain action a_iAnd a prize r_iThen inputting x and a into a Critic estimation network to obtain an estimation state-action function at the current moment, namely an estimation Q function, and converting the state s 'at the next moment'_iIs input to the Actor target network to obtain action a 'at the next time'_iInputting x 'and a' into a Critic target network to obtain a target Q function y_iAnd then updating the criticic estimation network by using a minimum Loss function, wherein the criticic network has an estimation network and a target network,

q value, μ ' ═ μ ' representing critical target network output '₁，μ′₂…μ′_n]Parameter θ 'with hysteresis update for target policy'_j；

4.2) agent updates the Actor estimation network according to the deterministic policy gradient and the Q function obtained in the criticic estimation network, and the accumulated expected reward J (mu) of the ith agent_i) Strategic gradient representation

6. The NOMA-MEC-based reinforcement learning resource allocation and task offloading method of claim 5, wherein: in the step 4), repeating the step 4.1) and the step 4.2), and updating parameters in the Actor target network and the Critic target network by a soft updating method at regular intervals of iteration times; until the iteration times are set and the network is trained, the state s of the current moment is obtained_tInputting the input into the Actor network, and outputting the action a_tObtaining the optimal resource allocation scheme at the current moment to optimize the energy efficiency of the mobile equipment user; when the network state changes, a new state is input into the Actor network again to obtain a new distribution scheme.